Thursday, November 18, 2021

[SOLVED] Using awk to replace and add text

Issue

I have the following .txt file:

##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##bcftools_viewVersion=1.12-57-g0c2765b+htslib-1.12-45-g1830551
##bcftools_viewCommand=view -h 20Perc.SNPs.mergedAll.vcf; Date=Tue Sep 28 09:46:59 2021

I would like to make two edits to this text. First, in the line:

##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">

I would like to replace "Number=." with "Number=G"

And immediately after the after the line:

##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">

I would like to add a new line of text (& and line break):

##INFO=<ID=QualityScore,Number=.,Type=Float,Description="Quality score">

I was wondering if this could be done with one or two awk commands.

Thanks for any suggestions!


Solution

My solution is similar to @Daweo. Consider this script, replace.awk:

/^##FORMAT/ { sub(/Number=\./, "Number=G") }

/##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">/ {
    print
    print "##INFO=<ID=QualityScore,Number=.,Type=Float,Description=\"Quality score\">"
    next
}

1

Run it:

awk -f replace.awk file.txt

Notes

  • The first line is easy to understand. It is a straight replace
  • The next group of lines deals with your second requirements. First, the print statement prints out the current line
  • The next print statement prints out your data
  • The next command skips to the next line
  • Finally, the pattern 1 tells awk to print every lines


Answered By - Hai Vu