Friday, July 29, 2022

[SOLVED] Add characters to a column if the row starts with a specific character, and do this for odd and even rows separately

Issue

I have multiple alignment format (MAF) files that look like this:

##maf version=1
a       score=-1274
s       Chr10                                            34972197            2927       +       190919061         AACCTTGGGG
s       Chr11                                            36777315            2442       +       244384623         AACCTTGGGG

a       score=-60687
s       Chr1                                             81897274           61972       +       159217232          CGTTTTCCCGG
s       Chr1                                             33997294           32248       +       200980605   

I would like to modify the second column of these files for lines that start with "s", to have something like this:

##maf version=1
a       score=-1274
s       species1.Chr10                                            34972197            2927       +       190919061         AACCTTGGGG
s       species2.Chr11                                            36777315            2442       +       244384623         AACCTTGGGG

a       score=-60687
s       species1.Chr1                                             81897274           61972       +       159217232          CGTTTTCCCGG
s       species2.Chr1                                             33997294           32248       +       200980605          CGTTTTCCCGG   

Using this idea https://unix.stackexchange.com/questions/154220/adding-a-character-to-every-other-text-line

I am trying things like this:

awk '$1 == "s" {print ((NR%2)? "species1.":"") $0}'

But I am still far to reach my objective. Do you know how I could achieve this?


Solution

Assumptions:

  • distance between fields is to be maintained

One awk idea:

awk '
!/^s/ { print; sfx=0 }                  # if line does not start with "^s" then print line and reset sfx variable
 /^s/ { n=split($0,a,FS,seps)           # if line starts with "^s" then split current line; key is to save each separator as a separate seps[] array entry
        a[2]="species" ++sfx "." a[2]   # add prefix to value in 2nd field
        for (i=1;i<=n;i++)              # loop through all field/separator pairs
            printf a[i] seps[i]         # print each field/separator
        print ""                        # terminate line
      }
' maf.dat

NOTE: requires GNU awk for 4th argument to split()

This generates:

##maf version=1
a       score=-1274
s       species1.Chr10                                            34972197            2927       +       190919061         AACCTTGGGG
s       species2.Chr11                                            36777315            2442       +       244384623         AACCTTGGGG

a       score=-60687
s       species1.Chr1                                             81897274           61972       +       159217232          CGTTTTCCCGG
s       species2.Chr1                                             33997294           32248       +       200980605          CGTTTTCCCGG


Answered By - markp-fuso
Answer Checked By - Terry (WPSolving Volunteer)