Monday, January 31, 2022

[SOLVED] merging multiple rows into one row per record seprated by a blank line in unix

Issue

I have a text file in which each record starts with a no and name and ends with a blank line. I would like to have per record in one row as comma-separated values. I have tried the following code whose code file and text file link attached below: biosample.txt sark.awk

unix command: to run the code is:

gawk -f sark.awk biosample.txt

then run:

sed 's/,,/\n/g' <biosample.txt > out.txt

but the out.txt is a bit discrepant/messy/confusing.

I want each record in one line with the values to be extracted for the following headers only:

record name
Identifiers
Organism
strain
isolate
serovar
isolation source
collected by
collection date
geographic location
host
host disease
Accession
ID
potential_contaminant
sample type
Description

Having the values for each header to be picked from each record that is separated by a new line.

Thanks


Solution

Here's a straightforward implementation with awk:

BEGIN   { print "record name,Identifiers,Organism,strain,isolate,serovar,"\
                "isolation source,collected by,collection date,"\
                "geographic location,host,host disease,Accession,ID,"\
                "potential_contaminant,sample type,Description"
          RS="\r\n"
          ORS=""
        }
sub(/^[0-9]*: /,"")     { r[1] = $0; next }
sub(/^Identifiers: /,""){ r[2] = $0; next }
sub(/^Organism: /,"")   { r[3] = $0; next }
/^ /                    { split($0, a, "=") }
/^ *\/strain=/          { r[4] = a[2] }
/^ *\/isolate=/         { r[5] = a[2] }
/^ *\/serovar=/         { r[6] = a[2] }
/^ *\/isolation source=/{ r[7] = a[2] }
/^ *\/collected by=/    { r[8] = a[2] }
/^ *\/collection date=/ { r[9] = a[2] }
/^ *\/geographic locati/{ r[10] = a[2] }
/^ *\/host=/            { r[11] = a[2] }
/^ *\/host disease=/    { r[12] = a[2] }
/^Accession:/           { r[13] = $2; r[14] = $4 }
/^ *\/potential_contami/{ r[15] = a[2] }
/^ *\/sample type=/     { r[16] = a[2] }
/^Description:/         { getline; r[17] = $0 }
/^$/                    { if (r[1]) {   for (i = 1; i < 17; ++i) print r[i]","
                                        print r[i]"\n"
                                        delete r
                                    }
                        }


Answered By - Armali
Answer Checked By - Senaida (WPSolving Volunteer)