Issue
I have a text file in which each record starts with a no and name and ends with a blank line. I would like to have per record in one row as comma-separated values. I have tried the following code whose code file and text file link attached below: biosample.txt sark.awk
unix command: to run the code is:
gawk -f sark.awk biosample.txt
then run:
sed 's/,,/\n/g' <biosample.txt > out.txt
but the out.txt
is a bit discrepant/messy/confusing.
I want each record in one line with the values to be extracted for the following headers only:
record name
Identifiers
Organism
strain
isolate
serovar
isolation source
collected by
collection date
geographic location
host
host disease
Accession
ID
potential_contaminant
sample type
Description
Having the values for each header to be picked from each record that is separated by a new line.
Thanks
Solution
Here's a straightforward implementation with awk
:
BEGIN { print "record name,Identifiers,Organism,strain,isolate,serovar,"\
"isolation source,collected by,collection date,"\
"geographic location,host,host disease,Accession,ID,"\
"potential_contaminant,sample type,Description"
RS="\r\n"
ORS=""
}
sub(/^[0-9]*: /,"") { r[1] = $0; next }
sub(/^Identifiers: /,""){ r[2] = $0; next }
sub(/^Organism: /,"") { r[3] = $0; next }
/^ / { split($0, a, "=") }
/^ *\/strain=/ { r[4] = a[2] }
/^ *\/isolate=/ { r[5] = a[2] }
/^ *\/serovar=/ { r[6] = a[2] }
/^ *\/isolation source=/{ r[7] = a[2] }
/^ *\/collected by=/ { r[8] = a[2] }
/^ *\/collection date=/ { r[9] = a[2] }
/^ *\/geographic locati/{ r[10] = a[2] }
/^ *\/host=/ { r[11] = a[2] }
/^ *\/host disease=/ { r[12] = a[2] }
/^Accession:/ { r[13] = $2; r[14] = $4 }
/^ *\/potential_contami/{ r[15] = a[2] }
/^ *\/sample type=/ { r[16] = a[2] }
/^Description:/ { getline; r[17] = $0 }
/^$/ { if (r[1]) { for (i = 1; i < 17; ++i) print r[i]","
print r[i]"\n"
delete r
}
}
Answered By - Armali Answer Checked By - Senaida (WPSolving Volunteer)