Issue
I have a sequence file that has a repeated pattern that looks like this:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"
$>g104 | effector probability: 0.65
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH
$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
and so on. I want to extract the text between and including each >g## and create a new file titled protein_g##.faa In the above example it would create a file called "protein_g34.faa" and it would be:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
I was trying to use sed but I am not very experienced using it. My guess was something like this:
$ sed -n '/^>g*/s///p; y/ /\n/' file > "g##"
but I can clearly tell that that is wrong... maybe the right thing is using awk? Thanks!
Solution
Yeah, I would use awk for that. I don't think sed can write to more than one different output stream.
Here's how I would write that:
< input.txt awk '/^\$>/{fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} {print $0 > fname}'
Breaking it down into details:
< input.txt
This part reads in the input file.awk
Runs awk./^\$>/
On lines which start with the literal string$>
, run the piece of code in brackets.- (If previous step matched)
{fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname}
Take the first field in the previous line. Remove the first two characters of that field. Surround that withprotein_ .faa
. Save it as the variable fname. Print a message about switching files. - This next block has no condition before it. Implicitly, that means that it matches every line.
{print $0 > fname}
Take the entire line, and send it to the filename held by fname. If no file is selected, this will cause an error.
Hope that helps!
Answered By - Nick ODell Answer Checked By - Clifford M. (WPSolving Volunteer)