Issue
I am using grep in a while loop to find lines from one file in another file and saving the output to a new file. My file is quite large (226 million lines) and the script is taking forever (12 days and counting). Do you have a suggestion to speed it up, perhaps there is a better way rather than grep?
(I also need the preceding line for the output, therefore grep -B 1.)
Here is my code:
#!/bin/bash
while IFS= read -r line; do
grep -B 1 $line K33.21mercounts.bf.trimmedreads.dumps.fa >> 21mercounts.bf.trimmedreads.diff.kmers.K33;
done <21mercounts.bf.trimmedreads.diff.kmers
Update:
The input file with the lines to look for is 4.7 GB and 226 mio lines and looks like this:
AAAGAAAAAAAAAGCTAAAAT
ATCTCGACGCTCATCTCAGCA
GTTCGTCGGAGAGGAGAGAAC
GAGGACTATAAAATTGTCGCA
GGCTTCAATAATTTGTATAAC
GACATAGAATCACGAGTGACC
TGGTGAGTGACATCCTTGACA
ATGAAAACTGCCAGCAAACTC
AAAAAACTTACCTTAAAAAGT
TTAGTACACAATATCTCCCAA
The file to look in is 26 GB and 2 billion lines and looks like this:
>264638
AAAAAAAAAAAAAAAAAAAAA
>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>1
GTTCGTCGGAGAGGAGAGAAC
>28
TCTTTTCAGGAGTAATAACAA
>13
AATCATTTTCCGCTGGAGAGA
>38
ATTCAATAAATAATAAATTAA
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC
The expected output would be this:
>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>1
GTTCGTCGGAGAGGAGAGAAC
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC
Solution
You can try this grep -f
command without shell loop and using a fixed string search:
grep -B1 -Ff 21mercounts.bf.trimmedreads.diff.kmers \
K33.21mercounts.bf.trimmedreads.dumps.fa > 21mercounts.bf.trimmedreads.diff.kmers.K33
Answered By - anubhava Answer Checked By - Katrina (WPSolving Volunteer)