Friday, January 26, 2024

[SOLVED] Fastest way to find lines from a large file in another file

Issue

I am using grep in a while loop to find lines from one file in another file and saving the output to a new file. My file is quite large (226 million lines) and the script is taking forever (12 days and counting). Do you have a suggestion to speed it up, perhaps there is a better way rather than grep?

(I also need the preceding line for the output, therefore grep -B 1.)

Here is my code:

#!/bin/bash

while IFS= read -r line; do
  grep -B 1 $line K33.21mercounts.bf.trimmedreads.dumps.fa >> 21mercounts.bf.trimmedreads.diff.kmers.K33;
done <21mercounts.bf.trimmedreads.diff.kmers 

Update:

The input file with the lines to look for is 4.7 GB and 226 mio lines and looks like this:

AAAGAAAAAAAAAGCTAAAAT
ATCTCGACGCTCATCTCAGCA
GTTCGTCGGAGAGGAGAGAAC
GAGGACTATAAAATTGTCGCA
GGCTTCAATAATTTGTATAAC
GACATAGAATCACGAGTGACC
TGGTGAGTGACATCCTTGACA
ATGAAAACTGCCAGCAAACTC
AAAAAACTTACCTTAAAAAGT
TTAGTACACAATATCTCCCAA

The file to look in is 26 GB and 2 billion lines and looks like this:

>264638
AAAAAAAAAAAAAAAAAAAAA
>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>1
GTTCGTCGGAGAGGAGAGAAC
>28
TCTTTTCAGGAGTAATAACAA
>13
AATCATTTTCCGCTGGAGAGA
>38
ATTCAATAAATAATAAATTAA
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC

The expected output would be this:

>1
AAAGAAAAAAAAAGCTAAAAT
>1
ATCTCGACGCTCATCTCAGCA
>1
GTTCGTCGGAGAGGAGAGAAC
>2
GAGGACTATAAAATTGTCGCA
>1
GGCTTCAATAATTTGTATAAC

Solution

You can try this grep -f command without shell loop and using a fixed string search:

grep -B1 -Ff 21mercounts.bf.trimmedreads.diff.kmers \
 K33.21mercounts.bf.trimmedreads.dumps.fa > 21mercounts.bf.trimmedreads.diff.kmers.K33


Answered By - anubhava
Answer Checked By - Katrina (WPSolving Volunteer)