Monday, November 1, 2021

[SOLVED] While read loop appending

Issue

I have to extract lines from file1 corresponding to a list of words in file2

I'm wondering what's the difference between doing:

while read line; do grep "${line}" file1; done < file2 > output

while read line; do grep "${line}" file1 >> output; done < file2

Which one is the correct and fatest?

Is there any other faster way of doing this than a loop?

Both the files I'm working are huge 536864856 and 1947 lines for file1 and file2, respectively.

file1 (look a $7)

NC_045027.1     29500101        T/A     NC_045027.1:29500101    A       101232882       XM_032744187.1  Transcript     3_prime_UTR_variant                                 2764     -       -       -       -       -       MODIFIER        -       -1      -       ARL14EPL        -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_045027.1     29500102        G/A     NC_045027.1:29500102    A       101232882       XM_032744187.1  Transcript     3_prime_UTR_variant                                 2763     -       -       -       -       -       MODIFIER        -       -1      -       ARL14EPL        -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_045027.1     29500103        C/A     NC_045027.1:29500103    A       101232882       XM_032744187.1  Transcript     3_prime_UTR_variant                                 2762     -       -       -       -       -       MODIFIER        -       -1      -       ARL14EPL        -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_045027.1     29500104        C/A     NC_045027.1:29500104    A       101232882       XM_032744187.1  Transcript     3_prime_UTR_variant                                 2761     -       -       -       -       -       MODIFIER        -       -1      -       ARL14EPL        -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_045027.1     29500105        A/C     NC_045027.1:29500105    C       101232882       XM_032744187.1  Transcript     3_prime_UTR_variant                                 2760     -       -       -       -       -       MODIFIER        -       -1      -       ARL14EPL        -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_045027.1     29500106        A/C     NC_045027.1:29500106    C       101232882       XM_032744187.1  Transcript     3_prime_UTR_variant                                 2759     -       -       -       -       -       MODIFIER        -       -1      -       ARL14EPL        -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_045027.1     29500107        G/A     NC_045027.1:29500107    A       101232882       XM_032744187.1  Transcript     3_prime_UTR_variant                                 2758     -       -       -       -       -       MODIFIER        -       -1      -       ARL14EPL        -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_045027.1     29500108        T/A     NC_045027.1:29500108    A       101232882       XM_032744187.1  Transcript     3_prime_UTR_variant                                 2757     -       -       -       -       -       MODIFIER        -       -1      -       ARL14EPL        -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_045027.1     29500109        G/A     NC_045027.1:29500109    A       101232882       XM_032744187.1  Transcript     3_prime_UTR_variant                                 2756     -       -       -       -       -       MODIFIER        -       -1      -       ARL14EPL        -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_045027.1     29500110        G/A     NC_045027.1:29500110    A       101232882       XM_032744187.1  Transcript     3_prime_UTR_variant                                 2755     -       -       -       -       -       MODIFIER        -       -1      -       ARL14EPL        -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_044998.1     16147   C/A     NC_044998.1:16147       A       100221041       XM_030285707.2  Transcript      3_prime_UTR_variant                                        7416     -       -       -       -       -       MODIFIER        -       -1      -       LOC100221041    -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_044998.1     16147   C/A     NC_044998.1:16147       A       100221041       XM_030285715.2  Transcript      3_prime_UTR_variant                                        7234     -       -       -       -       -       MODIFIER        -       -1      -       LOC100221041    -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_044998.1     16147   C/A     NC_044998.1:16147       A       100221041       XM_030285720.2  Transcript      3_prime_UTR_variant                                        7110     -       -       -       -       -       MODIFIER        -       -1      -       LOC100221041    -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_044998.1     16147   C/A     NC_044998.1:16147       A       100221041       XM_030285728.2  Transcript      3_prime_UTR_variant                                        6856     -       -       -       -       -       MODIFIER        -       -1      -       LOC100221041    -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_044998.1     16147   C/A     NC_044998.1:16147       A       100221041       XM_030285733.2  Transcript      intron_variant                                             --       -       -       -       -       MODIFIER        -       -1      -       LOC100221041    -       -       -      ZFgenomic_tabixprep_nomiRNA.gff.gz                  --
NC_044998.1     16147   C/A     NC_044998.1:16147       A       100221041       XM_030285738.2  Transcript      3_prime_UTR_variant                                        6637     -       -       -       -       -       MODIFIER        -       -1      -       LOC100221041    -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_044998.1     16147   C/A     NC_044998.1:16147       A       100221041       XM_030285750.2  Transcript      3_prime_UTR_variant                                        6348     -       -       -       -       -       MODIFIER        -       -1      -       LOC100221041    -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_044998.1     16147   C/A     NC_044998.1:16147       A       100221041       XM_030285760.2  Transcript      3_prime_UTR_variant                                        7209     -       -       -       -       -       MODIFIER        -       -1      -       LOC100221041    -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_044998.1     16148   A/C     NC_044998.1:16148       C       100221041       XM_030285707.2  Transcript      3_prime_UTR_variant                                        7415     -       -       -       -       -       MODIFIER        -       -1      -       LOC100221041    -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -
NC_044998.1     16148   A/C     NC_044998.1:16148       C       100221041       XM_030285715.2  Transcript      3_prime_UTR_variant                                        7233     -       -       -       -       -       MODIFIER        -       -1      -       LOC100221041    -       -      -                                                   ZFgenomic_tabixprep_nomiRNA.gff.gz       -       -

file2

XM_032744187.1
XM_030272916.2
XM_032747381.1
XM_030265061.2
XM_030271469.2
XM_030272412.2
XM_032747456.1

Solution

while read line; do grep "${line}" file1; done < file2 > output

while read line; do grep "${line}" file1 >> output; done < file2

Which one is the correct and fastest?

First one as it would open output file only once whereas >> output inside the loop would open output file for each line in file2.

Is there any other faster way of doing this than a loop?

Based on updated information in question, this awk will produce accurate matching result which won't be possible with grep -fF. awk would be pretty fast too as we are reading only smaller file's first column in memory before doing a non-regex string comparison against $7 from second file:

awk 'FNR == NR {seen[$1]; next} $7 in seen' file2 file1 > output


Answered By - anubhava