Issue
I have a list of files (/c/Users/Roy/DataReceived) over which I want to grep some information and store it as txt files(/c/Users/Roy/Documents/Result).
For example purposes: Imagine I have a 20 files with different information about cities, and I want to grep information for the cities that are listed in a txt file. All this information will then be stored in another txt file that would have the name of the given city (NewYork.txt, Rome.txt, etc).
The following code is working:
#!/bin/bash
declare INPUT_DIRECTORY=/c/Users/Roy/DataReceived
declare OUTPUT_DIRECTORY=/c/Users/Roy/Documents/Result
while read -r city; do
echo $city
zgrep -Hwi "$city" "${INPUT_DIRECTORY}/"*.vcf.gz > "${OUTPUT_DIRECTORY}/${city}.txt"
done < list_of_cities.txt
However, this process takes around a week to run fully. My question is, is there a way to unzip the files just once? Using awk for example? This would make the process twice as fast.
Also, is there any other way of optimizing the process?
Solution
It's a little tricky but the following code should be several times faster than your solution:
zgrep -Hwif list_of_cities.txt /c/Users/Roy/DataReceived/*.vcf.gz |
awk -F ':' '
NR == FNR {
regex = regex sep "(" $0 ")"
sep = "|"
next
}
match($NF,regex) {
city = tolower(substr($NF,RSTART,RLENGTH))
print > ( "/c/Users/Roy/Documents/Result/" city ".txt")
}
' list_of_cities.txt -
But if your list_of_cities.txt
only contains literal city names (and not regexps) then it'll be faster to do something like this:
zgrep -HwiFf list_of_cities.txt /c/Users/Roy/DataReceived/*.vcf.gz |
awk -F ':' '
NR == FNR {
cities[$0]
next
}
{
split($NF,words,'[^[:alnum:]_]+')
for (c in cities)
if (c in words) {
city = tolower(c)
break
}
print > ( "/c/Users/Roy/Documents/Result/" city ".txt")
}
' list_of_cities.txt -
Limitation: If the matched lines can contain a :
character then the current awk
codes will break.
Answered By - Fravadona Answer Checked By - David Marino (WPSolving Volunteer)