Monday, October 10, 2022

[SOLVED] Optimize zgrep using awk

October 10, 2022 awk, bash, grep

Issue

I have a list of files (/c/Users/Roy/DataReceived) over which I want to grep some information and store it as txt files(/c/Users/Roy/Documents/Result).

For example purposes: Imagine I have a 20 files with different information about cities, and I want to grep information for the cities that are listed in a txt file. All this information will then be stored in another txt file that would have the name of the given city (NewYork.txt, Rome.txt, etc).

The following code is working:

#!/bin/bash

declare INPUT_DIRECTORY=/c/Users/Roy/DataReceived
declare OUTPUT_DIRECTORY=/c/Users/Roy/Documents/Result

while read -r city; do
  echo $city
  zgrep -Hwi "$city" "${INPUT_DIRECTORY}/"*.vcf.gz > "${OUTPUT_DIRECTORY}/${city}.txt"
done < list_of_cities.txt

However, this process takes around a week to run fully. My question is, is there a way to unzip the files just once? Using awk for example? This would make the process twice as fast.

Also, is there any other way of optimizing the process?

Solution

It's a little tricky but the following code should be several times faster than your solution:

zgrep -Hwif list_of_cities.txt /c/Users/Roy/DataReceived/*.vcf.gz |
awk -F ':' '
    NR == FNR {
        regex = regex sep "(" $0 ")"
        sep = "|"
        next
    }
    match($NF,regex) {
        city = tolower(substr($NF,RSTART,RLENGTH))
        print > ( "/c/Users/Roy/Documents/Result/" city ".txt")
    }
' list_of_cities.txt -

But if your list_of_cities.txt only contains literal city names (and not regexps) then it'll be faster to do something like this:

zgrep -HwiFf list_of_cities.txt /c/Users/Roy/DataReceived/*.vcf.gz |
awk -F ':' '
    NR == FNR {
        cities[$0]
        next
    }
    {
        split($NF,words,'[^[:alnum:]_]+')
        for (c in cities)
            if (c in words) {
                city = tolower(c)
                break
            }
        print > ( "/c/Users/Roy/Documents/Result/" city ".txt")
    }
' list_of_cities.txt -

Limitation: If the matched lines can contain a : character then the current awk codes will break.

Answered By - Fravadona

Answer Checked By - David Marino (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, October 10, 2022

[SOLVED] Optimize zgrep using awk

Issue

Solution

Popular Posts

Labels