Issue
I am trying to find XML files containing a particular string. These files are however zipped as .gz. Essentially, I want to search through all of these gz files in the directory without extracting them. Additionally, I would like to get the specific filename which matches the search pattern and not the output itself.
I have managed to get the following command to get me the matching output itself from a piped grep command:
gunzip -c *.xml.gz | grep 'idName="M"'
I would like to get the filenames however. I read somewhere that the -l
flag for grep will return the matching filename, but in this case, it gives me a result saying (standard input)
. I assume this is because I need to be piping the filename from gunzip too, but how do I do that?
Edit: Also adding that I have somewhat partial success by doing
gunzip -vc *.xml.gz | grep 'idName="M"'
but this gives me output like
filename_X: 30% -- replaced with stdout
filename_Y: 50% -- replaced with stdout
filename_Z: complete matching output
I would like to suppress the matching output too in this case, and not show all the non-matching filenames.
Solution
The zgrep
family of tools exist exactly for this use case.
zgrep -l 'idName="M"' *.xml.gz
If you need the same for *.zip
files, look for zipgrep
.
If the pattern you are searching for is just a static string, not a regular expression, you can speed up processing by using the -F
flag (aka legacy fgrep
).
This can make a substantial difference if the files are big.
If you need this for a file type for which you can't find an existing tool which provides this functionality, the implementation looks crudely something like
regex=$1
shift
for file; do
gzip -dc <"$file" |
sed -n "/$regex/s|^|$file:|p"
done
... with various complications to handle different options, etc; and with the caveat that this simple sed
script has robustness issues in a number of corner cases (the regex can't contain a slash, and the file name can't contain a literal |
or a newline).
If you have GNU grep
, try something like
regex=$1
options=$(... complex logic to extract grep options ...)
shift
for file; do
gzip -dc <"$file" |
grep --label="$file" -H -e "$regex" $options
done
In your particular case, this can be reduced to just
regex=$1
shift
for file; do
gzip -dc <"$file" |
grep -q "$regex" &&
echo "$file"
done
without any GNUisms.
Obviously, you'd replace gzip -dc
with whatever you need to extract the information from the file type you want to process.
Answered By - tripleee Answer Checked By - Candace Johnson (WPSolving Volunteer)