Issue
I have two files:
$hashfile: Hashes and ./relative/path/to/file/names, both on one line separated by 2 spaces
$badfiles: ./relative/path/to/file/names that I need to find in $hashfile to get the corresponding hash
Here is an extract of a $hashfile:
c2c99b59f3303cafac85c2c6df6653cc ./vm-mount.sh
058a8fb0b9366f248be32b7390e94595 ./Jerusalem_Canon EOS R5_20210601_031.jpg~
23eba1c54846de5244312047e2709f9a ./rsync-back.sh
ff3f08f7bf45f8e9ef8b33192db3ce9a ./vm-backup.sh
11e0d980f3b2219f65da97a0318e7dce ./Jerusalem_Canon EOS R5_20210601_031.jpg
49fb1fb660dce09acd87861a228c899d ./vm-test.sh
Here is an example of $badfiles containing the search patterns:
./Jerusalem_Canon EOS R5_20210601_031.jpg
./file.txt
I need to search the $hashfile for the patterns inside $badfiles and write the matching lines containing the hash to a third file $new.
So far I've used the following:
grep -Ff "$badfiles" "$hashfile" > "$new"
However, this will match both:
058a8fb0b9366f248be32b7390e94595 ./Jerusalem_Canon EOS R5_20210601_031.jpg~
11e0d980f3b2219f65da97a0318e7dce ./Jerusalem_Canon EOS R5_20210601_031.jpg
I then added a $ at the end of each line in $badfiles and changed the grep command to:
grep -f "$badfiles" "$hashfile" > "$new"
This worked on a small test folder, but I'm concerned that a pattern search that won't be interpreted as a fixed string can create havoc on large file systems. I have some 300,000+ file names and hashes, some of which use special characters like "':,;<>()[]? - in short any character that a Linux ext4 and/or Windows NTFS file system will accept.
Any ideas?
EDIT: Solution
Apparently grep does not provide a simple solution for including a newline into a fixed string search. @anubhava offered the best solution using awk:
awk 'NR == FNR {a[$0]; next}
{b=$0; sub(/^\S+\s+/, "", b)}
b in a' "$badfiles" "$hashfile" > "$new"
Note: $badfiles, $hashfiles and $new are variables holding file names.
The above syntax is best described here under "Two-file processing". NR
holds the line numbers so far read from all files, whereas FNR
holds the line numbers read so far from the current file. So when awk has finished reading $badfiles and reads the first line of $hashfile, NR
holds the sum of all lines read so far, and FNR
equals 1 since this is the first line of a new file. {a[$0]; next}
reads the $badfiles file into an array, ; next
prevents the program from executing the subsequent conditions and actions until the entire $badfiles is read, that is, until NR == FNR
is false.
When reading $hashfile, $0
(the line that's been read) is assigned to b
(b=$0
). sub(/^\S+\s+/, "", b)
substitutes one or more non-space characters (\S+
) at the beginning of the line (^
), followed by one or more space characters (\s+
) by ""
(empty string) in variable b
. This then leaves only the ./path/to/file inside variable b
.
The last line b in a' "$badfiles" "$hashfile" > "$new"
looks if variable b
is found in a
and, if yes, copies the line in $hashfile to file $new. If all lines in $badfiles have a matching entry in $hashfile, the corresponding $hashfile lines with the hashes are copied to $new.
Since the hash value before the file name is of a fixed length, the awk statement can be simplified as follows:
awk 'NR == FNR {a[$0]; next}
{b=substr($0,35)}
b in a' "$badfiles" "$hashfile" > "$new"
The above substr()
statement takes the input line $0
and strips off the first 34 characters, counting from 1. The substring b
then starts at position 35. This is much like substring extraction in bash, for example ${mystring:34}
. Note that bash substring extraction starts counting at 0.
I use now a variation of that awk command to create a new hashfile containing all file hashes except for those listed in $deletedfiles
:
awk 'NR == FNR {a[$0]; next}
{b=substr($0,35)}
!(b in a)' "$deletedfiles" "$hashfile" > "$new"
With the above command, every string b
(from $hashfile) NOT found in $deletedfiles copies the corresponding line from $hashfile to $new. One has to pay special attention to an empty $deletedfiles file: If $deletedfiles is an empty file, the $new file will be empty too! The expected result is that $new file is identical to the $hashfile.
This solution works very well (and fast), even with 200,000-300,000 filenames in one hashfile.
Solution
This awk
solution should work for you:
awk 'FNR == NR {srch[$0]; next}
{s = $0; sub(/^[^[:blank:]]+[[:blank:]]+/, "", s)}
s in srch' badfiles hashfile
11e0d980f3b2219f65da97a0318e7dce ./Jerusalem_Canon EOS R5_20210601_031.jpg
This solution first stores all lines from badfiles
in an array srch
. Then from hashfile
it removes text until first whitespace and then prints each line from same file if remaining part is found in srch
array.
Answered By - anubhava Answer Checked By - Mary Flores (WPSolving Volunteer)