Issue
fileA.txt has several thousand file paths that look something like this:
/foldera/folder b/folder-c/filename.txt
/foldera/folder b/folder-c/folderd/file name.txt
/foldera/folder b/.file name.txt.abc
fileB.txt has several million lines of file paths:
21 001 Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/filename.txt
21 001 Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/foldere/filename.txt
21 001 Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/foldere/filename.txt
21 001 Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/file name.txt
21 001 Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/foldere/filename.txt
21 001 Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/.file name.txt.abc
21 001 Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/foldere/filename.txt
21 001 Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/foldere/filename.txt
21 001 Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/foldere/filename.txt
I would like to use the whole path and filename from each line in fileA.txt, to find the same whole path and filename from a line in fileB.txt and return the full line please.
e.g.
Match /foldera/folder b/folder-c/filename.txt
in fileB.txt and return 21 001 Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/filename.txt
Each full line in fileA.txt is used for the match.
I've tried using grep but it just grinds to a halt with this many lines:
grep -f fileA.txt fileB.txt
Is there a quicker way using something like awk please?
I've found various similar solutions but they are usually for single word matches rather than whole paths that may contain spaces and odd characters.
Solution
Use -F
option to search literally (i.e. without regexp, which is slow and not needed here):
grep -Ff f1 f2
You can use ripgrep to speed up further:
rg -Ff f1 f2
With awk
, which might end up faster than rg
, but assumes :
cannot be part of the filenames:
awk 'NR==FNR{a[$0]; next} $NF in a' f1 FS=':' f2
The lines from first file are stored in array a
as keys. The second file is split into fields based on the :
character and if the last field is found as a key in array a
, the line is printed.
As far as I know, this code should work with mawk
(which will likely be faster than other awk
implementations).
Answered By - Sundeep Answer Checked By - Clifford M. (WPSolving Volunteer)