Saturday, September 17, 2022

[SOLVED] Given a list of file paths from fileA, how do I match and return a line with the same path in fileB using bash, awk, or sed, in Linux?

September 17, 2022 awk, bash, linux

Issue

fileA.txt has several thousand file paths that look something like this:

/foldera/folder b/folder-c/filename.txt
/foldera/folder b/folder-c/folderd/file name.txt
/foldera/folder b/.file name.txt.abc

fileB.txt has several million lines of file paths:

     21  001  Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/filename.txt
     21  001  Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/foldere/filename.txt
     21  001  Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/foldere/filename.txt
     21  001  Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/file name.txt
     21  001  Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/foldere/filename.txt
     21  001  Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/.file name.txt.abc
     21  001  Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/foldere/filename.txt
     21  001  Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/foldere/filename.txt
     21  001  Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/folderd/foldere/filename.txt

I would like to use the whole path and filename from each line in fileA.txt, to find the same whole path and filename from a line in fileB.txt and return the full line please.

e.g. Match /foldera/folder b/folder-c/filename.txt in fileB.txt and return 21 001 Wed Jul 6 08:00:00 2020 12345678:/foldera/folder b/folder-c/filename.txt

Each full line in fileA.txt is used for the match.

I've tried using grep but it just grinds to a halt with this many lines:

grep -f fileA.txt fileB.txt

Is there a quicker way using something like awk please?

I've found various similar solutions but they are usually for single word matches rather than whole paths that may contain spaces and odd characters.

Solution

Use -F option to search literally (i.e. without regexp, which is slow and not needed here):

grep -Ff f1 f2

You can use ripgrep to speed up further:

rg -Ff f1 f2

With awk, which might end up faster than rg, but assumes : cannot be part of the filenames:

awk 'NR==FNR{a[$0]; next} $NF in a' f1 FS=':' f2

The lines from first file are stored in array a as keys. The second file is split into fields based on the : character and if the last field is found as a key in array a, the line is printed.

As far as I know, this code should work with mawk (which will likely be faster than other awk implementations).

Answered By - Sundeep

Answer Checked By - Clifford M. (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, September 17, 2022

[SOLVED] Given a list of file paths from fileA, how do I match and return a line with the same path in fileB using bash, awk, or sed, in Linux?

Issue

Solution

Popular Posts

Labels