Friday, February 2, 2024

[SOLVED] Grep the last occurence of different elements in a big file

February 02, 2024 for-loop, grep

Issue

I have a file where different elements are repeated on several lines. My file contains lines like this:

1  $element_(1)
10 $element_(2)
20 $element_(1)
30 $element_(3)
40 $element_(1)
50 $element_(2)
60 $element_(3)
70 $element_(1)

I want to get the last occurrence of each of these elements and put them in a file resultfile.

50 $element_(2)
60 $element_(3)
70 $element_(1)

I tried

for  i in {1..8000} do 
     grep $element_\($i\) sourcefile | tail -1 >> resultfile 
done

But it is giving me errors. Besides, how to make distinction between $ as part of the string name and $ to increment the number of the element I am searching for?

Also I don't know exactly how many elements I am going to have in the file so I took 8000 as a max value, but it can be less or more.

Solution

Output sorted by element index

You can tell grep to stop after finding the first match (-m 1), and to make this match the last in your file, you can pipe the file in reverse to grep:

for i in {1..8000}; do
    tac sourcefile | grep -m 1 "\$element_($i)"
done > resultfile

I've also moved the output redirection outside the loop, and fixed the quoting in your pattern: I quote the whole pattern; the first $ has to be escaped so the shell doesn't try to expand a variable $element_, and the parentheses must not be escaped or grep thinks it's a capture group. In your try, you correctly escaped them, but this is avoided here by quoting the whole pattern.

It's usually easier to single quote the pattern so we don't have to care about shell expansion, but in this case, we want $i to actually expand.

Your try had a syntax error in that the ; was missing after the braces.

Output sorted by order of appearance in input file

If the lines have to be in the same order as in the input file, we can prepend line numbers (nl) and sort by them in the end (sort -n) before removing them again with cut:

for i in {1..8000}; do
    nl sourcefile | tac | grep -m 1 "\$element_($i)"
done | sort -n | cut -f 2 > resultfile

Stop after first unsuccessful search

If we know that the element indices are contiguous and we can stop as soon as we don't find an element, we can tweak the loop as follows (still assuming we want to keep elements in order of appearance in the input file):

i=0
while true; do
    ((++i))
    nl sourcefile | tac | grep -m 1 "\$element_($i)" || break
done | sort -n | cut -f 2 > resultfile

This uses an increasing counter instead of a predetermined sequence. If the exit status of the pipe is non-zero, i.e., grep couldn't find the element, we exit the loop.

Answered By - Benjamin W.

Answer Checked By - Gilberto Lyons (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0