Issue
I was trying to remove lines that had an element already seen in them (an ID) while keeping the first that appeared with sed. I found a href="https://stackoverflow.com/questions/10530438/remove-lines-from-a-file-by-using-an-index-array">solution but it was not explained at all and I am struggling to understand it.
Example of test.txt (IDs will not always be numerically sorted but duplicates will follow each others) :
1
2
3
3
4
4
4
5
6
7
7
Result wanted :
1
2
3
4
5
6
7
The code:
#creates array of Ids
mapfile -t id_array < <(cut -f1 test.txt)
#loops over IDs
for (( i=0; i < ${#id_array[@]}; i++ ))
do
prev=$(($i-1))
#compares each ID with the previous one, if same then adds it to index
if (( ${id_array[$prev]} == ${id_array[$i]} ))
then
index_array+=($i)
fi
done
#line I dont fully understand, removes lines from array
sed -i ''"${index_array[*]/%/d;}"'' test.txt
The last line deletes inplace the lines indexed in the arrray. [*]
expands all values in a single word ([@]
would not work as it expands each value in its in own word). The /%/
replaces whitespaces with d;
with parameters expansion. But I completely fail to understand the ''
on each side. Just one simple quote does not not work. Why ?
EDIT: it came to me that its was to keep the first (internal) '
to keep the sed expression in single quotes as required, true ?
Solution
The correct awk solution for this is:
awk '$1 != prev{print} {prev=$1}' test.txt
The above stores almost nothing in memory, just 2 $1
s at a time. If you did awk '!seen[$1]++' test.txt
, on the other hand, you'd get the same output but then you'd have to store all unique $1
values in memory together and so YMMV if your input is massive.
Answered By - Ed Morton Answer Checked By - Clifford M. (WPSolving Volunteer)