Monday, December 4, 2023

[SOLVED] Find duplicate text in a line, delete that whole line and its successor line

December 04, 2023 awk, bash, sed

Issue

Due to my last question was closed and after some own testings I want to try to post another question to get the correct answer hopefully.

I want to remove some duplicated lines in a text file. Every line has some fixed strings that should act as some kind of starting point to check for duplicated text until the end of that line. When duplicate was found this whole line including its successor line right underneath should get deleted. Every first occurrency should be kept. Texfile looks like

somerandomtexthere fixedstring somepossibledupetext <- 1st line
line to also delete if dupe right above             <- 2nd line
somerandomtexthere fixedstring somepossibledupetext <- 3rd line
line to also delete if dupe right above             <- 4th line
...

So my idea finally is to use that fixed string as starting point to ensure only dupes in lines 13579 etc are recognized and it wont scan lines 246810 because "starting string" only occurs in 13579. Hopefully this will prevent messing around the whole textfile because 246810 could also have dupes.

As example if line 1 and 9 has dupe text found between string and end of line, the lines 9 and 10 should get removed from file and line 1 and 2 should stay. I hope this is as clear as it could be. :)

Is this task possible with sed or awk?

What I tried was something like

`awk '!(seen[$0]++&&(skip=1)||skip&&!(skip=0))'
or
awk '!seen[$0]++||getline&&0'

Both wil find and eleminate duplicated lines and their respective successor line but unfortunatly when diving deeper into the results it messed up my textfile at some points where lines 246810 and so on had also dupes. This is where my thoughts about this fixed string came up.

Solution

This might be what you're trying to do:

$ cat tst.awk
fxStrBeg = index($0,fxStr) {
    tailOfLine = substr($0,fxStrBeg + length(fxStr) + 1)
    if ( seen[tailOfLine]++ ) {
        numLinesToSkip = 2
    }
}
numLinesToSkip > 0 {
    if ( numLinesToSkip-- > 0 ) {
        next
    }
}
{ print }

$ awk -v fxStr='fixedstring' -f tst.awk file
somerandomtexthere fixedstring somepossibledupetext
line to also delete if dupe right above
...

but the example you provided is far from adequate to demonstrate your requirements (what should happen given "fixedstring" twice on a line, "fixedstring" on back-to-back lines, "fixedstring" as a substring of some lengthier string, "fixedstring" containing regexp metachars, what are the references to lines 13579 and 246810 about, where/what are lines 1 and 9, etc., etc.) so it's a guess.

I used many meaningfully-named variable names and generally lengthier than necessary code to try to make it clear what the code is doing. See Printing with sed or awk a line following a matching pattern for more information on how to idiomatically do tasks like this in general.

Answered By - Ed Morton

Answer Checked By - Dawn Plyler (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 4, 2023

[SOLVED] Find duplicate text in a line, delete that whole line and its successor line

Issue

Solution

Popular Posts

Labels