Friday, April 15, 2022

[SOLVED] sed: removing dublicated patterns in the log file

Issue

I am working with post-processing of the log file arranged in the following format:

Finding intramodel H-bonds
Constraints relaxed by 0.55 angstroms and 20 degrees
Models used:
    1.1 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.6 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.10 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.8 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.2 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.3 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.4 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.7 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.5 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.9 SarsCov2_structure31R_nsp5holo_rep1.pdb

6 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
SarsCov2_structure31R_nsp5holo_rep1.pdb #1.3/? ASN 142 ND2   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.3/A UNL 1 N   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.3/? ASN 142 2HD2   3.419  2.541
SarsCov2_structure31R_nsp5holo_rep1.pdb #1.5/? GLN 189 NE2   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.5/A UNL 1 O   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.5/? GLN 189 1HE2   2.883  2.159
SarsCov2_structure31R_nsp5holo_rep1.pdb #1.6/? HIS 163 NE2   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.6/A UNL 1 O   no hydrogen  

From this log I need to take all the lines after the 3rd line, and then delete all dublicated patterns "SarsCov2_structure31R_nsp5holo_rep1.pdb". May I use some regex with sed to detect any phrase matching such patter in the log ( which ends with *.pdb) that should be removed automatically for each processed log? So the expected output should be:

Models used:
    1.1 
    1.6 
    1.10 
    1.8 
    1.2 
    1.3 
    1.4 
    1.7 
    1.5 
    1.9 

6 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.3/? ASN 142 ND2    #1.3/A UNL 1 N    #1.3/? ASN 142 2HD2   3.419  2.541
 #1.5/? GLN 189 NE2    #1.5/A UNL 1 O    #1.5/? GLN 189 1HE2   2.883  2.159
 #1.6/? HIS 163 NE2    #1.6/A UNL 1 O   no hydrogen            3.299  N/A
 #1.7/? GLN 189 NE2    #1.7/A UNL 1 O    #1.7/? GLN 189 1HE2   3.109  2.147
 #1.9/? ASN 142 ND2    #1.9/A UNL 1 O    #1.9/? ASN 142 1HD2   3.032  2.319
 #1.10/? GLN 189 NE2   #1.10/A UNL 1 O   #1.10/? GLN 189 1HE2  3.054  2.125

Here is some example without regex, which does not work yet :-)

cat test.log | tail -n +2 | sed -e "/SarsCov2_structure31R_nsp5holo_rep1.pdb/d" >> ./test2.log

Solution

Using sed

$ sed 's/[[:alnum:]_]*\.pdb//g;1,2d' input_file
Models used:
    1.1
    1.6
    1.10
    1.8
    1.2
    1.3
    1.4
    1.7
    1.5
    1.9

6 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.3/? ASN 142 ND2    #1.3/A UNL 1 N    #1.3/? ASN 142 2HD2   3.419  2.541
 #1.5/? GLN 189 NE2    #1.5/A UNL 1 O    #1.5/? GLN 189 1HE2   2.883  2.159
 #1.6/? HIS 163 NE2    #1.6/A UNL 1 O   no hydrogen


Answered By - HatLess
Answer Checked By - David Goodson (WPSolving Volunteer)