Monday, July 25, 2022

[SOLVED] sed/awk: adding missed patterns to the string

Issue

I am working with text files containing somewhere a string

ligand_types A C Cl NA OA N SA HD    # ligand atom types

I need to check the patterns in the string: A C Cl NA OA N SA HD and in the case of the absence add to the end the missed "SA" and/or "HD".

Sample input:

ligand_types A C Cl NA OA N HD       # ligand atom types
ligand_types A C NA OA N SA          # ligand atom types
ligand_types A C Cl NA OA N          # ligand atom types

Expected output:

ligand_types A C Cl NA OA N HD SA    # ligand atom types
ligand_types A C NA OA N SA HD       # ligand atom types
ligand_types A C Cl NA OA N HD SA    # ligand atom types

May you suggest me sed or awk solution for this task?

For example using SED it may be:

sed -i 's/old-string/new-string-with-additional-patterns/g' my_file

Solution

Update: adding support for end-of-line comments

Here's an awk idea:

awk -v types="SA HD" '
    BEGIN {
        typesCount = split(types,typesArr)
    }
    /^ligand_types / {

        if (i = index($0,"#")) {
            comment = substr($0,i)
            $0 = substr($0,1,i-1)
        } else
            comment = ""

        for (i = 1; i <= typesCount; i++)
            typesHash[typesArr[i]]

        for (i = 2; i <= NF; i++)
            delete typesHash[$i]

        for (t in typesHash)
            $(NF+1) = t

        if (comment)
            $(NF+1) = comment
    }
    1
' my_file > my_file.new

With your examples:

ligand_types A C Cl NA OA N SA HD    # ligand atom types
ligand_types A C Cl NA OA N SA HD # ligand atom types

ligand_types A C Cl NA OA N HD       # ligand atom types
ligand_types A C Cl NA OA N HD SA # ligand atom types

ligand_types A C NA OA N SA          # ligand atom types
ligand_types A C NA OA N SA HD # ligand atom types

ligand_types A C Cl NA OA N          # ligand atom types
ligand_types A C Cl NA OA N HD SA # ligand atom types


Answered By - Fravadona
Answer Checked By - Dawn Plyler (WPSolving Volunteer)