Monday, April 11, 2022

[SOLVED] How to add one space character (without changing any other characters) to "one character strings" using awk, sed, or grep?

Issue

I obtained this text file using sed and awk (leap.log):

Template_frcmod
MASS

Pd 0.000         0.000 

BOND
Pd-c
Pd-3e
c-Pd
4p-ca
o-3e
n-3e
Pd-4e
3p-ca
o-4e
n-4e

ANGLE
Pd-c-Pd
Pd-3e-o
Pd-3e-n
Pd-1c-Pd
c-Pd-4p
c-Pd-3e
c-Pd-1c
c-Pd-3p
c-Pd-4e
4p-ca-ca
4p-Pd-3e
4p-Pd-1c
o-3e-n
3e-n-c3
3e-Pd-1c
ca-4p-ca
Pd-4e-o
Pd-4e-n
1c-Pd-4e
3p-ca-ca
3p-Pd-4e
o-4e-n
4e-n-c3
ca-3p-ca

DIHE

 Pd-4p-ca-ca
 Pd-3e-n-c3
 c-Pd-3e-o
 c-Pd-3e-n
 c-Pd-4e-o
 c-Pd-4e-n
 4p-Pd-3e-o
 4p-Pd-3e-n
 o-3e-n-c3
 o-3e-Pd-1c
 n-3e-Pd-1c
 ca-4p-ca-ca
 ca-ca-4p-ca
 Pd-3p-ca-ca
 Pd-4e-n-c3
 1c-Pd-4e-o
 1c-Pd-4e-n
 3p-Pd-4e-o
 3p-Pd-4e-n
 o-4e-n-c3
 ca-3p-ca-ca
 ca-ca-3p-ca

IMPROPER

NONBON

Now I have a problem with "one character" atom names:

c-Pd-4p

in this line and all other similar lines (which contain one character atom names), "c" must be two characters: "c " (with a space) :

c -Pd-4p

or in this line: 4e-n-c3 "n" must be "n " 4e-n -c3 or this line: "Pd-c" must be "Pd-c " exc.. all atom names which contains one char must be two chars and get a space char.

When I try to change "c" to "c " "1c" become "1c ": Pd-1c-Pd --> Pd-1c -Pd but I don't want to change 2 char atom names. It must be stay the same.

When try to this command:

awk 'BEGIN{FS="-"}{ if(length($2) == 1 ) $2= $2" " } {print $0}' leap.log

This time the "-" signs disappeared. What should I do to add all one character atom names with a space?

Expected results (comments jut for this question real file will have not comments):

Template_frcmod
MASS

Pd 0.000         0.000 

BOND
Pd-c  #Also the last "c" must be "c " 
Pd-3e
c -Pd
4p-ca
o -3e
n -3e
Pd-4e
3p-ca
o -4e
n -4e

ANGLE
Pd-c -Pd
Pd-3e-o 
Pd-3e-n 
Pd-1c-Pd
c -Pd-4p
c -Pd-3e
c -Pd-1c
c -Pd-3p
c -Pd-4e
4p-ca-ca
4p-Pd-3e
4p-Pd-1c
o -3e-n 
3e-n -c3
3e-Pd-1c
ca-4p-ca
Pd-4e-o 
Pd-4e-n 
1c-Pd-4e
3p-ca-ca
3p-Pd-4e
o -4e-n
4e-n -c3
ca-3p-ca

DIHE

Pd-4p-ca-ca
Pd-3e-n-c3
c -Pd-3e-o #Also the last "o" must be "o "
c -Pd-3e-n #Also the last "n" must be "n " 
c -Pd-4e-o #Also the last "o" must be "o "
c-Pd-4e-n  #Also the last "n" must be "n "  
4p-Pd-3e-o #Also the last "o" must be "o " 
4p-Pd-3e-n #Also the last "n" must be "n " 
o -3e-n-c3
o -3e-Pd-1c
n-3e-Pd-1c
ca-4p-ca-ca
ca-ca-4p-ca
Pd-3p-ca-ca
Pd-4e-n-c3
1c-Pd-4e-o
1c-Pd-4e-n
3p-Pd-4e-o
3p-Pd-4e-n
o -4e-n -c3
ca-3p-ca-ca
ca-ca-3p-ca

IMPROPER

NONBON

Solution

Assumptions:

  • only lines of interest are also the only lines that contain a -
  • for the lines of interest there will only be one field containing a -
  • need to test all - delimited strings and all such strings with length()==1 are to have a space ( ) appended on the end of the field
  • leading white space in a line can be ignored/removed

One awk idea (strips leading white space):

awk '
/-/ { n=split($1,arr,"-")                          # split field #1 into arr[] array based on "-" delimiter
      x=delim=""
      for (i=1;i<=n;i++) {                         # loop through array
          # piece together our new field
          x=x delim arr[i] ( length(arr[i]) == 1 ? " " : "")
          delim="-"
      }
      $1=x                                         # replace field #1 with value in variable "x"
    }
1
' leap.log

Another awk idea (maintains leading white space):

awk '
BEGIN { FS=OFS="-" }                   # define input/output field delimiter == "-"
NF>1  { for (i=1;i<=NF;i++) {          # if more than one "-" delimited field then ...
            old=$i
            gsub(/ /,"",old)           # strip any (leading) spaces from field
            if (length(old) == 1)      # if lenght() == 1 then ...
               $i=$i " "               # append space to current field
        }
      }
1
' leap.log

These both generate:

Template_frcmod
MASS

Pd 0.000         0.000

BOND
Pd-c
Pd-3e
c -Pd
4p-ca
o -3e
n -3e
Pd-4e
3p-ca
o -4e
n -4e

ANGLE
Pd-c -Pd
Pd-3e-o
Pd-3e-n
Pd-1c-Pd
c -Pd-4p
c -Pd-3e
c -Pd-1c
c -Pd-3p
c -Pd-4e
4p-ca-ca
4p-Pd-3e
4p-Pd-1c
o -3e-n
3e-n -c3
3e-Pd-1c
ca-4p-ca
Pd-4e-o
Pd-4e-n
1c-Pd-4e
3p-ca-ca
3p-Pd-4e
o -4e-n
4e-n -c3
ca-3p-ca

DIHE

 Pd-4p-ca-ca
 Pd-3e-n -c3
 c -Pd-3e-o
 c -Pd-3e-n
 c -Pd-4e-o
 c -Pd-4e-n
 4p-Pd-3e-o
 4p-Pd-3e-n
 o -3e-n -c3
 o -3e-Pd-1c
 n -3e-Pd-1c
 ca-4p-ca-ca
 ca-ca-4p-ca
 Pd-3p-ca-ca
 Pd-4e-n -c3
 1c-Pd-4e-o
 1c-Pd-4e-n
 3p-Pd-4e-o
 3p-Pd-4e-n
 o -4e-n -c3
 ca-3p-ca-ca
 ca-ca-3p-ca


IMPROPER

NONBON

NOTE: for the 1st awk script the entries under DIHE lose their leading white space



Answered By - markp-fuso
Answer Checked By - Mildred Charles (WPSolving Admin)