Friday, October 22, 2021

[SOLVED] sed: set multiple lines below with same beginning

October 22, 2021 awk, sed

Issue

I have a text file as follows

# jkakjshkjh
  *   drink  (2 spaces *  2 spaces)(non hash starting)
 *   biscuit  (1 space * 2 spaces)(non hash starting)
* paper       (* 1 space)(non has starting)
... (many more lines) of non hash starting
     *  tea   (7 spaces * 3 space)(non has starting)
# happy
* cup       (* 1 space)(non has starting)
  *   bat  (2 spaces *  2 spaces)(non hash starting)
 *   scooter  (1 space * 2 spaces)(non hash starting)
... (many more lines) of non hash starting
     *  disk   (7 spaces * 3 space)(non has starting)

I want all the non hash starting line to have the same beginning as the first non hash starting line

i.e:

# jkakjshkjh
  *   drink  (2 spaces *  2 spaces)(non hash starting)
  *   biscuit  (2 spaces *  2 spaces)(non hash starting)
  *   paper  (2 spaces *  2 spaces)(non hash starting)
   ... (many more lines of non hash starting)
  *   tea  (2 spaces *  2 spaces)(non hash starting)
# happy
* cup       (* 1 space)(non has starting)
* bat       (* 1 space)(non has starting)
* scooter       (* 1 space)(non has starting)
... (many more lines) of non hash starting
* disk       (* 1 space)(non has starting)

Now there is a twist in the above problem.

1) The first non hash line is not always starting with (2 spaces * 2 spaces)

It can vary (1 space * 1 space) or (radon number of pre spaces * random number of post spaces)

2) Inbetween if there is a line starting with hash it should not touch that line

So how to solve the above with sed

I have tried the below:

sed -Ez 's/(\n)([^#]\s+\*\s+)([^\n]*\n)([^#]\s+\*\s+)([^\n]*\n)/\1\2\3\2\5/g' filename

the above will only check for two consequite lines. Problem with this is it treats 2 lines as one unit. So groups of two lines will have same beginning. But i want all of them to have the same beginning as the first non hash starting line

Solution

In case you're OK with a non-sed solution: with GNU awk for the 3rd arg to match():

$ cat tst.awk
{
    match($0,/^(\s*(\S)\s*)(.*)/,a)
    currHead = a[1]
    currChar = a[2]
    currTail = a[3]
}
currChar == "#" { indent = currHead }
currChar != "#" { indent = (prevChar == "#" ? currHead : indent) }
{ printf "%s%s\n", indent, currTail; prevChar = currChar }

$ awk -f tst.awk file
# jkakjshkjh
  *   drink  (2 spaces *  2 spaces)(non hash starting)
  *   biscuit  (1 space * 2 spaces)(non hash starting)
  *   paper       (* 1 space)(non has starting)
  *   .. (many more lines) of non hash starting
  *   tea   (7 spaces * 3 space)(non has starting)
# happy
* cup       (* 1 space)(non has starting)
* bat  (2 spaces *  2 spaces)(non hash starting)
* scooter  (1 space * 2 spaces)(non hash starting)
* .. (many more lines) of non hash starting
* disk   (7 spaces * 3 space)(non has starting)

With other awks you'd just use substr()s to get the parts that match() is putting in a[] for gawk and use [[:space:]] and [^[:space:]] for \s and \S respectively.

To help you understand the syntax, if I were writing the above in a C-like language then it'd be:

while ( read(FILENAME,line) ) {                 # awk does this for you
    NR++;                                       # awk does this for you
    NF = split(line into $1, $2, $3, ... $NF);  # awk does this for you
    match(line,/^(\s*(\S)\s*)(.*)/,a);
    currHead = a[1];
    currChar = a[2];
    currTail = a[3];
    if (currChar == "#") { indent = currHead; }
    if (currChar != "#") { indent = (prevChar == "#" ? currHead : indent); }
    printf "%s%s\n", indent, currTail; prevChar = currChar;
}                                               # awk does this for you

and in fact you can duplicate that syntax in awks BEGIN section with:

BEGIN {
    filename = ARGV[1]
    ARGV[1] = ""
    ARGC--
    while ( (getline line < filename) > 0) ) {
        nr++
        nf = split(line,flds)
        match(line,/^(\s*(\S)\s*)(.*)/,a)
        currHead = a[1]
        currChar = a[2]
        currTail = a[3]
        if (currChar == "#") { indent = currHead }
        if (currChar != "#") { indent = (prevChar == "#" ? currHead : indent) }
        printf "%s%s\n", indent, currTail; prevChar = currChar
    }
}

but see http://awk.freeshell.org/AllAboutGetline for why not to do that unless you have a very specific need.

Answered By - Ed Morton

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, October 22, 2021

[SOLVED] sed: set multiple lines below with same beginning

Issue

Solution

Popular Posts

Labels