Issue
I have a text file as follows
# jkakjshkjh
* drink (2 spaces * 2 spaces)(non hash starting)
* biscuit (1 space * 2 spaces)(non hash starting)
* paper (* 1 space)(non has starting)
... (many more lines) of non hash starting
* tea (7 spaces * 3 space)(non has starting)
# happy
* cup (* 1 space)(non has starting)
* bat (2 spaces * 2 spaces)(non hash starting)
* scooter (1 space * 2 spaces)(non hash starting)
... (many more lines) of non hash starting
* disk (7 spaces * 3 space)(non has starting)
I want all the non hash starting line to have the same beginning as the first non hash starting line
i.e:
# jkakjshkjh
* drink (2 spaces * 2 spaces)(non hash starting)
* biscuit (2 spaces * 2 spaces)(non hash starting)
* paper (2 spaces * 2 spaces)(non hash starting)
... (many more lines of non hash starting)
* tea (2 spaces * 2 spaces)(non hash starting)
# happy
* cup (* 1 space)(non has starting)
* bat (* 1 space)(non has starting)
* scooter (* 1 space)(non has starting)
... (many more lines) of non hash starting
* disk (* 1 space)(non has starting)
Now there is a twist in the above problem.
1) The first non hash line is not always starting with (2 spaces * 2 spaces)
It can vary (1 space * 1 space) or (radon number of pre spaces * random number of post spaces)
2) Inbetween if there is a line starting with hash it should not touch that line
So how to solve the above with sed
I have tried the below:
sed -Ez 's/(\n)([^#]\s+\*\s+)([^\n]*\n)([^#]\s+\*\s+)([^\n]*\n)/\1\2\3\2\5/g' filename
the above will only check for two consequite lines. Problem with this is it treats 2 lines as one unit. So groups of two lines will have same beginning. But i want all of them to have the same beginning as the first non hash starting line
Solution
In case you're OK with a non-sed solution: with GNU awk for the 3rd arg to match():
$ cat tst.awk
{
match($0,/^(\s*(\S)\s*)(.*)/,a)
currHead = a[1]
currChar = a[2]
currTail = a[3]
}
currChar == "#" { indent = currHead }
currChar != "#" { indent = (prevChar == "#" ? currHead : indent) }
{ printf "%s%s\n", indent, currTail; prevChar = currChar }
$ awk -f tst.awk file
# jkakjshkjh
* drink (2 spaces * 2 spaces)(non hash starting)
* biscuit (1 space * 2 spaces)(non hash starting)
* paper (* 1 space)(non has starting)
* .. (many more lines) of non hash starting
* tea (7 spaces * 3 space)(non has starting)
# happy
* cup (* 1 space)(non has starting)
* bat (2 spaces * 2 spaces)(non hash starting)
* scooter (1 space * 2 spaces)(non hash starting)
* .. (many more lines) of non hash starting
* disk (7 spaces * 3 space)(non has starting)
With other awks you'd just use substr()
s to get the parts that match()
is putting in a[]
for gawk and use [[:space:]]
and [^[:space:]]
for \s
and \S
respectively.
To help you understand the syntax, if I were writing the above in a C-like language then it'd be:
while ( read(FILENAME,line) ) { # awk does this for you
NR++; # awk does this for you
NF = split(line into $1, $2, $3, ... $NF); # awk does this for you
match(line,/^(\s*(\S)\s*)(.*)/,a);
currHead = a[1];
currChar = a[2];
currTail = a[3];
if (currChar == "#") { indent = currHead; }
if (currChar != "#") { indent = (prevChar == "#" ? currHead : indent); }
printf "%s%s\n", indent, currTail; prevChar = currChar;
} # awk does this for you
and in fact you can duplicate that syntax in awks BEGIN section with:
BEGIN {
filename = ARGV[1]
ARGV[1] = ""
ARGC--
while ( (getline line < filename) > 0) ) {
nr++
nf = split(line,flds)
match(line,/^(\s*(\S)\s*)(.*)/,a)
currHead = a[1]
currChar = a[2]
currTail = a[3]
if (currChar == "#") { indent = currHead }
if (currChar != "#") { indent = (prevChar == "#" ? currHead : indent) }
printf "%s%s\n", indent, currTail; prevChar = currChar
}
}
but see http://awk.freeshell.org/AllAboutGetline for why not to do that unless you have a very specific need.
Answered By - Ed Morton