Friday, May 27, 2022

[SOLVED] select only a word that is part of colon

Issue

I have a text file using markup language (similar to wikipedia articles)

cat test.txt
This is a sample text having: colon in the text. and there is more [[in single or double: brackets]]. I need to select the first word only.
and second line with no [brackets] colon in it.

I need to select the word "having:" only because that is part of regular text. I tried

grep -v '[*:*]' test.txt

This will correctly avoid the tags, but does not select the expected word.


Solution

A combined solution using sed and awk:

sed 's/ /\n/g' test.txt | gawk 'i==0 && $0~/:$/{ print $0 }/\[/{ i++} /\]/ {i--}'
  • sed will change all spaces to a newline
  • awk (or gawk) will output all lines matching $0~/:$/, as long as i equals zero
  • The last part of the awk stuff keeps a count of the opening and closing brackets.

Another solution using sed and grep:

sed -r -e 's/\[.*\]+//g' -e 's/ /\n/g' test.txt  | grep ':$'
  • 's/\[.*\]+//g' will filter the stuff between brackets
  • 's/ /\n/g' will replace a space with a newline
  • grep will only find lines ending with :

A third on using only awk:

gawk '{ for (t=1;t<=NF;t++){ 
            if(i==0 && $t~/:$/) print $t; 
            i=i+gsub(/\[/,"",$t)-gsub(/\]/,"",$t) }}' test.txt
  • gsub returns the number of replacements.
  • The variable i is used to count the level of brackets. On every [ it is incremented by 1, and on every ] it is decremented by one. This is done because gsub(/\[/,"",$t) returns the number of replaced characters. When having a token like [[][ the count is increased by (3-1=) 2. When a token has brackets AND a semicolon my code will fail, because the token will match, if it ends with a :, before the count of the brackets.


Answered By - Luuk
Answer Checked By - Clifford M. (WPSolving Volunteer)