Monday, November 27, 2023

[SOLVED] Basic parsing of braces in Bash/sed/AWK

November 27, 2023 awk, bash, parsing, sed

Issue

Being from a math background, I've successfully Pavlov's-dogged myself into only being able to write in (Plain) TeX, so I often use TeX to do a lot of non-math text editing that people more commonly use MS Word or Google Docs to do. (Once you are used to the elegance of TeX-produced PDFs the clunkiness of word-processor typesetting really becomes an eyesore.) The issue arises once in a while when I'm asked to submit something in .docx format, which usually requires me to copy my TeX code over to a Google doc, spend a few minutes to half an hour manually making changes, and then export that as a .docx file. I have a little Bash function in my .zshrc that I use to do some of the really easy stuff (European accents and such) that I don't want to do by hand:

tex2doc() {
    (tr '\n' ' ' < $1.tex) |
    sed 's/  /\n\t/g' |
    sed "s/\\\'e/é/g" |
    sed "s/\\\^o/ô/g" |
    sed "s/\\\'E/É/g" |
    sed 's/```/“‘/g' |
    sed "s/'''/’”/g" |
    sed 's/``/“/g' |
    sed "s/''/”/g" |
    sed 's/`/‘/g' |
    sed "s/'/’/g" |
    sed 's/\\\"//g' |
    sed 's/\\~n/ñ/g' |
    sed 's/\~/ /g' |
    sed 's/\\ / /g' |
    sed 's/\\-//g' |
    sed 's/\\c{c}/ç/g' |
    sed 's/\\\///g' |
    sed 's/---/—/g' |
    sed 's/--/–/g' > $1.txt
}

With this stuff out of the way, what I'm left with is the (computationally) more complex task of converting italics from the format {\it ... } to maybe something like Markdown, so that I can open the Markdown in some kind of Markdown reader and just copy-paste that. This would be easy if I only ever used italics, but sometimes I might use bold, in the format {\bf ... }, or small caps, in the format {\sc ... }. So the functionality I might like is the following:

Upon seeing {\it , replace it with _.
Upon seeing {\bf , replace it with **.
Upon seeing {\sc , replace it with nothing. (I don't mind not having small caps in my .doc output.)
Upon seeing }, replace it with _ if the last token encountered was a {\it , with a ** if we last saw a {\bf , or nothing if we last saw a {\sc . (And then pop the "stack", however that's implemented.)

Because this is a (very simple form of) parsing, I considered using a Python-based lexer and parser to parse everything and then output Markdown, which I feel would take a while, and also it has been a super long time since I used lex and yacc. This also feels like using a sledgehammer to hang up a picture. Is there a simple "Bash-esque" solution to this problem? I know sed itself probably cannot solve this, since it's more for regexes, but maybe AWK might help. I've used AWK a bit in the past but I remember its syntax being quite obtuse and difficult, and also I'm not sure of its limitations, so I would appreciate any pointers! Thanks in advance :)

Solution

Assuming you have GNU sed and the original text doesn't contain a BEL character (ASCII 7):

sed '
    s/{\\/\a/g                  # replace each {\ with BEL character (\a)
:a
    s/\ait\([^}\a]*\)}/_\1_/g   # replace italics
    ta
    s/\abf\([^}\a]*\)}/**\1**/g # replace bolds
    ta
    s/\asc\([^}\a]*\)}/\1/g     # replace small caps
    ta
    s/\a/{\\/g
'

This version supports nesting. Each s command replaces the inner-most {\xx ... }.

Answered By - M. Nejat Aydin

Answer Checked By - Candace Johnson (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 27, 2023

[SOLVED] Basic parsing of braces in Bash/sed/AWK

Issue

Solution

Popular Posts

Labels