Issue
Being from a math background, I've successfully Pavlov's-dogged myself into only being able to write in (Plain) TeX, so I often use TeX to do a lot of non-math text editing that people more commonly use MS Word or Google Docs to do. (Once you are used to the elegance of TeX-produced PDFs the clunkiness of word-processor typesetting really becomes an eyesore.) The issue arises once in a while when I'm asked to submit something in .docx
format, which usually requires me to copy my TeX code over to a Google doc, spend a few minutes to half an hour manually making changes, and then export that as a .docx
file. I have a little Bash function in my .zshrc
that I use to do some of the really easy stuff (European accents and such) that I don't want to do by hand:
tex2doc() {
(tr '\n' ' ' < $1.tex) |
sed 's/ /\n\t/g' |
sed "s/\\\'e/é/g" |
sed "s/\\\^o/ô/g" |
sed "s/\\\'E/É/g" |
sed 's/```/“‘/g' |
sed "s/'''/’”/g" |
sed 's/``/“/g' |
sed "s/''/”/g" |
sed 's/`/‘/g' |
sed "s/'/’/g" |
sed 's/\\\"//g' |
sed 's/\\~n/ñ/g' |
sed 's/\~/ /g' |
sed 's/\\ / /g' |
sed 's/\\-//g' |
sed 's/\\c{c}/ç/g' |
sed 's/\\\///g' |
sed 's/---/—/g' |
sed 's/--/–/g' > $1.txt
}
With this stuff out of the way, what I'm left with is the (computationally) more complex task of converting italics from the format {\it ... }
to maybe something like Markdown, so that I can open the Markdown in some kind of Markdown reader and just copy-paste that. This would be easy if I only ever used italics, but sometimes I might use bold, in the format {\bf ... }
, or small caps, in the format {\sc ... }
. So the functionality I might like is the following:
- Upon seeing
{\it
, replace it with_
. - Upon seeing
{\bf
, replace it with**
. - Upon seeing
{\sc
, replace it with nothing. (I don't mind not having small caps in my.doc
output.) - Upon seeing
}
, replace it with_
if the last token encountered was a{\it
, with a**
if we last saw a{\bf
, or nothing if we last saw a{\sc
. (And then pop the "stack", however that's implemented.)
Because this is a (very simple form of) parsing, I considered using a Python-based lexer and parser to parse everything and then output Markdown, which I feel would take a while, and also it has been a super long time since I used lex
and yacc
. This also feels like using a sledgehammer to hang up a picture. Is there a simple "Bash-esque" solution to this problem? I know sed
itself probably cannot solve this, since it's more for regexes, but maybe AWK might help. I've used AWK a bit in the past but I remember its syntax being quite obtuse and difficult, and also I'm not sure of its limitations, so I would appreciate any pointers! Thanks in advance :)
Solution
Assuming you have GNU sed
and the original text doesn't contain a BEL
character (ASCII 7):
sed '
s/{\\/\a/g # replace each {\ with BEL character (\a)
:a
s/\ait\([^}\a]*\)}/_\1_/g # replace italics
ta
s/\abf\([^}\a]*\)}/**\1**/g # replace bolds
ta
s/\asc\([^}\a]*\)}/\1/g # replace small caps
ta
s/\a/{\\/g
'
This version supports nesting. Each s
command replaces the inner-most {\xx ... }
.
Answered By - M. Nejat Aydin Answer Checked By - Candace Johnson (WPSolving Volunteer)