Friday, April 15, 2022

[SOLVED] How to sed replace UTF-8 characters with HTML entities in another file?

Issue

I'm running cygwin under windows 10

Have a dictionary file (1-dictionary.txt) that looks like this:

labelling   labeling
flavour flavor
colour  color
organisations   organizations
végétales   végétales
contr?lée   contrôlée
"   "

The separators between are TABs (\ts).

The dictionary file is encoded as UTF-8.

Want to replace words and symbols in the first column with words and HTML entities in the second column.

My source file (2-source.txt) has the target UTF-8 and ASCII symbols. The source file also is encoded as UTF-8.

Sample text looks like this:

Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system

I run the following sed one-liner in a shell script (./3-script.sh):

sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' 1-dictionary.txt) 2-source.txt > 3-translation.txt

The substitution of English (en-GB) words with American (en-US) words in 3-translation.txt is successful.

However the substitution of ASCII symbols, such as the quote symbol, and UTF-8 words produces this result:

vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)

If i use only the specific symbol (not the full word) I get results like this:

vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e

The ASCII quote symbol is appended with &#x0022; - it is not replaced.

Similarly, the UTF-8 symbol is appended with its HTML entity - not replaced with the HTML entity.

The expected output would look like this:

v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e

How to modify the sed script so that target ASCII and UTF-8 symbols are replaced with their HTML entity equivalent as defined in the dictionary file?


Solution

I tried it, just replace all & with \& in your 1-dictionary.txt will solve your problem.

Sed's substitute uses a regex as the from part, so when you use it like that, notice those regex characters and add \ to prepare them to be escaped.

And the to part will have special characters too, mainly \ and &, add extra \ to prepare them to be escaped too.

Above linked to GNU sed's document, for other sed version, you can also check man sed.



Answered By - Tiw
Answer Checked By - Mary Flores (WPSolving Volunteer)