Issue
I got two unquoted and single column TSV files (exported from a database) with a few thousand people names and I need to find the names that appear in both files. Both files are UTF-8
, CRLF
terminated, and start with the BOM 0xEF 0xBB 0xBF
.
A simple join
or comm
command could have done the trick but there are a few differences in the names:
# cat file1.tsv
A. Einstein
Louis Pasteur
Diego Armando Maradona
Isaac Newton
Frava D’onä
D Rüge
Françoise Barré-Sinoussi
# cat file2.tsv
Diego Maradona
Albert Einstein
Francoise, BARRE SINOUSSI
Louis Pasteur
frava d'ona
Marie-Louise Von FRANZ
Dimitri Rüge
The expected matches in file2.tsv
would be:
Diego Maradona
Albert Einstein
Francoise, BARRE SINOUSSI
Louis Pasteur
frava d'ona
Dimitri Rüge
I've wrote this bash
sed
awk
grep
script that dynamically generates a regex for matching the last names:
#!/bin/bash
# U+0300 = 0xCC80 = 52352
# U+033F = 0xCCBF = 52415
# U+0340 = 0xCD80 = 52608
# U+036E = 0xCDAE = 52654
_COMBINING_CHARS_=()
for i in {52352..52415} {52608..52654}
do
hex=$(printf %04X "$i")
_COMBINING_CHARS_+=( "$(printf '\x'"${hex:0:2}"'\x'"${hex:2:2}")" )
done
_COMBINING_CHARS_ERE_=$(IFS='|'; printf %s "${_COMBINING_CHARS_[*]}")
# Function that removes the BOM, CRLF, and COMBINING characters:
sanitize() {
LANG=C sed -E \
-e $'1s/^\xEF\xBB\xBF//' \
-e $'s/\r$//' \
-e "s/$_COMBINING_CHARS_ERE_//g" \
-- "$@"
}
# Function that generates a regex for the _lastname_:
toERE() {
awk '
{
if ( $0 ~ /,/) {
n = split($0, a, ",");
$0 = a[n];
} else {
$0 = $NF
}
sub("^[[:space]]+","");
sub("[[:space]]+$","");
gsub("[[:space:]-]+"," ");
}
{
ere = ""
sep = "";
for ( nf = 1; nf <= NF; nf++ ) {
n = split($nf, c, "");
for ( i = 1; i <= n; i++ ) {
ere = ere "[[=" c[i] "=]]"
}
ere = sep ere
sep = "[[:space:]-]+"
}
print ere "[[:space:]]*$"
}
' < <(sanitize "$@")
}
grep -E -f <(toERE "$1") <(sanitize "$2")
Unfortunately, the result with the given input is:
grep: illegal byte sequence
UTF-8 multibyte characters seems to be the problem but I can't think of a way to handle it with awk
Solution
In the end I did the job with ruby
but I post an awk
solution nevertheless.
There was two problems:
POSIX
[= =]
doesn't work for diacritic charactersawk
doesn't know about multi-byte characters
Working around that is possible by converting the input files to ASCII. iconv
can do it somewhat accurately for Romanian characters, which is what I needed:
#!/bin/bash
to_ascii() {
case $(uname) in
Darwin)
iconv -f UTF-8 -t UTF-8-MAC "$@" |
iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE
;;
Linux)
iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE "$@"
;;
esac |
sed $'s/\r$//'
}
Now we just need to do a little normalization and find the perfect matches in the family names:
awk '
{
gsub("-+","-");
gsub("\27+","\27");
gsub("[.[:space:]]+"," ");
sub("^[[:space:]]+","");
sub("[[:space:]]+$","");
}
{
if ($0 ~ /,/) {
n = split($0,a,"[[:space:]]*,[[:space:]]*");
lastname = a[n];
} else {
lastname = $NF;
}
gsub("[-[:space:]]+"," ",lastname);
lastname = tolower(lastname);
}
FNR == NR {
keys[lastname] = $0;
next;
}
{
count = 0;
for (n in keys) {
if (n == lastname) {
matches[++count] = keys[n]
}
}
if (count > 0) {
print $0
for (i = 1; i <= count; i++) {
print "\t" matches[i]
}
}
}
' <(to_ascii "$2") <(to_ascii "$1")
output
A Einstein
Albert Einstein
Louis Pasteur
Louis Pasteur
Diego Armando Maradona
Diego Maradona
Frava D'ona
frava d'ona
D Ruge
Dimitri Ruge
Francoise Barre-Sinoussi
Francoise, BARRE SINOUSSI
Answered By - Fravadona Answer Checked By - David Marino (WPSolving Volunteer)