Issue
How do I match the corpus file with second column in stem and return the first column?
corpus.txt
this
is
broken
testing
as
told
Only the fist 2 columns are important in this file:
stem.csv
"test";"tested";"test";"Suffix";"A";"7673";"321: 0 xxx"
"test";"testing";"test";"Suffix";"A";"7673";"322: 0 xxx"
"test";"tests";"test";"Suffix";"b";"5942";"001: 0 xxx"
"break";"broke";"break";"Suffix";"b";"5942";"002: 0 xxx"
"break";"broken";"break";"Suffix";"b";"5942";"003: 0 xxx"
"break";"breaks";"break";"Suffix";"c";"5778";"001: 0 xxx"
"tell";"told";"tell";"Suffix";"c";"5778";"002: 0 xx"
If the word is missing in the stem file, it should be replaced with XXX
expected.txt
XXX
XXX
break
test
XXX
tell
It can be done using SQL queries like this...
CREATE TABLE `stem` (
`column1` varchar(100) DEFAULT NULL,
`column2` varchar(100) DEFAULT NULL
) ;
INSERT INTO `stem` VALUES ('break','broken'),('break','breaks'),('test','tests');
CREATE TABLE `corpus` (
`column1` varchar(100) DEFAULT NULL
)
INSERT INTO `corpus` VALUES ('tests'),('xyz');
_____
mysql> select ifnull(b.column1, 'XXX') as result from corpus as a left join stem as b on a.column1 = b.column2;
+--------+
| result |
+--------+
| test |
| XXX |
+--------+
But I am looking for a way to process text files directly so that I do not need to import them in mysql.
Solution
Using awk:
$ awk -F';' ' # delimiter
NR==FNR { # process the stem file
gsub(/"/,"") # off with the double quotes
a[$2]=$1 # hash
next
}
{
if($1 in a) # if corpus entry found in stem
print a[$1] # output
else
print "XXX"
}' stem corpus
Output:
XXX
XXX
break
test
XXX
tell
Answered By - James Brown