Issue
I have a csv file, separated by semicolons. This file contains a Danish lexicon from which I need to extract the stems and suffixes. I need to do it in AWK!
File:
adelig;adelig;adj.;1
adelig;adelige;adj.;2
adelig;adeligt;adj.;3
adelig;adeligst;adj.;5
voksen;voksen;adj.;1
voksen;voksne;adj.;2
voksen;voksent;adj.;3
voksen;voksnest;adj.;5
virkemiddel;virkemiddel;sb.;1
virkemiddel;virkemidlet;sb.;2
virkemiddel;virkemidlets;sb.;3
virkemiddel;virkemiddels;sb.;4
virkemiddel;virkemidlerne;sb.;5
virkemiddel;virkemidlernes;sb.;6
virkemiddel;virkemiddel;sb.;7
virkemiddel;virkemidler;sb.;7
virkemiddel;virkemiddels;sb.;8
virkemiddel;virkemidlers;sb.;8
expected output:
adelig;adelig; ,e,t,*,st
voksen;voks; ,ne,ent,*,nest
virkemiddel;virkemid ,let,lets,dels,lerne,lernes,del;ler,dels;lers
Column four are the forms. When some form is missing, the suffix is replaced by an asterisk. Like adelig;adelig; ,e,t,*,st
If the form (the number) is repeated, the suffixes are separated by a semicolon. Like virkemiddel;virkemid ,let,lets,dels,lerne,lernes,del;ler,dels;lers
I started to write this code, but I don't get the algorithm to deal with more than one possible stem. As in the case of virkemiddel
BEGIN{
FS=";"
}
{
lemm=$1;
form=$2;
if(match(form, lemm) > 0)
{
root=lemm;
sub(root,"",form);
suf[$1]=suf[$1]","form;
}
else
{
split($1,a,"");
split($2,b,"");
s="";
for(i in a)
{
if(b[i]!=a[i])
{
break;
}
s = s "" a[i];
}
}
root=s;
}
Solution
Here's some awk code to find the common prefix length and determine the list of suffixes. I have not handled the missing form, nor the repeated number, but it should give you a start
#!/usr/bin/gawk -f
BEGIN { FS = OFS = ";" }
{ words[$1] = words[$1] FS $2 }
END {
for (word in words) {
sub("^"FS, "", words[word])
num_words = split(words[word], these_words)
prefix_length = common_prefix_length(these_words, num_words)
suffixes = ""
sep = ""
for (i=1; i<=num_words; i++) {
suffixes = suffixes sep substr(these_words[i],prefix_length+1)
sep = ","
}
print word, substr(these_words[1], 1, prefix_length), suffixes
}
}
function common_prefix_length(w, n ,i,j,minlen, char) {
minlen = length(w[1])
for (i=2; i<=n; i++)
if (length(w[i]) < minlen)
minlen = length(w[i])
for (i=1; i <= minlen; i++) {
char = substr(w[1], i, 1)
for (j=2; j <= n; j++)
if (substr(w[j], i, 1) != char)
return i-1
}
return minlen
}
The output, given your input, is
voksen;voks;en,ne,ent,nest
virkemiddel;virkemid;del,let,lets,dels,lerne,lernes,del,ler,dels,lers
adelig;adelig;,e,t,st
Answered By - glenn jackman Answer Checked By - Robin (WPSolving Admin)