Issue
I'm trying to find similarities between both the sentences in a shell script.
Have a two sentences containing duplicate words, for example, the input data in file my_text.txt
Shell Script.
Linux Shell Script.
The intersection of both sentences:
Shell
+Script
The union " size " of both sentences:
3
The correct output for similarity of sentences :
0.30000000000000000000
The definition of the similarity ** is the intersection of words between the two sentences divided by the size of the union of the two sentences.
The problem: I have tried a lot to found a shell script, but I have not found a solution to this problem.
Solution
The following script should do the trick. It also ignores duplicated words per sentence, filler words, and non-alphabetical characters as described by you in the comment section.
words=$(
< my_text.txt tr 'A-Z' 'a-z' |
grep -Eon '\b[a-z]*\b' |
grep -Fwvf <(printf %s\\n is a to be by the and for) |
sort -u | cut -d: -f2 | sort
)
union=$(uniq <<< "$words" | wc -l)
intersection=$(uniq -d <<< "$words" | wc -l)
echo "similarity is $(bc -l <<< "$intersection/$union")"
The output for your example input is .30000000000000000000
(= 0.3).
Answered By - Socowi