Monday, November 1, 2021

[SOLVED] How can I find both identical and similar strings in a particular field in a text file in Linux?

Issue

My apologies ahead of time - I'm not sure that there is an answer for this one using only Linux command-line fu. Please note I am not a programmer, but I have been playing around with bash and python a bit over the last few years.

I have a large text file with rows and columns that resemble the following (note - fields are separated with tabs):

1074    Beetle  OOB11061MNH 12/22/16    Confirmed   
3430    Hightop 0817BESTYET 08/07/17    Queued  
3431    Hightop 0817BESTYET 08/07/17    Queued  
3078    Copland 2017GENERAL 07/07/17    Confirmed   
3890    Bartok  FOODS   09/11/17    Confirmed
5440    Alphapha    00B1106IMNH 01/09/18    Queued  

What I want to do is find and output only those rows where the third field is either identical OR similar to another in the list. I don't really care whether the other fields are similar or not, but they should all be included in the output. By similar, I mean no more than [n] characters are different in that particular field (for example, no more than 3 characters are different). So the output I would want would be:

1074    Beetle  OOB11061MNH 12/22/16    Confirmed   
3430    Hightop 0817BESTYET 08/07/17    Queued  
3431    Hightop 0817BESTYET 08/07/17    Queued  
5440    Alphapha    00B1106IMNH 01/09/18    Queued  

The line beginning 1074 has a third field that differs by 3 characters with 5440, so both of them are included. 3430 and 3431 are included because they are exactly identical. 3078 and 3890 are eliminated because they are not similar.

Through googling the forums I've managed to piece together this rather longish pipeline to be able to find all of the instances where field 3 is exactly identical:

cat inputfile.txt | awk 'BEGIN { OFS=FS="\t" } {if (count[$3] > 1) print $0; else if (count[$3] == 1) { print save[$3]; print $0; } else save[$3] = $0; count[$3]++; }' > outputfile.txt

I must confess I don't really understand awk all that well; I'm just copying and adapting from the web. But that seemed to work great at finding exact duplicates (i.e., it would output only 3430 and 3431 above). But I have no idea how to approach trying to find strings that are not identical but that differ in no more than 3 places.

For instance, in my example above, it should match 1074 and 5440 because they would both fit the pattern: ??B1106?MNH

But I would want it to be able to match also any other random pattern of matches, as long as there are no more than three differences, like this: 20?7G?N?RAL

These differences could be arbitrarily in any position.

The reason for needing this is we are trying to find a way to automatically find typographical errors in a serial-number-like field. There might be a mis-key, or perhaps a letter "O" replaced with a number "0", or the like.

So... any ideas? Thanks for the help!


Solution

you can use this script

 $ more hamming.awk

  function hamming(x,y,xs,ys,min,max,h) {
    if(x==y) return 0;
    else {
      nx=split(x,xs,"");
      mx=split(y,ys,"");
      min=nx<mx?nx:mx;
      max=nx<mx?mx:nx;
      for(i=1;i<=min;i++) if(xs[i]!=ys[i]) h++;
      return h+(max-min);     
    }
  }  
  BEGIN   {FS=OFS="\t"}
  NR==FNR {
      if($3 in a) nrs[NR];
      for(k in a)
        if(hamming(k,$3)<4) {
           nrs[NR];
           nrs[a[k]];
        }
      a[$3]=NR;
      next
  }

  FNR in nrs

usage

$ awk -f hamming.awk file{,}

it's a double scan algorithm, finds the hamming distance (the one you described) between keys. Notice the it's O(n^2) algorithm, so may not suitable for very large data sets. However, not sure any other algorithm can do better.

NB Additional note based on the comment which I missed from the post. This algorithm compares the keys character by character, so displacements won't be identified. For example 123 and 23 will give a distance of 3.



Answered By - karakfa