Monday, January 22, 2024

[SOLVED] Bash script Find Duplicate Values and modify them

January 22, 2024 awk, bash, csv, parsing, sed

Issue

Have a script that parses csv file but it does not work as needed Here is some content from csv file

id,location_id,name,title,email,department
1,1,Susan houston,Director of Services,,
2,1,Christina Gonzalez,Director,,
3,2,Brenda brown,"Director, Second Career Services",,
4,3,Howard Lader,"Manager, Senior Counseling",,
5,4,Kimberly Pesavento,Commercial director,,
6,5,Joe Bloom,Financial Empowerment Programs Program Director,,
7,6,peter Olson,Director,,
8,6,Bart charlow,Executive Director,,
9,7,Bart Charlow,Executive Director,,

#!/bin/bash

#create path to redirect accounts.csv to same directory as accounts_new.csv
path=$(dirname $1)


# Substituted commmas with vertical lines, so sed command works 
awk -F'"' -v OFS='"' '{ for (i=2; i<=NF; i+=2) gsub(",", "|", $i) } 1' accounts.csv |

# Changed first letters of names to uppercase
awk -F, -v col=3 ' 
NR > 1{
  n=split(tolower($col),a," ")
  $col=toupper(substr(a[1],1,1)) substr(a[1],2)
  for(i=2;i<=n;i++) {
    $col=$col " " toupper(substr(a[i],1,1)) substr(a[i],2)
  }
}1' OFS="," |

# Generated email addresses
sed -E 's/([^,]*,([^,]*),) ?(([[:alpha:]])[^ ]* +)(([^,]*),[^,]*,)[^,]*/\1\u\3\u\5\L\4\6\@abc.com/' |

awk -F, '{for (i=5;i<=5;i++){if (v[i,$i]++){b[$i]=$i; 
  $i=split($3,arr," ")
  val=(substr($3,1,1) arr[2]$2"@abc.com")
  $5=tolower(val)
  }};print $0}' OFS="," |

# Added missing commas and sent output to new file
sed -E 's/\|/\,/g' > $path"/accounts_new.csv"

here is the output of the script

id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda Brown,Director, Second Career Services,[email protected],
4,3,Howard Lader,Manager, Senior Counseling,[email protected],
5,4,Kimberly Pesavento,Commercial director,[email protected],
6,5,Joe Bloom,Financial Empowerment Programs Program Director,[email protected],
7,6,Peter Olson,Director,[email protected],
8,6,Bart Charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],

but desired output is this

id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda Brown,Director, Second Career Services,[email protected],
4,3,Howard Lader,Manager, Senior Counseling,[email protected],
5,4,Kimberly Pesavento,Commercial director,[email protected],
6,5,Joe Bloom,Financial Empowerment Programs Program Director,[email protected],
7,6,Peter Olson,Director,[email protected],
8,6,Bart Charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],

As you can see from last 2 rows location_id is appended only the second equal email but the first one remains untouched, how can I add location_id to the first one also?

Can you help me? Thanks!

I tried to use while instead of if statement but it didn't help me

Solution

Assumptions/understandings:

some input fields may be wrapped in double quotes; these double quotes are not to be removed (unlike OP's desired output)
since commas show up as both delimiter and data we'll temporarily convert the data instances to another character that does not show up in the data; for this exercise we'll use the pipe character (|), otherwise we'll need to pick another substitution character
all name fields consist of exactly two space-delimited strings (ie, don't have to worry about names like Bob B Burns III nor Lastname, Firstname)
both of the space-delimited strings in the name field are to have just the initial character uppercased (eg, we don't have to worry about names with mixed case like dePalma or McDonald)
if we end up with duplicate email accounts the location_id is guaranteed to be unique
input file can fit into memory (as awk array)

Once awk is pulled into the mix there's rarely a need for sed nor additional awk scripts, ie, we should be able to perform the entire operation with a single awk script.

One awk idea:

awk '
BEGIN { FS="\""; OFS="," }                              # input is delimited by double qutoes
NR==1 { print; next }
      { line=""
        for (i=1;i<NF;i+=2) {                           # loop through odd numbered fields
            gsub(/,/,"|",$(i+1))                        # in even numbered double-quote-delimited fields replace commas with pipes
            line=line $i FS $(i+1) FS                   # rebuild the current line
        }
        line=line $NF                                   # add last field to new line

        split(line,a,",")                               # split new line on commas
        split(tolower(a[3]),b,/[[:space:]]+/)           # split tolower(name field) on white space

        # rebuild name with first characters of first/last names uppercased

        name=toupper(substr(b[1],1,1)) substr(b[1],2) " " toupper(substr(b[2],1,1)) substr(b[2],2)

        acct=substr(b[1],1,1) b[2]                      # build email acct name

        lines[NR]=a[1] OFS a[2] OFS name OFS a[4]       # rebuild current line based on first 4 fields
        locid[NR]=a[2]                                  # make note of location_id for current line
        email[NR]=acct                                  # make note of email acct for current line
        count[acct]++                                   # keep count of number of times we see this email acct
      }

END   { for (i=2;i<=NR;i++) {                           # loop through our lines of output
            gsub(/\|/,",",lines[i])                     # replace pipes with original commas

            # print final line of output; if email acct has been seen more than once then append the location_id to the email acct; add the "@abc.com" domain and the trailing comma

            print lines[i] OFS email[i] (count[email[i]] > 1 ? locid[i] : "") "@abc.com" OFS
        }
      }
' accounts.csv

This generates:

id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda Brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
5,4,Kimberly Pesavento,Commercial director,[email protected],
6,5,Joe Bloom,Financial Empowerment Programs Program Director,[email protected],
7,6,Peter Olson,Director,[email protected],
8,6,Bart Charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],

Answered By - markp-fuso

Answer Checked By - Cary Denson (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 22, 2024

[SOLVED] Bash script Find Duplicate Values and modify them

Issue

Solution

Popular Posts

Labels