Issue
Have a script that parses csv file but it does not work as needed Here is some content from csv file
id,location_id,name,title,email,department
1,1,Susan houston,Director of Services,,
2,1,Christina Gonzalez,Director,,
3,2,Brenda brown,"Director, Second Career Services",,
4,3,Howard Lader,"Manager, Senior Counseling",,
5,4,Kimberly Pesavento,Commercial director,,
6,5,Joe Bloom,Financial Empowerment Programs Program Director,,
7,6,peter Olson,Director,,
8,6,Bart charlow,Executive Director,,
9,7,Bart Charlow,Executive Director,,
#!/bin/bash
#create path to redirect accounts.csv to same directory as accounts_new.csv
path=$(dirname $1)
# Substituted commmas with vertical lines, so sed command works
awk -F'"' -v OFS='"' '{ for (i=2; i<=NF; i+=2) gsub(",", "|", $i) } 1' accounts.csv |
# Changed first letters of names to uppercase
awk -F, -v col=3 '
NR > 1{
n=split(tolower($col),a," ")
$col=toupper(substr(a[1],1,1)) substr(a[1],2)
for(i=2;i<=n;i++) {
$col=$col " " toupper(substr(a[i],1,1)) substr(a[i],2)
}
}1' OFS="," |
# Generated email addresses
sed -E 's/([^,]*,([^,]*),) ?(([[:alpha:]])[^ ]* +)(([^,]*),[^,]*,)[^,]*/\1\u\3\u\5\L\4\6\@abc.com/' |
awk -F, '{for (i=5;i<=5;i++){if (v[i,$i]++){b[$i]=$i;
$i=split($3,arr," ")
val=(substr($3,1,1) arr[2]$2"@abc.com")
$5=tolower(val)
}};print $0}' OFS="," |
# Added missing commas and sent output to new file
sed -E 's/\|/\,/g' > $path"/accounts_new.csv"
here is the output of the script
id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda Brown,Director, Second Career Services,[email protected],
4,3,Howard Lader,Manager, Senior Counseling,[email protected],
5,4,Kimberly Pesavento,Commercial director,[email protected],
6,5,Joe Bloom,Financial Empowerment Programs Program Director,[email protected],
7,6,Peter Olson,Director,[email protected],
8,6,Bart Charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],
but desired output is this
id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda Brown,Director, Second Career Services,[email protected],
4,3,Howard Lader,Manager, Senior Counseling,[email protected],
5,4,Kimberly Pesavento,Commercial director,[email protected],
6,5,Joe Bloom,Financial Empowerment Programs Program Director,[email protected],
7,6,Peter Olson,Director,[email protected],
8,6,Bart Charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],
As you can see from last 2 rows location_id is appended only the second equal email but the first one remains untouched, how can I add location_id to the first one also?
Can you help me? Thanks!
I tried to use while instead of if statement but it didn't help me
Solution
Assumptions/understandings:
- some input fields may be wrapped in double quotes; these double quotes are not to be removed (unlike OP's desired output)
- since commas show up as both delimiter and data we'll temporarily convert the data instances to another character that does not show up in the data; for this exercise we'll use the pipe character (
|
), otherwise we'll need to pick another substitution character - all
name
fields consist of exactly two space-delimited strings (ie, don't have to worry about names likeBob B Burns III
norLastname, Firstname
) - both of the space-delimited strings in the
name
field are to have just the initial character uppercased (eg, we don't have to worry about names with mixed case likedePalma
orMcDonald
) - if we end up with duplicate email accounts the
location_id
is guaranteed to be unique - input file can fit into memory (as
awk
array)
Once awk
is pulled into the mix there's rarely a need for sed
nor additional awk
scripts, ie, we should be able to perform the entire operation with a single awk
script.
One awk
idea:
awk '
BEGIN { FS="\""; OFS="," } # input is delimited by double qutoes
NR==1 { print; next }
{ line=""
for (i=1;i<NF;i+=2) { # loop through odd numbered fields
gsub(/,/,"|",$(i+1)) # in even numbered double-quote-delimited fields replace commas with pipes
line=line $i FS $(i+1) FS # rebuild the current line
}
line=line $NF # add last field to new line
split(line,a,",") # split new line on commas
split(tolower(a[3]),b,/[[:space:]]+/) # split tolower(name field) on white space
# rebuild name with first characters of first/last names uppercased
name=toupper(substr(b[1],1,1)) substr(b[1],2) " " toupper(substr(b[2],1,1)) substr(b[2],2)
acct=substr(b[1],1,1) b[2] # build email acct name
lines[NR]=a[1] OFS a[2] OFS name OFS a[4] # rebuild current line based on first 4 fields
locid[NR]=a[2] # make note of location_id for current line
email[NR]=acct # make note of email acct for current line
count[acct]++ # keep count of number of times we see this email acct
}
END { for (i=2;i<=NR;i++) { # loop through our lines of output
gsub(/\|/,",",lines[i]) # replace pipes with original commas
# print final line of output; if email acct has been seen more than once then append the location_id to the email acct; add the "@abc.com" domain and the trailing comma
print lines[i] OFS email[i] (count[email[i]] > 1 ? locid[i] : "") "@abc.com" OFS
}
}
' accounts.csv
This generates:
id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,[email protected],
2,1,Christina Gonzalez,Director,[email protected],
3,2,Brenda Brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
5,4,Kimberly Pesavento,Commercial director,[email protected],
6,5,Joe Bloom,Financial Empowerment Programs Program Director,[email protected],
7,6,Peter Olson,Director,[email protected],
8,6,Bart Charlow,Executive Director,[email protected],
9,7,Bart Charlow,Executive Director,[email protected],
Answered By - markp-fuso Answer Checked By - Cary Denson (WPSolving Admin)