Issue
I have the following dataset and I would like to implement an iteration that checks line by line(awk or for) in an awk file that after executing it in the following way:
gawk -f file.awk dataset.csv
Allow me to get a file with the records without duplicates and the float in the last column rounded to two decimals. Below, I attach a sample of my dataset and as you can see there should be only one record per country.
40462186,US,177827,7671,4395,190
2872296,US,273870,3492,95349,1216
45236699,US,265691,6874,5873,152
Since my level is not advanced, I don't mind if the code is long so I can familiarise myself with the steps the code goes through.
awk '{a[$1]++}END{for (i in a)if (a[i]>1)print i;}' file
I found that this command can help in such functionality it would be a shell script in not an awk script.
Thank you in advance for your help
Solution
Your original code:
awk '{a[$1]++}END{for (i in a)if (a[i]>1)print i;}' file
has the test inverted: a[i]>1
should be a[i]==1
to only print unique lines.
Some ways to implement truncation of n
to 2 decimal places are:
n = substr(n,1,match(n,/[.]/)+2)
n = sprintf("%0.2f",n)
So your script could be:
BEGIN { FS=OFS="," } # delimit columns by comma
# csv must not have embedded commas
NR==1 {print; next} # print header
{ $10 = sprintf("%0.2f", $10) } # truncate column 10
# rewrites $0 so uses OFS
{ a[$0]++ } # using $0 means entire line must be unique
END { for (i in a) if (a[i]==1) print i } # print unique lines
Given your comment about data cleansing, it would probably be better to use a two-pass approach: use your original code to alert you to erroneous input, then truncate in a separate pass.
Note that if a single column changes you will get what appears to be valid input. These lines are different:
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132
Afghanistan,Afghanistan,AFG,40462106,Asia,177827,7671,4395,190,4.313743132
I suppose you wanted to detect this, so your sanity checks need to be more sophisticated.
Answered By - jhnc Answer Checked By - Cary Denson (WPSolving Admin)