Issue
I have two CSV files without headers that dont exceed 1 GB. For example:
1,75.10,"1","1"
2,12.01,"0","4"
4,26.57,"1","1"
3,100.94,"1","4"
File2.csv with 4 columns:
1,"D",0,5
1,"F",0,5
1,"N",1,3
2,"D",1,8
2,"F",0,6
2,"N",3,0
4,"I",,5
4,"F",2,9
Wherever there is a match between column 1 of File2.csv and column 1 of File1.csv, I want to append the 2nd, 3rd, and 4th columns of File1.csv to File2.csv – to create New.csv that has 7 columns.
Expected output (New.csv):
1,"D",0,5,75.10,"1","1"
1,"F",0,5,75.10,"1","1"
1,"N",1,3,75.10,"1","1"
2,"D",1,8,12.01,"0","4"
2,"F",0,6,12.01,"0","4"
2,"N",3,0,12.01,"0","4"
4,"I",,5,26.57,"1","1"
4,"F",2,9,26.57,"1","1"
Explaining the concepts behind each step will be very helpful for me because I am self-taught so far.
I am extremely new to AWK, and am having trouble finding a solution so far. I would use Excel, but my file sizes are too large and I'd like to keep learning about AWK.
(Edited) Below is as far as I got, through trial/error, where I've appended only the 2nd column from File1.csv to the end of File2.csv. I do not understand how the ternary portion works, so I would really appreciate an explanation.
awk -F "," 'FNR==NR {a[$1]=$2; next} {print $0","(($1 in a)?a[$1]:"NA")}' File1.csv File2.csv > New.csv
Solution
As your files are large, if they are not sorted already, the best is to first sort them (the GNU sort
utility is smart enough to sort large files on memory-limited computers). Once the files are sorted you can use join
(output prefixed with -|
):
join -t, <(sort -nst, -k1,1 File2.csv) <(sort -nst, -k1,1 File1.csv)
-| 1,"D",0,5,75.10,"1","1"
-| 1,"F",0,5,75.10,"1","1"
-| 1,"N",1,3,75.10,"1","1"
-| 2,"D",1,8,12.01,"0","4"
-| 2,"F",0,6,12.01,"0","4"
-| 2,"N",3,0,12.01,"0","4"
-| 4,"I",,5,26.57,"1","1"
-| 4,"F",2,9,26.57,"1","1"
If you absolutely need awk
, again once the files are sorted, you will be able to read them in parallel (getline
) and avoid storing large amounts of data in memory. Example (tested with GNU awk
and the awk
that comes with macOS, output prefixed with -|
):
sort -nst, -k1,1 File1.csv > File1.sorted.csv
sort -nst, -k1,1 File2.csv | awk -v r=1 '
{
f2 = t2 = $0; sub(/,.*/, "", f2); sub(/[^,]*,/, ",", t2)
while(r == 1 && f1 < f2 && (r = (getline < "File1.sorted.csv")) == 1) {
f1 = t1 = $0; sub(/,.*/, "", f1); sub(/[^,]*,/, ",", t1)
}
print f2 t2 ((f1 == f2) ? t1 : "")
}'
-| 1,"D",0,5,75.10,"1","1"
-| 1,"F",0,5,75.10,"1","1"
-| 1,"N",1,3,75.10,"1","1"
-| 2,"D",1,8,12.01,"0","4"
-| 2,"F",0,6,12.01,"0","4"
-| 2,"N",3,0,12.01,"0","4"
-| 4,"I",,5,26.57,"1","1"
-| 4,"F",2,9,26.57,"1","1"
- Read next line of sorted
File2.csv
, store the first field (before first comma) inf2
and the rest of the line (from first comma) int2
. - While
f1
(0 at initialization) is strictly less thanf2
, read the next line of sortedFile1.csv
. If the read is successful store the first field inf1
and the rest of the line int1
. - Print the current line of
File2.csv
and, if the 2 first fields match, append the tail of the line ofFile1.csv
(t1
).
Note: we store the getline
result in variable r
(initialized to 1) and skip subsequent calls if it is not equal to 1 to avoid reading past the end of file, which is inefficient.
Answered By - Renaud Pacalet Answer Checked By - Candace Johnson (WPSolving Volunteer)