Monday, January 8, 2024

[SOLVED] VLOOKUP using AWK to append multiple values as new columns in CSV file

January 08, 2024 awk, csv, unix

Issue

I have two CSV files without headers that dont exceed 1 GB. For example:

File1.csv with 4 columns:

1,75.10,"1","1"
2,12.01,"0","4"
4,26.57,"1","1"
3,100.94,"1","4"

File2.csv with 4 columns:

1,"D",0,5
1,"F",0,5
1,"N",1,3
2,"D",1,8
2,"F",0,6
2,"N",3,0
4,"I",,5
4,"F",2,9

Wherever there is a match between column 1 of File2.csv and column 1 of File1.csv, I want to append the 2nd, 3rd, and 4th columns of File1.csv to File2.csv – to create New.csv that has 7 columns.

Expected output (New.csv):

1,"D",0,5,75.10,"1","1"
1,"F",0,5,75.10,"1","1"
1,"N",1,3,75.10,"1","1"
2,"D",1,8,12.01,"0","4"
2,"F",0,6,12.01,"0","4"
2,"N",3,0,12.01,"0","4"
4,"I",,5,26.57,"1","1"
4,"F",2,9,26.57,"1","1"

Explaining the concepts behind each step will be very helpful for me because I am self-taught so far.

I am extremely new to AWK, and am having trouble finding a solution so far. I would use Excel, but my file sizes are too large and I'd like to keep learning about AWK.

(Edited) Below is as far as I got, through trial/error, where I've appended only the 2nd column from File1.csv to the end of File2.csv. I do not understand how the ternary portion works, so I would really appreciate an explanation.

awk -F "," 'FNR==NR {a[$1]=$2; next} {print $0","(($1 in a)?a[$1]:"NA")}' File1.csv File2.csv > New.csv

Solution

As your files are large, if they are not sorted already, the best is to first sort them (the GNU sort utility is smart enough to sort large files on memory-limited computers). Once the files are sorted you can use join (output prefixed with -| ):

join -t, <(sort -nst, -k1,1 File2.csv) <(sort -nst, -k1,1 File1.csv)
-| 1,"D",0,5,75.10,"1","1"
-| 1,"F",0,5,75.10,"1","1"
-| 1,"N",1,3,75.10,"1","1"
-| 2,"D",1,8,12.01,"0","4"
-| 2,"F",0,6,12.01,"0","4"
-| 2,"N",3,0,12.01,"0","4"
-| 4,"I",,5,26.57,"1","1"
-| 4,"F",2,9,26.57,"1","1"

If you absolutely need awk, again once the files are sorted, you will be able to read them in parallel (getline) and avoid storing large amounts of data in memory. Example (tested with GNU awk and the awk that comes with macOS, output prefixed with -| ):

sort -nst, -k1,1 File1.csv > File1.sorted.csv
sort -nst, -k1,1 File2.csv | awk -v r=1 '
  {
    f2 = t2 = $0; sub(/,.*/, "", f2); sub(/[^,]*,/, ",", t2)
    while(r == 1 && f1 < f2 && (r = (getline < "File1.sorted.csv")) == 1) {
      f1 = t1 = $0; sub(/,.*/, "", f1); sub(/[^,]*,/, ",", t1)
    }
    print f2 t2 ((f1 == f2) ? t1 : "")
  }'
-| 1,"D",0,5,75.10,"1","1"
-| 1,"F",0,5,75.10,"1","1"
-| 1,"N",1,3,75.10,"1","1"
-| 2,"D",1,8,12.01,"0","4"
-| 2,"F",0,6,12.01,"0","4"
-| 2,"N",3,0,12.01,"0","4"
-| 4,"I",,5,26.57,"1","1"
-| 4,"F",2,9,26.57,"1","1"

Read next line of sorted File2.csv, store the first field (before first comma) in f2 and the rest of the line (from first comma) in t2.
While f1 (0 at initialization) is strictly less than f2, read the next line of sorted File1.csv. If the read is successful store the first field in f1 and the rest of the line in t1.
Print the current line of File2.csv and, if the 2 first fields match, append the tail of the line of File1.csv (t1).

Note: we store the getline result in variable r (initialized to 1) and skip subsequent calls if it is not equal to 1 to avoid reading past the end of file, which is inefficient.

Answered By - Renaud Pacalet

Answer Checked By - Candace Johnson (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 8, 2024

[SOLVED] VLOOKUP using AWK to append multiple values as new columns in CSV file

Issue

Solution

Popular Posts

Labels