Saturday, November 25, 2023

[SOLVED] Parsing a nested cell in a CSV file using the bash terminal or R

November 25, 2023 awk, bash, grep, r, sed

Issue

Attempting to create a scaled-down approach to reading in a LARGE (>150GB) CSV file into an R script by chunking it into smaller bits which can be read in sequentially.

Problem is, one of the column variables is kind of a nested cell, similar to:

ID1,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"
ID2,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"
ID3,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"

I've had some success getting just the nested cell values by running:

cat file.csv | cut -d'"' -f2

But this generates ALL the values in the final column, and I would like to just be able to sequentially call each occurrence (e.g. everything between "[[ and ]]" for one ID) and flatten it into a row/vector of some kind in a single file via >>

I tried variations of this solution: How to print a single cell in a csv file using bash script or awk

But it looks like there are some returns in there that are preventing it from being called correctly (as it all comes up with either the first line via head or just blanks).

I'm sure there's a sed, awk, or grep call that can handle this but I'm drawing a blank.

Edit: It has been brought to my attention that it is unclear what I'm asking for, the short answer is I want to extract everything between the two " for a single line/entry in the CSV.

So that I can pipe an out put like:

ID3,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"

to it's own file with a name containing a variation of ID3 in the name.

Solution

For non-trivial CSV data including data with quoted fields, it's ultimately easier to use tools that actually understand the format, for example the utilities in the ever-handy csv kit package.

Example script that extracts the 7th record of your data and saves each entry in its own file:

#!/usr/bin/env bash

rm -f output*.txt
while IFS=$'\037' read -r -d $'\036' lineno record; do
    # Skip header line
    if [[ $lineno = line_number ]]; then
        continue
    fi
    printf "%s\n" "$record" > "output$lineno.txt"
done < <(csvcut -l -H -c 7 input.csv | csvformat -M $'\036' -D $'\037' -U3)

csvcut -l -H -c 7 will output records with a record number and the 7th column of the original CSV file, and the csvformat bit will format things so an ASCII record-separator character is between each record instead of a newline (So you don't have to worry about things like newlines in the record messing line-oriented tools up), and put a unit-separator character between each field instead of comma, and not quote anything. The while loop then reads a record at a time using the new delimiters and saves that in its own sequentially numbered file for later processing.

Answered By - Shawn

Answer Checked By - Terry (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 25, 2023

[SOLVED] Parsing a nested cell in a CSV file using the bash terminal or R

Issue

Solution

Popular Posts

Labels