Monday, October 25, 2021

[SOLVED] How can I replace line-feed in csv quoted fields with a blank?

Issue

Have a large csv file with some line feeds in quoted columns. I need to run a shell script, sed, awk, perl are fine and replace only the line feeds inside quotes with a space. The end of line must be preserved and I don't know the number of columns or which fields might have these embedded line feeds.

Further examination of the file shows this as a result of cat -v test_lf.csv

"NORTH ?M-^@?OLMSTED"
"PORT?M-^@?ST?M-^@?LUCIE"

in the csv file which in excel shows a line break where the ?M-^@? is.

I'd like to use tr to replace whatever that is with a space. How would I do that? What is that sequence?

I've now found that a small part of the file looks like the following in characters and in hex.

13:20:59 [woo:~/Development/scripts] > tail -n 8 test_lf.csv | head -n 1 | od -x
0000000      3431    3136    3439    3039    2c39    542c    4c45    3543
0000020      5f36    3430    2c47    4c46    332c    3934    3338    312c
0000040      3634    3931    3934    3930    222c    4f50    5452    80e2
0000060      53a8    e254    a880    554c    4943    2245    462c    2c4c
0000100      3433    3839    2c33    3737    2c32    3237    2c37    3535
0000120      2c2c    5441    334c    2c51    6e43    7463    222c    6f4e
0000140      80e2    4da8    6773    2c22    2c34    3832    312c    0d35
0000160      000a                                                        
0000161
13:21:50 [woo:~/Development/scripts] > tail -n 8 test_lf.csv | head -n 1 | od -c
0000000    1   4   6   1   9   4   9   0   9   ,   ,   T   E   L   C   5
0000020    6   _   0   4   G   ,   F   L   ,   3   4   9   8   3   ,   1
0000040    4   6   1   9   4   9   0   9   ,   "   P   O   R   T   
      **
0000060   **   S   T   
      **  **   L   U   C   I   E   "   ,   F   L   ,
0000100    3   4   9   8   3   ,   7   7   2   ,   7   2   7   ,   5   5
0000120    ,   ,   A   T   L   3   Q   ,   C   n   c   t   ,   "   N   o
0000140    
      **  **   M   s   g   "   ,   4   ,   2   8   ,   1   5  \r
0000160   \n  

I'm on a mac and the file is utf-8, and my locale is utf-8. It appears that the bytes are swapped (little-endian), so hexadecimal 3431 3136 is 1463 in the character representation. So, starting at byte 60 in this output, we have: something, S, T, blank, something, something, L and in hex this is: 53a8 e254 a880 554c, 53 is an S, 54 is a T, and 4c is an L . So, there is an e2, a8, 80 sequence between the T and the L. This produces a line-feed in the Excel spreadsheet field.
How do I find and replace these bytes by a space?


Solution

I'd use the Perl module Text::CSV

#!/usr/bin/perl

use strict;
use warnings;
use feature qw/say/;
use open IO => ':encoding(utf8)';
use open ':std';
use Text::CSV;

my $file = shift @ARGV;
open my $fh, "<", $file or die "cannot open $file: $!\n";

my $csv = Text::CSV->new({binary => 1});

while (my $row = $csv->getline($fh)) {
    my @no_newlines = map {s/\n/ /g; $_} @$row;
    $csv->combine(@no_newlines);
    say $csv->string();
}

close $fh;

And then you can run it like:

/path/to/csvfixer.pl file.csv > fixed.csv


Answered By - glenn jackman