Issue
I'm trying to import a big CSV file to BigQuery (2.2 GB+). This is the error I get:
"Error while reading data, error message: CSV table references column position 33, but line starting at position:254025076 contains only 26 columns."
There are more errors on that file – and on that file only, out of one per state. Usually I would skip the faulty lines, but then I would lose a lot of data.
What can be a good way to check and correct the errors in a file that big?
EDIT: This is what seems to happen in the file. It's one single line and it breaks between "Instituto" and "Butantan". As a result, BigQuery parses it as one line with 26 columns and another with nine columns. That repeats a lot.
As far as I've seen, it's just with Butantan, but sometimes the first word is described differently (I caught "Instituto" and "Fundação"). Can I correct that maybe with grep on the command line? If so, what syntax?
Solution
Actually 2.2GB is quite manageble size. It can be quickly pre-processed with command line tools or simple python script on any +/- modern laptop/desktop or on a small VM in GCP.
You can start from looking at the problematic row:
head -n 254025076 your_file.csv | tail -n 1
If problematic rows just have missing values for last columns - you can use "--allow_jagged_rows" loading CSV option.
Otherwise I'm usually using simple python script like this:
import fileinput
def process_line(line):
# your logic to fix line
return line
if __name__ == '__main__':
for line in fileinput.input():
print(process_line(line))
and run it with:
cat your_file.csv | python3 preprocess.py > new_file.csv
UPDATE: For newline characters in value - try BigQuery "Allow quoted newlines" option.
Answered By - Oleksandr Bushkovskyi Answer Checked By - Clifford M. (WPSolving Volunteer)