Wednesday, October 26, 2022

[SOLVED] Print sum of Nth column at the header of file with existing rows bash

Issue

I have an input file with billions of records and a header. Header consists of meta info, total number of rows and sum of the sixth column. I am splitting the file into small sizes, due to which my header record must be updated as the sum of sixth column and total rows is changed.

This is the sample record

filename: testFile.text

00|STMT|08-09-2022 13:24:56||5|13.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP

Expected:

filename: testFile_1.text

00|STMT|08-09-2022 13:24:56||3|6.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP

filename: testFile_2.text

00|STMT|08-09-2022 13:24:56||2|7.00|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP

I am able to split the file and calculate the sum but unable to replace the value in header part. This is the script I have made

#!/bin/bash

splitRowCount=$1
transactionColumn=$2

filename=$(basename -- "$3")
extension="${filename##*.}"
nameWithoutExt="${filename%.*}"

echo "splitRowCount: $splitRowCount"
echo "transactionColumn: $transactionColumn"


awk 'NR == 1 { head = $0 } NR % '$splitRowCount' == 2 { filename = "'$nameWithoutExt'_" int((NR-1)/'$splitRowCount')+1 ".'$extension'"; print head > filename } NR != 1 { print >> filename }' $filename


 ls *.txt | while read line
 do
  firstLine=$(head -n 1 $line);
  awk -F '|'  'NR !=1 {sum += '$transactionColumn'}END {print sum} '   $line
 done

Solution

Here's an awk solution for splitting the original file into files of n records. The idea is to accumulate the records until the given count is reached then generate a file with the updated header and the accumulated records:

n=3
file=./testFile.text

awk -v numRecords="$n"  '
    BEGIN {
        FS = OFS = "|"

        if ( match(ARGV[1],/[^\/]\.[^\/]*$/) ) {
            filePrefix = substr(ARGV[1],1,RSTART)
            fileSuffix = substr(ARGV[1],RSTART+1)
        } else {
            filePrefix = ARGV[1]
            fileSuffix = ""
        }

        if (getline headerStr <= 0)
            exit 1
        split(headerStr, headerArr)
    }
    (NR-2) % numRecords == 0 && recordsCount {
        outfile = filePrefix "_" ++filesCount fileSuffix
        print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
        printf("%s", records) > outfile
        close(outfile)
        
        records = ""
        recordsCount = recordsSum = 0
    }
    {
        records = records $0 ORS
        recordsCount++
        recordsSum += $6
    }
    END {
        if (recordsCount) {
            outfile = filePrefix "_" ++filesCount fileSuffix
            print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
            printf("%s", records) > outfile
            close(outfile)
        }
    }
' "$file"

With the given sample you'll get:

  • testFile_1.text
00|STMT|08-09-2022 13:24:56||3|6.1|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
  • testFile_2.text
00|STMT|08-09-2022 13:24:56||2|7|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP


Answered By - Fravadona
Answer Checked By - Senaida (WPSolving Volunteer)