Friday, November 26, 2021

[SOLVED] Split text file basing on date tag / timestamp

November 26, 2021 bash, debian

Issue

I have big log file containing date tags. It looks like this:

[01/11/2015, 02:19]
foo
[01/11/2015, 08:40]
bar
[04/11/2015, 12:21]
foo
bar
[08/11/2015, 14:12]
bar
foo
[09/11/2015, 11:25]
...
[15/11/2015, 19:22]
...
[15/11/2015, 21:55]
...

and so on. I need to split these data into files of days, like:

01.txt:

[01/11/2015, 02:19]
foo
[01/11/2015, 08:40]
bar

04.txt:

[04/11/2015, 12:21]
foo
bar

etc. How can I do that using any of unix tools?

Solution

I don't think there's a tool that will do it without a little programming, but with Awk the little programming really isn't all that hard.

`script.awk`

/^\[[0-3][0-9]\/[01][0-9]\/[12][0-9]{3},/ {
    if ($1 != old_date)
    {
        if (outfile != "") close(outfile);
        outfile = sprintf("%.2d.txt", ++filenum); 
        old_date = $1
    }
}
{ print > outfile }

The first (bigger) block of code recognizes the date string, which is also in $1 (so the condition could be made more precise by referring to $1, but the benefit it minimal to non-existent). Inside the actions, it checks to see if the date is different from the last date it remembered. If so, it checks whether it has a file open and closes it if necessary (close is part of POSIX awk). Then it generates a new file name, and remembers the current date it is processing.

The second smaller block simply writes the current line to the current file.

Invocation

awk -f script.awk data

This assumes you have a file script.awk; you could provide it as a script argument if you prefer. If the whole is encapsulated in a shell script, I'd use an expression rather than a second file, but I find it convenient for development to use a file. (The shell script would contain awk '…the script…' "$@" with no separate file.)

Example output files

Given the sample data from the question, the output is in five files, 01.txt .. 05.txt.

$ for file in 0?.txt; do boxecho $file; cat $file; done
************
** 01.txt **
************
[01/11/2015, 02:19]
foo
[01/11/2015, 08:40]
bar
************
** 02.txt **
************
[04/11/2015, 12:21]
foo
bar
************
** 03.txt **
************
[08/11/2015, 14:12]
bar
foo
************
** 04.txt **
************
[09/11/2015, 11:25]
...
************
** 05.txt **
************
[15/11/2015, 19:22]
...
[15/11/2015, 21:55]
...
$

The boxecho command is a simple script that echoes its arguments in a box of stars:

echo "** $* **" | sed -e h -e s/./*/g -e p -e x -e p -e x

Revised file name format

I wish have output as a [day].txt or [day].[month].[year].txt, based on date in file. Is that possible?

Yes; it is possible and not particularly hard. The split function is one way of dealing with breaking up the value in $1. The regex specifies that square brackets, slashes and commas are the field separators. There are 5 sub-fields in the value in $1: an empty field before the [, the three numeric components separated by slashes and an empty field after the ,. The array name, dmy, is mnemonic for the sequence in which the components are stored.

/^\[[0-3][0-9]\/[01][0-9]\/[12][0-9]{3},/ {
    if ($1 != old_date)
    {
        if (outfile != "") close(outfile)
        n = split($1, dmy, "[/\[,]")
        outfile = sprintf("%s.%s.%s.txt", dmy[4], dmy[3], dmy[2])
        old_date = $1
    }
}
{ print > outfile }

Permute the numbers 4, 3, 2 in the sprintf() statement to suit yourself. The given order is year, month, day, which has many merits including that it is exploiting the ISO 8601 standard and the files sort automatically into date order. I strongly counsel its use, but you may do as you wish. For the sample data and the input shown in the question, the files it generates are:

2015.11.01.txt
2015.11.04.txt
2015.11.08.txt
2015.11.09.txt
2015.11.15.txt

Answered By - Jonathan Leffler

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0