Issue
I have large files that each store results from very long calculations. Here's an example of a file where there are results for five time steps; there are problems with the output at the third, fourth, and fifth time steps.
(Please note that I have been lazy and have used the same numbers to represent the results at each time step in my example. In reality, the numbers would be unique at each time step.)
3
i = 1, time = 1.000, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
O 23.8223149199 7.2029442290 0.4030956770
3
i = 2, time = 1.500, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
O 23.8223149199 7.2029442290 0.4030956770
3
i = 3, time = 2.000, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
O 23.8223149199 (<--Problem: calculation stopped and some numbers are missing)
3
i = 4, time = 2.500, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709 (Problem: calculation stopped and entire row is missing below)
3
i = 5, time = 3.000, E = 1234567
Mg 22.9985897185 6.9311166109 0.7603733573
O 23.0438129644 6.4358253659 1.5992513709
O 23.8223149199 7.2029442290 0.4030956770 sdffs (<--Problem: rarely, additional characters can be printed but I figured out how to identify the longest lines in the file and don't have this problem this time)
The problem is that the calculations can fail (and then need to be restarted) as a result is printing to a file. That means that when I try to use the results, I have problems.
My question is, how can I find out when something has gone wrong and the results file has been messed up? The most common problem is that there are not "3" lines of results (plus the header, which is the line where there's i = ...)? If I could find a problem line, I could then delete that time step.
Here is an example of error output I get when trying to use a messed-up file:
Traceback (most recent call last):
File "/mtn/storage/software/languages/anaconda/Anaconda3-2018.12/lib/python3.7/site-packages/aser/io/extxyz.py", line 593, in read_xyz
nentss = int(line)
ValueError: invalid literal for int() with base 10: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "pythonPostProcessingCode.py", line 25, in <module>
path = read('%s%s' % (filename, fileext) , format='xyz', index=':') # <--This line tells me that Python cannot read in a particular time step because the formatting is messed up.
I am not experienced with scripting/Awk, etc, so if anyone thinks I have not used appropriate question tags, a heads-up would be welcome. Thank you.
Solution
The header plus 330 mean 331 lines of text and so
awk 'BEGIN { RS="i =" } { split($0,bits,"\n");if (length(bits)-1==331) { print RS$0 } }' file > newfile
Explanation:
awk 'BEGIN {
RS="i ="
}
{
split($0,bits,"\n");
if (length(bits)-1==331) {
print RS$0
}
}' file > newfile
Before processing any lines from the file called file, set the record separator equal to "i =". Then, for each record, use, split to split the record ($0) into an array bits based on a new line as the separator. Where the length of the array bits, less 1 is 331 print the record separator plus the record, redirecting the output to a new file called newfile
Answered By - Raman Sailopal