Saturday, January 15, 2022

[SOLVED] How can I find out how many lines are between a number and the next occurrence of the same number in a file?

January 15, 2022 awk, bash, vim

Issue

I have large files that each store results from very long calculations. Here's an example of a file where there are results for five time steps; there are problems with the output at the third, fourth, and fifth time steps.

(Please note that I have been lazy and have used the same numbers to represent the results at each time step in my example. In reality, the numbers would be unique at each time step.)

     3
 i =        1, time =        1.000, E =     1234567
  Mg       22.9985897185        6.9311166109        0.7603733573
  O        23.0438129644        6.4358253659        1.5992513709
  O        23.8223149199        7.2029442290        0.4030956770
     3
 i =        2, time =        1.500, E =     1234567
  Mg       22.9985897185        6.9311166109        0.7603733573
  O        23.0438129644        6.4358253659        1.5992513709
  O        23.8223149199        7.2029442290        0.4030956770
     3
 i =        3, time =        2.000, E =     1234567
  Mg       22.9985897185        6.9311166109        0.7603733573
  O        23.0438129644        6.4358253659        1.5992513709
  O        23.8223149199                                       (<--Problem: calculation stopped and some numbers are missing)
     3
 i =        4, time =        2.500, E =     1234567
  Mg       22.9985897185        6.9311166109        0.7603733573
  O        23.0438129644        6.4358253659        1.5992513709 (Problem: calculation stopped and entire row is missing below)
     3
 i =        5, time =        3.000, E =     1234567
  Mg       22.9985897185        6.9311166109        0.7603733573
  O        23.0438129644        6.4358253659        1.5992513709
  O        23.8223149199        7.2029442290        0.4030956770 sdffs (<--Problem: rarely, additional characters can be printed but I figured out how to identify the longest lines in the file and don't have this problem this time)

The problem is that the calculations can fail (and then need to be restarted) as a result is printing to a file. That means that when I try to use the results, I have problems.

My question is, how can I find out when something has gone wrong and the results file has been messed up? The most common problem is that there are not "3" lines of results (plus the header, which is the line where there's i = ...)? If I could find a problem line, I could then delete that time step.

Here is an example of error output I get when trying to use a messed-up file:

Traceback (most recent call last):
  File "/mtn/storage/software/languages/anaconda/Anaconda3-2018.12/lib/python3.7/site-packages/aser/io/extxyz.py", line 593, in read_xyz
    nentss = int(line)
ValueError: invalid literal for int() with base 10: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pythonPostProcessingCode.py", line 25, in <module>
    path = read('%s%s' % (filename, fileext) , format='xyz', index=':')  # <--This line tells me that Python cannot read in a particular time step because the formatting is messed up.

I am not experienced with scripting/Awk, etc, so if anyone thinks I have not used appropriate question tags, a heads-up would be welcome. Thank you.

Solution

The header plus 330 mean 331 lines of text and so

awk 'BEGIN { RS="i =" } { split($0,bits,"\n");if (length(bits)-1==331) { print RS$0 } }' file > newfile

Explanation:

awk 'BEGIN { 
             RS="i =" 
           } 
           { 
             split($0,bits,"\n");
             if (length(bits)-1==331) { 
                print RS$0 
             } 
           }' file > newfile

Before processing any lines from the file called file, set the record separator equal to "i =". Then, for each record, use, split to split the record ($0) into an array bits based on a new line as the separator. Where the length of the array bits, less 1 is 331 print the record separator plus the record, redirecting the output to a new file called newfile

Answered By - Raman Sailopal

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 15, 2022

[SOLVED] How can I find out how many lines are between a number and the next occurrence of the same number in a file?

Issue

Solution

Popular Posts

Labels