One part of my updates to my Python script has been to manage what the script does in the case of incomplete output files. I have developed ways of circumventing a couple of issues, which include missing or different nstates values and incomplete files.
To get the number of states, I use a function to find the route line, and then I use a regular expression to extract the nstates value. The function that I use to find the route line can be seen in a previous post.
I use a simple regular expression to find the nstates value from the route line: nstatesRegex = re.compile(r"nstates=(\d*)")
An example of a route line, which the find_parameters
function extracts, is as follows: # td=(nstates=6) cam-b3lyp/6-31+g(d) geom=connectivity
I use nstatesRegex
to get the nstates value from that line, but sometimes the nstates value does not appear in the route line. My script does not work properly if the nstates is not specified in the route line of an output file. For now, my code just prints a message specifying any files that do not include an nstates value in the route line.
If an nstates value is different from the other nstates values, I need to compensate for missing values in the list that contains all the relevant data. I first loop through all the files to get the largest_nstates
value. In the second loop of all the files in which I append the relevant data to the master_results
list, I do the following after I append the energy values:
for x in xrange(nstates, largest_nstates):
EE_lst.append(" ")
abs_EE_lst.append(" ")
osc_lst.append(" ")
The empty spaces account for empty cells, where the output file expects energy values. I extend my master_results
list with those three lists. The empty spaces help each of these lists be the same length as the largest lists I find in the directory, so that the lists are a uniform length and can be iterated through more easily.
Sometimes a user may have incomplete output files in the directory. An easy way to verify that an output file is complete is to check if it has a job time. This method is imperfect, since I have recently encountered an error in Gaussian itself of an incomplete file that nonetheless included its job time up until it crashed. Checking the job time will at least ensure that all files have run to completion, whether or not they failed in Gaussian. On the latest version, my script fails when it tries to read an incomplete file, but it will print a message that tells the user which files did not have job times, so that the user knows which files to remove from the directory.
There are several updates that I can make on this script to help it deal with incomplete files. I could have it skip incomplete files (such as those without job times) instead of just printing a message about which files are incomplete, so that the script still reads all the complete files. Although I have made progress in dealing with files that have different numbers of states, I could set the default to 3 states (if that is indeed the default), when nstates is not specified in the route line. I could also have my script check for Gaussian error messages and inform the user of the error message and the pertinent file. These updates are meant to help the user more quickly identify problems and spend less time looking at the script and various output files to determine what went wrong in a failed run.
It sounds like you’ve put some good thought into your script, so that it can handle some exceptions that could commonly occur. Nice work.
I’m going to add some comments here that will be good to know about, but I don’t think you need to worry about changing your script at this time. (After all our job is chemistry, not tweaking our scripts endlessly).
One is that the way you handle missing data from some files having a fewer number of states than others would be called by computer programmers “padding”. Its fine. It works. So don’t worry about this. But a more elegant way to handle the issue would be to instead of having one long list including ALL of the data, define a data structure that has multiple smaller lists within a list. The master “structure” (not really a list anymore, more like a matrix) would have one substructure for each data file it has read. Each substructure has each file’s route line information saved in it, and also a list for the energies, a list for the oscillator strengths, etc. And each of those lists would naturally be only as long as the file’s nstates value. And when you go to print out the data at the end, your code would just skip over any undefined values.
Your comments about incomplete files make sense — you can combine both approaches. If there’s an incomplete file, throw an error message to the screen so the user knows, but the script should just skip over it and keep reading the other files. Basically you would just have to check if the output file has any excitation energies at all — if it doesn’t throw an error. It may even be a complete output file for some other calculation that was accidentally submitted to the script.