Tag Archives: python

Python Error-Handling: Follow-up to Temporary Files

This post is a follow-up to this previous blog post.

On the new Windows 10 computers at Skidmore, I ran into the issue when running python in which temporary files disrupt the execution of my scripts. An example of a suspected temporary file is “C:\Users\jgerard\Desktop\Important2\~$LYP-D_root1.gjf.out”. However, when I select “View Hidden Files” in my directory, no such file appears. As a last resort, I have my script search for temporary files:

if "$" in file_name:

and break if True. This adjustment allowed the successful run of the script. However, it is less than elegant, and the script could run into issues if any of my output files were to include a “$” in their file name.

Leave a Comment

Filed under programming

Python Error Handling: Temporary Files and Skipping Failed Jobs

In my scripts that analyze all Gaussian output files in a directory, I use glob to get all the files into a list. The relevant script is as follows:

import glob

#get a folder to search through
folder_location = get_folder_location()

#gets filepaths for everything in the specified folder ending with .LOG or .out
search_str_log = folder_location + "*.LOG"
search_str_out = folder_location + "*.out"

file_paths_log = glob.glob(search_str_log)
file_paths_out = glob.glob(search_str_out)

file_paths = file_paths_log + file_paths_out

I have been editing and testing output files within the directory, and my script has mistakenly read their temporary Word files as .out files. I needed to make those temporary files visible (see this link for instructions on how to do that). Then I deleted the temporary files, so they would not be read as extra files.

 

I wanted my script to skip files that had in error termination in Gaussian, so I needed to search the output files for the string “Error termination”. Since I was reading output files with .readlines(), as a list of lines (strings), I decided to recombine the list of strings into one string as follows:

#reads the file as a list of its lines of text
content_lines = current_file.readlines()

combined = '\t'.join(content_lines)

if "Error termination" in combined:
    print file_name + " skipped because of error termination
    in Gaussian."

else:
    #analyze file

Another option would have been to use .read() in addition to .readlines() for a different variable content, but that would set the cursor to the end of the file, and I would need to write current_file.seek(0) to reset the cursor to the beginning of the file. (See this post for more information about this issue.) Using the .join() command was a simpler way of getting the text file as a string, without having to set the cursor back.

2 Comments

Filed under programming

Python: Error Handling

One part of my updates to my Python script has been to manage what the script does in the case of incomplete output files. I have developed ways of circumventing a couple of issues, which include missing or different nstates values and incomplete files.

To get the number of states, I use a function to find the route line, and then I use a regular expression to extract the nstates value. The function that I use to find the route line can be seen in a previous post.

I use a simple regular expression to find the nstates value from the route line: nstatesRegex = re.compile(r"nstates=(\d*)")

An example of a route line, which the find_parameters function extracts, is as follows: # td=(nstates=6) cam-b3lyp/6-31+g(d) geom=connectivity

I use nstatesRegex to get the nstates value from that line, but sometimes the nstates value does not appear in the route line. My script does not work properly if the nstates is not specified in the route line of an output file. For now, my code just prints a message specifying any files that do not include an nstates value in the route line.

If an nstates value is different from the other nstates values, I need to compensate for missing values in the list that contains all the relevant data. I first loop through all the files to get the largest_nstates value. In the second loop of all the files in which I append the relevant data to the master_results list, I do the following after I append the energy values:

for x in xrange(nstates, largest_nstates):
      EE_lst.append(" ")
      abs_EE_lst.append(" ")
      osc_lst.append(" ")

The empty spaces account for empty cells, where the output file expects energy values. I extend my master_results list with those three lists. The empty spaces help each of these lists be the same length as the largest lists I find in the directory, so that the lists are a uniform length and can be iterated through more easily.

Sometimes a user may have incomplete output files in the directory. An easy way to verify that an output file is complete is to check if it has a job time. This method is imperfect, since I have recently encountered an error in Gaussian itself of an incomplete file that nonetheless included its job time up until it crashed. Checking the job time will at least ensure that all files have run to completion, whether or not they failed in Gaussian. On the latest version, my script fails when it tries to read an incomplete file, but it will print a message that tells the user which files did not have job times, so that the user knows which files to remove from the directory.

There are several updates that I can make on this script to help it deal with incomplete files. I could have it skip incomplete files (such as those without job times) instead of just printing a message about which files are incomplete, so that the script still reads all the complete files. Although I have made progress in dealing with files that have different numbers of states, I could set the default to 3 states (if that is indeed the default), when nstates is not specified in the route line. I could also have my script check for Gaussian error messages and inform the user of the error message and the pertinent file. These updates are meant to help the user more quickly identify problems and spend less time looking at the script and various output files to determine what went wrong in a failed run.

1 Comment

Filed under programming

Python Single-Point Geometry Energy Script Update

Continued from a previous post.

As I created my first completed version of the python script, I learned several things about python.

One thing I learned was how file handles work. With the function:

def find_parameters(file_handle):
    for l in file_handle:
         if l.startswith(" #"):
             return l.rstrip()

I was running into an error in which I received the message, Nonetype has no attribute…, meaning that my code was trying to use nothing as its input for functions that were looking for strings. That was after I looped through the file handle for a find function. I learned that looping through a file handle leaves the “cursor” at the end of the file handle, so that the very next use of the file handle would start at the end of the file handle. To reset the “cursor” at the beginning of the file handle, I need to call file_handle.seek(0) after each iteration, as follows:

def find_parameters(file_handle):
    file_handle.seek(0)
    for l in file_handle:
         if l.startswith(" #"):
             return l.rstrip()

Here is a Stack Overflow thread with an explanation.

Another thing I learned had to do with the difference between the .append() and .extend() functions. After some searching and consultation with Prof. Kennerly, I realized that that was because I called .extend() rather than .append(). I was trying to add a string as an item to an empty list, and the .extend() function treated the string itself as some sort of list. Using the .append() function instead kept the string as a string and simply added it as a value to the list. Here is a Stack Overflow thread with more discussion of the difference between .append() and .extend().

One more error that I do not understand but have since overcome has to do with Unicode. I had an interesting hour or so in which I got long text files of Sanskrit and other unexpected languages instead of my output data, when I opened my results text file in Wordpad. Other text editors, such as Microsoft Word, had no trouble reading the text files. I think that this error had something to do with the .extend() function, but I’m not sure. I recall that the error went away after I corrected my code to use .append() instead of .extend(), but that may not have been the direct source of the error. Further investigation of the Unicode issue is needed for a better understanding of it.

Often error messages may not describe the cause of the problem, but give clues of where the problem is. That is why many different error messages popped up when I had a more fundamental issue somewhere else in the script. In addition, as Prof. Kennerly says, computers always do what you tell them. So, the Sanskrit output made sense, given the code I wrote. To fix problems with the script, I needed more understanding of what certain functions did and how python and Wordpad read files.

I used IDLE on a Windows computer to edit and run my script.

P.S. WordPress is not good at allowing indentations or extra spaces. I have to use a keyword &+nbsp; (without the +) to create spaces in the text editor, and all instances of it disappear if I switch to the visual editor. I think this is an issue with HTML, and I read that I would have to use CSS to allow indents in WordPress. This is an annoying problem, and I am trying to get around it for future posts where I show snippets of code.

4 Comments

Filed under Uncategorized

Python Script for Excited State Energy Calculations at Single-Point Geometries

I am in the process of writing a script in python to read Gaussian output files for single-point energy calculations and produce a text file of the energy values (ground state and six excited states) that can be opened in Excel. This will save me some time, since I am currently testing different combinations of functionals and basis sets with TDDFT in Gaussian, and I need to compare the results from many calculations to determine which combination is best. I am referencing a script written by Kristine Vorwerk. However, while Kristine’s script parses files with cclib and extracts the descriptions for energy values in addition to lambda max values, mine uses regular expressions and extracts only the energy values by themselves in an order I have predetermined. Like Kristine’s script, my script also extracts the method and basis set names and the job CPU time.

Another element of Kristine’s script that I have incorporated into my own is a brief interface on the command line that asks for the file path of the folder that contains the .out or .log files I want to read. The script will read every file of that type in that folder and create a text file in that same folder with all the results of interest. Unfortunately, testing the script on the command line can be a time-consuming and confusing process, since the command line itself does not show any error messages. To debug as I write, I am running my script through PyCharm, so I can see exactly where my code fails.

One of the biggest challenges of adapting this script from Kristine’s, and as a beginner programmer, I am unsure which functions requires cclib and which do not. As I move toward a finished script, I will rewrite many of her definitions and functions in a syntax that I am sure will work without cclib. Most of the adaptations rely of my knowledge of regular expressions. In particular, I am interested in ways that I could simplify parts of my code using loops. Since cclib uses simple functions to parse files, regular expressions should take more code to do the same job. However, I am finding that parts of my code look redundant and could probably be shortened using additional loops. For example, since I am looking for the same types of values for six excited states, my code has blocks such as:

mo1 = energyState1Regex.search(line)
if mo1 is not None:
    splitted = line.split()
    EEs1.extend(["  ", splitted[4]])
    absEEs1.extend(["   ", float(splitted[4]) + groundStateEV])

mo2 = energyState2Regex.search(line)
if mo2 is not None:
    splitted = line.split()
    EEs2.extend(["  ", splitted[4]])
    absEEs2.extend(["   ", float(splitted[4]) + groundStateEV])

mo3 = energyState3Regex.search(line)
if mo3 is not None:
    splitted = line.split()
    EEs3.extend(["  ", splitted[4]])
    absEEs3.extend(["   ", float(splitted[4]) + groundStateEV])

….

etc. within a loop, scanning each line for the regular expressions which indicate the different excited states. I could probably shorten this code with another loop, but I am still thinking about how to do that. I cannot use a loop to change one character in a variable name (e.g. mo1, mo2, etc.), so I may need to change the way my regular expressions search for the data I want. Kristine found a simple solution for that problem a while ago, but I believe that was for a list that could be indexed. Nonetheless, I may look at her latest script that uses regular expressions to see if she has found any ways to simplify the code I am writing, and whether that script may be a better reference for me to use.

 

cclib webpage

PyCharm webpage

2 Comments

Filed under chemistry, computing