Author Archives: Justin Gerard

253rd ACS National Meeting & Expo

The ACS National Meeting in April was one of the most significant trips I have been on, and by far the largest conference I have attended. I presented my poster, attended fascinating research talks, and networked with chemistry professionals.

Presenting my poster yielded some interesting discussions with the people who stopped by — mostly other computational chemistry professionals, in fact. One evaluated me for the poster award, and he even one-upped me by knowing the directions of the transition dipole moments of indole, which I did not. Presenting to him was a good experience, because it was fun to walk through my entire project. Another professional said that if we could find a successful functional with TD-DFT, perhaps we could eventually publish our results. Talking to other students gave me an idea of how vastly different research topics could be within the domain of computational chemistry.

I went to some great talks. I heard Harry Gray present on Solar Fuels, using catalysts inspired by plant photosystems, giving me a paper topic for my bioinorganic chemistry class. Later I got to hear the big talk on CRISPR, the gene-editing process and its potential implications in our lives. There were a lot of other good talks, but some less than inspiring ones — I heard someone present his graduate research on organic dyes for solar cells, a topic I was familiar with, but I was bored by his lack of passion. However, most of the talks were given with enthusiasm and depth.

Networking was an interesting experience. The career fair got me out of my comfort zone and helped me practice speaking with people in a professional environment. At a speed networking workshop, I talked to a lot of retired chemistry professionals, which was not as useful, because they had only generalized advice to give me and no real leads. Nonetheless, I became more confident in myself after the networking experiences I had at the Meeting.

A cool thing that happened was that I got to speak with the tech support at Gaussian about our research group’s problem with the Gaussian software — the confusion issue with excited state optimizations — and he gave me a few ideas to try. They haven’t worked so far, but they at least gave us something to move us forward on the issue and identify what does not work. I remain hopeful that we will resolve this issue with the software.

Other than that, Kristine and I got to explore the city a bit and walk on the Golden Gate Bridge. Can’t beat the view from there! I hope I get to go on a similar trip in the future.

1 Comment

Filed under conference report

Python Error-Handling: Follow-up to Temporary Files

This post is a follow-up to this previous blog post.

On the new Windows 10 computers at Skidmore, I ran into the issue when running python in which temporary files disrupt the execution of my scripts. An example of a suspected temporary file is “C:\Users\jgerard\Desktop\Important2\~$LYP-D_root1.gjf.out”. However, when I select “View Hidden Files” in my directory, no such file appears. As a last resort, I have my script search for temporary files:

if "$" in file_name:

and break if True. This adjustment allowed the successful run of the script. However, it is less than elegant, and the script could run into issues if any of my output files were to include a “$” in their file name.

Leave a Comment

Filed under programming

Python Error Handling: Temporary Files and Skipping Failed Jobs

In my scripts that analyze all Gaussian output files in a directory, I use glob to get all the files into a list. The relevant script is as follows:

import glob

#get a folder to search through
folder_location = get_folder_location()

#gets filepaths for everything in the specified folder ending with .LOG or .out
search_str_log = folder_location + "*.LOG"
search_str_out = folder_location + "*.out"

file_paths_log = glob.glob(search_str_log)
file_paths_out = glob.glob(search_str_out)

file_paths = file_paths_log + file_paths_out

I have been editing and testing output files within the directory, and my script has mistakenly read their temporary Word files as .out files. I needed to make those temporary files visible (see this link for instructions on how to do that). Then I deleted the temporary files, so they would not be read as extra files.

 

I wanted my script to skip files that had in error termination in Gaussian, so I needed to search the output files for the string “Error termination”. Since I was reading output files with .readlines(), as a list of lines (strings), I decided to recombine the list of strings into one string as follows:

#reads the file as a list of its lines of text
content_lines = current_file.readlines()

combined = '\t'.join(content_lines)

if "Error termination" in combined:
    print file_name + " skipped because of error termination
    in Gaussian."

else:
    #analyze file

Another option would have been to use .read() in addition to .readlines() for a different variable content, but that would set the cursor to the end of the file, and I would need to write current_file.seek(0) to reset the cursor to the beginning of the file. (See this post for more information about this issue.) Using the .join() command was a simpler way of getting the text file as a string, without having to set the cursor back.

2 Comments

Filed under programming

MERCURY: Guest Speakers and Poster Presentations

This past weekend, our research group attended talks by guest speakers, presented our abstracts, and fielded questions by our posters at the 2016 MERCURY Conference at Bucknell University.

There were six guest speakers: Chris Cramer, Steven Wheeler, Jeffrey Evanseck, Chris Wilmer, Richard Pastor, and Kate Holloway.

Cramer gave a crash course in quantum mechanics and computational chemistry. He referred to the psi operator in the Schrödinger equation as an oracle, and he said it is important to ask the oracle something you know before you ask it something you don’t know, i.e. to validate a computational method before using it to predict new information. He stated that the total energy of a molecule can be determined exclusively from its electron density. However, he pointed out self-interaction error as the biggest error in DFT, in which the calculation averages the position of one electron, so that in the calculation, the election repels itself (impossible). This error needs to be corrected for in the calculation. After the crash course, he discussed his research with MOFs (Metal Organic Frameworks), porous materials on the nano scale that can store and separate liquids or gases.

Wheeler delineated important goals and advice for computational chemists. He proposed that you only really learn how quantum chemistry methods work by programming them yourself, that you should learn a programming language (especially Python), and that you should take a course in linear algebra, if possible. Wheeler described three main types of functionals: semi-empirical (approximate/Hartree-Fock theory for 1000s of atoms), ab initio (MO-based methods for 10s to 100s of atoms, depending on the level of theory), and DFT (for 20 to 200 atoms). He asserted that choosing the appropriate level of theory for calculations is a primary focus of computational chemists. He mentioned his trials to find an appropriate functional with DFT for his research. Because traditional DFT functionals completely fail to capture dispersion interactions, he decided on B97-D, as a fast, accurate functional, though it had problems with long-range interactions. His research group studies the pi-pi* stacking of substituted benzene rings.

Evanseck introduced molecular dynamics. He stated three reasons for computation: to interpret experimental data, to extend (fill gaps) in experimental data, and to make predictions to guide the experiment. He made an analogy to the Pople Diagram (see below), with force field sophistication instead of level of theory. He then broke down the pros and cons on molecular dynamics methods vs. quantum methods. In addition, he pointed out that it is much easier to tweak the existing potential energy functions than to write new ones.

introduction-to-electron-correlation-32-638

Wilmer has become successful with a start-up called NuMat Technologies Inc., which develops new MOFs with programs that general hypothetical MOFs and calculates their ability to store and separate natural gases for motor vehicles and CO2 capture. He advanced with the help of some business majors by participating in business plan competitions.

Pastor informed us of the tremendous difficulty and computing time of membrane simulations, specifically looking at natural antibiotics in epithelial layers. He said it was appropriate for early grad students to be inexperienced and lazy (to learn but not disrupt the work of the professors or pursue fruitless projects), post-docs to be smart and hard-working (to effectively collaborate with professors), and PIs to be smart and lazy (to delegate tasks properly and be able to respond to post-docs and students). He encouraged students to seek truth, work within a community, and to be a good and generous friend by keeping friends out of the wrong quadrants (smart/inexperienced, hardworking/lazy), where inexperienced and hardworking was the most dangerous for everyone.

Holloway has worked on finding cures for AIDS and hepatitis C at Merck. She described her career trajectory, having decided on computational chemistry because she was good at it and didn’t like lab work. She became a computational chemist at Merck straight from her Ph.D. She emphasized the cost (>$2.5 billion) and waiting time (>10 years) for new drugs to get on the market. However, potentially saving people’s lives motivated her and her fellow researchers.

Presenting abstracts seemed like it would be easy enough, but I got some of the ordering of my spiel wrong and flubbed some of the wording. This experience has taught me that memorizing and practicing a short speech might be a good idea in the future.

Poster presentations went much better for me. Because we had rehearsed, I was comfortable with the flow of my presentation, and I adjusted the length and detail of it based on the knowledge of the person listening (anyone from Steven Wheeler to students with no knowledge of Gaussian). One student named Clorice was particularly interested in my poster, because she was also starting some excited state optimizations with TD-DFT. However, she did not experience the same difficulty optimizing first and second excited energy states, because her project involved pi-pi* stacking and enzyme activity, not a molecule with fluorescent properties similar to indole.

A highlight of the stay was hearing Chris Cramer expertly sing some folk songs at karaoke night (not to mention the our own group’s hoe-down of “Sweet Home Alabama”). That night, along with conversations with students and faculty in chemistry who are also involved in music, gave me hope that music and art will stay relevant to me if I pursue a career in chemistry.

1 Comment

Filed under conference report

Python: Error Handling

One part of my updates to my Python script has been to manage what the script does in the case of incomplete output files. I have developed ways of circumventing a couple of issues, which include missing or different nstates values and incomplete files.

To get the number of states, I use a function to find the route line, and then I use a regular expression to extract the nstates value. The function that I use to find the route line can be seen in a previous post.

I use a simple regular expression to find the nstates value from the route line: nstatesRegex = re.compile(r"nstates=(\d*)")

An example of a route line, which the find_parameters function extracts, is as follows: # td=(nstates=6) cam-b3lyp/6-31+g(d) geom=connectivity

I use nstatesRegex to get the nstates value from that line, but sometimes the nstates value does not appear in the route line. My script does not work properly if the nstates is not specified in the route line of an output file. For now, my code just prints a message specifying any files that do not include an nstates value in the route line.

If an nstates value is different from the other nstates values, I need to compensate for missing values in the list that contains all the relevant data. I first loop through all the files to get the largest_nstates value. In the second loop of all the files in which I append the relevant data to the master_results list, I do the following after I append the energy values:

for x in xrange(nstates, largest_nstates):
      EE_lst.append(" ")
      abs_EE_lst.append(" ")
      osc_lst.append(" ")

The empty spaces account for empty cells, where the output file expects energy values. I extend my master_results list with those three lists. The empty spaces help each of these lists be the same length as the largest lists I find in the directory, so that the lists are a uniform length and can be iterated through more easily.

Sometimes a user may have incomplete output files in the directory. An easy way to verify that an output file is complete is to check if it has a job time. This method is imperfect, since I have recently encountered an error in Gaussian itself of an incomplete file that nonetheless included its job time up until it crashed. Checking the job time will at least ensure that all files have run to completion, whether or not they failed in Gaussian. On the latest version, my script fails when it tries to read an incomplete file, but it will print a message that tells the user which files did not have job times, so that the user knows which files to remove from the directory.

There are several updates that I can make on this script to help it deal with incomplete files. I could have it skip incomplete files (such as those without job times) instead of just printing a message about which files are incomplete, so that the script still reads all the complete files. Although I have made progress in dealing with files that have different numbers of states, I could set the default to 3 states (if that is indeed the default), when nstates is not specified in the route line. I could also have my script check for Gaussian error messages and inform the user of the error message and the pertinent file. These updates are meant to help the user more quickly identify problems and spend less time looking at the script and various output files to determine what went wrong in a failed run.

1 Comment

Filed under programming

Python Single-Point Geometry Energy Script Update

Continued from a previous post.

As I created my first completed version of the python script, I learned several things about python.

One thing I learned was how file handles work. With the function:

def find_parameters(file_handle):
    for l in file_handle:
         if l.startswith(" #"):
             return l.rstrip()

I was running into an error in which I received the message, Nonetype has no attribute…, meaning that my code was trying to use nothing as its input for functions that were looking for strings. That was after I looped through the file handle for a find function. I learned that looping through a file handle leaves the “cursor” at the end of the file handle, so that the very next use of the file handle would start at the end of the file handle. To reset the “cursor” at the beginning of the file handle, I need to call file_handle.seek(0) after each iteration, as follows:

def find_parameters(file_handle):
    file_handle.seek(0)
    for l in file_handle:
         if l.startswith(" #"):
             return l.rstrip()

Here is a Stack Overflow thread with an explanation.

Another thing I learned had to do with the difference between the .append() and .extend() functions. After some searching and consultation with Prof. Kennerly, I realized that that was because I called .extend() rather than .append(). I was trying to add a string as an item to an empty list, and the .extend() function treated the string itself as some sort of list. Using the .append() function instead kept the string as a string and simply added it as a value to the list. Here is a Stack Overflow thread with more discussion of the difference between .append() and .extend().

One more error that I do not understand but have since overcome has to do with Unicode. I had an interesting hour or so in which I got long text files of Sanskrit and other unexpected languages instead of my output data, when I opened my results text file in Wordpad. Other text editors, such as Microsoft Word, had no trouble reading the text files. I think that this error had something to do with the .extend() function, but I’m not sure. I recall that the error went away after I corrected my code to use .append() instead of .extend(), but that may not have been the direct source of the error. Further investigation of the Unicode issue is needed for a better understanding of it.

Often error messages may not describe the cause of the problem, but give clues of where the problem is. That is why many different error messages popped up when I had a more fundamental issue somewhere else in the script. In addition, as Prof. Kennerly says, computers always do what you tell them. So, the Sanskrit output made sense, given the code I wrote. To fix problems with the script, I needed more understanding of what certain functions did and how python and Wordpad read files.

I used IDLE on a Windows computer to edit and run my script.

P.S. WordPress is not good at allowing indentations or extra spaces. I have to use a keyword &+nbsp; (without the +) to create spaces in the text editor, and all instances of it disappear if I switch to the visual editor. I think this is an issue with HTML, and I read that I would have to use CSS to allow indents in WordPress. This is an annoying problem, and I am trying to get around it for future posts where I show snippets of code.

4 Comments

Filed under Uncategorized

Python Script for Excited State Energy Calculations at Single-Point Geometries

I am in the process of writing a script in python to read Gaussian output files for single-point energy calculations and produce a text file of the energy values (ground state and six excited states) that can be opened in Excel. This will save me some time, since I am currently testing different combinations of functionals and basis sets with TDDFT in Gaussian, and I need to compare the results from many calculations to determine which combination is best. I am referencing a script written by Kristine Vorwerk. However, while Kristine’s script parses files with cclib and extracts the descriptions for energy values in addition to lambda max values, mine uses regular expressions and extracts only the energy values by themselves in an order I have predetermined. Like Kristine’s script, my script also extracts the method and basis set names and the job CPU time.

Another element of Kristine’s script that I have incorporated into my own is a brief interface on the command line that asks for the file path of the folder that contains the .out or .log files I want to read. The script will read every file of that type in that folder and create a text file in that same folder with all the results of interest. Unfortunately, testing the script on the command line can be a time-consuming and confusing process, since the command line itself does not show any error messages. To debug as I write, I am running my script through PyCharm, so I can see exactly where my code fails.

One of the biggest challenges of adapting this script from Kristine’s, and as a beginner programmer, I am unsure which functions requires cclib and which do not. As I move toward a finished script, I will rewrite many of her definitions and functions in a syntax that I am sure will work without cclib. Most of the adaptations rely of my knowledge of regular expressions. In particular, I am interested in ways that I could simplify parts of my code using loops. Since cclib uses simple functions to parse files, regular expressions should take more code to do the same job. However, I am finding that parts of my code look redundant and could probably be shortened using additional loops. For example, since I am looking for the same types of values for six excited states, my code has blocks such as:

mo1 = energyState1Regex.search(line)
if mo1 is not None:
    splitted = line.split()
    EEs1.extend(["  ", splitted[4]])
    absEEs1.extend(["   ", float(splitted[4]) + groundStateEV])

mo2 = energyState2Regex.search(line)
if mo2 is not None:
    splitted = line.split()
    EEs2.extend(["  ", splitted[4]])
    absEEs2.extend(["   ", float(splitted[4]) + groundStateEV])

mo3 = energyState3Regex.search(line)
if mo3 is not None:
    splitted = line.split()
    EEs3.extend(["  ", splitted[4]])
    absEEs3.extend(["   ", float(splitted[4]) + groundStateEV])

….

etc. within a loop, scanning each line for the regular expressions which indicate the different excited states. I could probably shorten this code with another loop, but I am still thinking about how to do that. I cannot use a loop to change one character in a variable name (e.g. mo1, mo2, etc.), so I may need to change the way my regular expressions search for the data I want. Kristine found a simple solution for that problem a while ago, but I believe that was for a list that could be indexed. Nonetheless, I may look at her latest script that uses regular expressions to see if she has found any ways to simplify the code I am writing, and whether that script may be a better reference for me to use.

 

cclib webpage

PyCharm webpage

2 Comments

Filed under chemistry, computing