From code to chemistry: A conversation with Connor W. Coley on open science practices and reproducible AI research

T
The Researcher's Source
By: Fernando Chirigati, Wed Jul 1 2026
Fernando Chirigati

Author: Fernando Chirigati

Chief Editor, Nature Computational Science

This is the sixth blog post in our series in which researchers share their experiences with open science practices, and reflect on the impact that sharing open data, code, and protocols can have.   

Connor W. Coley is Associate Professor of Chemical Engineering, Electrical Engineering, and Computer Science at the Massachusetts Institute of Technology (MIT). His research focuses on how data and automation can be used to streamline discovery in the chemical sciences. As a tenured professor, he is closely engaged in open science practices in both his own research and that of his students. Professor Coley has contributed to several Nature journals, with recent publications including A geometric foundation model for enzyme retrieval with evolutionary insights and Electron flow matching for generative reaction mechanism prediction.  

We spoke with Professor Coley about his work and perspective on open science, particularly the importance of sharing code that is accessible, usable, and reproducible for the progression of the field. 

Can you briefly introduce yourself and your research area?  

The research that we focus on in the group relates to the development of AI and other computational methods to help the design, synthesis, and analysis of small organic molecules for the most part. We build a lot of open source computational tools that have relevance to the synthesis of new molecules and their design, in particular for the pharmaceutical and chemical industries. 

How did you first become interested in open science practices?  

“Projects and contributions that are open source and reproducible allow people to understand what was done, how to reproduce it, and how to build on it.” 

                                                                                              Connor W. Coley 

There wasn't a particular defining moment. Towards the end of my undergraduate studies and into graduate school, I benefited greatly from projects and contributions that were open source and reproducible, allowing me to understand what was done, how to reproduce it, and how to build on top of it. When I read research articles that weren't released with corresponding open source code, but they were described in a method, it would present this very clear barrier to understanding and reproducing what was done. 

What open science practices do you use? 

We want to make sure that all the research we do is freely available. Sometimes that will mean publishing open access, so prioritising journals that have policies that let us do that. With our grants, there are some journals where publishing open access is rather expensive, and we don't always opt for that choice, but we will preprint almost every publication we have at the time of submission. We will preprint it typically on arXiv, so there's at least a version of record that is fully open that anyone can access. 

What motivated you to adopt these practices?  

I would say an appreciation for how much I benefited from accessible, reproducible tools when I was learning. During my PhD, I worked on some automation equipment and control software that built a lot upon MATLAB and LabView. Those are commercial environments for code development; LabView being common in laboratory automation for graphical programming workflows.  

That was part of a collaboration with Novartis, and we translated those platforms to Novartis for them to use internally for their research. I recall how much of a barrier it was to ask them to purchase a single site licence to be able to use the software that ran this platform, so having a commercial barrier to simply trying out research tools and products is something that I've tried to avoid in my own research more recently. That's why I'm very thankful that the language of choice for much of the research that we do is Python. We build on a lot of open source libraries and toolkits, and I think the open ethos has become much more ingrained in the field in the past few years. 

Have you faced any challenges or barriers in undertaking open science practices?  

We do a lot of research with statistical learning tools that learn from experimental data sets, and so we benefit greatly from training and deploying such tools with proprietary data sets. These may be commercial data sets that we license, for example, from CAS/SciFinder, and it's natural that in those agreements, we are not able to fully publish and redistribute the underlying data sets. 

That's very reasonable, but it does mean that occasionally, when we are pursuing different research questions, we have to make difficult choices about whether to use a data set we know can be open and reproducible versus a data set we believe to be of higher quality. Data set curation is very laborious and expensive, and there's reasons why that intellectual property lies with the people it does, but it poses a little bit of tension in when we choose to work with different data sets, knowing that it can't be fully reproducible like it could be with other data sets. 

Is there a particular success story or example you’re proud of that illustrates the benefits of open science?  

“You know that the research contributions you're putting out into the world are actually being adopted and used.” 

                                                        Connor W. Coley 

Quite often when we publish a new computational method, we'll receive a series of communications from colleagues in industry, from large pharma companies to biotechs, expressing interest in the tools, and so we get to feel that response from the community. I can imagine a situation where you preprint a method and there's no feedback from the community, except maybe a few months later you get some citations, but in this very much open science practice, you know that the research contributions you're putting out into the world are actually being adopted and used. 

I would also say that making sure your publications and methods are open source increases the number of people who are able to read, compare, and benefit from them. It's great for the field. We've encountered plenty of situations where we're unable to compare to a method someone else has published; maybe because it's inadequately described and it's not open source, or it's explicitly commercial and proprietary and their licence forbids comparison. By making your methods open, you become a part of the community. Maybe you become a row in a table that's used for comparisons in the future. It enables faster development in the field overall. 

A question we get asked frequently is about the intellectual property we might seek for the methods we develop. Patenting is a complicated landscape for some of these algorithmic contributions. Copyrighted software that is licenced commercially could be a revenue source for the lab, institution, and me personally, but I think the greater impact on the field offsets those incentives for me. 

It also shapes the kinds of collaborations we pursue. It makes the work we do more visible and attracts interest from folks in pharma who are very eager to work with us. One tangible example is a consortium we have at MIT called the Machine Learning for Pharmaceutical Discovery and Synthesis Consortium. This is a partnership between MIT and currently 9 pharmaceutical and chemical companies, and there's a commitment that the outputs of this sponsored research be made available, open, and pre-competitive. Not only that, but I'd say peers express appreciation. Certainly, student trainees from other groups around the country and the world express appreciation, frequently, about our approaches to open science. 

Are there any gaps in support or resources about open science you’d like to see addressed?  

One aspect of open science is making sure that the information being shared and published is in a form that can be easily read, adapted, and benefited from. There are some fields where deposition of crystal structures, for example, of inorganic molecules is mandated in the CSD, or you're expected to deposit protein structures in the Protein Data Bank. But in most other fields for other types of data, there aren't really any clear guidelines or expectations around the structure it should be in or the repositories they should be placed in. 

Several years ago, a few of us in the field created the Open Reaction Database to try to help define what is a useful data structure for capturing information about chemical reactions in a clear, systematic way, with the hopes that down the line, when data of synthetic organic transformations is published, it will be maximally beneficial for people that train models on that data or want to search it more easily. This is very domain- and field-specific, but I'd say that many scientific practices lack well-defined data structures, and that sort of heterogeneity impedes some text and data mining applications that would otherwise be enabled. 

I greatly appreciate when journals are by nature free to read and free to publish. When journals are diamond open access, that's a very big draw for me as an author. 

What advice would you give to researchers considering adopting open science practices?  

“Reproducibility is a necessary element of solid scientific contributions and so having your methods and code be fully reproducible through open access software is the best way to make a contribution to the field that others can truly benefit from.” 

                                                                                  Connor W. Coley 

There are almost no downsides to preprinting. It's a great way to increase the reach of your papers to receive additional feedback even during the stages of peer review or in informal ways for software, code, and machine learning-focused contributions.  

Reproducibility is a necessary element of solid scientific contributions and so having your methods and code be fully reproducible through open access software is the best way to make a contribution to the field that others can truly benefit from. 

There's also a very big difference between making your code available and making your code accessible, usable, and reproducible. It is true that when we open source software code, we do so with the expectation that there will be work required to respond to issues on GitHub to help people use the software in the way it was intended. That is one thing about publishing open source methods: you sign yourself up for a little bit of additional maintenance of that resource. Whereas for a publication you don't do that; once the publication is accepted, you're essentially done. We have to keep in mind from the beginning of a project that we're signing ourselves up for greater visibility, and maybe greater scrutiny, but that's part of how we share our work. 

I don't know how true this is, but I feel like there is a perception perhaps that sometimes open science comes at the expense of maintaining a competitive advantage in a crowded research field. People might think if they make their code available, someone will quickly improve upon it, or if they make their data set available, people will show that a different modelling approach does better on the same data set. There might be a reluctance to enable those direct comparisons that might be unflattering, but I do think ultimately this is how the field progresses. I just believe it's so inherent to how we need to do science. That should outweigh any fear of embarrassment. 

Learn more about open science and sharing research data, code and protocols & methods openly.


Connor W Coley © Springer Nature
Conor W. Coley, PhD, Associate Professor of Chemical Engineering, Electrical Engineering, and Computer Science, Massachusetts Institute of Technology (MIT) 

Connor W. Coley is the Class of 1957 Career Development Professor and an Associate Professor  at MIT. The Coley Research Group works at the interface of chemistry and data science to develop models that understand how molecules behave, interact and react. 

Dr. Coley earned a bachelor of science degree in chemical engineering from the California Institute of Technology in Pasadena, California and a master of science degree in chemical engineering practice and Ph.D. in chemical engineering from the Massachusetts Institute of Technology in Cambridge, Massachusetts. He completed postdoctoral work at the Broad Institute of MIT and Harvard in Cambridge, Massachusetts. 


Related content:

  • Open science conversations: 
  1. Building trust through transparency: An open science conversation with Geir Kjetil Sandve 
  2. Open science, altruism and impact: An interview with clinical geneticist Zornitza Stark
  3. Bioinformatician Johannes Koester on embodying the spirit of open science 
  4. How to do science that matters: Computational biologist Casey S. Greene on purpose-driven open science practices 
  5. Open science before it had a name: An interview with molecular biologist Steven Henikoff 
  •  Best practices for transparency and reuse: 
  1. How to share your research protocols and methods openly 

  2. How to share your research code openly 

  • Supporting open science practices: 
  1. Why share your research data? 

  2. Why sharing protocols matters 

  3. Why sharing your code matters 

Don't miss the latest news and blogs, sign up to The Researcher's Source Monthly Digest

Fernando Chirigati

Author: Fernando Chirigati

Chief Editor, Nature Computational Science

Fernando Chirigati is Chief Editor of Nature Computational Science. He received his PhD in Computer Science from New York University in 2018, and also worked as a postdoctoral research associate at the same institution. He conducted research in various areas, including scientific data management, provenance management and analytics, large-scale data analytics, data mining, computational reproducibility, and data visualization.