Machine learning: What is it and why does it matter in chemistry?

Hi everyone, for my first blog post I’d like to try explaining what my research is about and why it matters, in a way that I hope is accessible to someone who isn’t a scientist actually working on this stuff. This is also my first try at science communication in this format, so I’d welcome any feedback you have!

My research is in applying machine learning methods to do theoretical chemistry. I’ll get to the “theoretical chemistry” part later — first off, what is machine learning?

What is machine learning?

It’s a technology that seems to be everywhere nowadays, and certainly gets its fair share of hype from tech blogs and grant proposals. Part of that is because it’s such a vague, general term: In fact, “machine learning” can mean any technology designed to mimic the human brain’s extraordinary ability to learn from experience and build mental models of the world. To get a better idea of what this means, let’s take a look at two tasks that are typically easy for humans and difficult for computers.

Fig. 1: A collection of points.
Fig. 2: An image labeling problem.

In the first image above, we have a collection of points that seem to follow a curve of some kind. Our brains easily fill in the gaps between the data in order to draw a smooth curve — even accounting for the slight scatter of the points about the line.

In the second image, we have to label some (admittedly crude) hand-drawn pictures. My artistic skills aside, you can probably identify which picture is supposed to be a dog and which one is supposed to be a cat — from just a few lines on paper.

(If an auditory example works better for you than a visual one, try this: Get some background noise going, like a loud fan or a music video in the background, and open another video with someone talking, e.g. from the news. You can probably pick up what they’re saying despite the background noise. Or, try getting someone to hum or whistle a popular song for you. Usually, all we need is a few bars of the tune to be able to tell what song it is, without needing to hear the whole musical accompaniment or even the rest of the song.)

What these two examples have in common is that our brain makes an internal model — a representation — of the data it’s fed. That model somehow manages to pick out the most important features. In the case of the cloud of points, it’s the overall trend and how the points “connect” to one another. In the case of the pictures, it’s a few basic features — the dog’s ears, the cat’s whiskers — that we’ve learned represent “dog” or “cat”, from real life or from watching cartoons. And in the case of the video playing over noise, we’ve learned what our language sounds like in a variety of different situations from many different speakers, so it doesn’t matter if there’s some noise — the signal comes through just fine.

It’s this ability to automatically make a representation of the input data that seems to be so hard for computers to mimic. Take the point cloud, for example. Sure, we can make a computer program to draw a line through the points, but it will probably try to go through each point. Or we can try to make it smooth over the noise with a moving average, but if we’re not careful it’ll average over the signal as well as the noise: Basically, we have to tell it how big is the signal and how big is the noise; that’s not something it can figure out from the data itself without any prior information. In other words, some human interaction is required to tell the computer whether it did the task correctly! And for the image example, how would you even begin? Just extracting lines from a collection of pixels requires some serious mathematics, and we haven’t even gotten started on what those lines mean — how they combine to form features like ears, eyes, or a nose.

There’s a good reason why companies like Amazon and Facebook spend boatloads of money each year trying to improve their face recognition algorithms. This tutorial from Google should give you an idea of just how hard a problem image classification really is. (Google Image search, by the way, doesn’t seem to associate my drawings with either “cat” or “dog” — the related-image search only returns other hand-drawn pictures in a similar style, but of completely different things. The labels it suggests are “soft” and “sketch” — technically correct, but not very useful.) So the next time you’re feeling down on yourself, or maybe not so confident in your own intelligence, just think: You have, inside your head, an amazing information processing machine that instantly and effortlessly accomplishes tasks that tech companies have been spending many years and billions of dollars trying to replicate on a computer.1

The advances that computer algorithms have made today in tackling “hard” problems like these build upon decades of work in the broader field of artificial intelligence. One of the earliest attempts to actually model how brains work was the perceptron,2 which was designed to model how the neurons (individual cells) in a brain pass information to one another using simple signals. The main principle it’s building off of here is emergent behavior, where many small units each following the same, simple rules interact to give incredibly complex behavior, like pattern-matching, learning, and consciousness. Although we’re still far from understanding what consciousness is or how it arises, research into perceptrons — which later became artificial neural networks (ANNs), and then became “deep” neural networks as researchers figured out how to add more layers of neurons (interconnected nodes) — has cast some light onto how we learn and recognize patterns. One of the most exciting things about these models is that they can learn a representation of the data they’re trying to fit, in a way that’s reminiscent of how a human brain might make a mental model to solve the same task.

Fig. 3: Finding a smooth curve to fit the data.
Fig. 4: Updating the model to label the images correctly.

Artificial neural networks (NNs) are now among the most successful machine learning techniques, and part of that is because they’re very flexible models: Most realistic NNs have hundreds to thousands of adjustable parameters, so you can “train” the model to fit almost any data you feed it just by tweaking these parameters in the right way.3 Basically, you train or fit the model by feeding it many examples of data that are already labeled — in the image classification example above, we would feed the network many example images already labeled with “cat” or “dog.” As we adjust the network parameters to reduce classification error (the portion of images incorrectly labeled), the network “learns” to output the correct label for any new image it’s given. This process is called supervised learning, and it’s similar to the way foreign languages are often traditionally taught — just show your students a bunch of flashcards with the word in English, the word in French, maybe a picture, and make them memorize all the cards. Over time, your students will hopefully be able to translate not just these words, but also similar words that they haven’t actually been shown yet.4 In the case of the curve-fitting problem, “supervised learning” means we have a well-defined task: find y as a function of x, rather than just “try to find some structure in this data.” Here, the y-coordinates of the points function in the same way as the labels in the image classification problem.

Why does it matter to chemistry?

Nowadays, machine learning is finding usage in research fields from geology to psychology, so it shouldn’t be surprising that chemists and physicists are making increasing use of this technology as well. In this case, they’re interested in using machine learning to understand the world of atoms and molecules that make up everything around (and inside) you, as well as the tricky, unintuitive laws of quantum mechanics that govern how they behave.

Just like the examples above, there are many problems in chemistry that seem (at first) easy for humans but are still hard for computers to tackle. Let’s take a look at how our two problems of curve-fitting and image recognition translate into a chemistry context:

Fig. 5 :A curve-fitting problem for molecules.
Fig. 6: An image labeling problem. For molecules.

In the first example, we’re trying to find a curve that describes how two molecules interact. Specifically, we want to find the potential energy of the molecules for a given distance: Lower potential energy means (basically) that they’re more likely to be at that distance.

In the second example, we’re trying to figure out what molecule we’re looking at from just an “image” of its atomic positions.5 This is especially important for large, “floppy” molecules (like proteins) that can take many different shapes, and that behave in different ways depending on what shape they’re in.

These two examples only cover a small piece of the many inventive ways that machine learning has been applied in chemistry; one blog post wouldn’t be enough even to give a general overview of the subject.6 Instead, I’ll just focus on the first example because it’s closest to my research.

Suppose you have a way to get the potential energy of any molecule, any arrangement of atoms at all, just by writing down the exact positions of each atom and giving this list to a mysterious oracle. You will get the exact answer, but it will take time; there’s a certain amount of ceremony involved, and the oracle needs time to meditate on the answer. This is fine if you just want to get the energy of one molecule, but what use is that? In reality, molecules are moving and wiggling about constantly7, so you’d need to repeat this process many times if you wanted to simulate, say, how the molecules in each cell of your body actually do their job. But you can’t wait that long; with current technology, that might take more time than the age of the Universe, and require more computer memory than you could even fit on Earth.

Fig. 7: The oracle.

So let’s try our curve-fitting trick! We’ll just ask the oracle for a few energies, plot them against the atomic coordinates, and draw a curve through them like in the example above. Then, anytime we need an energy for a new set of coordinates, we just look at the curve instead of asking the oracle! This is the main idea behind machine learning potential energy surfaces, which is basically just curve-fitting applied to the problem of finding the energy of a collection of atoms. Unfortunately this problem isn’t nearly as simple as the example above, and it would be nearly impossible for a human to solve it by eye.

To see why, let’s take a closer look at that example. It’s a dataset from my PhD research project, and the objective is to predict the potential energy of two methane molecules (each composed of one carbon atom and four hydrogen atoms, arranged in a perfect tetrahedral structure) based on their relative positions. The plot shows the pre-computed potential energies (from our quantum-mechanical oracle) on the y axis, versus the distance between the two carbon atoms on the x axis. But this distance isn’t the only variable that changes the potential energy; let’s add another axis to the plot that shows how the molecules are oriented with respect to each other:

Fig. 8: A mess.

Do you think you could draw a curve (well, surface) through that? What if I told you that there are actually six independent variables (distances) that have an effect on the potential energy? And that more complex systems of atoms can easily require hundreds, if not thousands, of variables to describe?8

Fig. 9: This guy again. (Source: and, used under CC-BY-NC 2.5)

As you may have noticed, our brains — and especially our visual processing systems — just aren’t equipped to handle more than the three dimensions we’re used to seeing in the world around us. This is a key limitation when we use only our human brains to try to understand scientific data: There are just too many dimensions to keep track of. Fortunately, the maths and the algorithms we use for machine learning work just fine no matter how many dimensions you throw at them — although the more dimensions you have, the more training examples you’ll need, so we still use tricks to keep the number of dimensions under control. In a way, this is the main difference between machine learning potential energy surfaces and the more traditional curve-fitting approaches that have been used since the 60s to build potential energy surfaces: Machine learning methods can work in a high number of dimensions in an automated way, requiring much less human intervention — of course, this also means that the resulting model is much harder to understand, but the speed and accuracy it offers is usually worth it.

There’s also the issue I touched on earlier about how machine learning models can “learn” their own representation of the dataset. In principle, you could just feed the atomic coordinates into the neural network and let it learn its own representation. In practice, however, it’s almost always a good idea to build our own representations that use some of the chemistry and physics knowledge we’ve accumulated over the centuries. But that’s a topic for another blog post…

Why does it matter to us?

So what can we do with these new, fancy curve-fitting algorithms? I already mentioned that this lets us run computer simulations of molecules, but what does that mean for our everyday lives? Well, first off, I think it’s really cool that the technology is within reach that will let us just draw a molecule, put it into a computer, and get an idea for how it will behave, interact, or react in its intended environment — all without doing a single experiment! It’s like something out of science fiction (I can definitely remember more than one Star Trek episode where a a major plot point involved simulating some molecule or other). But that’s not what we write on our grant applications, unfortunately.9

We do write about two main technologies that could very well change our lives in the next decade or so: Drug discovery and materials discovery. The first one is motivated by public health: We are always in need of new pharmaceuticals to combat the infectious diseases that kill millions each year and impact the lives of many more (I’m writing this in the middle of the COVID-19 pandemic, which has likely changed billions of lives), as well as to improve the lives of those living with other long-term health conditions (including cancer). Computer simulations have been contributing to this search for a long time because they’re much faster than doing real experiments, one drug at a time — with simulations, we can filter out which candidate molecules are most likely to be successful and improve our understanding of why they work in the first place. We hope that machine learning simulations will be able to accelerate the pace of discovery beyond anything that was imaginable before.

As for materials, well, just have a look around you: They make up nearly all the things on, near, and around you. We can always use new materials to improve the devices we use or the clothes we wear, or to minimize their environmental impact. And by discovering new materials for more efficient batteries, solar cells, or superconductors, computational materials discovery has great potential to help in the fight against climate change. Just as in the search for new drugs, machine learning simulations promise to accelerate the pace of scientific discovery and make computational materials science a standard tool in designing the next generation of devices, clothes, and buildings.

Finally, there are many other ways that machine learning can have an impact on our everyday lives. It’s not just in Google image search or in voice recognition in our phones; more and more companies are using it to decide which ads to show you or what price you’ll pay for your airline or train ticket. Even more serious is the way banks are applying this technology to determine someone’s creditworthiness, and whether to give them a loan or a mortgage — decisions that can have a huge impact on someone’s life. And as you might have guessed, these algorithms are not without bias, whether because their training data is not representative of the population as a whole, or because they absorb (or even amplify!) biases already present in society.10 I urge you to read The Myth of the Impartial Machine, by Alice Feng and Shuyan Wu, which allows you to explore the issue by yourself with clear, interactive examples.

To end on a positive note, the authors of that article mention that one of the most important ways we can combat bias in machine learning models is to make sure that the people designing these algorithms come from backgrounds as diverse and varied as the people these algorithms are used on. No matter what your background or experience, you can find a way to participate in the development of the machine learning algorithms that will shape our future. I hope I’ve piqued your interest, and if you’re interested in learning more, then great! Stay tuned, stay curious, and thanks for reading!


1Seriously, even if you don’t think you’re “good at maths,” your brain has the capacity to do some pretty heavy mathematics without you even knowing. For instance, if you know how to ride a bicycle, take a look at the equations on this page. In a way, your brain is solving all those equations, many times a second, just to keep you upright and moving in the right direction.

2Yes, I know that sounds like something from 1950s science fiction. That is when it was invented, after all.

3How these parameters are tweaked is a whole story of its own. If you’re interested in the technical details, I strongly suggest you check out the online book Neural Networks and Deep Learning by Michael Nielsen. Not only does it explain the theory in a clear, accessible way, but it comes with programming examples that let you implement this algorithm on your own computer.

4This is one of the biggest issues in fitting a machine learning model. If you’re not careful, it will just memorize the example data (computers are very good at memorization) but give completely useless results on anything else. This problem is called “overfitting” and these technically-correct-but-practically-useless solutions are sometimes called clever Hans solutions. This problem has been plaguing neural networks (and other machine learning models) since they were invented and much researcher time has been spent on finding ways to avoid it.

Also, the example of translating between different languages is just another thing that humans are pretty good at and computers still struggle with. Just remember the last time you typed something into Google Translate and got something a native speaker would find odd, or even incomprehensible. (Check out the hilarious Translator Fails YouTube channel by Malinda Kathleen Reese for more examples of machine translation gone awry.)

5In fact, we train hundreds of thousands of neural networks undergraduate students for this task each year in introductory organic chemistry courses all over the world.

6The comment in this comic about the depth and variety of human subcultures applies equally well to research fields: For almost anything you can think of, someone has researched some specific sub-problem of that.

7At any temperature greater than absolute zero; this is called “thermal motion” and it’s the microscopic basis for the entire field of thermodynamics. (And even near absolute zero, the atoms still move around a bit because of Weird Quantum Effects.)

8A system of N atoms formally has 3N − 6 degrees of freedom, which is the same thing as “dimensions” for our purposes. Given that scientists are trying to simulate systems of anywhere from hundreds to millions (even billions!) of atoms, this means that the number of dimensions quickly gets out of hand even for modern machine learning algorithms. Of course, we’re always applying various tricks and approximations to keep the dimensions under control; more on that in another post.

9For a great perspective on how “computational experiments” could change the way we do science, and how machine learning potentials are bringing this dream ever closer to reality, check out this article by Dr. Miguel Caro (original article: M. A. Caro, Arkhimedes 3, 21 (2018) ). It’s this article that first inspired me to think about how computational experiments are like something from science fiction, and I think that’s pretty cool.

10We see the same bias tendencies reflected in our machine learning chemistry models: If your example data contains much more of one type of molecule than another kind, then of course the model will be biased to predict the most common molecule correctly at the expense of the less common molecules. In order to make our models less chemically biased and more representative of they data they’ll encounter, we use some of the same strategies that machine learning researchers are using to counter societal bias, as described in the linked article.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s