ChatGPT | ClearVitality Innovations Co.,Ltd

Credit: FlashMovie/Getty Images

Running a drug screening program is like staging an enormous cocktail party—and listening in on the proceedings. At cocktail parties, there's so much small talk, but only a few meaningful conversations. Similarly, in drug screening programs, feeble drug-target interactions greatly outnumber the instances of high-affinity binding.

Imagine if you had to listen to every bit of a cocktail party's banter. Surely, that would be tedious. Now, consider how much worse it would be to evaluate every drug-target interaction in a typical drug screen. Why, that would exhaust even the most patient listener—the typical artificial intelligence (AI) system.

Unfortunately, conventional AI systems take a long time to sift through data about the interactions between drug candidates and protein targets. Most AI systems calculate each target protein's three-dimensional structure from its amino-acid sequence, then use those structures to predict which drug molecules it will interact with. The approach is exhaustive, but slow.

To move things along, researchers at MIT and Tufts University have devised an alternative computational approach based on a type of AI algorithm known as a large language model. These models—one well-known example is ChatGPT—can analyze huge amounts of text and figure out which words (or, in this case, amino acids) are most likely to appear together. The large language model developed by the MIT/Tufts team is known as ConPLex. It can match target proteins with potential drug molecules without having to perform the computationally intensive step of calculating the molecules’ structures.

Details about ConPLex appeared June 8 in PNAS, in an article titled, "Contrastive learning in protein language space predicts interactions between drugs and protein targets." ConPLex can leverage the advances in pretrained protein language models ("PLex") and employ a protein-anchored contrastive coembedding ("Con") to outperform state-of-the-art approaches.

"ConPLex achieves high accuracy, broad adaptivity to unseen data, and specificity against decoy compounds," the article's authors wrote. "It makes predictions of binding based on the distance between learned representations, enabling predictions at the scale of massive compound libraries and the human proteome."

The researchers then tested their model by screening a library of about 4,700 candidate drug molecules for their ability to bind to a set of 51 enzymes known as protein kinases.

From the top hits, the researchers chose 19 drug-protein pairs to test experimentally. The experiments revealed that of the 19 hits, 12 had strong binding affinity (in the nanomolar range), whereas nearly all of the many other possible drug-protein pairs would have no affinity. Four of these pairs bound with extremely high, sub-nanomolar affinity (so strong that a tiny drug concentration, on the order of parts per billion, will inhibit the protein).

While the researchers focused mainly on screening small-molecule drugs in this study, they are now working on applying this approach to other types of drugs, such as therapeutic antibodies. This kind of modeling could also prove useful for running toxicity screens of potential drug compounds, to make sure they don't have any unwanted side effects before testing them in animal models.

"This work addresses the need for efficient and accurate in silico screening of potential drug candidates," said Bonnie Berger, an MIT researcher and one of the senior authors of the new study. "[Our model] enables large-scale screens for assessing off-target effects, drug repurposing, and determining the impact of mutations on drug binding."

"Part of the reason why drug discovery is so expensive is because it has high failure rates," noted Rohit Singh, an MIT researcher and one of the study's lead authors. "If we can reduce those failure rates by saying upfront that this drug is not likely to work out, that could go a long way in lowering the cost of drug discovery."