Retrosynthesis prediction with an iterative string editing model | Nature Communications
Nature Communications volume 15, Article number: 6404 (2024) Cite this article
4553 Accesses
7 Altmetric
Metrics details
Retrosynthesis is a crucial task in drug discovery and organic synthesis, where artificial intelligence (AI) is increasingly employed to expedite the process. However, existing approaches employ token-by-token decoding methods to translate target molecule strings into corresponding precursors, exhibiting unsatisfactory performance and limited diversity. As chemical reactions typically induce local molecular changes, reactants and products often overlap significantly. Inspired by this fact, we propose reframing single-step retrosynthesis prediction as a molecular string editing task, iteratively refining target molecule strings to generate precursor compounds. Our proposed approach involves a fragment-based generative editing model that uses explicit sequence editing operations. Additionally, we design an inference module with reposition sampling and sequence augmentation to enhance both prediction accuracy and diversity. Extensive experiments demonstrate that our model generates high-quality and diverse results, achieving superior performance with a promising top-1 accuracy of 60.8% on the standard benchmark dataset USPTO-50 K.
Designing synthetic reaction pathways for molecules is a fundamental aspect of organic synthesis, holding significant implications for various fields such as biomedical, pharmaceutical, and materials industries. Retrosynthetic analysis is the most widely used approach for developing synthetic routes. It involves breaking down molecules iteratively into simpler and more readily synthesizable precursors using established reactions. This methodology, initially formalized by Corey1,2, led to the development of computer-aided synthesis planning (CASP). CASP utilizes computational methods to predict retrosynthetic pathways, assisting chemists in efficiently identifying the optimal synthesis routes for target molecules. It has become a vital tool for addressing the challenges of organic synthesis planning.
In recent years, artificial intelligence (AI)-driven retrosynthesis has facilitated the exploration of more complex molecules and significantly reduced the time and energy required to design synthetic experiments3,4,5,6,7,8,9,10,11,12. Single-step retrosynthesis prediction is a crucial component of retrosynthetic planning, and several deep learning-based methods have been proposed with promising results. These methods can be broadly categorized into three groups13,14: template-based, template-free, and semi-template-based methods.
Template-based methods regard retrosynthesis prediction as a template retrieval problem and compare the target molecule with precomputed templates. These templates capture the essential features of the reaction center in specific types of chemical reactions. They can be generated manually or automatically and serve as a guide for the model to identify the most suitable chemical transformation for a given molecule. Various works15,16,17,18 have proposed different approaches to prioritize candidate templates. RetroSim15 employed the molecular fingerprint similarity between the given product and the molecules present in the corpus to rank the candidate templates. NeuralSym16 was the pioneering work to utilize deep neural networks for template selection by learning a multi-class classifier. GLN17 built a conditional graph logic network to learn the conditional joint probability of templates and reactants. LocalRetro18 conducted an evaluation of the suitability of local atom/bond templates at all predicted reaction centers for a target molecule and incorporated the non-local effects in chemical reactions through global reactivity attention. It has demonstrated state-of-the-art performance within the template-based methods. Although providing interpretability and molecule validity, template-based models suffer from limited generalization and scalability issues, which can hinder their practical utility19.
Template-free methods utilize deep generative models to generate reactant molecules without relying on predefined templates. Most of existing methods reformulate the task as a sequence-to-sequence20,21,22,23,24,25,26 problem, employing the sequence representation of molecules, specifically the simplified molecular-input line-entry system (SMILES)27. Liu et al.20 first utilized a long short-term memory (LSTM)28-based sequence-to-sequence (Seq2Seq) model to convert the SMILES representation of a product to the SMILES of the reactants. Karpov et al.29 further proposed a Transformer-based Seq2Seq method for retrosynthesis. SCROP21 integrated a grammar corrector into the Transformer architecture, aiming to resolve the prevalent problem of grammatical invalidity in seq2seq methods. R-SMILES26 established a closely aligned one-to-one mapping between the SMILES representations of the products and the reactants to enhance the efficiency of synthesis prediction in Transformer-based methods. PMSR25 devised three tailored pre-training tasks for retrosynthesis, encompassing auto-regression, molecule recovery, and contrastive reaction classification, thereby enhancing the performance of retrosynthesis and achieving state-of-the-art accuracy within template-free methods. Some studies characterize the task as a graph-to-sequence problem, employing the molecular graph as input14,30. Graph2SMILES30 integrated a sequential graph encoder with a Transformer decoder to preserve the permutation invariance of SMILES. Retroformer14 introduced a local attention head in the Transformer encoder to augment its reasoning capability for reactions. Recent studies, including MEGAN31, MARS32, and Graph2Edits33, have explored the utilization of end-to-end molecular graph editing models to represent a chemical reaction as a series of graph edits, drawing inspiration from the arrow pushing formalism. However, these approaches usually require time-consuming predictions for sequential graph edit operations. Fang et al.34 developed a substructure-level decoding method by automatically extracting commonly preserved portions of product molecules. However, the extraction of substructures is fully data-driven, and its coverage depends on the reaction dataset. Furthermore, incorrect substructures can lead to erroneous predictions. While template-free methods are fully data-driven, they raise concerns regarding the interpretability, chemical validity, and diversity of the generated molecules19.
Semi-template-based methods leverage the benefits of the two aforementioned methods. These methods follow a two-stage procedure: first, fragmenting the target molecule into synthons by identifying reactive sites, and subsequently converting the synthons into reactants using techniques such as leaving groups selection13, graph generation35, or SMILES generation36,37. RetroXpert36 first identified the reaction center of the target molecule to obtain synthons by employing an edge-enhanced graph attention network, followed by the generation of the corresponding reactants based on the synthons. RetroPrime37 introduced the mix-and-match and label-and-align strategies within a Transformer-based two-stage workflow to mitigate the challenges of insufficient diversity and chemical implausibility. G2Gs35 initially partitioned the target molecular graph into several synthons by identifying potential reaction centers, followed by the translation of the synthons into the complete reactant graphs using a variational graph translation framework. GraphRetro13 first transformed the target into synthons by predicting the set of graph edits and then expanded synthons into final molecules by attaching pertinent leaving groups. G2Retro38 determined the optimal order of molecule completion by considering all relevant synthon and product structures and attached small substructures to the synthons sequentially. These methods align more closely with the problem-solving intuition of chemists. However, the two learning stages of the framework are independent, leading to increased computational complexity. Additionally, it poses a significant challenge to propagate the knowledge and insights acquired from predicting the reactive sites to the reactant completion.
In this work, we focus on template-free retrosynthesis prediction. Existing methods often use string-based molecule representations due to their ease of manipulation and compatibility with established language models, resulting in higher generation efficiency22,25. Previous studies have shown that Transformer-based retrosynthesis predictions exhibit acceptable generalizability and robustness39. However, these methods generate reactants from scratch through token-by-token auto-regressive decoding strategies, which achieves unsatisfactory performance and limited diversity. In practice, chemical reactions often cause local molecular changes, leading to significant overlap between the reactants and products involved. Recognizing this fact, we propose redefining the problem as a molecular string editing task and introducing an Edit-based Retrosynthesis model, EditRetro, which enables high-quality and diverse predictions.
The core concept of this work revolves around generating reactant strings through an iterative editing process using Levenshtein operations40. Our approach draws inspiration from recent advancements in edit-based sequence generation models41,42. Specifically, we adopt the operations from EDITOR42, an edit-based Transformer designed for neural machine translation. The model architecture includes an encoder, a reposition decoder, a placeholder decoder, and a token decoder, as illustrated in Fig. 1b. The decoding process involves employing reposition, placeholder insertion, and token insertion actions to ensure the accuracy of reactant generation. The reposition policy predicts the index of the input tokens, encompassing the reordering and deletion functions. Subsequently, the placeholder policy predicts the required number of placeholders, followed by the token insertion policy to determine the actual tokens to be inserted. To further enhance prediction diversity, we design an inference module with reposition sampling and sequence augmentation as shown in Fig. 1a. Sequence augmentation randomly selects starting atom and direction of the molecular graph enumeration to create variants of canonical molecular SMILES, allowing for diverse editing pathways from product strings to reactants. Reposition sampling samples the output in the reposition classifier, providing opportunities to identify a wider range of reaction types as shown in Fig. 1c.
a The workflow of the inference module with reposition sampling and data augmentation. Sequence augmentation generates variants of the canonical SMILES. We always include the canonical SMILES as one of the inputs. Reposition sampling generates multiple outputs for an input SMILES. Fusion and ranking module first ranks candidate sequences from the same input based on their probabilities and then assigns each sequence a score using the reciprocal of its local rank. Finally, all canonical SMILES are ranked based on their final scores. b The process of generating candidate reactants for a target molecule by iteratively refining the decoders to improve the quality of the generated molecular string until it converges. c The workflow of reposition sampling. In the reposition decoder, greedy search is employed to find the best result at each position. Furthermore, a sampling method is utilized to obtain predictions at each position, resulting in the generation of alternative output sequences. In the token decoder, the best prediction at each placeholder is obtained using greedy decoding, forming the first output sequence. Additionally, we select the second-best prediction at each placeholder to create the second output sequence, and so on.
We have evaluated our proposed method on public benchmark datasets USPTO-50K and USPTO-FULL. Extensive experimental results demonstrate that the method yields superior performance over other baselines in terms of prediction accuracy, including the state-of-the-art sequence-based method R-SMILES and graph edit-based method Graph2Edits. In particular, our method achieves a top-1 exact-match accuracy of 60.8% and a top-1 round-trip accuracy of 83.4%, demonstrating it has well learned the chemical transformation rules. The quantitative evaluation shows that our method can provide relatively diverse sets of predictions with high-level generation validity. To enhance understanding of our approach, we provide visualized editing operations during the generation process. Further experiments show that our method is capable of handling complex chemical transformations, including chiral, ring-opening, ring-forming reactions, and multi-step synthesis. This indicates that our method is a highly valuable tool for chemists and researchers in the field.
Chemical reactions involve the participation of reactant molecules, represented by the reactant set R, and the formation of product molecules, represented by the product set P. In the context of this study, we focus on the task of template-free single-step retrosynthesis prediction, which aims to generate the reactant set R corresponding to a given product molecule P, without relying on pre-defined reaction templates or rules. It is important to note that in addition to reactants and products, chemical reactions may involve solvents, catalysts, and reagents. However, for the purpose of this study, we do not consider them in our analysis.
We adopt a string-based representation to encode chemical reactions, using a variable-length string that includes a pair of SMILES notations, one for the reactants and the other for the product compound. To formalize the molecular string editing problem, we introduce a Markov Decision Process \(\left({{{\bf{S}}}},{{{\bf{A}}}},{{{\rm{E}}}},{{{\rm{F}}}},{{{{\bf{s}}}}}^{{{{\rm{0}}}}}\right)\). In this formulation, a state \({{{\bf{s}}}}=\left({{{{\bf{s}}}}}_{{{{\rm{1}}}}},{{{{\bf{s}}}}}_{{{{\rm{2}}}}},\cdots \,,{{{{\bf{s}}}}}_{{{{\rm{L}}}}}\right)\in {{{\bf{S}}}}\) is a sequence of tokens, where each token si is drawn from a predefined vocabulary V. The sequence has a length L, and the initial sequence to be refined, i.e., the product string, is denoted by s0. The set of editing actions that can be applied to the sequence is defined as A. The reward function F is defined as the negative value of the distance D between the generated output and the ground-truth sequence, given by \({{{\rm{F}}}}(s)=-{{{\rm{D}}}}\left({{{\bf{s}}}},{{{{\bf{s}}}}}^{*}\right)\). In this setup, an agent interacts with an environment E that receives the agent’s editing actions and returns the modified sequence. The agent’s behavior is modeled by a policy, π: S → P(A), that maps the current generation over a probability distribution over A. At every decoding step, the model receives an input sequence s and selects an editing action a ∈ A to refine it using a policy π, resulting in a new state \({{{\rm{E}}}}\left({{{\bf{s}}}},{{{\bf{a}}}}\right)\), i.e., the intermediates or the reactants. The objective is to optimize the policy π to maximize the cumulative reward obtained throughout the sequence refinement process.
Our proposed model incorporates three editing actions, namely sequence reposition, placeholder insertion, and token insertion, to generate reactant strings. It is implemented by a Transformer model consisting of an encoder and three decoders, both of which are composed of stacked Transformer blocks. Our model enhances generation efficiency through its non-autoregressive decoders. Although incorporating additional decoders to iteratively predict editing actions, EditRetro performs editing actions in parallel within each decoder (i.e., non-autoregressive generation). When given a target molecule, the encoder of our method takes its string as input and generates corresponding hidden representations, which are then used as input for the cross-attention modules of the decoder. Similarly, the decoder also takes the product strings as input at the first iteration. During each decoding iteration, the three decoders are executed consecutively as shown in Fig. 1b.
Reposition Decoder: The sequence reposition policy (classifier) πrps predicts a value r for each input position. If the value of r is the index of an input token, the token will be placed at the predicted position. If the value of r is 0, the input token will be deleted. The reposition action involves basic token editing operations such as keeping, deleting, and reordering. It can be compared to the process of identifying a reaction center, involving the reordering and deletion of atoms or groups to obtain the synthons.
Placeholder Decoder: The placeholder insertion policy (classifier), denoted as πplh, predicts the number of placeholders to be inserted between adjacent tokens. It plays a crucial role in determining the structure of the reactants, similar to identifying the locations for adding atoms or groups to the intermediate synthons obtained from the sequence reposition stage.
Token Decoder: The token insertion policy (classifier), denoted as πtok, is responsible for generating candidate tokens for each placeholder. It is essential in determining the actual reactants that can be used to synthesize the target product. This process can be seen as analogous to synthons completion, in combination with the placeholder insertion action.
This iterative refinement process continues until the termination condition is reached. Detailed model architectures and training strategies can be found in Section Methods.
To evaluate the effectiveness and performance of our proposed method, we conducted experiments on two widely used benchmark datasets: USPTO-50K43 and USPTO-FULL17. These datasets provide diverse and comprehensive collections of chemical reactions, enabling a thorough evaluation of our model’s capabilities in molecule retrosynthesis. The USPTO-50K is a high-quality dataset that contains ~ 50,000 reactions from the U.S. patent literature. These reactions have accurate atom mappings between products and reactants, and they have been categorized into 10 distinct reaction types, facilitating detailed analysis and comparison with other existing methods. It has been extensively used in previous studies, making it suitable for benchmarking our proposed method against state-of-the-art approaches. For the USPTO-50K dataset, we adopt the same split as reported in Coley et al.15 and divide it into a 40K/5K/5K train/validation/test split. The USPTO-FULL dataset is a significantly larger chemical reaction dataset, comprising ~ 1 million reactions. We partition it into ~ 800 K/100 K/100 K training/validation/test reactions following Dai et al.17. The USPTO-FULL dataset serves as a validation of our model’s performance on a larger and more diverse set of reactions. By conducting experiments on both benchmark datasets, we can assess the performance, generalization ability, and scalability of our proposed method, providing valuable insights into its effectiveness for practical applications in molecule retrosynthesis.
To enhance the learning capabilities of the Transformer model and ensure its generalization ability without being overly reliant on the syntax rules of SMILES canonicalization, we employ the SMILES augmentation technique. In line with the augmentation strategy of previous studies22,24,26,37, we conduct 20 times augmentation on both the training and test sets of USPTO-50 K. Similarly, for the USPTO-FULL dataset, we apply 5 times augmentation on the training and test sets. It is important to note that augmentation is only applied to the products in the test sets, whereas it is applied to both the products and reactants in the train sets. The SMILES structures generated through augmentation maintain validity, as they are generated by randomly selecting the starting atom and graph enumeration direction. This random selection ensures that the augmented SMILES structures are chemically plausible and conform to the necessary rules and constraints. In this study, we performed SMILES canonicalization and augmentation using RDKit44.
To obtain SMILES fragments, we employ the SMILES Pair Encoding (SPE) method, which enhances the standard atom-level tokenization approach by incorporating human-readable and chemically explainable SMILES substrings as tokens45. This encoding technique allows for more intuitive and interpretable representations of molecules, facilitating the understanding of chemical transformations. We utilized the SPE tokenizer pre-trained on the ChEMBL dataset, as employed in45, for this study. Furthermore, we conduct alignment between the product and reactants SMILES on the training set, following the method proposed by Zipeng et al.26. This alignment process establishes a correspondence between the product and reactant molecules, enabling the model to capture the relationships and transformations between them effectively. By aligning the SMILES strings, we create a connection between the product and reactants, facilitating the learning of retrosynthetic patterns and improving the model’s ability to generate accurate and meaningful reactant strings. The combination of the SMILES Pair Encoding method and the alignment technique enhances the quality and interpretability of the input molecule representations. It enables our model to capture the structural and chemical information present in the SMILES strings and utilize it effectively in the retrosynthesis prediction task.
In evaluating the performance of our proposed EditRetro model for molecule retrosynthesis, we utilize the top-k exact match accuracy as our primary evaluation metric. This metric provides a rigorous assessment by comparing the canonical SMILES of the predicted reactants to the ground truth reactants in the test dataset. By measuring the exact match accuracy, we ensure that the predicted reactants precisely match the ground truth reactants, indicating the model’s ability to generate accurate retrosynthetic predictions. To comprehensively assess the overall performance of EditRetro, we conduct comparative evaluations against a diverse set of state-of-the-art approaches, including template-based, template-free, and semi-template-based approaches. This comparison allows us to gauge EditRetro’s performance in relation to the existing methods and provide insights into its strengths and limitations.
The results of top-k exact match accuracy on the USPTO-50K dataset, when the reaction class is not provided, are shown in Table 1. Specifically, EditRetro achieves a top-1 accuracy of 60.8% and a top-3 accuracy of 80.6%. In a more detailed comparison, EditRetro reaches the state-of-the-art performance for template-free methods and exceeds the notable work, i.e., R-SMILES by a margin of 4.5% in top-1 accuracy. Moreover, EditRetro also achieves comparable performance to the baseline models for larger values of k such as k = 5 and 10. We also provide a detailed breakdown of the top-1 exact match accuracy of our model across various reaction types in Supplementary Table 1, revealing that the model’s accuracy varies depending on the specific reaction types. A more detailed analysis can be found in Supplementary Notes.
In addition to USPTO-50K, we further evaluate the performance of our method on the larger and more diverse USPTO-FULL dataset, which poses additional challenges due to its extensive collection of chemical reactions. As shown in Table 2, our method achieves superior performance to all baselines in top-1 accuracy (52.2%). It should be noted that template-based approaches, which rely on predefined reaction templates, often struggle with generalizing to new reaction templates and handling the vast number of templates present in large datasets19. This limitation inherently affects their performance and scalability. However, EditRetro, being a template-free approach, exhibits competitive performance on larger datasets. This highlights its ability to generalize well to diverse reaction types and overcome the limitations associated with template-based methods.
The comprehensive results obtained from our evaluations consistently demonstrate the superior capability of EditRetro in generating high-quality reactants for a given product in retrosynthesis. This superiority can be attributed to two key factors: the close correlation between the three editing stages within a single iteration and the model’s ability to self-correct during iterative refinement. The close correlation between the sequence reposition, placeholder insertion, and token insertion stages within each decoding iteration enables EditRetro to effectively capture complex and diverse patterns in the data. Furthermore, the self-correcting nature of EditRetro during iterative refinement contributes to its high level of accuracy. The model continuously learns from its previous predictions and adjusts its subsequent predictions accordingly. This self-correction mechanism allows EditRetro to refine and improve the reactant generation process, leading to the generation of high-quality and chemically valid reactants.
As there may be multiple candidate reactants that can be used to synthesize the same product, we additionally adopt round-trip accuracy to evaluate the model. The roundtrip accuracy was formally proposed in a multi-step retrosynthesis study11 and quantifies the percentage of retrosynthetic predictions considered plausible by the forward prediction model. It is calculated by comparing the given product with the product predicted by a forward reaction model that uses the predicted reactants as input. For this purpose, we utilize the pre-trained Molecular Transformer46 as the oracle forward reaction prediction model, following previous work14,18. We adopt the top-k roundtrip accuracy calculation definition used in Retroformer14: \({{{\rm{RoundTrip}}}}(k)=\frac{1}{N\times k}{\sum }_{1}^{N}{\sum }_{1}^{k}{\mathbb{I}}(\,{\mbox{Reach Ground Truth Product}})\), where N is the number of molecules in the test set. The results of RoundTrip accuracy on USPTO-50K are summarized at the top of Table 3. Our model achieves impressive results, with a top-1 accuracy of 83.4% and a top-3 accuracy of 73.6%. Furthermore, even when considering k values of 5 and 10, our method is comparable to most of the baselines. This demonstrates the capability of EditRetro to effectively learn chemical rules.
In addition to the roundtrip accuracy, we adopt the MaxFrag accuracy metric22, inspired by classical retrosynthesis, to assess the exact match of the largest fragment. This metric is specifically designed to address prediction limitations caused by unclear reagent reactions in the dataset. The MaxFrag accuracy focuses on evaluating the accuracy of the largest fragment match, providing a more targeted assessment of the model’s ability to predict the main reactant fragment. This metric is particularly valuable in scenarios where the reactant reactions are not explicitly defined or may exhibit uncertainties. By emphasizing the largest fragment, we aim to mitigate the impact of unclear reagent reactions on the overall performance evaluation. The results of top-k MaxFrag accuracy are shown at the bottom of Table 3. EditRetro exhibits superior performance, outperforming all baselines with an accuracy of 65.3% for top-1 predictions and 83.9% for top-3 predictions. Furthermore, when k is equal to 5 and 10, EditRetro’s performance is also slightly better than that of the baselines.
Diversity in predicted reactions is crucial for exploring a broader synthesis space and discovering novel chemical pathways. In our inference module, we incorporate reposition sampling and sequence augmentation to enhance generation diversity. It allows for the identification of multiple reaction centers and the consideration of various attachments, enabling EditRetro to generate diverse reactants with distinct scaffolds and structures.
To gain a more comprehensive understanding of our model’s predictions, we visually analyze two randomly selected molecules along with the top-10 predictions by EditRetro. The first example, illustrated in Fig. 2a, showcases the synthesis of 5-Bromo-3-(3-pyridinylmethoxy)-2-pyridinamine. EditRetro identifies four distinct reactive sites in this synthesis. The first site corresponds to the oxygen atom, which aligns with the ground truth and includes the top-1, 2, 3, 5, and 7 predictions. The top-1 prediction precisely matches the ground truth, representing a Williamson ether synthesis reaction. Similarly, the top-2 and top-3 predictions involve substituting the chlorine atom with hydroxyl and bromine, respectively. Additionally, the top-5 prediction replaces hydroxyl with bromine. However, the top-7 prediction fails to generate the product within a single step. The second site pertains to the amino group and encompasses the top-4, 9, and 10 predictions. The top-4 prediction leads to product formation through the reduction of nitro compounds to amines. The top-9 prediction involves a Hofmann rearrangement reaction, and the top-10 prediction entails the substitution of aromatic halides with nitrogen nucleophiles. The third site corresponds to the bromine atom and is associated with the top-6 prediction, which represents the halogenation of aromatic compounds. The fourth site involves the nitrogen atom in the pyridine ring. However, its associated top-8 reaction is not plausible for the synthesis of the product. All these predictions, except for top-7 and top-8, have been verified by two chemists and can be successfully reproduced using the Molecular Transformer with high confidence. Therefore, they are considered plausible reactions. It is worth noting that the ground-truth reaction only achieves a maximum yield of 29% as reported in47. Conversely, all plausible predictions indicate significantly higher yields estimated by48, with an average yield exceeding 65%.
a Williamson ether synthesis reaction. EditRetro identifies four distinct reactive sites. All these predictions, except for top-7 and top-9, are plausible for the synthesis of the target product. b Alkylation of amines. EditRetro predicts five distinct reactive sites for the product. All of these predictions, except for top-9, are plausible reactions. Different reactive sites are highlighted with different colors.
In Fig. 2b, the second case exemplifies the synthesis of Benzamide, N,N-diethyl-4-[[4-[(4-methylphenyl)methyl]-1-piperazinyl]-8-quinolinylmethyl]-(9CI, ACI). EditRetro identifies five distinct reactive sites for the product. The first site is consistent with the ground truth, including the top-1, 3, 4, and 6 predictions. The top-6 matches the ground truth which involves the alkylation of amines. The top-1, 3, and 7 predictions use different compounds, i.e., the 4-Methylbenzaldehyde, 4-Methylbenzyl chloride, and 4-Methylbenzyl bromide, respectively. Furthermore, the top-4 prediction can yield the product, although with a lower confidence level of 0.46 using Molecular Transformer. The second site encompasses the top-2 and 10 predictions, which uses the amidation reaction. The third site is linked to the top-5 and top-8 predictions, which correspond to the nucleophilic substitution reaction. The fourth site involves the top-7 prediction, corresponding to a carbonyl reduction reaction. The fifth site is associated with the top-9 prediction, which does not produce the desired product. All of these predictions, except for top-9, have been verified by two chemists and can be successfully obtained using the Molecular Transformer.
The examples presented above indicate that predicted reactions from EditRetro, rather than ground truth reactions, can still be synthetically valuable and possible. This demonstrates that our method possesses the inherent capability to learn the underlying reaction rules and provide highly rational and diverse predictions. To quantitatively evaluate the diversity of the predictive outcomes, we examine the molecular similarities among them similar to previous work33,38. We calculate the average Tanimoto similarity between each pair of predicted reactants in the top-10 predictions for each product using concatenated ECFP4 fingerprints49. A lower similarity score indicates a higher diversity in the predicted results. Furthermore, we employ the K-means clustering algorithm to group the products based on the similarity of their predicted reactants. As shown in Fig. 3, the predictions in the first four clusters can be considered to have high diversity, as they exhibit lower prediction similarities (0.28, 0.37, 0.41, and 0.46) and account for ~36% of the test set. The predictions in the middle three clusters have medium diversity, as they exhibit average similarities of (0.52, 0.56, and 0.59)and account for nearly 44% of the test set. The predictions in the last three clusters are considered to have relatively low diversity as they exhibit relatively higher prediction similarities (0.63, 0.68, and 0.77). These clusters have a small proportion in the test set and indicate that EditRetro can predict similar reactants in some cases. The average similarity score for the entire test set is 0.55. Overall, these results demonstrate that EditRetro is capable of predicting relatively diverse sets of reactants. We also evaluated the diversity of the predictive outcomes on the larger USPTO-FULL dataset, as demonstrated in Supplementary Fig. 2. The results show that EditRetro continues to exhibit promising diversity on larger and more diverse reaction datasets. Furthermore, we have evaluated the chemical validity rates produced by EditRetro, and the results are presented in Supplementary Table 2 and Supplementary Notes.
The values displayed above the bars indicate the average similarity of the predicted reactants in the cluster, with lower values indicating higher diversity among the predicted reactants. The results demonstrate that EditRetro is capable of predicting relatively diverse sets of reactants. Source data are provided as a Source Data file.
Chirality is a fundamental property of asymmetry that plays a critical role in stereochemistry and drug discovery. To assess the ability to handle chirality, we compare the performance of EditRetro and a strong baseline method, R-SMILES, on the USPTO-50K test set for reactions with and without chirality. As illustrated in Fig. 4a, when k = 1, EditRetro achieves better results (55.7% and 61.8%) than R-SMILES (51.6% and 56.7%) for both chiral and non-chiral reactions. These results indicate that EditRetro outperforms R-SMILES in handling chirality, demonstrating its ability to accurately predict the correct chiral configurations. Moreover, EditRetro consistently exhibits superior or comparable performance to R-SMILES across different values of k in terms of top-k accuracies. Notably, both methods exhibit better performance on non-chiral reactions than on chiral ones, highlighting the challenges of handling chirality in retrosynthesis prediction. We attribute EditRetro’s superiority in handling chiral reactions to its edit-based generation approach. In a chemical reaction, products and reactants may share several substructures. By generating reactants based on the product’s structure, EditRetro facilitates accurate prediction of chirality during the generation process. These results demonstrate the efficacy and robustness of our edit-based method for retrosynthesis prediction.
a Reactions with/without chirality. Both methods show inferior performance on reactions with chirality compared to those without chirality, highlighting the challenge of chiral reactions. EditRetro exhibits superior performance than the baseline in most cases. b Non-ring, Ring-opening, and Ring-forming reactions. EditRetro demonstrates superior performance over the baseline in most cases for non-ring reactions and consistently exhibits better performance for ring-opening and ring-forming reactions.
Ring-forming and ring-opening reactions are both essential transformations in organic synthesis, with significant theoretical significance and broad practical applications. Ring-forming reactions allow the synthesis of various cyclic compounds, while ring-opening reactions, such as the epoxide ring-opening reaction, are crucial steps in the synthesis of many organic compounds. To assess the model’s capability for predicting these types of reactions, we compare the performance of EditRetro and the baseline R-SMILES on non-ring, ring-opening, and ring-forming reactions. As depicted in Fig. 4b, both models demonstrate better performance on ring-opening and ring-forming reactions. This observation indicates the inherent challenges associated with predicting these specific types of reactions. However, EditRetro consistently outperforms or performs comparably to R-SMILES on all types of reactions. Particularly, EditRetro shows significant improvements over R-SMILES for ring-opening and ring-forming reactions. For example, when k = 1, EditRetro outperforms R-SMILES by 5.9% for ring-opening reactions and 5.8% for ring-forming reactions. These results further confirm the superiority of our edit-based generation approach over methods that generate structures from scratch. EditRetro leverages the existing product structure to guide the generation of reactants, enabling it to capture the specific requirements and transformations involved in ring-opening and ring-forming reactions. We also compared the effects of the SPE tokenizer with the token-wise tokenizer for our model in Supplementary Fig. 5.
In this subsection, we present empirical examples to demonstrate our model’s reasoning process and its iterative refinement capability. Our model performs explicit Levenshtein string editing operations in each decoder step, allowing chemists to easily understand the generation process. This transparency enhances trust and utility in the generated results. Additionally, the parallel execution of string editing operations facilitates the efficiency, enabling faster generation and scalability for larger datasets and complex molecules. Moreover, it features the capability of iterative refinement, which allows for self-correction and improvement in subsequent iterations. To evaluate the impact of iterative refinement, we analyze the distribution of refinement iterations for correctly predicted reactions in the test set of USPTO-50 K, focusing on the Top-1 exact match accuracy.
As depicted in Supplementary Fig. 3, the analysis reveals that the majority of reactions in the test set achieve accurate predictions after just one refinement iteration, accounting for 80.18% of cases. This highlights the strong performance of our model in generating correct reactants in the initial iteration itself, ensuring efficiency and practicality in real-world applications. For a smaller proportion of reactions, additional refinement iterations are necessary to achieve accurate predictions. There are 335 (10.67%) of cases that require two refinement iterations, while 64(2.04%) and 29(0.92%) of cases require three or more iterations, respectively. These instances represent scenarios where the initial prediction may not fully capture the complexity of the chemical transformations or where further optimization is required for accurate reactant generation. Overall, the distribution of refinement iterations demonstrates that our method achieves high accuracy with efficient generation processes, aided by its iterative refinement capability. We conduct a quantitative evaluation of the inference latency of our model, and the results are presented in Supplementary Table 3.
To gain insights into the reasoning process of our model, we randomly select 3 reactions with different reaction types from the test set of USPTO-50K and visualize the generation process. These examples provide a deeper understanding of how EditRetro generates reactants and demonstrate its robustness in iteratively refining its predictions. The first example in Fig. 5a depicts a Wohl-Ziegler bromination reaction, which involves the allylic bromination of hydrocarbons using N-bromosuccinimide and a radical initiator. EditRetro accurately predicts that all atoms are from the reactants without the need for any reposition operations. During the insertion stage, it accurately predicts the placeholders and generates the final two ground-truth reactants with high probability. EditRetro achieves this by combining several substructures to form the Ethyl 1-(2,4-dichlorophenyl)-5-(4-methoxyphenyl)-4-methyl-1H-pyrazole-3-carboxylate (ACI) and generating N-bromosuccinimide from scratch. Figure 5b presents the second example, which showcases two different reactions that can be used to synthesize butyl cinnamate. The top-ranked prediction involves the esterification reaction. EditRetro successfully obtains the two reactants by removing the n-Butane fragment and inserting a 1-Butanol fragment while concatenating other fragments in a successive manner. The second-ranked reaction is a Heck reaction. EditRetro identifies another reaction center using a variant of the canonical SMILES and performs deletion and reordering operations on the fragments. This is followed by inserting corresponding fragments to obtain the ground-truth Butyl acrylate and Bromobenzene. This example showcases EditRetro’s ability to handle diverse reaction types and generate reactants through a combination of editing operations. The third one depicted in Fig. 5c is a nucleophilic addition reaction with two iterations. In the first iteration, EditRetro accurately identifies the reaction center but obtains an invalid molecule and an unavailable molecule. In the subsequent iteration, it continues to refine the intermediate molecules based on the predicted reaction center and successfully obtains the ground truth molecules, 4-Hydrazinylbenzonitrile and 4-Oxocyclohexyl benzoate, with a relatively high probability. This example highlights the robustness of our model in iteratively refining its predictions and correcting any self-generated errors.
a Wohl-Ziegler reaction. EditRetro reliably identifies the placeholders and tokens to generate the ground truth reactants in one iteration. b Esterification reaction and Heck reaction. EditRetro provides two different reactions to synthesize butyl cinnamate. c Nucleophilic addition reaction. During the first iteration, EditRetro generates an invalid molecule and an unavailable molecule. However, in the subsequent iteration, it is capable of detecting and self-correcting the incorrect generation. The P denotes the probability of the model’s prediction. The orange, cyan, yellow, and green boxes correspond to the operations of deletion, reordering, placeholder insertion, and token insertion, respectively. The [No Operation] and [Terminate] states are determined by the model. The corresponding molecule graph editing process is illustrated at the bottom of each example for reference. The operations are highlighted in different colors: , , , and .
As analyzed in Fig. 6, EditRetro sometimes produces incorrect predictions. To deeply analyze the incorrect predictions, we first conduct a comprehensive performance comparison by evaluating the four different error categories proposed by the baseline model MEGAN in Supplementary Fig. 4. We observe that EditRetro’s top-1 predictions for the first, second, and fourth error categories, i.e., only possible in multiple steps, low yield or side products, and a reactive functional group ignored, are entirely consistent with the ground-truth reactants. Regarding the third error category, namely incorrect chirality, our method also predicts the ground-truth reactants as the top-2 ranked choices. Furthermore, the top-1 prediction is also chemically feasible, identifying another reaction center but correctly handling the chirality. These results further reinforce the high accuracy and reliability of our model.
a Redundant reactant molecule, (b) Discrepancy in reactive sites, and (c) Chemically infeasible reaction. The atoms highlighted in red indicate the reactive sites.
To further investigate the incorrect predictions, we present three instances of inaccurate reactions by EditRetro within top-10 predictions in Fig. 6. In the first example shown in Fig. 6a, EditRetro accurately identifies the reactive site and produces three reactants. Among these, two molecules align with the ground truth, while one is redundant and does not participate in the reaction. In Fig. 6b, EditRetro correctly identifies the reactive site in accordance with the ground truth and generates two molecules, out of which one aligns with the ground truth. However, the two molecules are incapable of producing the desired product due to the discrepancy between the reactive site of the molecule Cc1ccc(S(=O)(=O)OS(=O)(=O)[N+](C)(C)C)cc1 and the ground truth molecule. Moreover, this molecule is not available in CAS SciFindern 50 and poses challenges for synthesis. Finally, EditRetro sometimes generates chemically infeasible reactions, as seen in Fig. 6c, where the two molecules are generally unable to react. This indicates a limitation in our model’s ability to accurately assess the feasibility of certain reactions. These examples provide valuable insights into potential areas for improvement in our model. To rectify these inaccuracies, potential future improvements could involve integrating chemical modules capable of determining the reactivity of different reactive sites.
To assess the practical utility of our one-step prediction method in synthesis planning, we extend EditRetro, which was trained on the USPTO-50K dataset, to enable the design of complete chemical pathways through sequential retrosynthetic predictions. We select four target compounds of significant medicinal importance for our evaluation: Febuxostat51, Osimertinib52, an Allosteric Activator for GPX47, and a DDR1 kinase inhibitor INS015_03753.
Febuxostat, which is presented as the first example in Fig. 7a, is a medication for treating gout that selectively inhibits xanthine oxidase without affecting purine synthesis. Our method accurately predicts a three-step pathway for febuxostat, which is identical to the previously reported pathway by Cao et al.51. The first step involves ester hydrolysis, followed by the Suzuki cross-coupling reaction between 3-cyano-4-isobutoxyphenyl boronic acid and ethyl 2-bromo-4-methyithiazole-5-carboxylate as reactants. The second example is a third-generation EGFR inhibitor Osimertinib for non-small cell lung carcinoma treatment, as illustrated in Fig. 7b. The complete five-step synthesis pathway of this drug was proposed by Finlay et al.52, utilizing easily accessible or obtainable starting materials. In the synthesis pathway for Osimertinib suggested by our model, the first step involves an acylation reaction using acryloyl chloride. Subsequently, the model accurately predicts the reduction of the nitro group in the second step. In the following two steps, the model suggests sequential nucleophilic aromatic substitution reactions (SNAr) to introduce the amino side chain and nitroaniline. Notably, our model deviates from the Friedel-Crafts arylation reported in the literature and instead proposes a Suzuki cross-coupling reaction in the final step, which is consistent with the baseline Graph2Edits approach, to generate 3-pyrazinyl indole. The third example is an allosteric activator of glutathione peroxidase 4 (GPX4), and its synthetic pathway illustrated in Fig. 7c is reported by Lin et al.7. They predicted the synthetic pathway by enumerating different reaction types with a template-free model. Nevertheless, our method successfully predicts all five reaction steps among the top four predictions, even without considering the reaction type, which directly highlights the superiority of our approach. The fourth example presents a challenging but interesting retrosynthesis task for the DDR1 kinase inhibitor INS015_037 as shown in Fig. 7d. INS015_037 is a potential DDR1 kinase inhibitor designed using generative machine learning methods, which has been experimentally demonstrated to exhibit favorable pharmacokinetics in mice. Using a convergent synthesis approach, Zhavoronkov et al.53 separately synthesized two precursors at first and then synthesized INS015_37 in the final step. Our model accurately predicts the convergent synthesis pathway with a high-ranked prediction, which is consistent with the reported synthesis pathway.
a Febuxostat. b The third-generation EGFR (Epidermal Growth Factor Receptor) inhibitor Osimertinib. c An allosteric activator for GPX4. d A DDR1 (Discoidin Domain Receptor 1) kinase inhibitor INS015_037. Distinct colors are used to clearly distinguish the reaction center, as well as the atom and bond transformations, in each step of the reaction. The retrosynthetic pathways generated by EditRetro for four examples closely align with those reported in the literature, with the majority of predictions ranking within the top two. It demonstrates the practical utility of EditRetro in synthesis planning.
All four demonstrated examples yield retrosynthetic pathways that closely align with those reported in the literature, with the majority of predictions ranking within the top two. Among the 16 individual steps considered, ten are accurately predicted at rank-1, with the remaining steps predicted at rank-2, -3, -4, -6, and -7. These results underscore the practical potential of our model for practical retrosynthesis predictions. By providing valuable insights and facilitating the design of efficient synthesis routes, our method holds promise for practical applications in the field of retrosynthesis planning.
In this study, we present EditRetro, an edit-based generative model for sequence-based single-step retrosynthesis prediction. Unlike previous sequence-based template-free methods that treat retrosynthesis as a language translation task, EditRetro formulates the problem as a molecular string editing task. The intuition behind our approach is based on the fact that chemical reactions typically occur on local substructures of the reactants, leading to considerable overlap between the reactants and products. EditRetro generates the reactants starting from the structure of the target product by predicting the explicit Levenshtein editing operations on the molecular sequence representation (SMILES is used in this paper). Through an iterative decoding process, EditRetro sequentially performs sequence reposition, placeholder insertion, and token insertion actions that contain basic text editing operations, such as keeping, deletion, reordering, and insertion. The decoding process determines if the generation is complete based on the termination conditions. If not, EditRetro self-corrects the generated immediate sequence from the previous iteration by addressing issues such as invalid molecules and unreasonable reactants. Extensive experiments on the benchmark retrosynthesis dataset USPTO-50K show that EditRetro achieves promising performance with a 60.8% top-1 exact match accuracy. EditRetro also shows superior performance than baseline methods in terms of RoundTrip and MaxFrag accuracy. Furthermore, we evaluate our method on the larger USPTO-FULL dataset, where it achieves a top-1 exact match accuracy of 52.2%, demonstrating its effectiveness on a more diverse and challenging set of chemical reactions.
Furthermore, we highlight that EditRetro provides diverse predictions through a carefully designed inference module. This module incorporates reposition sampling and sequence augmentation, which contribute to the generation of diverse and varied predictions. The reposition sampling samples the predictions of the reposition action, enabling the identification of different reactive sites. The sequence augmentation generates diverse editing pathways from different product variants to the reactants, improving both the prediction accuracy and diversity. These two strategies work together to enhance the accuracy and diversity of the predictions.
Further experiments verify the superiority of EditRetro in some more complex reactions, including chiral, ring-opening, and ring-forming reactions. The results affirm the superiority of EditRetro in these challenging scenarios, demonstrating its capability to handle diverse types of chemical transformations. In particular, the successful application of EditRetro in four multi-step retrosynthesis planning scenarios demonstrates its practical utility. The positive results suggest that our model exhibits promising generalization and robustness, highlighting its potential to advance the field of AI-driven chemical synthesis planning.
Despite the promising performance of EditRetro, there are still several challenges that need to be addressed to facilitate its widespread application in chemical synthesis planning.
Firstly, an advanced fragmentation method such as Group SELFIES54 is required to obtain more robust and chemically explainable molecular substructures. We found that some substructures obtained by SMILES pairing encoding are not chemically explainable, making the editing process less clear. Additionally, the size of the fragments must be considered as atom-wise editing can be computationally expensive, while excessively large fragments may be not chemically explainable. One possible solution is to combine the SMILES pairing encoding method with advanced sequence alignment methods, such as those used in55, to obtain more chemically reasonable substructures.
Secondly, the inability of EditRetro to explicitly determine the reactivity of different reactive sites can lead to the generation of implausible and infeasible reactions. One promising solution is to incorporate chemical modules that can solve these issues in the decoding process. These modules can provide valuable insights and constraints to guide the generation of chemically valid and feasible reactions.
Lastly, integrating the knowledge of reaction classes into the model poses another challenge. While reaction class information is crucial for guiding retrosynthesis prediction, incorporating it effectively into our model is not straightforward. One approach could involve inputting a reaction class token embedding as a hard constraint during the decoding process, enabling the model to refine its predictions based on specific reaction class information. Additionally, using different reaction class token prompts as in56, could improve the prediction diversity in terms of reaction class, leading to more comprehensive and diverse synthesis plans.
Addressing these challenges will be essential to enhance the practical utility of EditRetro in real-world chemical synthesis planning scenarios. By improving the generation of chemically explainable substructures, incorporating chemical modules for reactivity assessment, and effectively leveraging reaction class information, EditRetro can further advance the field of AI-driven chemical synthesis planning and enable its widespread application in various domains.
As shown in Fig. 1, our model is built upon the base Transformer encoder and decoder57. Given a target molecule, the encoder component of our model takes its sequence representations \({{{\bf{x}}}}=\left({{{{\bf{x}}}}}_{{{{\rm{1}}}}},\cdots \,,{{{{\bf{x}}}}}_{{{{\rm{n}}}}}\right)\) as input and maps them to a sequence of hidden representations h(enc) = (\({{{{\bf{h}}}}}_{1}^{({{{\rm{enc}}}})}\), ⋯, \({{{{\bf{h}}}}}_{{{{\rm{n}}}}}^{({{{\rm{enc}}}})}\)). This encoder architecture is similar to a conventional sequence-to-sequence (seq2seq) encoder, which encodes the input sequence into a fixed-length vector representation. To enable iterative refinement, we use the product string (which is the same as the encoder input) as the initial input of the decoder. Each refinement iteration consists of three editing decoders: reposition decoder, placeholder decoder, and token decoder.
The reposition decoder reads the input token sequence s and gives a categorical distribution over the indexes of the input tokens as their new positions using the reposition classifier πrps:
where es and ls denote the token embedding and positional encoding of the input sequence s respectively, hrps represents the output hidden representations of the input tokens in reposition decoder, \({{{\bf{b}}}}\in {{\mathbb{R}}}^{{d}_{{{{\rm{model}}}}}}\) is used to predict whether to delete the token, r represents the newly predicted indexes of the input tokens with πrps, and the function reposition_tokens is utilized to place the input tokens in the output sequence \({{{{\bf{s}}}}}^{{\prime} }\) or delete them. The dot product in the softmax function captures the similarity between the hidden state hi and each input embedding ej or the deletion vector b. When r > 0, the r-th input token sr is placed at the i-th output position. If r ≤ 0, the token at that position is deleted. To preserve the sequence boundaries, we enforce the constraint that πrps(0∣0, s) = πrps(n + 1∣n + 1, s) = 1, which are the special tokens indicting the beginning and end of the sequence. The single reposition action can accomplish reordering, deletion, and keeping editing functions. This process can be seen as analogous to identifying a reaction center, which involves reordering and deleting atoms or groups to obtain synthons that are then ready for insertion (synthon completion).
The placeholder decoder receives the output \({{{{\bf{s}}}}}^{{\prime} }\) from the reposition decoder as its input and predicts the number of placeholders that should be inserted between adjacent tokens in the sequence. Specifically, the placeholder policy generates a categorical distribution over the number of placeholders, denoted by \(p\in \left[0,\, {K}_{\max }\right]\), that should be inserted between two consecutive tokens \(({{{{\bf{s}}}}}_{i}^{{\prime} },\, {{{{\bf{s}}}}}_{i+1}^{{\prime} })\) in the input sequence:
where \({{{{\bf{e}}}}}_{{{{{\bf{s}}}}}^{{\prime} }}\) and \({{{{\bf{l}}}}}_{{{{{\bf{s}}}}}^{{\prime} }}\) denote the token embedding and positional encoding of the input sequence \({{{{\bf{s}}}}}^{{\prime} }\) respectively, \({{{{\bf{W}}}}}^{({{{\rm{plh}}}})}\in {{\mathbb{R}}}^{(2{d}_{{{{\rm{model}}}}})\times ({K}_{\max }+1)}\), p denotes the number of placeholders predicted at each position, and the function insert_placeholders is used to insert placeholders in the output sequence \({{{{\bf{s}}}}}^{{\prime\prime} }\).
After generating its output \({{{{\bf{s}}}}}^{{\prime\prime} }\), the placeholder decoder passes it on to the token decoder, which utilizes it to determine the appropriate token for each placeholder using the token policy:
where where \({{{{\bf{e}}}}}_{{{{{\bf{s}}}}}^{{\prime\prime} }}\) and \({{{{\bf{l}}}}}_{{{{{\bf{s}}}}}^{{\prime\prime} }}\) denote the token embedding and positional encoding of the input sequence \({{{{\bf{s}}}}}^{{\prime\prime} }\) respectively, \({{{{\bf{W}}}}}^{({{{\rm{tok}}}})}\in {{\mathbb{R}}}^{{d}_{{{{\rm{model}}}}}\times | {{{\bf{V}}}}| }\), and the function insert_tokens is used to insert the predicted tokens at each placeholder.
This process can be considered as the synthon completion process, since it takes the synthons generated from the reposition stage and predicts the atoms or groups that are required to complete the synthons.
While the model excels at generating outputs efficiently, training it can be challenging due to potential difficulties in capturing the dependency on the target side58. To address this issue, we employ a two-stage strategy with the sequence-level self-distillation method59 to enhance the model training process.
In the first stage of our training process, we concentrate on training the encoder and token decoder. Our study shows that the token decoder has a larger output space of 3136 compared to the mask decoder’s 256, which makes it more difficult to train well. Specifically, we applied a random mask policy to mask the reactants and trained the token decoder to recover the masked tokens. The same objective was used as in the second training process. This can provide initialization to the token decoder and balance training with the other two decoders.
In the second stage of our training process, we begin by training the entire model using all three decoders for a specified number of updates (e.g. 300 K updates on USPTO-50K). Following that, we use the trained model to generate multiple candidate reactants for every product in the training set, and then we select the high-quality samples using two methods. Firstly, we check if the generated reactants can predict the products using a golden forward prediction model. In our experiments, we utilized EditRetro as the golden forward prediction model by training it with the corresponding USPTO-50 K and USPTO-FULL training datasets. The reactants were used as input, while the products served as the output. Secondly, we look for high similarity (e.g. 0.95 in terms of SMILES sequence) between the generated reactants and the reactants of the corresponding product in the training set. The reactants that are generated along with the raw reactants are mixed together to further train the model. The process of self-distillation can mitigate the issue of multi-modality, where a single product can be synthesized using multiple reactants. This phenomenon is also observed in non-autoregressive models of natural language processing (NLP)58. The self-distillation process can be repeated multiple times, for example, three times on the USPTO-50K dataset.
All training processes are performed on the corresponding training dataset to mitigate potential data leakage risks (e.g. USPTO-50K model is trained on USPTO-50K training set in both stages).
The core of the training strategy is the dual-policy imitation learning algorithm41 as illustrated in Supplementary Fig. 6, which involves letting the agent imitate behaviors drawn from an oracle policy without the need for additional labeling:
The oracle policy is used to find the optimal action that transforms the input sequence s into its desired output s*. Based on this, the training objective is defined to maximize the following expectation:
where \({\tilde{{{\rm{\pi }}}}}_{{{{\rm{rps}}}}}\) and \({\tilde{{{\rm{\pi }}}}}_{{{{\rm{ins}}}}}\) are the roll-in policies to induce state distribution \({d}_{{\tilde{{{\rm{\pi }}}}}_{{{{\rm{rps}}}}}}\) and \({d}_{{\tilde{{{\rm{\pi }}}}}_{{{{\rm{ins}}}}}}\) for training reposition policy and insertion policies (placeholder policy and token policy), respectively. \({{{{\bf{s}}}}}_{{{{\rm{ins}}}}}^{{\prime} }\) is the output of inserting placeholders upon sins.
The roll-in policies play a critical role in the model training. To encourage the multiple iterative refinement, we design \({d}_{{\tilde{{{\rm{\pi }}}}}_{{{{\rm{rps}}}}}}\) as a stochastic mixture between the initial input s0 and the prediction output of insertion:
where p* ~ π*, \(\tilde{{{{\rm{t}}}}} \sim {{{\rm{\pi }}}}_{{{{\rm{tok}}}}}^{\theta }\), \({{{{\bf{s}}}}}^{{\prime} }\) is any sequence to be inserted, \(u \sim {{{\rm{Uniform}}}}\left[0,1\right]\) and \(\alpha \in \left[0,1\right]\). Similarly, we use a mixture of the oracle reposition output and a randomly generated word-dropping sequence from the ground truth as \({d}_{{\tilde{{{\rm{\pi }}}}}_{{{{\rm{ins}}}}}}\):
where r* ~ π*, \(\tilde{{{{\rm{d}}}}} \sim {{{\rm{\pi }}}}^{{{{\rm{rnd}}}}}\), \(v \sim {{{\rm{Uniform}}}}\left[0,1\right]\) and \(\beta \in \left[0,1\right]\).
To simplify, the training data for the reposition decoder consists of two parts: the initial given product string and the predicted sequence from the token decoder. Similarly, the input training data for the placeholder decoder comprises the output sequence from the reposition decoder and the roll-in policies known as random shuffle deletion of the target sequence, which randomly deletes certain tokens and shuffles others. The input training data for the token decoder primarily comes from the output sequence generated by the placeholder decoder. The detailed learning algorithm is shown in Supplementary Learning algorithm for EditRetro.
Refinement Action: The three editing decoders are complementary and can be combined in an alternating fashion to form a refinement action during the inference process. More formally, we define a refinement action as an ordered sequence of reposition (r), placeholder insertion (p), and token prediction (t) operations:
where \(m={\sum }_{i}^{n}{\mathbb{I}}({r}_{i} \, > \, 0)\) and \(l={\sum }_{i}^{m-1}{p}_{i}\). So the policy for one iteration can be defined as:
with intermediate outputs \({{{{\bf{s}}}}}^{{\prime} }={{{\rm{E}}}}({{{\bf{s}}}},{{{\bf{r}}}})\) and \({{{{\bf{s}}}}}^{{\prime\prime} }={{{\rm{E}}}}({{{{\bf{s}}}}}^{{\prime} },{{{\bf{p}}}})\). During the inference stage, the initial sequence (the product SMILES) is iteratively refined by applying the refinement action. This iterative refinement process continues until either two consecutive decoding iterations produce identical outputs41 or a predetermined maximum number of iterations is reached60.
Regarding the partially auto-regressive architecture of our model, the application of the commonly used beam search algorithm to generate the top-k predictions is not feasible. Therefore, we have designed an inference module specifically tailored for our model as shown in Fig. 1b. Given that the prediction at each position of an input sequence is independent in each decoder, it becomes challenging to identify multiple promising output sequences. Therefore, we have developed reposition sampling as a solution to generate multiple predictions as shown in Fig. 1c. Specifically, in the reposition decoder, this strategy first selects the best prediction at each position to form the initial output sequence. It then samples predictions to create additional output sequences. In our study, we choose a specific portion of the positions in the initial output sequence and replace them with sampled predictions, resulting in the creation of other sequences. Here we denote the number of outputs in the reposition decoder as kr. In the subsequent placeholder decoder, we select the best prediction at each position for each input sequence, resulting in kr output sequences. Then, in the token decoder, we retain the best kt predictions at each placeholder to form the output sequences. This process yields a total of k = kr × kt predictions for an input molecule. Note that in cases where multiple iterations are required to iteratively refine the predictions, we only apply reposition sampling during the first iteration. In subsequent iterations, we select the best prediction at each position in all decoders. Finally, we ranked the predictions based on their probabilities in the token decoder, which we refer to as the local ranking.
Inspired by Augmented Transformer22 and R-SMILES26, during the inference process on the validation and test data, we input multiple SMILES representations of a molecule individually and obtain multiple sets of outputs accordingly. Tetko et al.22 demonstrated that the frequency of predicted SMILES could be utilized as a confidence metric for retrosynthesis prediction. Subsequently, after converting all the outputs to canonical SMILES, we uniformly score these outputs using the following approach as in22,26:
where pred is the generated result and gt is the ground-truth SMILES. δ will output 1 if the generated SMILES is identical to the target. α is the weighing hyper-parameter. aug is the number of augmented SMILES and topk is the number of predictions generated by one input SMILES. After applying uniform scoring, we can select the outputs with the top-k scores as the final result. Supplementary Fig. 1 illustrates an example of the workflow of the inference module. Note that this approach is unique to sequence-based methods, and we adopt the experimental settings of the strong baseline R-SMILES for a fair comparison.
The EditRetro model adopts the base Transformer encoder-decoder architecture57. The reposition classifier is implemented as the dot product between the i-th hidden state hi and each input token embedding ej or the deletion vector b, followed by a softmax function. Additionally, both the placeholder classifier and token classifier are implemented using a linear layer followed by a softmax function. By default, we set dmodel = 512, dhidden = 2048, nheads = 8, nlayers = 6, and pdropout = 0.3. We also apply dropout pdropout = 0.3 to the token embeddings and label smoothing with pls = 0.1. We use Adam61 with an initial learning rate of 0.0002 with a total batch size of ~ 32,000 tokens per step. The best checkpoint is selected based on validation perplexity. All models are implemented using the Fairseq62 toolkit.
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
The data and predictions that support the results of this study are available at the EditRetro GitHub repo: https://github.com/yuqianghan/editretro. Source data are provided with this paper.
The source code of this work and associated trained models are available at the EditRetro GitHub repo: https://github.com/yuqianghan/editretro63.
Corey, E. J., Long, A. K. & Rubenstein, S. D. Computer-assisted analysis in organic synthesis. Science 228, 408–418 (1985).
Article ADS CAS PubMed Google Scholar
Corey, E. J. & Cheng, X.-M. The Logic of Chemical Synthesis 1st edn, 464 (John Wiley & Sons, New York, 1995).
Heifets, A. & Jurisica, I. Construction of new medicines via game proof search. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence. 1564–1570 (AAAI Press, 2012).
Segler, M. H., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).
Article ADS CAS PubMed Google Scholar
Kishimoto, A., Buesser, B., Chen, B. & Botea, A. Depth-first proof-number search with heuristic edge cost and application to chemical synthesis planning. Adv. Neural Inf. Processing Syst. 32, 7224–7234 (2019).
Schreck, J. S., Coley, C. W. & Bishop, K. J. Learning retrosynthetic planning through simulated experience. ACS Central Sci. 5, 970–981 (2019).
Article CAS Google Scholar
Lin, K., Xu, Y., Pei, J. & Lai, L. Automatic retrosynthetic route planning using template-free models. Chem. Sci. 11, 3355–3364 (2020).
Article CAS PubMed PubMed Central Google Scholar
Chen, B., Li, C., Dai, H. & Song, L. Retro*: learning retrosynthetic planning with neural guided A* search. In Proc. 37th International Conference on Machine Learning. 1608–1616 (PMLR, 2020).
Kim, J., Ahn, S., Lee, H. & Shin, J. Self-improved retrosynthetic planning. In Proc. 38th International Conference on Machine Learning. 5486–5495 (PMLR, 2021).
Ishida, S., Terayama, K., Kojima, R., Takasu, K. & Okuno, Y. AI-driven synthetic route design incorporated with retrosynthesis knowledge. J. Chem. Inf. Model. 62, 1357–1367 (2022).
Article CAS PubMed PubMed Central Google Scholar
Schwaller, P. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem. Sci. 11, 3316–3325 (2020).
Article CAS PubMed PubMed Central Google Scholar
Xie, S. et al. Retrograph: Retrosynthetic planning with graph search. In The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2120–2129 (ACM, 2022).
Somnath, V. R., Bunne, C., Coley, C., Krause, A. & Barzilay, R. Learning graph models for retrosynthesis prediction. In Advances in Neural Information Processing Systems 34, 9405–9415 (2021).
Wan, Y., Hsieh, C.-Y., Liao, B. & Zhang, S. Retroformer: Pushing the limits of end-to-end retrosynthesis transformer. In International Conference on Machine Learning, 22475–22490 (PMLR, 2022).
Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. Computer-assisted retrosynthesis based on molecular similarity. ACS Central Sci. 3, 1237–1245 (2017).
Article CAS Google Scholar
Segler, M. H. & Waller, M. P. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chem. Eur. J. 23, 5966–5971 (2017).
Article CAS PubMed Google Scholar
Dai, H., Li, C., Coley, C., Dai, B. & Song, L. Retrosynthesis Prediction With Conditional Graph Logic Network. https://proceedings.neurips.cc/paper_files/paper/2019/file/0d2b2061826a5df3221116a5085a6052-Paper.pdf (2019).
Chen, S. & Jung, Y. Deep retrosynthetic reaction prediction using local reactivity and global attention. JACS Au 1, 1612–1620 (2021).
Article CAS PubMed PubMed Central Google Scholar
Dong, J., Zhao, M., Liu, Y., Su, Y. & Zeng, X. Deep learning in retrosynthesis planning: datasets, models and tools. Brief. Bioinform. 23, bbab391 (2022).
Article PubMed Google Scholar
Liu, B. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Central Sci. 3, 1103–1113 (2017).
Article ADS CAS Google Scholar
Zheng, S., Rao, J., Zhang, Z., Xu, J. & Yang, Y. Predicting retrosynthetic reactions using self-corrected transformer neural networks. J. Chem. Inform. Model. 60, 47–55 (2019).
Article Google Scholar
Tetko, I. V., Karpov, P., Van Deursen, R. & Godin, G. State-of-the-art augmented nlp transformer models for direct and single-step retrosynthesis. Nat. Commun. 11, 1–11 (2020).
Article Google Scholar
Kim, E., Lee, D., Kwon, Y., Park, M. S. & Choi, Y.-S. Valid, plausible, and diverse retrosynthesis using tied two-way transformers with latent variables. J. Chem. Inform. Model. 61, 123–133 (2021).
Article CAS Google Scholar
Seo, S.-W. et al. GTA: Graph truncated attention for retrosynthesis. In Thirty-Fifth AAAI Conference on Artificial Intelligence. 531–539 (AAAI Press, 2021).
Jiang, Y. et al. Learning chemical rules of retrosynthesis with pre-training. In Thirty-Seventh AAAI Conference on Artificial Intelligence. 5113–5121 (AAAI Press, 2023).
Zhong, Z. Root-aligned SMILES: a tight representation for chemical reaction prediction. Chem. Sci. 13, 9023–9034 (2022).
Article CAS PubMed PubMed Central Google Scholar
Weininger, D. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inform. Comput. Sci. 28, 31–36 (1988).
Article CAS Google Scholar
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article CAS PubMed Google Scholar
Karpov, P., Godin, G. & Tetko, I. V. A transformer model for retrosynthesis. In International Conference on Artificial Neural Networks (eds. Tetko, I., Kůrková, V., Karpov, P., Theis, F.) 817–830 (Springer, 2019).
Tu, Z. & Coley, C. W. Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction. J. Chem. Inform. Model. 62, 3503–3513 (2022).
Article CAS Google Scholar
Sacha, M. Molecule edit graph attention network: modeling chemical reactions as sequences of graph edits. J. Chem. Inform. Model. 61, 3273–3284 (2021).
Article CAS Google Scholar
Liu, J. et al. MARS: A motif-based autoregressive model for retrosynthesis prediction. Bioinformatics 40, btae115 (2024).
Zhong, W., Yang, Z. & Chen, C. Y.-C. Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing. Nat. Commun. 14, 3009 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Fang, L., Li, J., Zhao, M., Tan, L. & Lou, J.-G. Single-step retrosynthesis prediction by leveraging commonly preserved substructures. Nat. Commun. 14, 2446 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Shi, C., Xu, M., Guo, H., Zhang, M. & Tang, J. A graph to graphs framework for retrosynthesis prediction. In Proc. 37th International Conference on Machine Learning. 8818–8827 (PMLR, 2020).
Yan, C. et al. Retroxpert: Decompose retrosynthesis prediction like a chemist. Adv. Neural Inform. Processing Syst. 33, 11248–11258 (2020).
Wang, X. Retroprime: A diverse, plausible and transformer-based method for single-step retrosynthesis predictions. Chem. Eng. J. 420, 129845 (2021).
Article CAS Google Scholar
Chen, Z., Ayinde, O. R., Fuchs, J. R., Sun, H. & Ning, X. G2Retro as a two-step graph generative models for retrosynthesis prediction. Commun. Chem. 6, 102 (2023).
Article PubMed PubMed Central Google Scholar
Yang, Q. Molecular transformer unifies reaction prediction and retrosynthesis across pharma chemical space. Chem. Commun. 55, 12152–12155 (2019).
Article Google Scholar
Levenshtein, V. I. et al. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10, 707–710 (1966).
Gu, J., Wang, C. & Zhao, J. Levenshtein transformer. Adv. Neural Inform. Processing Syst. 32, 11179–11189 (2019).
Xu, W. & Carpuat, M. Editor: An edit-based transformer with repositioning for neural machine translation with soft lexical constraints. Trans. Assoc. Comput. Linguis. 9, 311–328 (2021).
Article Google Scholar
Schneider, N., Stiefl, N. & Landrum, G. A. What’s what: The (nearly) definitive guide to reaction role assignment. J. Chem. Inform. Model. 56, 2336–2346 (2016).
Article CAS Google Scholar
Landrum, G. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg. Landrum 8, 31 (2013).
Google Scholar
Li, X. & Fourches, D. SMILES pair encoding: a data-driven substructure tokenization algorithm for deep learning. J. Chem. Informa. Model. 61, 1560–1569 (2021).
Article CAS Google Scholar
Schwaller, P. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Central Sci. 5, 1572–1583 (2019).
Article CAS Google Scholar
Pauls, H. & Berman, J. M. Preparation of 3-heterocyclylacrylamide derivatives as Fabi protein inhibitors for treating bacterial infection. World Intellectual Property Organization A2 (2007).
Schwaller, P., Vaucher, A. C., Laino, T. & Reymond, J.-L. Prediction of chemical reaction yields using deep learning. Mach. Learn. Sci. Technol. 2, 015016 (2021).
Article Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inform. Model. 50, 742–754 (2010).
Article CAS Google Scholar
Chemical Abstracts Service. CAS SciFindern. https://scifinder-n.cas.org (2023).
Cao, Q.-M., Ma, X.-L., Xiong, J.-M., Guo, P. & Chao, J.-P. The preparation of febuxostat by Suzuki reaction. Chinese J. NewDrugs 25, 1057–1060 (2016).
CAS Google Scholar
Finlay, M. R. V. et al. Discovery of a potent and selective egfr inhibitor (azd9291) of both sensitizing and t790m resistance mutations that spares the wild type form of the receptor. J. Med. Chem. 57, 8249–67 (2014).
Zhavoronkov, A. Deep learning enables rapid identification of potent ddr1 kinase inhibitors. Nat. Biotechnol. 37, 1038–1040 (2019).
Article CAS PubMed Google Scholar
Cheng, A. H. et al. Group SELFIES: a robust fragment-based molecular string representation. Digital Discov. 2, 897 (2023).
Sumner, D., He, J., Thakkar, A., Engkvist, O. & Bjerrum, E. J. Levenshtein augmentation improves performance of smiles based deep-learning synthesis prediction. Chemrxiv https://doi.org/10.26434/chemrxiv.12562121.v2 (2020).
Toniato, A., Vaucher, A. C., Schwaller, P. & Laino, T. Enhancing diversity in language based models for single-step retrosynthesis. Digital Discov. 2, 489–501 (2023).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. Adv. Neural Inform. Processing Syst. 30, 5998–6008 (2017).
Xiao, Y. et al. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence. 11407-11427 (IEEE, 2023).
Liao, Y., Jiang, S., Li, Y., Wang, Y. & Wang, Y. Self-improvement of non-autoregressive model via sequence-level distillation. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing. 14202–14212 (2023).
Ghazvininejad, M., Levy, O., Liu, Y. & Zettlemoyer, L. Mask-predict: Parallel decoding of conditional masked language models. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 6111–6120 (Association for Computational Linguistics, 2019).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations. 13 (ICLR, 2015).
Ott, M. et al. fairseq: A fast, extensible toolkit for sequence modeling. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 48–53 (Association for Computational Linguistics, 2019).
Han, Y. et al. Retrosynthesis prediction with an iterative string editing model. Zenodo https://doi.org/10.5281/zenodo.11483329 (2024).
Download references
This work is funded by NSFCU23B2055 (H.C.), and supported by the Fundamental Research Funds for the Central Universities (226-2023-00138, H.C.), National Natural Science Foundation of China (62302433, 62301480, U23A20496, Q.Z.; NO.82202984, H.X.), and New Generation AI Development Plan for 2030 of China (2023ZD0120802, Q.Z.).
College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China
Yuqiang Han, Keyan Ding, Renjun Xu, Qiang Zhang & Huajun Chen
ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
Yuqiang Han, Keyan Ding, Renjun Xu, Qiang Zhang & Huajun Chen
Polytechnic Institute, Zhejiang University, Hangzhou, 310015, China
Xiaoyang Xu
Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310018, China
Chang-Yu Hsieh, Hongxia Xu & Tingjun Hou
Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 311121, China
Hongxia Xu
Zhejiang-University-Ant-Group Joint Center for Knowledge Graphs, Hangzhou, 310000, China
Huajun Chen
Hangzhou Institute of Medicine Chinese Academy of Science, Hangzhou, 310023, China
Huajun Chen
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
Y.H. conceived the study and designed the methods. Y.H. and X.X. implemented and conducted the experiments. Y.H., X.X., C.Y.H., and H.X. performed the analyses. T.H. provided suggestions on the experimental analysis. Y.H., C.Y.H., K.D., and Q.Z. wrote and revised the manuscript. R.X. provided suggestions on the model design and writing. T.H., Q.Z., and H.C. supervised the project. All authors have read and approved the manuscript.
Correspondence to Tingjun Hou, Qiang Zhang or Huajun Chen.
The authors declare no competing interests.
Nature Communications thanks Calvin Chen, Lei Fang and Thijs Stuyver for their contribution to the peer review of this work. A peer review file is available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it.The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Han, Y., Xu, X., Hsieh, CY. et al. Retrosynthesis prediction with an iterative string editing model. Nat Commun 15, 6404 (2024). https://doi.org/10.1038/s41467-024-50617-1
Download citation
Received: 16 August 2023
Accepted: 09 July 2024
Published: 30 July 2024
DOI: https://doi.org/10.1038/s41467-024-50617-1
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative