Synlett 2023; 34(09): 1012-1018
DOI: 10.1055/a-1937-9113
cluster
Machine Learning and Artificial Intelligence in Chemical Synthesis

A Novel Application of a Generation Model in Foreseeing ‘Future’ Reactions

Lujing Cao
a   College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, P. R. of China
,
Yejian Wu
a   College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, P. R. of China
,
Yixin Zhuang
a   College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, P. R. of China
,
Linan Xiong
a   College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, P. R. of China
,
Zhajun Zhan
a   College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, P. R. of China
,
Liefeng Ma
a   College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, P. R. of China
,
Hongliang Duan
a   College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, P. R. of China
b   State Key Laboratory of Drug Research, Shanghai Institute of Materia Medical (SIMM), Chinese Academy of Sciences, Shanghai, 201203, P. R. of China
› Author Affiliations

This project was supported by the National Natural Science Foundation of China, (No.81903438) and Natural Science Foundation of Zhejiang Province (LD22H300004).
 


Abstract

Deep learning is widely used in chemistry and can rival human chemists in certain scenarios. Inspired by molecule generation in new drug discovery, we present a deep-learning-based approach to reaction generation with the Trans-VAE model. To examine how exploratory and innovative the model is in reaction generation, we constructed the dataset by time splitting. We used the Michael addition reaction as a generation vehicle and took these reactions reported before a certain date as the training set and explored whether the model could generate reactions that were reported after that date. We took 2010 and 2015 as time points for splitting the reported Michael addition reaction; among the generated reactions, 911 and 487 reactions were applied in the experiments after the respective split time points, accounting for 12.75% and 16.29% of all reported reactions after each time point. The generated results were in line with expectations and a large number of new, chemically feasible, Michael addition reactions were generated, which further demonstrated the ability of the Trans-VAE model to learn reaction rules. Our research provides a reference for the future discovery of novel reactions by using deep learning.


#

Organic synthesis is one of the most challenging processes in drug discovery, and the exploration of new organic reactions has always been a major stumbling block in the development of synthetic organic chemistry.[1] [2] New reactions enrich synthetic routes in the fields of chemistry and materials. Conventionally, most new reactions have been discovered by the application of chemical intuition by scientists, which is a complex task requiring a certain degree of luck. For instance, the products of Diels–Alder reactions were known to chemists as early as 1906, but it was not until 1950 that the reaction was applied in total-synthesis experiments.[3,4] The long and intricate progress of discovering new reactions hinders progress in drug discovery.[5,6]

When artificial intelligence (AI) was first applied to the field of chemistry, Maryasin and co-workers[7] discussed whether it might one day replace chemists and they examined the generality of AI. Over the past few years, AI technology has provided a number of important applications in various aspects of chemistry and has brought some disruptive effects.[8] [9] [10] [11] [12] [13] [14] [15] [16] Reaching or even surpassing human-level capability by combining chemical reactions with AI remains a new challenge with a broad range of feasible applications. The exploration of the application of AI to chemical reactions has primarily involved reaction prediction,[17,18] retrosynthetic analysis,[19,20] optimization of reaction conditions,[21] and reaction classification.[22]

In principle, reaction prediction can be realized by extracting the rules for various chemical reactions, and then directly deriving the relationship between products and reactants. The current mainstream methods usually treat reaction-prediction tasks as similarity transformations of molecular graphs or text translations, and the corresponding models are graph-convolutional neural-network and sequence-to-sequence models.[23] [24] The performance of text-based reaction prediction has been significantly improved by the release of the Switch Transformer AI model (Google Research, Brain Team), which is entirely based on an attention mechanism. The Molecular Transformer proposed by Schwaller et al.,[25] in which the molecules involved in a reaction are all represented in Simplified Molecular Input Line Entry System (SMILES) notation, is a state-of-the-art SMILES-based sequence-to-sequence model that can reach a 90.4% top-1 accuracy on the USPTO_MIT data set with separated reagents. In addition to innovations in the model structure, many strategies can assist AI in better comprehending chemical reactions, including data augmentation[26] and transfer learning,[27] which have shown satisfactory functions in tackling low-chemical-data regimes. However, the discovery of new reactions by automatically extracting rules from known chemical reactions is an arduous process.

Inspired by molecular generation, which involves the generation of undiscovered active or target molecules by extracting the characteristics from a set of molecules known to have specific biological activities, recent studies have turned their attention to generation models and have proposed de novo reaction generation. In the task of molecular generation, many active molecules aimed at a specific target have been successfully generated, and some molecules have already reached the clinical research stage, which provides an excellent reference for finding new reactions through generation models. Reactions generated by a reliable generation model can not only guide future chemical research, but can also provide a wealth of reaction data to drive deep-learning models. However, a chemical reaction, which implies a chemical transformation from reactants to products, is a more intricate object for a computer than a pure chemical compound that contains only SMILES rules and information on structural properties.

The first attempt at reaction generation was presented by Bort et al.,[28] who constructed bidirectional long short-term memory (LSTM) layers and trained their system on a database named USPTO. All reactions were modified from the original SMILES notation in the form of a corresponding condensed graph reaction (SMILES/CGR). By visualizing latent variables through generative topographic mapping, the researchers located the position of the Suzuki reaction and found some reactions that have particular structural motifs that were not present in the training data. In subsequent studies by Wang et al.,[29] the type of data set used for reaction generation was restricted to Heck reactions. Transformer XL, a fully attention-based model that is more suitable for long sequences, was applied in their study. An analysis of the results proved that the generated reaction conformed to the Heck reaction rule, and the model also had a favorable grasp of deeper chemical knowledge, such as site selectivity. They further selected some reactions for laboratory synthesis to verify the reliability of the generated reactions.

Is there a simple and efficient way to test the reliability of the generation model and the novelty of generated reactions? We have devised a scheme in which the model is trained with the chemical reactions reported in journals before a certain time point to test whether the model can reproduce the corresponding reactions reported after that time point. A schematic representation of this method is shown in Figure [1].

In this study, we used the classical Michael addition as a representative reaction for carbon-chain growth, carbon-ring formation, and heteroatom introduction in organic synthesis as a reaction-generation vehicle. The Trans-VAE[30] model, in which both an encoder and decoder are built with a transformer, was applied to accommodate the long sequence generation of the reaction. We imported the Michael addition reactions reported before a certain date into the model as the training input, and some of the reactions generated by our model were verified by chemists and reported in the literature after that date. The result proved the superiority of the model in certain aspects of reaction generation. More importantly, some of the generated reactions were novel Michael addition reactions that have never appeared in the literature and would be valuable for confirming chemical feasibility. The successful generation of Michael addition reactions, as supported by the literature, not only provides us with a simple and effective way to test the chemical-level generation model, but also sets the stage for generating new types of future reactions in the next phase of our work.

Zoom Image
Figure 1 Schematic diagram of the workflow for the generation of Michael addition reactions

The primary purpose of reaction generation is to generate reactions that can be used in future research. As a secondary purpose, it can also be used to expand the volume of data for reactions having only a small data set to eliminate the resulting bottleneck in the application of deep-learning technology to chemistry. Obviously, it is more difficult to generate new reactions that meet the demands of researchers in the process of model generation. Taking 2010 as the time point for a split, we used a total of 3,218 reactions to train the model and we then generated 32,979 new Michael addition reactions, 911 of which have been reported in the literature since 2010 and validated experimentally. Similarly, when we divided the data at 2015 and fed the resulting 6,962 training reactions data into the model, we finally generated 81,377 new reactions, 487 of which have been reported in the literature since 2015. As listed in Table [1], we observed that the generated reactions reported after 2010 accounted for 12.75% of all reactions reported after that date, whereas when the split was set at 2015, the ratio was 16.29%. We also show the variation in the ratio with the progress of reaction generation in the Supplementary Information (SI; Figure S2); this indicates that the model-generated reactions are reliable and it provides a guarantee for the application of the remainder of the new reactions to chemical research in the future.

We randomly selected some of the model-generated Michael addition reactions that were reported after 2015, as well as some new examples of Michael addition reactions. Figures [2a–c] show model-generated reactions that have been applied in practical studies, whereas Figures [2d–f] are completely new reactions. These examples are consistent with the reaction-characteristics rule for Michael addition reaction. Considering the availability of experimental raw materials, we performed an experimental verification of the entries in Figures [2e] and 2f; specific experimental details are given in the SI. On the basis of the reaction generated from the dataset with 2015 as the split date, we evaluated the quality of model generation in terms of the distribution and similarity of the generated reactions to the training reactions and their chemical properties.

Table 1 Distribution of Michael Addition Reactions with 2010 and 2015 as the Data Split Points

2010

2015

Generated reported reactions

  911

  487

All reported reactions

7,144

2,989

Rate (%)

12.75

16.29

Zoom Image
Figure 2 Examples of model-generated reactions. (a)–(c): Michael addition reactions reported after 2015.[31] [32] [33] (d)–(f): new Michael addition reactions

Because a complete reaction includes reactants and products and the chemical rule relating them, it is necessary to compare the component relationships between the training set and the generated set. As listed in Table [2], we counted the types of Michael acceptors and donors and the products in the training set and generated set. To represent the distribution between the generated set and the training set visually, we used the t-distributed stochastic neighbor embedding (t-SNE)[34] method to visualize the molecular Morgan fingerprints[35] in a low-dimensional space, and we further verified the validity of the generated molecules.

Table 2 Distribution of Michael Reactants and Products in the Training Set and the Generated Set

Training set

Generated set

New reactions

Michael donors

1,436

 6,336

 5,466

Michael acceptors

1,694

 6,572

 5,401

Products

6,308

66,543

64,476

Figures [3]A–C show the t-SNE plots of the Michael donors, acceptors, and products in the generated set with the Morgan fingerprints of the corresponding reactants in the training set, respectively. It can be seen from the plots that the training-set molecules overlap well with the corresponding generated set, showing that both the reactant and product molecules generated by the model varied around the training set with a certain novelty and also fitted the distribution of the training set.

Zoom Image
Figure 3 Morgan fingerprint distributions of reactants, products, and reactions for the training and generated sets. (A) The t-SNE distribution of Michael acceptors in the training set and the generated set. (B) The t-SNE distribution of Michael donors in the training set and the generated set. (C) The t-SNE distribution of Michael products in the training set and the generated set. (D) The UMAP distribution of rxnfp for the training set and the generated set.

On shifting our attention to the overall reaction level, the process of combining the corresponding reactants and product molecules into a reaction means that the model must learn a Michael addition reaction rule. Although the Michael addition reaction is one of the most widely used catalytic C–C bond-forming tools in organic synthesis, its rule is complicated for the Trans-VAE model. To further demonstrate that the reactions generated by the model are Michael addition reactions, we used the thematic map package tmap [36] to visualize the reaction fingerprint (rxnfp) of the reactions. The rxnfp is derived from the reaction representation learned by the bidirectional encoder representations from transformers (BERT) reaction-classifier model. As shown in Figure [4], tmap connects reactions in the generated reactions (10,000 reactions randomly selected from the generated set) with those in the training dataset based on their rxnfp similarity, with each reaction represented as a point in a tree diagram. In addition, USPTO-50K, which contains ten major classes of chemical reactions curated by Liu et al.,[37] was downloaded and used to form the backbone of the chemical space. Furthermore, we used the uniform manifold approximation and projection for dimension reduction (UMAP)[38] to reduce the dimensionality of the rxnfp to validate the distribution of the training set and the generated set (Figure [3]D). The model grasped and reproduced the reaction rules in the training set relatively satisfactorily.

The Michael addition reaction is used in the preparation of complex compounds and has an important practical value. To explore in detail whether our model fully understands the Michael addition reaction, we performed an in-depth analysis of the generated Michael addition reaction set. First, we divided the Michael addition reaction into intermolecular and intramolecular reactions. If a molecule contains both donor and acceptor functional groups, intramolecular reactions can occur to form carbon rings or heterocycles. As listed in Table [3], there were 6,707 intermolecular reactions and 255 intramolecular reactions in the training dataset. Intermolecular reactions accounted for 99.6% of the generated reactions, which was consistent with the distribution of intermolecular reactions in the training set. Figure S3 in the SI shows several representative examples of intermolecular reactions and intramolecular reactions from the training and generated datasets. Because the Michael addition reaction is reversible, the thermodynamically most stable product usually predominates. Five- and six-membered rings are usually more stable due to their lower ring strains. Our model accurately captured this feature, and the intramolecular Michael addition reaction is mainly used for the synthesis of more-stable five- or six-membered rings.

Zoom Image
Figure 4 The tmap plot of rxnfp for the training set, the generated set, and USPTO-50K.

Table 3 Distribution of Michael Addition Reactions in the Training Set and the Generated Set

Reaction type

Amount

Rate (%)

Training

Generated

Training

Generated

Intermolecular

6,707

81,115

 96.3

 99.6

Intramolecular

  255

   262

  3.6

  0.4

Total

6,962

81,377

100.0

100.0

Besides alkene Michael acceptors, electron-deficient alkynes conjugated with electron-withdrawing groups can also be used as Michael acceptors, although they are less reactive than their alkene counterparts. Table [4] shows the distribution of types of Michael acceptor divided into alkene acceptors and alkyne acceptors. Several Michael addition reactions of alkynes selected from the training set and the generated set are shown in the Figure S4 of the SI.

Table 4 Distribution of Michael Acceptor Types in the Training Set and the Generated Set

C–C bond

Number

Rate (%)

Training

Generated

Training

Generated

Alkene

6,801

80,837

97.6

99.3

Alkyne

161

540

2.4

0.7

Total

6,962

81,377

100.0

100.0

In the Michael addition reaction, a wide range of donor compounds are available. As shown in Figure S5 of the SI, molecules with activated C–H bonds attached to electron-withdrawing groups typically produce stable carbanions, and all of these molecules can be used as donors for Michael addition. In the case of a simple carbonyl compound with asymmetric Michael donors, the acceptor reacts mainly with the α-carbon atom having the more substituents, depending on the stability of the intermediate enol. In general, the greater the number of electron-donating substituents on the double bond, the more stable is the enol and the more the Michael addition reaction is promoted. Our model perceived this rule, and most of the generated reactions followed it satisfactorily, as depicted in Figure S6 of the SI.

For stable carbanions conjugated to multiple heteroatoms, reactions with an acceptor typically yield one to four addition products. Most heteroatom-containing stable groups are good leaving groups and can be considered as conjugated auxiliary groups. In Figure S7 of the SI, we list some instances in which it is obvious that a carbon atom between two carbonyl groups is more acidic than the carbon atoms on the other sides of the carbonyl groups and so is more likely to be deprotonated by a base to form a carbanion. The reactions generated by our model also fitted this signature.

The molecular structure of a Michael acceptor includes an electron-withdrawing group and an unsaturated system. Almost all alkenes substituted with an electron-withdrawing group can be used as Michael acceptors, as shown in Figure S8 of the SI. However, if the acceptor molecule contains two or more electron-withdrawing groups, the regioselectivity of the reaction is usually controlled by the more-active group. Figure S8 of the SI shows some generated reaction examples that conform to this rule. In Figure S8(A2) of the SI, the Michael acceptor has two electron-withdrawing groups: nitro and cyano. Because the nitro group is the more reactive, the reaction tends to yield the product shown in the scheme. The model is aware of this principle during training, and reflects it in the generated reaction.

It is worth mentioning that, in addition to carbon nucleophiles, some heteroatom groups can also be used as donors for the Michael addition reaction, due to their nucleophilic properties. For example, alkylamines or arylamines are widely used as Michael donors. The reaction has promising chemical selectivity and generally does not generate imine byproducts. We have added additional reaction data for the heteroatom Michael addition reaction to the data set with 2015 as the split date. After retraining the model on this data, we examined whether the model still grasped the reaction Michael addition when heteroatoms were present. In this study, we mainly considered heteroatoms such as N, S, and O. Table [5] lists the classification and proportions of heteroatom nucleophiles. It can be seen that the generated carbon, nitrogen, oxygen, and sulfur nucleophiles make up most of the generated reactants in a manner that is similar to the distribution of these four reactants in the training dataset. Examples of Michael additions involving heteroatoms in the training and generated sets are presented in Figure S9 of the SI. These results are exciting, as they prove that our Trans-VAE model is sufficiently expressive to produce correct reactions.

Table 5 Distribution of Heteroatom Nucleophiles in the Training Set and the Generated Set

Heteroatoms classification of donors

Number

Rate (%)

Training

Generated

Training

Generated

C

 6,964

10,416

 56.5 

 76.0

N

 2,753

 1,877

 22.3

 13.7

O

   843

   129

  6.9

  1.0

S

 1,763

 1,267

 14.3

  9.3

Total

12,323

13,689

100.0

100.0

In summary, we have applied the Trans-VAE model to the task of reaction generation. To explore whether the model is capable of generating novel unreported Michael addition reactions, we simulated this scenario by dividing a dataset of existing reactions at a selected time point. Thanks to its transformer-based encoder and decoder architecture, the model captured both SMILES rules and features of the Michael addition reaction from a large sequence of reactions. We used 2010 as our first time point, and we trained the model on reactions reported before this date. The model then generated a set of reactions that were compared with those reported after 2010, and was found to reproduce 12.75% of all reactions reported after 2010, providing initial evidence of the reliability of our Trans-VAE model in reaction generation. To confirm the effectiveness (rather than randomness) of our model, we conducted another experiment with 2015 as the split-time point, and the hit rate was 16.29% in this case. We then further inspected whether the model mastered the rules of Michael addition reactions by analyzing the generated Michael addition reactions in terms of their chemical characteristics. Our final analysis showed that the model captures reaction characteristics in a manner consistent with the now-discovered chemical laws of Michael addition reactions, indicating the reliability of applying deep-learning models to reaction generation, and laying the foundation for our subsequent explorations of the vast chemical space by using deep-learning models and for the generation of completely new types of chemical reaction.

Methods: Dataset The reaction generation model was trained on SMILES files containing only Michael addition reactions that were extracted from the Reaxys database. During the data preprocessing, reactions where the SMILES string was invalid or the reactants and products were identical were removed from the file, and the remaining reactions were canonized using RDkit[39] so that the same compound was represented by the same SMILES. Finally, the non-compliant reactions were filtered based on the Michael addition reaction template using RDkit. As for the time point of the reaction, it was considered that the same reaction may be reported in the literature at different times, we took the time when it was first reported and deleted the rest of the same reactions to obtain a dataset containing 12,322 Michael addition reactions. Taking 2015 as the split line, the reactions before 2015 were divided into training and validation sets (9:1), while those after this time were used as a reference for whether the model could generate ‘future’ reactions.

Methods: Model Consider the fact that the SMILES representation of the reaction has increased by two to three times compared to the molecule, which requires the model to have salient performance for long sequences. Therefore, we applied the Trans-VAE model proposed by Dollar et al30 which implements the transformer as the encoder and decoder. The fundamental architecture of Trans-VAE consists of an encoder and a decoder. The encoder maps the discrete SMILES to a dense latent representation and transforms it into a continuous fixed-dimensional vector, while the decoder attempts to convert the vector in the latent space back into input with the smallest possible error. By adding noise to the encoded SMILES, molecules would have corresponding probability distribution in the latent space rather than individual points, and the decoder also learns to discover more robust representations from latent points. The training process intends to minimize the reconstruction loss between original SMILES and generative SMILES, while satisfying the probability distribution of the generated data is similar to that of the training data.


#

Conflict of Interest

The authors declare no conflict of interest.

Supporting Information


Corresponding Authors

Liefeng Ma
College of Pharmaceutical Sciences, Zhejiang University of Technology
Hangzhou, 310014
China   

Hongliang Duan
College of Pharmaceutical Sciences, Zhejiang University of Technology
Hangzhou, 310014
P. R. of China   

Publication History

Received: 14 May 2022

Accepted after revision: 06 September 2022

Accepted Manuscript online:
06 September 2022

Article published online:
07 October 2022

© 2022. Thieme. All rights reserved

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany


Zoom Image
Figure 1 Schematic diagram of the workflow for the generation of Michael addition reactions
Zoom Image
Figure 2 Examples of model-generated reactions. (a)–(c): Michael addition reactions reported after 2015.[31] [32] [33] (d)–(f): new Michael addition reactions
Zoom Image
Figure 3 Morgan fingerprint distributions of reactants, products, and reactions for the training and generated sets. (A) The t-SNE distribution of Michael acceptors in the training set and the generated set. (B) The t-SNE distribution of Michael donors in the training set and the generated set. (C) The t-SNE distribution of Michael products in the training set and the generated set. (D) The UMAP distribution of rxnfp for the training set and the generated set.
Zoom Image
Figure 4 The tmap plot of rxnfp for the training set, the generated set, and USPTO-50K.