MultiChem: predicting chemical properties using multi-view graph attention network

Moon, Heesang; Rho, Mina

doi:10.1186/s13040-024-00419-4

Methodology
Open access
Published: 16 January 2025

MultiChem: predicting chemical properties using multi-view graph attention network

Heesang Moon¹ &
Mina Rho^1,2,3

BioData Mining volume 18, Article number: 4 (2025) Cite this article

842 Accesses
Metrics details

Abstract

Background

Understanding the molecular properties of chemical compounds is essential for identifying potential candidates or ensuring safety in drug discovery. However, exploring the vast chemical space is time-consuming and costly, necessitating the development of time-efficient and cost-effective computational methods. Recent advances in deep learning approaches have offered deeper insights into molecular structures. Leveraging this progress, we developed a novel multi-view learning model.

Results

We introduce a graph-integrated model that captures both local and global structural features of chemical compounds. In our model, graph attention layers are employed to effectively capture essential local structures by jointly considering atom and bond features, while multi-head attention layers extract important global features. We evaluated our model on nine MoleculeNet datasets, encompassing both classification and regression tasks, and compared its performance with state-of-the-art methods. Our model achieved an average area under the receiver operating characteristic (AUROC) of 0.822 and a root mean squared error (RMSE) of 1.133, representing a 3% improvement in AUROC and a 7% improvement in RMSE over state-of-the-art models in extensive seed testing.

Conclusion

MultiChem highlights the importance of integrating both local and global structural information in predicting molecular properties, while also assessing the stability of the models across multiple datasets using various random seed values.

Implementation

The codes are available at https://github.com/DMnBI/MultiChem.

Peer Review reports

Introduction

Accurate characterization of the molecular properties of chemical compounds is crucial for identifying lead candidates and evaluating cellular and tissue-level interactions in drug development. With over 100 million molecules available, experimentally screening desirable compounds within this vast chemical space is challenging, time-consuming, and costly. In this context, in silico evaluation provides a more efficient alternative, contingent on the development of accurate predictive models.

In recent years, machine learning methods have become widely adopted in drug discovery and molecular property prediction [43, 66]. These methods, grounded in quantitative structure–activity relationship (QSAR) principles, predict molecular activity based on structural similarities by carefully selecting relevant variables and inference functions [21]. Traditional machine learning models, such as support vector machine [7] and random forests [2], have demonstrated strong performance in estimating relationships between molecular structures and their properties.

In QSAR modeling, molecular fingerprints have become a standard for representing molecular structures [43, 66]. These fingerprints capture features such as electronic, geometric, and steric properties. Commonly used fingerprints, like PubChem fingerprint [30], MACCS Key [8], and extended-connectivity fingerprints (ECFP) [49], can be generated efficiently using tools like PaDel [65], RDKit [32], and CDK [53]. However, these approaches depend on predefined substructures, which may overlook other relevant molecular features [26], highlighting the need for more comprehensive molecular representations.

Deep learning has recently gained prominence across various fields due to its ability to automatically extract features without manual intervention [13]. In drug discovery, graph-based and sequence-based neural networks have shown particular promise by capturing a broader range of substructures than traditional fingerprints [27, 28, 37, 40, 44, 45, 56, 59, 67, 70]. Graph-based models are adept at distinguishing locally positioned substructures, while sequence-based models capture global patterns from distant substructures.

Several graph neural networks (GNNs) have been introduced for molecular property prediction, including graph convolutional network (GCN) [31], message-passing neural network (MPNN) [19], and graph attention network (GAT) [3, 57]. These models focus on individual nodes within molecular graphs to accurately capture local structural features. Studies have shown that GCNs and MPNNs perform comparably to traditional machine learning approaches [6, 9, 19, 29, 31, 39, 51, 55]. GATs and their variants, incorporating attention mechanisms, have shown better performance by assigning varying influence probabilities within the graph structure [34, 62, 68].

Building on the success of atom-centered GNNs, several variant models have been developed to enhance prediction accuracy [15, 36, 38, 54, 64, 69]. Among these, the directional message-passing neural network (DMPNN) [63] is notable for its focus on bond information, treating atoms and bonds are equally important to better differentiate non-isomorphic structures. Unlike atom-centered GNNs, the bond-centered DMPNN utilizes a line graph where the atoms and bonds are assigned to edges and nodes, respectively [18, 60, 63]. Inspired by the success of both atom- and bond-centered GNNs, several studies have attempted to combine atom and bond representations in a single framework [5, 15, 40, 52, 59]. In these studies, either a bond-centered GNN was added to an atom-centered GNN, or both GNNs were used together to exchange information. However, these approaches perform simple summation or averaging to represent atoms and bonds rather than employing more sophisticated attention, potentially limiting their effectiveness.

Sequence-based models have also been employed in molecular property prediction [20, 25, 50, 58]. These models primarily use SMILES sequences to represent molecular structures, enabling them to extract global structural features from entire sequences. Notably, BERT has been adapted for molecular property prediction using fully connected graph approaches [58]. Similar to GNNs, sequence-based models often outperform traditional methods.

Data scarcity is a common challenge in molecular property prediction, leading to adoption of multi-task learning and transfer learning to address this issue. Multi-task learning improves performance by sharing information across related tasks [43, 61]. Transfer learning, widely used in both graph- and sequence-based models, enables the leveraging of knowledge from larger datasets to enhance predictive accuracy on smaller datasets [16, 35, 50, 58].

In this work, we developed a model that generates a comprehensive molecular representation by simultaneously capturing local and global structural information. Our approach incorporates two graph encoders to effectively capture local information from atoms and bonds. These local features are jointly aggregated using a graph attention mechanism, which represents a key contribution of our study. Unlike traditional aggregation methods [34, 57, 62], our mechanism more effectively captures structural information from atoms and bonds, making it particularly well-suited for molecular property prediction. To complement the local features, we applied a multi-head attention mechanism to capture global relationships between distant substructures across diverse molecular topologies. By directly applying multi-head attention to the local features from our graph encoders, our model emphasizes both substructural relationships while simultaneously extracting global relationships. This multi-view approach integrates local and global information, enabling a more refined representation of molecular structures. When we assessed the impact of individual components within our model, we observed AUROC improvements ranging from 1.0% to 5.1% when comparing the model to the other models lacking specific components. In addition, the final model with ensemble method showed a 2.3% increase in AUROC score.

In our experiments, several models showed substantial performance differences between the different datasets. These inconsistencies may be due to structural differences in the scaffold distributions of the datasets, suggesting that each model focused on different molecular regions during training. To address the challenges of data scarcity and imbalance, we also incorporated an ensemble method, which has also shown promising results in the biological fields [12, 14]. Ensemble methods reduce variance and bias, mitigating risks of overfitting. In particular, bagging in our method promotes model diversity by training on data subsets with varied distributions, improving generalization and stability by reducing variance across individual models.

We evaluated our method on nine benchmark datasets from the MoleculeNet, including both classification and regression tasks. Compared to state-of-the-art methods [16, 34, 35, 40, 50] on golden standard datasets, our model achieved better or comparable performances, with an average area under the receiver operating characteristic (AUROC) of 0.822 and a root mean squared error (RMSE) of 1.133, reflecting improvements of 3% in AUROC and 7% in RMSE over current leading methods.

Materials and methods

Datasets

We used classification and regression datasets from MoleculeNet [61] using the DeepChem [47] library. The classification datasets include six tasks of BBBP [42], Tox21 [23], ToxCast [48], SIDER [47], ClinTox [61], and BACE. The regression datasets include three tasks of ESOL, FreeSolv, and Lipophilicity regarding physical chemistry.

The BBBP dataset provides information on molecular penetration of the blood–brain barrier [42]. Tox21 includes qualitative screening data for nuclear receptor (NR) and stress response (SR) assays [23]. ToxCast comprises extensive in vitro high-throughput screening data [48]. SIDER contains adverse drug reaction data classified by MedDRA [47], and ClinTox includes FDA-approved drugs alongside those failed due to toxicity issues [61]. The BACE dataset provides biophysical properties related to binding for inhibitors of human beta-secretase 1 (BACE-1). The ESOL dataset contains 1,128 compounds with water solubility. FreeSolv provides compounds with experimental hydration-free energy in water. The Lipophilicity dataset contains experimental octanol/water distribution coefficient values. In classification tasks, each task classifies compounds as active, inactive, and inconclusive. For this study, we focus exclusively on the active and inactive outcomes to maintain accuracy and reliability. Detailed configurations of each dataset are provided in Table 1.

Table 1 Details of the datasets

Full size table

Feature initialization

Our model utilizes two graph forms: atom graph and bond graph (Fig. 1A), designed for the atom-centered and bond-centered sub-models, respectively. The atom graph represents atoms as nodes and bonds as edges, with the adjacency matrix indicating the bond presence. The bond graph assigns bonds as nodes and atoms as edges, with the adjacency matrix indicating direct connections between two bonds via an atom. This dual-input approach ensures our model equally considers both atoms and bonds.

To initialize features, we represented the molecule with three matrices (node vectors, edge vectors, and adjacency matrix), a common method in deep learning for drug discovery [63]. Node vectors are based on atomic properties such as atom type, the number of bonds, and mass. Edge vectors include bond properties like bond type, ring presence, and conjugation. Properties are transformed into feature vectors using one-hot encoding or scaling, resulting in a node vector of length 127 and an edge vector of length 12. For example, atom type and mass are converted to a vector of length 100 by one-hot encoding and a vector of length one by scaling, respectively. Bond type is transformed into a vector of length four by one-hot encoding. Feature configurations are detailed in Table 2, with properties extracted from the SMILES string using RDKit [32].

Table 2 Feature initialization

Full size table

Model construction

In the MultiChem model, we incorporate three key components: a graph attention mechanism to integrate atom and bond information, a multi-head attention mechanism to capture both local and global features, and an ensemble method to enhance model stability. Since molecules are composed of atoms connected by covalent bonds, accurately predicting molecular properties requires models that account for both atomic and bonding information. Recent studies [5, 15, 40, 52, 59] have shown that models integrating both atomic and bond information provide more accurate predictions and richer representations compared to those focusing primarily on atoms and treating bonds merely as connections.

In our approach, we incorporate distinct graphs for atoms and bonds and iteratively updating both through a process of mutual reinforcement, where nodes are updated based on edge information and edges are updated based on node information. Further, we advance the integration of atom and bond information by employing a graph attention mechanism, which is known for its flexibility in learning structural information by assigning different weights to neighboring nodes, thus identifying crucial substructures. Our method uses two graph attention encoders—one for atoms and one for bonds–that are interconnected by sharing certain hidden states of atoms and bonds, updated using attention weights derived from neighboring atom and bond hidden states.

After extracting local features with the graph attention encoders, we apply a multi-head attention mechanism to the combined features, aiming to improve predictive power. Multi-head attention is recognized for its reliability and stability compared to single attention mechanisms. While GNNs are typically limited to a predefined number of hops, multi-head attention allows for capturing useful information between distant nodes. By applying multi-head attention to the outputs from our local graph-based module, the model can extract global features from distant parts of the graph, complementing the local features provided by the graph encoders. Further, to integrate local and global features effectively, we used a residual connection between the local and global modules, which helps prevent overfitting in the multi-head attention layers and preserves earlier local information.

To further improve robustness and performance, we adopted bagging, an ensemble method, during model training and inference. Ensemble methods like bagging reduce variance and bias, mitigating risks of overfitting and underfitting. In particular, bagging in our method promotes model diversity by training on data subsets with varied distributions, improving generalization and stability by reducing variance across individual models. Bagging is well-suited for our method because we can generate diverse training data by employing balanced scaffold splitting method, as described in later section. For final inference in both classification and regression tasks, we used soft voting, which generally produces more reliable results than hard voting by averaging the predictions from multiple models.

This combined approach—graph attention for interactive atom-bond representation, multi-head attention for global feature extraction, and ensemble bagging for robustness–provides a comprehensive framework for accurate molecular property prediction.

Sub-model for the node

The node sub-model (Fig. 1B) is based on the graph attention network [2, 57], applying attention to connected nodes in the message-passing layers of the message-passing neural network. Message-passing involves exchanging information (i.e., message) between two connected nodes, and we incorporated the attention method into this process. The initial hidden vector of the ith node as ${h}_{i}^{0}$, is derived from the initial node vector ${x}_{i}$ through the function ${f}_{init}$ (Eq. 1), which includes learning parameters, a bias, and the exponential linear unit (ELU) activation function.

The message vector ${m}_{j,i}^{l}$ in the message-passing layer is derived from the node vectors ${h}_{j}^{l-1}$, ${h}_{i}^{l-1}$, and the edge vector ${h}_{ji}^{l-1}$ from the previous layer using the function ${f}_{message}$ (Eq. 2). The hidden vector ${h}_{ji}^{l}$ is derived from the directional edge from the jth node to the ith node in the lth hidden layer in the edge sub-model, described in the next section. The attention coefficient ${\alpha }_{j,i}^{l}$ (Eq. 3) is calculated by applying the softmax function to the message vectors ${m}_{j,i}^{l}$, with j in the range of the neighbors of the ith node (N(i)). The node update function (Eq. 4) is formulated to obtain the node vector ${h}_{i}^{l}$, considering the edge vector ${h}_{ji}^{l-1}$, the attention score ${\alpha }_{j,i}^{l}$, and the node vector ${h}_{j}^{l-1}$. Finally, multi-head attention with k heads, a more stable method than single attention, is applied to the node update function (Eq. 5).

$$h_i^0=f_{init}\left(x_i\right)$$

(1)

$${m}_{j,i}^{l} = {f}_{message}\left({h}_{j}^{l-1},{h}_{i}^{l-1},{h}_{ji}^{l-1}\right)$$

(2)

$${\alpha }_{j,i}^{l} = {Softmax}_{j \in N\left(i\right)}\left({m}_{j,i}^{l}\right)$$

(3)

$$h_i^l=\sum\nolimits_{j\in N\left(i\right)}\alpha_{j,i}^l{\ast f_{update}(h}_j^{l-1},h_{ji}^{l-1})$$

(4)

$${h}_{i}^{l} = \frac{1}{K}\sum_{k = 1}^{K}{\sum }_{j \in N\left(i\right)}{\alpha }_{j,i}^{l,k}{*{f}_{update}^{k}(h}_{j}^{l-1,k},{h}_{ji}^{l-1,k})$$

(5)

Sub-model for the edge

The edge sub-model (Fig. 1C), also based on the graph attention network, was constructed similarly to the node sub-model. We applied the attention mechanism to the directional message-passing neural network [63] to incorporate hidden states and messages for the edges, creating a novel model. By focusing on bonds, our model captures features distinct from those obtained by the node model, making it valuable for bond-centric characteristics. This initial hidden vector of the bond between the ith and jth nodes, ${h}_{ij}^{0}$, is derived from the initial edge vector ${e}_{ij}$, node vector ${x}_{i}$, and node vector ${x}_{j}$ through the function ${f}_{init}$ (Eq. 6). The hidden vector of the ith node in the lth hidden layer, ${h}_{i}^{l}$, is inherited from the node sub-model.

With the initial states defined, we constructed the novel message function ${f}_{message}$. The message ${m}_{ki,ij}^{l}$ is calculated using the edge ${h}_{ki}^{l-1}$, the edge ${h}_{ij}^{l-1}$, and the node ${h}_{i}^{l-1}$ (Eq. 7). Here, ki represents the edge from the kth node to the ith node, ij represents the edge from the ith node to the jth node, and l denotes the layer depth. These messages derive the attention score ${\alpha }_{ki,ij}^{l}$ through a softmax function (Eq. 8). This function operates among the edges (E(ij)), directly connected to the edge ij. The aggregating function (Eq. 9) leverages attention scores, edges, and nodes. Finally, we introduced the multi-head attention with n heads to this sub-model (Eq. 10).

$${h}_{ij}^{0} = {f}_{init}\left({x}_{i},{x}_{j},{e}_{ij}\right)$$

(6)

$${m}_{ki,ij}^{l} = {f}_{message}\left({h}_{ki}^{l-1},{h}_{ij}^{l-1},{h}_{i}^{l-1}\right)$$

(7)

$${\alpha }_{ki,ij}^{l} = {Softmax}_{ki \in E(ij)}\left({m}_{ki,ij}^{l}\right)$$

(8)

$${h}_{ij}^{l} = {\sum }_{ki \in E(ij)}{\alpha }_{ki,ij}^{l}{*{f}_{update}(h}_{ki}^{l-1},{h}_{i}^{l-1})$$

(9)

$$h_{ij}^l=\frac1N\sum_{n=1}^N\sum\nolimits_{ki\in E(ij)}\alpha_{ki,ij}^{l,n}{\ast f_{update}^n(h}_{ki}^{l-1,n},h_i^{l-1,n})$$

(10)

Multi-head attention network

We built our model on the graph neural network that extracts the features focusing on the nodes and edges, typically capturing local structural features. To incorporate local and global crucial features, we adopted a multi-head attention layer known for capturing distant nodes. Before applying this method, we merged the hidden states from the node and edge sub-models. As shown in Eq. 11, we aggregated the edges connected to the ith node and the node ${h}_{i}^{L}$ in the last (Lth) layer of the sub-models, obtaining a new initial hidden state ${h}_{i}^{t0}$ using the function ${f}_{update}$ (Fig. 1D).

Once the new hidden states containing the local structural features were initialized, we applied the multi-head attention method to our model. The technique allowed us to determine the contributions of these local features by using the softmax function to calculate the importance of each sub-structure. In Eq. 12, the attention score ${\alpha }_{i,j}^{t}$ representing the importance of the jth node relative to the ith node, is derived using the softmax function on the hidden states of nodes ${h}_{i}^{t-1}$ and ${h}_{j}^{t-1}$, where t indicates the layer depth. The hidden state ${h}_{i}^{t}$ is then systematically calculated by summing all the nodes in the graph (N(G)) with their attention scores (Eq. 13). Finally, we applied the multi-head attention mechanism with k heads to our model (Eq. 14).

$${h}_{i}^{t0} = {f}_{update}({h}_{i}^{L},\sum_{j \in N(i)}{h}_{ji}^{L})$$

(11)

$${\alpha }_{i,j}^{t} = {Softmax}_{j \in N(G)}\left({f}_{query}({h}_{i}^{t-1}), {{f}_{key}({h}_{j}^{t-1})}^{T}\right)$$

(12)

$${h}_{i}^{t} = {\sum }_{j \in N(G)}{\alpha }_{i,j}^{t}*{f}_{value}({h}_{j}^{t-1})$$

(13)

$${h}_{i}^{t} = \frac{1}{K}\sum_{k = 1}^{K}{\sum }_{j \in N(G)}{\alpha }_{i,j}^{t,k}{*{f}_{update}^{k}(h}_{j}^{t-1,k})$$

(14)

Readout phase

Following the multi-head attention layers, we adopted the readout method in our model to extract the hidden feature $H$ of the molecule. For all nodes in the graph (N(G)), ${h}_{i}^{T}$ from the last (= Tth) multi-head attention layer was averaged (Eq. 15). Finally, the feed-forward neural network was applied to the output of the readout phase, yielding the predicted values.

$$H = \frac{1}{|N\left(G\right)|}{\sum }_{i \in N(G)}{h}_{i}^{T}$$

(15)

Multi-task learning and ensemble method

We employed binary cross-entropy (BCE) and mean squared error (MSE) loss functions for classification and regression tasks, respectively. For classification tasks, we adopted a multi-task learning approach, training a model on multiple tasks simultaneously [4]. This approach leverages larger datasets combined from multiple tasks, which is advantageous when individual tasks have limited data. Multi-task learning also acts as a regularization effect, reducing the risk of overfitting to any specific task. For instance, on the Tox21 dataset, consisting of 12 tasks in NR and SR, multi-task learning yielded better performance than single-task learning. We further enhanced our model through bagging, an ensemble method that improves generalization and mitigates overfitting. In bagging, multiple training datasets are created by sampling subsets with varying distributions, addressing data limitations and providing a diverse training base. Independent models are then trained on these samples, and the final prediction is made by combining the outputs from all trained models. For this ensemble inference, we used soft voting to calculate the final predictions. Unlike hard voting, which selects the prediction favored by the majority of models, soft voting averages the outputs from each model to produce the final prediction. As shown in Eq. 16, the result of sample i is derived by averaging the outputs across multiple models ${M}_{k}$.

$${\widehat{y}}_{i}^{t} = \frac{1}{K}{\sum }_{k = 1}^{K}{M}_{k}({x}_{i}^{t})$$

(16)

Training and evaluation data

Similar to previous studies, we employed the balanced scaffold splitting method for evaluation, which introduces a more rigorous and scientifically meaningful challenge compared to random or scaffold splitting [35, 50]. The balanced scaffold splitting method divides molecules into disjoint scaffold sets based on their structural scaffolds. These scaffold sets are then categorized into large and small sets according to a specified size threshold (e.g., half of the size of a test dataset). To ensure the representativeness of the training set and reduce bias in the validation and test datasets, large scaffold sets are consistently allocated to the training dataset.

To train and evaluate our ensemble model, we initially split a dataset using balanced scaffold splitting with a random seed, allocating 10% for testing and 90% for subsequent processing. We then applied a new random seed, referred to as the e-seed, distinct from the existing seed, to split the remaining data. Once the e-seed was determined, we further divided this data into 80% for training and 10% for validation. This approach enabled us to obtain multiple training-validation dataset pairs using different e-seed values, each pair containing slightly different scaffolds in the training dataset. This variability enabled us to train multiple models independently, supporting our ensemble model.

Model training

We trained our ensemble model using datasets created with balanced scaffold splitting. Before training, graphs and line graphs were extracted from the molecules in the datasets using RDKit [32]. Our model was implemented using Pytorch [1], PyTorch Lightning [11], PyTorch Geometric [17], and TorchVision [41]. PyTorch Geometric and TorchVision were specifically used to implement the graph and multi-head attention layers.

We trained our model by minimizing BCE and MSE loss functions for classification and regression tasks, respectively, on the preprocessed datasets. To prevent overfitting and improve model efficiency, we applied an early stopping based on the minimum validation loss [33, 46]. We trained our model on a single NVIDIA Tesla V100-DGXS-32 GB GPU and a CUDA 11.3 environment.

After training, we obtained multiple models from different e-seed splits, allowing us to make predictions on the test dataset using an ensemble approach with soft voting. In this method, the final prediction for the test set was obtained by averaging the outputs from each model. Final evaluation metrics, including AUROC for classification tasks and RMSE for regression tasks, were computed using Scikit-Learn [10].

During hyperparameter tuning, we applied early stopping method to prevent overfitting, with a maximum epoch of 1000 and patience of 30 epochs. Patience represents the number of epochs that training continues after a new minimum loss value is found. We optimized our model on validation datasets with various hyperparameters: batch size, layer size, layer depth, learning rate, weight decay, and dropout rate. Detailed hyperparameter configurations are provided in Supplementary Table 1.

We experimented with different batch sizes to assess their effects on overfitting, underfitting, and generalization. The optimal batch sizes were found to be 64 for smaller datasets (e.g., BBBP) and 256 for larger datasets (e.g., ToxCast). We also tested different sizes and depths for the graph and attention layers, which significantly affected model performance. The best performance was achieved with a graph layer depth of three, an attention layer depth of one, and a layer size of 128 for both layers.

For optimization, we adopted the Adam optimizer, known for its stability and speed. We adjusted two key parameters–learning rate and weight decay. Weight decay, similar to L2 regularization, was effective in preventing overfitting. The optimal model configuration was found with a learning rate of 1e-3 and a weight decay of zero. Additionally, dropout was applied to further mitigate overfitting, with the highest performance observed at a dropout rate of 0.3.

Methods for performance comparison

We conducted a comprehensive comparison of our model with various existing methods of different approaches, including graph-based [19, 29, 31, 34, 39, 40, 51, 62, 63], SMILES-based [58], and pre-training methods [16, 22, 25, 35, 50, 58], using the MoleculeNet dataset. This thorough evaluation ensured the fairness and reliability of our model.

Firstly, we compared our model with nine graph-based methods built on the message-passing neural network framework, including AttentiveFP [62], DMPNN [63], TrimNet [34], and CD-MVGNN [40]. AttentiveFP integrates a small graph into a large one using a graph attention mechanism to extract the critical substructures. DMPNN treats bonds as nodes to solve the tottering problem, while TrimNet exploits both bond and atom features to enhance the representation of molecular features. CD-MVGNN integrates atom- and bond-based GNNs with disagreement loss, mitigating the difference in predictions between these GNNs.

Secondly, we evaluated our model against the SMILES-BERT model [58], which uses the BERT architecture with SMILES string as input to represent chemical structures.

Lastly, we compared our model with six pre-training methods, such as GROVER [50], MPG [35], and KANO [16]. These methods were pre-trained on a large datasets like ZINC [24] and fine-tuned on small datasets like MoleculeNet. GROVER employs multi-level pre-tasks predicting masked information and motifs, MPG used self-supervised learning with the tasks of predicting the masked atom information and pairwise half-graph discrimination. KANO introduces contrastive learning between the augmented molecular graph for the knowledge graph of elements and the enhanced molecular graph for functional groups.

Results

We evaluated the accuracy of our model by comparing it to state-of-the-art models. To assess the stability and generalization of the models, we tested our model and several models across different data characteristics using various random seed values. Additionally, we analyzed the effectiveness of our ensemble method in handling data scarcity. An ablation study provided insight into the contribution of each model component, offering a comprehensive understanding of their roles and impact.

Performance evaluation on MoleculeNet dataset

We evaluated our model with eight state-of-the-art graph-based models and six pre-trained models [16, 19, 22, 25, 29, 31, 34, 35, 39, 50, 51, 58, 62, 63]. These methods were tested on two datasets: dataset-a, initially used to validate graph-based methods [50], and dataset-b, initially used to evaluate pre-trained models [35]. Both datasets were split using different random seeds for scaffold splitting, resulting in varying scaffold distributions across the training, validation, and test sets. All models, including ours, were evaluated on consistent test datasets to ensure fair comparison. We measured AUROC for classification tasks and RMSE for regression tasks, averaging results over three random seed values for comparison.

On dataset-a, initially used in GROVER [50], our model achieved an average AUROC of 0.821 across six classification tasks, placing second highest. GROVER led with an average AUROC of 0.834, followed by KANO with 0.814 as the third (Table 3). For regression tasks, GROVER achieved the lowest average RMSE scores of 1.544 and 0.560 on FreeSolv and Lipo, respectively, while our method reached the lowest RMSE of 0.746 on ESOL. On dataset-b, initially used in MPG [35], our model showed strong performance with an average AUROC of 0.851 on six classification tasks, followed by MPG and the KANO with AUROCs of 0.841 and 0.833, respectively. In regression tasks on FreeSolv, ESOL, and Lipo datasets, MPG, KANO, and our method achieved the lowest RMSE scores of 1.269, 0.405, and 0.545, respectively (Table 4).

Table 3 Performance comparison with current state-of-the-art methods on dataset-a

Full size table

Table 4 Performance comparison with current state-of-the-art methods on dataset-b

Full size table

Interestingly, several models showed substantial performance differences between the two datasets. Our method and KANO demonstrated stable performance across both datasets, whereas GROVER and MPG showed significant performance variations. These inconsistencies may be due to structural differences in the scaffold distributions of the two datasets, suggesting that each model focused on different molecular regions during training. To address this and ensure fair comparisons, we conducted an extensive evaluation using 30 random seed values, detailed in Sect. "Performance evaluation with multiple random seed values".

Performance evaluation with multiple random seed values

The performance variations observed between dataset-a and dataset-b for several methods (Sect. "Performance evaluation on MoleculeNet dataset") underscore the need to assess robustness across different dataset compositions. Thus, using 30 random seed values, we compared our model to prominent methods, including GROVER, MPG, CD-MVGNN, and KANO, on nine benchmark datasets. GROVER [50] and MPG [35] were selected due to their notable performance difference, KANO [16] as a recent advancement in the field, and CD-MVGNN [40] for its multi-view approach that considers both atoms and bonds for non-isomorphic graph differentiation.

As shown in Table 5, our method attained higher average AUROC scores of 0.957, 0.740, 0.628, and 0.919 on the BBBP, ToxCast, SIDER, and ClinTox, respectively. In the regression tasks, our method also obtained the lowest average RMSE scores of 0.783 and 0.597 on the ESOL and Lipo datasets. Meanwhile, KANO and CD-MVGNN outperformed other models on Tox21 and BACE datasets, with AUROC scores of 0.837 and 0.874, respectively. KANO also achieved the best score of 1.970 in the regression task on the FreeSolv dataset. These results demonstrate the robustness of our model in handling data variations, showing the effectiveness of the ensemble method, particularly for small datasets. Statistical significance was validated through Student’s t-test and Welch’s t-test (Supplementary Fig. 1 and 2).

Table 5 Performance comparison with sufficient random seed values

Full size table

Among the five models evaluated on nine datasets, our model ranked the best on six datasets and the second-best on two others. These highlight the importance of developing more generalized and integrated models. Additionally, CD-MVGNN displayed the best performances on one dataset and the second-best performances on the two. These results suggest that multi-view approaches consistently improved the prediction of molecular properties. KANO presented the best scores on two datasets and the second-best scores on the three, demonstrating the effectiveness of the pretraining and knowledge- guided learning. In this context, the balanced performance of our model across multiple datasets highlights its potential for a wide range of molecular property prediction tasks, and our method could promise better predictions with pretraining in future work.

Contribution of the ensemble model

We applied an ensemble method to our model to address data scarcity on the nine MoleculeNet datasets, which contain only a few thousand samples. To validate the effectiveness of the ensemble method, we compared our ensemble model to single models across 30 random seed values. We used balanced scaffold splitting to create six single models with distinct e-seed values (from 0 to 5), training each model on a unique training and validation dataset (Figs. 2 and 3). The ensemble model was then constructed by aggregating these six models using soft voting.

Our results show that the ensemble model consistently outperformed each individual model. For classification tasks, the ensemble model achieved higher AUROC scores than all single models (Fig. 2). The ensemble model improved AUROC scores by 1.4%, 2.4%, 3.0%, 3.3%, 2.9%, and 3.4% on average for the BBBP, Tox21, ToxCast, SIDER, ClinTox, and BACE, respectively. For regression, the ensemble model achieved RMSE improvements of 13.8%, 10.3%, and 8.3% on average for the FreeSolv, ESOL, and Lipo datasets, respectively (Fig. 3). These results indicate that the ensemble method significantly enhances performance, especially in regression tasks. Detailed classification and regression results are presented in Supplementary Table 3 and 4.

Ablation study

We conducted a systematic ablation study to assess the impact of individual components within our model, using MoleculeNet physiology datasets from dataset-b. We first evaluated the complete model (M1 in Fig. 4), which integrates all components such as sub-model for node, sub-model for edge, multi-head attention, and ensemble method. We then tested a variant of the model without the ensemble method (M2 in Fig. 4). From the M2 model, we built and evaluated three additional models, each independently eliminating one component − either the sub-model for node, sub-model for edge, and multi-head attention.

The ablation study revealed that each component significantly contributed to the overall performance, with the full model achieving the best performance, underscoring the importance of incorporating all components (Fig. 4). The basic models containing only a single component (either the atom graph, bond graph, or attention) showed lower AUROC scores of 0.795, 0.801, and 0.777, respectively, which are approximately 5%, 4%, and 7% lower than our full model. Notably, the comparison between the M1 and M2 models showed a significant 2.3% increase in AUROC score attributed to the ensemble method. Additionally, we observed AUROC improvements ranging from 1.0% to 5.1% when comparing the M2 model to the other models lacking specific components. Detailed results on the physiology datasets are presented in Supplementary Table 5.

Discussion

As with many clinical prediction tasks, this study faced limitations due to the availability of experimentally labeled data, resulting in small sample sizes that can challenge the training of deep learning models and potentially compromise robustness and generalization. Despite these challenges, our multi-view model, which integrates atom, bond, and global features alongside an ensemble method, showed consistently stable performance across most experiments. These findings suggest that our approach to enhancing generalization could be valuable for clinical applications where data availability is limited. We believe this model holds promise for clinical studies, potentially facilitating reliable predictions in scenarios with constrained data resources.

In addition, the ensemble method has a limitation, increasing computational cost as the number of models increases. Based on our experiments, our results indicate that while our model uses higher computational resources than other architectures, it still performs efficiently within practical limits. For example, we measured training time for MPG [35] and our method; MPG is one of the pretrained methods, and our method exploits the ensemble method. To compare these two methods, we chose the BBBP dataset, which includes a small amount of data and one task, and the ToxCast dataset, which contains a relatively large amount of data and 617 tasks. On the BBBP dataset, MPG takes 27 min for finetuning, and our method takes 24 min for the ensemble. If a dataset is small, the ensemble method works efficiently through optimization, such as early stopping. For the ToxCast dataset, MPG takes 46 min, and our method takes one hour and 21 min. Following our anticipation, we observed that a large dataset could demand higher computational resources for the ensemble method, but it still has been within practical limits. Additionally, we can apply several potential optimizations to reduce computational costs, such as reducing the number of attention layers without significant loss of accuracy and leveraging sparse attention mechanisms that selectively focus on critical substructures when it was applied to larger datasets or more complex molecular structures.

To address these challenges associated with the ensemble method, future work should be focused on developing more efficient and practical ensemble approaches for deep learning models. Pretraining methods might be synergetic to the ensemble method by diminishing learning time with fine-tuning. Considering KANO achieved better performances in some of our experiments, adopting the ensemble method to a pre-trained model could promise to reduce computational costs and enhance performance. Additionally, replacing soft voting with attention-based networks could provide improved interpretability, which is essential for many clinical applications.

Conclusions

This study presents a graph-integrated model designed to enhance the prediction of molecular properties by effectively capturing both local and global substructural features. The model incorporates graph attention encoders and multi-head attention mechanisms to a deeper structural understanding. The graph attention networks enable our model to detect more specific structural properties by considering interactions between atoms and edges, while the multi-head attention network captures the relationships between both proximate and distant substructures. This constitution could lead our model to reflect locally and globally essential substructures. Furthermore, the multi-task learning method was adapted to increase performance by facilitating task cooperation and employed an ensemble method to enhance robustness and generalization.

We validated the effectiveness of our model through experiments across 30 random seed values on nine MoleculeNet datasets, comparing it with several state-of-the-art methods. Our model outperformed other models on four classification and two regression datasets, where the results demonstrate higher predictive accuracy. These results underscore the advantage of our ensemble method and highlight the importance of integrating local and global structural information in predicting molecular properties.

Data availability

No datasets were generated or analysed during the current study.

Abbreviations

AUROC:: Area Under the Receiver Operating Characteristic curve
RMSE:: Root Mean Squared Error
QSAR:: Quantitative Structure–Activity Relationship
ECFP:: Extended-Connectivity FingerPrints
GNN:: Graph Neural Network
GCN:: Graph Convolutional Network
GAT:: Graph Attention Network
MPNN:: Message-Passing Neural Network
DMPNN:: Directional Message-Passing Neural Network
SMILES:: Simplified Molecular Input Line Entry System
Lipo:: Lipophilicity
ELU:: Exponential Linear Unit
BCE:: Binary Cross-Entropy
MSE:: Mean Squared Error

References

Ansel J, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In, Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2024;2:929–947.
Breiman L. Random forests Machine learning. 2001;45:5–32.
Article Google Scholar
Brody S, Alon U, Yahav E. How Attentive are Graph Attention Networks? In, International Conference on Learning Representations. 2022.
Caruana R. Multitask learning Machine learning. 1997;28:41–75.
Article Google Scholar
Choudhary K, DeCost B. Atomistic line graph neural network for improved materials property predictions. npj Comput Mat. 2021;7(1):185.
Article Google Scholar
Coley CW, et al. Convolutional embedding of attributed molecular graphs for physical property prediction. J Chem Inf Model. 2017;57(8):1757–72.
Article CAS PubMed Google Scholar
Cortes C, Vapnik V. Support-vector networks Machine learning. 1995;20:273–97.
Google Scholar
Durant JL, et al. Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci. 2002;42(6):1273–80.
Article CAS PubMed Google Scholar
Duvenaud DK, et al. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems. 2015;28.
Fabian P. Scikit-learn: Machine learning in Python. J Machine Learn Res. 2011;12:2825.
Google Scholar
Falcon W, et al. PyTorchLightning/pytorch-lightning: 0.7. 6 release. Zenodo; 2020.
Fan X-Q, et al. I-DNAN6mA: Accurate Identification of DNA N6-Methyladenine Sites Using the Base-Pairing Map and Deep Learning. J Chem Inf Model. 2023;63(3):1076–86.
Article CAS PubMed Google Scholar
Fan XQ., et al. Ense-i6mA: Identification of DNA N6-methyl-adenine Sites Using XGB-RFE Feature Se-lection and Ensemble Machine Learning. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2024.
Fan X, Lin B, Guo Z. Infrared Polarization-Empowered Full-Time Road Detection via Lightweight Multi-Pathway Collaborative 2D/3D Convolutional Networks. 2024. IEEE Transactions on Intelligent Transportation Systems.
Fang X, et al. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence. 2022;4(2):127–34.
Article Google Scholar
Fang Y, et al. Knowledge graph-enhanced molecular contrastive learning with functional prompt. Nature Machine Intelligence. 2023;5(5):542–53.
Article Google Scholar
Fey M, Lenssen JE. Fast graph representation learning with PyTorch Geometric. arXiv preprint. arXiv:1903.02428. 2019
Gasteiger J, Groß J, Günnemann S. Directional Message Passing for Molecular Graphs. In, International Conference on Learning Representations. 2020.
Gilmer J, et al. Neural message passing for quantum chemistry. In, International conference on machine learning. PMLR; 2017;1263–1272.
Gómez-Bombarelli R, et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci. 2018;4(2):268–76.
Article PubMed PubMed Central Google Scholar
Hansch C, Fujita T. p-σ-π Analysis. A Method for the Correlation of Biological Activity and Chemical Structure. J Amer Chem Soc. 1964;86(8):1616–26.
Article CAS Google Scholar
Hu* W, et al. Strategies for Pre-training Graph Neural Networks. In, International Conference on Learning Representations. 2020.
Huang R, et al. Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Front Environ Sci. 2016;3:85.
Article Google Scholar
Irwin JJ, et al. ZINC20—a free ultralarge-scale chemical database for ligand discovery. J Chem Inf Model. 2020;60(12):6065–73.
Article CAS PubMed PubMed Central Google Scholar
Jaeger S, Fulle S, Turk S. Mol2vec: unsupervised machine learning approach with chemical intuition. J Chem Inf Model. 2018;58(1):27–35.
Article CAS PubMed Google Scholar
Jiang D, et al. Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. Journal of cheminformatics. 2021;13:1–23.
Article Google Scholar
Jiang M, et al. Drug–target affinity prediction using graph neural network and contact maps. RSC Adv. 2020;10(35):20701–12.
Article CAS PubMed PubMed Central Google Scholar
Karimi M, et al. DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics. 2019;35(18):3329–38.
Article CAS PubMed PubMed Central Google Scholar
Kearnes S, et al. Molecular graph convolutions: moving beyond fingerprints. J Comput Aided Mol Des. 2016;30:595–608.
Article CAS PubMed PubMed Central Google Scholar
Kim S, et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 2021;49(D1):D1388–95.
Article CAS PubMed Google Scholar
Kipf TN, Welling M. Semi-Supervised Classification with Graph Convolutional Networks. In, International Conference on Learning Representations. 2017.
Landrum G. Rdkit documentation Release. 2013;1(1–79):4.
Google Scholar
Li M, Soltanolkotabi M, Oymak S. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In, International conference on artificial intelligence and statistics. PMLR; 2020. p. 4313–4324.
Li P, et al. TrimNet: learning molecular representation from triplet messages for biomedicine. Brief Bioinform. 2021;22(4):bbaa266.
Article PubMed Google Scholar
Li P, et al. An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Brief Bioinform. 2021;22(6):bbab109.
Article PubMed Google Scholar
Li S et al. Geomgcl: Geometric graph contrastive learning for molecular property prediction. In, Proceedings of the AAAI conference on artificial intelligence. 2022. p. 4541–4549.
Lin X, et al. Interpretable multi-view attention network for drug-drug interaction prediction. In, 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2023. p. 422–427.
Liu H, et al. Attention-wise masked graph contrastive learning for predicting molecular property. Brief Bioinform. 2022;23(5):bbac303.
Article PubMed Google Scholar
Lu C, et al. Molecular property prediction: A multilevel quantum interactions modeling perspective. In, Proceedings of the AAAI conference on artificial intelligence. 2019. p. 1052–1060.
Ma H, et al. Cross-dependent graph neural networks for molecular property prediction. Bioinformatics. 2022;38(7):2003–9.
Article CAS PubMed Google Scholar
Maintainers, T. and Contributors. 2016. TorchVision: PyTorch's Computer Vision library. https://github.com/pytorch/vision
Martins IF, et al. A Bayesian approach to in silico blood-brain barrier penetration modeling. J Chem Inf Model. 2012;52(6):1686–97.
Article CAS PubMed Google Scholar
Mayr A, et al. DeepTox: toxicity prediction using deep learning. Front Environ Sci. 2016;3:80.
Article Google Scholar
Nguyen T, et al. GraphDTA: predicting drug–target binding affinity with graph neural networks. Bioinformatics. 2021;37(8):1140–7.
Article CAS PubMed Google Scholar
Öztürk H, Özgür A, Ozkirimli E. DeepDTA: deep drug–target binding affinity prediction. Bioinformatics. 2018;34(17):i821–9.
Article PubMed PubMed Central Google Scholar
Prechelt L. Early stopping-but when? In, Neural Networks: Tricks of the trade. In, Neural Networks: Springer; 2002. p. 55–69.
Google Scholar
Ramsundar B et al. Deep Learning for the Life Sciences. O'Reilly Media.
Richard AM, et al. ToxCast chemical landscape: paving the road to 21st century toxicology. Chem Res Toxicol. 2016;29(8):1225–51.
Article CAS PubMed Google Scholar
Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50(5):742–54.
Article CAS PubMed Google Scholar
Rong Y, et al. Self-supervised graph transformer on large-scale molecular data. Adv Neural Inf Process Syst. 2020;33:12559–71.
Google Scholar
Schütt K, et al. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. Adv Neural Inform Process Syst. 2017;30.
Somnath VR, et al. Learning graph models for retrosynthesis prediction. Adv Neural Inf Process Syst. 2021;34:9405–15.
Google Scholar
Steinbeck C, et al. The Chemistry Development Kit (CDK): An open-source Java library for chemo-and bioinformatics. J Chem Inf Comput Sci. 2003;43(2):493–500.
Article CAS PubMed PubMed Central Google Scholar
Sun R, Dai H, Yu AW. Does gnn pretraining help molecular representation? Adv Neural Inf Process Syst. 2022;35:12096–109.
Google Scholar
Tang B, et al. A self-attention based message passing neural network for predicting molecular lipophilicity and aqueous solubility. Journal of cheminformatics. 2020;12:1–9.
Article Google Scholar
Torng W, Altman RB. Graph convolutional neural networks for predicting drug-target interactions. J Chem Inf Model. 2019;59(10):4131–49.
Article CAS PubMed Google Scholar
Veličković P, et al. Graph Attention Networks. In, International Conference on Learning Representations. 2018.
Wang, S., et al. Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In, Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics. 2019. p. 429–436.
Wang Z, et al. Advanced graph and sequence neural networks for molecular property prediction and drug discovery. Bioinformatics. 2022;38(9):2579–86.
Article CAS PubMed Google Scholar
Withnall M, et al. Building attention and edge message passing neural networks for bioactivity and physical–chemical property prediction. Journal of cheminformatics. 2020;12(1):1.
Article CAS PubMed PubMed Central Google Scholar
Wu Z, et al. MoleculeNet: a benchmark for molecular machine learning. Chem Sci. 2018;9(2):513–30.
Article CAS PubMed Google Scholar
Xiong Z, et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem. 2019;63(16):8749–60.
Article PubMed Google Scholar
Yang K, et al. Analyzing learned molecular representations for property prediction. J Chem Inf Model. 2019;59(8):3370–88.
Article CAS PubMed PubMed Central Google Scholar
Yang N, et al. Learning substructure invariance for out-of-distribution molecular representations. Adv Neural Inf Process Syst. 2022;35:12964–78.
Google Scholar
Yap CW. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011;32(7):1466–74.
Article CAS PubMed Google Scholar
Zaslavskiy M, et al. ToxicBlend: Virtual screening of toxic compounds with ensemble predictors. Computational Toxicology. 2019;10:81–8.
Article Google Scholar
Zeng Y, et al. Deep drug-target binding affinity prediction with multiple attention blocks. Briefings in bioinformatics. 2021;22(5):bbab117.
Article PubMed Google Scholar
Zhang Z, Guan J, Zhou S. FraGAT: a fragment-oriented multi-scale graph attention model for molecular property prediction. Bioinformatics. 2021;37(18):2981–7.
Article CAS PubMed PubMed Central Google Scholar
Zhang Z, et al. Motif-based graph self-supervised learning for molecular property prediction. Adv Neural Inf Process Syst. 2021;34:15870–82.
Google Scholar
Zhao Q, AttentionDTA: prediction of drug–target binding affinity using attention model. In, et al. IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE. 2019;2019:64–9.
Google Scholar

Download references

Funding

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea Government (MSIT) [No. 2020–0-01373, Artificial Intelligence Graduate School Program (Hanyang University)] and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT)(RS-2023–00217123).

Author information

Authors and Affiliations

Department of Computer Science, Hanyang University, Seoul, Republic of Korea
Heesang Moon & Mina Rho
Department of Artificial Intelligence, Seoul, Republic of Korea
Mina Rho
Department of Biomedical Informatics, Hanyang University, Seoul, Republic of Korea
Mina Rho

Authors

Heesang Moon
View author publications
You can also search for this author inPubMed Google Scholar
Mina Rho
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

MR and HM performed conceptualization and developed the model. HM implemented the algorithm. MR and HM evaluated the model and wrote the manuscript.

Corresponding author

Correspondence to Mina Rho.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Material 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Moon, H., Rho, M. MultiChem: predicting chemical properties using multi-view graph attention network. BioData Mining 18, 4 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13040-024-00419-4

Download citation

Received: 20 September 2024
Accepted: 26 December 2024
Published: 16 January 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13040-024-00419-4

MultiChem: predicting chemical properties using multi-view graph attention network

Abstract

Background

Results

Conclusion

Implementation

Introduction

Materials and methods

Datasets

Feature initialization

Model construction

Sub-model for the node

Sub-model for the edge

Multi-head attention network

Readout phase

Multi-task learning and ensemble method

Training and evaluation data

Model training

Methods for performance comparison

Results

Performance evaluation on MoleculeNet dataset

Performance evaluation with multiple random seed values

Contribution of the ensemble model

Ablation study

Discussion

Conclusions

Data availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Supplementary Material 1.

Rights and permissions

About this article

Cite this article

Share this article

BioData Mining

Contact us