In the field of biotechnology and chemical research, it is common to test the effect of certain chemicals or materials on animals first, instead of humans. Although these studies are controversial in nature (animals may be harmed or killed in the process), they have proven to be the most effective method in certain areas of research — especially toxicology.
But now, as machine learning is changing the technology landscape, bio researchers are trying novel methods using ML and related fields to fulfil the possibility of an alternate testing technique. If successful in the long run, computers and ML can completely replace animal studies for chemical safety-related projects.
Combining Machine Learning And Big Data For RASAR
Professor Thomas Hartung and team from John Hopkins University designed a novel method based on an earlier technique called read-across structure activity relationships (abbreviated as ‘RASAR’). This technique actually combines chemical similarities with supervised learning. RASAR was designed by taking the earlier Read-Across approaches into consideration.
The chemical similarity is determined by two steps — by using binary fingerprints for chemicals or using the Jaccard distance to establish similarity on these fingerprints. Hartung and team tell the reason why ML has a significant impact after creating chemical similarity.
Supervised learning methods then provide a statistical model of the insights deliverable from chemical similarity. Due to automation, the approach can be applied to large datasets and thus validated according to common statistical methods such as cross-validation. Supervised learning models built on chemical similarity also allow assignment of confidence to individual predictions.
This means that ML could greatly help in knowing chemical similarities for a large number of toxic chemicals and their information collected on a database, instead of conducting animal tests extensively.
To demonstrate this, the researchers built an ML model called ‘Simple RASAR’ trained in logistic regression to predict hazards from similarities for every chemical. These chemicals are either labelled negative (not hazardous) or positive (hazardous) by referring to the similarities in the chemical information in the database.
The model was tested for the European Council’s REACH (Registration, Evaluation, Authorisation and Restriction of Chemicals) regulation as well as with the help of nine standard toxicology methods. Before testing simple RASAR, the aspect of reproducibility (this is actually a workflow conducted in the study) by adhering to OECD guidelines on chemical testing is also evaluated. Apart from this, the effect of structural analogues is also studied.
For RASAR’s database, chemical-related data from REACH in collaboration from PubChem was collected. Over 80,000 chemicals were analysed which resulted in more than 800,000 chemical labels. These labels were created by inferring from other standard regulations in chemical hazards.
Simple RASAR And Data Fusion RASAR
Before starting off with the ML models, the prerequisite of constructing RASARs is done in two steps.
- Unsupervised Learning
- Supervised Learning
In the first step, chemical similarities are established through locality sensitive hashing methods. This creates local graphs for all the chemicals, which in turn are used to generate feature vectors through K-nearest neighbours. The second step applies supervised learning (logistic regression in this case) to the unsupervised learning method (first step). This forms the core of Simple RASAR, which acts as the aggregation function just like logistic regression. Therefore, Simple RASAR generates 2D vectors, that is, positive and negative chemical analogues, while Data Fusion RASAR is an extension of this model where it trains a random forest tree using the generated 2D vectors. A detailed illustration can be found here.
Model Training And Evaluation
Here, the feature vectors differ for both these models, and that’s why there’s a difference in supervised learning models (logistic regression in Simple RASAR and random forests in Data Fusion RASAR). For training these models, spark.ml is the library package used in the study. With more than 300,000 iterations in training, these models are evaluated through five-fold cross-validation after training.
Conclusion
In the study, three main results are given. The first being the test reproducibility with respect to OECD guideline while the second and third being Simple RASAR and Data Fusion RASAR modes respectively. Reproducibility accuracy was significantly good in terms of accuracy and even on chemical specificity (more than 90 percent accurate). This means it is on par with animal tests.
Similarly, Simple RASAR and Data Fusion RASAR through cross-validation achieve an accuracy in the range of 80-95 percent. All of this means that ML is nearly there in predicting chemical hazards by comparing properties of a large collection of dangerous chemicals.
This new study is a typical example of how ML could make animal tests redundant thus saving cost and time all along. However, there is a lot to achieve to make it a full-fledged method.