Enhancing Predictive Accuracy of Biomolecules Partition Coefficients in Aqueous Two-Phase Systems Using Machine Learning

Abstract

Information about the pharmaceutical and biotech industries has experienced remarkablegrowth in recent years. Biological processes consist of two main stages: upstream and downstreamprocesses. Downstream processes aim to recover and purify chemical products, constituting 50 to 80percent of the total production cost [1]. Extraction, a stage within downstream processes, employsmethods like liquid-liquid extraction systems. One type of liquid-liquid extraction system is the aqueoustwo-phase system(ATPS)[2]. As a significant portion of the aqueous two-phase system is water, it isused to purify biological molecules such as drugs, proteins, etc. This study uses machine learningmethods to predict the partition coefficient of drugs in aqueous two-phase system[3]. The databaseutilized in this study includes data collected from previous articles that experimentally calculatedinformation related to the components of the aqueous two-phase system, and details about the chemicalstructure and physical properties of drugs. This study aims to investigate how the properties of drugsaffect their distribution coefficient in the aqueous two-phase system. In the investigation of the chemicalstructure, binary Morgan fingerprints, count-based Morgan fingerprints and Graph convolution wereutilized. Additionally, physical properties such as melting point, density, log P, etc were considered. Topredict the partition coefficient of drugs in the aqueous two-phase system, various machine-learningmodels were employed, including Random Forests, ANN, ensemble methods, etc. Results show thesignificant influence of drug properties on partition coefficient prediction. The best performance relatesto combining the physical and chemical properties of drugs using count_based Morgan fingerprintsrepresentation. On the other hand, the performance of the model using the ensemble method is betterthan the other models. This model achieves an MSE of 0.0079, MAE of 0.057, RMSD of 0.0888, andan R2 value of 0.84 for test data.

Publication
3rd International & 12th Iranian Conference on Bioinformatics
Fatemeh Zare-Mirakabad
Fatemeh Zare-Mirakabad
Associate Professor

My research interests include bioinformatics, computational biology and artificial intelligence.

Mahsa Sa'adat
Mahsa Sa'adat
Postdoctoral researcher

Ph.D. candidate in Computer Science specializing in Soft Computing and Artificial Intelligence, with a strong focus on bioinformatics, immunoinformatics, and computational drug design. Experienced in teaching, academic leadership, and organizing international conferences. Passionate about employing AI and machine learning to solve complex problems in healthcare and biology.

Zahra Ghorbanali
Zahra Ghorbanali
Assistant professor