Publications

Drug-Target Binding Affinity Prediction Using Transformers

Drug discovery is generally difficult, expensive, and low success rate. One of the essential steps in the early stages of drug discovery and drug repurposing is identifying drug-target interactions. Binding affinity indicates the strength of drug-target pair interactions. In this regard, several computational methods have been developed to predict the drug-target binding affinity, and the input representation of these models has been shown to be very effective in improving accuracy. Although the recent models predict binding affinity more accurate than the first ones, they need the structure of target proteins. Despite the strong interest in protein structure, there is a massive gap between known sequences and experimentally determined structures. Therefore, finding an appropriate presentation for drug and protein sequences is vital for drug-target binding affinity prediction. In this paper, our primary goal is to assess the drug …

Download PDF Journal Link

DRP-VEM: Drug repositioning prediction using voting ensemble

Traditional drug discovery methods are costly and time-consuming. Drug repositioning (DR) is a common strategy to overcome these issues. Recently, machine learning methods have been used extensively in DR problem. The performance of these methods depends on the features, representations and training dataset. In this problem, feature sets include many redundant features, which have a negative effect on the performance of methods. Moreover, selecting an appropriate training set is influential in the rise of machine learning method accuracy. However, in this problem, we face two obstacles to find the proper training set. First, most methods employ known and unknown drug-disease pairs as positive and negative sets, respectively. While the number of known pairs is much less than unknowns, it leads to machine learning performance error because of biasing to the majority group. Second, the absence of a drug-disease association means this association has not been approved experimentally and may be changed. In this paper, DRP-VEM framework is proposed to overcome the challenges. We assess DRP-VEM based on different parameters: disease and drug feature representations, classification methods, and voting ensemble training approaches. DRP-VEM is evaluated using heterogenous evaluation criteria. Moreover, we compare DRP-VEM using the best combination of parameters with DisDrugPred.

Download PDF Journal Link

The assessment of essential genes in the stability of PPI networks using critical node detection problem

Essential genes and proteins as their products encode the basic functions of a cell in a variety of conditions and are vital for the survival of a cell. Analyzing the characteristics of these proteins provides important biological information. An interesting analysis is to demonstrate the correlation between topological importance of a protein in protein-protein interaction networks and its essentiality. Different centrality criteria such as degree, betweenness, closeness, and eigenvector centralities are used to investigate such a correlation. Despite the remarkable results obtained by these methods, it is shown that the centrality criteria in scale-free networks show a high level of correlations which indicate that they share similar topological information of the networks. In this paper, we use a different approach for analyzing this correlation and use a well-known problem in the field of graph theory, Critical Node Detection Problem …

Journal Link

EIA-CNDP: An exact iterative algorithm for critical node detection problem

In designing reliable and impermeable networks, the robustness of the network is evaluated against the removal and failure of the node or edge where the network robustness (network connectivity) is measured using various metrics (objective functions) such as the number of connected components, size of the largest connected component, and pairwise connectivity. Critical node detection problem (CNDP) is one of the main issues in this literature, which aims to find a set of vertices whose removal maximizes or minimizes some objective function. In this paper, the focus is on solving CNDP, considering the size of the largest connected component as its objective function. In this regard, we introduce a new problem called K-Group-Division-Problem and present a mixed integer linear programming model to solve it. We prove that under certain circumstances, any optimal solution of the new problem is also an optimal …

Journal Link

The assessment of similarity vectors of fingerprint and UMLS in adverse drug reaction prediction

Identifying and controlling adverse drug reactions is a complex problem in the pharmacological field. Despite the studies done in different laboratory stages, some adverse drug reactions are recognized after being released, such as Rosiglitazone. Due to such experiences, pharmacists are now more interested in using computational methods to predict adverse drug reactions. In computational methods, finding and representing appropriate drug and adverse reaction features are one of the most critical challenges. Here, we assess fingerprint and target as drug features; and phenotype and unified medical language system as adverse reaction features to predict adverse drug reaction. Meanwhile, we show that drug and adverse reaction features represented by similarity vectors can improve adverse drug prediction. In this regard, we propose four frameworks. Two frameworks are based on random forest classification and neural networks as machine learning methods called F_RF and F_NN, respectively. The rest of them improve two state-of-art matrix factorization models, CS and TMF, by considering target as a drug feature and phenotype as an adverse reaction feature. However, machine learning frameworks with fewer drug and adverse reaction features are more accurate than matrix factorization frameworks. In addition, the F_RF framework performs significantly better than F_NN with ACC = %89.15, AUC = %96.14 and AUPRC = %92.9. Next, we contrast F_RF with some well-known models designed based on similarity vectors of drug and adverse reaction features. Unlike other methods, we do not remove rare reactions from the data set in …

Download PDF Journal Link

TranDTA: Prediction Of Drug Target Binding Affinity Using Transformer Representations

Drug discovery is generally difficult, expensive and the success rate is low. One of the essential steps in the early stages of drug discovery and drug repurposing is identifying drug target interactions. Although several methods developed use binary classification to predict if the interaction between a drug and its target exists or not, it is more informative and challenging to predict the strength of the binding between a drug and its target. Binding affinity indicates the strength of drug-target pair interactions. In this regard, several computational methods have been developed to predict the drug-target binding affinity. With the advent of deep learning methods, the accuracy of binding affinity prediction is improving. However, the input representation of these models is very effective in the result. The early models only use the sequence of molecules and the latter models focus on the structure of them. Although the recent models predict binding affinity more accurate than the first ones, they need more data and resources for training. In this study, we present a method that uses a pre-trained transformer to represent the protein as model input. Although pretrained transformer extracts a feature vector of the protein sequence, they can learn structural information in layers and heads. So, the extracted feature vector by transformer includes the sequence and structural properties of protein. Therefore, our method can also be run without limitations on resources (memory, CPU and GPU). The results show that our model achieves a competitive performance with the state-of-art models. Data and trained model is available at https://bioinformatics.aut.ac.ir/TranDTA/ .

Download PDF Journal Link

Protein sequence profile prediction using ProtAlbert transformer1

Protein sequences can be viewed as a language; therefore, we benefit from using the models initially developed for natural languages such as transformers. ProtAlbert is one of the best pre-trained transformers on protein sequences, and its efficiency enables us to run the model on longer sequences with less computation power while having similar performance with the other pre-trained transformers. This paper includes two main parts: transformer analysis and profile prediction. In the first part, we propose five algorithms to assess the attention heads in different layers of ProtAlbert for five protein characteristics, nearest-neighbor interactions, type of amino acids, biochemical and biophysical properties of amino acids, protein secondary structure, and protein tertiary structure. These algorithms are performed on 55 proteins extracted from CASP13 and three case study proteins whose sequences, experimental tertiary structures, and HSSP profiles are available. This assessment shows that although the model is only pre-trained on protein sequences, attention heads in the layers of ProtAlbert are representative of some protein family characteristics. This conclusion leads to the second part of our work. We propose an algorithm called PA_SPP for protein sequence profile prediction by pre-trained ProtAlbert using masked-language modeling. PA_SPP algorithm can help the researchers to predict an HSSP profile while there are no similar sequences to a query sequence in the database for making the HSSP profile.

Download PDF Journal Link

The outward shift of clarithromycin binding to the ribosome in mutant Helicobacter pylori strains

Disruption of protein synthesis, by drug‐mediated restriction of the ribosomal nascent peptide exit tunnel (NPET), may inhibit bacterial growth. Here, we have studied the secondary and tertiary structures of domain V of the 23S rRNA in the wild‐type and mutant (resistant) H. pylori strains and their mechanisms of interaction with clarithromycin (CLA).
H pylori strains, isolated from cultured gastric biopsies, underwent CLA susceptibility testing by E test, followed by PCR amplification and sequencing of domain V of 23S rRNA. The homology model of this domain in H pylori, in complex with L4 and L22 accessory proteins, was determined based on the E. coli ribosome 3D structure. The interactions between CLA and 23S rRNA complex were determined by molecular docking studies.
Of the 70 H pylori strains, isolated from 200 dyspeptic patients, 11 (16%) were CLA‐resistant. DNA sequencing identified categories with no (A), A2142G (B), and A2143G (C) mutations. Docking studies of our homology model of 23S rRNA complex with CLA showed deviated positions for categories B and C, in reference to category A, with 12.19 Å and 7.92 Å RMSD values, respectively. In both mutant categories, CLA lost its interactions at positions 2142 and 2587 and gained two new bonds with the L4 accessory protein.
Our data suggest that, in mutant H pylori strains, once the nucleotides at positions 2142 and 2587 are detached from the drug, CLA interacts with and is peeled back by the L4 accessory protein, removing the drug‐imposed spatial restriction of the NPET.

Journal Link

The Assessment of Histone Acetylation marks in the Vicinity of Transcription Factor Binding Sites in Human CD4+ T Cells using Information Theory Methods

The genetic information encoded in structural genes is decoded by an intracellular process called gene expression. This mechanism is regulated by epigenetic processes such as histone acetylation. Histone acetylation, which happens in nucleosomes, exposes DNA (genome) to transcription factors. Therefore, the correlation between histone acetylation and gene expression has been assessed as a fundamental issue in many previous studies. In the proposed research, we investigate which marks of histone acetylation are informative and which ones are redundant in the vicinity of SP1 transcription factor binding sites, in human CD4 + T cell. To achieve this, we use information theory methods. Subsequently, we apply a multilayer perceptron neural network to show that the selected histone acetylation marks by information theory methods are sufficiently informative. Finally, we use the neural network to predict binding sites of 17 other transcription factors on chromosomes 1 and 2. The results suggest that information conveyed by the selected histone acetylation marks are equivalent to that of all 18 marks associated with SP1 transcription factor binding sites on chromosome 1. Furthermore, almost 91.75 % of SP1 binding sites of chromosome 2 are predicted by the selected histone acetylation marks while all 18 marks predict 90.56 % correctly. Moreover, the selected histone acetylation marks are efficient at predicting 17 other types of transcription factor binding sites.

Journal Link

A new hash function and its use in read mapping on genome

Mapping reads onto genomes is an indispensable step in sequencing data analysis. A widely used method to speed up mapping is to index a genome by a hash table, in which genomic positions of k-mers are stored in the table. The hash table size increases exponentially with the k-mer length and thus the traditional hash function is not appropriate for a k-mer as long as a read. We present a hashing mechanism by two functions named score1 and score2 which can hash sequences with the length of reads. The size of hash table is directly proportional to the genome size, which is absolutely lower than that of hash table built by the conventional hash function. We evaluate our hashing system by developing a read mapper and running the mapper on E. coli genome with some simulated data sets. The results show that the high percentage of simulated reads can be mapped to correct locations on the genome.

Journal Link

Draft Genome Sequence of Pseudomonas aeruginosa Strain LMG 1272, an Atypical White Line Reaction Producer

The draft genome sequence of Pseudomonas aeruginosa LMG 1272, isolated from mushroom, is reported here. This strain triggers formation of a precipitate (“white line”) when cocultured with Pseudomonas tolaasii. However, LMG 1272 lacks the capacity to produce a cyclic lipopeptide that is typically associated with white line formation, suggesting the involvement of a different diffusible factor.

Journal Link

A Stochastic Model of DNA Double-Strand Breaks Repair Throughout the Cell Cycle

Cell cycle phase is a decisive factor in determining the repair pathway of DNA double-strand breaks (DSBs) by non-homologous end joining (NHEJ) or homologous recombination (HR). Recent experimental studies revealed that 53BP1 and BRCA1 are the key mediators of the DNA damage response (DDR) with antagonizing roles in choosing the appropriate DSB repair pathway in G1, S, and G2 phases. Here, we present a stochastic model of biochemical kinetics involved in detecting and repairing DNA DSBs induced by ionizing radiation during the cell cycle progression. A three-dimensional stochastic process is defined to monitor the cell cycle phase and DSBs repair at times after irradiation. To estimate the model parameters, a Metropolis Monte Carlo method is applied to perform maximum likelihood estimation utilizing the kinetics of γ-H2AX and RAD51 foci formation in G1, S, and G2 phases. The recruitment of DSB repair proteins is verified by comparing our model predictions with the corresponding experimental data on human cells after exposure to X and γ-radiation. Furthermore, the interaction between 53BP1 and BRCA1 is simulated for G1 and S/G2 phases determining the competition between NHEJ and HR pathways in repairing induced DSBs throughout the cell cycle. In accordance with recent biological data, the numerical results demonstrate that the maximum proportion of HR occurs in S phase cells and the high level of NHEJ takes place in G1 and G2 phases. Moreover, the stochastic realizations of the total yield of simple and complex DSBs ligation are compared for G1 and S/G2 damaged cells. Finally, the proposed stochastic model is validated when DSBs induced by different particle radiation such as iron, silicon, oxygen, proton, and carbon.

Journal Link

The assessment of efficient representation of drug features using deep learning for drug repositioning

De novo drug discovery is a time-consuming and expensive process. Nowadays, drug repositioning is utilized as a common strategy to discover a new drug indication for existing drugs. This strategy is mostly used in cases with a limited number of candidate pairs of drugs and diseases. In other words, they are not scalable to a large number of drugs and diseases. Most of the in-silico methods mainly focus on linear approaches while non-linear models are still scarce for new indication predictions. Therefore, applying non-linear computational approaches can offer an opportunity to predict possible drug repositioning candidates.
In this study, we present a non-linear method for drug repositioning. We extract four drug features and two disease features to find the semantic relations between drugs and diseases. We utilize deep learning to extract an efficient representation for each feature. These representations reduce the dimension and heterogeneity of biological data. Then, we assess the performance of different combinations of drug features to introduce a pipeline for drug repositioning. In the available database, there are different numbers of known drug-disease associations corresponding to each combination of drug features. Our assessment shows that as the numbers of drug features increase, the numbers of available drugs decrease. Thus, the proposed method with large numbers of drug features is as accurate as small numbers.
Our pipeline predicts new indications for existing drugs systematically, in a more cost-effective way and shorter timeline. We assess the pipeline to discover the potential drug-disease associations based on cross-validation experiments and some clinical trial studies.

Journal Link

Disease gene prioritization using network topological analysis from a sequence based human functional linkage network

Sequencing large number of candidate disease genes which cause diseases in order to identify the relationship between them is an expensive and time-consuming task. To handle these challenges, different computational approaches have been developed. Based on the observation that genes associated with similar diseases have a higher likelihood of interaction, a large class of these approaches relay on analyzing the topological properties of biological networks. However, the incomplete and noisy nature of biological networks is known as an important challenge in these approaches. In this paper, we propose a two-step framework for disease gene prioritization:(1) construction of a reliable human FLN using sequence information and machine learning techniques,(2) prioritizing the disease gene relations based on the constructed FLN. On our framework, unlike other FLN based frameworks that using FLNs based on integration of various low quality biological data, the sequence of proteins is used as the comprehensive data to construct a reliable initial network. In addition, the physicochemical properties of amino-acids are employed to describe the functionality of proteins. All in all, the proposed approach is evaluated and the results indicate the high efficiency and validity of the FLN in disease gene prioritization.

Journal Link

Genetic algorithm for dyad pattern finding in DNA sequences

In this paper a novel genetic algorithm is presented for the dyad motif finding problem. The genetic algorithm uses a multi-objective fitness function based on the sum of pairs, the number of matches, and the information content. The individuals required for the population pool in the genetic algorithm are optimized by Gibbs sampling method. Also, new crossover and mutation operators are designed. The algorithm is implemented and tested on the different types of real datasets. The results are compared with other well-known algorithms and the effectiveness of our algorithm is shown.

Journal Link

A CONSTANT TIME ALGORITHM FOR DNA ADD

We present a new molecular algorithm for adding two binary numbers with n bits. Without considering the generation of input, this algorithm can be performed in O(1) in a test tube using O(n) different types of DNA strands, and the output can be detected in O(n). The output strands, prior to read out operation, can serve as the input strands for another round of addition. The algorithm can be easily extended to any other logical operation, and even for adding two decimal numbers.

Journal Link

Finding Motifs based on suffix tree

Pattern discovery or motif finding is one of the most challenging problems in both molecular biology and computer science. In this paper we present an exact exhaustive method, for finding motifs of length ℓ in a set of t sequences of length n with a limited number of mutations d. The algorithm is based on the Depth First Search on a suffix trie with maximum nodes O(tn) and is performed in O(t 2n 2 ℓ 2 ) time complexity. The proposed algorithm is tested on yeast and human transcription factor binding site data sets and the obtained results are compared to the other well-known algorithms. The experimental results demonstrate that the proposed method is working analogous to them algorithms.

Journal Link

RNAComp: A new method for RNA secondary structure alignment

In this paper, a new algorithm for alignment of two RNA secondary structures without pseudoknots is presented. The algorithm is based on finding the longest common sub-structures between two RNA structures and special effort is devoted for aligning the beginning the end parts of the existing stems (base pairs) in the secondary structures of two RNAs. The results of structure alignment of different types of RNA by this algorithm are obtained, and the result of this algorithm show more consistency with the models in the evolution rather than the other existing structure alignment algorithms

Journal Link

New scoring scheme for finding motifs in DNA sequences

Pattern discovery in DNA sequences is one of the most fundamental problems in molecular biology with important applications in finding regulatory signals and transcription factor binding sites. An important task in this problem is to search (or predict) known binding sites in a new DNA sequence. For this reason, all subsequences of the given DNA sequence are scored based on an scoring function and the prediction is done by selecting the best score. By assuming no dependency between binding site base positions, most of the available tools for known binding site prediction are designed. Recently Tomovic and Oakeley investigated the statistical basis for either a claim of dependence or independence, to determine whether such a claim is generally true, and they presented a scoring function for binding site prediction based on the dependency between binding site base positions. Our primary objective is to investigate the scoring functions which can be used in known binding site prediction based on the assumption of dependency or independency in binding site base positions.

Journal Link

Parallel generation of the biological trees 1

Several biological features are presented by different types of trees. Two types of such trees are considered in this paper. the first type is trees with n external nodes that each internal node have at least two children, and are used in neuro-science and called neuronal dendritic trees. The second type is trees with n internal nodes and m external nodes. This type of trees represent the secondary structure of RNA sequences, and called RNA trees. In this paper, we present two new parallel algorithms for generation of these two biological trees. Both algorithms are adoptive and cost-optimal and generate the trees in B-order. Computations run in an SM SIMD model.

Journal Link

The assessment of efficient representation of drug features using deep learning for drug repositioning

De novo drug discovery is a time-consuming and expensive process. Nowadays, drug repositioning is utilized as a common strategy to discover a new drug indication for existing drugs. This strategy is mostly used in cases with a limited number of candidate pairs of drugs and diseases. In other words, they are not scalable to a large number of drugs and diseases. Most of the in-silico methods mainly focus on linear approaches while non-linear models are still scarce for new indication predictions. Therefore, applying non-linear computational approaches can offer an opportunity to predict possible drug repositioning candidates. In this study, we present a non-linear method for drug repositioning. We extract four drug features and two disease features to find the semantic relations between drugs and diseases. We utilize deep learning to extract an efficient representation for each feature. These representations reduce the dimension and heterogeneity of biological data. Then, we assess the performance of different combinations of drug features to introduce a pipeline for drug repositioning. In the available database, there are different numbers of known drug-disease associations corresponding to each combination of drug features. Our assessment shows that as the numbers of drug features increase, the numbers of available drugs decrease. Thus, the proposed method with large numbers of drug features is as accurate as small numbers. Our pipeline predicts new indications for existing drugs systematically, in a more cost-effective way and shorter timeline. We assess the pipeline to discover the potential drug-disease associations based on cross-validation experiments and some clinical trial studies.

Journal Link

Assessing the impact of exact reads on reducing the error rate of read mapping

Nowadays, according to valuable resources of high-quality genome sequences, reference-based assembly methods with high accuracy and efficiency are strongly required. Many different algorithms have been designed for mapping reads onto a genome sequence which try to enhance the accuracy of reconstructed genomes. In this problem, one of the challenges occurs when some reads are aligned to multiple locations due to repetitive regions in the genomes. In this paper, our goal is to decrease the error rate of rebuilt genomes by resolving multi-mapping reads. To achieve this purpose, we reduce the search space for the reads which can be aligned against the genome with mismatches, insertions or deletions to decrease the probability of incorrect read mapping. We propose a pipeline divided to three steps: ExactMapping, InExactMapping, and MergingContigs, where exact and inexact reads are aligned in two separate phases. We test our pipeline on some simulated and real data sets by applying some read mappers. The results show that the two-step mapping of reads onto the contigs generated by a mapper such as Bowtie2, BWA and Yara is effective in improving the contigs in terms of error rate. Assessment results of our pipeline suggest that reducing the error rate of read mapping, not only can improve the genomes reconstructed by reference-based assembly in a reasonable running time, but can also have an impact on improving the genomes generated by de novo assembly. In fact, our pipeline produces genomes comparable to those of a multi-mapping reads resolution tool, namely MMR by decreasing the number of multi-mapping reads. Consequently, we introduce EIM as a post-processing step to genomes reconstructed by mappers.

Journal Link

S-FLN: A sequence-based hierarchical approach for functional linkage network construction

The functional linkage network (FLN) construction is a primary and important step in drug discovery and disease gene prioritization methods. In order to construct FLN, several methods have been introduced based on integration of various biological data. Although, there are impressive ideas behind these methods, they suffer from low quality of the biological data. In this paper, a hierarchical sequence-based approach is proposed to construct FLN. The proposed approach, denoted as S-FLN (Sequence-based Functional Linkage Network), uses the sequence of proteins as the primary data in three main steps. Firstly, the physicochemical properties of amino-acids are employed to describe the functionality of proteins. As the sequence of proteins is a more comprehensive and accurate primary data, more reliable relations are achieved. Secondly, seven different descriptor methods are used to extract feature vectors from the proteins sequences. Advantage of different descriptor methods lead to obtain diverse ensemble learners in the next step. Finally, a two-layer ensemble learning structure is proposed to calculated the score of protein pairs. The proposed approach has been evaluated using two biological datasets, S.Cerevisiae and H.Pylori, and resulted in 93.9% and 91.15% precision rates, respectively. The results of various experiments indicate the efficiency and validity of the proposed approach.

Journal Link

The effect of stochasticity on repair of DNA double strand breaks throughout non-homologous end joining pathway

DNA double strand breaks (DSBs) are the most lethal lesions of DNA induced by ionizing radiation, industrial chemicals and a wide variety of drugs used in chemotherapy. In the context of DNA damage response system modelling, uncertainty may arise in several ways such as number of induced DSBs, kinetic rates and measurement error in observable quantities. Therefore, using the stochastic approaches is imperative to gain further insight into the dynamic behaviour of DSBs repair process. In this article, a continuous-time Markov chain (CTMC) model of the non-homologous end joining (NHEJ) mechanism is formulated according to the DSB complexity. Additionally, a Metropolis Monte Carlo method is used to perform maximum likelihood estimation of the kinetic rate constants. Here, the effects of fluctuating kinetic rates and DSBs induction rate of the NHEJ mechanism are investigated. The stochastic realizations of the total yield of simple and complex DSBs ligation are simulated to compare their asymptotic dynamics. Furthermore, it has been proved that the total yield of DSBs has a normal distribution for sufficiently large number of DSBs. In order to estimate the expected duration of repairing DSBs, the probability distribution of DSBs lifetime is calculated based on the CTMC NHEJ model. Moreover, the variability of total yield of DSBs during constant low-dose radiation is evaluated in the presented model. The findings indicate that in stochastic NHEJ model, when there is no new DSBs induction through the repair process, all DSBs are eventually repaired. However, when DSBs are induced by constant low-dose radiation, a number of DSBs remains un-repaired.

Journal Link

Transcription Factor Binding Sites Prediction Based on Histone Acetylations using Neural Networks

Histone acetylation is one of the most important epigenetic processes that regulate gene expression. In other words, chromatin exposes DNA to transcription factors and gene regulators by histone tail acetylation in nucleosomes. There are some studies to show the relation between gene regulation and histone acetylation. In this paper, our main goal is to propose a computational method for transcription factor binding site prediction based on a pattern of 18 types of histone acetylations. In this regard, we analyze 18 types of histone acetylations near SP1 binding sites on Chromosome 1 in human CD4+T cells. The results show that 12 out of 18 marks are strongly correlated with transcription factor binding sites. Then, we implement a multilayer perceptron neural network with supervised training. This network is trained using binding sites of various transcription factors of SP1 in chromosome 1 and 18 types of histone acetylations near them. Finally, this network is applied for predicting binding sites of various transcription factors on chromosomes 1 and 2.

Journal Link

Evaluating the quality of SHAPE data simulated by k-mers for RNA structure prediction

Finding an effective measure to predict a more accurate RNA secondary structure is a challenging problem. In the last decade, an experimental method, known as selective 2′-hydroxyl acylation analyzed by primer extension (SHAPE), was proposed to measure the tendency of forming a base pair for almost all nucleotides in an RNA sequence. These SHAPE reactivities are then utilized to improve the accuracy of RNA structure prediction. Due to a significant impact of SHAPE reactivity and in order to reduce the experimental costs, we propose a new model called HL-k-mer. This model simulates the SHAPE reactivity for each nucleotide in an RNA sequence. This is done by fetching the SHAPE reactivities for all sub-sequences of length k (k-mers) appearing in helix and loop regions. For evaluating the quality of simulated SHAPE data, ESD-Fold method is used based on the SHAPE data simulated by the HL-k-mer model (k=2, 3, 4). Also, for further evaluation of simulated SHAPE data, three different methods are employed. We also extend this model to simulate the SHAPE data for the RNA pseudoknotted structure. The results indicate that the average accuracies of prediction using the SHAPE data simulated by our models (for k=2, 3) are higher compared to the experimental SHAPE data.

Journal Link

Dimensionality Reduction of Binding Site Sequence Data on Human Genome Using a Deep Autoencoder Neural Network

The use of genomic nucleotide sequences as biochemical signals in machine learning methods is possible by converting these sequences into numerical codes. This conversion results in an unrealistic increase in the dimension of the data and encounters some data analysis operations such as visualization and feature extraction with constraints. Therefore, one should use the dimensionality reduction technics in order to return the data to its real dimension. In this study, a deep autoencoder neural network has been used to reduce the dimension of binding site sequence data on the human genome. In order to determine whether the information of real data is preserved in compressed data, we perform a two-class classification using a support vector machine. The results show that information is almost entirely preserved in compression. Then, compressed data is used for visualization as well as feature selection by analysis of variance. The results show that the first, the tenth and eighth positions in the sequences are the most informative positions. While the majority of the previous works deal with gene expression data of microarrays and compare a few dimension reduction algorithms, this paper for the first time uses an autoencoder on nucleotide sequence data and provides a comprehensive comparison between the performance of the dimension reduction technics and machine learning algorithms.

Journal Link

RNA design using simulated SHAPE data

It has long been established that in addition to being involved in protein translation, RNA plays essential roles in numerous other cellular processes, including gene regulation and DNA replication. Such roles are known to be dictated by higher-order structures of RNA molecules. It is therefore of prime importance to find an RNA sequence that can fold to acquire a particular function that is desirable for use in pharmaceuticals and basic research. The challenge of finding an RNA sequence for a given structure is known as the RNA design problem. Although there are several algorithms to solve this problem, they mainly consider hard constraints, such as minimum free energy, to evaluate the predicted sequences. Recently, SHAPE data has emerged as a new soft constraint for RNA secondary structure prediction.

Journal Link

Evolutionary Algorithm for RNA Secondary Structure Prediction Based on Simulated SHAPE Data

Background Non-coding RNAs perform a wide range of functions inside the living cells that are related to their structures. Several algorithms have been proposed to predict RNA secondary structure based on minimum free energy. Low prediction accuracy of these algorithms indicates that free energy alone is not sufficient to predict the functional secondary structure. Recently, the obtained information from the SHAPE experiment greatly improves the accuracy of RNA secondary structure prediction by adding this information to the thermodynamic free energy as pseudo-free energy. Method In this paper, a new method is proposed to predict RNA secondary structure based on both free energy and SHAPE pseudo- free energy.

Journal Link

Evaluating the accuracy of splice site prediction based on integrating Jensen-Shannon divergence and a polynomial equation of order 2

Advances in DNA sequencing technology have caused generation of the vast amount of new sequence data. It is essential to understand the functions, features, and structures of every newly sequenced data. Analyzing sequence data by different methods could provide important information about the sequence data. One of the essential tasks for genome annotation is gene prediction that can help to understand the features and determine functions of the genes. One of the key steps towards correct gene structure prediction is accurate splice site detection. There are vast numbers of splice site prediction methods, however, a few of them can be incorporated in gene prediction modules because of their complexity. In this paper, a novel model is presented to recognize unknown splice sites in a new genome without using any prior knowledge. Our model is defined based on integrating Jensen-Shannon divergence and a polynomial equation of order 2. Finally, the proposed model is evaluated on Yeast’s genome to predict splice sites. The experimental results suggest that the proposed method is an effective approach for splice site prediction.

Journal Link

MOTIF FINDING IN UPSTREAM REGULATORY REGIONS OF CO-EXPRESSED GENES USING CUCKOO OPTIMIZATION ALGORITHM AND SIMULATED ANNEALING

In this paper, a novel hybrid algorithm is represented named SA-COAMF by using cuckoo optimization algorithm, simulated annealing and expected maximization for motif finding problem. SA-COAMF is very efficient in global optimal convergence. In the algorithm, two models of motif representation, consensus and probability matrix representations, are applied to take the advantage of them. SA-COAMF is run on experimental datasets (SCPD database). The results are compared with some well-known algorithms (GA- DPAF, PSO+ and MEME) to show that our algorithm is efficient.

Journal Link

Evaluating the accuracy of protein design using native secondary sub-structures

Background According to structure-dependent function of proteins, two main challenging problems called Protein Structure Prediction (PSP) and Inverse Protein Folding (IPF) are investigated. In spite of IPF essential applications, it has not been investigated as much as PSP problem.In fact, the ultimate goal of IPF problem or protein design is to create proteins with enhanced properties or even novel functions. One of the major computational challenges in protein design is its large sequence space, namely searching through all plausible sequences is impossible. Inasmuch as, protein secondary structure represents an appropriate primary scaffold of the protein conformation, undoubtedly studying the Protein Secondary Structure Inverse Folding (PSSIF) problem is a quantum leap forward in protein design, as it can reduce the search space.In this paper, a novel genetic algorithm which uses native secondary sub- structures is proposed to solve PSSIF problem.

Journal Link Download GAPSSIF

WCOACH: protein -protein complex prediction in weighted PPT networks

Protein complexes are aggregates of protein molecules that play important roles in biological processes . Detecting protein complexes from protein-protein interaction (PPI) networks is one of the most challenging problems in computational biology, and many computational methods have been developed to solve this problem . Generally , these methods yield high false positive rates . In this article , a semantic similarity measure between proteins , based on Gene Ontology (GO) structure , i s applied to weigh PPI networks. Consequently, one of the well-known methods, COACH, has been improved to be compatible with weighted PPI networks for protein complex detection.

Journal Link

RNA secondary structure prediction based on SHAPE data in helix regions

RNA molecules play important and fundamental roles in biological processes. Frequently, the functional form of single-stranded RNA molecules requires a specific tertiary structure. Classically, RNA structure determination has mostly been accomplished by X-Ray crystallography or Nuclear Magnetic Resonance approaches. These experimental methods are time consuming and expensive. In the past two decades, some computational methods and algorithms have been developed for RNA secondary structure prediction. In these algorithms, minimum free energy is known as the best criterion. However, the results of algorithms show that minimum free energy is not a sufficient criterion to predict RNA secondary structure. These algorithms need some additional knowledge about the structure, which has to be added in the methods. Recently, the information obtained from some experimental data, called SHAPE, can greatly improve the consistency between the native and predicted RNA secondary structure.

Journal Link

A genetic approach to accurately predict RNA secondary structure

RNA molecule plays important and fundamental roles in many biological processes. In the most times, activities of RNAs are determined by their structures. In notice to complexity and costly of laboratory methods to predict RNAs structure, computational approaches are used. There are variety of algorithms to predict RNA secondary structure. In this paper, a genetic algorithm called RNAG is presented to predict the RNA secondary structure based on minimum free energy (MFE). In this algorithm, each individual of population includes some stems. The individuals are increasingly ranked based on fitness value of MFE from stems and loops, and in the follow, crossover and mutation operations are done on individuals to make a new population, respectively. Process of population generation continues until an individual with proper MFE is produced. Finally, this individual is selected as an optimal RNA secondary structure. The proposed algorithm is performed on some RNAs in the bacteria. Results of the paper show that RNAG algorithm has a high accuracy in comparison with the other related methods.

Journal Link

Using temperature effects to predict the interactions between two RNAs

Interaction of two RNA molecules is considered as an important factor that regulates gene expression post-transcriptional process. Most of the ncRNAs prevent the translation of their target mRNA(s) by forming stable bindings with them. Although several computational methods have been proposed to predict the interactions between two RNAs, none of them can produce reliable and accurate results. RESULTS: In this paper, a new approach entitled tempRNAs is presented to accurately predict interaction structure between two RNAs based on a gradual temperature decrease. For each specified temperature, our algorithm contains two main steps. First, the secondary structure of each RNA is determined with respect to the previous base pairs as constraints. Second, two RNAs are concatenated and then the interaction between them is calculated according to the previous base pairs. The secondary structures are determined based on minimum free energy model. The proposed algorithm is evaluated for a set of known interacting RNA pairs. The results show the higher accuracy of the proposed method in comparison to the other state-of-the-art algorithms, namely inRNAs and RactIP.

Journal Link

RNA-RNA interaction prediction using genetic algorithm

Background RNA-RNA interaction plays an important role in the regulation of gene expression and cell development. In this process, an RNA molecule prohibits the translation of another RNA molecule by establishing stable interactions with it. In the RNA-RNA interaction prediction problem, two RNA sequences are given as inputs and the goal is to find the optimal secondary structure of two RNAs and between them. Some different algorithms have been proposed to predict RNA-RNA interaction structure. However, most of them suffer from high computational time. Results In this paper, we introduce a novel genetic algorithm called GRNAs to predict the RNA-RNA interaction. The proposed algorithm is performed on some standard datasets with appropriate accuracy and lower time complexity in comparison to the other state-of-the-art algorithms.

Journal Link

Transcription Factor Binding Sites Prediction Based on Modified Nucleosomes

In computational methods, position weight matrices (PWMs) are commonly applied for transcription factor binding site (TFBS) prediction. Although these matrices are more accurate than simple consensus sequences to predict actual binding sites, they usually produce a large number of false positive (FP) predictions and so are impoverished sources of information. Several studies have employed additional sources of information such as sequence conservation or the vicinity to transcription start sites to distinguish true binding regions from random ones. Recently, the spatial distribution of modified nucleosomes has been shown to be associated with different promoter architectures. These aligned patterns can facilitate DNA accessibility for transcription factors. We hypothesize that using data from these aligned and periodic patterns can improve the performance of binding region prediction.

Journal Link

Inverse RNA folding solution based on multi-objective genetic algorithm and gibbs sampling method

In living systems, RNAs play important biological functions. The functional form of an RNA frequently requires a specific tertiary structure. The scaffold for this structure is provided by secondary structural elements that are hydrogen bonds within the molecule. Here, we concentrate on the inverse RNA folding problem. In this problem, an RNA secondary structure is given as a target structure and the goal is to design an RNA sequence that its structure is the same (or very similar) to the given target structure. Different heuristic search methods have been proposed for this problem. One common feature among these methods is to use a folding algorithm to evaluate the accuracy of the designed RNA sequence during the generation process. The well known folding algorithms take O(n3) times where n is the length of the RNA sequence. In this paper, we introduce a new algorithm called GGI-Fold based on multi-objective genetic algorithm and Gibbs sampling method for the inverse RNA folding problem. Our algorithm generates a sequence where its structure is the same or very similar to the given target structure. The key feature of our method is that it never uses any folding algorithm to improve the quality of the generated sequences. We compare our algorithm with RNA-SSD for some biological test samples. In all test samples, our algorithm outperforms the RNA-SSD method for generating a sequence where its structure is more stable.

Journal Link

PreRkTAG: Prediction of RNA Knotted Structures Using Tree Adjoining Grammars

Background: RNA molecules play many important regulatory, catalytic and structural roles in the cell, and RNA secondary structure prediction with pseudoknots is one the most important problems in biology. An RNA pseudoknot is an element of the RNA sec ondary structure in which bases of a single-stranded loop pair with complementary bases outside the loop. Modeling these nested structures (pseudoknots) causes numerous com putational diffilties and so it has been generally neglected in RNA structure prediction algorithms. Objectives: In this study, we present a new heuristic algorithm for the Prediction of RNA Knotted structures using Tree Adjoining Grammars (named PreRKTAG). Materials and Methods: For a given RNA sequence, PreRKTAG uses a genetic algorithm on tree adjoining grammars to propose a structure with minimum thermodynamic energy.

Journal Link

EPWM: An Extended Position Weight Matrix for Motif Representation in Biological Sequences

In this paper, we design a new MR model based on information theory. This model can be used for the de novo motif finding and motif search problems. We extract some known motifs from JASPAR and TRANSFAC databases to search for common features among them. Based on these features, a new MR model is constructed called EPWM. A jackknife test is used to show the EPWM model is more successful than the other MR models for the motif search problem. The jackknife test with each MR model (the EPWM model and the other MR models) is implemented and performed on the JASPAR and TRANSFAC databases. To verify the efficiency of the EPWM model for the de novo motif finding problem, we implement the EPWM model and the other MR models in the Gibbs sampling method. Finally, the Gibbs sampling method is performed on the JASPAR and TRANSFAC databases. The results show that the EPWM model gives more accurate prediction than the other MR models for the motif search and de novo motif finding problems

Journal Link

A heuristic approach to RNA-RNA interaction prediction

RNA-RNA interaction is used in many biological processes such as gene expression regulation. In this process, an RNA molecule prohibits the translation of another RNA molecule by establishing stable interactions with it. In this regard, some algorithms have been formed to predict the structure of the interaction between two RNA molecules. One common pitfall in the most algorithms is their high computational time. In this paper, we introduce a novel algorithm called TIRNA to accurately predict the secondary structure between two RNA molecules based on minimum free energy (MFE). The algorithm is stand on a heuristic approach which employs some dot matrices for finding the secondary structure of each RNA and between two RNAs. The proposed algorithm has been performed on some standard datasets such as CopA-CopT, R1inv-R2inv, Tar-Tar*, DIS-DIS and IncRNA₅₄-RepZ in the Escherichia coli bacteria. The time and space complexity of the algorithm are 0(k² log k²) and 0(k²), respectively, where k indicates the sum of the length of two RNAs. The experimental results show the high validity and efficiency of the TIRNA.

Journal Link

Transcription factor binding sites identification on human genome using an artificial neural network

Transcription factor binding sites on human DNA are the target locations of specific proteins called transcription factors. Gene expression process begins when a transcription factor binds to its target location in the genome. Expensive experimental methods are used to identify a limited number of these binding sites, hence there is essential need for computational algorithms. In this paper, we train a back propagation neural network to identify SP1 factor binding sites on human chromosome1. Biological data have been extracted from NCBI database which includes a wide variety of genetic information of human and other species. In order to compare the performance of our trained neural network with other classification algorithms, we use Support Vector Machine, Discriminant Analysis and K-Nearest Neighbor algorithm to classify same data. Results show that our trained neural network outperforms other classification algorithms.

Journal Link

PSOMF: An algorithm for pattern discovery using PSO

The task of transcription factor binding sites discovery from the upstream region of gene, without any prior knowledge of what look likes, is very challenging. In this paper we propose an algorithm based on Particle Swarm Optimization (PSO) to identify motif instances in multiple biological sequences. The experimental results on yeast sac-choromyces Cerevisae transcription factor binding sites, demonstrate that the proposed method is working analogous to YMF, MEME and AlignACE algorithms

Journal Link