Fundamental of Bioinformatics
Spaced words for alignment-free sequence comparison and read mapping
`Spaced words’ or `spaced seeds’ are frequently used in biological sequence analysis, e.g. in database searching. A `spaced word’ is a word that contains wildcard characters at certain positions specified by a pre-defined binary pattern of `match’ and `don’t-care’ positions. It has been shown that methods that rely on spaced words are often more accurate than approaches based on contiguous words. In 2014, we proposed to use spaced words in alignment-free sequence comparison, to estimate phylogenetic distances between genomic sequences. The results of `spaced words’ algorithms depend on the underlying pattern of `match’ and `don’t-care’ positions. We developed a program called `rasbhari’ to calculate suitable patterns for database searching, read mapping and alignment-free sequence comparison.
Using Deep Neural Networks to Reveal Cell Identity from Gene Expression Profiles
Understanding cell identity is an important task in many biomedical areas. Expression patterns of specific marker genes have been used to characterize some limited cell types, but for most of the cell types, we do not know exclusive markers.
In this talk we introduce a method based on deep neural networks to identify cell type based on gene expression profiles. We have used more than 1000 whole-genome transcription profiles to train and test our model, and reached more than 96% classification accuracy.
RNA Positive Design
The design of an RNA programmed to adopt a given secondary structure. From a computational perspective, this problem is elusive, and its exact complexity is currently unknown. Consequently, virtually all existing tools for the problem resort to incremental heuristics (genetic algorithms, local search, ant colonies…) with comparable success that is highly dependent on the quality of (the set of) initially chosen sequence(s), also called seeds. In collaboration with the Waldispuhl (Univ. McGill, Montreal, Canada) and Hofacker (TBI Vienna, Austria) groups, we have chosen to focus on the seed design, and have introduced sequence sampling techniques which seem to consistently improve the quality of the optimization step using heuristics.This sampling technique is also quite flexible, and can also be used to capture other recurrent goals of RNA design in applicative contexts, including an explicit control over the forbidden/forced sequence motifs (using formal languages inspired techniques), GC-content, dinucleotide composition (both using multidimensionnal Boltzmann sampling), compatibility with multiple target structures (contributing a FPT algorithm based on a tree-decomposition of the induced compatibility graph).
RNA-SEQ: Read alignment and mapping
Development of novel sequencing technologies has provided a new method to reveal the presence and quantity of RNA in a biological sample for both mapping and quantifying transcriptomes. Understanding the transcriptome is essential for interpreting the functional elements of the genome and revealing the molecular constituents of cells and tissues, and also for understanding development and diseases.
Two methods are used to assign raw sequence reads or assemble the transcriptome including:
De novo: This approach does not require a reference genome to reconstruct the transcriptome, and is typically used if the genome is unknown, incomplete, or substantially altered compared to the reference.
Genome guided: This approach relies on the same methods used for DNA alignment, with the additional complexity of aligning reads that cover non-continuous portions of the reference genome. These non-continuous reads are the result of sequencing spliced transcripts.
In this presentation different algorithms for transcriptome assembly is reviewed.
Stochastic Modeling of DNA damages
DNA double strand breaks (DSBs) are the most lethal lesions of DNA induced by ionizing radiation, industrial chemicals and a wide variety of drugs used in chemotherapy. In the context of DNA damage response system modelling, uncertainty may arise in several ways such as number of induced DSBs, kinetic rates and measurement error in observable quantities. Therefore, using the stochastic approaches is imperative to gain further insight into the dynamic behavior of DSBs repair process.
In order to estimate the expected duration of repairing DSBs, the probability distribution of DSBs lifetime is calculated based on the a continuous-time Markov chain (CTMC) non-homologous end joining (NHEJ)model. Moreover, the variability of total yield of DSBs during constant low-dose radiation is evaluated in the presented model. The findings indicate that in stochastic NHEJ model, when there is no new DSBs induction through the repair process, all DSBs are eventually repaired. However, when DSBs are induced by constant low-dose radiation, a number of DSBs remains un-repaired.
Breast Cancer Drug (ICD-85)
Finding novel drug is always a challenge. The major concern is related to the costs of preclinical and clinical trials studies which are the most budget and time consuming process. Bioinformatics applying techniques allowing us to perform many experiments by computer so that so much time and money could be saved. It is a powerful tool that accelerate the process of drug development by providing in sight into structure potential targets and target specific sites of signaling molecules or their downstream effectors. From biological view the process of bioinformatics, can be through 4 steps including : A-Target identification, B-Target Validation, C- lead substance and D- lead optimization. Our team experience in discovery of ICD-85, the biological peptides which suppress the growth of cancer cell, primary steps including identification of peptides through trial and error finding the mechanism of action, safety studies , bio-distribution and finally exposure time related activity and finally clinical trial in phase 0 and phase 1 studies in breast cancer patients will be discussed and the challenges we faced during 12 years of work will be presented in this speech . The possible help of bioinformatics science will be discussed to join with biologist to ease the process of drug discovery .
Drug Design for Alzheimer
Alzheimer’s disease, the most common form of dementia, is a chronic neurodegenerative disorder characterized by progressive cognitive impairment in elderly people. According to the cholinergic hypothesis, memory loss in Alzheimer’s disease is due to decreased levels of the neurotransmitter acetylcholine (ACh), which plays a key role in memory and cognition in cholinergic synapses. Therefore, Alzheimer’s disease is characterized by a low ACh in the hippocampus and cortex. Acetylcholinesterase (AChE), one of the most essential enzymes in the family of serine hydrolases, is responsible for rapid breakdown of ACh to allow repeated signal transmission. Inhibition of AChE in Alzheimer’s disease treatment should decrease the level of ACh in the synapses, providing a chance to induce a signal in the downstream nerve. In this study we present an approach for predicting the inhibitory activity of AChE inhibitors by combining docking studies and structure-based quantitative structure–activity relationship (QSAR) model. Docking analysis revealed that hydrophobic interactions play important roles in the AChE-inhibitor complex. A structure-based QSAR model is also developed to represent the relationship between descriptors created from docking and the activities of the inhibitors. The least squares support vector regression was constructed using the four most relevant docking descriptors and one molecular structure descriptor. The Q2 value of the model was found to be 0.790.
Protein engineering is an important tool for overcoming the limitations of natural enzymes as biocatalysts. In this regard computational tools are becoming increasingly important in order to create improved or novel enzymes.
Here we describe some strategies for rational protein engineering and summarize the computational tools available. Computational tools can either be used to increase stability, activity and affinity of proteins. This also includes new peptide design.
Alignment-free sequence comparison using maximal common substrings
Most methods for alignment-free sequence comparison are based on a fixed word length or on fixed binary patterns of `match’ and `don’t-care’ positions. The results of these methods therefore depend on the word length or underlying pattern. As an alternative, some approaches have been proposed that are based on the length of common subwords. Haubold et al. (2009) showed how phylogenetic distances can be estimated in a rigorous way based on the average length of common substrings. Generalizing this approach, we proposed to use the length of common substrings with k mismatches in alignment-free sequence comparison. In a recent paper, we showed that the number of substitutions per position in DNA sequences can be accurately estimated from the length distribution of $k$-mismatch common substrings.
A Systems Approach to Modeling Cell-specific Metabolic Networks
Genome-scale metabolic networks have been widely used to model the metabolic capacities of of a variety of cell types, ranging from microorganisms to plants and human. More specifically, context-specific human metabolic networks have been used during the last decade to understand human physiology and pathology. In the present talk, by reviewing recent publications, I will explain how “omics” data empower the reconstruction and (the subsequent) analysis of such networks. Furthermore, some of the basic computational challenges of the procedure will be discussed.
Comparison of Different Approaches for Identifying Subnetworks in Metabolic Networks
A metabolic network model provides a computational framework for studying the metabolism of a cell at the system level. The organization of metabolic networks has been investigated in different studies. One of the organization aspects considered in these studies is the decomposition of a metabolic network. The decompositions produced by different methods are very different and there is no comprehensive evaluation framework to compare the results with each other. In this study, these methods are reviewed and compared in the first place. Then they are applied to six different metabolic network models and the results are evaluated and compared based on two existing and two newly proposed criteria. Results show that no single method can beat others in all criteria but it seems that the methods introduced by Guimera & Amaral and Verwoerd do better on among metabolite-based methods and the method introduced by Sridharan et al. does better among reaction-based ones.