CBRC Tutorials

Drug Repurposing concept
- Drug Discovery
Benefits
- Less risky
- Faster
- Cheaper
- Creates opportunities to treat rare, acute, and neglected diseases
Methods
- Gene expressions before and after using drug (Connectivity Map (CMap) dataset)
- Same target, same drug (Guilt and Association (GBA) )
Main idea
- Similarity of drug-drug, disease-disease => drug-disease relationship
drug-drug similarity
- Molecular structure pubchem.ncbi.nlm.nih.gov
- Molecular activity (drug-target) DrugBank
- Phenotype data (side-effect) SIDER
disease-disease similarity
- Phenotype data (symptoms disease) www.cmbi.ru.nl/MimMiner/suppl.html
drug-disease interaction prediction
- Gold standard for learning SVM
Evaluation
- Leave One Drug Out Cross Validation

Secondary structure prediction
- Input: a chain of 20 amino acid sequence
- Output: secondary structure of each amino acid (Helix, Strand, Coil)
- Challenge: How to give input to SVM
  - Evolutionary information: multiple sequence alignment with a database
- Several binary SVMs
  - one-versus-rest (H/~H, S/~S, C/~C)
  - one-versus-one (C/H, C/E, H/E)
Fold recognition
- Input: part of protein sequence
- Output: fold type of that part of protein sequence (27 folds SCOP)
- Challenge: How to give input to SVM
  - Physical and chemical properties: amino acids composition, predicted secondary structure, hydrophobicity, polarity, ...
Cleavage site identification
- Input: protein sequence
- Output: every position of protein, whether it is cleavage site or not
RNA-binding proteins
- Input: protein sequence
- Output: whether the protein bind to RNA or not
  - proteins that bind to RNA: UniProt
  - proteins that do not bind to RNA: PDB
- Challenge: How to give input to SVM
  - Physical and chemical properties: hydrophobicity, polarity, ...

Input: the gene expression level of each gene for normal and cancer samples
Output: whether the new sample has cancer or not by identifying of genes involved in the disease
The gene expression level computation -> microarray
Challenge of microarray data: the number of features (genes) is greater than the number of samples
- Select genes that relate to disease -> Mutual Information \[\sum_x \sum_y p(x,y) \log_2 {p(x,y) \over p(x)p(y)}\]
  - Discrete data -> normalize microarray data and categorize to a different level of expression
  - The relation between normal and cancer in each gene (expressed with different level)
  - Lower mutual information means that gene show disease specifically
- Classify samples to normal and cancer according to the gene expression level