Title: Pharmacogenomics and Bioinformatics
1Pharmacogenomics and Bioinformatics
2What is pharmacogenomics?
- Pharmacogenomics is the use genomic and sequence
data of host and pathogens to identify potential
drug targets - Involves a variety of techniques/disciplines such
as sequence analysis, protein structure,
genomics, micorarray analysis and others - These fields rely heavily on bioinformatics
- Usually focuses on medical or agricultural
applications
3Human Genome Project
- Project goals are to
- identify all the approximately 20,000-25,000
genes in human DNA, - determine the sequences of the 3 billion chemical
basepairs that make up human DNA, - store this information in databases,
- improve tools for data analysis,
- transfer related technologies to the private
sector, and - address the ethical, legal, and social issues
(ELSI) that may arise from the project.
From http//www.ornl.gov/hgmis/
4Human Genome Project
- Progress
- - Several types of genome maps have already
been completed, and a working draft of the entire
human genome sequence was announced in June 2000,
with analyses published in February 2001. - - An important feature of this project is the
federal government's long-standing dedication to
the transfer of technology to the private sector.
By licensing technologies to private companies
and awarding grants for innovative research, the
project is catalyzing the multibillion-dollar
U.S. biotechnology industry and fostering the
development of new medical applications.
From http//www.ornl.gov/hgmis/
5Human Genome Project
- Seven organisms were originally chosen for
sequencing. - E. coli
- Yeast
- Fly
- Worm
- Arabidopsis
- Mouse
- human
- Why were these chosen?
6Genome Projects
- As of January 2005 there were many more sequenced
- 25 non-plant eukaryotes
- 5 plants
- 213 microbes completed
- 21 Archae
- 274 microbes in progress
- 1431 viruses in progress
- 833 non-virus organisms with at least on
nucleotide sequence submitted - Why were these chosen?
7Genome Projects
- Chosen by funding agencies
- Four main categories
- Medical applications
- Evolutionary significance
- Environmental impact
- Food production
8How are genomics used for drug target
identification?
- The basic idea is to look for genes unique to the
pathogen that are crucial for its survival. This
would be the drug target. - If this is a pathogen in the host, the gene would
be in the pathogen and not in the host. - If this was in the environment, the gene should
be as specific as possible for the pathogen to
avoid harming other organisms that might be
beneficial.
9How can this be done?
- To do this genomics, proteomics and
bioinformatics are involved. - In any of these cases bioinformatics tools are
necessary.
10Genome Sequencing and Comparison
- As mentioned earlier, many pathogen (virus,
bacteria, and other microorganisms) have been
sequenced. - Once they are sequenced, they are annotated.
Annotation is the process by which the functions
of the different proteins (genes) are determined. - In this way, an understanding of the organisms
metabolism is gained.
11Malaria
- Malaria is caused by the genus Plasmodium, with
Plasmodium falciparum being the most lethal. - Its genome has been sequenced
- It is a pathogen that digests proteins for food.
It does not contain any amino acid producing
genes in its genome, i.e. it does not make its
own amino acids. - Purines are recycled, but there are not genes for
purine synthesis. - Has many solute ATP dependent transporters and
one novel multifunctional transporter.
12How is annotation done?
- Annotation is the process of predicting the
function of genes in a genome. - First all the genes have to be found. This is
done by finding the open reading frame (ORF). - This is done by gene finding or gene prediction
software.
13Gene Prediction
- Analysis by sequence similarity can only reliably
identify about 30 of the protein-coding genes in
a genome - 50-80 of new genes identified have a partial,
marginal, or unidentified homolog - Frequently expressed genes tend to be more easily
identifiable by homology than rarely expressed
genes
14Gene Finding
- Process of identifying potential coding regions
in an uncharacterized region of the genome - Still a subject of active research
- There are many different gene finding software
packages and no one program is capable of finding
everything
15(No Transcript)
16Eukaryotes vs Prokaryotes
- Eukaryotic DNA wrapped around histones that might
result in repeated patterns (periodicity of 10)
for histone binding. The promotor regions might
be near these sites so that they remain hidden. - Prokaryotes have no introns.
- Promotor regions and start sites more highly
conserved in Prokaryotes - Different codon use frequencies
17Gene finding is species-specific
- Codon usage patterns vary by species
- Functional regions (promoters, splice sites,
translation initiation sites, termination
signals) vary by species - Common repeat sequences are species-specific
- Gene finding programs rely on this information to
identify coding regions
18The genetic code
19Codon usage
20Identifying ORFs
- Simple first step in gene finding
- Translate genomic sequence in six frames.
Identify stop codons in each frame - Regions without stop codons are called "open
reading frames" or ORFs - Locate and tag all of the likely ORFs in a
sequence - The longest ORF from a Met codon is a good
prediction of a protein encoding sequence. - SOFTWARE NCBI ORF Finder
21ORF Finder input
22ORF finder results
23Tests of the Predicted ORF
- Check if the third base in the codons tends to be
the same one more often than by chance alone. - Are the codons used in the ORF the same as those
used in other genes (need codon usage frequency). - Compare the amino acid sequence for similarity
with other know amino acid sequences.
24Problems with ORF finding
- A single-character sequencing error can hide a
stop codon or insert a false stop codon,
preventing accurate identification of ORFs - Short exons can be overlooked
- Multiple transcripts or ORFs on complementary
strand can confuse results
25Pattern-based gene finding
- ORF finding based on start and stop codon
frequency is a pattern-based procedure - Other pattern-based procedures recognize
characteristic sequences associated with known
features and genes, such as ribosome binding
sites, promoter sites, histone binding sites,
etc. - Statistically based.
26Content-based gene finding
- Content-based gene finding methods rely on
statistical information derived from known
sequences to predict unknown genes - Some evaluative measures include "coding
potential" (based on codon bias), periodicity in
the sequence, sequence homogeneity, etc.
27A standard content-based alignment procedure
- Select a window of DNA sequence from the unknown.
The window is usually around 100 base pairs long - Evaluate the window's potential as a gene, based
on a variety of factors - Move the window over by one base
- Repeat procedure until end of sequence is
reached report continuous high-scoring regions
as putative genes
28Combining measures
- Programs rarely use one measure to predict genes
- Different values are combined (using
probabilistic methods, discriminant analysis,
neural net methods, etc.) to produce one "score"
for the entire window
29Drawbacks to window-based evaluation
- A sequence length of at least 100 b.p. is
required before significant information can be
gained from the analysis - Results in a /- 100 b.p. uncertainty in the
start site of predicted coding regions, unless an
unambiguous pattern can also be found to indicate
the start.
30Most are web-based, but...
- Submit sequence input sequence length may be
limited - Select parameters, if any
- Interpret results
- Most software is first or second generation
results come in non-graphical formats. - GeneMark, GenScan, Glimmer
31How is annotation done?
- This is done by comparing the DNA sequences of
the genes to known genes in a database. If they
sequences are similar, the a similar function is
assumed. - The comparison is done using sequence comparison
tools such as BLAST
32Database Searching for Similar Sequences
- Database searching for similar sequences is
ubiquitous in bioinformatics. - Databases are large and getting larger
- Need fast methods
33Types of Searches
- Sequence similarity search with query sequence
- Alignment search with profile (scoring matrix
with gap penalties) - Serch with position-specific scoring matrix
representing ungapped sequence alignment - Iterative alignment search for similar sequences
that starts with a query sequence, builds a
multiple alignmnet, and then uses the alignment
to augment the search - Search query sequence for patterns representative
of protein families
From Bioinformatics by Mount
34DNA vs Protein Searches
- DNA sequences consists of 4 characters
(nucleotides) - Protein sequences consist of 20 characters (amino
acids) - Hence, it is easier to detect patterns in protein
sequences than DNA sequences - Better to convert DNA sequences to protein
sequences for searches.
35Database Searching Efficacy
- To evaluate searching methods, selectivity and
sensitivity need to be considered. - Selectivity is the ability of the method not to
find members known to be of another group (i.e.
false positives). - Sensitivity is the ability of the method to find
members of the same protein family as the query
sequence.
36Protein Searches
- Easier to identify protein families by sequence
similarity rather than structural similarity.
(same structure does not mean same sequence) - Use the appropriate gap penalty scorings
- Evaluate results for statistical significance.
37History
- Historically dynamic programming was used for
database sequence similarity searching. - Computer memory, disk space, and CPU speed were
limiting factors. - Speed still a factor due to the larger databases
and increase number of searches. - FASTA and BLAST allow fast searching.
38History
- The PAM250 matrix was used for a long time. It
corresponds to a period of time where only 20 of
the amino acids have remained unchanged. - BLOSUM has replace PAM250 in most applications.
BLAST use the BLOSUM62 matrix. FASTA uses the
BLOSUM50 matrix.
39Search Tools
- Similarity Search Tools
- Smith-Waterman Searching
- Heuristic Search Tools
- FASTA
- BLAST
40Malaria Vaccine
- A German and American Team used reverse genetics
i.e. they used the sequenced genome, deduced the
candidate genes, and then knocked out a
particular gene (Uis3). - This give 30 day immunity in mice which is better
than vaccines made by traditional methods
41Microarray Data Analysis
- Gene chips allow the simultaneous monitoring of
the expression level of thousands of genes. Many
statistical and computational methods are used to
analyze this data. These include - statistical hypothesis tests for differential
expression analysis - principal component analysis and other methods
for visualizing high-dimensional microarray data - cluster analysis for grouping together genes or
samples with similar expression patterns - hidden Markov models, neural networks and other
classifiers for predictively classifying sample
expression patters as one of several types
(diseased, ie. cancerous, vs. normal)
42What is Microarray Data?
- In spite of the ability to allow us to
simultaneously monitor the expression of
thousands of genes, there are some liabilities
with micorarray data. Each micorarray is very
expensive, the statistical reproducibility of the
data is relatively poor, and there are a lot of
genes and complex interactions in the genome. -
- Microarray data is often arranged in an n x m
matrix M with rows for the n genes and columns
for the m biological samples in which gene
expression has been monitored. Hence, mij is the
expression level of gene i in sample j. A row ei
is the gene expression pattern of gene i over all
the samples. A column sj is the expression level
of all genes in a sample j and is called the
sample expression pattern.
43Types of Microarrays
- cDNA microarray
- Nylon membrane and plastic arrays (by Clontech)
- Oligonucleotide silicon chips (by Affymetrix)
- Note Each new version of a microarray chip is
at least slightly different from the previous
version. This means that the measures are likely
to change. This has to be taken into account
when analyzing data.
44cDNA Microarray
- The expression level eij of a gene i in sample j
is expressed as a log ratio, log(rij/gi), of the
log of its actual expression level rij in this
sample over its expression level gi in a control.
- When this data is visualized eij is color coded
to a mixture of red (rij gtgt gi) and green (rij
ltlt gi) and a mixture in between.
45Nylon Membrane and Plastic Arrays (by Clontech)
- A raw intensity and a background value are
measured for each gene. - The analyst is free to choose the raw intensity
or can adjust it by subtracting the background
intensity.
46Oligonucleotide Silicon Chips (by Affymetrix)
- These arrays produce a variety of numbers derived
from 16-20 pairs of perfect match (PM) and
mismatch (MM) probes. - There are several statistics related to gene
expression that can be derived from this data.
The most commonly used one is the average
difference (AVD), which is derived from the
differences of PM-MM in the 16-20 probe pairs. - The next most commonly used method is the log
absolute value (LAV), which comes from the ratios
PM/MM in the probe pairs. - Note The Affymetrix gene-chip software has a
absent/present call for each gene on a chip.
According to Jagota, the method is complex and
arbitrary so they usually ignore it.
47For What Do We Use Microarray Data?
- Genes with similar expression patterns over all
samples We can compare the expression patterns
ei and ei of two genes i and i' over all
samples. - If we use cluster analysis, we can separate the
genes into groups of genes with similar
expression patterns (trees). - This will allow us to find what unknown genes
have altered expression in a particular disease
by comparing the pattern to genes know to be
affiliated with a disease. - It can also find genes that fit a certain pattern
such as a particular pattern of change with time.
- It can also characterize broad functional classes
of new genes from the known classes of genes with
similar expression.
48For What Do We Use Microarray Data?
- Genes with unusual expression levels in a sample
In contrast to standard statistical methods
where we ignore outliers, here outliers might
have particular importance. Hence, we look for
genes whose expression levels are very different
from the others. - Genes whose expression levels vary across samples
We can compare gene expression levels of a
particular gene or set of genes in different
samples. This can be used to look compare normal
and diseased tissues or diseased tissue before
and after treatment.
49For What Do We Use Microarray Data?
- Samples that have similar expression patterns
We might want to compare the expression patters
of all genes between two samples. We might
cluster the genes into gene with similar
expression patterns to help with the comparison.
This can be used to look compare normal and
diseased tissues or diseased tissue before and
after treatment. - Tissues that might be cancerous (diseased) We
can take the gene expression pattern of sample
and compare it to library expression patterns
that indicate diseased or not diseased tissue.
50Statistical Methods Can Help
- Experimental Design Since using microarrays is
costly and time consuming, we want to design
experiments to use the minimal number of
micorarrays that will give a statistically
significant result. - Data Pre-processing It is sometimes useful to
preprocess the data prior to visualization. An
example of this is the log ratio mentioned
earlier. It is often necessary to rescale data
from different microarrays so that they can be
compared. This is due to variation in chip to
chip intensity. Another type of preprocessing
is subtracting the mean and dividing by the
variance.
51Statistical Methods Can Help
- Data Visualization Principle component analysis
and multidimensional scaling are two useful
techniques for reducing multidimensional data to
two and three dimensions. This allows us to
visualize it. - Cluster Analysis By associating genes with
similar expression patterns, we might be able to
draw conclusions about their functional
expression. - Probability Theory We can use statistical
modeling and inference to analyze our data.
Probability theory is the basis for these.
52Statistical Methods Can Help
- Statistical Inference This is the formulation
and statistical testing of a hypothesis and
alternative hypothesis. - Classifiers for the Data We can construct
classes from data, such a diseased vs.
non-diseased tissue. We can build a model (such
as a hidden Markov model) that fits know data for
the different classes. This can then be used to
classify previously unclassified data.
53Preprocessing Microarray Data
- Before microarray data can be analyzed or stored,
a number of procedures or transformations must be
applied to it. - In order to analyze the data correctly, it is
important to understand what the transformations
might be doing to the data.
54Preprocessing Microarray Data
- Ratioing the data
- Log-tranforming ratioed data
- Alternative to ratioing the data
- Differencing the data
- Scaling data across chips to account for
chip-to-chip difference - Zero-centering a gene on a sample expression
pattern - Weighting the components of a gene or sample
expression pattern differently - Handling missing data
- Variation filtering expression patterns
- Discretizing expression data
55Cluster Analysis of Microarray Data
- Recall that microarray data can be thought of as
gene expression patterns or sample expression
patterns. These can be each considered to be
vectors. The first thing we have to do before
applying cluster analysis is to find a distance
between the various expression pattern vectors.
This is done using similarity/dissimilarity
measures such as Euclidean distance, Mahalonobis
distance, or linear correlation coefficients.
Once a distance matrix is computed, the following
clustering algorithms can be used. The clusters
formed can differ significantly depending upon
the distance measure used.
56Cluster Analysis of Microarray Data
- Hierarchical Clustering Assume each data point
is in a singleton cluster. - Find the two clusters that are closest together.
Combine these to form a new cluster. - Compute the distance from all clusters to the new
cluster using some form of averaging. - Find the two closest clusters and repeat.
57Cluster Analysis of Microarray Data
- k-Means Clustering An alternate method of
clustering called k-means clustering, partitions
the data into k clusters and finds cluster means
?i for each cluster. In our case, the means will
be vectors also. Usually, the number of clusters
k is fixed in advance. To choose k something
must be know about the data. There might be a
range of possible k values. To decide which is
best, optimization of a quantity that maximizes
cluster tightness ie. minimizes distances between
points in a cluster.
58Cluster Analysis of Microarray Data
- Self-organizing Maps This is basically an
application of neural networks to microarray
data. Assume that there is a 2-dimensional grid
of cells and a map from a given set of expression
data vectors in Rn, ie, there are n nodes in the
input layer and a connection neuron from each of
these to each cell. Each cell (i, j) gets it own
weight from n input neurons. The weight vector
mij is the mean of the cluster associated with
cell (i, j). Each data vector d gets mapped to
the cell (i, j) that is closest to d using
Euclidean distance.In order to train the network,
the mean vectors mij for the cells (i, j) must be
learned.
59 Sample Microarray
60Correlations
61Clustering of Genes
62Personalized Medicine
- There is a new buzz word called personalized
medicine. - The idea is to develop medicine and treatment
plan based on an individuals genetic make-up.
63Proteomics
- Understanding protein function
- Functional genomics
- Multiple approaches structure, expression
levels, biochemistry, modeling etc. - Combining technologies is necessary to understand
in vivo protein functional
64Approach
- Use data to determine pathway.
- Use biochemistry to figure out kinetics and
concentrations. - Use new proteomic approaches to determine
relative concentrations. - Apply pathway model to determine functional
consequence.
65Pathway Data
- Using molecular biological techniques we can
determine what proteins make up a biochemical
pathway.
A
B
C
D
66Pathways
- Biochemical Pathways form complex biochemical
reaction networks. - There might be multiple ways to get from A to B.
- The path chosen depends on biochemical kinetics.
67Biochemistry
- Classical biochemistry isolates proteins from
tissue or cells. - Modern molecular biology allows the production of
purified protein. - The concentration of the protein is determined
- The kinetic properties of the proteins is
determined by biochemical assay rates of
reactions, modulating factors, etc.
68Pathway Modeling Methods
- Boolean Models
- Metabolic Control Theory Flux Balance Analysis
- Biochemical Systems Analysis
- Kinetic Modeling Approach
69Disorders of Thrombophilia
- The functional consequences of nonsynonymous SNPS
can be predicted by comparison of protein
structures. - There are various SNPs know
- Activated protein C resistance by Arg 506 to Glu
- Prothrombing polymorphism (G20210A) causing
elevated prothrombin levels - Protein C deficiency
- Protein S deficiency
- Antithormbin deficiency
- Elevated factor VIII levels
70Fibrinogen Abnormalities
- Various polymorphisms found in the long arm of
chromosome 4 - Two dimorphisms of the b-chain gene are of major
importance and in linkage disequilibrium with
each other. - These affect plasma fibrogen levels
71Prothrombin G20210 Polymorphism
- Replacement of a G by A at nucleotide 20210 in
the untranslated section of the prothrombin gene
increases translation without altering
transcription of the gene. - This results in elevated synthesis and secretion
of prothrombin by the liver. - This results in increased thrombin levels
72Activated protein C deficiency
- Factor V Leiden R506Q mutation occurs in 8 of
the population. - It is a G?A substitution at nucleotide 1691 in
the gene for factor V. - Factor V is cleaved less efficiently by activated
protein C - Results in deep vein thrombosis, early kidney
transplant loss, recurrent miscarriages and other
disorders