Pharmacogenomics and Bioinformatics

About This Presentation

Title:

Pharmacogenomics and Bioinformatics

Description:

Pharmacogenomics is the use genomic and sequence data of host and pathogens to ... Smith-Waterman Searching. Heuristic Search Tools. FASTA. BLAST. Malaria Vaccine ... – PowerPoint PPT presentation

Number of Views:308

Avg rating:3.0/5.0

Slides: 73

Provided by: jaf1

Learn more at: http://www.binf.gmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Pharmacogenomics and Bioinformatics

1
Pharmacogenomics and Bioinformatics

M. Saleet Jafri

2
What is pharmacogenomics?

Pharmacogenomics is the use genomic and sequence
data of host and pathogens to identify potential
drug targets
Involves a variety of techniques/disciplines such
as sequence analysis, protein structure,
genomics, micorarray analysis and others
These fields rely heavily on bioinformatics
Usually focuses on medical or agricultural
applications

3
Human Genome Project

Project goals are to
identify all the approximately 20,000-25,000
genes in human DNA,
determine the sequences of the 3 billion chemical
basepairs that make up human DNA,
store this information in databases,
improve tools for data analysis,
transfer related technologies to the private
sector, and
address the ethical, legal, and social issues
(ELSI) that may arise from the project.

From http//www.ornl.gov/hgmis/
4
Human Genome Project

Progress
- Several types of genome maps have already
been completed, and a working draft of the entire
human genome sequence was announced in June 2000,
with analyses published in February 2001.
- An important feature of this project is the
federal government's long-standing dedication to
the transfer of technology to the private sector.
By licensing technologies to private companies
and awarding grants for innovative research, the
project is catalyzing the multibillion-dollar
U.S. biotechnology industry and fostering the
development of new medical applications.

From http//www.ornl.gov/hgmis/
5
Human Genome Project

Seven organisms were originally chosen for
sequencing.
E. coli
Yeast
Fly
Worm
Arabidopsis
Mouse
human
Why were these chosen?

6
Genome Projects

As of January 2005 there were many more sequenced
25 non-plant eukaryotes
5 plants
213 microbes completed
21 Archae
274 microbes in progress
1431 viruses in progress
833 non-virus organisms with at least on
nucleotide sequence submitted
Why were these chosen?

7
Genome Projects

Chosen by funding agencies
Four main categories
Medical applications
Evolutionary significance
Environmental impact
Food production

8
How are genomics used for drug target
identification?

The basic idea is to look for genes unique to the
pathogen that are crucial for its survival. This
would be the drug target.
If this is a pathogen in the host, the gene would
be in the pathogen and not in the host.
If this was in the environment, the gene should
be as specific as possible for the pathogen to
avoid harming other organisms that might be
beneficial.

9
How can this be done?

To do this genomics, proteomics and
bioinformatics are involved.
In any of these cases bioinformatics tools are
necessary.

10
Genome Sequencing and Comparison

As mentioned earlier, many pathogen (virus,
bacteria, and other microorganisms) have been
sequenced.
Once they are sequenced, they are annotated.
Annotation is the process by which the functions
of the different proteins (genes) are determined.
In this way, an understanding of the organisms
metabolism is gained.

11
Malaria

Malaria is caused by the genus Plasmodium, with
Plasmodium falciparum being the most lethal.
Its genome has been sequenced
It is a pathogen that digests proteins for food.
It does not contain any amino acid producing
genes in its genome, i.e. it does not make its
own amino acids.
Purines are recycled, but there are not genes for
purine synthesis.
Has many solute ATP dependent transporters and
one novel multifunctional transporter.

12
How is annotation done?

Annotation is the process of predicting the
function of genes in a genome.
First all the genes have to be found. This is
done by finding the open reading frame (ORF).
This is done by gene finding or gene prediction
software.

13
Gene Prediction

Analysis by sequence similarity can only reliably
identify about 30 of the protein-coding genes in
a genome
50-80 of new genes identified have a partial,
marginal, or unidentified homolog
Frequently expressed genes tend to be more easily
identifiable by homology than rarely expressed
genes

14
Gene Finding

Process of identifying potential coding regions
in an uncharacterized region of the genome
Still a subject of active research
There are many different gene finding software
packages and no one program is capable of finding
everything

15
(No Transcript)
16
Eukaryotes vs Prokaryotes

Eukaryotic DNA wrapped around histones that might
result in repeated patterns (periodicity of 10)
for histone binding. The promotor regions might
be near these sites so that they remain hidden.
Prokaryotes have no introns.
Promotor regions and start sites more highly
conserved in Prokaryotes
Different codon use frequencies

17
Gene finding is species-specific

Codon usage patterns vary by species
Functional regions (promoters, splice sites,
translation initiation sites, termination
signals) vary by species
Common repeat sequences are species-specific
Gene finding programs rely on this information to
identify coding regions

18
The genetic code
19
Codon usage
20
Identifying ORFs

Simple first step in gene finding
Translate genomic sequence in six frames.
Identify stop codons in each frame
Regions without stop codons are called "open
reading frames" or ORFs
Locate and tag all of the likely ORFs in a
sequence
The longest ORF from a Met codon is a good
prediction of a protein encoding sequence.
SOFTWARE NCBI ORF Finder

21
ORF Finder input
22
ORF finder results
23
Tests of the Predicted ORF

Check if the third base in the codons tends to be
the same one more often than by chance alone.
Are the codons used in the ORF the same as those
used in other genes (need codon usage frequency).
Compare the amino acid sequence for similarity
with other know amino acid sequences.

24
Problems with ORF finding

A single-character sequencing error can hide a
stop codon or insert a false stop codon,
preventing accurate identification of ORFs
Short exons can be overlooked
Multiple transcripts or ORFs on complementary
strand can confuse results

25
Pattern-based gene finding

ORF finding based on start and stop codon
frequency is a pattern-based procedure
Other pattern-based procedures recognize
characteristic sequences associated with known
features and genes, such as ribosome binding
sites, promoter sites, histone binding sites,
etc.
Statistically based.

26
Content-based gene finding

Content-based gene finding methods rely on
statistical information derived from known
sequences to predict unknown genes
Some evaluative measures include "coding
potential" (based on codon bias), periodicity in
the sequence, sequence homogeneity, etc.

27
A standard content-based alignment procedure

Select a window of DNA sequence from the unknown.
The window is usually around 100 base pairs long
Evaluate the window's potential as a gene, based
on a variety of factors
Move the window over by one base
Repeat procedure until end of sequence is
reached report continuous high-scoring regions
as putative genes

28
Combining measures

Programs rarely use one measure to predict genes
Different values are combined (using
probabilistic methods, discriminant analysis,
neural net methods, etc.) to produce one "score"
for the entire window

29
Drawbacks to window-based evaluation

A sequence length of at least 100 b.p. is
required before significant information can be
gained from the analysis
Results in a /- 100 b.p. uncertainty in the
start site of predicted coding regions, unless an
unambiguous pattern can also be found to indicate
the start.

30
Most are web-based, but...

Submit sequence input sequence length may be
limited
Select parameters, if any
Interpret results
Most software is first or second generation
results come in non-graphical formats.
GeneMark, GenScan, Glimmer

31
How is annotation done?

This is done by comparing the DNA sequences of
the genes to known genes in a database. If they
sequences are similar, the a similar function is
assumed.
The comparison is done using sequence comparison
tools such as BLAST

32
Database Searching for Similar Sequences

Database searching for similar sequences is
ubiquitous in bioinformatics.
Databases are large and getting larger
Need fast methods

33
Types of Searches

Sequence similarity search with query sequence
Alignment search with profile (scoring matrix
with gap penalties)
Serch with position-specific scoring matrix
representing ungapped sequence alignment
Iterative alignment search for similar sequences
that starts with a query sequence, builds a
multiple alignmnet, and then uses the alignment
to augment the search
Search query sequence for patterns representative
of protein families

From Bioinformatics by Mount
34
DNA vs Protein Searches

DNA sequences consists of 4 characters
(nucleotides)
Protein sequences consist of 20 characters (amino
acids)
Hence, it is easier to detect patterns in protein
sequences than DNA sequences
Better to convert DNA sequences to protein
sequences for searches.

35
Database Searching Efficacy

To evaluate searching methods, selectivity and
sensitivity need to be considered.
Selectivity is the ability of the method not to
find members known to be of another group (i.e.
false positives).
Sensitivity is the ability of the method to find
members of the same protein family as the query
sequence.

36
Protein Searches

Easier to identify protein families by sequence
similarity rather than structural similarity.
(same structure does not mean same sequence)
Use the appropriate gap penalty scorings
Evaluate results for statistical significance.

37
History

Historically dynamic programming was used for
database sequence similarity searching.
Computer memory, disk space, and CPU speed were
limiting factors.
Speed still a factor due to the larger databases
and increase number of searches.
FASTA and BLAST allow fast searching.

38
History

The PAM250 matrix was used for a long time. It
corresponds to a period of time where only 20 of
the amino acids have remained unchanged.
BLOSUM has replace PAM250 in most applications.
BLAST use the BLOSUM62 matrix. FASTA uses the
BLOSUM50 matrix.

39
Search Tools

Similarity Search Tools
Smith-Waterman Searching
Heuristic Search Tools
FASTA
BLAST

40
Malaria Vaccine

A German and American Team used reverse genetics
i.e. they used the sequenced genome, deduced the
candidate genes, and then knocked out a
particular gene (Uis3).
This give 30 day immunity in mice which is better
than vaccines made by traditional methods

41
Microarray Data Analysis

Gene chips allow the simultaneous monitoring of
the expression level of thousands of genes. Many
statistical and computational methods are used to
analyze this data. These include
statistical hypothesis tests for differential
expression analysis
principal component analysis and other methods
for visualizing high-dimensional microarray data
cluster analysis for grouping together genes or
samples with similar expression patterns
hidden Markov models, neural networks and other
classifiers for predictively classifying sample
expression patters as one of several types
(diseased, ie. cancerous, vs. normal)

42
What is Microarray Data?

In spite of the ability to allow us to
simultaneously monitor the expression of
thousands of genes, there are some liabilities
with micorarray data. Each micorarray is very
expensive, the statistical reproducibility of the
data is relatively poor, and there are a lot of
genes and complex interactions in the genome.
Microarray data is often arranged in an n x m
matrix M with rows for the n genes and columns
for the m biological samples in which gene
expression has been monitored. Hence, mij is the
expression level of gene i in sample j. A row ei
is the gene expression pattern of gene i over all
the samples. A column sj is the expression level
of all genes in a sample j and is called the
sample expression pattern.

43
Types of Microarrays

cDNA microarray
Nylon membrane and plastic arrays (by Clontech)
Oligonucleotide silicon chips (by Affymetrix)
Note Each new version of a microarray chip is
at least slightly different from the previous
version. This means that the measures are likely
to change. This has to be taken into account
when analyzing data.

44
cDNA Microarray

The expression level eij of a gene i in sample j
is expressed as a log ratio, log(rij/gi), of the
log of its actual expression level rij in this
sample over its expression level gi in a control.
When this data is visualized eij is color coded
to a mixture of red (rij gtgt gi) and green (rij
ltlt gi) and a mixture in between.

45
Nylon Membrane and Plastic Arrays (by Clontech)

A raw intensity and a background value are
measured for each gene.
The analyst is free to choose the raw intensity
or can adjust it by subtracting the background
intensity.

46
Oligonucleotide Silicon Chips (by Affymetrix)

These arrays produce a variety of numbers derived
from 16-20 pairs of perfect match (PM) and
mismatch (MM) probes.
There are several statistics related to gene
expression that can be derived from this data.
The most commonly used one is the average
difference (AVD), which is derived from the
differences of PM-MM in the 16-20 probe pairs.
The next most commonly used method is the log
absolute value (LAV), which comes from the ratios
PM/MM in the probe pairs.
Note The Affymetrix gene-chip software has a
absent/present call for each gene on a chip.
According to Jagota, the method is complex and
arbitrary so they usually ignore it.

47
For What Do We Use Microarray Data?

Genes with similar expression patterns over all
samples We can compare the expression patterns
ei and ei of two genes i and i' over all
samples.
If we use cluster analysis, we can separate the
genes into groups of genes with similar
expression patterns (trees).
This will allow us to find what unknown genes
have altered expression in a particular disease
by comparing the pattern to genes know to be
affiliated with a disease.
It can also find genes that fit a certain pattern
such as a particular pattern of change with time.
It can also characterize broad functional classes
of new genes from the known classes of genes with
similar expression.

48
For What Do We Use Microarray Data?

Genes with unusual expression levels in a sample
In contrast to standard statistical methods
where we ignore outliers, here outliers might
have particular importance. Hence, we look for
genes whose expression levels are very different
from the others.
Genes whose expression levels vary across samples
We can compare gene expression levels of a
particular gene or set of genes in different
samples. This can be used to look compare normal
and diseased tissues or diseased tissue before
and after treatment.

49
For What Do We Use Microarray Data?

Samples that have similar expression patterns
We might want to compare the expression patters
of all genes between two samples. We might
cluster the genes into gene with similar
expression patterns to help with the comparison.
This can be used to look compare normal and
diseased tissues or diseased tissue before and
after treatment.
Tissues that might be cancerous (diseased) We
can take the gene expression pattern of sample
and compare it to library expression patterns
that indicate diseased or not diseased tissue.

50
Statistical Methods Can Help

Experimental Design Since using microarrays is
costly and time consuming, we want to design
experiments to use the minimal number of
micorarrays that will give a statistically
significant result.
Data Pre-processing It is sometimes useful to
preprocess the data prior to visualization. An
example of this is the log ratio mentioned
earlier. It is often necessary to rescale data
from different microarrays so that they can be
compared. This is due to variation in chip to
chip intensity. Another type of preprocessing
is subtracting the mean and dividing by the
variance.

51
Statistical Methods Can Help

Data Visualization Principle component analysis
and multidimensional scaling are two useful
techniques for reducing multidimensional data to
two and three dimensions. This allows us to
visualize it.
Cluster Analysis By associating genes with
similar expression patterns, we might be able to
draw conclusions about their functional
expression.
Probability Theory We can use statistical
modeling and inference to analyze our data.
Probability theory is the basis for these.

52
Statistical Methods Can Help

Statistical Inference This is the formulation
and statistical testing of a hypothesis and
alternative hypothesis.
Classifiers for the Data We can construct
classes from data, such a diseased vs.
non-diseased tissue. We can build a model (such
as a hidden Markov model) that fits know data for
the different classes. This can then be used to
classify previously unclassified data.

53
Preprocessing Microarray Data

Before microarray data can be analyzed or stored,
a number of procedures or transformations must be
applied to it.
In order to analyze the data correctly, it is
important to understand what the transformations
might be doing to the data.

54
Preprocessing Microarray Data

Ratioing the data
Log-tranforming ratioed data
Alternative to ratioing the data
Differencing the data
Scaling data across chips to account for
chip-to-chip difference
Zero-centering a gene on a sample expression
pattern
Weighting the components of a gene or sample
expression pattern differently
Handling missing data
Variation filtering expression patterns
Discretizing expression data

55
Cluster Analysis of Microarray Data

Recall that microarray data can be thought of as
gene expression patterns or sample expression
patterns. These can be each considered to be
vectors. The first thing we have to do before
applying cluster analysis is to find a distance
between the various expression pattern vectors.
This is done using similarity/dissimilarity
measures such as Euclidean distance, Mahalonobis
distance, or linear correlation coefficients.
Once a distance matrix is computed, the following
clustering algorithms can be used. The clusters
formed can differ significantly depending upon
the distance measure used.

56
Cluster Analysis of Microarray Data

Hierarchical Clustering Assume each data point
is in a singleton cluster.
Find the two clusters that are closest together.
Combine these to form a new cluster.
Compute the distance from all clusters to the new
cluster using some form of averaging.
Find the two closest clusters and repeat.

57
Cluster Analysis of Microarray Data

k-Means Clustering An alternate method of
clustering called k-means clustering, partitions
the data into k clusters and finds cluster means
?i for each cluster. In our case, the means will
be vectors also. Usually, the number of clusters
k is fixed in advance. To choose k something
must be know about the data. There might be a
range of possible k values. To decide which is
best, optimization of a quantity that maximizes
cluster tightness ie. minimizes distances between
points in a cluster.

58
Cluster Analysis of Microarray Data

Self-organizing Maps This is basically an
application of neural networks to microarray
data. Assume that there is a 2-dimensional grid
of cells and a map from a given set of expression
data vectors in Rn, ie, there are n nodes in the
input layer and a connection neuron from each of
these to each cell. Each cell (i, j) gets it own
weight from n input neurons. The weight vector
mij is the mean of the cluster associated with
cell (i, j). Each data vector d gets mapped to
the cell (i, j) that is closest to d using
Euclidean distance.In order to train the network,
the mean vectors mij for the cells (i, j) must be
learned.

59
Sample Microarray
60
Correlations
61
Clustering of Genes
62
Personalized Medicine

There is a new buzz word called personalized
medicine.
The idea is to develop medicine and treatment
plan based on an individuals genetic make-up.

63
Proteomics

Understanding protein function
Functional genomics
Multiple approaches structure, expression
levels, biochemistry, modeling etc.
Combining technologies is necessary to understand
in vivo protein functional

64
Approach

Use data to determine pathway.
Use biochemistry to figure out kinetics and
concentrations.
Use new proteomic approaches to determine
relative concentrations.
Apply pathway model to determine functional
consequence.

65
Pathway Data

Using molecular biological techniques we can
determine what proteins make up a biochemical
pathway.

A
B
C
D
66
Pathways

Biochemical Pathways form complex biochemical
reaction networks.
There might be multiple ways to get from A to B.
The path chosen depends on biochemical kinetics.

67
Biochemistry

Classical biochemistry isolates proteins from
tissue or cells.
Modern molecular biology allows the production of
purified protein.
The concentration of the protein is determined
The kinetic properties of the proteins is
determined by biochemical assay rates of
reactions, modulating factors, etc.

68
Pathway Modeling Methods

Boolean Models
Metabolic Control Theory Flux Balance Analysis
Biochemical Systems Analysis
Kinetic Modeling Approach

69
Disorders of Thrombophilia

The functional consequences of nonsynonymous SNPS
can be predicted by comparison of protein
structures.
There are various SNPs know
Activated protein C resistance by Arg 506 to Glu
Prothrombing polymorphism (G20210A) causing
elevated prothrombin levels
Protein C deficiency
Protein S deficiency
Antithormbin deficiency
Elevated factor VIII levels

70
Fibrinogen Abnormalities

Various polymorphisms found in the long arm of
chromosome 4
Two dimorphisms of the b-chain gene are of major
importance and in linkage disequilibrium with
each other.
These affect plasma fibrogen levels

71
Prothrombin G20210 Polymorphism

Replacement of a G by A at nucleotide 20210 in
the untranslated section of the prothrombin gene
increases translation without altering
transcription of the gene.
This results in elevated synthesis and secretion
of prothrombin by the liver.
This results in increased thrombin levels

72
Activated protein C deficiency

Factor V Leiden R506Q mutation occurs in 8 of
the population.
It is a G?A substitution at nucleotide 1691 in
the gene for factor V.
Factor V is cleaved less efficiently by activated
protein C
Results in deep vein thrombosis, early kidney
transplant loss, recurrent miscarriages and other
disorders

Write a Comment

User Comments (0)