Pharmacogenomics and Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Pharmacogenomics and Bioinformatics

Description:

Pharmacogenomics is the use genomic and sequence data of host and pathogens to ... Smith-Waterman Searching. Heuristic Search Tools. FASTA. BLAST. Malaria Vaccine ... – PowerPoint PPT presentation

Number of Views:306
Avg rating:3.0/5.0
Slides: 73
Provided by: jaf1
Learn more at: http://www.binf.gmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Pharmacogenomics and Bioinformatics


1
Pharmacogenomics and Bioinformatics
  • M. Saleet Jafri

2
What is pharmacogenomics?
  • Pharmacogenomics is the use genomic and sequence
    data of host and pathogens to identify potential
    drug targets
  • Involves a variety of techniques/disciplines such
    as sequence analysis, protein structure,
    genomics, micorarray analysis and others
  • These fields rely heavily on bioinformatics
  • Usually focuses on medical or agricultural
    applications

3
Human Genome Project
  • Project goals are to
  • identify all the approximately 20,000-25,000
    genes in human DNA,
  • determine the sequences of the 3 billion chemical
    basepairs that make up human DNA,
  • store this information in databases,
  • improve tools for data analysis,
  • transfer related technologies to the private
    sector, and
  • address the ethical, legal, and social issues
    (ELSI) that may arise from the project.

From http//www.ornl.gov/hgmis/
4
Human Genome Project
  • Progress
  • - Several types of genome maps have already
    been completed, and a working draft of the entire
    human genome sequence was announced in June 2000,
    with analyses published in February 2001.
  • - An important feature of this project is the
    federal government's long-standing dedication to
    the transfer of technology to the private sector.
    By licensing technologies to private companies
    and awarding grants for innovative research, the
    project is catalyzing the multibillion-dollar
    U.S. biotechnology industry and fostering the
    development of new medical applications.

From http//www.ornl.gov/hgmis/
5
Human Genome Project
  • Seven organisms were originally chosen for
    sequencing.
  • E. coli
  • Yeast
  • Fly
  • Worm
  • Arabidopsis
  • Mouse
  • human
  • Why were these chosen?

6
Genome Projects
  • As of January 2005 there were many more sequenced
  • 25 non-plant eukaryotes
  • 5 plants
  • 213 microbes completed
  • 21 Archae
  • 274 microbes in progress
  • 1431 viruses in progress
  • 833 non-virus organisms with at least on
    nucleotide sequence submitted
  • Why were these chosen?

7
Genome Projects
  • Chosen by funding agencies
  • Four main categories
  • Medical applications
  • Evolutionary significance
  • Environmental impact
  • Food production

8
How are genomics used for drug target
identification?
  • The basic idea is to look for genes unique to the
    pathogen that are crucial for its survival. This
    would be the drug target.
  • If this is a pathogen in the host, the gene would
    be in the pathogen and not in the host.
  • If this was in the environment, the gene should
    be as specific as possible for the pathogen to
    avoid harming other organisms that might be
    beneficial.

9
How can this be done?
  • To do this genomics, proteomics and
    bioinformatics are involved.
  • In any of these cases bioinformatics tools are
    necessary.

10
Genome Sequencing and Comparison
  • As mentioned earlier, many pathogen (virus,
    bacteria, and other microorganisms) have been
    sequenced.
  • Once they are sequenced, they are annotated.
    Annotation is the process by which the functions
    of the different proteins (genes) are determined.
  • In this way, an understanding of the organisms
    metabolism is gained.

11
Malaria
  • Malaria is caused by the genus Plasmodium, with
    Plasmodium falciparum being the most lethal.
  • Its genome has been sequenced
  • It is a pathogen that digests proteins for food.
    It does not contain any amino acid producing
    genes in its genome, i.e. it does not make its
    own amino acids.
  • Purines are recycled, but there are not genes for
    purine synthesis.
  • Has many solute ATP dependent transporters and
    one novel multifunctional transporter.

12
How is annotation done?
  • Annotation is the process of predicting the
    function of genes in a genome.
  • First all the genes have to be found. This is
    done by finding the open reading frame (ORF).
  • This is done by gene finding or gene prediction
    software.

13
Gene Prediction
  • Analysis by sequence similarity can only reliably
    identify about 30 of the protein-coding genes in
    a genome
  • 50-80 of new genes identified have a partial,
    marginal, or unidentified homolog
  • Frequently expressed genes tend to be more easily
    identifiable by homology than rarely expressed
    genes

14
Gene Finding
  • Process of identifying potential coding regions
    in an uncharacterized region of the genome
  • Still a subject of active research
  • There are many different gene finding software
    packages and no one program is capable of finding
    everything

15
(No Transcript)
16
Eukaryotes vs Prokaryotes
  • Eukaryotic DNA wrapped around histones that might
    result in repeated patterns (periodicity of 10)
    for histone binding. The promotor regions might
    be near these sites so that they remain hidden.
  • Prokaryotes have no introns.
  • Promotor regions and start sites more highly
    conserved in Prokaryotes
  • Different codon use frequencies

17
Gene finding is species-specific
  • Codon usage patterns vary by species
  • Functional regions (promoters, splice sites,
    translation initiation sites, termination
    signals) vary by species
  • Common repeat sequences are species-specific
  • Gene finding programs rely on this information to
    identify coding regions

18
The genetic code
19
Codon usage
20
Identifying ORFs
  • Simple first step in gene finding
  • Translate genomic sequence in six frames.
    Identify stop codons in each frame
  • Regions without stop codons are called "open
    reading frames" or ORFs
  • Locate and tag all of the likely ORFs in a
    sequence
  • The longest ORF from a Met codon is a good
    prediction of a protein encoding sequence.
  • SOFTWARE NCBI ORF Finder

21
ORF Finder input
22
ORF finder results
23
Tests of the Predicted ORF
  • Check if the third base in the codons tends to be
    the same one more often than by chance alone.
  • Are the codons used in the ORF the same as those
    used in other genes (need codon usage frequency).
  • Compare the amino acid sequence for similarity
    with other know amino acid sequences.

24
Problems with ORF finding
  • A single-character sequencing error can hide a
    stop codon or insert a false stop codon,
    preventing accurate identification of ORFs
  • Short exons can be overlooked
  • Multiple transcripts or ORFs on complementary
    strand can confuse results

25
Pattern-based gene finding
  • ORF finding based on start and stop codon
    frequency is a pattern-based procedure
  • Other pattern-based procedures recognize
    characteristic sequences associated with known
    features and genes, such as ribosome binding
    sites, promoter sites, histone binding sites,
    etc.
  • Statistically based.

26
Content-based gene finding
  • Content-based gene finding methods rely on
    statistical information derived from known
    sequences to predict unknown genes
  • Some evaluative measures include "coding
    potential" (based on codon bias), periodicity in
    the sequence, sequence homogeneity, etc.

27
A standard content-based alignment procedure
  • Select a window of DNA sequence from the unknown.
    The window is usually around 100 base pairs long
  • Evaluate the window's potential as a gene, based
    on a variety of factors
  • Move the window over by one base
  • Repeat procedure until end of sequence is
    reached report continuous high-scoring regions
    as putative genes

28
Combining measures
  • Programs rarely use one measure to predict genes
  • Different values are combined (using
    probabilistic methods, discriminant analysis,
    neural net methods, etc.) to produce one "score"
    for the entire window

29
Drawbacks to window-based evaluation
  • A sequence length of at least 100 b.p. is
    required before significant information can be
    gained from the analysis
  • Results in a /- 100 b.p. uncertainty in the
    start site of predicted coding regions, unless an
    unambiguous pattern can also be found to indicate
    the start.

30
Most are web-based, but...
  • Submit sequence input sequence length may be
    limited
  • Select parameters, if any
  • Interpret results
  • Most software is first or second generation
    results come in non-graphical formats.
  • GeneMark, GenScan, Glimmer

31
How is annotation done?
  • This is done by comparing the DNA sequences of
    the genes to known genes in a database. If they
    sequences are similar, the a similar function is
    assumed.
  • The comparison is done using sequence comparison
    tools such as BLAST

32
Database Searching for Similar Sequences
  • Database searching for similar sequences is
    ubiquitous in bioinformatics.
  • Databases are large and getting larger
  • Need fast methods

33
Types of Searches
  • Sequence similarity search with query sequence
  • Alignment search with profile (scoring matrix
    with gap penalties)
  • Serch with position-specific scoring matrix
    representing ungapped sequence alignment
  • Iterative alignment search for similar sequences
    that starts with a query sequence, builds a
    multiple alignmnet, and then uses the alignment
    to augment the search
  • Search query sequence for patterns representative
    of protein families

From Bioinformatics by Mount
34
DNA vs Protein Searches
  • DNA sequences consists of 4 characters
    (nucleotides)
  • Protein sequences consist of 20 characters (amino
    acids)
  • Hence, it is easier to detect patterns in protein
    sequences than DNA sequences
  • Better to convert DNA sequences to protein
    sequences for searches.

35
Database Searching Efficacy
  • To evaluate searching methods, selectivity and
    sensitivity need to be considered.
  • Selectivity is the ability of the method not to
    find members known to be of another group (i.e.
    false positives).
  • Sensitivity is the ability of the method to find
    members of the same protein family as the query
    sequence.

36
Protein Searches
  • Easier to identify protein families by sequence
    similarity rather than structural similarity.
    (same structure does not mean same sequence)
  • Use the appropriate gap penalty scorings
  • Evaluate results for statistical significance.

37
History
  • Historically dynamic programming was used for
    database sequence similarity searching.
  • Computer memory, disk space, and CPU speed were
    limiting factors.
  • Speed still a factor due to the larger databases
    and increase number of searches.
  • FASTA and BLAST allow fast searching.

38
History
  • The PAM250 matrix was used for a long time. It
    corresponds to a period of time where only 20 of
    the amino acids have remained unchanged.
  • BLOSUM has replace PAM250 in most applications.
    BLAST use the BLOSUM62 matrix. FASTA uses the
    BLOSUM50 matrix.

39
Search Tools
  • Similarity Search Tools
  • Smith-Waterman Searching
  • Heuristic Search Tools
  • FASTA
  • BLAST

40
Malaria Vaccine
  • A German and American Team used reverse genetics
    i.e. they used the sequenced genome, deduced the
    candidate genes, and then knocked out a
    particular gene (Uis3).
  • This give 30 day immunity in mice which is better
    than vaccines made by traditional methods

41
Microarray Data Analysis
  • Gene chips allow the simultaneous monitoring of
    the expression level of thousands of genes. Many
    statistical and computational methods are used to
    analyze this data. These include
  • statistical hypothesis tests for differential
    expression analysis
  • principal component analysis and other methods
    for visualizing high-dimensional microarray data
  • cluster analysis for grouping together genes or
    samples with similar expression patterns
  • hidden Markov models, neural networks and other
    classifiers for predictively classifying sample
    expression patters as one of several types
    (diseased, ie. cancerous, vs. normal)

42
What is Microarray Data?
  • In spite of the ability to allow us to
    simultaneously monitor the expression of
    thousands of genes, there are some liabilities
    with micorarray data. Each micorarray is very
    expensive, the statistical reproducibility of the
    data is relatively poor, and there are a lot of
    genes and complex interactions in the genome.
  •  
  • Microarray data is often arranged in an n x m
    matrix M with rows for the n genes and columns
    for the m biological samples in which gene
    expression has been monitored. Hence, mij is the
    expression level of gene i in sample j. A row ei
    is the gene expression pattern of gene i over all
    the samples. A column sj is the expression level
    of all genes in a sample j and is called the
    sample expression pattern.

43
Types of Microarrays
  • cDNA microarray
  • Nylon membrane and plastic arrays (by Clontech)
  • Oligonucleotide silicon chips (by Affymetrix)
  • Note Each new version of a microarray chip is
    at least slightly different from the previous
    version. This means that the measures are likely
    to change. This has to be taken into account
    when analyzing data.

44
cDNA Microarray
  • The expression level eij of a gene i in sample j
    is expressed as a log ratio, log(rij/gi), of the
    log of its actual expression level rij in this
    sample over its expression level gi in a control.
  • When this data is visualized eij is color coded
    to a mixture of red (rij gtgt gi) and green (rij
    ltlt gi) and a mixture in between.

45
Nylon Membrane and Plastic Arrays (by Clontech)
  • A raw intensity and a background value are
    measured for each gene.
  • The analyst is free to choose the raw intensity
    or can adjust it by subtracting the background
    intensity.

46
Oligonucleotide Silicon Chips (by Affymetrix)
  • These arrays produce a variety of numbers derived
    from 16-20 pairs of perfect match (PM) and
    mismatch (MM) probes.
  • There are several statistics related to gene
    expression that can be derived from this data.
    The most commonly used one is the average
    difference (AVD), which is derived from the
    differences of PM-MM in the 16-20 probe pairs.
  • The next most commonly used method is the log
    absolute value (LAV), which comes from the ratios
    PM/MM in the probe pairs.
  • Note The Affymetrix gene-chip software has a
    absent/present call for each gene on a chip.
    According to Jagota, the method is complex and
    arbitrary so they usually ignore it.

47
For What Do We Use Microarray Data?
  • Genes with similar expression patterns over all
    samples We can compare the expression patterns
    ei and ei of two genes i and i' over all
    samples.
  • If we use cluster analysis, we can separate the
    genes into groups of genes with similar
    expression patterns (trees).
  • This will allow us to find what unknown genes
    have altered expression in a particular disease
    by comparing the pattern to genes know to be
    affiliated with a disease.
  • It can also find genes that fit a certain pattern
    such as a particular pattern of change with time.
  • It can also characterize broad functional classes
    of new genes from the known classes of genes with
    similar expression.

48
For What Do We Use Microarray Data?
  • Genes with unusual expression levels in a sample
    In contrast to standard statistical methods
    where we ignore outliers, here outliers might
    have particular importance. Hence, we look for
    genes whose expression levels are very different
    from the others.
  • Genes whose expression levels vary across samples
    We can compare gene expression levels of a
    particular gene or set of genes in different
    samples. This can be used to look compare normal
    and diseased tissues or diseased tissue before
    and after treatment.

49
For What Do We Use Microarray Data?
  • Samples that have similar expression patterns
    We might want to compare the expression patters
    of all genes between two samples. We might
    cluster the genes into gene with similar
    expression patterns to help with the comparison.
    This can be used to look compare normal and
    diseased tissues or diseased tissue before and
    after treatment.
  • Tissues that might be cancerous (diseased) We
    can take the gene expression pattern of sample
    and compare it to library expression patterns
    that indicate diseased or not diseased tissue.

50
Statistical Methods Can Help
  • Experimental Design Since using microarrays is
    costly and time consuming, we want to design
    experiments to use the minimal number of
    micorarrays that will give a statistically
    significant result.
  • Data Pre-processing It is sometimes useful to
    preprocess the data prior to visualization. An
    example of this is the log ratio mentioned
    earlier. It is often necessary to rescale data
    from different microarrays so that they can be
    compared. This is due to variation in chip to
    chip intensity. Another type of preprocessing
    is subtracting the mean and dividing by the
    variance.

51
Statistical Methods Can Help
  • Data Visualization Principle component analysis
    and multidimensional scaling are two useful
    techniques for reducing multidimensional data to
    two and three dimensions. This allows us to
    visualize it.
  • Cluster Analysis By associating genes with
    similar expression patterns, we might be able to
    draw conclusions about their functional
    expression.
  • Probability Theory We can use statistical
    modeling and inference to analyze our data.
    Probability theory is the basis for these.

52
Statistical Methods Can Help
  • Statistical Inference This is the formulation
    and statistical testing of a hypothesis and
    alternative hypothesis.
  • Classifiers for the Data We can construct
    classes from data, such a diseased vs.
    non-diseased tissue. We can build a model (such
    as a hidden Markov model) that fits know data for
    the different classes. This can then be used to
    classify previously unclassified data.

53
Preprocessing Microarray Data
  • Before microarray data can be analyzed or stored,
    a number of procedures or transformations must be
    applied to it.
  • In order to analyze the data correctly, it is
    important to understand what the transformations
    might be doing to the data.

54
Preprocessing Microarray Data
  • Ratioing the data
  • Log-tranforming ratioed data
  • Alternative to ratioing the data
  • Differencing the data
  • Scaling data across chips to account for
    chip-to-chip difference
  • Zero-centering a gene on a sample expression
    pattern
  • Weighting the components of a gene or sample
    expression pattern differently
  • Handling missing data
  • Variation filtering expression patterns
  • Discretizing expression data

55
Cluster Analysis of Microarray Data
  • Recall that microarray data can be thought of as
    gene expression patterns or sample expression
    patterns. These can be each considered to be
    vectors. The first thing we have to do before
    applying cluster analysis is to find a distance
    between the various expression pattern vectors.
    This is done using similarity/dissimilarity
    measures such as Euclidean distance, Mahalonobis
    distance, or linear correlation coefficients.
    Once a distance matrix is computed, the following
    clustering algorithms can be used. The clusters
    formed can differ significantly depending upon
    the distance measure used.

56
Cluster Analysis of Microarray Data
  • Hierarchical Clustering Assume each data point
    is in a singleton cluster.
  • Find the two clusters that are closest together.
    Combine these to form a new cluster.
  • Compute the distance from all clusters to the new
    cluster using some form of averaging.
  • Find the two closest clusters and repeat.

57
Cluster Analysis of Microarray Data
  • k-Means Clustering An alternate method of
    clustering called k-means clustering, partitions
    the data into k clusters and finds cluster means
    ?i for each cluster. In our case, the means will
    be vectors also. Usually, the number of clusters
    k is fixed in advance. To choose k something
    must be know about the data. There might be a
    range of possible k values. To decide which is
    best, optimization of a quantity that maximizes
    cluster tightness ie. minimizes distances between
    points in a cluster.

58
Cluster Analysis of Microarray Data
  • Self-organizing Maps This is basically an
    application of neural networks to microarray
    data. Assume that there is a 2-dimensional grid
    of cells and a map from a given set of expression
    data vectors in Rn, ie, there are n nodes in the
    input layer and a connection neuron from each of
    these to each cell. Each cell (i, j) gets it own
    weight from n input neurons. The weight vector
    mij is the mean of the cluster associated with
    cell (i, j). Each data vector d gets mapped to
    the cell (i, j) that is closest to d using
    Euclidean distance.In order to train the network,
    the mean vectors mij for the cells (i, j) must be
    learned.

59
Sample Microarray
60
Correlations
61
Clustering of Genes
62
Personalized Medicine
  • There is a new buzz word called personalized
    medicine.
  • The idea is to develop medicine and treatment
    plan based on an individuals genetic make-up.

63
Proteomics
  • Understanding protein function
  • Functional genomics
  • Multiple approaches structure, expression
    levels, biochemistry, modeling etc.
  • Combining technologies is necessary to understand
    in vivo protein functional

64
Approach
  • Use data to determine pathway.
  • Use biochemistry to figure out kinetics and
    concentrations.
  • Use new proteomic approaches to determine
    relative concentrations.
  • Apply pathway model to determine functional
    consequence.

65
Pathway Data
  • Using molecular biological techniques we can
    determine what proteins make up a biochemical
    pathway.

A
B
C
D
66
Pathways
  • Biochemical Pathways form complex biochemical
    reaction networks.
  • There might be multiple ways to get from A to B.
  • The path chosen depends on biochemical kinetics.

67
Biochemistry
  • Classical biochemistry isolates proteins from
    tissue or cells.
  • Modern molecular biology allows the production of
    purified protein.
  • The concentration of the protein is determined
  • The kinetic properties of the proteins is
    determined by biochemical assay rates of
    reactions, modulating factors, etc.

68
Pathway Modeling Methods
  • Boolean Models
  • Metabolic Control Theory Flux Balance Analysis
  • Biochemical Systems Analysis
  • Kinetic Modeling Approach

69
Disorders of Thrombophilia
  • The functional consequences of nonsynonymous SNPS
    can be predicted by comparison of protein
    structures.
  • There are various SNPs know
  • Activated protein C resistance by Arg 506 to Glu
  • Prothrombing polymorphism (G20210A) causing
    elevated prothrombin levels
  • Protein C deficiency
  • Protein S deficiency
  • Antithormbin deficiency
  • Elevated factor VIII levels

70
Fibrinogen Abnormalities
  • Various polymorphisms found in the long arm of
    chromosome 4
  • Two dimorphisms of the b-chain gene are of major
    importance and in linkage disequilibrium with
    each other.
  • These affect plasma fibrogen levels

71
Prothrombin G20210 Polymorphism
  • Replacement of a G by A at nucleotide 20210 in
    the untranslated section of the prothrombin gene
    increases translation without altering
    transcription of the gene.
  • This results in elevated synthesis and secretion
    of prothrombin by the liver.
  • This results in increased thrombin levels

72
Activated protein C deficiency
  • Factor V Leiden R506Q mutation occurs in 8 of
    the population.
  • It is a G?A substitution at nucleotide 1691 in
    the gene for factor V.
  • Factor V is cleaved less efficiently by activated
    protein C
  • Results in deep vein thrombosis, early kidney
    transplant loss, recurrent miscarriages and other
    disorders
Write a Comment
User Comments (0)
About PowerShow.com