Basic Introduction to Microarrays - PowerPoint PPT Presentation

About This Presentation
Title:

Basic Introduction to Microarrays

Description:

Thousands of genes are each ... the decision about where to create branches and in what order to present them ... J. of Cell Science 113, 2000, 4151-4156. ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 14
Provided by: publi5
Category:

less

Transcript and Presenter's Notes

Title: Basic Introduction to Microarrays


1
Basic Introduction to Microarrays
  • Chitta Baral
  • Arizona State University
  • Feb 3, 2003

2
Basics of Microarrays
  • From The magic of microarrays, S. Friend and R.
    Stoughton
  • (Scientific American, Feb 2002)
  • A microarray has thousands of spots (or array
    elements)
  • Each spot has thousands to millions of copies of
    a particular single stranded DNA representing a
    particular gene.
  • Thousands of genes are each assigned a unique
    spot.
  • From 2 cell samples (say one treated with a drug
    and another untreated) collect mRNA, make more
    stable cDNA from them and add fluorescent labels
    (green to untreated and red to treated).
  • Apply the labeled cDNAs to the chip.
  • Binding occurs when cDNA from a sample finds its
    complementary sequence of bases on the chip.
  • Such binding means that the gene represented by
    the chip DNA was active or expressed in the
    sample.
  • Put the chip in a scanner.
  • Calculate the ratio of red to green at each spot
    and generate a color coded readout.
  • Red Gene that strongly increased activity in
    treated cells
  • Green Gene that strongly decreased activity in
    treated cells
  • Yellow Gene that was equally active in treated
    and untreated cells.
  • Black Gene that was inactive in both groups.

3
Determining the impact of new drug
  • Suppose the last experiment was about determining
    quickly whether a potential new drug is likely to
    harm the liver.
  • Are the red or green genes correspond to ways
    (gene functions) that reflect liver damage. I.e.,
    do those genes make proteins whose concentration
    (high for the green ones, and low for the red
    ones) reflects liver damage?
  • Or compare the overall expression pattern with
    the patterns produced when those genes (? cells)
    react to known liver toxins.
  • Close similarity would indicate that the new drug
    is probably toxic as well.

4
Application of comparative cDNA hybridization
  • Tissue-specific genes
  • Comparative hybridization experiments (CHE) can
    reveal genes which are preferentially expressed
    in specific tissues. Some of the genes implement
    the behaviors that distinguish the cells tissue
    type while other controlling genes make sure
    that the cell only performs the function for its
    type.
  • Regulatory gene defects in cancer
  • CHE can pinpoint the transcription differences
    responsible for the change from normal to
    cancerous cells
  • CHE can distinguish different patterns of
    abnormal transcription in heterogenous cancers.
  • Cellular responses to the environment
  • CHE can point to genes whose transcription
    changes in response to an environmental stimulus.
  • Temporal studies can also identify the order of
    changes providing evidence about which genes
    control the response directly and which are only
    indirectly affected by it.
  • Cell cycle variations
  • CHE can be used to distinguish genes that are
    expressed at different times in the cell cycle.
    Thus pathways responsible for basic life
    processes can be uncovered.

5
Microarray applications and uses
  • Characterization of the temporal order of gene
    expression within a cell
  • Determination of the cellular location of gene
    products
  • Prediction of the function of resulting proteins
  • Prediction of the effect of perturbations of the
    cellular environment on the program of gene
    expression by changing environmental conditions
    or administering drugs.

6
Computational challenges interpreting the
scanned image
  • End point of a CHE is a scanned array image.
  • Intensities can be quantified by measuring the
    average or integrated intensities of the spots
  • Subject to noise from irregular spots, dust on
    the slide, and nonspecific hybridization.
  • Deciding the threshold (between spots and
    background) can be difficult, especially when the
    spots fade gradually around the edges.
  • Detection efficiency may not be uniform across
    the slide, leading to excessive red intensity on
    one side and excessive green on the other.
  • Ratio of fluoroscent intensities for a spot is
    interpreted as the ratio of concentrations for
    its corresponding mRNA in the two cell
    populations.
  • Low levels of cDNA due to reverse transcription
    bias, sample loss, or an inherently rare mRNA can
    cause large uncertainties in these ratios.

7
A typical Microarray data set
  • Includes expression levels for thousands of genes
    across hundreds of conditions such as
  • From cells of different cell lines
  • From cells under different conditions
  • Pathological tissue specimens from different
    patients
  • Serial time points following a stimulus to a cell
    or organism.
  • Imagine a 2D array of measurements
  • Rows measurements associated with individual
    genes
  • Columns measurements associate with conditions
  • Profile list of measurements along each row or
    column
  • Features individual expression measurements
    within each profile. (some features more valuable
    than others and sometimes focusing on a subset
    improves results)

8
Analyzing microarray data
  • Any dataset can be analyzed in 2 ways.
  • Eg. 47 expression profiles of 4026 genes
    collected from lymphoma specimens (Nature 403,
    503-511, 2000)
  • 47 cancer profiles with 4026 available features
  • 4026 gene profiles with 47 available features
  • Supervised and unsupervised methods
  • Supervised the genes or conditions are
    associated with labels coming from outside the
    experiment --that provide information about a
    preexisting classification. The information may
    include knowledge of gene function or regulation,
    disease sub type or tissue origin of a cell type.
  • Classification information is used to drive the
    analysis
  • Used for predicting accurate labels for new
    genes.
  • Unsupervised No additional information
  • Geared towards the discovery of patterns in the
    data, unbiased by outside knowledge.
  • Used for exploratory tasks.
  • Clustering

9
Unsupervised grouping clustering
  • Goal Simplify large gene expression data sets
  • Approach Group similar profiles together based
    on a distance metric (a formula for calculating
    the similarity of two profiles)
  • Distance metrics distance between 2 list of
    numbers
  • Euclidean distance (sqr root of the sum of
    squared differences)
  • Statistical correlation coefficient (-1 to 1)
  • Clustering strategy
  • Hierarchical clustering calculate the distance
    between individual data points and then group
    together that are close.
  • Distance between groups are computed and used to
    create groups of groups
  • Easy to implement but suffer because the decision
    about where to create branches and in what order
    to present them is often arbitrary.
  • K-means clustering requires a parameter k (the
    expected clusters)
  • Initially cluster centers are selected randomly
  • In each iteration of the algorithm, all of the
    profiles are assigned to clusters whose center
    they are nearest to, and then the cluster center
    is recalculated based on the profiles within the
    cluster.
  • Self-organizing maps
  • Instead of partitioning, they organize the
    clusters into a map where similar clusters are
    close to each other.
  • No and topological configuration of the clusters
    are pre-specified
  • Cluster centers are recalculated in each
    iteration using both the profiles within the
    cluster as well as the profiles in adjacent
    clusters.
  • Clustering is sensitive to the features used to
    compute the distance metric.
  • Applications identify co-regulated genes, genes
    with related functions, signatures of individual
    signalling pathways within the data set, etc.

10
Supervised grouping classification
  • Approach Take known groupings and create rules
    for reliably assigning genes or conditions to
    these groups.
  • Eg. Problem of classifying unknown genes as
    ribosomal or non-ribosomal
  • Success depends on whether high quality labeled
    sets are provided or not.
  • Examples of methods (Machine learning)
  • Logistic regression
  • uses the feature value for different groups to
    estimate the parameters of a predictor function
    (a linear log-likelihood model)
  • Neural networks
  • Use a set of known examples to create a
    multi-layered computational network
  • Linear discriminant analysis
  • Use the labeled example from each set of
    classified cases to estimate a probability
    distribution for the values of the features in
    that set.
  • Given a new example, it determines the closest
    distribution and assigns the example to this set.
  • Inductive logic programming
  • Decision trees

11
Dimension reduction
  • Involves removing features from the data set
  • Removed because do not provide significant
    incremental information and can confuse and make
    analysis unnecessarily complex
  • Feature selection often attempts to identify a
    minimum set of non-redundant features that are
    useful for classification.
  • Eg. Dont select co-regulated genes.
  • Unsupervised dimensional reduction pruning
    uninformative features several methods such as
  • Principal component analysis (PCA)
  • Automatically detects redundancies in the data
    and determines a new set of guarentedly
    non-redundant hybrid (multiple features
    condensed) features.
  • Advantage Makes apparent the outliers and
    clusters in a data set and reduces the noise in a
    data set
  • Disadvantage Throwing away weak signals that
    could be but important
  • Independent component analysis
  • Supervised dimension reduction feature selection
  • Goals selecting relevant underlying features and
    reducing the number of features necessary to
    classify correctly
  • A straight forward method
  • Iteratively apply a supervised classification
    algorithm that reports weights on all features.
  • After running the classification algorithm the
    first time, the feature with the lowest weight is
    removed from the data set.
  • The algorithm is run again to determine the
    second least important feature.
  • This process is repeated while monitoring the
    classification performance on known examples.

12
Further Details
  • Types of microarrays spotted cDNA microarrays,
    high-density Oligonucleotide microarrays
  • Fluorescent dyes (for glass arrays) vs
    radioactive isotopes (for membrane arrays)
  • Interpretations
  • Identifying individual genes (regulated
    expression of which can explain particular
    biological phenomena) or assign potential
    function to new genes.
  • Co-regulated genes (often identified using
    cluster analysis) allow functional classification
    (may participate in similar cellular processes or
    pathways),
  • potential identification of common regulatory
    elements (DNA motifs) in promoter sequences.
  • Assumption Genes with closely related expression
    patterns may be controlled by the same regulatory
    mechanism
  • When one sees differential expression they may
    have knowledge about the probable function (from
    NCBI databases) of that gene and can make a
    hypothesis about the role that gene is playing in
    their system.
  • One cluster of genes related to one pathway,
    another to another pathway hint about
    interconnection between such pathways
  • Gene regulatory networks
  • Comparing large number of samples for a global
    view
  • Common control can be mixture of all samples used
  • Combining many data sets and analyzing the whole
    set is very useful
  • Comparing the expression profiles of tumour
    samples using many genes, it is possible to
    identify those genes whose expression
    characterizes a particular tumour type
  • Compare the expression signature of a particular
    tumour type to data generated by measuring the
    responses of closely related cell lines in
    culture to many different stimuli, such as
    hormones, growth factors, etc. Using this
    strategy one can draw conclusions about which
    signalling pathways are activated in a particular
    tumour type, leading to the identification of
    pathways that might provide therauptic targets.

13
References main sources used in this presentation
  • Basic microarray analysis grouping and feature
    reduction. Raychaudhuri et al.Trends in
    Biotechnology vol 19, No 5, May 2001 189 193.
  • Gene expression microarrays and the integration
    of biological knowledge. Noordewier and Warren.
    Trends in Biotechnology vol 19, No 10, Oct 2001
    412-415.
  • http//www.cs.wustl.edu/jbuhler/research/array
  • The magic of microarrays. Friend and Stoughton.
    Scientific American. Feb 2002.
  • Other sources that I read.
  • Navigating gene expression using microarrays a
    tech review. Schulze Downward. Nature cell
    biology. Vol3, Aug 01. E190
  • http//www.fargo.ars.usda.gov/ps/micr_the.htm
  • Analysis of gene expression by microarrays cell
    biologists gold mine or minefield? Schulze
    Downward. J. of Cell Science 113, 2000,
    4151-4156.
  • Microarrays handling the deluge of data and
    extracting reliable information. Hess et al.
    Trends in Biotechnology. Vol 19, No 11, Nov 2001,
    463-468
Write a Comment
User Comments (0)
About PowerShow.com