Bioc 300: Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Bioc 300: Bioinformatics

Description:

Bioc 300: Bioinformatics www.geneticsplace.com What is bioinformatics? – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 48
Provided by: JamesG157
Category:

less

Transcript and Presenter's Notes

Title: Bioc 300: Bioinformatics


1
Bioc 300 Bioinformatics
  • www.geneticsplace.com

2
Goals of the Course
  • Understand Methods and Research Questions
  • Analyze Real Data
  • Engage in a Realistic Learning Environment
  • Utilize Online Databases
  • Appreciate Complexity of Research Systems
  • Integrate Different Types of Information
  • Reconsider Cells as Intracellular Ecosystems
  • Integrate Bioinformatics with Biology

3
What is bioinformatics?
  • "Bioinformatics is the term coined for the new
    field that merges biology, computer science, and
    information technology to manage and analyze the
    data, with the ultimate goal of understanding and
    modeling living systems."
  • Genomics and Its Impact on Medicine and Society -
    A 2001 Primer U.S. Department of Energy Human
    Genome Program

Bioinformatics also represents a paradigm shift
for molecular biology, instead of taking a
reductionist approach, the sub-disciplines of
bioinformatics are more expansionist they
attempt to study the entire complement of a
particular cellular molecule or process.
4
The omics revolution
  • Genomics

The study of the entire DNA complement of an
organism
5
Genome Sequence Information
Basic Research
  • Acquiring Sequence
  • Human Genome Draft
  • Evolution

Applied Research
  • Identification of Biological Unknowns
  • Biomedical Research

6
Genomic Variations
Ecology
  • Tracking Ivory Sales
  • Diatoms and Global Warming

Human Variations
  • SNPs
  • Disease Analysis

Ethics
  • GMOs
  • Genetic Testing

7
DNA Microarrays
Basic Research
  • Introduction to Method
  • Data Analysis

Applied Research
  • Cancer
  • Pharmacogenomics

8
The omics revolution
  • Genomics

The study of the entire DNA complement of an
organism
  • Proteomics

The study of the entire set of proteins in a
particular cell type
9
Proteomics
Cellular Roles
Protein-Protein Interactions
permission from Benno Schwikowski
permission form Stan Fields
permission form Stan Fields
Identification and Quantification
10
The omics revolution
  • Genomics

The study of the entire DNA complement of an
organism
  • Proteomics

The study of the entire set of proteins in a
particular cell type
  • Transcriptonomics

The study of all mRNA transcripts in a particular
cell type
  • Metabolomics

The study of all metabolites in a particular cell
type
  • Glycomics

The study of all polysaccharides in a particular
cell type
  • Variomics

The study of all possible drug targets in a
particular cell type
11
Genomic Circuits
Single Gene Circuit
Toggle Switches
www.bio.davidson.edu/courses/genomics/circuits.htm
l
Integrated Circuits
12
Sequencing of Whole Genomes
  • Three Phases of Genome Sequencing
  • Preliminary sequencing
  • Finishing
  • Annotating

13
Preliminary sequencing
  • 1970s
  • Maxam-Gilbert sequencing (chemical cleavage)
  • Sanger sequencing (dideoxy method)

Autorad
You could sequence 100s of bases per day!
14
Genomics took off with automated sequencing
  • 1990s
  • Leroy Hood made modifications to dideoxy
    sequencing
  • ddNTPs were coupled to fluorescent dyes (instead
    of radioactivity)

DNA fragments were separated via capillary gel
electrophoresis
Sequence read by lasers, data was directly
recorded into computer
Now, instead of an autorad, we have a
15
Chromat!
The newest DNA sequencers can determine millions
of bases of sequence in a day!
The increasing ease of obtaining sequence data
has lead to a logarithmic growth of Genbank, the
main repository of sequence data which is housed
at the National Library of Medicine at NIH.
16
Growth of Genbank
17
Sequencing Entire Organisms
Before the 1990s, sequencing was somewhat
haphazard. Depending on the researcher,
different pieces of different organisms genomes
had been sequenced.
No concerted effort had been made to sequence the
entire genome of an organism.
HUGO changed all of that, its mission was to
sequence the human genome, as well as a number of
the genomes of model organisms.
While small genomes could be sequenced directly,
larger genomes were first mapped out.
18
Mapping large genomes
Sequencers needed some reference sequences to
know what part of a genome they were dealing with.
STSs - sequence tagged sites These are defined by
a pair of PCR primers that amplify only one
segment of a genome (ie. unique sequence).
ESTs- expressed sequence tags These are short
sequences of cDNA that indicate where genes are
located within the genome.
Now genomes could be cut into pieces, sequenced,
and the pieces reassembled.
19
Cutting up genomes
Vectors designed to carry large pieces of DNA
include
BACs- bacterial artificial chromosomes- can carry
about 150 kb of insert
YACs- yeast artificial chromosomes- can carry up
to 1.5 Mb of insert
BACs or YACs containing overlapping DNA can be
assembled into contigous overlapping fragments.
20
Shotgun sequencing
While HUGO was busy mapping large genomes and
sequencing some small genomes, Craig Venter
founded TIGR.
TIGR took a completely different approach.
Instead of mapping a genome, they simply cut it
into thousands of pieces, sequenced the pieces,
and reassembled the data using overlapping
fragments.
It was TIGR, not HUGO, who produced the worlds
1st completed genome in 1995- H. influenzae.
21
Finishing a Genomic Sequence
  • A finished sequence is defined as one that
    contains no more than 1 error in 10,000 bases.
  • Finishing a sequence involves aligning a number
    of preliminary sequences and correcting any
    inconsistencies.
  • Overlapping segments are combined into larger
    assemblies of contiguous DNA (contigs).
  • If contigs do not overlap, a gap remains in the
    sequence.

22
Finishing continued
  • The human draft sequence, published in 2001,
    contained 147,821 gaps.
  • The finished sequence, published in 2004,
    contained 341 gaps.
  • A gap usually contains highly repetitive DNA
    that complicates attempts to clone and sequence
    it.
  • Finishing is a very expensive process, many
    genomes have not been finished.

23
Annotating Genomes
  • Annotation involves the identification of
    functionally important sections of a genome.
  • This includes, but is not limited to, making an
    educated guess about what kind of protein is
    encoded by a given coding sequence.
  • Annotation is performed using various computer
    programs.

24
Locating genes within a genome
Process is different in prokaryotes vs. eukaryotes
  • Prokaryotes contain ORFs with no introns and very
    little intergenic sequence.
  • Eukaryotes contain introns, complex promoters,
    and enhancers
  • Introns range between 70 and 30,000 bp
  • One eukaryotic gene can encode more that one
    different protein via alternate splicing
    mechanisms
  • Eukaryotes also contain pseudogenes, ORFs which
    have been rendered nonfunctional by mutation
  • Mammalian genomes contain about 23 pseudogenes

25
Tools for gene hunting
  • GeneMark - originally created for prokaryotes but
    adapted for some model eukaryotes
  • GenScan - accepts up to 1 million bp of sequence
    online, more if downloaded
  • Glimmer GlimmerM - developed by TIGR, accepts
    up to 200 kb online, more if downloaded

Once a genome is annotated
  • One can use a genome browser to locate specific
    loci on specific chromosomes
  • One can then use resources such as GeneCard to
    find out more about a specific gene

26
Progress of Genome Sequencing
  • Sequenced Euk. Genomes
  • Yeast
  • Drosophila
  • C. elegans
  • Arabidopsis
  • Mosquito
  • Human
  • Mouse
  • Rat
  • Chicken
  • Dog
  • Zebra fish
  • Euk. Genomes in Progress
  • Xenopus
  • Cow
  • Cat
  • Horse
  • Kangaroo
  • Honey Bee
  • Turkey
  • Lobster
  • Bat
  • Hedgehog

and others
27
Tools had to be developed to make sense of the
dearth of genomic data being produced
  • Genomic Search Engines include
  • BLAST- searches sequence information, either
    nucleotide (BLASTn) or protein (BLASTp)
  • BLAST2- aligns two sequences, checking similarity
  • Enterez- searches databases for textual
    information
  • PubMed- searches scientific literature for text
  • ORF finder- finds Open Reading Frames (genes)
  • PREDATOR- predicts secondary structure of
    proteins
  • ExPASy- analysis of protein sequence and
    structure as well as 2D gel information

28
Calculating E(expect)-values
E-values measure the significance of a match,
the smaller E-value, the better
  • E-values are calculated using
  • S, the bit score, a measure of the similarity
    between the hit and the query
  • m, the length of the query
  • n, the size of the database

E mn2-S
29
So, how do you get the bit score?
S is calculated from the raw score, R
R aI bX - cO - dG
Where I is the of identities, X is the of
mis-matched nucleotides, O is the of gaps, and
G is the of spaces in the gap.
a, b, c, and d are the rewards, and penalties,
for each of these variables.
The defaults of these lower-case letters are set
at 1, -3, 5, and 2, respectively.
These values can be changed on the Other
advanced line.
30
Now that we have a raw score, the bit score can
be obtained by normalizing the data
S (lR - ln K)/ln 2
(where l and K are the normalizing parameters)
These parameters are printed at the bottom of a
BLAST report.
Normalization enables a direct comparison of
E-values and bit scores, even if the reward and
penalty variables have been changed by the user.
31
More databases of interest
  • SwissProt- protein sequence database
  • PDB- contains protein structural information
  • OMIM- catalogs human disease genes
  • TIGR- many searchable genomes, esp. bacterial
    ones
  • GeneCard- genomic, proteomic and phenotypic
    info.
  • Unigene- catalogs human ESTs
  • Human map viewer- shows chromosomal location of
    genes

32
Protein structure and function
For most researchers, the final goal of genomic
research is not the genomic data itself but an
understanding of the proteins encoded for by a
genome.
Steps to determining protein structure and
function
  • Find ORFs, or coding sequences (CDSs)
  • Translate ORFs
  • Is this a known protein? If not, find protein
    orthologs, similar proteins in different species
  • Check if 3D structure has been determined
  • Predict hydropathy using a Kyte-Doolitle plot
  • Predict secondary structure of your protein

33
What do we mean by function?
The term function is too simplistic and is
somewhat outdated. A consortium called Gene
Ontology decided that a complete description of
function must include not only why? but also
what? and where?
  • Why biological process. The objective toward
    which this protein contributes.
  • What molecular function. The biochemical
    activity that the protein accomplishes.
  • Where cellular component. The location of
    protein activity.

34
One example isocitrate dehydrogenase (IDH)
  • OMIM - IDH3A
  • COG - functional categories, dendograms,
  • isoforms- distinct genes encoding similar
    proteins
  • Enzyme Commission, EC numbers
  • Swiss-Prot
  • Phylogenetic trees
  • rooted vs. unrooted

35
Terms used to describe phylogeny
  • - genes which arose from a common ancestral gene
    within one species (isoforms)
  • paralogs
  • orthologs
  • - genes from two organisms which arose from a
    common ancestral gene
  • genetic loci located on the same chromosome
  • (or multiple genetic loci from different species
    which are located on a chromosomal region of
    common ancestry)
  • synteny
  • homology
  • - sequences which are similar due to a common
    evolutionary origin
  • - terms used to describe sequences without regard
    to evolutionary relationships
  • similarity or identity

36
Searching for related proteins
PSI-BLAST allows one to search outward in a
spiraling pattern from a central starting point.
First iteration- finds proteins with similar
sequences.
Second iteration- can be performed using a
consensus sequence computed from your first
iteration.
More iterations can be performed as desired.
Or, one can choose a species and perform another
first iteration using the results of the original
search.
This approach can be used to annotate ORFs from a
newly sequenced genome
37
Alternate Splicing
  • 60 of human genes produce more than 1 mRNA
  • Only about 22 of genes in C. elegans fit into
    this category

38
Epigenetic Control
  • It is not just the coding regions which matter.
  • Methylation, such as that found in
    heterochromatin
  • and CpG islands, also plays a role in gene
    expression.
  • At any given time, there are 400,000 mC in a
    given cell. Since there are about 100 different
    human cell types, this totals 40 million
    methylation events in our methylome.
  • Nonmammalian animals lack this form of epigenetic
    control.

39
The of CpG islands correlates with the of
genes on a chromosome
40
CpGs are usually associated with genes
41
Imprinting
  • About 20 mammalian genes are known to be
    methylated during gametogenesis in either the
    parental or maternal copy.
  • Imprinting may represent a genetic tug-of-war
    between male and female interests.

For example, the insulin-like growth factor 2,
Igf2, is expressed only in the paternal allele.
Igf2 promotes the growth of the developing embryo.
The expression of its receptor, Igf2r, is
controlled by the maternally inherited allele.
42
Expression of Paternal Allele of Igf2 in embryo
and placenta
43
How does silencing work?
44
What is the effect a loss of imprinting?
  • Loss of Igf2 imprinting can lead to colorectal
    cancer and Beckwith-Wiedemann Syndrome
  • There is a cluster of CpG islands in an insulator
    region near Igf2

CTCF is a protein which only binds to
unmethylated DNA.
17/20 tumor samples taken from cancer patients
were found to be hypermethylated in this region.
45
What about the rest of our genome?
  • Since only 1-2 of our genome is coding sequence
    what does the rest do?

A majority of our DNA is repetitive sequence
  • There are 5 classes of repetitive sequence

1) transposon derived
2) pseudogenes
3) simple repeats such as VNTRs
4) segmental duplications
5) heterochromatic regions
The first category alone accounts for 45 of our
genome!
46
Transposons
  • Transposons fall into 4 categories

1) SINEs, short interspersed elements, such as
Alu comprise 13 of our genome These may help a
cell cope with stress, RNA produced from these
bind to an inhibitor of translation.
2) LINEs, long interspersed elements, comprise
21 of our genome
3) LTR retrotransposons comprise 8 of our genome
4) Other DNA transposons 3 of our genome
47
More Transposon Facts
  • About 50 genes appear to be derived from
    transposons, including RAG1 and RAG2, necessary
    for antibody diversity.
  • The X chromosome has the highest concentration of
    transposons- one 525 kb section is 89
    transposon-derived.
  • The Y chromosome has the highest concentration of
    LINEs, it is the most gene-poor of the
    chromosomes and probably tolerates insertions
    well.
Write a Comment
User Comments (0)
About PowerShow.com