Title: C. elegans Bioinformatics
1C. elegans Bioinformatics
2Bioinformatics
- Applies math, statistics and computer science to
understand biological processes, usually on a
molecular level - Data often from high-throughput techniques
(sequencing, microarrays etc.) - Many research areas
- Sequence alignment
- Sequence pattern finding
- Gene expression - microarray bioinformatics
- Assembly of genome from short sequences (in
genome projects) - Protein structure prediction from sequence
- Visualization of protein-protein interaction
networks - Modeling of evolution (genetic algorithms)
- C. elegans genome quite well annoted and
accessible with many bioinformatics tools
3Overview of lecture
- Sequence analyses
- Sequence alignments
- PCR primer design
- Prediction of transcription factor binding sites
- Wormbase
- Microarray bioinformatics
- Data analysis overview
- Functional annotation
- Data repository GEO
4Sequence analysis - Identifier
- Name of the sequence
- Different bioinformatics databases and tools
operate with different identifiers - Example different identifiers for a gene
- Wormbase locus ID ahr-1
- Ensembl Gene ID C41G7.5
- Entrez gene ID 172788
- EMBL (Genbank) ID Z81048
- RefSeq DNA ID NM_001025865
- WB Gene ID WB00000096
5Sequence analysis - Format
- How the sequence is written
- Tools require the sequence in a correct format
- Most common format FASTA
gtY48G10A.1
ATTAATCTTTAGACATCAATAATACTTGCCTCTAAAAAAGCGTTTGGCTCCGCTTGGATCACTCTCAGCATCAAGTTCTTCTTTTTTCCGGGAAGGAAACGCTCATTTTCGATTAATATTAATTTTGATCGATTCGAAATCGATTGAAACCCGTTTTTGTCATCGAAAATTCTGAAAATATCCGTATTTAGCTCGAATAAAACTATTTAATTTCCATTAAAAATCCGTTTTTAATTGAATTCCGTTCGAAATTTCCTGTTGGAAAAATAAATAAATAAATACGAAGAAGCGTGCGGCGCATTGCAAAAAGCCGTGCGGCGCATTGCGAAGGACTGTGCGGCGGGCTTGCGAAAAGGCGCGCCGCACATTGCCCTACTGAAAGCGTTCCTTACGAAAAAATCCCCTTACGAAAACGTACCCCCTCTTTAATTTCGCGAAAAATAGTTTTTTGCCGAAAATAGAGTTAATTAGGCTAAAAAGCTGTTTTACAATCAATTTTGTTAAAGAAAAACCGCAAAAAACCTGAAAATTGACGAAAAAAAGCCAAAAAAAAAAAAAATTTTGCTTTTTAGTTCTACGCAGGAAAAGTGCGGCGATGGGTTTAAAACTAGAGAAATATTAGAAATCGTTAAATTTAATAGTAGAAAATTACAAAAAACCTAGATTTTCTGTGGAAAATACACGAAAAACAACGAAAAACTTTGGGAATTAAATTAAAAATTCGAAATCTAGCAAATCGTTCTCTACGTCTCCACTCTCTACCGCGTGGCGATGGAGCGCGTTTGCTTTTTACTGATTTTAATTAATTTTAATTAATTGAATTTCAGCGTATTTTCGCTGAATTTCTAGTGTTTTCCCGATAAAAACAAATCGAAATTCAACGTTTCCACTAATTTCAAGCTTTTTTCCTCTATTTTCAGGAAGTATACGCAAAAATCTGTATTTTTCTCTGACGCCTCACTCGGCAATTTTCCACAATTTCTTATCAATTTTGTCTCTT
6Annotation
- Any information on a sequence
- Can be identifier, description,...
7Gene Ontology
- Project that provides a controlled vocabulary to
describe gene product attributes in any organism - Ontologies Biological process, molecular
function, cellular compartment - Ontology term code and a common name
- e.g. GO0007186 - GPCR protein signaling pathway
- Gene Ontology annotation characterization of
gene products using the ontology terms - Based on wet lab experiments or sequence
similarity with other characterized genes
8BioMart tool
- By EBI (European Bioinformatics Institute)
- Finds annotations from databases e.g. Ensembl
genome database - Good tool to e.g. convert gene identifiers and
download sequences - Also finds chromosomal locations and Gene
Ontology terms - Free web interface at http//www.biomart.org
9Using the BioMart tool
- 1. Choose database and organism
- 2. Define what your input is (e.g. list of
Ensembl gene IDs) - 3. Specify what you want as output (e.g. gene
sequences with Entrez gene IDs) - 4. Run search and export your results
10Sequence alignments - BLAST
- Basic Local Alignment Tool
- Sequence similarity search program
- Finds matching sequences in NCBI database
- The sequence can be nucleotide, protein,
translated, genome,... - Free web interface at http//www.ncbi.nlm.nih.gov/
blast
11Two types of Sequence alignments in BLAST
- 1. Compare two given sequences, e.g.
- Does your PCR product have the right sequence?
- How closely related your protein of interest is
to its homolog in another species? - 2. Compare one sequence against genome,
transcriptome, proteome etc. - Does a sequence correspond to any known gene or
regulatory area? - Will a PCR primer bind to one or many sites in
the genome?
12Example Alignment of two sequences with BLAST
button that starts alignment
sequences to be compared
13Example The two sequences were almost the same
87 / 96 right nucleotides
3 missing nucleotides
Both sequences were in 5'-3' orientation
6 wrong nucleotides
14Using Nucleotide BLAST
The sequence in FASTA format
What type of nucleotides (RNA, genomic DNA,
expressed sequence tags etc.)
Organism
Run search
15Example gene cloned and sequenced - compared to
genome
Corresponds to a known gene (cyp-42A1)
The sequence was backwards
Matches closely but not perfectly
16Multiple sequence alignment
- Compares several given sequences
- Builds a hierachical tree that shows how closely
each sequence is related to the others - Several tools, e.g. ClustalW
- Several tools to visualize the tree e.g.
HyperTree, JalView,... - e.g. family of proteins in different species -
which ones most closely related
17ClustalW multiple sequence alignment tool
- Compares each given sequence to each other given
sequence - Free web interface at http//ebi.ac.uk/Tools/clust
alW/index.html
18Example Hierarchical tree of C. elegans CYP
proteins (HyperTree)
19PCR primer design with Primer3
- Finds optimized primers from sequence
- Takes into account the desired melting
temperature, GC content and primer length - Improvements made in Tartu University
20In silico PCR predicts what will be amplified
with given primers
21Transcription factors
translation
22Prediction of transcription factor binding sites
- Transcription factors bind to specific short DNA
sequences to induce or repress transcription - Sometimes binding sites of a transcription factor
can vary in terms of one or few nucleotides - There are several tools to predict transcription
factor binding sites, e.g. POXO
23Different tools in POXO
Kankainen, M. et al. Nucl. Acids Res. 2006
34W534-W540 doi10.1093/nar/gkl296
24Finding of enriched patterns
- Put in sequences upstream of your genes of
interest (can be obtained from BioMart or POXO
sequence retrieval) - POCO finds which patterns occur a lot
- Compares to likelihood of finding that sequence,
for example certain nucleotides are generally
more common in the genome than others
25Clustering of found patterns
- Puts together patterns that have something in
common - Forms longer patterns
- Allows also some differences
26Checking if a given pattern is enriched in the
sequences
- POBO counts how many times the given pattern is
present in the sequences - Compares to how many times the sequence is
present in the "background" - Background different in different tools
- POXO creates several lists of random sequences
(same number and length as in the given sequence
list)
27Wormbase
- Major publicly available database of information
about C. elegans - Essential for worm researcher
- Search e.g. info on a gene
28Names
29Sequence exons, introns
Sequence exons colored, introns white
30Chromosomal location, anatomic expression pattern
31Information of gene product function collected
from publications (RNAi, microarray), links to
the publications
32(No Transcript)
33Information on how a mutant allele is different
from the wild-type gene
In this example a point mutation...
...that leads to a stop codon in the middle of
protein
34Link to C. elegans Genetics Center from where
you can request worm strains
35Microarrays
36Microarray bioinformatics
- Expression levels of tens of thousands of genes
in one experiment - Quantification of intensities from an image
- Data analysis
- Finding annotations to genes - DAVID tool
- Using existing microarray data - GEO
37Image quantification
- TIGR Spotfinder
- - freely downloadable software
- Input image files
- Compose a grid
- - each spot to its own square
- Segmentation method decides which
- part is spot and which is background
- Intensity value for each spot represents amount
of RNA in the original sample
38Data analysis
- Several commercial programs e.g. GeneSpring
- Normalization
- Sometimes one label (of 2-color experiment) is
stronger than the other or some chips or chip
areas have been hybridized more efficiently - Normalization makes these different labels, chips
or chip areas comparable - Statistics
- Which genes are significantly under- or
over-expressed
39Finding generally overrepresented functions in
your gene list
- DAVID annotation tool
- Compares which annotations are over-represented
in your gene list - Good for showing general trends in a large gene
list
40DAVID Functional Annotation Tool
- By NIAID, National Institutes of Health
- Finds significantly enriched annotations to gene
products in your list - Gene Ontology terms
- Identifiers
- Protein domains
- Pathways
- etc.
41Example Functional annotation chart
42Microarray data repository - GEO
- Gene Expression Omnibus, NCBI
- MIAME the Minimum Information About a Microarray
Experiment that should be provided - Submit your own miroarray data to GEO
- Browse, search and retrieve microarray data
- http//www.ncbi.nlm.nih.gov/geo/
43Summary Bioinformatics
- More and larger data sets available
- -omics level approach
- Performs functions that would be extremely
tideous to do manually - Tools are easy to use
- - information sometimes predictions based on
similarity with other gene products -gt wet lab
experiments still needed