Title: Interface of biology and computers
1What is bioinformatics?
- Interface of biology and computers
- Analysis of proteins, genes and genomes
- using computer algorithms and
- computer databases
- Genomics is the analysis of genomes.
- The tools of bioinformatics are used to make
- sense of the billions of base pairs of DNA
- that are sequenced by genomics projects.
-
2Top ten challenges for bioinformatics
1 Precise models of where and when
transcription will occur in a genome
(initiation and termination) 2 Precise,
predictive models of alternative RNA
splicing 3 Precise models of signal
transduction pathways ability to predict
cellular responses to external stimuli 4
Determining proteinDNA, proteinRNA,
proteinprotein recognition codes 5
Accurate ab initio protein structure prediction
3Top ten challenges for bioinformatics
6 Rational design of small molecule inhibitors
of proteins 7 Mechanistic understanding of
protein evolution 8 Mechanistic understanding
of speciation 9 Development of effective gene
ontologies systematic ways to describe
gene and protein function 10 Education
development of bioinformatics curricula
Source Ewan Birney, Chris Burge, Jim Fickett
4After Pace NR (1997) Science 276734
5DNA
RNA
phenotype
protein
6Growth of GenBank
Base pairs of DNA (billions)
Sequences (millions)
Fig. 2.1 Page 17
1982
1986
1990
1994
1998
2002
Updated 8-12-04 gt40b base pairs
Year
7Central dogma of molecular biology
DNA
RNA
protein
8DNA
RNA
phenotype
protein
protein sequence databases
cDNA ESTs UniGene
genomic DNA databases
9There are three major public DNA databases
GenBank
EMBL
DDBJ
Housed at EBI European Bioinformatics Institute
Housed at NCBI National Center
for Biotechnology Information
Housed in Japan
The underlying raw DNA sequences are identical
10Caveats
- Remember what you are looking for
- Seems obvious, but sometimes it isnt
- Formats!
- Due to the hodge-podge history of database
development and sequence data acquisition, MANY
different formats exist - Dont feed the wrong format to a search engine or
you wont get a response - Stay focused and alert.
- Many times youll get a hit that was not exactly
what you were looking for. It may lead you
someplace you werent expecting to be, but you
may be glad to be there!
11National Center for Biotechnology Information
(NCBI) www.ncbi.nlm.nih.gov
12www.ncbi.nlm.nih.gov
13(No Transcript)
14- PubMed is
-
- National Library of Medicine's search service
- 16 million citations in MEDLINE
- links to participating online journals
- PubMed tutorial (via Education on side bar)
15- Entrez integrates
- the scientific literature
- DNA and protein sequence databases
- 3D protein structure data
- population study data sets
- assemblies of complete genomes
16Entrez is a search and retrieval system that
integrates NCBI databases
17- BLAST is
- Basic Local Alignment Search Tool
- NCBI's sequence similarity search tool
- supports analysis of DNA and protein databases
- 100,000 searches per day
18- Structure site includes
- Molecular Modelling Database (MMDB)
- biopolymer structures obtained from
- the Protein Data Bank (PDB)
- Cn3D (a 3D-structure viewer)
- vector alignment search tool (VAST)
19Accessing information on molecular sequences
20Accession numbers are labels for sequences
NCBI includes databases (such as GenBank) that
contain information on DNA, RNA, or protein
sequences. You may want to acquire information
beginning with a query such as the name of a
protein of interest, or the raw nucleotides
comprising a DNA sequence of interest. DNA
sequences and other molecular data are tagged
with accession numbers that are used to identify
a sequence or other record relevant to molecular
data.
21What is an accession number?
An accession number is label that used to
identify a sequence. It is a string of letters
and/or numbers that corresponds to a molecular
sequence. Examples (all for retinol-binding
protein, RBP4) X02775 GenBank genomic DNA
sequence NT_030059 Genomic contig Rs7079946 dbSNP
(single nucleotide polymorphism) N91759.1 An
expressed sequence tag (1 of 170) NM_006744 RefSeq
DNA sequence (from a transcript) NP_007635 RefSe
q protein AAC02945 GenBank protein Q28369 SwissPr
ot protein 1KT7 Protein Data Bank structure
record
DNA
RNA
protein
22Four ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI)
234 ways to access protein and DNA sequences
1 Entrez Gene with RefSeq Entrez Gene is a
great starting point it collects key information
on each gene/protein from major databases. It
covers all major organisms. RefSeq provides a
curated, optimal accession number for each DNA
(NM_006744) or protein (NP_007635)
24From the NCBI home page, type rbp4 and hit Go
25(No Transcript)
26(No Transcript)
27(No Transcript)
28By applying limits, there are now just two entries
29Entrez Gene (top of page)
Note that links to many other RBP4 database
entries are available
30Entrez Gene (middle of page)
31Entrez Gene (bottom of page)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35FASTA format
36What is an accession number?
An accession number is label that used to
identify a sequence. It is a string of letters
and/or numbers that corresponds to a molecular
sequence. Examples (all for retinol-binding
protein, RBP4) X02775 GenBank genomic DNA
sequence NT_030059 Genomic contig Rs7079946 dbSNP
(single nucleotide polymorphism) N91759.1 An
expressed sequence tag (1 of 170) NM_006744 RefSeq
DNA sequence (from a transcript) NP_007635 RefSe
q protein AAC02945 GenBank protein Q28369 SwissPr
ot protein 1KT7 Protein Data Bank structure
record
DNA
RNA
protein
37NCBIs important RefSeq project best
representative sequences
RefSeq (accessible via the main page of
NCBI) provides an expertly curated accession
number that corresponds to the most stable,
agreed-upon reference version of a sequence.
RefSeq identifiers include the following
formats Complete genome NC_ Complete
chromosome NC_ Genomic contig NT_ mRN
A (DNA format) NM_ e.g. NM_006744 Protein
NP_ e.g. NP_006735
38NCBIs RefSeq project accession for genomic,
mRNA, protein sequences
Accession Molecule Method Note AC_123456
Genomic Mixed Alternate complete
genomic AP_123456 Protein Mixed Protein
products alternate NC_123456
Genomic Mixed Complete genomic
molecules NG_123456 Genomic Mixed Incomplet
e genomic regions NM_123456
mRNA Mixed Transcript products mRNA
NM_123456789 mRNA Mixed Transcript
products 9-digit NP_123456
Protein Mixed Protein products NP_123456789
Protein Curation Protein products 9-digit
NR_123456 RNA Mixed Non-coding
transcripts NT_123456 Genomic Automated Gen
omic assemblies NW_123456
Genomic Automated Genomic assemblies
NZ_ABCD12345678 Genomic Automated Whole genome
shotgun data XM_123456 mRNA Automated Transc
ript products XP_123456 Protein Automated Pr
otein products XR_123456 RNA Automated Tran
script products YP_123456 Protein Auto.
Curated Protein products ZP_12345678
Protein Automated Protein products
39Four ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI)
40protein
DNA
RNA
complementary DNA (cDNA)
UniGene
41UniGene unique genes via ESTs
- Find UniGene at NCBI
- www.ncbi.nlm.nih.gov/UniGene
- UniGene clusters contain many expressed sequence
- tags (ESTs), which are DNA sequences (typically
- 500 base pairs in length) corresponding to the
mRNA - from an expressed gene. ESTs are sequenced from
a - complementary DNA (cDNA) library.
- UniGene data come from many cDNA libraries.
- Thus, when you look up a gene in UniGene
- you get information on its abundance
- and its regional distribution.
42Cluster sizes in UniGene
This is a gene with 1 EST associated the cluster
size is 1
43Cluster sizes in UniGene
This is a gene with 10 ESTs associated the
cluster size is 10
44Cluster sizes in UniGene (human)
Cluster size (ESTs) Number of clusters 1 ?
42,800 2 6,500 3-4 6,500 5-8 5,400 9-16
4,100 17-32 3,300 ?500-1000 2,128 ?2000-4
000 233 ?8000-16,000 21 ?16,000-30,000 8
UniGene build 194, 8/06
45UniGene unique genes via ESTs
Conclusion UniGene is a useful tool to look
up information about expressed genes.
UniGene displays information about the abundance
of a transcript (expressed gene), as well as its
regional distribution of expression (e.g. brain
vs. liver).
46Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI)
47Ensembl to access protein and DNA sequences
Try Ensembl at www.ensembl.org for a
premier human genome web browser.
48click human
49enter RBP4
50(No Transcript)
51Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI)
52ExPASy to access protein and DNA sequences
ExPASy sequence retrieval system (ExPASy Expert
Protein Analysis System) Visit
http//www.expasy.ch/
53(No Transcript)
54(No Transcript)
55Example of how to access sequence data HIV-1 pol
There are many possible approaches. Begin at the
main page of NCBI, and type an Entrez query
hiv-1 pol
56(No Transcript)
57Searching for HIV-1 pol Following the genome
link yields a manageable three results
58Example of how to access sequence data HIV-1 pol
For the Entrez query hiv-1 pol there are about
40,000 nucleotide or protein records (and
gt100,000 records for a search for hiv-1), but
these can easily be reduced in two easy
steps --specify the organism, e.g.
hiv-1organism --limit the output to RefSeq!
59over 100,000 nucleotide entries for HIV-1
only 1 RefSeq
60Examples of how to access sequence data histone
query for histone results protein
records 21847 RefSeq entries 7544 RefSeq
(limit to human) 1108 NOT deacetylase 697 At
this point, select a reasonable candidate
(e.g. histone 2, H4) and follow its link to
Entrez Gene. There, you can confirm you have the
right gene/protein.
8-12-06
61(No Transcript)
62Access to Biomedical Literature
63PubMed at NCBI to find literature information
64PubMed is the NCBI gateway to MEDLINE. MEDLINE
contains bibliographic citations and author
abstracts from over 4,600 journals published in
the United States and in 70 foreign countries.
It has gt14 million records dating back to 1966.
65MeSH is the acronym for "Medical Subject
Headings." MeSH is the list of the vocabulary
terms used for subject analysis of biomedical
literature at NLM. MeSH vocabulary is used for
indexing journal articles for MEDLINE. The
MeSH controlled vocabulary imposes uniformity
and consistency to the indexing of biomedical
literature.
66(No Transcript)
67(No Transcript)
68PubMed search strategies
Try the tutorial (education on the left
sidebar) Use boolean queries (capitalize AND,
OR, NOT) lipocalin AND disease Try using
limits Try Links to find Entrez information
and external resources Obtain articles on-line
via Welch Medical Library (and download pdf
files) http//www.welch.jhu.edu/
Page 35