Title: Fundamentals in Sequence Analysis 1.(part 1)
1Fundamentals in Sequence Analysis 1.(part 1)
Review of Basic biology database searching in
Biology.
2The Flow of Biotechnology Information
Gene
Function
gt DNA sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGGC
AACAC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA TCT
GGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA ACATCGATGAAC
TGCTGAACGAAGATATCCTGATCCTGGG TTGCTCTGCCATGGGCGATGA
AGTTCTCGAGGAAAGCGAA TTTGAACCGTTCATCGAAGAGATCTCTACC
AAAATCTCTG GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCG
A CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC TACGGTT
GCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA ACGAGCCGGACGAAGC
TGAGCAGGACTGCATCGAATTTGG TAAGAAGATCGCGAACATCTAGTAG
A
gt Protein sequence MKIVYWSGTGNTEKMAELIAKGIIESGKDVN
TINVSDVNI DELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGK
KVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE PDEAEQDC
IEFGKKIANI
3Prequisites to Sequence Analysis
- Basic Biology so you can understand the language
of the databases Central Dogma (transcription
Translation, Prokaryotes, Eukaryotes,CDS, 3UTR,
5UTR, introns, exons, promoters, operons,
codons, start codons, stop codons,snRNA,hnRNA,tRNA
, secondary structure, tertiary structure). - Before you can analyze sequences.. You have to
understand their structure.. And know about Basic
Biological Database Searching
4 Central Dogmas of Molecular Biology
1) The concept of genes is historically defined
on the basic of genetic inheritance of a
phenotype. (Mendellian Inheritance) 2) The DNA an
organism encodes the genetic information. It is
made up of a double stranded helix composed of
ribose sugars. Adenine(A), Citosine (C), Guanine
(G) and Thymine (T). note that only 4 values
nees be encode ACGT.. Which can be done using 2
bits.. But to allow redundant letter combinations
(like N means any 4 nucleotides), one usually
resorts to a 4 bit alphabet.
5 Central Dogmas of Molecular Biology
3) Each side of the double helix faces its
complementary base. A ??T, and G ?? C. 4)
Biochemical process that read off the DNA always
read it from the 5side towards the 3 side.
(replication and transcription). 5) A gene can be
located on either the plus strand or the minus
strand. But rule 4) imposes the orientation of
reading .. And rule 3 (complementarity) tells us
to complement each base E.g. If the sequence on
the strand is ACGTGATCGATGCTA, the strand
must be read off by reading the complement of
this sequence going backwards e.g.
TAGCATCGATCACGT
6 Central Dogmas of Molecular Biology
6) DNA information is copied over to mRNA that
acts as a template to produce proteins.
We often concentrate on protein coding genes,
because proteins are the building blocks of cells
and the majority of bio-active molecules. (but
lets not forget the various RNA genes)
7Prokaryotic genes
Prokaryotes (intronless protein coding genes)
Upstream (5)
Gene region
Downstream (3)
promoter
TAC
DNA
Transcription (gene is encoded on minus strand ..
And the reverse complement is read into mRNA)
ATG
mRNA
5 UTR
3 UTR
CoDing Sequence (CDS)
ATG
Translation tRNA read off each codons, 3 bases
at a time, starting at start codon until it
reaches a STOP codon.
protein
8Why does Nature bothers with the mRNA?
- Why would the cell want to have an intermediate
between DNA and the proteins it encodes? - Gene information can be amplified by having many
copies of an RNA made from one copy of DNA. - Regulation of gene expression can be effected by
having specific controls at each element of the
pathway between DNA and proteins. The more
elements there are in the pathway, the more
opportunities there are to control it in
different circumstances. - In Eukaryotes, the DNA can then stay pristine and
protected, away from the caustic chemistry of the
cytoplasm.
9Prokaryotic genes (operons)
Prokaryotes (operon structure)
downstream
promoter
upstream
Gene 1
Gene 2
Gene 3
In prokaryotes, sometimes genes that are part of
the same operational pathway are grouped together
under a single promoter. They then produce a
pre-mRNA which eventually produces 3 separates
mRNAs.
10Bacterial Gene Structure of signals
- Bacterial genomes have simple gene structure.
- - Transcription factor binding site.
- - Promoters
- -35 sequence (T82T84G78A65C54A45) 15-20 bases
- -10 sequence (T80A95T45A60A50T96) 5-9 bases
- - Start of transcription initiation start
Purine90 (sometimes its the A in CAT) - - translation binding site (shine-dalgarno) 10 bp
upstream of AUG (AGGAGG) - - One or more Open Reading Frame
- start-codon (unless sequence is partial)
- until next in-frame stop codon on that strand ..
- Separated by intercistronic sequences.
- - Termination
11Genetic Code
- How does an mRNA specify amino acid sequence? The
answer lies in the genetic code. It would be
impossible for each amino acid to be specified by
one nucleotide, because there are only 4
nucleotides and 20 amino acids. Similarly, two
nucleotide combinations could only specify 16
amino acids. The final conclusion is that each
amino acid is specified by a particular
combination of three nucleotides, called a codon - Each 3 nucleotide code for one amino acid.
- The first codon is the start codon, and usually
coincides with the Amino Acid Methionine. (M
which has codon code ATG) - The last codon is the stop codon and does NOT
code for an amino acid. It is sometimes
represented by to indicate the STOP codon. - A coding region (abbreviation CDS) starts at the
START codon and ends at the STOP codon.
12Codon table
- Note the degeneracy of the genetic code. Each
amino acid might have up to six codons that
specify it. - Different organisms have different frequencies
of codon usage. - A handful of species vary from the codon
association described above, and use different
codons fo different amino acids. - How do tRNAs recognize to which codon to bring an
amino acid? The tRNA has an anticodon on its
mRNA-binding end that is complementary to the
codon on the mRNA. Each tRNA only binds the
appropriate amino acid for its anticodon. -
13RNA
- RNA has the same primary structure as DNA. It
consists of a sugar-phosphate backbone, with
nucleotides attached to the 1' carbon of the
sugar. The differences between DNA and RNA are
that - RNA has a hydroxyl group on the 2' carbon of the
sugar (thus, the difference between
deoxyribonucleic acid and ribonucleic acid. - Instead of using the nucleotide thymine, RNA uses
another nucleotide called uracil - Because of the extra hydroxyl group on the sugar,
RNA is too bulky to form a stable double helix.
RNA exists as a single-stranded molecule.
However, regions of double helix can form where
there is some base pair complementation (U and A
, G and C), resulting in hairpin loops. The RNA
molecule with its hairpin loops is said to have a
secondary structure. - Because the RNA molecule is not restricted to a
rigid double helix, it can form many different
stable three-dimensional tertiary structures.
14tRNA ( transfer RNA)
is a small RNA that has a very specific secondary
and tertiary structure such that it can bind an
amino acid at one end, and mRNA at the other end.
It acts as an adaptor to carry the amino acid
elements of a protein to the appropriate place as
coded for by the mRNA. T
Three-dimensional Tertiary structure
Secondary structure of tRNA
15Bacterial Gene Prediction
Most of the consensus sequences are known from
ecoli studies. So for each bacteria the exact
distribution of consensus will change. Most
modern gene prediction programs need to be
trained. E.g. they find their own consensus and
assembly rules given a few examples genes. A few
programs find their own rules from a completely
unannotated bacterial genome by trying to find
conserved patterns. This is feasible because
ORFs restrict the search space of possible gene
candidates. E.g. selfid program(selfid_at_igs.cnrs-mr
s.fr)
16 Open Reading Frames
- The simplest bacterial gene prediction techniques
simply - identify all open reading frames(ORFs),
- and blastx them against known proteins.
- The ORFs with the best homology are retained
first. - This usually densely covers the bacterial genomes
with genes. rRNA and tRNA are detected separately
using tRNAScan or blastn.
17 Open Reading Frames (ORF)
On a given piece of DNA, there can be 6 possible
frames. The ORF can be either on the or minus
strand and on any of 3 possible frames Frame 1
1st base of start codon can either start at base
1,4,7,10,... Frame 2 1st base of start codon
can either start at base 2,5,8,11,... Frame 3
1st base of start codon can either start at base
3,6,9,12,... (frame 1,-2,-3 are on minus
strand) Some programs have other conventions for
naming frames.. (0..5, 1-6, etc)
Gene finding in eukaryotic cDNA uses ORF finding
blastx as well. http//www.ncbi.nlm.nih.gov/gorf/
gorf.html try with gi41 ( or your own piece of
DNA)
18Eukaryotic Central Dogma
In Eukaryotes ( cells where the DNA is
sequestered in a separate nucleus) The DNA does
not contain a duplicate of the coding gene,
rather exons must be spliced. ( many eukaryotes
genes contain no introns! .. Particularly true in
lower organisms) mRNA (messenger RNA)
Contains the assembled copy of the gene. The mRNA
acts as a messenger to carry the information
stored in the DNA in the nucleus to the cytoplasm
where the ribosomes can make it into protein.
19Eukaryotic Nuclear Gene Structure
- Gene prediction for Pol II transcribed genes.
- Upstream Enhancer elements.
- Upstream Promoter elements.
- GC box(-90nt) (20bp), CAAT box(-75 nt)(22bp)
- TATA promoter (-30 nt) (70, 15 nt consensus
(Bucher et al (1990)) - 14-20 nt spacer DNA
- CAP site (8 bp)
- Transcription Initiation.
- Transcript region, interrupted by introns.
Translation Initiation (Kozak signal 12 bp
consensus) 6 bp prior to initiation codon. - polyA signal (AATAAA 99,other)
20introns
- Transcript region, interrupted by introns. Each
introns - starts with a donor site consensus
(G100T100A62A68G84T63..) - Has a branch site near 3 end of intron (one not
very conserved consensus UACUAAC) - ends with an acceptor site consensus.
(12Py..NC65A100G100)
UG
AG
UACUAAC
21Exons
- The exons of the transcript region are composed
of - 5UTR (mean length of 769 bp) with a specific
base composition, that depends on local GC
content of genome) - AUG (or other start codon)
- Remainder of coding region
- Stop Codon
- 3 UTR (mean length of 457, with a specific base
composition that depends on local GC content of
genome)
22Structure of the Eukaryotic Genome
6-12 of human DNA encodes proteins(higher
fraction in nematode) 10 of human DNA codes for
UTR 90 of human DNA is non-coding.
23Non-Coding Eukaryotic DNA
- Untranslated regions (UTRs)
- introns (can be genes within introns of another
gene!) - intergenic regions.
- - repetitive elements
- - pseudogenes (dead
- genes that may(or not) have been retroposed back
in the genome as a single-exon gene
24Pseudogenes
Pseudogenes Dna sequence that might code for a
gene, but that is unable to result in a protein.
This deficiency might be in transcription (lack
of promoter, for example) or in translation or
both. Processed pseudogenes Gene retroposed
back in the genome after being processed by the
splicing apperatus. Thus it is fully spliced and
has polyA tail. Insertion process flanks mRNA
sequence with short direct repeats. Thus no
promoters.. Unless is accidentally retroposed
downstream of the promoter sequence. Do not
confuse with single-exon genes.
25Repeats
Each repeat family has many subfamilies. - ALU
300nt long 600,000 elements in human genome. can
cause false homology with mRNA. Many have an Alu1
restriction site. - Retroposons. ( can get copied
back into genome) - Telltale sign Direct or
inverted repeat flank the repeated element. That
repeat was the priming site for the RNA that was
inserted. LINEs (Long INtersped Elements) L1
1-7kb long, 50000 copies Have two ORFs!!!!! Will
cause problems for gene prediction
programs. SINEs (Short Intersped Elements)
26Low-Complexity Elements
- When analyzing sequences, one often rely on the
fact that two stretches are similar to infer that
they are homologous (and therefore related).. But
sequences with repeated patterns will match
without there being any philogenetic relation! - Sequences like ATATATACTTATATA which are mostly
two letters are called low-complexity. - Triplet repeats (particularly CAG) have a
tendency to make the replication machinery
stutter.. So they are amplified. - The low-complexity sequence can also be hidden at
the translated protein level.
27Masking
- To avoid finding spurious matches in alignment
programs, you should always mask out the query
sequence. - Before predicting genes it is a good idea to mask
out repeats (at least those containing ORFs). - Before running blastn against a genomic record,
you must mask out the repeats. - Most used Programs
- CENSOR
- Repeat Masker
- http//ftp.genome.washington.edu/cgi-bin/RepeatMas
ker
28More Non-Protein genes
rRNA - ribosomal RNA is one of the structural
components of the ribosome. It has sequence
complementarity to regions of the mRNA so that
the ribosome knows where to bind to an mRNA it
needs to make protein from. snRNA - small
nuclear RNA is involved in the machinery that
processes RNA's as they travel between the
nucleus and the cytoplasm. hnRNA
hetero-nuclear RNA. small RNA involved in
transcription.
29Protein Processing localization.
- The protein as read off from the mRNA may not be
in the final form that will be used in the cell.
Some proteins contains - Signal Peptide (located at N-terminus
(beginning)), this signal peptide is used to
guide the protein out of the nucleus towards its
final cellular localization. This signal peptide
is cleaved-out at the cleavage site once the
protein has reach (or is near) its final
destination. - Various Post-Translational modifications
(phosphorylation) - The final protein is called the mature peptide
30Convention for nucleotides in database
Because the mRNA is actually read off the minus
strand of the DNA, the nucleotide sequence are
always quoted on the minus strand. In
bioinformatics the sequence format does NOT make
a difference between Uracil and Thymine. There is
no symbol for Uracil.. It is always represented
by a T Even genomic sequence follows that
convention. A gene on the plus strand is quoted
so that it is in the same strand as its product
mRNA.
31Biology Information on the Internet
32Biology Information on the Internet
- Introduction to Databases
- Searching the Internet for Biology Information.
- General Search methods
- Biology Web sites
- Introduction to Genbank file format.
- Introduction to Entrez and Pubmed
- Ref Chapters 1,2,5,6 of Bioinformatics
33- Databases
- A collection of Records.
- Each record has many fields.
- Each field contain specific information.
- Each field has a data type.
- E.g. money, currency,Text Field, Integer,
date,address(text field) ,citation (text field) - Each record has a primary key. A UNIQUE
identifier that unambiguously defines this record.
Spread-sheet Flat-file version of a database.
34Gi Genbank Identifier Unique Key Primary
Key GI Changes with each update of the sequence
record. Accession Number Secondary key Points
to same locus and sequence despite sequence
updates. Accession Version Number equivalent
to Gi
35Relational Database (Normalizing a database for
repeated sub-elements of a database.. Splitting
it into smaller databases, relating the
sub-databases to the first one using the primary
key.)
36Types of Relational databases.
- The Internet can be though of as one enormous
relational database. - The links/URL are the primary keys.
- SQL (Standard Query Language)
- Sybase Oracle Access (Databases systems)
- Sybase used at NCBI.
- SRS(One type of database querying system of use
in Biology)
37Indexed searches.
- To allow easy searching of a database, make an
index. - An index is a list of primary keys corresponding
to a key in a given field (or to a collection of
fields)
38Indexed searches.
- Boolean Query Merging and Intersecting lists
- AND (in both lists) (e.g. human AND genome)
- human genome
- human genome
- OR (in either lists) (e.g. human OR genome)
- human genome
39Search strategies
- Search engines use complex strategies that go
beyond Boolean queries. - Phrases matching
- human genome -gt human genome
- togetherness documents with human close to
genome are scored higher. - Term expansion synomyms
- human -gt homo sapiens
- neigbours
- human genome-gt genome projects,
chromosomes,genetics - Frequency of links (www.google.com)
- To avoid these term mapping, enclose your queries
in quotes human AND genome
40Search strategies
- Search engines use complex strategies that go
beyond Boolean queries. - To avoid these term mapping, enclose your queries
in quotes human AND genome - To require that ALL the terms in your query be
important, precede them with a . This also
prevents term mapping. - To force the order of the words to be important,
group sentences within strings. biology of
mammals.
41Indexed searches.
- Example
- find the advanced query page at
http//www.altavista.com - type human (and hit the Search button)
- Type genome
- type human AND genome
- type human genome (finds the least matches)
- type human OR genome (finds the most matches)
42- Search Engines
- Web Spiders Collection of All web pages, but
since Web pages change all the time and new ones
appear, they must constantly roam the web and
re-index.. Or depend on people submitting their
own pages. - www.google.com (BEST!)
- www.infoseek.com
- www.lycos.com
- www.exite.com
- www.webcrawler.com
- www.lycos.com
- www.looksmart.com (country specific)
43- Search Engines
- www.google.com (BEST!)
- Google ranks pages according to how many pages
with those terms refer to the pages you are
asking for. Not only must one document contain
ALL the search terms, but other documents which
refer to this one must also contain all the
terms. - Great when you know what you are looking for! You
can also use to require immediate proximity
and order of terms. - E.g. type
- Web server for the blast program.
- But google only indexes about 40 of the web.. So
you may have to use other web spiders. - (disclaimer.. I dont own stock in that company..
But Id like to)
44- Search Engines
- Curated Collections Not comprehensive Contains
list of best sites for commonly requested topics,
but is missing important sites for more
specialized topics (like biology) - www.yahoo.com (Has travel maps too!)
- Answer-based curated collections Easy to use
english-like queries. First looks at list of
predefined answers, then refines answers based on
user interaction. Also answer new questions. - www.askjeeves.com
- www.magellan.com
- www.altavista.com(has translation TOOLS)
- www.hotbot.com
45- Search Engines
- Meta-Search Engines Polls several search
engines, and returns the consensus of all
results. Is likely to miss sites, but the sites
it returns are very relevant to the query. - Other operating mode is to return the sum of all
the results.. Then becomes very sensitive to a
very detailled query. - www.metacrawler.com
- www.savvysearch.com
- www.1blink.com (fast)
- www.metafind.com
- www.dogpile.com
46- Virtual Libraries Curated collections of links
for Biologists.(by Biologists) - Pedros BioMolecular Research Tools(1996)
- http//www.public.iastate.edu/pedro/
- Virtual Library Bio Sciences
- http//vlib.org/Biosciences.html
- Publications and abstract search.
- http//www.ncbi.nlm.nih.gov/
- Expasy server
- http//www.expasy.ch
- EBI Biocatalog (software databases list)
- http//www.ebi.ac.uk/biocat/
47Biological Databases
- Nucleotide databases
- Genbank International Collaboration
- NCBI(USA), EMBL(Europe), DDBJ (Japan and Asia)
- A bank No curation.. Submission to these
database is required for publication in a
journal. - Organism specific databases (Exercize Find URLs
using search engines) - FlyBase
- ChickGBASE
- pigbase
- wormpep
- YPD (Yeast Protein Database)
- SGD(Saccharomyces Genome Database)
48- Protein Databases
- NCBI
- Swiss Prot(Free for academic use, otherwise
commercial. Licensing restrictions on discoveries
made using the DB. 1998 version free of any
licensing) - http//www.expasy.ch(latest pay version)
- NCBI has the latest free version.
- Translated Proteins from Genbank Submissions
- EMBL
- TrEMBL is a computer-annotated supplement of
SWISS-PROT that contains all the translations of
EMBL nucleotide sequence entries not yet
integrated in SWISS-PROT - PIR
49- Structure databases
- PDB Protein structure database.
- Http//www.rscb.org/pdb/
- MMDB NCBIs version of PDB with entrez links.
- Http//www.ncbi.nlm.nih.gov
- Genome Mapping Information
- http//www.il-st-acad-sci.org/health/genebase.html
- NCBI(Human)
- Genome Centers
- Stanford, Washington University, Stanford
- Research Centers and Universities
50- Litterature databases
- NCBI Pubmed All biomedical litterature.
- Www.ncbi.nlm.nih.gov
- Abstracts and links to publisher sites for
- full text retrieval/ordering
- journal browsing.
- Publisher web sites.
- Biomednet Commercial site for litterature
search. - Pathways Database
- KEGG Kyoto Encyclopedia of Genes and Genomes
www.genome.ad.jp/kegg/kegg/html
51- Database Identifiers Primary keys
- GI (changes with each sequence update for NCBI
only) - Annotation may change without the gi changing!
- Accession(stable)
- version(changes with each sequence update)
- Version also refers to Accession.version
- Secondary accession Records may have been merged
in the past.. So the records which were not
chosen as the primary were made secondary.
52Primary Databases
- A primary Database is a repository of data
derived from experiments or from research
knowledge. - Genbank (Nucleotide repository)
- Protein DB, Swissprot
- PDB (MMDB) are primary databases.
- Pubmed (litterature)
- Genome Mapping databases.
- Kegg Database.(pathways)
53Secondary Databases
- A secondary database contains information derived
from other sources. - Refseq (Currated collection of Genbank at NCBI)
- Unigene (Clustering of ESTs at NCBI)
- Organism-specific databases are often a mix
between primary and secondary.
54Genbank Records
- A Bank No attempt at reconciliation.
- Submit a sequence ? Get an Accession Number!
- Cannot modify sequences without submitters
consent. - No attempt at reconciliation.(not a unique
collection per LOCUS/gene) - Entries of various sequence quality and different
sourcesgt Separate in various divisions based on - High Quality sequences in taxon specific
divisions. - Low Quality sequences in Usage specific
databases. - A Collaboration between NCBI, EMBL and DDBJ. They
contain (nearly) the same information, only the
data format differs.
EMBL does not differentiate between the different
types of RNA records, while NCBI (and DDBJ) do.
In Entrez EMBL records are patched up to add that
information.
55Refseq and LocusLink
- Attempt to produce 1 mRNA, 1 protein, and 1
genomic gene for each frequently occuring allele
of a protein expressing gene. - www.ncbi.nlm.nih.gov/LocusLink
- Special non-genbank Accession numbers
- NM_nnnnnn mRNA refseq
- NP_nnnnnn protein refseq
- NC_nnnnnn refseq genomic contig
- NT_nnnnnn temporary genomic contig
- NX_nnnnnn predicted gene
56Genbank divisions
- Sequences in genbank are split into various
categories based on - The quality and type of sequences
- The high quality nucleotide sequences are divided
into organism-dependant divisions.
57- Genbank Entry type (and query to restrict to
that field) - mRNA (1/10000 errors)
- biomol_mRNA PROP
- cDNA (EST, 95-99 accuracy, single pass )
- gbdiv_EST PROP
- genomic ( biomol_genomic PROP)
- in HTGS division gt99 accuracy
- gbdiv_HTG PROP
- GSS(low-quality genome survey sequences)
- gbdiv_GSS PROP
- rest of Genbank 1/10000 accuracy.
- Human gbdiv_PRI PROP
- mouse gbdiv_ROD PROP
- bovine gbdiv_MAM PROP
- STS(EST or cDNA used in mapping)
- gbdiv_STS PROP
58FASTA Format
MOST important data format!!!
- gtidentifier descriptive text
- nucleotide of amino-acid
- sequence on multiple lines if needed.
- Example
- gtgi41embX63129.1BTA1AT B.taurus mRNA for
alpha-1-anti-trypsin - GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
- CATCACGCGGGGCCTTCTGCTGCTGGC .
59Modified FASTA Format
- A few tools follow the convention that lower case
sequences are masked. (repeat masker, some
versions of blast, megablast, blastz) - A few analysis tools (like CLUSTAL) want a
simplified identifier on the defline.. So they
can have a short string for the alignment. - gtX63129.1
- GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
- CATCACGCGGGGCCTTCTGCTGCTGGC .
60- WIM now will talk about GCG
61Feature table(NCBIEMBL/DDBJ)
- http//www.ncbi.nlm.nih.gov/collab/FT/index.html
62Genbank Data format
41
- LOCUS BTA1AT 1380 bp mRNA
MAM 30-APR-1992 - DEFINITION B.taurus mRNA for alpha-1-antitrypsin.
- ACCESSION X63129
- NID g41
- VERSION X63129.1 GI41
- KEYWORDS alpha-1 antitrypsin serine protease
inhibitor serpin. - SOURCE Bos taurus.
- ORGANISM Bos taurus
- Eukaryota Metazoa Chordata
Vertebrata Mammalia Eutheria - Artiodactyla Ruminantia Pecora
Bovoidea Bovidae Bovinae Bos.
63Genbank References
- LOCUS BTA1AT 1380 bp mRNA
MAM 30-APR-1992 - ...
- REFERENCE 1 (bases 1 to 1380)
- AUTHORS Sinha,D.
- TITLE Direct Submission
- JOURNAL Submitted (22-OCT-1991) D. Sinha, Dept
of Biochemistry, Temple University, 3400
North Broad Street, Philadelphia, PA 19140, USA - REFERENCE 2 (bases 1 to 1380)
- AUTHORS Sinha,D., Bakhshi,M.R. and Kirby,E.P.
- TITLE Complete cDNA sequence of bovine alpha
1-antitrypsin - JOURNAL Biochim. Biophys. Acta 1130 (2),
209-212 (1992) - MEDLINE 92223096
- FEATURES Location/Qualifiers
-
64Genbank Source Qualifier
- LOCUS BTA1AT 1380 bp mRNA
MAM 30-APR-1992 - ...
- FEATURES Location/Qualifiers
- source 1..1380
- /organism"Bos taurus"
- /db_xref"taxon9913"
- /tissue_type"liver"
- /cell_type"hepatocyte"
- /clone_lib"lambda gt11"
- /clone"2f-Ic"
- mRNA lt1..gt1380
- sig_peptide 33..104
- ...
65Genbank mRNACDS features
- mRNA lt1..gt1380
- sig_peptide 33..104
- CDS 33..1283
- /codon_start1
- /product"alpha-1-antitrypsin
" - /protein_id"CAA44840.1"
- /db_xref"PIDg42"
- /db_xref"GI42"
- /db_xref"SWISS-PROTP34955"
- /translation"MALSITRGLLLLAALCCLAPISLAGVLQGHAVQETD
DTSHQEAACHKIAPNLANFAFSIYHHLAHQSNTSNIFFSPVSIASAFAML
SLGAKGNTHTEILKGLGFNLTELAEAEIHKGFQHLLHTLNQPNHQLQLTT
GNGLFINESAKLVDTFLEDVKNLYHSEAFSINFRDAEEAKKKINDYVEKG
SHGKIVELVKVLDPNTVFALVNYISFKGKWEKPFEMKHTTERDFHVDEQT
TVKVPMMNRLGMFDLHYCDKLASWVLLLDYVGNVTACFILPDLGKLQQLE
DKLNNELLAKFLEKKYASSANLHLPKLSISETYDLKSVLGDVGITEVFSD
RADLSGITKEQPLKVSKALHKAALTIDEKGTEAVGSTFLEAIPMSLPPDV
EFNRPFLCILYDRNTKSPLFVGKVVNPTQA" - mat_peptide 105..1280
- /product"alpha-1-antitrypsin
" - polyA_signal 1343..1348
- polyA_site 1368
66Genbank Sequence format
- ...
- BASE COUNT 357 a 413 c 322 g 288 t
- ORIGIN
- 1 gaccagccct gacctaggac agtgaatcga taatggcact
ctccatcacg cggggccttc - 61 tgctgctggc agccctgtgc tgcctggccc ccatctccct
ggctggagtt ctccaaggac - 121 acgctgtcca agagacagat gatacatccc accaggaagc
agcgtgccac aagattgccc - 181 ccaacctggc caactttgcc ttcagcatat accaccattt
ggctcatcag tccaacacca - 241 gcaacatctt cttctccccc gtgagcatcg cttcagcctt
tgcgatgctc tccctgggag - 301 ccaagggcaa cactcacact gagatcctga agggcctggg
tttcaacctc actgagctcg - 361 cagaggctga gatccacaaa ggctttcagc atcttctcca
caccctgaac cagccaaacc - ...
- 1321 gtccccccac tccctccatg gcattaaagg atgactgacc
tagccccgaa aaaaaaaaaa - //
67EMBL DATA FORMAT
- Embl http//www.ebi.ac.uk/Databases/
- http//www.ebi.ac.uk/cgi-bin/emblfetch
- Use Accession X63129
68DDBJ DATA FORMAT
- DDBJ http//www.ddbj.nig.ac.jp/
- http//ftp2.ddbj.nig.ac.jp8000/getstart-e.html
- Use Accession X63129
- Flat file format same as NCBI/Genbank format.
69Entrez
- Index Based search system. Each field in the
database is searchable individually or as
agregate. - (e.g. CDS FKEY)
- default is agregate ALL FIELDS
- All primary databases are interlinked as one big
relational database. - (e.g. Pubmed links in Genbank records)
- Phrase matching.
- Human genome -gt human genome
70Entrez
- Available neighbours (related documents or
related sequences) - In Pubmed searches Term mapping to neighbouring
documents and neighbouring terms. - Term mapping to chemical names.
- In pubmed term All Fields is term mapped to
chemical names MeSH terms Text Fields. - .. Unless term is whithin double quotes.
71Entrez
- http//www.ncbi.nlm.nih.gov/Entrez/
- Tutorials
- http//www.ncbi.nlm.nih.gov/Class/MLACourse/Geneti
cs/index.html - http//www.ncbi.nlm.nih.gov/Literature/pubmed_sear
ch.html - http//www.ncbi.nlm.nih.gov/Database.tut1.html
72SWISSPROT
http//www.expasy.ch/sprot/sprot_details.html
- Core data protein sequence data the citation
information and the taxonomic data - Annotation
- Function(s) of the protein
- Domains and sites. For example calcium binding
regions, ATP-binding sites, zinc fingers,
homeobox, kringle, etc. - Post-translational modification(s). For example
carbohydrates, phosphorylation, acetylation,
GPI-anchor, etc. - Secondary structure
- Quaternary structure. For example homodimer,
heterotrimer, etc. - Similarities to other proteins
- Disease(s) associated with deficiencie(s) in the
protein - Sequence conflicts, variants, etc.
73SWISSPROT
http//www.expasy.ch/cgi-bin/get-random-entry.pl?S
74REBASE (Restriction enzymes dataBASE)
Restriction enzymes have a pattern recognition
sequence, and then within or a few bases away
from that pattern is the actual cutting site
http//rebase.neb.com/rebase/rebase.html I
prefer the bairoch format (SWISSPROT
format) http//rebase.neb.com/rebase/rebase.f19.ht
ml ID enzyme name ET enzyme type OS
microorganism name PT prototype RS recognition
sequence, cut site MS methylation site (type)
CR commercial sources for the restriction enzyme
CM commercial sources for the methylase RN
count RA authors RL jour, vol, pages, year,
etc.
75Exercises
- You can work in teams for this.
- 1a) Use the first 6000 bases of your genomic
piece or find a bacterial genomic or mRNA
sequence in Entrez with length between 200010000
- b) Use the ORF finder to find the gene(s).
Compare the answer you get to the annotation you
can infer from using blastn against genbank and
to using blastx against a protein database. - Do the Entrez exercizes. ( separate word
document)