Fundamentals in Sequence Analysis 1.(part 1)

About This Presentation

Title:

Fundamentals in Sequence Analysis 1.(part 1)

Description:

The mRNA acts as a messenger to carry the information stored in the DNA in the ... Telltale sign: Direct or inverted repeat flank the repeated element. ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 76

Provided by: Sico

Category:

more less

Transcript and Presenter's Notes

Title: Fundamentals in Sequence Analysis 1.(part 1)

1
Fundamentals in Sequence Analysis 1.(part 1)
Review of Basic biology database searching in
Biology.

Hugues Sicotte
NCBI

2
The Flow of Biotechnology Information
Gene
Function
gt DNA sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGGC
AACAC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA TCT
GGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA ACATCGATGAAC
TGCTGAACGAAGATATCCTGATCCTGGG TTGCTCTGCCATGGGCGATGA
AGTTCTCGAGGAAAGCGAA TTTGAACCGTTCATCGAAGAGATCTCTACC
AAAATCTCTG GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCG
A CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC TACGGTT
GCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA ACGAGCCGGACGAAGC
TGAGCAGGACTGCATCGAATTTGG TAAGAAGATCGCGAACATCTAGTAG
A
gt Protein sequence MKIVYWSGTGNTEKMAELIAKGIIESGKDVN
TINVSDVNI DELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGK
KVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE PDEAEQDC
IEFGKKIANI
3
Prequisites to Sequence Analysis

Basic Biology so you can understand the language
of the databases Central Dogma (transcription
Translation, Prokaryotes, Eukaryotes,CDS, 3UTR,
5UTR, introns, exons, promoters, operons,
codons, start codons, stop codons,snRNA,hnRNA,tRNA
, secondary structure, tertiary structure).
Before you can analyze sequences.. You have to
understand their structure.. And know about Basic
Biological Database Searching

4
   Central Dogmas of Molecular Biology
1) The concept of genes is historically defined
on the basic of genetic inheritance of a
phenotype. (Mendellian Inheritance) 2) The DNA an
organism encodes the genetic information. It is
made up of a double stranded helix composed of
ribose sugars. Adenine(A), Citosine (C), Guanine
(G) and Thymine (T). note that only 4 values
nees be encode ACGT.. Which can be done using 2
bits.. But to allow redundant letter combinations
(like N means any 4 nucleotides), one usually
resorts to a 4 bit alphabet.
5
   Central Dogmas of Molecular Biology
3) Each side of the double helix faces its
complementary base. A ??T, and G ?? C. 4)
Biochemical process that read off the DNA always
read it from the 5side towards the 3 side.
(replication and transcription). 5) A gene can be
located on either the plus strand or the minus
strand. But rule 4) imposes the orientation of
reading .. And rule 3 (complementarity) tells us
to complement each base E.g. If the sequence on
the strand is ACGTGATCGATGCTA, the strand
must be read off by reading the complement of
this sequence going backwards e.g.
TAGCATCGATCACGT
6
   Central Dogmas of Molecular Biology
6) DNA information is copied over to mRNA that
acts as a template to produce proteins.
We often concentrate on protein coding genes,
because proteins are the building blocks of cells
and the majority of bio-active molecules. (but
lets not forget the various RNA genes)
7
Prokaryotic genes
Prokaryotes (intronless protein coding genes)
Upstream (5)
Gene region
Downstream (3)
promoter
TAC
DNA
Transcription (gene is encoded on minus strand ..
And the reverse complement is read into mRNA)
ATG
mRNA
5 UTR
3 UTR
CoDing Sequence (CDS)
ATG
Translation tRNA read off each codons, 3 bases
at a time, starting at start codon until it
reaches a STOP codon.
protein
8
Why does Nature bothers with the mRNA?

Why would the cell want to have an intermediate
between DNA and the proteins it encodes?
Gene information can be amplified by having many
copies of an RNA made from one copy of DNA.
Regulation of gene expression can be effected by
having specific controls at each element of the
pathway between DNA and proteins. The more
elements there are in the pathway, the more
opportunities there are to control it in
different circumstances.
In Eukaryotes, the DNA can then stay pristine and
protected, away from the caustic chemistry of the
cytoplasm.

9
Prokaryotic genes (operons)
Prokaryotes (operon structure)
downstream
promoter
upstream
Gene 1
Gene 2
Gene 3
In prokaryotes, sometimes genes that are part of
the same operational pathway are grouped together
under a single promoter. They then produce a
pre-mRNA which eventually produces 3 separates
mRNAs.
10
Bacterial Gene Structure of signals

Bacterial genomes have simple gene structure.
- Transcription factor binding site.
- Promoters
-35 sequence (T82T84G78A65C54A45) 15-20 bases
-10 sequence (T80A95T45A60A50T96) 5-9 bases
- Start of transcription initiation start
Purine90 (sometimes its the A in CAT)
- translation binding site (shine-dalgarno) 10 bp
upstream of AUG (AGGAGG)
- One or more Open Reading Frame
start-codon (unless sequence is partial)
until next in-frame stop codon on that strand ..
Separated by intercistronic sequences.
- Termination

11
Genetic Code

How does an mRNA specify amino acid sequence? The
answer lies in the genetic code. It would be
impossible for each amino acid to be specified by
one nucleotide, because there are only 4
nucleotides and 20 amino acids. Similarly, two
nucleotide combinations could only specify 16
amino acids. The final conclusion is that each
amino acid is specified by a particular
combination of three nucleotides, called a codon
Each 3 nucleotide code for one amino acid.
The first codon is the start codon, and usually
coincides with the Amino Acid Methionine. (M
which has codon code ATG)
The last codon is the stop codon and does NOT
code for an amino acid. It is sometimes
represented by to indicate the STOP codon.
A coding region (abbreviation CDS) starts at the
START codon and ends at the STOP codon.

12
Codon table

Note the degeneracy of the genetic code. Each
amino acid might have up to six codons that
specify it.
Different organisms have different frequencies
of codon usage.
A handful of species vary from the codon
association described above, and use different
codons fo different amino acids.
How do tRNAs recognize to which codon to bring an
amino acid? The tRNA has an anticodon on its
mRNA-binding end that is complementary to the
codon on the mRNA. Each tRNA only binds the
appropriate amino acid for its anticodon.

13
RNA

RNA has the same primary structure as DNA. It
consists of a sugar-phosphate backbone, with
nucleotides attached to the 1' carbon of the
sugar. The differences between DNA and RNA are
that
RNA has a hydroxyl group on the 2' carbon of the
sugar (thus, the difference between
deoxyribonucleic acid and ribonucleic acid.
Instead of using the nucleotide thymine, RNA uses
another nucleotide called uracil
Because of the extra hydroxyl group on the sugar,
RNA is too bulky to form a stable double helix.
RNA exists as a single-stranded molecule.
However, regions of double helix can form where
there is some base pair complementation (U and A
, G and C), resulting in hairpin loops. The RNA
molecule with its hairpin loops is said to have a
secondary structure.
Because the RNA molecule is not restricted to a
rigid double helix, it can form many different
stable three-dimensional tertiary structures.

14
tRNA ( transfer RNA)
is a small RNA that has a very specific secondary
and tertiary structure such that it can bind an
amino acid at one end, and mRNA at the other end.
It acts as an adaptor to carry the amino acid
elements of a protein to the appropriate place as
coded for by the mRNA. T
Three-dimensional Tertiary structure
Secondary structure of tRNA
15
Bacterial Gene Prediction
Most of the consensus sequences are known from
ecoli studies. So for each bacteria the exact
distribution of consensus will change. Most
modern gene prediction programs need to be
trained. E.g. they find their own consensus and
assembly rules given a few examples genes. A few
programs find their own rules from a completely
unannotated bacterial genome by trying to find
conserved patterns. This is feasible because
ORFs restrict the search space of possible gene
candidates. E.g. selfid program(selfid_at_igs.cnrs-mr
s.fr)
16
Open Reading Frames

The simplest bacterial gene prediction techniques
simply
identify all open reading frames(ORFs),
and blastx them against known proteins.
The ORFs with the best homology are retained
first.
This usually densely covers the bacterial genomes
with genes. rRNA and tRNA are detected separately
using tRNAScan or blastn.

17
Open Reading Frames (ORF)
On a given piece of DNA, there can be 6 possible
frames. The ORF can be either on the or minus
strand and on any of 3 possible frames Frame 1
1st base of start codon can either start at base
1,4,7,10,... Frame 2 1st base of start codon
can either start at base 2,5,8,11,... Frame 3
1st base of start codon can either start at base
3,6,9,12,... (frame 1,-2,-3 are on minus
strand) Some programs have other conventions for
naming frames.. (0..5, 1-6, etc)
Gene finding in eukaryotic cDNA uses ORF finding
blastx as well. http//www.ncbi.nlm.nih.gov/gorf/
gorf.html try with gi41 ( or your own piece of
DNA)
18
Eukaryotic Central Dogma
In Eukaryotes ( cells where the DNA is
sequestered in a separate nucleus) The DNA does
not contain a duplicate of the coding gene,
rather exons must be spliced. ( many eukaryotes
genes contain no introns! .. Particularly true in
lower organisms) mRNA (messenger RNA)
Contains the assembled copy of the gene. The mRNA
acts as a messenger to carry the information
stored in the DNA in the nucleus to the cytoplasm
where the ribosomes can make it into protein.
19
Eukaryotic Nuclear Gene Structure

Gene prediction for Pol II transcribed genes.
Upstream Enhancer elements.
Upstream Promoter elements.
GC box(-90nt) (20bp), CAAT box(-75 nt)(22bp)
TATA promoter (-30 nt) (70, 15 nt consensus
(Bucher et al (1990))
14-20 nt spacer DNA
CAP site (8 bp)
Transcription Initiation.
Transcript region, interrupted by introns.
Translation Initiation (Kozak signal 12 bp
consensus) 6 bp prior to initiation codon.
polyA signal (AATAAA 99,other)

20
introns

Transcript region, interrupted by introns. Each
introns
starts with a donor site consensus
(G100T100A62A68G84T63..)
Has a branch site near 3 end of intron (one not
very conserved consensus UACUAAC)
ends with an acceptor site consensus.
(12Py..NC65A100G100)

UG
AG
UACUAAC
21
Exons

The exons of the transcript region are composed
of
5UTR (mean length of 769 bp) with a specific
base composition, that depends on local GC
content of genome)
AUG (or other start codon)
Remainder of coding region
Stop Codon
3 UTR (mean length of 457, with a specific base
composition that depends on local GC content of
genome)

22
Structure of the Eukaryotic Genome
6-12 of human DNA encodes proteins(higher
fraction in nematode) 10 of human DNA codes for
UTR 90 of human DNA is non-coding.
23
Non-Coding Eukaryotic DNA

Untranslated regions (UTRs)
introns (can be genes within introns of another
gene!)
intergenic regions.
- repetitive elements
- pseudogenes (dead
genes that may(or not) have been retroposed back
in the genome as a single-exon gene

24
Pseudogenes
Pseudogenes Dna sequence that might code for a
gene, but that is unable to result in a protein.
This deficiency might be in transcription (lack
of promoter, for example) or in translation or
both. Processed pseudogenes Gene retroposed
back in the genome after being processed by the
splicing apperatus. Thus it is fully spliced and
has polyA tail. Insertion process flanks mRNA
sequence with short direct repeats. Thus no
promoters.. Unless is accidentally retroposed
downstream of the promoter sequence. Do not
confuse with single-exon genes.
25
Repeats
Each repeat family has many subfamilies. - ALU
300nt long 600,000 elements in human genome. can
cause false homology with mRNA. Many have an Alu1
restriction site. - Retroposons. ( can get copied
back into genome) - Telltale sign Direct or
inverted repeat flank the repeated element. That
repeat was the priming site for the RNA that was
inserted. LINEs (Long INtersped Elements) L1
1-7kb long, 50000 copies Have two ORFs!!!!! Will
cause problems for gene prediction
programs. SINEs (Short Intersped Elements)
26
Low-Complexity Elements

When analyzing sequences, one often rely on the
fact that two stretches are similar to infer that
they are homologous (and therefore related).. But
sequences with repeated patterns will match
without there being any philogenetic relation!
Sequences like ATATATACTTATATA which are mostly
two letters are called low-complexity.
Triplet repeats (particularly CAG) have a
tendency to make the replication machinery
stutter.. So they are amplified.
The low-complexity sequence can also be hidden at
the translated protein level.

27
Masking

To avoid finding spurious matches in alignment
programs, you should always mask out the query
sequence.
Before predicting genes it is a good idea to mask
out repeats (at least those containing ORFs).
Before running blastn against a genomic record,
you must mask out the repeats.
Most used Programs
CENSOR
Repeat Masker
http//ftp.genome.washington.edu/cgi-bin/RepeatMas
ker

28
More Non-Protein genes
rRNA - ribosomal RNA is one of the structural
components of the ribosome. It has sequence
complementarity to regions of the mRNA so that
the ribosome knows where to bind to an mRNA it
needs to make protein from. snRNA - small
nuclear RNA is involved in the machinery that
processes RNA's as they travel between the
nucleus and the cytoplasm. hnRNA
hetero-nuclear RNA. small RNA involved in
transcription.
29
Protein Processing localization.

The protein as read off from the mRNA may not be
in the final form that will be used in the cell.
Some proteins contains
Signal Peptide (located at N-terminus
(beginning)), this signal peptide is used to
guide the protein out of the nucleus towards its
final cellular localization. This signal peptide
is cleaved-out at the cleavage site once the
protein has reach (or is near) its final
destination.
Various Post-Translational modifications
(phosphorylation)
The final protein is called the mature peptide

30
Convention for nucleotides in database
Because the mRNA is actually read off the minus
strand of the DNA, the nucleotide sequence are
always quoted on the minus strand. In
bioinformatics the sequence format does NOT make
a difference between Uracil and Thymine. There is
no symbol for Uracil.. It is always represented
by a T Even genomic sequence follows that
convention. A gene on the plus strand is quoted
so that it is in the same strand as its product
mRNA.
31
Biology Information on the Internet
32
Biology Information on the Internet

Introduction to Databases
Searching the Internet for Biology Information.
General Search methods
Biology Web sites
Introduction to Genbank file format.
Introduction to Entrez and Pubmed
Ref Chapters 1,2,5,6 of Bioinformatics

Databases
A collection of Records.
Each record has many fields.
Each field contain specific information.
Each field has a data type.
E.g. money, currency,Text Field, Integer,
date,address(text field) ,citation (text field)
Each record has a primary key. A UNIQUE
identifier that unambiguously defines this record.

Spread-sheet Flat-file version of a database.
34
Gi Genbank Identifier Unique Key Primary
Key GI Changes with each update of the sequence
record. Accession Number Secondary key Points
to same locus and sequence despite sequence
updates. Accession Version Number equivalent
to Gi
35
Relational Database (Normalizing a database for
repeated sub-elements of a database.. Splitting
it into smaller databases, relating the
sub-databases to the first one using the primary
key.)
36
Types of Relational databases.

The Internet can be though of as one enormous
relational database.
The links/URL are the primary keys.
SQL (Standard Query Language)
Sybase Oracle Access (Databases systems)
Sybase used at NCBI.
SRS(One type of database querying system of use
in Biology)

37
Indexed searches.

To allow easy searching of a database, make an
index.
An index is a list of primary keys corresponding
to a key in a given field (or to a collection of
fields)

38
Indexed searches.

Boolean Query Merging and Intersecting lists
AND (in both lists) (e.g. human AND genome)
human genome
human genome
OR (in either lists) (e.g. human OR genome)
human genome

39
Search strategies

Search engines use complex strategies that go
beyond Boolean queries.
Phrases matching
human genome -gt human genome
togetherness documents with human close to
genome are scored higher.
Term expansion synomyms
human -gt homo sapiens
neigbours
human genome-gt genome projects,
chromosomes,genetics
Frequency of links (www.google.com)
To avoid these term mapping, enclose your queries
in quotes human AND genome

40
Search strategies

Search engines use complex strategies that go
beyond Boolean queries.
To avoid these term mapping, enclose your queries
in quotes human AND genome
To require that ALL the terms in your query be
important, precede them with a . This also
prevents term mapping.
To force the order of the words to be important,
group sentences within strings. biology of
mammals.

41
Indexed searches.

Example
find the advanced query page at
http//www.altavista.com
type human (and hit the Search button)
Type genome
type human AND genome
type human genome (finds the least matches)
type human OR genome (finds the most matches)

Search Engines
Web Spiders Collection of All web pages, but
since Web pages change all the time and new ones
appear, they must constantly roam the web and
re-index.. Or depend on people submitting their
own pages.
www.google.com (BEST!)
www.infoseek.com
www.lycos.com
www.exite.com
www.webcrawler.com
www.lycos.com
www.looksmart.com (country specific)

Search Engines
www.google.com (BEST!)
Google ranks pages according to how many pages
with those terms refer to the pages you are
asking for. Not only must one document contain
ALL the search terms, but other documents which
refer to this one must also contain all the
terms.
Great when you know what you are looking for! You
can also use to require immediate proximity
and order of terms.
E.g. type
Web server for the blast program.
But google only indexes about 40 of the web.. So
you may have to use other web spiders.
(disclaimer.. I dont own stock in that company..
But Id like to)

Search Engines
Curated Collections Not comprehensive Contains
list of best sites for commonly requested topics,
but is missing important sites for more
specialized topics (like biology)
www.yahoo.com (Has travel maps too!)
Answer-based curated collections Easy to use
english-like queries. First looks at list of
predefined answers, then refines answers based on
user interaction. Also answer new questions.
www.askjeeves.com
www.magellan.com
www.altavista.com(has translation TOOLS)
www.hotbot.com

Search Engines
Meta-Search Engines Polls several search
engines, and returns the consensus of all
results. Is likely to miss sites, but the sites
it returns are very relevant to the query.
Other operating mode is to return the sum of all
the results.. Then becomes very sensitive to a
very detailled query.
www.metacrawler.com
www.savvysearch.com
www.1blink.com (fast)
www.metafind.com
www.dogpile.com

Virtual Libraries Curated collections of links
for Biologists.(by Biologists)
Pedros BioMolecular Research Tools(1996)
http//www.public.iastate.edu/pedro/
Virtual Library Bio Sciences
http//vlib.org/Biosciences.html
Publications and abstract search.
http//www.ncbi.nlm.nih.gov/
Expasy server
http//www.expasy.ch
EBI Biocatalog (software databases list)
http//www.ebi.ac.uk/biocat/

47
Biological Databases

Nucleotide databases
Genbank International Collaboration
NCBI(USA), EMBL(Europe), DDBJ (Japan and Asia)
A bank No curation.. Submission to these
database is required for publication in a
journal.
Organism specific databases (Exercize Find URLs
using search engines)
FlyBase
ChickGBASE
pigbase
wormpep
YPD (Yeast Protein Database)
SGD(Saccharomyces Genome Database)

Protein Databases
NCBI
Swiss Prot(Free for academic use, otherwise
commercial. Licensing restrictions on discoveries
made using the DB. 1998 version free of any
licensing)
http//www.expasy.ch(latest pay version)
NCBI has the latest free version.
Translated Proteins from Genbank Submissions
EMBL
TrEMBL is a computer-annotated supplement of
SWISS-PROT that contains all the translations of
EMBL nucleotide sequence entries not yet
integrated in SWISS-PROT
PIR

Structure databases
PDB Protein structure database.
Http//www.rscb.org/pdb/
MMDB NCBIs version of PDB with entrez links.
Http//www.ncbi.nlm.nih.gov
Genome Mapping Information
http//www.il-st-acad-sci.org/health/genebase.html
NCBI(Human)
Genome Centers
Stanford, Washington University, Stanford
Research Centers and Universities

Litterature databases
NCBI Pubmed All biomedical litterature.
Www.ncbi.nlm.nih.gov
Abstracts and links to publisher sites for
full text retrieval/ordering
journal browsing.
Publisher web sites.
Biomednet Commercial site for litterature
search.
Pathways Database
KEGG Kyoto Encyclopedia of Genes and Genomes
www.genome.ad.jp/kegg/kegg/html

Database Identifiers Primary keys
GI (changes with each sequence update for NCBI
only)
Annotation may change without the gi changing!
Accession(stable)
version(changes with each sequence update)
Version also refers to Accession.version
Secondary accession Records may have been merged
in the past.. So the records which were not
chosen as the primary were made secondary.

52
Primary Databases

A primary Database is a repository of data
derived from experiments or from research
knowledge.
Genbank (Nucleotide repository)
Protein DB, Swissprot
PDB (MMDB) are primary databases.
Pubmed (litterature)
Genome Mapping databases.
Kegg Database.(pathways)

53
Secondary Databases

A secondary database contains information derived
from other sources.
Refseq (Currated collection of Genbank at NCBI)
Unigene (Clustering of ESTs at NCBI)
Organism-specific databases are often a mix
between primary and secondary.

54
Genbank Records

A Bank No attempt at reconciliation.
Submit a sequence ? Get an Accession Number!
Cannot modify sequences without submitters
consent.
No attempt at reconciliation.(not a unique
collection per LOCUS/gene)
Entries of various sequence quality and different
sourcesgt Separate in various divisions based on
High Quality sequences in taxon specific
divisions.
Low Quality sequences in Usage specific
databases.
A Collaboration between NCBI, EMBL and DDBJ. They
contain (nearly) the same information, only the
data format differs.

EMBL does not differentiate between the different
types of RNA records, while NCBI (and DDBJ) do.
In Entrez EMBL records are patched up to add that
information.
55
Refseq and LocusLink

Attempt to produce 1 mRNA, 1 protein, and 1
genomic gene for each frequently occuring allele
of a protein expressing gene.
www.ncbi.nlm.nih.gov/LocusLink
Special non-genbank Accession numbers
NM_nnnnnn mRNA refseq
NP_nnnnnn protein refseq
NC_nnnnnn refseq genomic contig
NT_nnnnnn temporary genomic contig
NX_nnnnnn predicted gene

56
Genbank divisions

Sequences in genbank are split into various
categories based on
The quality and type of sequences
The high quality nucleotide sequences are divided
into organism-dependant divisions.

Genbank Entry type (and query to restrict to
that field)
mRNA (1/10000 errors)
biomol_mRNA PROP
cDNA (EST, 95-99 accuracy, single pass )
gbdiv_EST PROP
genomic ( biomol_genomic PROP)
in HTGS division gt99 accuracy
gbdiv_HTG PROP
GSS(low-quality genome survey sequences)
gbdiv_GSS PROP
rest of Genbank 1/10000 accuracy.
Human gbdiv_PRI PROP
mouse gbdiv_ROD PROP
bovine gbdiv_MAM PROP
STS(EST or cDNA used in mapping)
gbdiv_STS PROP

58
FASTA Format
MOST important data format!!!

gtidentifier descriptive text
nucleotide of amino-acid
sequence on multiple lines if needed.
Example
gtgi41embX63129.1BTA1AT B.taurus mRNA for
alpha-1-anti-trypsin
GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
CATCACGCGGGGCCTTCTGCTGCTGGC .

59
Modified FASTA Format

A few tools follow the convention that lower case
sequences are masked. (repeat masker, some
versions of blast, megablast, blastz)
A few analysis tools (like CLUSTAL) want a
simplified identifier on the defline.. So they
can have a short string for the alignment.
gtX63129.1
GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
CATCACGCGGGGCCTTCTGCTGCTGGC .

WIM now will talk about GCG

61
Feature table(NCBIEMBL/DDBJ)

http//www.ncbi.nlm.nih.gov/collab/FT/index.html

62
Genbank Data format
41

LOCUS BTA1AT 1380 bp mRNA
MAM 30-APR-1992
DEFINITION B.taurus mRNA for alpha-1-antitrypsin.
ACCESSION X63129
NID g41
VERSION X63129.1 GI41
KEYWORDS alpha-1 antitrypsin serine protease
inhibitor serpin.
SOURCE Bos taurus.
ORGANISM Bos taurus
Eukaryota Metazoa Chordata
Vertebrata Mammalia Eutheria
Artiodactyla Ruminantia Pecora
Bovoidea Bovidae Bovinae Bos.

63
Genbank References

LOCUS BTA1AT 1380 bp mRNA
MAM 30-APR-1992
...
REFERENCE 1 (bases 1 to 1380)
AUTHORS Sinha,D.
TITLE Direct Submission
JOURNAL Submitted (22-OCT-1991) D. Sinha, Dept
of Biochemistry, Temple University, 3400
North Broad Street, Philadelphia, PA 19140, USA
REFERENCE 2 (bases 1 to 1380)
AUTHORS Sinha,D., Bakhshi,M.R. and Kirby,E.P.
TITLE Complete cDNA sequence of bovine alpha
1-antitrypsin
JOURNAL Biochim. Biophys. Acta 1130 (2),
209-212 (1992)
MEDLINE 92223096
FEATURES Location/Qualifiers

64
Genbank Source Qualifier

LOCUS BTA1AT 1380 bp mRNA
MAM 30-APR-1992
...
FEATURES Location/Qualifiers
source 1..1380
/organism"Bos taurus"
/db_xref"taxon9913"
/tissue_type"liver"
/cell_type"hepatocyte"
/clone_lib"lambda gt11"
/clone"2f-Ic"
mRNA lt1..gt1380
sig_peptide 33..104
...

65
Genbank mRNACDS features

mRNA lt1..gt1380
sig_peptide 33..104
CDS 33..1283
/codon_start1
/product"alpha-1-antitrypsin
"
/protein_id"CAA44840.1"
/db_xref"PIDg42"
/db_xref"GI42"
/db_xref"SWISS-PROTP34955"
/translation"MALSITRGLLLLAALCCLAPISLAGVLQGHAVQETD
DTSHQEAACHKIAPNLANFAFSIYHHLAHQSNTSNIFFSPVSIASAFAML
SLGAKGNTHTEILKGLGFNLTELAEAEIHKGFQHLLHTLNQPNHQLQLTT
GNGLFINESAKLVDTFLEDVKNLYHSEAFSINFRDAEEAKKKINDYVEKG
SHGKIVELVKVLDPNTVFALVNYISFKGKWEKPFEMKHTTERDFHVDEQT
TVKVPMMNRLGMFDLHYCDKLASWVLLLDYVGNVTACFILPDLGKLQQLE
DKLNNELLAKFLEKKYASSANLHLPKLSISETYDLKSVLGDVGITEVFSD
RADLSGITKEQPLKVSKALHKAALTIDEKGTEAVGSTFLEAIPMSLPPDV
EFNRPFLCILYDRNTKSPLFVGKVVNPTQA"
mat_peptide 105..1280
/product"alpha-1-antitrypsin
"
polyA_signal 1343..1348
polyA_site 1368

66
Genbank Sequence format

...
BASE COUNT 357 a 413 c 322 g 288 t
ORIGIN
1 gaccagccct gacctaggac agtgaatcga taatggcact
ctccatcacg cggggccttc
61 tgctgctggc agccctgtgc tgcctggccc ccatctccct
ggctggagtt ctccaaggac
121 acgctgtcca agagacagat gatacatccc accaggaagc
agcgtgccac aagattgccc
181 ccaacctggc caactttgcc ttcagcatat accaccattt
ggctcatcag tccaacacca
241 gcaacatctt cttctccccc gtgagcatcg cttcagcctt
tgcgatgctc tccctgggag
301 ccaagggcaa cactcacact gagatcctga agggcctggg
tttcaacctc actgagctcg
361 cagaggctga gatccacaaa ggctttcagc atcttctcca
caccctgaac cagccaaacc
...
1321 gtccccccac tccctccatg gcattaaagg atgactgacc
tagccccgaa aaaaaaaaaa
//

67
EMBL DATA FORMAT

Embl http//www.ebi.ac.uk/Databases/
http//www.ebi.ac.uk/cgi-bin/emblfetch
Use Accession X63129

68
DDBJ DATA FORMAT

DDBJ http//www.ddbj.nig.ac.jp/
http//ftp2.ddbj.nig.ac.jp8000/getstart-e.html
Use Accession X63129
Flat file format same as NCBI/Genbank format.

69
Entrez

Index Based search system. Each field in the
database is searchable individually or as
agregate.
(e.g. CDS FKEY)
default is agregate ALL FIELDS
All primary databases are interlinked as one big
relational database.
(e.g. Pubmed links in Genbank records)
Phrase matching.
Human genome -gt human genome

70
Entrez

Available neighbours (related documents or
related sequences)
In Pubmed searches Term mapping to neighbouring
documents and neighbouring terms.
Term mapping to chemical names.
In pubmed term All Fields is term mapped to
chemical names MeSH terms Text Fields.
.. Unless term is whithin double quotes.

71
Entrez

http//www.ncbi.nlm.nih.gov/Entrez/
Tutorials
http//www.ncbi.nlm.nih.gov/Class/MLACourse/Geneti
cs/index.html
http//www.ncbi.nlm.nih.gov/Literature/pubmed_sear
ch.html
http//www.ncbi.nlm.nih.gov/Database.tut1.html

72
SWISSPROT
http//www.expasy.ch/sprot/sprot_details.html

Core data protein sequence data the citation
information and the taxonomic data
Annotation
Function(s) of the protein
Domains and sites. For example calcium binding
regions, ATP-binding sites, zinc fingers,
homeobox, kringle, etc.
Post-translational modification(s). For example
carbohydrates, phosphorylation, acetylation,
GPI-anchor, etc.
Secondary structure
Quaternary structure. For example homodimer,
heterotrimer, etc.
Similarities to other proteins
Disease(s) associated with deficiencie(s) in the
protein
Sequence conflicts, variants, etc.

73
SWISSPROT
http//www.expasy.ch/cgi-bin/get-random-entry.pl?S

74
REBASE (Restriction enzymes dataBASE)
Restriction enzymes have a pattern recognition
sequence, and then within or a few bases away
from that pattern is the actual cutting site
http//rebase.neb.com/rebase/rebase.html I
prefer the bairoch format (SWISSPROT
format) http//rebase.neb.com/rebase/rebase.f19.ht
ml ID enzyme name ET enzyme type OS
microorganism name PT prototype RS recognition
sequence, cut site MS methylation site (type)
CR commercial sources for the restriction enzyme
CM commercial sources for the methylase RN
count RA authors RL jour, vol, pages, year,
etc.

75
Exercises

You can work in teams for this.
1a) Use the first 6000 bases of your genomic
piece or find a bacterial genomic or mRNA
sequence in Entrez with length between 200010000
b) Use the ORF finder to find the gene(s).
Compare the answer you get to the annotation you
can infer from using blastn against genbank and
to using blastx against a protein database.
Do the Entrez exercizes. ( separate word
document)

Write a Comment

User Comments (0)

About PowerShow.com

Fundamentals in Sequence Analysis 1.(part 1) - PowerPoint PPT Presentation

Fundamentals in Sequence Analysis 1.(part 1)

The mRNA acts as a messenger to carry the information stored in the DNA in the ... Telltale sign: Direct or inverted repeat flank the repeated element. ... – PowerPoint PPT presentation