Fundamentals in Sequence Analysis 1.(part 1) - PowerPoint PPT Presentation

About This Presentation
Title:

Fundamentals in Sequence Analysis 1.(part 1)

Description:

The mRNA acts as a messenger to carry the information stored in the DNA in the ... Telltale sign: Direct or inverted repeat flank the repeated element. ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 76
Provided by: Sico
Category:

less

Transcript and Presenter's Notes

Title: Fundamentals in Sequence Analysis 1.(part 1)


1
Fundamentals in Sequence Analysis 1.(part 1)
Review of Basic biology database searching in
Biology.
  • Hugues Sicotte
  • NCBI

2
The Flow of Biotechnology Information
Gene
Function
gt DNA sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGGC
AACAC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA TCT
GGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA ACATCGATGAAC
TGCTGAACGAAGATATCCTGATCCTGGG TTGCTCTGCCATGGGCGATGA
AGTTCTCGAGGAAAGCGAA TTTGAACCGTTCATCGAAGAGATCTCTACC
AAAATCTCTG GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCG
A CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC TACGGTT
GCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA ACGAGCCGGACGAAGC
TGAGCAGGACTGCATCGAATTTGG TAAGAAGATCGCGAACATCTAGTAG
A
gt Protein sequence MKIVYWSGTGNTEKMAELIAKGIIESGKDVN
TINVSDVNI DELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGK
KVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE PDEAEQDC
IEFGKKIANI
3
Prequisites to Sequence Analysis
  • Basic Biology so you can understand the language
    of the databases Central Dogma (transcription
    Translation, Prokaryotes, Eukaryotes,CDS, 3UTR,
    5UTR, introns, exons, promoters, operons,
    codons, start codons, stop codons,snRNA,hnRNA,tRNA
    , secondary structure, tertiary structure).
  • Before you can analyze sequences.. You have to
    understand their structure.. And know about Basic
    Biological Database Searching

4
   Central Dogmas of Molecular Biology
1) The concept of genes is historically defined
on the basic of genetic inheritance of a
phenotype. (Mendellian Inheritance) 2) The DNA an
organism encodes the genetic information. It is
made up of a double stranded helix composed of
ribose sugars. Adenine(A), Citosine (C), Guanine
(G) and Thymine (T). note that only 4 values
nees be encode ACGT.. Which can be done using 2
bits.. But to allow redundant letter combinations
(like N means any 4 nucleotides), one usually
resorts to a 4 bit alphabet.
5
   Central Dogmas of Molecular Biology
3) Each side of the double helix faces its
complementary base. A ??T, and G ?? C. 4)
Biochemical process that read off the DNA always
read it from the 5side towards the 3 side.
(replication and transcription). 5) A gene can be
located on either the plus strand or the minus
strand. But rule 4) imposes the orientation of
reading .. And rule 3 (complementarity) tells us
to complement each base E.g. If the sequence on
the strand is ACGTGATCGATGCTA, the strand
must be read off by reading the complement of
this sequence going backwards e.g.
TAGCATCGATCACGT
6
   Central Dogmas of Molecular Biology
6) DNA information is copied over to mRNA that
acts as a template to produce proteins.
We often concentrate on protein coding genes,
because proteins are the building blocks of cells
and the majority of bio-active molecules. (but
lets not forget the various RNA genes)
7
Prokaryotic genes
Prokaryotes (intronless protein coding genes)
Upstream (5)
Gene region
Downstream (3)
promoter
TAC
DNA
Transcription (gene is encoded on minus strand ..
And the reverse complement is read into mRNA)
ATG
mRNA
5 UTR
3 UTR
CoDing Sequence (CDS)
ATG
Translation tRNA read off each codons, 3 bases
at a time, starting at start codon until it
reaches a STOP codon.
protein
8
Why does Nature bothers with the mRNA?
  • Why would the cell want to have an intermediate
    between DNA and the proteins it encodes?
  • Gene information can be amplified by having many
    copies of an RNA made from one copy of DNA.
  • Regulation of gene expression can be effected by
    having specific controls at each element of the
    pathway between DNA and proteins. The more
    elements there are in the pathway, the more
    opportunities there are to control it in
    different circumstances.
  • In Eukaryotes, the DNA can then stay pristine and
    protected, away from the caustic chemistry of the
    cytoplasm.

9
Prokaryotic genes (operons)
Prokaryotes (operon structure)
downstream
promoter
upstream
Gene 1
Gene 2
Gene 3
In prokaryotes, sometimes genes that are part of
the same operational pathway are grouped together
under a single promoter. They then produce a
pre-mRNA which eventually produces 3 separates
mRNAs.
10
Bacterial Gene Structure of signals
  • Bacterial genomes have simple gene structure.
  • - Transcription factor binding site.
  • - Promoters
  • -35 sequence (T82T84G78A65C54A45) 15-20 bases
  • -10 sequence (T80A95T45A60A50T96) 5-9 bases
  • - Start of transcription initiation start
    Purine90 (sometimes its the A in CAT)
  • - translation binding site (shine-dalgarno) 10 bp
    upstream of AUG (AGGAGG)
  • - One or more Open Reading Frame
  • start-codon (unless sequence is partial)
  • until next in-frame stop codon on that strand ..
  • Separated by intercistronic sequences.
  • - Termination

11
Genetic Code
  • How does an mRNA specify amino acid sequence? The
    answer lies in the genetic code. It would be
    impossible for each amino acid to be specified by
    one nucleotide, because there are only 4
    nucleotides and 20 amino acids. Similarly, two
    nucleotide combinations could only specify 16
    amino acids. The final conclusion is that each
    amino acid is specified by a particular
    combination of three nucleotides, called a codon
  • Each 3 nucleotide code for one amino acid.
  • The first codon is the start codon, and usually
    coincides with the Amino Acid Methionine. (M
    which has codon code ATG)
  • The last codon is the stop codon and does NOT
    code for an amino acid. It is sometimes
    represented by to indicate the STOP codon.
  • A coding region (abbreviation CDS) starts at the
    START codon and ends at the STOP codon.

12
Codon table
  • Note the degeneracy of the genetic code. Each
    amino acid might have up to six codons that
    specify it.
  • Different organisms have different frequencies
    of codon usage.
  • A handful of species vary from the codon
    association described above, and use different
    codons fo different amino acids.
  • How do tRNAs recognize to which codon to bring an
    amino acid? The tRNA has an anticodon on its
    mRNA-binding end that is complementary to the
    codon on the mRNA. Each tRNA only binds the
    appropriate amino acid for its anticodon.
  •         

13
RNA
  • RNA has the same primary structure as DNA. It
    consists of a sugar-phosphate backbone, with
    nucleotides attached to the 1' carbon of the
    sugar. The differences between DNA and RNA are
    that
  • RNA has a hydroxyl group on the 2' carbon of the
    sugar (thus, the difference between
    deoxyribonucleic acid and ribonucleic acid.
  • Instead of using the nucleotide thymine, RNA uses
    another nucleotide called uracil 
  • Because of the extra hydroxyl group on the sugar,
    RNA is too bulky to form a stable double helix.
    RNA exists as a single-stranded molecule.
    However, regions of double helix can form where
    there is some base pair complementation (U and A
    , G and C), resulting in hairpin loops. The RNA
    molecule with its hairpin loops is said to have a
    secondary structure.         
  • Because the RNA molecule is not restricted to a
    rigid double helix, it can form many different
    stable three-dimensional tertiary structures.

14
tRNA ( transfer RNA)
is a small RNA that has a very specific secondary
and tertiary structure such that it can bind an
amino acid at one end, and mRNA at the other end.
It acts as an adaptor to carry the amino acid
elements of a protein to the appropriate place as
coded for by the mRNA. T
Three-dimensional Tertiary structure
Secondary structure of tRNA
15
Bacterial Gene Prediction
Most of the consensus sequences are known from
ecoli studies. So for each bacteria the exact
distribution of consensus will change. Most
modern gene prediction programs need to be
trained. E.g. they find their own consensus and
assembly rules given a few examples genes. A few
programs find their own rules from a completely
unannotated bacterial genome by trying to find
conserved patterns. This is feasible because
ORFs restrict the search space of possible gene
candidates. E.g. selfid program(selfid_at_igs.cnrs-mr
s.fr)
16
Open Reading Frames
  • The simplest bacterial gene prediction techniques
    simply
  • identify all open reading frames(ORFs),
  • and blastx them against known proteins.
  • The ORFs with the best homology are retained
    first.
  • This usually densely covers the bacterial genomes
    with genes. rRNA and tRNA are detected separately
    using tRNAScan or blastn.

17
Open Reading Frames (ORF)
On a given piece of DNA, there can be 6 possible
frames. The ORF can be either on the or minus
strand and on any of 3 possible frames Frame 1
1st base of start codon can either start at base
1,4,7,10,... Frame 2 1st base of start codon
can either start at base 2,5,8,11,... Frame 3
1st base of start codon can either start at base
3,6,9,12,... (frame 1,-2,-3 are on minus
strand) Some programs have other conventions for
naming frames.. (0..5, 1-6, etc)
Gene finding in eukaryotic cDNA uses ORF finding
blastx as well. http//www.ncbi.nlm.nih.gov/gorf/
gorf.html try with gi41 ( or your own piece of
DNA)
18
Eukaryotic Central Dogma
In Eukaryotes ( cells where the DNA is
sequestered in a separate nucleus) The DNA does
not contain a duplicate of the coding gene,
rather exons must be spliced. ( many eukaryotes
genes contain no introns! .. Particularly true in
lower organisms) mRNA (messenger RNA)
Contains the assembled copy of the gene. The mRNA
acts as a messenger to carry the information
stored in the DNA in the nucleus to the cytoplasm
where the ribosomes can make it into protein.
19
Eukaryotic Nuclear Gene Structure
  • Gene prediction for Pol II transcribed genes.
  • Upstream Enhancer elements.
  • Upstream Promoter elements.
  • GC box(-90nt) (20bp), CAAT box(-75 nt)(22bp)
  • TATA promoter (-30 nt) (70, 15 nt consensus
    (Bucher et al (1990))
  • 14-20 nt spacer DNA
  • CAP site (8 bp)
  • Transcription Initiation.
  • Transcript region, interrupted by introns.
    Translation Initiation (Kozak signal 12 bp
    consensus) 6 bp prior to initiation codon.
  • polyA signal (AATAAA 99,other)

20
introns
  • Transcript region, interrupted by introns. Each
    introns
  • starts with a donor site consensus
    (G100T100A62A68G84T63..)
  • Has a branch site near 3 end of intron (one not
    very conserved consensus UACUAAC)
  • ends with an acceptor site consensus.
    (12Py..NC65A100G100)

UG
AG
UACUAAC
21
Exons
  • The exons of the transcript region are composed
    of
  • 5UTR (mean length of 769 bp) with a specific
    base composition, that depends on local GC
    content of genome)
  • AUG (or other start codon)
  • Remainder of coding region
  • Stop Codon
  • 3 UTR (mean length of 457, with a specific base
    composition that depends on local GC content of
    genome)

22
Structure of the Eukaryotic Genome
6-12 of human DNA encodes proteins(higher
fraction in nematode) 10 of human DNA codes for
UTR 90 of human DNA is non-coding.
23
Non-Coding Eukaryotic DNA
  • Untranslated regions (UTRs)
  • introns (can be genes within introns of another
    gene!)
  • intergenic regions.
  • - repetitive elements
  • - pseudogenes (dead
  • genes that may(or not) have been retroposed back
    in the genome as a single-exon gene

24
Pseudogenes
Pseudogenes Dna sequence that might code for a
gene, but that is unable to result in a protein.
This deficiency might be in transcription (lack
of promoter, for example) or in translation or
both. Processed pseudogenes Gene retroposed
back in the genome after being processed by the
splicing apperatus. Thus it is fully spliced and
has polyA tail. Insertion process flanks mRNA
sequence with short direct repeats. Thus no
promoters.. Unless is accidentally retroposed
downstream of the promoter sequence. Do not
confuse with single-exon genes.
25
Repeats
Each repeat family has many subfamilies. - ALU
300nt long 600,000 elements in human genome. can
cause false homology with mRNA. Many have an Alu1
restriction site. - Retroposons. ( can get copied
back into genome) - Telltale sign Direct or
inverted repeat flank the repeated element. That
repeat was the priming site for the RNA that was
inserted. LINEs (Long INtersped Elements) L1
1-7kb long, 50000 copies Have two ORFs!!!!! Will
cause problems for gene prediction
programs. SINEs (Short Intersped Elements)
26
Low-Complexity Elements
  • When analyzing sequences, one often rely on the
    fact that two stretches are similar to infer that
    they are homologous (and therefore related).. But
    sequences with repeated patterns will match
    without there being any philogenetic relation!
  • Sequences like ATATATACTTATATA which are mostly
    two letters are called low-complexity.
  • Triplet repeats (particularly CAG) have a
    tendency to make the replication machinery
    stutter.. So they are amplified.
  • The low-complexity sequence can also be hidden at
    the translated protein level.

27
Masking
  • To avoid finding spurious matches in alignment
    programs, you should always mask out the query
    sequence.
  • Before predicting genes it is a good idea to mask
    out repeats (at least those containing ORFs).
  • Before running blastn against a genomic record,
    you must mask out the repeats.
  • Most used Programs
  • CENSOR
  • Repeat Masker
  • http//ftp.genome.washington.edu/cgi-bin/RepeatMas
    ker

28
More Non-Protein genes
rRNA - ribosomal RNA is one of the structural
components of the ribosome. It has sequence
complementarity to regions of the mRNA so that
the ribosome knows where to bind to an mRNA it
needs to make protein from. snRNA - small
nuclear RNA is involved in the machinery that
processes RNA's as they travel between the
nucleus and the cytoplasm. hnRNA
hetero-nuclear RNA. small RNA involved in
transcription.
29
Protein Processing localization.
  • The protein as read off from the mRNA may not be
    in the final form that will be used in the cell.
    Some proteins contains
  • Signal Peptide (located at N-terminus
    (beginning)), this signal peptide is used to
    guide the protein out of the nucleus towards its
    final cellular localization. This signal peptide
    is cleaved-out at the cleavage site once the
    protein has reach (or is near) its final
    destination.
  • Various Post-Translational modifications
    (phosphorylation)
  • The final protein is called the mature peptide

30
Convention for nucleotides in database
Because the mRNA is actually read off the minus
strand of the DNA, the nucleotide sequence are
always quoted on the minus strand. In
bioinformatics the sequence format does NOT make
a difference between Uracil and Thymine. There is
no symbol for Uracil.. It is always represented
by a T Even genomic sequence follows that
convention. A gene on the plus strand is quoted
so that it is in the same strand as its product
mRNA.
31
Biology Information on the Internet
32
Biology Information on the Internet
  • Introduction to Databases
  • Searching the Internet for Biology Information.
  • General Search methods
  • Biology Web sites
  • Introduction to Genbank file format.
  • Introduction to Entrez and Pubmed
  • Ref Chapters 1,2,5,6 of Bioinformatics

33
  • Databases
  • A collection of Records.
  • Each record has many fields.
  • Each field contain specific information.
  • Each field has a data type.
  • E.g. money, currency,Text Field, Integer,
    date,address(text field) ,citation (text field)
  • Each record has a primary key. A UNIQUE
    identifier that unambiguously defines this record.

Spread-sheet Flat-file version of a database.
34
Gi Genbank Identifier Unique Key Primary
Key GI Changes with each update of the sequence
record. Accession Number Secondary key Points
to same locus and sequence despite sequence
updates. Accession Version Number equivalent
to Gi
35
Relational Database (Normalizing a database for
repeated sub-elements of a database.. Splitting
it into smaller databases, relating the
sub-databases to the first one using the primary
key.)
36
Types of Relational databases.
  • The Internet can be though of as one enormous
    relational database.
  • The links/URL are the primary keys.
  • SQL (Standard Query Language)
  • Sybase Oracle Access (Databases systems)
  • Sybase used at NCBI.
  • SRS(One type of database querying system of use
    in Biology)

37
Indexed searches.
  • To allow easy searching of a database, make an
    index.
  • An index is a list of primary keys corresponding
    to a key in a given field (or to a collection of
    fields)

38
Indexed searches.
  • Boolean Query Merging and Intersecting lists
  • AND (in both lists) (e.g. human AND genome)
  • human genome
  • human genome
  • OR (in either lists) (e.g. human OR genome)
  • human genome

39
Search strategies
  • Search engines use complex strategies that go
    beyond Boolean queries.
  • Phrases matching
  • human genome -gt human genome
  • togetherness documents with human close to
    genome are scored higher.
  • Term expansion synomyms
  • human -gt homo sapiens
  • neigbours
  • human genome-gt genome projects,
    chromosomes,genetics
  • Frequency of links (www.google.com)
  • To avoid these term mapping, enclose your queries
    in quotes human AND genome

40
Search strategies
  • Search engines use complex strategies that go
    beyond Boolean queries.
  • To avoid these term mapping, enclose your queries
    in quotes human AND genome
  • To require that ALL the terms in your query be
    important, precede them with a . This also
    prevents term mapping.
  • To force the order of the words to be important,
    group sentences within strings. biology of
    mammals.

41
Indexed searches.
  • Example
  • find the advanced query page at
    http//www.altavista.com
  • type human (and hit the Search button)
  • Type genome
  • type human AND genome
  • type human genome (finds the least matches)
  • type human OR genome (finds the most matches)

42
  • Search Engines
  • Web Spiders Collection of All web pages, but
    since Web pages change all the time and new ones
    appear, they must constantly roam the web and
    re-index.. Or depend on people submitting their
    own pages.
  • www.google.com (BEST!)
  • www.infoseek.com
  • www.lycos.com
  • www.exite.com
  • www.webcrawler.com
  • www.lycos.com
  • www.looksmart.com (country specific)

43
  • Search Engines
  • www.google.com (BEST!)
  • Google ranks pages according to how many pages
    with those terms refer to the pages you are
    asking for. Not only must one document contain
    ALL the search terms, but other documents which
    refer to this one must also contain all the
    terms.
  • Great when you know what you are looking for! You
    can also use to require immediate proximity
    and order of terms.
  • E.g. type
  • Web server for the blast program.
  • But google only indexes about 40 of the web.. So
    you may have to use other web spiders.
  • (disclaimer.. I dont own stock in that company..
    But Id like to)

44
  • Search Engines
  • Curated Collections Not comprehensive Contains
    list of best sites for commonly requested topics,
    but is missing important sites for more
    specialized topics (like biology)
  • www.yahoo.com (Has travel maps too!)
  • Answer-based curated collections Easy to use
    english-like queries. First looks at list of
    predefined answers, then refines answers based on
    user interaction. Also answer new questions.
  • www.askjeeves.com
  • www.magellan.com
  • www.altavista.com(has translation TOOLS)
  • www.hotbot.com

45
  • Search Engines
  • Meta-Search Engines Polls several search
    engines, and returns the consensus of all
    results. Is likely to miss sites, but the sites
    it returns are very relevant to the query.
  • Other operating mode is to return the sum of all
    the results.. Then becomes very sensitive to a
    very detailled query.
  • www.metacrawler.com
  • www.savvysearch.com
  • www.1blink.com (fast)
  • www.metafind.com
  • www.dogpile.com

46
  • Virtual Libraries Curated collections of links
    for Biologists.(by Biologists)
  • Pedros BioMolecular Research Tools(1996)
  • http//www.public.iastate.edu/pedro/
  • Virtual Library Bio Sciences
  • http//vlib.org/Biosciences.html
  • Publications and abstract search.
  • http//www.ncbi.nlm.nih.gov/
  • Expasy server
  • http//www.expasy.ch
  • EBI Biocatalog (software databases list)
  • http//www.ebi.ac.uk/biocat/

47
Biological Databases
  • Nucleotide databases
  • Genbank International Collaboration
  • NCBI(USA), EMBL(Europe), DDBJ (Japan and Asia)
  • A bank No curation.. Submission to these
    database is required for publication in a
    journal.
  • Organism specific databases (Exercize Find URLs
    using search engines)
  • FlyBase
  • ChickGBASE
  • pigbase
  • wormpep
  • YPD (Yeast Protein Database)
  • SGD(Saccharomyces Genome Database)

48
  • Protein Databases
  • NCBI
  • Swiss Prot(Free for academic use, otherwise
    commercial. Licensing restrictions on discoveries
    made using the DB. 1998 version free of any
    licensing)
  • http//www.expasy.ch(latest pay version)
  • NCBI has the latest free version.
  • Translated Proteins from Genbank Submissions
  • EMBL
  • TrEMBL is a computer-annotated supplement of
    SWISS-PROT that contains all the translations of
    EMBL nucleotide sequence entries not yet
    integrated in SWISS-PROT
  • PIR

49
  • Structure databases
  • PDB Protein structure database.
  • Http//www.rscb.org/pdb/
  • MMDB NCBIs version of PDB with entrez links.
  • Http//www.ncbi.nlm.nih.gov
  • Genome Mapping Information
  • http//www.il-st-acad-sci.org/health/genebase.html
  • NCBI(Human)
  • Genome Centers
  • Stanford, Washington University, Stanford
  • Research Centers and Universities

50
  • Litterature databases
  • NCBI Pubmed All biomedical litterature.
  • Www.ncbi.nlm.nih.gov
  • Abstracts and links to publisher sites for
  • full text retrieval/ordering
  • journal browsing.
  • Publisher web sites.
  • Biomednet Commercial site for litterature
    search.
  • Pathways Database
  • KEGG Kyoto Encyclopedia of Genes and Genomes
    www.genome.ad.jp/kegg/kegg/html

51
  • Database Identifiers Primary keys
  • GI (changes with each sequence update for NCBI
    only)
  • Annotation may change without the gi changing!
  • Accession(stable)
  • version(changes with each sequence update)
  • Version also refers to Accession.version
  • Secondary accession Records may have been merged
    in the past.. So the records which were not
    chosen as the primary were made secondary.

52
Primary Databases
  • A primary Database is a repository of data
    derived from experiments or from research
    knowledge.
  • Genbank (Nucleotide repository)
  • Protein DB, Swissprot
  • PDB (MMDB) are primary databases.
  • Pubmed (litterature)
  • Genome Mapping databases.
  • Kegg Database.(pathways)

53
Secondary Databases
  • A secondary database contains information derived
    from other sources.
  • Refseq (Currated collection of Genbank at NCBI)
  • Unigene (Clustering of ESTs at NCBI)
  • Organism-specific databases are often a mix
    between primary and secondary.

54
Genbank Records
  • A Bank No attempt at reconciliation.
  • Submit a sequence ? Get an Accession Number!
  • Cannot modify sequences without submitters
    consent.
  • No attempt at reconciliation.(not a unique
    collection per LOCUS/gene)
  • Entries of various sequence quality and different
    sourcesgt Separate in various divisions based on
  • High Quality sequences in taxon specific
    divisions.
  • Low Quality sequences in Usage specific
    databases.
  • A Collaboration between NCBI, EMBL and DDBJ. They
    contain (nearly) the same information, only the
    data format differs.

EMBL does not differentiate between the different
types of RNA records, while NCBI (and DDBJ) do.
In Entrez EMBL records are patched up to add that
information.
55
Refseq and LocusLink
  • Attempt to produce 1 mRNA, 1 protein, and 1
    genomic gene for each frequently occuring allele
    of a protein expressing gene.
  • www.ncbi.nlm.nih.gov/LocusLink
  • Special non-genbank Accession numbers
  • NM_nnnnnn mRNA refseq
  • NP_nnnnnn protein refseq
  • NC_nnnnnn refseq genomic contig
  • NT_nnnnnn temporary genomic contig
  • NX_nnnnnn predicted gene

56
Genbank divisions
  • Sequences in genbank are split into various
    categories based on
  • The quality and type of sequences
  • The high quality nucleotide sequences are divided
    into organism-dependant divisions.

57
  • Genbank Entry type (and query to restrict to
    that field)
  • mRNA (1/10000 errors)
  • biomol_mRNA PROP
  • cDNA (EST, 95-99 accuracy, single pass )
  • gbdiv_EST PROP
  • genomic ( biomol_genomic PROP)
  • in HTGS division gt99 accuracy
  • gbdiv_HTG PROP
  • GSS(low-quality genome survey sequences)
  • gbdiv_GSS PROP
  • rest of Genbank 1/10000 accuracy.
  • Human gbdiv_PRI PROP
  • mouse gbdiv_ROD PROP
  • bovine gbdiv_MAM PROP
  • STS(EST or cDNA used in mapping)
  • gbdiv_STS PROP

58
FASTA Format
MOST important data format!!!
  • gtidentifier descriptive text
  • nucleotide of amino-acid
  • sequence on multiple lines if needed.
  • Example
  • gtgi41embX63129.1BTA1AT B.taurus mRNA for
    alpha-1-anti-trypsin
  • GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
  • CATCACGCGGGGCCTTCTGCTGCTGGC .

59
Modified FASTA Format
  • A few tools follow the convention that lower case
    sequences are masked. (repeat masker, some
    versions of blast, megablast, blastz)
  • A few analysis tools (like CLUSTAL) want a
    simplified identifier on the defline.. So they
    can have a short string for the alignment.
  • gtX63129.1
  • GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
  • CATCACGCGGGGCCTTCTGCTGCTGGC .

60
  • WIM now will talk about GCG

61
Feature table(NCBIEMBL/DDBJ)
  • http//www.ncbi.nlm.nih.gov/collab/FT/index.html

62
Genbank Data format
41
  • LOCUS BTA1AT 1380 bp mRNA
    MAM 30-APR-1992
  • DEFINITION B.taurus mRNA for alpha-1-antitrypsin.
  • ACCESSION X63129
  • NID g41
  • VERSION X63129.1 GI41
  • KEYWORDS alpha-1 antitrypsin serine protease
    inhibitor serpin.
  • SOURCE Bos taurus.
  • ORGANISM Bos taurus
  • Eukaryota Metazoa Chordata
    Vertebrata Mammalia Eutheria
  • Artiodactyla Ruminantia Pecora
    Bovoidea Bovidae Bovinae Bos.

63
Genbank References
  • LOCUS BTA1AT 1380 bp mRNA
    MAM 30-APR-1992
  • ...
  • REFERENCE 1 (bases 1 to 1380)
  • AUTHORS Sinha,D.
  • TITLE Direct Submission
  • JOURNAL Submitted (22-OCT-1991) D. Sinha, Dept
    of Biochemistry, Temple University, 3400
    North Broad Street, Philadelphia, PA 19140, USA
  • REFERENCE 2 (bases 1 to 1380)
  • AUTHORS Sinha,D., Bakhshi,M.R. and Kirby,E.P.
  • TITLE Complete cDNA sequence of bovine alpha
    1-antitrypsin
  • JOURNAL Biochim. Biophys. Acta 1130 (2),
    209-212 (1992)
  • MEDLINE 92223096
  • FEATURES Location/Qualifiers

64
Genbank Source Qualifier
  • LOCUS BTA1AT 1380 bp mRNA
    MAM 30-APR-1992
  • ...
  • FEATURES Location/Qualifiers
  • source 1..1380
  • /organism"Bos taurus"
  • /db_xref"taxon9913"
  • /tissue_type"liver"
  • /cell_type"hepatocyte"
  • /clone_lib"lambda gt11"
  • /clone"2f-Ic"
  • mRNA lt1..gt1380
  • sig_peptide 33..104
  • ...

65
Genbank mRNACDS features
  • mRNA lt1..gt1380
  • sig_peptide 33..104
  • CDS 33..1283
  • /codon_start1
  • /product"alpha-1-antitrypsin
    "
  • /protein_id"CAA44840.1"
  • /db_xref"PIDg42"
  • /db_xref"GI42"
  • /db_xref"SWISS-PROTP34955"
  • /translation"MALSITRGLLLLAALCCLAPISLAGVLQGHAVQETD
    DTSHQEAACHKIAPNLANFAFSIYHHLAHQSNTSNIFFSPVSIASAFAML
    SLGAKGNTHTEILKGLGFNLTELAEAEIHKGFQHLLHTLNQPNHQLQLTT
    GNGLFINESAKLVDTFLEDVKNLYHSEAFSINFRDAEEAKKKINDYVEKG
    SHGKIVELVKVLDPNTVFALVNYISFKGKWEKPFEMKHTTERDFHVDEQT
    TVKVPMMNRLGMFDLHYCDKLASWVLLLDYVGNVTACFILPDLGKLQQLE
    DKLNNELLAKFLEKKYASSANLHLPKLSISETYDLKSVLGDVGITEVFSD
    RADLSGITKEQPLKVSKALHKAALTIDEKGTEAVGSTFLEAIPMSLPPDV
    EFNRPFLCILYDRNTKSPLFVGKVVNPTQA"
  • mat_peptide 105..1280
  • /product"alpha-1-antitrypsin
    "
  • polyA_signal 1343..1348
  • polyA_site 1368

66
Genbank Sequence format
  • ...
  • BASE COUNT 357 a 413 c 322 g 288 t
  • ORIGIN
  • 1 gaccagccct gacctaggac agtgaatcga taatggcact
    ctccatcacg cggggccttc
  • 61 tgctgctggc agccctgtgc tgcctggccc ccatctccct
    ggctggagtt ctccaaggac
  • 121 acgctgtcca agagacagat gatacatccc accaggaagc
    agcgtgccac aagattgccc
  • 181 ccaacctggc caactttgcc ttcagcatat accaccattt
    ggctcatcag tccaacacca
  • 241 gcaacatctt cttctccccc gtgagcatcg cttcagcctt
    tgcgatgctc tccctgggag
  • 301 ccaagggcaa cactcacact gagatcctga agggcctggg
    tttcaacctc actgagctcg
  • 361 cagaggctga gatccacaaa ggctttcagc atcttctcca
    caccctgaac cagccaaacc
  • ...
  • 1321 gtccccccac tccctccatg gcattaaagg atgactgacc
    tagccccgaa aaaaaaaaaa
  • //

67
EMBL DATA FORMAT
  • Embl http//www.ebi.ac.uk/Databases/
  • http//www.ebi.ac.uk/cgi-bin/emblfetch
  • Use Accession X63129

68
DDBJ DATA FORMAT
  • DDBJ http//www.ddbj.nig.ac.jp/
  • http//ftp2.ddbj.nig.ac.jp8000/getstart-e.html
  • Use Accession X63129
  • Flat file format same as NCBI/Genbank format.

69
Entrez
  • Index Based search system. Each field in the
    database is searchable individually or as
    agregate.
  • (e.g. CDS FKEY)
  • default is agregate ALL FIELDS
  • All primary databases are interlinked as one big
    relational database.
  • (e.g. Pubmed links in Genbank records)
  • Phrase matching.
  • Human genome -gt human genome

70
Entrez
  • Available neighbours (related documents or
    related sequences)
  • In Pubmed searches Term mapping to neighbouring
    documents and neighbouring terms.
  • Term mapping to chemical names.
  • In pubmed term All Fields is term mapped to
    chemical names MeSH terms Text Fields.
  • .. Unless term is whithin double quotes.

71
Entrez
  • http//www.ncbi.nlm.nih.gov/Entrez/
  • Tutorials
  • http//www.ncbi.nlm.nih.gov/Class/MLACourse/Geneti
    cs/index.html
  • http//www.ncbi.nlm.nih.gov/Literature/pubmed_sear
    ch.html
  • http//www.ncbi.nlm.nih.gov/Database.tut1.html

72
SWISSPROT
http//www.expasy.ch/sprot/sprot_details.html
  • Core data protein sequence data the citation
    information and the taxonomic data
  • Annotation
  • Function(s) of the protein
  • Domains and sites. For example calcium binding
    regions, ATP-binding sites, zinc fingers,
    homeobox, kringle, etc.
  • Post-translational modification(s). For example
    carbohydrates, phosphorylation, acetylation,
    GPI-anchor, etc.
  • Secondary structure
  • Quaternary structure. For example homodimer,
    heterotrimer, etc.
  • Similarities to other proteins
  • Disease(s) associated with deficiencie(s) in the
    protein
  • Sequence conflicts, variants, etc.

73
SWISSPROT
http//www.expasy.ch/cgi-bin/get-random-entry.pl?S

74
REBASE (Restriction enzymes dataBASE)
Restriction enzymes have a pattern recognition
sequence, and then within or a few bases away
from that pattern is the actual cutting site
http//rebase.neb.com/rebase/rebase.html I
prefer the bairoch format (SWISSPROT
format) http//rebase.neb.com/rebase/rebase.f19.ht
ml ID enzyme name ET enzyme type OS
microorganism name PT prototype RS recognition
sequence, cut site MS methylation site (type)
CR commercial sources for the restriction enzyme
CM commercial sources for the methylase RN
count RA authors RL jour, vol, pages, year,
etc.

75
Exercises
  • You can work in teams for this.
  • 1a) Use the first 6000 bases of your genomic
    piece or find a bacterial genomic or mRNA
    sequence in Entrez with length between 200010000
  • b) Use the ORF finder to find the gene(s).
    Compare the answer you get to the annotation you
    can infer from using blastn against genbank and
    to using blastx against a protein database.
  • Do the Entrez exercizes. ( separate word
    document)
Write a Comment
User Comments (0)
About PowerShow.com