Sequence Analysis Unit 5

About This Presentation

Title:

Sequence Analysis Unit 5

Description:

Lectures & Lab: every Wednesday, Duncan Hall, Room 550, 6:00 pm to 9:45 pm. Office hours: Wednesday, 4pm-6pm (Room 554, phone: 924-4831) and by appointment ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 55

Provided by: irenegab

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Analysis Unit 5

1
Sequence AnalysisUnit 5

BIOL221T Advanced Bioinformatics for
Biotechnology

Irene Gabashvili, PhD igabashvili_at_yahoo.com
2
Reminder

Lectures Lab every Wednesday, Duncan Hall,
Room 550, 600 pm to 945 pm
Office hours Wednesday, 4pm-6pm (Room 554,
phone 924-4831) and by appointment
Lecture notes http//home.comcast.net/igabashvil
i/221T.htm
Or the SJSU page --
The user name is ewok\biostudents (dont enter
quotation mark)
And the password is 4biolecture (dont enter
quotation mark).

3
Problem Set Review

1. SWISSPROT protein sequence database began in
1987 all correct answers
2. OMIM stands for "Online Mendelian Inheritance
in Man all correct answers
3. How many genes in human 6q23 region?
6q23 AND humanorgn (AND aliveprop)
(EntrezGene) ? 320(76) (6q23.1-6q23.3)
11 genes exactly in 6q23 (decimal not known)
MapViewer 1505 genes on chr.6, 72 between
130,400K 139,100K, 93 between 23.1 and 23.3

4
Why different results?
Entrez Gene
6q23 AND humanorgn AND aliveprop
5
Entrez Map Viewer

Good tools, but lots of bugs be aware
Make sure the right numbers are displayed in the
Region Shown boxes the software tends to null
them and display something else. If nothing is
shown wrong results! If different numbers are
shown wrong results! Only if the correct
numbers/loci are still there, trust the results.

6
Map View Allowable Values

cytogenetic bands - if the master map is a
cytogenetic map, you can enter band numbers in
formats such as 9p23 to 9p11, or 22q12.1 to
22q13.2.
symbols - gene symbols, marker names, etc to
display a region of the chromosome between
numerical positions -cM for genetic linkage map,
cR for radiation hybrid map, base pairs - if the
master map is a sequence map1000000 or
1,000,000 or 1M or 1000K or 1.1M or 1.2K

7
Entrez Map Viewer

Types of maps might include
cytogenetic map (banding pattern) also referred
to as ideogram
genetic linkage map (centiMorgans cM)
radiation hybrid map (centiRays cR)
sequence map (megabases Mb)

http//www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid
9606chr7mapsideogrverboseon

9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
Why different results?

OMIM lists not only known genes but also regions
associated with diseases e.g.
Pulmonary function 6q21-q22
Stature (height) 6q24-q25
Even if the answer appears right, it might
include a few extra genes -- 6q24-q25
http//www.ncbi.nlm.nih.gov/Omim/getmap.cgi?chromo
some6q23
Make sure you use the right maps!

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Entrez Map Viewer

Where does gene X exist within the genome of
organism Y? What are some flanking markers?
Which genes exist on a chromosome, and in what
order do they appear?
Show the genes that exist in region R, the
corresponding sequence data for that region.
Display the region of a chromosome between points
A and B.
I know the cytogenetic location of gene X. What
is the corresponding physical location?

17
genome.ucsc.edu UCSC Genome Browser
18
Problem Set Review

4. What gene has SNP rs3761847?
http//www.ncbi.nlm.nih.gov/projects/SNP/
5. OMIM Hypertensiondis AND (KCNA1gene OR
TRPM3gene OR NOS3gene OR CHEK2gene)
6. clinicaltrials.gov 6
7. crisp.cit.nih.gov 1400
8. EST

19
Alignment
NT_017582.2
C5 TRAF1 PRO1995
TNF receptor associated factor 1
20
(No Transcript)
21
(No Transcript)
22
Problem Set Review

9. 90000100000SLEN ? 32K records
10. txid9606 NOT newentrytitle NOT aliveprop
? 126K
11. qualifier to get less results gene
12. Gene, OMIM
13. KEGG pathway Histidine metabolism 00340 KEGG
pathway Pentose phosphate pathway 00030 KEGG
pathway Purine metabolism 00230 Reactome Event
Metabolism of carbohydrates 71387
14. Yes, Entrez Gene DB provides info on
protein-protein interactions, if known

23
BaxevanisOuellette ch.1

Intro
Primary and Secondary DBs
Nucleotide Sequence DBs
DB formats
Nucleotide Sequence Flatfiles
The feature table
Third Party Annotation

24
Sequence Databases

1965 M.Dayhoffs Atlas of Protein Sequences
1979 - Los Alamos Sequence Library (pre-GenBank)
1980 EMBL, 1st public nucleotide sequence DB
1982 GenBank (formerly Los Alamos Library)
1984 DDBJ - DNA Data Bank of Japan
1988 - NCBI created at NIH/NLM
1991 the 1st version of Entrez distributed on
CD-ROM
1992 GenBank transferred to NCBI

25
Links

http//www.ncbi.nlm.nih.gov/Genbank/
http//www.ebi.ac.uk/embl/
http//www.ddbj.nig.ac.jp/
The three databases exchange data daily. For
formats, codes, etc, see feature table
http//www.ncbi.nlm.nih.gov/collab/FT/index.html

26
Submitting to these DBs

Sakura, a nucleotide sequence data submission
system http//sakura.ddbj.nig.ac.jp/
Sequin, submitting to GenBank www.ncbi.nlm.nih.go
v/Sequin
Submitting to EMBL www.ebi.ac.uk/embl/Submission/
webin.html

27
About

http//www.ncbi.nlm.nih.gov/books/bv.fcgi?ridhand
book.chapter.ch1 (GenBank, e-book)
http//nar.oxfordjournals.org/cgi/content/full/36/
suppl_1/D25 (GenBank, NAR-2008)
http//nar.oxfordjournals.org/cgi/content/full/36/
suppl_1/D22 (DDBJ, NAR-2008)
http//nar.oxfordjournals.org/cgi/content/full/35/
suppl_1/D16 (EMBL, NAR-2007)

28
Pubmed Nucleic Acids Res.journal AND "Database
issue"

PRIDE, PRoteomics IDEntifications database
http//www.ebi.ac.uk/pride/
STITCH, Chemical-Protein Interactions
http//stitch.embl.de/
The universal protein resource (UniProt).
http//www.uniprot.org
http//nar.oxfordjournals.org/cgi/content/full/36/
suppl_1/D707 (ENSEMBL-2008)

29
BaxevanisOuellette ch.1

ncbi.nlm.nih.gov/RefSeq
EMBL Genome Reviews
GenPept http//www.ncbi.nlm.nih.gov/sites/entrez?
dbprotein
UniProt http//www.pir.uniprot.org/
UniParc www.ebi.ac.uk/uniparc/
IPI http//www.ebi.ac.uk/IPI/IPIhelp.html

30
RefSeq

January 11, 2008 RefSeq Release 27 available for
FTP This release includes
Proteins 4,426,609 Organisms 4,926 Available
at ftp//ftp.ncbi.nih.gov/refseq/release/ To
receive announcements of future RefSeq releases
and incremental large updates please subscribe to
NCBI's refseq-announce mail list refseq-announce

31
More databases to know

ExPASy, Expert Protein Analysis System
http//www.expasy.org/
SwissProt and TrEMBL at ExPASy
http//www.expasy.org/sprot/
UniProt/SwissProt by EBI www.ebi.ac.uk/swissprot
UniProt PIR, protein information resource
http//pir.georgetown.edu/
www.uniprot.org ? beta.uniprot.org

32
More NCBI DBs

GenPept (compiled from SwissProt, PIR, PRF, PDB,
and translations from annotated coding regions in
GenBank and RefSeq http//www.ncbi.nlm.nih.gov/si
tes/entrez?dbprotein
The NCBI Entrez Taxonomy http//www.ncbi.nlm.nih.
gov/sites/entrez?dbtaxonomy

33
Sequence Formats

ASCII TEXT
There are at least a couple of dozen sequence
formats in existence at the moment.
Nucleotide (DNA or RNA) sequences are usually
stored in the IUBMB standard codes. Similarly,
protein sequences are usually stored in the IUPAC
standard one-letter codes.

34
IUPAC characters

A adenine
C cytosine
G guanine
T thymine
U uracil
R G A (puRine)
Y T C (pYrimidine)
K G T (keto)
M A C (amino)
S G C ltstrong interaction, 3H bondsgt
W A T ltweak interaction, 2H bondsgt
B G T C (not-A, follows B)
D G A T (not C, follows C in the alphabet)
H A C T (not G, follows G in the alphabet)
V G C A (not-T (not-U), V follows U)
N A G C T (any)

35
Amino Acid Symbols
36
Amino Acid Symbols -2
37
Amino Acid Symbols -3
38
Single-letter code recognition sequences for type
II restriction endonucleases
39
FASTA format

gtgi27827140gbBX094486.1BX094486 BX094486
NCI_CGAP_Kid5 Homo sapiens cDNA clone
IMAGp998D214660 IMAGE1901156 5', mRNA sequence
AGGGCAAGGAGTAAAGGTGGCTGGGTGTGGGTCCGTTGAAGCGAGCCGCC
TCCAGCCCTGTTGAACTGGTGGGCCCAGGGACTGGAGCGGGATTGAAAGG
GATCTTGCTCTCCCTTGAAGCCTCGAGTTGCAGCGATTTCAGTGTCTTCT
CTCCCTGTGTAAGCCTGTCTGGGTGTTTAGGCTGAACTACAGCCACCCCC
TCTCCCGGGGGTGTGCAGGCCAGGGACTGGCCAGGCAGCCATGGCTGACG
AGAAGACCTTCCGGATCGGCTTCATTGTGCTGGGGCTTTTCCTGCTGGCC
CTCGGTACGTTCCTCATGAGCCATGATCGGCCCCAGGTCTACGGCACCTT
CTATGCCATGGGCAGCGTCATGGTGATCGGGGGCATCATCTGGAGCATGT
GCCAGTGCTACCCCAAGATCACCTTCGTCCCTGCTGACTCTGACTTTCAA
GGCATCCTCTCCCCAAAGGCCATGGGCCTGCTGGAGAA

40
FASTA format

Starts from gt
N omore than 80 characters in the header
Sequence Codes
A-Adenosine C-Cytidine G-Guanine T-Thymidine
U-Uracil R-G or A (puRine) N - any
http//en.wikipedia.org/wiki/Fasta_format

41
Sample questions

What records could be written in FASTA format?
Genes, proteins, whole genomes, one or multiple
sequences, aligned sequences, metabolic pathways?

42
EMBL format

A sequence file in EMBL format can contain
several sequences. One sequence entry starts
with an identifier line ("ID"), followed by
further annotation lines. The start of the
sequence is marked by a line starting with "SQ"
and the end of the sequence is marked by two
slashes ("//").
ID AB000263 standard RNA PRI 368 BP.
XX
AC AB000263
XX
DE Homo sapiens mRNA for prepro cortistatin like
peptide, complete cds.
XX
SQ Sequence 368 BP
acaagatgcc attgtccccc ggcctcctgc tgctgctgct
ctccggggcc acggccaccg 60
ctgccctgcc cctggagggt ggccccaccg gccgagacag
cgagcatatg caggaagcgg 120
caggaataag gaaaagcagc ctcctgactt tcctcgcttg
gtggtttgag tggacctccc 180
aggccagtgc cgggcccctc ataggagagg aagctcggga
ggtggccagg cggcaggaag 240
gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc
ctgcaggaac ttcttctgga 300
agaccttctc ctcctgcaaa taaaacctca cccatgaatg
ctcacgcaag tttaattaca 360
gacctgaa 368 //

43
GENBANK (GBK) format

LOCUS AB000263 368 bp mRNA linear PRI 05-FEB-1999
DEFINITION Homo sapiens mRNA for prepro
cortistatin like peptide, complete cds.
ACCESSION AB000263
ORIGIN
1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct
ctccggggcc acggccaccg
61 ctgccctgcc cctggagggt ggccccaccg gccgagacag
cgagcatatg caggaagcgg
121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg
gtggtttgag tggacctccc
181 aggccagtgc cgggcccctc ataggagagg aagctcggga
ggtggccagg cggcaggaag
241 gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc
ctgcaggaac ttcttctgga
301 agaccttctc ctcctgcaaa taaaacctca cccatgaatg
ctcacgcaag tttaattaca
361 gacctgaa

44
GENBANK (GBK) format

ACCESSION BX094486
VERSION BX094486.1 GI27827140
source 1..488 /organism"Homo sapiens"
/mol_type"mRNA" /db_xref"taxon9606"
/clone"IMAGp998D214660 IMAGE1901156"
/tissue_type"2 pooled tumors (clear cell type)"
/lab_host"DH10B" /clone_lib"NCI_CGAP_Kid5"
/note"Organ kidney Vector pT7T3D-PacI
Site_1 Not I Site_2 Eco RI 1st strand cDNA
was primed with a Not I - oligo(dT) primer 5'

45
Compare again to FASTA format

gtgi27827140gbBX094486.1BX094486 BX094486
NCI_CGAP_Kid5 Homo sapiens cDNA clone
IMAGp998D214660 IMAGE1901156 5', mRNA sequence
AGGGCAAGGAGTAAAGGTGGCTGGGTGTGGGTCCGTTGAAGCGAGCCGCC
TCCAGCCCTGTTGAACTGGTGGGCCCAGGGACTGGAGCGGGATTGAAAGG
GATCTTGCTCTCCCTTGAAGCCTCGAGTTGCAGCGATTTCAGTGTCTTCT
CTCCCTGTGTAAGCCTGTCTGGGTGTTTAGGCTGAACTACAGCCACCCCC
TCTCCCGGGGGTGTGCAGGCCAGGGACTGGCCAGGCAGCCATGGCTGACG
AGAAGACCTTCCGGATCGGCTTCATTGTGCTGGGGCTTTTCCTGCTGGCC
CTCGGTACGTTCCTCATGAGCCATGATCGGCCCCAGGTCTACGGCACCTT
CTATGCCATGGGCAGCGTCATGGTGATCGGGGGCATCATCTGGAGCATGT
GCCAGTGCTACCCCAAGATCACCTTCGTCCCTGCTGACTCTGACTTTCAA
GGCATCCTCTCCCCAAAGGCCATGGGCCTGCTGGAGAA

46
ASN.1

Abstract Syntax Notation One (ASN.1) is a formal
language for abstractly describing messages to be
exchanged among an extensive range of
applications involving the Internet, intelligent
network, cellular phones, ground-to-air
communications, electronic commerce, secure
electronic services, interactive television,
intelligent transportation systems, Voice Over IP
and others.
Messages defined using ASN.1 can be encoded in
XML format and visualized

47
ASN.1

Seq-entry seq id general db "dbEST" ,
tag id 16732958 , embl accession "BX094486" ,
version 1 , gi 27827140 ,
title "BX094486 NCI_CGAP_Kid5 Homo sapiens cDNA
clone IMAGp998D214660 IMAGE1901156 5'." ,
subtype clone-lib , name "NCI_CGAP_Kid5" ,
subtype tissue-type , name "2 pooled tumors
(clear cell type)" ,
orgname name binomial genus "Homo" , species
"sapiens" , mod subtype other , subname
"Organ kidney Vector pT7T3D-PacI Site_1 ,

48
ASN.1

Abstract Syntax Notation One (ASN.1) is a formal
language for abstractly describing messages to be
exchanged among an extensive range of
applications involving the Internet, intelligent
network, cellular phones, ground-to-air
communications, electronic commerce, secure
electronic services, interactive television,
intelligent transportation systems, Voice Over IP
and others. Due to its streamlined encoding
rules, ASN.1 is also reliable and ideal for
wireless broadband and other resource-constrained
environments. Its extensibility facilitates
communications between newer and older versions
of applications. In a world of change, ASN.1 is
core technology, constantly adapting to new
technologies.ASN.1 is a critical part of our
daily lives it's everywhere, but it works so
well it's invisible
The standardized XML Encoding Rules (XER) allow
ASN.1 specifications (modules) to be used as
ASN.1 schemas against which XML documents can be
validated. As a result, the ASN.1 language now
competes with other XML schema languages, but it
has some additional benefits over them.

49
ASN.1

One of the difficulties in making use of data
from a database is in interpreting the format of
the data. A common approach in the past has been
to release data in a so-called flat file format.
However, this format fails to preserve the
inherent relationships in a more complex data
model--for example, a relational database
management system such as the Sybase software
used for the Plant Genome Database.
A better approach is to make the data available
according to a specification defined in a data
description language. Abstract Syntax Notation 1
(ASN.1) is one type of data description language.
Data from the Plant Genome Database is to be made
available in the ASN.1 format in addition to its
primary means of access, which will be on-line.
Also, data exchange between NAL and our
collaborators will most likely occur using ASN.1.
The article by Jim Ostell of the National Library
of Medicine's National Center for Biotechnology
Information describes ASN.1 and some advantages
in its use
http//www.nal.usda.gov/pgdic/Probe/v2n2/using.htm
l

50
SEQUENCE ANALYSIS TOOLS

EMBOSS analysis tools
http//www.molbiol.ox.ac.uk/analysis_tools/EMBOSS/
index.shtml
Pasteur Analysis tools
http//bioweb2.pasteur.fr/intro-en.html
ExPASy http//www.expasy.ch/tools/
SEQOOL
http//www.biossc.de/seqool/index.html

51
Sequence Analysis Software

Vector NTI
Accelrys GCG
Reda-Soft Textco
BioSoftware, Inc.
GCK
Virtual Cloning SuiteTM

Vector NTI Advance 10
Vector NTISequence analysis, annotation, and
illustration restriction mapping recombinant
molecule design, including Gateway and TOPO
cloning in silico gel electrophoresis
mol.bio-data management
AlignXMultiple sequence alignment of proteins
and DNAs
ContigExpressDNA sequence assembly and
sequencing project management using the CAP 3
algorithm
GenomBenchAnalysis and annotation of reference
genomic DNA sequences
BioAnnotatorFunctional annotation of DNAs and
proteins

53
(No Transcript)
54
MATLAB BIOINFORMATICS TOOLBOX DEMO
http//www.mathworks.com/access/helpdesk/help/tool
box/bioinfo/index.html?/access/helpdesk/help/toolb
ox/bioinfo/ug/fp6010dup1.htmlhttp//www.mathworks
.com/access/helpdesk/help/toolbox/bioinfo/bioinfo_
product_page.html

Write a Comment

User Comments (0)