Title: Sequence Analysis Unit 5
1Sequence AnalysisUnit 5
- BIOL221T Advanced Bioinformatics for
Biotechnology
Irene Gabashvili, PhD igabashvili_at_yahoo.com
2Reminder
- Lectures Lab every Wednesday, Duncan Hall,
Room 550, 600 pm to 945 pm - Office hours Wednesday, 4pm-6pm (Room 554,
phone 924-4831) and by appointment - Lecture notes http//home.comcast.net/igabashvil
i/221T.htm - Or the SJSU page --
- The user name is ewok\biostudents (dont enter
quotation mark) Â - And the password is 4biolecture (dont enter
quotation mark).
3Problem Set Review
- 1. SWISSPROT protein sequence database began in
1987 all correct answers - 2. OMIM stands for "Online Mendelian Inheritance
in Man all correct answers - 3. How many genes in human 6q23 region?
- 6q23 AND humanorgn (AND aliveprop)
(EntrezGene) ? 320(76) (6q23.1-6q23.3) - 11 genes exactly in 6q23 (decimal not known)
- MapViewer 1505 genes on chr.6, 72 between
130,400K 139,100K, 93 between 23.1 and 23.3
4Why different results?
Entrez Gene
6q23 AND humanorgn AND aliveprop
5Entrez Map Viewer
- Good tools, but lots of bugs be aware
- Make sure the right numbers are displayed in the
Region Shown boxes the software tends to null
them and display something else. If nothing is
shown wrong results! If different numbers are
shown wrong results! Only if the correct
numbers/loci are still there, trust the results.
6Map View Allowable Values
- cytogenetic bands - if the master map is a
cytogenetic map, you can enter band numbers in
formats such as 9p23 to 9p11, or 22q12.1 to
22q13.2. - symbols - gene symbols, marker names, etc to
display a region of the chromosome between - numerical positions -cM for genetic linkage map,
cR for radiation hybrid map, base pairs - if the
master map is a sequence map1000000 or
1,000,000 or 1M or 1000K or 1.1M or 1.2K
7Entrez Map Viewer
- Types of maps might include
- cytogenetic map (banding pattern) also referred
to as ideogram - genetic linkage map (centiMorgans cM)
- radiation hybrid map (centiRays cR)
- sequence map (megabases Mb)
8- http//www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid
9606chr7mapsideogrverboseon
9(No Transcript)
10(No Transcript)
11(No Transcript)
12Why different results?
- OMIM lists not only known genes but also regions
associated with diseases e.g. - Pulmonary function 6q21-q22
- Stature (height) 6q24-q25
- Even if the answer appears right, it might
include a few extra genes -- 6q24-q25 - http//www.ncbi.nlm.nih.gov/Omim/getmap.cgi?chromo
some6q23 - Make sure you use the right maps!
13(No Transcript)
14(No Transcript)
15(No Transcript)
16Entrez Map Viewer
- Where does gene X exist within the genome of
organism Y? What are some flanking markers? - Which genes exist on a chromosome, and in what
order do they appear? - Show the genes that exist in region R, the
corresponding sequence data for that region. - Display the region of a chromosome between points
A and B. - I know the cytogenetic location of gene X. What
is the corresponding physical location?
17genome.ucsc.edu UCSC Genome Browser
18Problem Set Review
- 4. What gene has SNP rs3761847?
- http//www.ncbi.nlm.nih.gov/projects/SNP/
- 5. OMIM Hypertensiondis AND (KCNA1gene OR
TRPM3gene OR NOS3gene OR CHEK2gene) - 6. clinicaltrials.gov 6
- 7. crisp.cit.nih.gov 1400
- 8. EST
19Alignment
NT_017582.2
C5 TRAF1 PRO1995
TNF receptor associated factor 1
20(No Transcript)
21(No Transcript)
22Problem Set Review
- 9. 90000100000SLEN ? 32K records
- 10. txid9606 NOT newentrytitle NOT aliveprop
? 126K - 11. qualifier to get less results gene
- 12. Gene, OMIM
- 13. KEGG pathway Histidine metabolism 00340 KEGG
pathway Pentose phosphate pathway 00030 KEGG
pathway Purine metabolism 00230 Reactome Event
Metabolism of carbohydrates 71387 - 14. Yes, Entrez Gene DB provides info on
protein-protein interactions, if known
23BaxevanisOuellette ch.1
- Intro
- Primary and Secondary DBs
- Nucleotide Sequence DBs
- DB formats
- Nucleotide Sequence Flatfiles
- The feature table
- Third Party Annotation
24Sequence Databases
- 1965 M.Dayhoffs Atlas of Protein Sequences
- 1979 - Los Alamos Sequence Library (pre-GenBank)
- 1980 EMBL, 1st public nucleotide sequence DB
- 1982 GenBank (formerly Los Alamos Library)
- 1984 DDBJ - DNA Data Bank of Japan
- 1988 - NCBI created at NIH/NLM
- 1991 the 1st version of Entrez distributed on
CD-ROM - 1992 GenBank transferred to NCBI
25Links
- http//www.ncbi.nlm.nih.gov/Genbank/
- http//www.ebi.ac.uk/embl/
- http//www.ddbj.nig.ac.jp/
- The three databases exchange data daily. For
formats, codes, etc, see feature table - http//www.ncbi.nlm.nih.gov/collab/FT/index.html
26Submitting to these DBs
- Sakura, a nucleotide sequence data submission
system http//sakura.ddbj.nig.ac.jp/ - Sequin, submitting to GenBank www.ncbi.nlm.nih.go
v/Sequin - Submitting to EMBL www.ebi.ac.uk/embl/Submission/
webin.html
27About
- http//www.ncbi.nlm.nih.gov/books/bv.fcgi?ridhand
book.chapter.ch1 (GenBank, e-book) - http//nar.oxfordjournals.org/cgi/content/full/36/
suppl_1/D25 (GenBank, NAR-2008) - http//nar.oxfordjournals.org/cgi/content/full/36/
suppl_1/D22 (DDBJ, NAR-2008) - http//nar.oxfordjournals.org/cgi/content/full/35/
suppl_1/D16 (EMBL, NAR-2007)
28Pubmed Nucleic Acids Res.journal AND "Database
issue"
- PRIDE, Â PRoteomics IDEntifications database
- http//www.ebi.ac.uk/pride/
- STITCH, Chemical-Protein Interactions
- http//stitch.embl.de/
- The universal protein resource (UniProt).
http//www.uniprot.org - http//nar.oxfordjournals.org/cgi/content/full/36/
suppl_1/D707 (ENSEMBL-2008)
29BaxevanisOuellette ch.1
- ncbi.nlm.nih.gov/RefSeq
- EMBL Genome Reviews
- GenPept http//www.ncbi.nlm.nih.gov/sites/entrez?
dbprotein - UniProt http//www.pir.uniprot.org/
- UniParc www.ebi.ac.uk/uniparc/
- IPI http//www.ebi.ac.uk/IPI/IPIhelp.html
30RefSeq
- January 11, 2008 RefSeq Release 27 available for
FTP This release includes - Proteins 4,426,609 Organisms 4,926 Available
at ftp//ftp.ncbi.nih.gov/refseq/release/ To
receive announcements of future RefSeq releases
and incremental large updates please subscribe to
NCBI's refseq-announce mail list refseq-announce
31More databases to know
- ExPASy, Expert Protein Analysis System
http//www.expasy.org/ - SwissProt and TrEMBL at ExPASy
http//www.expasy.org/sprot/ - UniProt/SwissProt by EBI www.ebi.ac.uk/swissprot
- UniProt PIR, protein information resource
http//pir.georgetown.edu/ - www.uniprot.org ? beta.uniprot.org
32More NCBI DBs
- GenPept (compiled from SwissProt, PIR, PRF, PDB,
and translations from annotated coding regions in
GenBank and RefSeq http//www.ncbi.nlm.nih.gov/si
tes/entrez?dbprotein - The NCBI Entrez Taxonomy http//www.ncbi.nlm.nih.
gov/sites/entrez?dbtaxonomy
33Sequence Formats
- ASCII TEXT
- There are at least a couple of dozen sequence
formats in existence at the moment. - Nucleotide (DNA or RNA) sequences are usually
stored in the IUBMB standard codes. Similarly,
protein sequences are usually stored in the IUPAC
standard one-letter codes.
34IUPAC characters
- A adenine
- C cytosine
- G guanine
- T thymine
- U uracil
- R G A (puRine)
- Y T C (pYrimidine)
- K G T (keto)
- M A C (amino)
- S G C ltstrong interaction, 3H bondsgt
- W A T ltweak interaction, 2H bondsgt
- B G T C (not-A, follows B)
- D G A T (not C, follows C in the alphabet)
- H A C T (not G, follows G in the alphabet)
- V G C A (not-T (not-U), V follows U)
- N A G C T (any)
35Amino Acid Symbols
36Amino Acid Symbols -2
37Amino Acid Symbols -3
38Single-letter code recognition sequences for type
II restriction endonucleases
39FASTA format
- gtgi27827140gbBX094486.1BX094486 BX094486
NCI_CGAP_Kid5 Homo sapiens cDNA clone
IMAGp998D214660 IMAGE1901156 5', mRNA sequence
AGGGCAAGGAGTAAAGGTGGCTGGGTGTGGGTCCGTTGAAGCGAGCCGCC
TCCAGCCCTGTTGAACTGGTGGGCCCAGGGACTGGAGCGGGATTGAAAGG
GATCTTGCTCTCCCTTGAAGCCTCGAGTTGCAGCGATTTCAGTGTCTTCT
CTCCCTGTGTAAGCCTGTCTGGGTGTTTAGGCTGAACTACAGCCACCCCC
TCTCCCGGGGGTGTGCAGGCCAGGGACTGGCCAGGCAGCCATGGCTGACG
AGAAGACCTTCCGGATCGGCTTCATTGTGCTGGGGCTTTTCCTGCTGGCC
CTCGGTACGTTCCTCATGAGCCATGATCGGCCCCAGGTCTACGGCACCTT
CTATGCCATGGGCAGCGTCATGGTGATCGGGGGCATCATCTGGAGCATGT
GCCAGTGCTACCCCAAGATCACCTTCGTCCCTGCTGACTCTGACTTTCAA
GGCATCCTCTCCCCAAAGGCCATGGGCCTGCTGGAGAA
40FASTA format
- Starts from gt
- N omore than 80 characters in the header
- Sequence Codes
- A-Adenosine C-Cytidine G-Guanine T-Thymidine
U-Uracil R-G or A (puRine) N - any - http//en.wikipedia.org/wiki/Fasta_format
41Sample questions
- What records could be written in FASTA format?
Genes, proteins, whole genomes, one or multiple
sequences, aligned sequences, metabolic pathways?
42EMBL format
- A sequence file in EMBL format can contain
several sequences. One sequence entry starts
with an identifier line ("ID"), followed by
further annotation lines. The start of the
sequence is marked by a line starting with "SQ"
and the end of the sequence is marked by two
slashes ("//"). - ID AB000263 standard RNA PRI 368 BP.
- XX
- AC AB000263
- XX
- DE Homo sapiens mRNA for prepro cortistatin like
peptide, complete cds. - XX
- SQ Sequence 368 BP
- acaagatgcc attgtccccc ggcctcctgc tgctgctgct
ctccggggcc acggccaccg 60 - ctgccctgcc cctggagggt ggccccaccg gccgagacag
cgagcatatg caggaagcgg 120 - caggaataag gaaaagcagc ctcctgactt tcctcgcttg
gtggtttgag tggacctccc 180 - aggccagtgc cgggcccctc ataggagagg aagctcggga
ggtggccagg cggcaggaag 240 - gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc
ctgcaggaac ttcttctgga 300 - agaccttctc ctcctgcaaa taaaacctca cccatgaatg
ctcacgcaag tttaattaca 360 - gacctgaa 368 //
43GENBANK (GBK) format
- LOCUS AB000263 368 bp mRNA linear PRI 05-FEB-1999
- DEFINITION Homo sapiens mRNA for prepro
cortistatin like peptide, complete cds. - ACCESSION AB000263
- ORIGIN
- 1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct
ctccggggcc acggccaccg - 61 ctgccctgcc cctggagggt ggccccaccg gccgagacag
cgagcatatg caggaagcgg - 121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg
gtggtttgag tggacctccc - 181 aggccagtgc cgggcccctc ataggagagg aagctcggga
ggtggccagg cggcaggaag - 241 gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc
ctgcaggaac ttcttctgga - 301 agaccttctc ctcctgcaaa taaaacctca cccatgaatg
ctcacgcaag tttaattaca - 361 gacctgaa
44GENBANK (GBK) format
- ACCESSION BX094486
- VERSION BX094486.1 GI27827140
- source 1..488 /organism"Homo sapiens"
/mol_type"mRNA" /db_xref"taxon9606"
/clone"IMAGp998D214660 IMAGE1901156"
/tissue_type"2 pooled tumors (clear cell type)"
/lab_host"DH10B" /clone_lib"NCI_CGAP_Kid5"
/note"Organ kidney Vector pT7T3D-PacI
Site_1 Not I Site_2 Eco RI 1st strand cDNA
was primed with a Not I - oligo(dT) primer 5'
45Compare again to FASTA format
- gtgi27827140gbBX094486.1BX094486 BX094486
NCI_CGAP_Kid5 Homo sapiens cDNA clone
IMAGp998D214660 IMAGE1901156 5', mRNA sequence
AGGGCAAGGAGTAAAGGTGGCTGGGTGTGGGTCCGTTGAAGCGAGCCGCC
TCCAGCCCTGTTGAACTGGTGGGCCCAGGGACTGGAGCGGGATTGAAAGG
GATCTTGCTCTCCCTTGAAGCCTCGAGTTGCAGCGATTTCAGTGTCTTCT
CTCCCTGTGTAAGCCTGTCTGGGTGTTTAGGCTGAACTACAGCCACCCCC
TCTCCCGGGGGTGTGCAGGCCAGGGACTGGCCAGGCAGCCATGGCTGACG
AGAAGACCTTCCGGATCGGCTTCATTGTGCTGGGGCTTTTCCTGCTGGCC
CTCGGTACGTTCCTCATGAGCCATGATCGGCCCCAGGTCTACGGCACCTT
CTATGCCATGGGCAGCGTCATGGTGATCGGGGGCATCATCTGGAGCATGT
GCCAGTGCTACCCCAAGATCACCTTCGTCCCTGCTGACTCTGACTTTCAA
GGCATCCTCTCCCCAAAGGCCATGGGCCTGCTGGAGAA
46ASN.1
- Abstract Syntax Notation One (ASN.1) is a formal
language for abstractly describing messages to be
exchanged among an extensive range of
applications involving the Internet, intelligent
network, cellular phones, ground-to-air
communications, electronic commerce, secure
electronic services, interactive television,
intelligent transportation systems, Voice Over IP
and others. - Messages defined using ASN.1 can be encoded in
XML format and visualized
47ASN.1
- Seq-entry seq id general db "dbEST" ,
tag id 16732958 , embl accession "BX094486" ,
version 1 , gi 27827140 , - title "BX094486 NCI_CGAP_Kid5 Homo sapiens cDNA
clone IMAGp998D214660 IMAGE1901156 5'." , - subtype clone-lib , name "NCI_CGAP_Kid5" ,
subtype tissue-type , name "2 pooled tumors
(clear cell type)" , - orgname name binomial genus "Homo" , species
"sapiens" , mod subtype other , subname
"Organ kidney Vector pT7T3D-PacI Site_1 ,
48ASN.1
- Abstract Syntax Notation One (ASN.1) is a formal
language for abstractly describing messages to be
exchanged among an extensive range of
applications involving the Internet, intelligent
network, cellular phones, ground-to-air
communications, electronic commerce, secure
electronic services, interactive television,
intelligent transportation systems, Voice Over IP
and others. Due to its streamlined encoding
rules, ASN.1 is also reliable and ideal for
wireless broadband and other resource-constrained
environments. Its extensibility facilitates
communications between newer and older versions
of applications. In a world of change, ASN.1 is
core technology, constantly adapting to new
technologies.ASN.1 is a critical part of our
daily lives it's everywhere, but it works so
well it's invisible - The standardized XML Encoding Rules (XER) allow
ASN.1 specifications (modules) to be used as
ASN.1 schemas against which XML documents can be
validated. As a result, the ASN.1 language now
competes with other XML schema languages, but it
has some additional benefits over them.
49ASN.1
- One of the difficulties in making use of data
from a database is in interpreting the format of
the data. A common approach in the past has been
to release data in a so-called flat file format.
However, this format fails to preserve the
inherent relationships in a more complex data
model--for example, a relational database
management system such as the Sybase software
used for the Plant Genome Database. - A better approach is to make the data available
according to a specification defined in a data
description language. Abstract Syntax Notation 1
(ASN.1) is one type of data description language.
Data from the Plant Genome Database is to be made
available in the ASN.1 format in addition to its
primary means of access, which will be on-line.
Also, data exchange between NAL and our
collaborators will most likely occur using ASN.1.
The article by Jim Ostell of the National Library
of Medicine's National Center for Biotechnology
Information describes ASN.1 and some advantages
in its use - http//www.nal.usda.gov/pgdic/Probe/v2n2/using.htm
l
50SEQUENCE ANALYSIS TOOLS
- EMBOSS analysis tools
- http//www.molbiol.ox.ac.uk/analysis_tools/EMBOSS/
index.shtml - Pasteur Analysis tools
- http//bioweb2.pasteur.fr/intro-en.html
- ExPASy http//www.expasy.ch/tools/
- SEQOOL
- http//www.biossc.de/seqool/index.html
51Sequence Analysis Software
- Vector NTI
- Accelrys GCG
- Reda-Soft Textco
- BioSoftware, Inc.
- GCK
- Virtual Cloning SuiteTM
52- Vector NTI Advance 10
- Vector NTISequence analysis, annotation, and
illustration restriction mapping recombinant
molecule design, including Gateway and TOPO
cloning in silico gel electrophoresis
mol.bio-data management - AlignXMultiple sequence alignment of proteins
and DNAs - ContigExpressDNA sequence assembly and
sequencing project management using the CAP 3
algorithm - GenomBenchAnalysis and annotation of reference
genomic DNA sequences - BioAnnotatorFunctional annotation of DNAs and
proteins
53(No Transcript)
54MATLAB BIOINFORMATICS TOOLBOX DEMO
http//www.mathworks.com/access/helpdesk/help/tool
box/bioinfo/index.html?/access/helpdesk/help/toolb
ox/bioinfo/ug/fp6010dup1.htmlhttp//www.mathworks
.com/access/helpdesk/help/toolbox/bioinfo/bioinfo_
product_page.html