Sequence Analysis Unit 5 - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Sequence Analysis Unit 5

Description:

Lectures & Lab: every Wednesday, Duncan Hall, Room 550, 6:00 pm to 9:45 pm. Office hours: Wednesday, 4pm-6pm (Room 554, phone: 924-4831) and by appointment ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 55
Provided by: irenegab
Category:

less

Transcript and Presenter's Notes

Title: Sequence Analysis Unit 5


1
Sequence AnalysisUnit 5
  • BIOL221T Advanced Bioinformatics for
    Biotechnology

Irene Gabashvili, PhD igabashvili_at_yahoo.com
2
Reminder
  • Lectures Lab every Wednesday, Duncan Hall,
    Room 550, 600 pm to  945 pm
  • Office hours Wednesday, 4pm-6pm (Room 554,
    phone 924-4831) and by appointment
  • Lecture notes http//home.comcast.net/igabashvil
    i/221T.htm
  • Or the SJSU page --
  • The user name is ewok\biostudents (dont enter
    quotation mark)  
  • And the password is 4biolecture (dont enter
    quotation mark).

3
Problem Set Review
  • 1. SWISSPROT protein sequence database began in
    1987 all correct answers
  • 2. OMIM stands for "Online Mendelian Inheritance
    in Man all correct answers
  • 3. How many genes in human 6q23 region?
  • 6q23 AND humanorgn (AND aliveprop)
    (EntrezGene) ? 320(76) (6q23.1-6q23.3)
  • 11 genes exactly in 6q23 (decimal not known)
  • MapViewer 1505 genes on chr.6, 72 between
    130,400K 139,100K, 93 between 23.1 and 23.3

4
Why different results?
Entrez Gene
6q23 AND humanorgn AND aliveprop
5
Entrez Map Viewer
  • Good tools, but lots of bugs be aware
  • Make sure the right numbers are displayed in the
    Region Shown boxes the software tends to null
    them and display something else. If nothing is
    shown wrong results! If different numbers are
    shown wrong results! Only if the correct
    numbers/loci are still there, trust the results.

6
Map View Allowable Values
  • cytogenetic bands - if the master map is a
    cytogenetic map, you can enter band numbers in
    formats such as 9p23 to 9p11, or 22q12.1 to
    22q13.2.
  • symbols - gene symbols, marker names, etc to
    display a region of the chromosome between
  • numerical positions -cM for genetic linkage map,
    cR for radiation hybrid map, base pairs - if the
    master map is a sequence map1000000 or
    1,000,000 or 1M or 1000K or 1.1M or 1.2K

7
Entrez Map Viewer
  • Types of maps might include
  • cytogenetic map (banding pattern) also referred
    to as ideogram
  • genetic linkage map (centiMorgans cM)
  • radiation hybrid map (centiRays cR)
  • sequence map (megabases Mb)

8
  • http//www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid
    9606chr7mapsideogrverboseon

9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
Why different results?
  • OMIM lists not only known genes but also regions
    associated with diseases e.g.
  • Pulmonary function 6q21-q22
  • Stature (height) 6q24-q25
  • Even if the answer appears right, it might
    include a few extra genes -- 6q24-q25
  • http//www.ncbi.nlm.nih.gov/Omim/getmap.cgi?chromo
    some6q23
  • Make sure you use the right maps!

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Entrez Map Viewer
  • Where does gene X exist within the genome of
    organism Y? What are some flanking markers?
  • Which genes exist on a chromosome, and in what
    order do they appear?
  • Show the genes that exist in region R, the
    corresponding sequence data for that region.
  • Display the region of a chromosome between points
    A and B.
  • I know the cytogenetic location of gene X. What
    is the corresponding physical location?

17
genome.ucsc.edu UCSC Genome Browser
18
Problem Set Review
  • 4. What gene has SNP rs3761847?
  • http//www.ncbi.nlm.nih.gov/projects/SNP/
  • 5. OMIM Hypertensiondis AND (KCNA1gene OR
    TRPM3gene OR NOS3gene OR CHEK2gene)
  • 6. clinicaltrials.gov 6
  • 7. crisp.cit.nih.gov 1400
  • 8. EST

19
Alignment
NT_017582.2
C5 TRAF1 PRO1995
TNF receptor associated factor 1
20
(No Transcript)
21
(No Transcript)
22
Problem Set Review
  • 9. 90000100000SLEN ? 32K records
  • 10. txid9606 NOT newentrytitle NOT aliveprop
    ? 126K
  • 11. qualifier to get less results gene
  • 12. Gene, OMIM
  • 13. KEGG pathway Histidine metabolism 00340 KEGG
    pathway Pentose phosphate pathway 00030 KEGG
    pathway Purine metabolism 00230 Reactome Event
    Metabolism of carbohydrates 71387
  • 14. Yes, Entrez Gene DB provides info on
    protein-protein interactions, if known

23
BaxevanisOuellette ch.1
  • Intro
  • Primary and Secondary DBs
  • Nucleotide Sequence DBs
  • DB formats
  • Nucleotide Sequence Flatfiles
  • The feature table
  • Third Party Annotation

24
Sequence Databases
  • 1965 M.Dayhoffs Atlas of Protein Sequences
  • 1979 - Los Alamos Sequence Library (pre-GenBank)
  • 1980 EMBL, 1st public nucleotide sequence DB
  • 1982 GenBank (formerly Los Alamos Library)
  • 1984 DDBJ - DNA Data Bank of Japan
  • 1988 - NCBI created at NIH/NLM
  • 1991 the 1st version of Entrez distributed on
    CD-ROM
  • 1992 GenBank transferred to NCBI

25
Links
  • http//www.ncbi.nlm.nih.gov/Genbank/
  • http//www.ebi.ac.uk/embl/
  • http//www.ddbj.nig.ac.jp/
  • The three databases exchange data daily. For
    formats, codes, etc, see feature table
  • http//www.ncbi.nlm.nih.gov/collab/FT/index.html

26
Submitting to these DBs
  • Sakura, a nucleotide sequence data submission
    system http//sakura.ddbj.nig.ac.jp/
  • Sequin, submitting to GenBank www.ncbi.nlm.nih.go
    v/Sequin
  • Submitting to EMBL www.ebi.ac.uk/embl/Submission/
    webin.html

27
About
  • http//www.ncbi.nlm.nih.gov/books/bv.fcgi?ridhand
    book.chapter.ch1 (GenBank, e-book)
  • http//nar.oxfordjournals.org/cgi/content/full/36/
    suppl_1/D25 (GenBank, NAR-2008)
  • http//nar.oxfordjournals.org/cgi/content/full/36/
    suppl_1/D22 (DDBJ, NAR-2008)
  • http//nar.oxfordjournals.org/cgi/content/full/35/
    suppl_1/D16 (EMBL, NAR-2007)

28
Pubmed Nucleic Acids Res.journal AND "Database
issue"
  • PRIDE,  PRoteomics IDEntifications database
  • http//www.ebi.ac.uk/pride/
  • STITCH, Chemical-Protein Interactions
  • http//stitch.embl.de/
  • The universal protein resource (UniProt).
    http//www.uniprot.org
  • http//nar.oxfordjournals.org/cgi/content/full/36/
    suppl_1/D707 (ENSEMBL-2008)

29
BaxevanisOuellette ch.1
  • ncbi.nlm.nih.gov/RefSeq
  • EMBL Genome Reviews
  • GenPept http//www.ncbi.nlm.nih.gov/sites/entrez?
    dbprotein
  • UniProt http//www.pir.uniprot.org/
  • UniParc www.ebi.ac.uk/uniparc/
  • IPI http//www.ebi.ac.uk/IPI/IPIhelp.html

30
RefSeq
  • January 11, 2008 RefSeq Release 27 available for
    FTP This release includes
  • Proteins 4,426,609 Organisms 4,926 Available
    at ftp//ftp.ncbi.nih.gov/refseq/release/ To
    receive announcements of future RefSeq releases
    and incremental large updates please subscribe to
    NCBI's refseq-announce mail list refseq-announce

31
More databases to know
  • ExPASy, Expert Protein Analysis System
    http//www.expasy.org/
  • SwissProt and TrEMBL at ExPASy
    http//www.expasy.org/sprot/
  • UniProt/SwissProt by EBI www.ebi.ac.uk/swissprot
  • UniProt PIR, protein information resource
    http//pir.georgetown.edu/
  • www.uniprot.org ? beta.uniprot.org

32
More NCBI DBs
  • GenPept (compiled from SwissProt, PIR, PRF, PDB,
    and translations from annotated coding regions in
    GenBank and RefSeq http//www.ncbi.nlm.nih.gov/si
    tes/entrez?dbprotein
  • The NCBI Entrez Taxonomy http//www.ncbi.nlm.nih.
    gov/sites/entrez?dbtaxonomy

33
Sequence Formats
  • ASCII TEXT
  • There are at least a couple of dozen sequence
    formats in existence at the moment.
  • Nucleotide (DNA or RNA) sequences are usually
    stored in the IUBMB standard codes. Similarly,
    protein sequences are usually stored in the IUPAC
    standard one-letter codes.

34
IUPAC characters
  • A adenine
  • C cytosine
  • G guanine
  • T thymine
  • U uracil
  • R G A (puRine)
  • Y T C (pYrimidine)
  • K G T (keto)
  • M A C (amino)
  • S G C ltstrong interaction, 3H bondsgt
  • W A T ltweak interaction, 2H bondsgt
  • B G T C (not-A, follows B)
  • D G A T (not C, follows C in the alphabet)
  • H A C T (not G, follows G in the alphabet)
  • V G C A (not-T (not-U), V follows U)
  • N A G C T (any)

35
Amino Acid Symbols
36
Amino Acid Symbols -2
37
Amino Acid Symbols -3
38
Single-letter code recognition sequences for type
II restriction endonucleases
39
FASTA format
  • gtgi27827140gbBX094486.1BX094486 BX094486
    NCI_CGAP_Kid5 Homo sapiens cDNA clone
    IMAGp998D214660 IMAGE1901156 5', mRNA sequence
    AGGGCAAGGAGTAAAGGTGGCTGGGTGTGGGTCCGTTGAAGCGAGCCGCC
    TCCAGCCCTGTTGAACTGGTGGGCCCAGGGACTGGAGCGGGATTGAAAGG
    GATCTTGCTCTCCCTTGAAGCCTCGAGTTGCAGCGATTTCAGTGTCTTCT
    CTCCCTGTGTAAGCCTGTCTGGGTGTTTAGGCTGAACTACAGCCACCCCC
    TCTCCCGGGGGTGTGCAGGCCAGGGACTGGCCAGGCAGCCATGGCTGACG
    AGAAGACCTTCCGGATCGGCTTCATTGTGCTGGGGCTTTTCCTGCTGGCC
    CTCGGTACGTTCCTCATGAGCCATGATCGGCCCCAGGTCTACGGCACCTT
    CTATGCCATGGGCAGCGTCATGGTGATCGGGGGCATCATCTGGAGCATGT
    GCCAGTGCTACCCCAAGATCACCTTCGTCCCTGCTGACTCTGACTTTCAA
    GGCATCCTCTCCCCAAAGGCCATGGGCCTGCTGGAGAA

40
FASTA format
  • Starts from gt
  • N omore than 80 characters in the header
  • Sequence Codes
  • A-Adenosine C-Cytidine G-Guanine T-Thymidine
    U-Uracil R-G or A (puRine) N - any
  • http//en.wikipedia.org/wiki/Fasta_format

41
Sample questions
  • What records could be written in FASTA format?
    Genes, proteins, whole genomes, one or multiple
    sequences, aligned sequences, metabolic pathways?

42
EMBL format
  • A sequence file in EMBL format can contain
    several sequences. One sequence entry starts
    with an identifier line ("ID"), followed by
    further annotation lines. The start of the
    sequence is marked by a line starting with "SQ"
    and the end of the sequence is marked by two
    slashes ("//").
  • ID AB000263 standard RNA PRI 368 BP.
  • XX
  • AC AB000263
  • XX
  • DE Homo sapiens mRNA for prepro cortistatin like
    peptide, complete cds.
  • XX
  • SQ Sequence 368 BP
  • acaagatgcc attgtccccc ggcctcctgc tgctgctgct
    ctccggggcc acggccaccg 60
  • ctgccctgcc cctggagggt ggccccaccg gccgagacag
    cgagcatatg caggaagcgg 120
  • caggaataag gaaaagcagc ctcctgactt tcctcgcttg
    gtggtttgag tggacctccc 180
  • aggccagtgc cgggcccctc ataggagagg aagctcggga
    ggtggccagg cggcaggaag 240
  • gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc
    ctgcaggaac ttcttctgga 300
  • agaccttctc ctcctgcaaa taaaacctca cccatgaatg
    ctcacgcaag tttaattaca 360
  • gacctgaa 368 //

43
GENBANK (GBK) format
  • LOCUS AB000263 368 bp mRNA linear PRI 05-FEB-1999
  • DEFINITION Homo sapiens mRNA for prepro
    cortistatin like peptide, complete cds.
  • ACCESSION AB000263
  • ORIGIN
  • 1 acaagatgcc attgtccccc ggcctcctgc tgctgctgct
    ctccggggcc acggccaccg
  • 61 ctgccctgcc cctggagggt ggccccaccg gccgagacag
    cgagcatatg caggaagcgg
  • 121 caggaataag gaaaagcagc ctcctgactt tcctcgcttg
    gtggtttgag tggacctccc
  • 181 aggccagtgc cgggcccctc ataggagagg aagctcggga
    ggtggccagg cggcaggaag
  • 241 gcgcaccccc ccagcaatcc gcgcgccggg acagaatgcc
    ctgcaggaac ttcttctgga
  • 301 agaccttctc ctcctgcaaa taaaacctca cccatgaatg
    ctcacgcaag tttaattaca
  • 361 gacctgaa

44
GENBANK (GBK) format
  • ACCESSION BX094486
  • VERSION BX094486.1 GI27827140
  • source 1..488 /organism"Homo sapiens"
    /mol_type"mRNA" /db_xref"taxon9606"
    /clone"IMAGp998D214660 IMAGE1901156"
    /tissue_type"2 pooled tumors (clear cell type)"
    /lab_host"DH10B" /clone_lib"NCI_CGAP_Kid5"
    /note"Organ kidney Vector pT7T3D-PacI
    Site_1 Not I Site_2 Eco RI 1st strand cDNA
    was primed with a Not I - oligo(dT) primer 5'

45
Compare again to FASTA format
  • gtgi27827140gbBX094486.1BX094486 BX094486
    NCI_CGAP_Kid5 Homo sapiens cDNA clone
    IMAGp998D214660 IMAGE1901156 5', mRNA sequence
    AGGGCAAGGAGTAAAGGTGGCTGGGTGTGGGTCCGTTGAAGCGAGCCGCC
    TCCAGCCCTGTTGAACTGGTGGGCCCAGGGACTGGAGCGGGATTGAAAGG
    GATCTTGCTCTCCCTTGAAGCCTCGAGTTGCAGCGATTTCAGTGTCTTCT
    CTCCCTGTGTAAGCCTGTCTGGGTGTTTAGGCTGAACTACAGCCACCCCC
    TCTCCCGGGGGTGTGCAGGCCAGGGACTGGCCAGGCAGCCATGGCTGACG
    AGAAGACCTTCCGGATCGGCTTCATTGTGCTGGGGCTTTTCCTGCTGGCC
    CTCGGTACGTTCCTCATGAGCCATGATCGGCCCCAGGTCTACGGCACCTT
    CTATGCCATGGGCAGCGTCATGGTGATCGGGGGCATCATCTGGAGCATGT
    GCCAGTGCTACCCCAAGATCACCTTCGTCCCTGCTGACTCTGACTTTCAA
    GGCATCCTCTCCCCAAAGGCCATGGGCCTGCTGGAGAA

46
ASN.1
  • Abstract Syntax Notation One (ASN.1) is a formal
    language for abstractly describing messages to be
    exchanged among an extensive range of
    applications involving the Internet, intelligent
    network, cellular phones, ground-to-air
    communications, electronic commerce, secure
    electronic services, interactive television,
    intelligent transportation systems, Voice Over IP
    and others.
  • Messages defined using ASN.1 can be encoded in
    XML format and visualized

47
ASN.1
  • Seq-entry seq id general db "dbEST" ,
    tag id 16732958 , embl accession "BX094486" ,
    version 1 , gi 27827140 ,
  • title "BX094486 NCI_CGAP_Kid5 Homo sapiens cDNA
    clone IMAGp998D214660 IMAGE1901156 5'." ,
  • subtype clone-lib , name "NCI_CGAP_Kid5" ,
    subtype tissue-type , name "2 pooled tumors
    (clear cell type)" ,
  • orgname name binomial genus "Homo" , species
    "sapiens" , mod subtype other , subname
    "Organ kidney Vector pT7T3D-PacI Site_1 ,

48
ASN.1
  • Abstract Syntax Notation One (ASN.1) is a formal
    language for abstractly describing messages to be
    exchanged among an extensive range of
    applications involving the Internet, intelligent
    network, cellular phones, ground-to-air
    communications, electronic commerce, secure
    electronic services, interactive television,
    intelligent transportation systems, Voice Over IP
    and others. Due to its streamlined encoding
    rules, ASN.1 is also reliable and ideal for
    wireless broadband and other resource-constrained
    environments. Its extensibility facilitates
    communications between newer and older versions
    of applications. In a world of change, ASN.1 is
    core technology, constantly adapting to new
    technologies.ASN.1 is a critical part of our
    daily lives it's everywhere, but it works so
    well it's invisible
  • The standardized XML Encoding Rules (XER) allow
    ASN.1 specifications (modules) to be used as
    ASN.1 schemas against which XML documents can be
    validated. As a result, the ASN.1 language now
    competes with other XML schema languages, but it
    has some additional benefits over them.

49
ASN.1
  • One of the difficulties in making use of data
    from a database is in interpreting the format of
    the data. A common approach in the past has been
    to release data in a so-called flat file format.
    However, this format fails to preserve the
    inherent relationships in a more complex data
    model--for example, a relational database
    management system such as the Sybase software
    used for the Plant Genome Database.
  • A better approach is to make the data available
    according to a specification defined in a data
    description language. Abstract Syntax Notation 1
    (ASN.1) is one type of data description language.
    Data from the Plant Genome Database is to be made
    available in the ASN.1 format in addition to its
    primary means of access, which will be on-line.
    Also, data exchange between NAL and our
    collaborators will most likely occur using ASN.1.
    The article by Jim Ostell of the National Library
    of Medicine's National Center for Biotechnology
    Information describes ASN.1 and some advantages
    in its use
  • http//www.nal.usda.gov/pgdic/Probe/v2n2/using.htm
    l

50
SEQUENCE ANALYSIS TOOLS
  • EMBOSS analysis tools
  • http//www.molbiol.ox.ac.uk/analysis_tools/EMBOSS/
    index.shtml
  • Pasteur Analysis tools
  • http//bioweb2.pasteur.fr/intro-en.html
  • ExPASy http//www.expasy.ch/tools/
  • SEQOOL
  • http//www.biossc.de/seqool/index.html

51
Sequence Analysis Software
  • Vector NTI
  • Accelrys GCG
  • Reda-Soft Textco
  • BioSoftware, Inc.
  • GCK
  • Virtual Cloning SuiteTM

52
  • Vector NTI Advance 10
  • Vector NTISequence analysis, annotation, and
    illustration restriction mapping recombinant
    molecule design, including Gateway and TOPO
    cloning in silico gel electrophoresis
    mol.bio-data management
  • AlignXMultiple sequence alignment of proteins
    and DNAs
  • ContigExpressDNA sequence assembly and
    sequencing project management using the CAP 3
    algorithm
  • GenomBenchAnalysis and annotation of reference
    genomic DNA sequences
  • BioAnnotatorFunctional annotation of DNAs and
    proteins

53
(No Transcript)
54
MATLAB BIOINFORMATICS TOOLBOX DEMO
http//www.mathworks.com/access/helpdesk/help/tool
box/bioinfo/index.html?/access/helpdesk/help/toolb
ox/bioinfo/ug/fp6010dup1.htmlhttp//www.mathworks
.com/access/helpdesk/help/toolbox/bioinfo/bioinfo_
product_page.html
Write a Comment
User Comments (0)
About PowerShow.com