Introduction to Bioinformatics 20120 - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Introduction to Bioinformatics 20120

Description:

Schoenfeld,J., Seshagiri,S., Simmons,L., Singh,J., Smith,V. ... gene='SFTPG' ORIGIN. 1 ggggtgtaga aacaggcctg ttaaggagag gccaccggga cttcagtgtc tcctccatcc ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 32
Provided by: gruye
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics 20120


1
Introduction to Bioinformatics20120
  • Gianluca Pollastri
  • office CS A1.07
  • email gianluca.pollastri_at_ucd.ie

2
Credits
  • Richard Lathrop and Pierre Baldis Bioinformatics
    courses at University of California _at_ Irvine.

3
Course overview
  • Context DNA, RNA, proteins
  • Resources GenBank, PDB, etc.
  • Algorithms for sequence comparison.
  • Phylogenetics.
  • Structural bioinformatics protein structure
    prediction.

4
Lecture notes
  • http//gruyere.ucd.ie/2007_courses/20120/
  • confidential..

5
Recommended/useful readings
  • No book is actually required
  • Introduction to Bioinformatics
  • Lesk
  • Introduction to Computational Molecular Biology
  • Setubal, Meidanis
  • Bioinformatics the Machine Learning approach
  • Baldi, Brunak

6
  • CS 20120, Introduction to Bioinformatics
  • Assignment 1, 29 January 2007
  • 10 of the overall mark
  • To hand in by midnight of February 12
  • 1. identify your favourite pet
  • 2. get the protein sequence for one of its genes
    on
  • a. http//www.ncbi.nlm.nih.gov/entrez/
  • 3. BLAST your sequence against UniProt at
  • a. http//www.ebi.ac.uk/blast2/index.html?UniProt
  • 4. If you get less than 6 results from 6
    different organisms, go back to 2 and choose
    another protein
  • 5. Select 6 sequences returned by BLAST, from 6
    different organisms (ticking the appropriate
    boxes and downloading them in fasta format will
    give you the right input format for the next
    step)
  • 6. Run clustalW on them using the page (be
    patient, might take time)
  • a. http//www.ebi.ac.uk/clustalw/index.html
  • 7. Draw a phylogenetic tree for your guide tree
    (.dnd) using an online viewer, e.g.
  • a. http//bioweb.pasteur.fr/seqanal/interfaces/dra
    wtree.html
  • 8. email me (gianluca.pollastri_at_ucd.ie)
  • a. your protein sequence UniProt record

7
DNA summary
  • A string of 4 letters (vs the 20 of proteins).
  • Very very long, packed several times over, first
    as a double helix, then into more complicated
    shapes.
  • The double helix allows DNA to make copies of
    itself.

8
How this works summary
  • DNA-gtRNA-gtprotein, through transcription and
    translation
  • The above is true of genes in DNA. Lots of DNA
    does not make proteins, hence does something else
    (or nothing)
  • Telling where genes are is simple in prokaryotes,
    more complicated in eukaryotes where introns and
    exons kick in.
  • Not all genes are active in every cell at any
    given time.

9
Evolution
  • Genotype vs Phenotype
  • Genotype changes
  • Random mutations
  • Recombination
  • Gene duplication
  • Gene flow
  • Phylogenetics reverse engineering evolution.
  • Molecular phylogenetics the above, based on DNA
    and protein sequences.

10
Public resources
  • A gigantic amount of information is now
    available.
  • Most of it is publicly available.
  • Much of it is raw, i.e. unannotated or only
    mildly annotated.

11
(No Transcript)
12
Nucleic acid (DNA/RNA) sequence databases
  • One main database arising from a partnership
    between GenBANK at the NCBI (National Center for
    Biotechnology Information USA), the EMBL data
    library at the EBI (European Bioinformatics
    Institute UK) and the DNA Data Bank at the NIG
    (National Institute of Genetics Japan).
  • Daily exchanges between the 3 partners to keep
    the databases synchronised.
  • DNA and RNA sequences curated, archived,
    distributed.
  • Sequences from genome projects, scientific
    articles, patent applications. Most scientific
    journals require DNA and RNA sequences related to
    each publication to be publicly available.
  • Sequences deposited early and going through a
    review cycle unannotated.. preliminary..
    unreviewed.. standard.
  • Format human and computer readable.

13
(No Transcript)
14
  • LOCUS NM_205854 462 bp
    mRNA linear PRI 12-JAN-2007
  • DEFINITION Homo sapiens surfactant associated
    protein G (SFTPG), mRNA.
  • ACCESSION NM_205854 XM_371808
  • VERSION NM_205854.2 GI122056695
  • KEYWORDS .
  • SOURCE Homo sapiens (human)
  • ORGANISM Homo sapiens
  • Eukaryota Metazoa Chordata
    Craniata Vertebrata Euteleostomi
  • Mammalia Eutheria Euarchontoglires
    Primates Haplorrhini
  • Catarrhini Hominidae Homo.
  • REFERENCE 1 (bases 1 to 462)
  • AUTHORS Zhang,Z. and Henzel,W.J.
  • TITLE Signal peptide prediction based on
    analysis of experimentally
  • verified cleavage sites
  • JOURNAL Protein Sci. 13 (10), 2819-2824
    (2004)
  • PUBMED 15340161
  • REFERENCE 2 (bases 1 to 462)
  • AUTHORS Clark,H.F., Gurney,A.L., Abaya,E.,
    Baker,K., Baldwin,D., Brush,J.,
  • Chen,J., Chow,B., Chui,C.,
    Crowley,C., Currell,B., Deuel,B.,

15
  • FEATURES Location/Qualifiers
  • source 1..462
  • /organism"Homo sapiens"
  • /mol_type"mRNA"
  • /db_xref"taxon9606"
  • /chromosome"6"
  • /map"6p21.33"
  • gene 1..462
  • /gene"SFTPG"
  • /note"surfactant associated
    protein G synonyms UNQ541,
  • GSGL541"
  • /db_xref"GeneID389376"
  • /db_xref"HGNC18386"
  • /db_xref"HPRD18265"
  • CDS 80..316
  • /gene"SFTPG"
  • /codon_start1
  • /product"surfactant
    associated protein G precursor"
  • /protein_id"NP_995326.1"

16
Protein sequence databases
  • UniProt, resulting from the merger of the PIR
    (Protein Information Resource) at the National
    Biomedical Research Foundation at the Georgetown
    University Medical Center (DC, USA), SWISS-PROT
    at the Swiss Institute of Bioinformatics in
    Geneva and the TrEMBL at the EBI.
  • SWISS-PROT is human-annotated, TrEMBL is EMBL
    translated (PIR is smaller).
  • TrEMBL more than 90 of UniProt, but somewhat
    hypothetical proteins inferred from genes in
    DNA, but many genes might not actually make
    proteins or be genes at all.
  • Smaller, specialistic databases associated to
    UniProt ENZYME DB, PROSITE, etc.

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Databases of structures
  • The main database of protein structures is the
    PDB (Protein Data Bank).
  • The PDB started in 1971 at Brookhaven National
    Labs (NY, USA) and is now a distributed
    organisation (Research Collaboratory for
    Structural Bioinformatics, www.rcsb.org) of US
    partners (Rutgers, NJ San Diego Supercomputer
    Centre, Ca NIST, Md).
  • The PDB includes protein structures (and a few
    DNA and other structures) determined by X-ray
    crystallography and Nuclear Magnetic Resonance.
  • It used to include a few structures predicted
    computationally, but these have recently been
    dropped.

22
  • HEADER ANTIBACTERIAL PROTEIN
    17-DEC-82 1ACX 1ACX 3
  • COMPND ACTINOXANTHIN
    1ACX 4
  • SOURCE (ACTINOMYCES GLOBISPORUS, NUMBER 1131)
    1ACXD 1
  • AUTHOR V.Z.PLETNEV,A.P.KUZIN
    1ACX 6
  • REVDAT 7 31-JAN-94 1ACXF 1 REMARK
    1ACXF 1
  • REVDAT 6 15-OCT-90 1ACXE 1 REMARK
    1ACXE 1
  • REVDAT 5 16-APR-87 1ACXD 1 SOURCE
    REMARK 1ACXD 2
  • REVDAT 4 25-APR-86 1ACXC 1 REMARK
    1ACXC 1
  • REVDAT 3 28-FEB-84 1ACXB 1 REMARK
    1ACXB 1
  • REVDAT 2 30-SEP-83 1ACXA 1 REVDAT
    1ACXA 1
  • REVDAT 1 09-MAR-83 1ACX 0
    1ACXA 2
  • REMARK 1
    1ACX 7
  • REMARK 1 REFERENCE 1
    1ACX 8
  • REMARK 1 AUTH V.Z.PLETNEV,A.P.KUZIN,L.V.MALIN
    INA 1ACX 9
  • REMARK 1 TITL ACTINOXANTHIN STRUCTURE AT THE
    ATOMIC LEVEL 1ACXB 2
  • REMARK 1 TITL 2 (RUSSIAN)
    1ACX 11
  • REMARK 1 REF BIOORG.KHIM.
    V. 8 1637 1982 1ACXB 3
  • REMARK 1 REFN ASTM BIKHD7 UR ISSN 0132-3423
    364 1ACXF 2
  • REMARK 1 REFERENCE 2
    1ACXB 4

23
  • ATOM 1 N ALA 1 9.484 -7.014
    7.366 1.00 1.00 1ACX 115
  • ATOM 2 CA ALA 1 8.411 -6.863
    6.372 1.00 1.00 1ACX 116
  • ATOM 3 C ALA 1 7.066 -6.704
    7.086 1.00 1.00 1ACX 117
  • ATOM 4 O ALA 1 6.627 -5.580
    7.375 1.00 1.00 1ACX 118
  • ATOM 5 CB ALA 1 8.300 -8.147
    5.531 1.00 1.00 1ACX 119
  • ATOM 6 N PRO 2 6.433 -7.831
    7.359 1.00 1.00 1ACX 120
  • ATOM 7 CA PRO 2 5.128 -7.820
    8.039 1.00 1.00 1ACX 121
  • ATOM 8 C PRO 2 5.252 -7.094
    9.380 1.00 1.00 1ACX 122
  • ATOM 9 O PRO 2 6.283 -7.190
    10.062 1.00 1.00 1ACX 123
  • ATOM 10 CB PRO 2 4.772 -9.269
    8.170 1.00 1.00 1ACX 124
  • ATOM 11 CG PRO 2 6.091 -10.022
    8.162 1.00 1.00 1ACX 125
  • ATOM 12 CD PRO 2 7.000 -9.172
    7.275 1.00 1.00 1ACX 126
  • ATOM 13 N ALA 3 4.202 -6.375
    9.735 1.00 1.00 1ACX 127
  • ATOM 14 CA ALA 3 4.192 -5.622
    10.994 1.00 1.00 1ACX 128
  • ATOM 15 C ALA 3 2.765 -5.553
    11.542 1.00 1.00 1ACX 129
  • ATOM 16 O ALA 3 1.796 -5.402
    10.781 1.00 1.00 1ACX 130
  • ATOM 17 CB ALA 3 4.623 -4.168
    10.734 1.00 1.00 1ACX 131
  • ATOM 18 N PHE 4 2.655 -5.665
    12.854 1.00 1.00 1ACX 132
  • ATOM 19 CA PHE 4 1.340 -5.621
    13.509 1.00 1.00 1ACX 133

24
X-ray crystallography
  • Crystallography is very time- and
    resource-consuming.
  • For the reason above, we know the structure of
    only about 40,000 proteins (while we know the
    sequence of nearly 4,000,000 of them).
  • Crystallography might produce high resolution
    results or not, depending on a number of factors.
    The quality of a structure is expressed by a
    number of indexes, including
  • Resolution, in Angstrom (1Å1-10m). Less than
    2.5Å is OK, more is not ideal. (For examples
    sake, the distance between the main Carbon atoms
    of two consecutive amino acids is about 4Å)
  • B-factors local measure of quality/uncertainty

25
Nuclear Magnetic Resonance (NMR)
  • Significantly faster than crystallography, but
  • cannot deal with long proteins (although it is
    getting better at that)
  • rather than a single structure it produces a
    bundle of potential structures, more or less
    similar
  • it doesnt provide an explicit measure of quality
  • implicitly, the level of similarity between
    different structures in the bundle provide a
    quality index (structures very similar implies
    good reliability, and vice versa)

26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
Public tools summary
  • DNA/RNA databases GenBANK/EMBL - 3-way
    consortium
  • Protein sequences UniProt - SWISS-PROT, TrEMBL,
    etc.
  • Protein structures Protein Data Bank (PDB)
  • A massive amount of portals, servers, boutique
    databases, etc., some good, some a bit less.
  • To sort out the above, a lot of servers
    benchmarking other servers (e.g. EVA, LiveBench,
    etc.)
Write a Comment
User Comments (0)
About PowerShow.com