Bioinformatics Tools - PowerPoint PPT Presentation

About This Presentation

Title:

Bioinformatics Tools

Description:

Bioinformatics Tools Overview This lecture will summarize a huge amount of bioinformatics material that is usually presented ... The most popular is known as ... – PowerPoint PPT presentation

Number of Views:608

Avg rating:3.0/5.0

Slides: 51

Provided by: Stuart217

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics Tools

1
Bioinformatics Tools

Stuart M. Brown, Ph.D
Dept of Cell Biology
NYU School of Medicine

Bioinformatics Tools

Stuart M. Brown, Ph.D Dept of Cell Biology NYU
School of Medicine
3
Overview

This lecture will summarize a huge amount of
bioinformatics material that is usually presented
as a full 12 week course.
Data management and analysis of sequences from
the HGP
A quick look at GenBank and ENTREZ.
Gene finding and translation
Similarity searching and alignment (BLAST)
Protein structure and function

4
Data Management and Analysis

The Human Genome Project has generated huge
quantities of DNA sequence data.
This data will lead to many medial advances.
But a great deal of analysis and research will be
needed.

Access to the Data

Organize the genome data provide access for
scientists
Use the Internet
The data is public, so anyone can access it.

6
GenBank

All Genome Project data is stored in a database
called GenBank managed by the National Center for
Biotechnology Information (NCBI)
The NCBI is a branch of the National Library of
Medicine, which is part of the NIH (National
Institutes of Health).
http//ncbi.nlm.nih.gov

7
(No Transcript)
8
GenBank Sections

In addition to DNA sequences of genes GenBank
has a number of other sections including
Protein sequences (translated from DNA)
Short RNA fragments (ESTs)
Cancer Genome Anatomy Project (CGAP) gene
expression profiles of normal, pre-cancer, and
cancer cells from a wide variety of tissue types
Single Nucleotide Polymorphisms (SNPs) which
represent genetic variations in the human
population
Online Mendelian Inheritance in Man (OMIM) a
database of human genetic disorders

9
Finding Genes

GenBank contains approximately 13 billion bases
in 12 million sequence records (as of August
2001).
These billions of G, A, T, and C letters would be
almost useless without descriptions of what genes
they contain, the organisms they come from, etc.
All of this information is contained in the
"annotation" part of each sequence record.

10
(No Transcript)
11
Entrez is a Tool for Finding Sequences

NCBI has created a Web-based tool called Entrez
for finding sequences in GenBank.
Each sequence in GenBank has a unique accession
number.
Entrez can also search for keywords such as gene
names, protein names, and the names of orgainisms
or biological functions

12
(No Transcript)
13
Entrez has links to Medline

Entrez is much more than just a tool for finding
sequences by keywords.
It contains links to PubMed/Medline
Entrez also contains all known protein sequences
and 3-D protein structures.

15
Entrez is Internally Cross-linked

DNA and protein sequences are linked to other
similar sequences
Medline citations are linked to other citations
that contain similar keywords
3-D structures are linked to similar structures

16
(No Transcript)
17

These relationships might include genes in a
multi-gene family, related journal articles, or
other proteins in the same biochemical pathway
This potential for horizontal movement through
the linked databases makes Entrez a dynamic tool.
You can start with only a vague set of keywords
or a sequence from the laboratory and rapidly
access a set of relevant literature and related
database sequences.

18
Similarity Searching

There are a variety of computer programs that are
used for making comparisons between DNA
sequences.
The most popular is known as BLAST (Basic Local
Alignment Search Tool)
BLAST is free at the NCBI website

19
(No Transcript)
20
BLAST Searches GenBank

The NCBI BLAST web server lets you compare your
query sequence to various sections of GenBank
nr non-redundant (main sections)
month new sequences from the past few weeks
ESTs
human, drososphila, yeast, or E.coli genomes
proteins (by automatic translation)
This is a VERY fast and powerful computer.

21
BLAST is Complex

Similarity searching relies on the concepts of
alignment and distance between pairs of
sequences.
Distances can only be measured between aligned
sequences (match vs. mismatch at each position).
A similarity search is a process of testing the
best alignment of a query sequence with every
sequence in a database.

22
Search with Protein not DNA

1) 4 DNA bases vs. 20 amino acids - less random
similarity
2) Can have varying degrees of similarity between
different AAs
- of mutations, chemical similarity, PAM matrix
3) Protein databanks are much smaller than DNA
databanks.

23
BLAST has Automatic Translation

BLASTX makes automatic translation (in all 6
reading frames) of your DNA query sequence to
compare with protein databanks
TBLASTN makes automatic translation of an entire
DNA database to compare with your protein query
sequence
Only make a DNA-DNA search if you are working
with a sequence that does not code for protein.

gtgbBE588357.1BE588357 194087 BARC 5BOV Bos
taurus cDNA 5'.
Length 369
Score 272 bits (137), Expect 4e-71
Identities 258/297 (86), Gaps 1/297 (0)
Strand Plus / Plus
Query 17 aggatccaacgtcgctccagctgctcttgacgactccac
agataccccgaagccatggca 76
Sbjct 1 aggatccaacgtcgctgcggctacccttaaccact-cgc
agaccccccgcagccatggcc 59
Query 77 agcaagggcttgcaggacctgaagcaacaggtggagggg
accgcccaggaagccgtgtca 136
Sbjct 60 agcaagggcttgcaggacctgaagaagcaagtggagggg
gcggcccaggaagcggtgaca 119
Query 137 gcggccggagcggcagctcagcaagtggtggaccaggcc
acagaggcggggcagaaagcc 196
Sbjct 120 tcggccggaacagcggttcagcaagtggtggatcaggcc
acagaagcagggcagaaagcc 179
Query 197 atggaccagctggccaagaccacccaggaaaccatcgac
aagactgctaaccaggcctct 256

25
Understand the Statistics!

BLAST produces an E-value for every match
This is the same as the P value in a statistical
test
A match is generally considered significant if
the E-value lt 0.05 (smaller numbers are more
significant)
Very low E-values (e-100) are homologs or
identical genes
Moderate E-values are related genes
Long regions of moderate similarity are more
important than short regions of high identity.

26
BLAST is Approximate

BLAST makes similarity searches very quickly
because it takes shortcuts.
looks for short, nearly identical words (11
bases)
It also makes errors
misses some important similarities
makes many incorrect matches
easily fooled by repeats or skewed composition

27
Bad Genome Annotation

Gene finding is at best only 90 accurate.
New sequences are automatically annotated with
BLAST scores.
Bad annotations propagate
Its going to take us 10-20 years or more to sort
this mess out!

28
Protein Function

The ultimate goal of the HGP is to identify all
of the genes and determine their functions
Genes function by being translated into proteins
structural
enzymes
regulatory
signalling

29
Translation

Once we have found the DNA sequence of a gene, we
can decode the amino acid sequence of the
corresponding protein .
The Genetic Code is actually quite simple.

30
Chemical Properties

Some chemical properties of a protein can be
calculated from its amino acid sequence
molecular weight
charge/pH
hydrophobicity

31
Patterns in Proteins
32
Conserved Domains

Proteins are built out of functional units know
as domains (or motifs)
These domains have conserved sequences
Often much more similar than their respective
proteins
Exon splicing theory (W. Gilbert)
Exons correspond to folding domains which in
turn serve as functional units
Unrelated proteins may share a single similar
exon (i.e.. ATPase or DNA binding function)

33
Simple Structures

Some motifs form structures that can be
recognized as simple sequence patterns
transmembrane domains
coiled coils
helix-turn-helix
signal peptides

34
Functional Motifs

Other functional portions of proteins can be
recognized by their sequence, even if their 3-D
structure is not known.
There are many databases of protein
motifs/domains ProSite, Pfam, ProDom, etc.

35
Tools for Finding Motifs

Define a motif from a set of known proteins that
share a similar sequence and function.
A pattern is a list of amino acids that can occur
at each position in the motif.
A profile is a matrix that assigns a value to
every amino acid at every position in the motif.
A HMM is a more complex profile based on pairs of
amino acids.

36
(No Transcript)
37
Protein 3-D Structure
38
Structure Function

Proteins function by 3-D interactions with other
molecules (i.e. physical chemistry).
So for a protein, 3-D structure is function.
But we cant accurately determine 3-D structure
from gene sequence.

39
Structure Prediction

Predicting a proteins 3-D structure from its
amino acid sequence is incredibly complex.
proteins are polypeptides (long chains of amino
acids)
can fold and rotate around bonds within each
amino acid as well as the bonds between them
it is not possible to evaluate every possible
folding pattern for an amino acid sequence

40
Secondary Structure

The local structure of the amino acids in a
protein can also be predicted to some extent.
Each amino acid has a tendency to form either an
alpha helix or a beta sheet

41
Threading

Rather than computing a 3-D structure from
scratch, it may be possible to find a similar
structure.
Must have 25 aa sequence identity.
Uses a process called threading to create a new
structure based on a known structure.
This still requires HUGE amounts of computer
power.

42
(No Transcript)
43
Protein Data Base

There is a database of all known protein
structures called the PDB.
These have been determined by X-ray
crystalography and/or NMR.
Anyone download and view these structures with a
PDB viewer program.

44
RasMol

RasMol is the simplest PDB viewer.
http//www.umass.edu/microbio/rasmol/
It can work together with a web browser to let
you view the structure of any sequence found with
Entrez that has a known 3-D structure.

45
Gene Finding Translation