Biological Networks CSci 732: Introduction to Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Biological Networks CSci 732: Introduction to Bioinformatics

Description:

Biological Networks CSci 732: Introduction to Bioinformatics Anne Denton Assistant Professor Department of Computer Science North Dakota State University, Fargo, ND – PowerPoint PPT presentation

Number of Views:289
Avg rating:3.0/5.0
Slides: 51
Provided by: AnneD96
Category:

less

Transcript and Presenter's Notes

Title: Biological Networks CSci 732: Introduction to Bioinformatics


1
Biological NetworksCSci 732 Introduction to
Bioinformatics
  • Anne Denton
  • Assistant Professor
  • Department of Computer Science
  • North Dakota State University, Fargo, ND

2
The Promise of Bioinformatics
  • Rich supply of data from high-throughput
    experiments
  • Sequencing
  • Microarray experiments
  • Scores of specialized high-throughput experiments
  • Data made available
  • Most biological data disseminated in large
    biological databases
  • Dissemination is a condition of funding (in US)
  • Makes biology different from most other
    disciplines!

3
Challenge and Opportunity
  • Traditional approach
  • Studies of one or a few gene at a time
  • Conclusions based on thorough domain knowledge
  • Challenge
  • Standard data analysis techniques inadequate for
    hundreds or thousands of genes
  • Opportunity
  • Massive data valuable to quantify evidence
  • New aspects can be studied, such as network
    structure

4
From Sequencing to Functional Genomics
  • Sequencing genomes is a mature discipline
    experimentally and computationally
  • Whole-genome sequencing centrally involved
    computations
  • But what do the proteins do?
  • Sequence comparison (BLAST) among species
  • Great computational and experimental challenges
  • Networks have a fundamental role in specifying
    relationships

5
Computer Scientists Introduction to Genomics
  • Encoding
  • Smallest unit of information
  • Computer 1 bit (0 or 1)
  • Cell 1 nucleotide of DNA (A,C,T, or G)
  • Most practical unit of information
  • Computer 8 bit are 1 byte
  • (can represent 256 values)
  • Cell 3 nucleotide determine 1 amino acid of the
    protein
  • (20 different amino acids)

6
Further Analogy between Cells and Computers
  • Encoded values can serve very different purposes
    and are stored in the same location
  • Computer Instructions and data are stored in the
    same memory (von Neumann architecture)
  • Cell Proteins in the cell do very different
    things
  • Catalyze chemical reactions (proteins are part of
    the process)
  • Regulate other proteins (proteins change the
    process)

7
Networks in Bioinformatics
  • Many network definitions
  • Protein-protein interactions
  • Biochemical pathways
  • Annotation Networks
  • Different definitions in each category

Scientific American 05/03
8
Why Study Networks?
  • Biochemical pathways tell us about functioning of
    cell
  • Chemical processes (Metabolic pathways)
  • Control of other proteins (Regulatory pathways)
  • Neighbors in networks often have similar function
  • Structure of networks can tell us about evolution
  • Combined study of networks and data can uncover
    yet more information about cells

9
Outline
  • Part 1 Properties of the Networks
  • Scale-free networks
  • Part 2 Networks and data
  • Relational data mining
  • Problem Similarity between network neighbors
  • Solution Focus on differences between neighbors
  • Comparison of different networks

10
Example of a Biological NetworkPhysical
Protein-Protein Interactions
  • Proteins interact (attach to each other)
  • Tests if proteins are stable in a close position
  • Proteins may perform function together (not
    tested)
  • Mathematically Undirected graph
  • Only one definition of interactions between
    proteins? No!
  • Definition based on function Genetic
  • Definition based on evolution Domain fusion

11
Physical Protein-Protein in Yeast
Scientific American 05/03
12
Scale-free Networks
  • Properties
  • Barabasi, Bonabeau 1998
  • Some nodes have large number of links, most have
    only a few
  • Number of nodes that have a particular number of
    links decreases as a power law
  • Robust against accidental failures
  • Hubs with high connectivity

13
Power-law Behavior
  • Probability that any node is connected to k other
    nodes is 1/k n with n between 2 and 3
  • For k 2 Probability of having twice as many
    links is a quarter as likely

Scientific American 05/03
14
Robust against accidental failures
  • As many as 80 of nodes can fail in a scale-free
    network without breaking up the entire cluster
  • I.e., even with large number of random mutations
    in genes, unaffected proteins continue to work
    together
  • Note that random removal of nodes is unlikely to
    remove hubs
  • Removal of only a few hubs does break up network
    significantly

15
Examples Outside Biology
  • Hyperlink structure of the Internet
  • Physical structure (routers and communication
    lines) of the Internet
  • Social networks
  • Airline system
  • Scientific papers connected by citations

16
Scale-free Network (Airline System)
Scientific American 05/03
17
Random Networks in Contrast
  • Links placed randomly
  • Mathematical model (Erdos 1959)
  • Example Highway system
  • Bell-shaped (Poisson, similar to Gauss)
    distribution of number of nodes around typical
    value
  • For large number of links k,
  • probability of k links
  • decreases exponentially
  • Very unlikely to have
  • nodes with a very large
  • number of links

18
Random Network (Highway System)
Scientific American 05/03
19
Reasons for the development of scale-free
networks
  • Networks grow over time
  • Older nodes have had longer to accumulate links
  • The most connected nodes in E.coli metabolic
    network have an early evolutionary history
  • Preferential attachment
  • "The rich get richer
  • cf. many people hyper-link to Google

20
"Small World" Property
  • Game "Six Degrees of Kevin Bacon " Acting in the
    same movie as links connects most actors in 6
    steps
  • Internet pages are typically 19 clicks apart
  • Any two chemicals in a cell are only 3 reactions
    apart!
  • Small-world property is not limited to scale-free
    networks

21
Are Protein Networks Pure Scale-free Networks?
  • Clusters of tightly connected nodes
  • Example
  • Proteins that perform function together
  • Recovering scale free network
  • Clustering of nodes leads to groups that interact
    as scale-free networks

22
Summary of Scale-free Networks
  • Characterized by
  • Few hubs with a large number of edges
  • Many nodes with few edges
  • Show small-world property
  • Any node can be reached from any other in only a
    few steps
  • Ubiquitous in biology and outside

23
Has Everything Interesting Related to Networks
Been Done?
  • Graph theory is an old topic
  • Euler 1736
  • Work on scale-
  • free network
  • has added to it
  • Surely now
  • everything is done!

24
Protein-Protein Interaction Networks
Name YAL003W Function Protein
Synthesis Localization Cytoplasm Class
GTP/GDP-exchange factors Complex Translation
complexes MOTIF PDOC00648
25
Outline (2) Data and Networks
  • Relational rather than graph-theoretic approach
    to mining of data on a graph
  • Problem of similarity between neighbors
  • New Algorithm
  • Focuses on differences
  • Comparison of Networks
  • Different definitions of Networks
  • Generalization of difference-based algorithm

26
Questions of Interest
  • How do data between nodes relate?
  • Are there typical patterns among interacting
    proteins?
  • Can we find relationships that are not yet known
    to biologists?
  • It is expected that proteins in the nucleus
    interact with other proteins in the nucleus
  • It is more surprising if proteins in the nucleus
    interact with proteins in the mitochondria
  • How do different networks compare?

27
Frequent Patterns in TablesAssociation Rule
Mining
  • 1st step Finding sets of items that are
    frequent
  • Originally Items in shopping carts
  • Here Properties of proteins
  • Support Fraction of transactions, in which the
    set of items occurs
  • 2nd step Finding associations A -gt B
  • If we find set A, we are likely to find set B
  • Similar to correlation, but goes in one direction
    only
  • Confidence Fraction of transactions that have
    A, which also have B

28
Relational Approach
Node Table Node Table
ORF Annotations
YPR184W cytoplasm
YER146W cytoplasm
YNL287W SensitivityTOaaaod
YBL026W transcription, nucleus
YMR207C nucleus
Node Table Node Table
ORF Annotations
YPR184W cytoplasm
YER146W cytoplasm
YNL287W SensitivityTOaaaod
YBL026W transcription, nucleus
YMR207C nucleus
0.
1.
Edge Table Edge Table
ORF1 ORF1
YPR184W YER146W
YNL287W YBL026W
YBL026W YMR207C
29
Generalization
  • Works for any number of nodes
  • Even 2-node structure allows complex rules
  • Results become harder to interpret for many nodes
  • Small world property Any protein can be
    reached in 3-4 hops

30
Effect of Joining on Item Sets
31
ARM on Joined Tables
  • Protein names (key) used for joining but not in
    ARM
  • One node table participates multiple times
  • Items labeled to keep track of instance
  • Typical transactions
  • Simplest approach had been done
  • overemphasizes similarities

T1 0.cyto, 0.Cond_pheno, 0. PDOC00030, 1.Hydrolases
T2 0.Cond_pheno, 0.Transferases, 0.Polymerases, 1.Cond_pheno, 1.Transferases, 1.Polymerases
32
Problem with Naive Implementation
  • Many rules that are not interesting
  • Rules that reflect similarity between neighbors
  • 0.nucleus ? 1.nucleus
  • Rules involving only one protein
  • 0.transcription ? 0.nucleus
  • Note Different support and confidence compared
    with ARM on node relation (protein data ignoring
    network)
  • Rules that are a consequence of both above
  • 0.transcription ? 1.nucleus

33
Algorithm
TID Unique Operation Unique Operation
1 0.cytoplasm 1.cytoplasm
2 0.metabolism, 0.nucleus 1.transcription, 1.nucleus
3 0.transcription, 0.nucleus 1.nucleus
TID Final Transactions Final Transactions
2 0.metabolism 1.transcription
34
Properties of Algorithm
  • Significant pruning at transaction level
  • Fewer items in transactions
  • Fewer transactions
  • Note Pruning at rule or item set level would not
    be consistent
  • Example 0.transcription ? 1.nucleus
  • Differs from conventional setting pruning at
    itemset level is gold standard
  • Modular approach
  • Unique operation can be combined with different
    ARM implementations

35
Results for 0.transcription ? 1.nucleus
  • Standard ARM
  • Support 0.29, Confidence 28.38
  • Rule 0.transcription ? 0.nucleus (0.70,
    69.59)
  • Rule 0.nucleus ? 1.nucleus (5.74, 29.06)
  • Differential ARM
  • Support 0.02, Confidence 2.08
  • Typical range for differential ARM
  • Support 0.2-2, Confidence 6-20

36
Focus on Differences
  • Similarity can be tested through calculation of
    correlation of items with themselves
  • Computationally easy
  • Association meaningless
  • Typical rules that are interesting to biologists
  • Which kinds of proteins show compartmental
    cross-talk?
  • E.g., proteins in the nucleus interacting with
    proteins in the mitochondria
  • How do interaction definitions differ?
  • E.g., which protein families show physical
    interactions but no domain fusion interactions

37
Other Results of Differential ARM
  • Expected Cross-Talk past analysis papers
  • 1.mitochondria ? 0.cytoplasm (1.2, 27.3)
  • Interesting Related Rules not found before
  • 1.mitochondria ? 0.nucleus (0.72, 16)
  • 1.ER ? 0.mitochondira (0.21, 6)

38
Performance
39
Comparison of Different Protein-Protein
Interaction Networks
  • Different Definitions
  • Physical interactions
  • Genetic interactions
  • Domain fusion
  • Are the resulting networks biologically
    equivalent?
  • Common assumption All networks signify
    similarity in neighbors

40
Physical Interactions
  • Tests whether proteins physically interact when
    brought close together
  • Yeast-2-hybrid method
  • A gene is cut in half, and each half attached to
    one of the proteins in question
  • The gene can only perform its job if its parts
    come close through the protein-protein
    interactions
  • Can be done in any cell, i.e. does not test
    functions in cell

41
Genetic Interactions
  • In vivo analysis, i.e., in the living organism
  • Typical scenario
  • Assume gene A and B can individually be deleted
    and the organism survives
  • Assume deleting A and B together means the
    organism does not survive
  • Other combinations are possible
  • Organism does not survive deletion of A and B
    individually but does survive combined deletion
  • Or Organism survives but is noticeably changed

42
Domain Fusion Interactions
  • Comparative Genome Analysis
  • Purely computational analysis
  • Based on evolutionary relationships
  • Assume species A has one gene with two domains 1
    and 2
  • Assume species B has two genes that have the same
    evolutionary origin (orthologs), 1' and 2'
  • Likely that proteins from genes 1' and 2'
    interact to generate the same function
  • 24,000 protein-protein interactions in yeast  
  • No experimental verification!

43
Can ARM Results be Compared Between Networks?
  • Not all proteins studied for all interactions
  • Networks have very different properties

Table Int/orf Max int gt20 int
Physical 3.55 289 73 14672
Genetic 7.88 157 93 8336
Domain fusion 44.6 231 305 28040
44
Construction of Network Comparison Basis Set
Transactions
C D E
C D K
G D E
G D K
J H I
45
Algorithm
  • Only nodes are considered that are involved in
    both networks
  • Items are eliminated if they occur in either of
    the two interaction types

1.A 1.C 1.D
2.A 2.B
0.C 0.E
46
Results
  • Rules based on physical interactions have higher
    confidence than genetic or domain fusion
  • Example Physical compared with Domain Fusion
  • 0.ABC trans family signature(PDOC00185) ?
    1.ATP/GTP binding site motif A(PDOC00017)
    (0.45, 90)
  • ABC family is known to function together with ATP
    binding site
  • Nevertheless ABC family signature dont occur in
    one gene together with ATP binding site in other
    species (no domain fusion)
  • Possible reason many proteins with ATP binding
    site

47
Conclusions of Part 2
  • Differential algorithm to ARM in relations that
    describe networks contrasts
  • Neighbors within neighbors
  • Multiple networks
  • Solves other problems of network setting
  • Overwhelming number of rules due to neighbor
    similarity
  • Problem of rules that dont involve all nodes
  • Rules that follow from combinations of both above
    problems
  • Some results confirmed by biologists
  • Some results interesting but plausible to
    biologists

48
Overall Conclusions
  • Computer science techniques can uncover patterns
    in data that could not be identified by simple
    inspection
  • Large amount of data
  • Complex data
  • Exciting times are only starting
  • Functional genomics only at its beginning
  • Mutual understanding of language takes time

49
Overall Conclusions (Data Miners perspective)
  • Bioinformatics excellent playing field for data
    miners
  • Data more easily available than in most other
    disciplines (possible exception astronomy)
  • Results can directly benefit researchers in
    biology
  • Algorithms can/have to pass test of reality
  • Real data motivate fundamentally new algorithms

50
Acknowledgements (Part 2)
  • Computational side
  • Christopher Besemann
  • Biological interpretation
  • (Collaborators from NDSU Dept. of Biology)
  • Ajay Yekkirala
  • Ron Hutchison
  • Marc Anderson
  • The work was funded by the Dept. of CS, EPSCoR
    and the NDSU Research Foundation
Write a Comment
User Comments (0)
About PowerShow.com