Biological Networks CSci 732: Introduction to Bioinformatics - PowerPoint PPT Presentation

About This Presentation

Title:

Biological Networks CSci 732: Introduction to Bioinformatics

Description:

Biological Networks CSci 732: Introduction to Bioinformatics Anne Denton Assistant Professor Department of Computer Science North Dakota State University, Fargo, ND – PowerPoint PPT presentation

Number of Views:297

Avg rating:3.0/5.0

Slides: 51

Provided by: AnneD96

Learn more at: http://www.cs.ndsu.nodak.edu

Category:

more less

Transcript and Presenter's Notes

Title: Biological Networks CSci 732: Introduction to Bioinformatics

1
Biological NetworksCSci 732 Introduction to
Bioinformatics

Anne Denton
Assistant Professor
Department of Computer Science
North Dakota State University, Fargo, ND

2
The Promise of Bioinformatics

Rich supply of data from high-throughput
experiments
Sequencing
Microarray experiments
Scores of specialized high-throughput experiments
Data made available
Most biological data disseminated in large
biological databases
Dissemination is a condition of funding (in US)
Makes biology different from most other
disciplines!

3
Challenge and Opportunity

Traditional approach
Studies of one or a few gene at a time
Conclusions based on thorough domain knowledge
Challenge
Standard data analysis techniques inadequate for
hundreds or thousands of genes
Opportunity
Massive data valuable to quantify evidence
New aspects can be studied, such as network
structure

4
From Sequencing to Functional Genomics

Sequencing genomes is a mature discipline
experimentally and computationally
Whole-genome sequencing centrally involved
computations
But what do the proteins do?
Sequence comparison (BLAST) among species
Great computational and experimental challenges
Networks have a fundamental role in specifying
relationships

5
Computer Scientists Introduction to Genomics

Encoding
Smallest unit of information
Computer 1 bit (0 or 1)
Cell 1 nucleotide of DNA (A,C,T, or G)
Most practical unit of information
Computer 8 bit are 1 byte
(can represent 256 values)
Cell 3 nucleotide determine 1 amino acid of the
protein
(20 different amino acids)

6
Further Analogy between Cells and Computers

Encoded values can serve very different purposes
and are stored in the same location
Computer Instructions and data are stored in the
same memory (von Neumann architecture)
Cell Proteins in the cell do very different
things
Catalyze chemical reactions (proteins are part of
the process)
Regulate other proteins (proteins change the
process)

7
Networks in Bioinformatics

Many network definitions
Protein-protein interactions
Biochemical pathways
Annotation Networks
Different definitions in each category

Scientific American 05/03
8
Why Study Networks?

Biochemical pathways tell us about functioning of
cell
Chemical processes (Metabolic pathways)
Control of other proteins (Regulatory pathways)
Neighbors in networks often have similar function
Structure of networks can tell us about evolution
Combined study of networks and data can uncover
yet more information about cells

9
Outline

Part 1 Properties of the Networks
Scale-free networks
Part 2 Networks and data
Relational data mining
Problem Similarity between network neighbors
Solution Focus on differences between neighbors
Comparison of different networks

10
Example of a Biological NetworkPhysical
Protein-Protein Interactions

Proteins interact (attach to each other)
Tests if proteins are stable in a close position
Proteins may perform function together (not
tested)
Mathematically Undirected graph
Only one definition of interactions between
proteins? No!
Definition based on function Genetic
Definition based on evolution Domain fusion

11
Physical Protein-Protein in Yeast
Scientific American 05/03
12
Scale-free Networks

Properties
Barabasi, Bonabeau 1998
Some nodes have large number of links, most have
only a few
Number of nodes that have a particular number of
links decreases as a power law
Robust against accidental failures
Hubs with high connectivity

13
Power-law Behavior

Probability that any node is connected to k other
nodes is 1/k n with n between 2 and 3
For k 2 Probability of having twice as many
links is a quarter as likely

Scientific American 05/03
14
Robust against accidental failures

As many as 80 of nodes can fail in a scale-free
network without breaking up the entire cluster
I.e., even with large number of random mutations
in genes, unaffected proteins continue to work
together
Note that random removal of nodes is unlikely to
remove hubs
Removal of only a few hubs does break up network
significantly

15
Examples Outside Biology

Hyperlink structure of the Internet
Physical structure (routers and communication
lines) of the Internet
Social networks
Airline system
Scientific papers connected by citations

16
Scale-free Network (Airline System)
Scientific American 05/03
17
Random Networks in Contrast

Links placed randomly
Mathematical model (Erdos 1959)
Example Highway system
Bell-shaped (Poisson, similar to Gauss)
distribution of number of nodes around typical
value
For large number of links k,
probability of k links
decreases exponentially
Very unlikely to have
nodes with a very large
number of links

18
Random Network (Highway System)
Scientific American 05/03
19
Reasons for the development of scale-free
networks

Networks grow over time
Older nodes have had longer to accumulate links
The most connected nodes in E.coli metabolic
network have an early evolutionary history
Preferential attachment
"The rich get richer
cf. many people hyper-link to Google

20
"Small World" Property

Game "Six Degrees of Kevin Bacon " Acting in the
same movie as links connects most actors in 6
steps
Internet pages are typically 19 clicks apart
Any two chemicals in a cell are only 3 reactions
apart!
Small-world property is not limited to scale-free
networks

21
Are Protein Networks Pure Scale-free Networks?

Clusters of tightly connected nodes
Example
Proteins that perform function together
Recovering scale free network
Clustering of nodes leads to groups that interact
as scale-free networks

22
Summary of Scale-free Networks

Characterized by
Few hubs with a large number of edges
Many nodes with few edges
Show small-world property
Any node can be reached from any other in only a
few steps
Ubiquitous in biology and outside

23
Has Everything Interesting Related to Networks
Been Done?

Graph theory is an old topic
Euler 1736
Work on scale-
free network
has added to it
Surely now
everything is done!

24
Protein-Protein Interaction Networks
Name YAL003W Function Protein
Synthesis Localization Cytoplasm Class
GTP/GDP-exchange factors Complex Translation
complexes MOTIF PDOC00648
25
Outline (2) Data and Networks

Relational rather than graph-theoretic approach
to mining of data on a graph
Problem of similarity between neighbors
New Algorithm
Focuses on differences
Comparison of Networks
Different definitions of Networks
Generalization of difference-based algorithm

26
Questions of Interest

How do data between nodes relate?
Are there typical patterns among interacting
proteins?
Can we find relationships that are not yet known
to biologists?
It is expected that proteins in the nucleus
interact with other proteins in the nucleus
It is more surprising if proteins in the nucleus
interact with proteins in the mitochondria
How do different networks compare?

27
Frequent Patterns in TablesAssociation Rule
Mining

1st step Finding sets of items that are
frequent
Originally Items in shopping carts
Here Properties of proteins
Support Fraction of transactions, in which the
set of items occurs
2nd step Finding associations A -gt B
If we find set A, we are likely to find set B
Similar to correlation, but goes in one direction
only
Confidence Fraction of transactions that have
A, which also have B

28
Relational Approach
Node Table Node Table
ORF Annotations
YPR184W cytoplasm
YER146W cytoplasm
YNL287W SensitivityTOaaaod
YBL026W transcription, nucleus
YMR207C nucleus
Node Table Node Table
ORF Annotations
YPR184W cytoplasm
YER146W cytoplasm
YNL287W SensitivityTOaaaod
YBL026W transcription, nucleus
YMR207C nucleus
0.
1.
Edge Table Edge Table
ORF1 ORF1
YPR184W YER146W
YNL287W YBL026W
YBL026W YMR207C
29
Generalization

Works for any number of nodes
Even 2-node structure allows complex rules
Results become harder to interpret for many nodes
Small world property Any protein can be
reached in 3-4 hops

30
Effect of Joining on Item Sets
31
ARM on Joined Tables

Protein names (key) used for joining but not in
ARM
One node table participates multiple times
Items labeled to keep track of instance
Typical transactions
Simplest approach had been done
overemphasizes similarities

T1 0.cyto, 0.Cond_pheno, 0. PDOC00030, 1.Hydrolases
T2 0.Cond_pheno, 0.Transferases, 0.Polymerases, 1.Cond_pheno, 1.Transferases, 1.Polymerases
32
Problem with Naive Implementation

Many rules that are not interesting
Rules that reflect similarity between neighbors
0.nucleus ? 1.nucleus
Rules involving only one protein
0.transcription ? 0.nucleus
Note Different support and confidence compared
with ARM on node relation (protein data ignoring
network)
Rules that are a consequence of both above
0.transcription ? 1.nucleus

33
Algorithm
TID Unique Operation Unique Operation
1 0.cytoplasm 1.cytoplasm
2 0.metabolism, 0.nucleus 1.transcription, 1.nucleus
3 0.transcription, 0.nucleus 1.nucleus
TID Final Transactions Final Transactions
2 0.metabolism 1.transcription
34
Properties of Algorithm

Significant pruning at transaction level
Fewer items in transactions
Fewer transactions
Note Pruning at rule or item set level would not
be consistent
Example 0.transcription ? 1.nucleus
Differs from conventional setting pruning at
itemset level is gold standard
Modular approach
Unique operation can be combined with different
ARM implementations

35
Results for 0.transcription ? 1.nucleus

Standard ARM
Support 0.29, Confidence 28.38
Rule 0.transcription ? 0.nucleus (0.70,
69.59)
Rule 0.nucleus ? 1.nucleus (5.74, 29.06)
Differential ARM
Support 0.02, Confidence 2.08
Typical range for differential ARM
Support 0.2-2, Confidence 6-20

36
Focus on Differences

Similarity can be tested through calculation of
correlation of items with themselves
Computationally easy
Association meaningless
Typical rules that are interesting to biologists
Which kinds of proteins show compartmental
cross-talk?
E.g., proteins in the nucleus interacting with
proteins in the mitochondria
How do interaction definitions differ?
E.g., which protein families show physical
interactions but no domain fusion interactions

37
Other Results of Differential ARM

Expected Cross-Talk past analysis papers
1.mitochondria ? 0.cytoplasm (1.2, 27.3)
Interesting Related Rules not found before
1.mitochondria ? 0.nucleus (0.72, 16)
1.ER ? 0.mitochondira (0.21, 6)

38
Performance
39
Comparison of Different Protein-Protein
Interaction Networks

Different Definitions
Physical interactions
Genetic interactions
Domain fusion
Are the resulting networks biologically
equivalent?
Common assumption All networks signify
similarity in neighbors

40
Physical Interactions

Tests whether proteins physically interact when
brought close together
Yeast-2-hybrid method
A gene is cut in half, and each half attached to
one of the proteins in question
The gene can only perform its job if its parts
come close through the protein-protein
interactions
Can be done in any cell, i.e. does not test
functions in cell

41
Genetic Interactions

In vivo analysis, i.e., in the living organism
Typical scenario
Assume gene A and B can individually be deleted
and the organism survives
Assume deleting A and B together means the
organism does not survive
Other combinations are possible
Organism does not survive deletion of A and B
individually but does survive combined deletion
Or Organism survives but is noticeably changed

42
Domain Fusion Interactions

Comparative Genome Analysis
Purely computational analysis
Based on evolutionary relationships
Assume species A has one gene with two domains 1
and 2
Assume species B has two genes that have the same
evolutionary origin (orthologs), 1' and 2'
Likely that proteins from genes 1' and 2'
interact to generate the same function
24,000 protein-protein interactions in yeast
No experimental verification!

43
Can ARM Results be Compared Between Networks?

Not all proteins studied for all interactions
Networks have very different properties

Table Int/orf Max int gt20 int
Physical 3.55 289 73 14672
Genetic 7.88 157 93 8336
Domain fusion 44.6 231 305 28040
44
Construction of Network Comparison Basis Set
Transactions
C D E
C D K
G D E
G D K
J H I
45
Algorithm

Only nodes are considered that are involved in
both networks
Items are eliminated if they occur in either of
the two interaction types

1.A 1.C 1.D
2.A 2.B
0.C 0.E
46
Results

Rules based on physical interactions have higher
confidence than genetic or domain fusion
Example Physical compared with Domain Fusion
0.ABC trans family signature(PDOC00185) ?
1.ATP/GTP binding site motif A(PDOC00017)
(0.45, 90)
ABC family is known to function together with ATP
binding site
Nevertheless ABC family signature dont occur in
one gene together with ATP binding site in other
species (no domain fusion)
Possible reason many proteins with ATP binding
site

47
Conclusions of Part 2

Differential algorithm to ARM in relations that
describe networks contrasts
Neighbors within neighbors
Multiple networks
Solves other problems of network setting
Overwhelming number of rules due to neighbor
similarity
Problem of rules that dont involve all nodes
Rules that follow from combinations of both above
problems
Some results confirmed by biologists
Some results interesting but plausible to
biologists

48
Overall Conclusions

Computer science techniques can uncover patterns
in data that could not be identified by simple
inspection
Large amount of data
Complex data
Exciting times are only starting
Functional genomics only at its beginning
Mutual understanding of language takes time

49
Overall Conclusions (Data Miners perspective)

Bioinformatics excellent playing field for data
miners
Data more easily available than in most other
disciplines (possible exception astronomy)
Results can directly benefit researchers in
biology
Algorithms can/have to pass test of reality
Real data motivate fundamentally new algorithms

50
Acknowledgements (Part 2)