Title: Biological Networks CSci 732: Introduction to Bioinformatics
1Biological NetworksCSci 732 Introduction to
Bioinformatics
- Anne Denton
- Assistant Professor
- Department of Computer Science
- North Dakota State University, Fargo, ND
2The Promise of Bioinformatics
- Rich supply of data from high-throughput
experiments - Sequencing
- Microarray experiments
- Scores of specialized high-throughput experiments
- Data made available
- Most biological data disseminated in large
biological databases - Dissemination is a condition of funding (in US)
- Makes biology different from most other
disciplines!
3Challenge and Opportunity
- Traditional approach
- Studies of one or a few gene at a time
- Conclusions based on thorough domain knowledge
- Challenge
- Standard data analysis techniques inadequate for
hundreds or thousands of genes - Opportunity
- Massive data valuable to quantify evidence
- New aspects can be studied, such as network
structure
4From Sequencing to Functional Genomics
- Sequencing genomes is a mature discipline
experimentally and computationally - Whole-genome sequencing centrally involved
computations - But what do the proteins do?
- Sequence comparison (BLAST) among species
- Great computational and experimental challenges
- Networks have a fundamental role in specifying
relationships
5Computer Scientists Introduction to Genomics
- Encoding
- Smallest unit of information
- Computer 1 bit (0 or 1)
- Cell 1 nucleotide of DNA (A,C,T, or G)
- Most practical unit of information
- Computer 8 bit are 1 byte
- (can represent 256 values)
- Cell 3 nucleotide determine 1 amino acid of the
protein - (20 different amino acids)
6Further Analogy between Cells and Computers
- Encoded values can serve very different purposes
and are stored in the same location - Computer Instructions and data are stored in the
same memory (von Neumann architecture) - Cell Proteins in the cell do very different
things - Catalyze chemical reactions (proteins are part of
the process) - Regulate other proteins (proteins change the
process)
7Networks in Bioinformatics
- Many network definitions
- Protein-protein interactions
- Biochemical pathways
- Annotation Networks
- Different definitions in each category
Scientific American 05/03
8Why Study Networks?
- Biochemical pathways tell us about functioning of
cell - Chemical processes (Metabolic pathways)
- Control of other proteins (Regulatory pathways)
- Neighbors in networks often have similar function
- Structure of networks can tell us about evolution
- Combined study of networks and data can uncover
yet more information about cells
9Outline
- Part 1 Properties of the Networks
- Scale-free networks
- Part 2 Networks and data
- Relational data mining
- Problem Similarity between network neighbors
- Solution Focus on differences between neighbors
- Comparison of different networks
10Example of a Biological NetworkPhysical
Protein-Protein Interactions
- Proteins interact (attach to each other)
- Tests if proteins are stable in a close position
- Proteins may perform function together (not
tested) - Mathematically Undirected graph
- Only one definition of interactions between
proteins? No! - Definition based on function Genetic
- Definition based on evolution Domain fusion
11Physical Protein-Protein in Yeast
Scientific American 05/03
12Scale-free Networks
- Properties
- Barabasi, Bonabeau 1998
- Some nodes have large number of links, most have
only a few - Number of nodes that have a particular number of
links decreases as a power law - Robust against accidental failures
- Hubs with high connectivity
13Power-law Behavior
- Probability that any node is connected to k other
nodes is 1/k n with n between 2 and 3 - For k 2 Probability of having twice as many
links is a quarter as likely
Scientific American 05/03
14Robust against accidental failures
- As many as 80 of nodes can fail in a scale-free
network without breaking up the entire cluster - I.e., even with large number of random mutations
in genes, unaffected proteins continue to work
together - Note that random removal of nodes is unlikely to
remove hubs - Removal of only a few hubs does break up network
significantly
15Examples Outside Biology
- Hyperlink structure of the Internet
- Physical structure (routers and communication
lines) of the Internet - Social networks
- Airline system
- Scientific papers connected by citations
16Scale-free Network (Airline System)
Scientific American 05/03
17Random Networks in Contrast
- Links placed randomly
- Mathematical model (Erdos 1959)
- Example Highway system
- Bell-shaped (Poisson, similar to Gauss)
distribution of number of nodes around typical
value - For large number of links k,
- probability of k links
- decreases exponentially
- Very unlikely to have
- nodes with a very large
- number of links
18Random Network (Highway System)
Scientific American 05/03
19Reasons for the development of scale-free
networks
- Networks grow over time
- Older nodes have had longer to accumulate links
- The most connected nodes in E.coli metabolic
network have an early evolutionary history - Preferential attachment
- "The rich get richer
- cf. many people hyper-link to Google
20"Small World" Property
- Game "Six Degrees of Kevin Bacon " Acting in the
same movie as links connects most actors in 6
steps - Internet pages are typically 19 clicks apart
- Any two chemicals in a cell are only 3 reactions
apart! - Small-world property is not limited to scale-free
networks
21Are Protein Networks Pure Scale-free Networks?
- Clusters of tightly connected nodes
- Example
- Proteins that perform function together
- Recovering scale free network
- Clustering of nodes leads to groups that interact
as scale-free networks
22Summary of Scale-free Networks
- Characterized by
- Few hubs with a large number of edges
- Many nodes with few edges
- Show small-world property
- Any node can be reached from any other in only a
few steps - Ubiquitous in biology and outside
23Has Everything Interesting Related to Networks
Been Done?
- Graph theory is an old topic
- Euler 1736
- Work on scale-
- free network
- has added to it
- Surely now
- everything is done!
24Protein-Protein Interaction Networks
Name YAL003W Function Protein
Synthesis Localization Cytoplasm Class
GTP/GDP-exchange factors Complex Translation
complexes MOTIF PDOC00648
25Outline (2) Data and Networks
- Relational rather than graph-theoretic approach
to mining of data on a graph - Problem of similarity between neighbors
- New Algorithm
- Focuses on differences
- Comparison of Networks
- Different definitions of Networks
- Generalization of difference-based algorithm
26Questions of Interest
- How do data between nodes relate?
- Are there typical patterns among interacting
proteins? - Can we find relationships that are not yet known
to biologists? - It is expected that proteins in the nucleus
interact with other proteins in the nucleus - It is more surprising if proteins in the nucleus
interact with proteins in the mitochondria - How do different networks compare?
27Frequent Patterns in TablesAssociation Rule
Mining
- 1st step Finding sets of items that are
frequent - Originally Items in shopping carts
- Here Properties of proteins
- Support Fraction of transactions, in which the
set of items occurs - 2nd step Finding associations A -gt B
- If we find set A, we are likely to find set B
- Similar to correlation, but goes in one direction
only - Confidence Fraction of transactions that have
A, which also have B
28Relational Approach
Node Table Node Table
ORF Annotations
YPR184W cytoplasm
YER146W cytoplasm
YNL287W SensitivityTOaaaod
YBL026W transcription, nucleus
YMR207C nucleus
Node Table Node Table
ORF Annotations
YPR184W cytoplasm
YER146W cytoplasm
YNL287W SensitivityTOaaaod
YBL026W transcription, nucleus
YMR207C nucleus
0.
1.
Edge Table Edge Table
ORF1 ORF1
YPR184W YER146W
YNL287W YBL026W
YBL026W YMR207C
29Generalization
- Works for any number of nodes
- Even 2-node structure allows complex rules
- Results become harder to interpret for many nodes
- Small world property Any protein can be
reached in 3-4 hops
30Effect of Joining on Item Sets
31ARM on Joined Tables
- Protein names (key) used for joining but not in
ARM - One node table participates multiple times
- Items labeled to keep track of instance
- Typical transactions
- Simplest approach had been done
- overemphasizes similarities
T1 0.cyto, 0.Cond_pheno, 0. PDOC00030, 1.Hydrolases
T2 0.Cond_pheno, 0.Transferases, 0.Polymerases, 1.Cond_pheno, 1.Transferases, 1.Polymerases
32Problem with Naive Implementation
- Many rules that are not interesting
- Rules that reflect similarity between neighbors
- 0.nucleus ? 1.nucleus
- Rules involving only one protein
- 0.transcription ? 0.nucleus
- Note Different support and confidence compared
with ARM on node relation (protein data ignoring
network) - Rules that are a consequence of both above
- 0.transcription ? 1.nucleus
33Algorithm
TID Unique Operation Unique Operation
1 0.cytoplasm 1.cytoplasm
2 0.metabolism, 0.nucleus 1.transcription, 1.nucleus
3 0.transcription, 0.nucleus 1.nucleus
TID Final Transactions Final Transactions
2 0.metabolism 1.transcription
34Properties of Algorithm
- Significant pruning at transaction level
- Fewer items in transactions
- Fewer transactions
- Note Pruning at rule or item set level would not
be consistent - Example 0.transcription ? 1.nucleus
- Differs from conventional setting pruning at
itemset level is gold standard - Modular approach
- Unique operation can be combined with different
ARM implementations
35Results for 0.transcription ? 1.nucleus
- Standard ARM
- Support 0.29, Confidence 28.38
- Rule 0.transcription ? 0.nucleus (0.70,
69.59) - Rule 0.nucleus ? 1.nucleus (5.74, 29.06)
- Differential ARM
- Support 0.02, Confidence 2.08
- Typical range for differential ARM
- Support 0.2-2, Confidence 6-20
36Focus on Differences
- Similarity can be tested through calculation of
correlation of items with themselves - Computationally easy
- Association meaningless
- Typical rules that are interesting to biologists
- Which kinds of proteins show compartmental
cross-talk? - E.g., proteins in the nucleus interacting with
proteins in the mitochondria - How do interaction definitions differ?
- E.g., which protein families show physical
interactions but no domain fusion interactions
37Other Results of Differential ARM
- Expected Cross-Talk past analysis papers
- 1.mitochondria ? 0.cytoplasm (1.2, 27.3)
- Interesting Related Rules not found before
- 1.mitochondria ? 0.nucleus (0.72, 16)
- 1.ER ? 0.mitochondira (0.21, 6)
38Performance
39Comparison of Different Protein-Protein
Interaction Networks
- Different Definitions
- Physical interactions
- Genetic interactions
- Domain fusion
- Are the resulting networks biologically
equivalent? - Common assumption All networks signify
similarity in neighbors
40Physical Interactions
- Tests whether proteins physically interact when
brought close together - Yeast-2-hybrid method
- A gene is cut in half, and each half attached to
one of the proteins in question - The gene can only perform its job if its parts
come close through the protein-protein
interactions - Can be done in any cell, i.e. does not test
functions in cell
41Genetic Interactions
- In vivo analysis, i.e., in the living organism
- Typical scenario
- Assume gene A and B can individually be deleted
and the organism survives - Assume deleting A and B together means the
organism does not survive - Other combinations are possible
- Organism does not survive deletion of A and B
individually but does survive combined deletion - Or Organism survives but is noticeably changed
42Domain Fusion Interactions
- Comparative Genome Analysis
- Purely computational analysis
- Based on evolutionary relationships
- Assume species A has one gene with two domains 1
and 2 - Assume species B has two genes that have the same
evolutionary origin (orthologs), 1' and 2' - Likely that proteins from genes 1' and 2'
interact to generate the same function - 24,000 protein-protein interactions in yeast Â
- No experimental verification!
43Can ARM Results be Compared Between Networks?
- Not all proteins studied for all interactions
- Networks have very different properties
Table Int/orf Max int gt20 int
Physical 3.55 289 73 14672
Genetic 7.88 157 93 8336
Domain fusion 44.6 231 305 28040
44Construction of Network Comparison Basis Set
Transactions
C D E
C D K
G D E
G D K
J H I
45Algorithm
- Only nodes are considered that are involved in
both networks - Items are eliminated if they occur in either of
the two interaction types
1.A 1.C 1.D
2.A 2.B
0.C 0.E
46Results
- Rules based on physical interactions have higher
confidence than genetic or domain fusion - Example Physical compared with Domain Fusion
- 0.ABC trans family signature(PDOC00185) ?
1.ATP/GTP binding site motif A(PDOC00017)
(0.45, 90) - ABC family is known to function together with ATP
binding site - Nevertheless ABC family signature dont occur in
one gene together with ATP binding site in other
species (no domain fusion) - Possible reason many proteins with ATP binding
site
47Conclusions of Part 2
- Differential algorithm to ARM in relations that
describe networks contrasts - Neighbors within neighbors
- Multiple networks
- Solves other problems of network setting
- Overwhelming number of rules due to neighbor
similarity - Problem of rules that dont involve all nodes
- Rules that follow from combinations of both above
problems - Some results confirmed by biologists
- Some results interesting but plausible to
biologists
48Overall Conclusions
- Computer science techniques can uncover patterns
in data that could not be identified by simple
inspection - Large amount of data
- Complex data
- Exciting times are only starting
- Functional genomics only at its beginning
- Mutual understanding of language takes time
49Overall Conclusions (Data Miners perspective)
- Bioinformatics excellent playing field for data
miners - Data more easily available than in most other
disciplines (possible exception astronomy) - Results can directly benefit researchers in
biology - Algorithms can/have to pass test of reality
- Real data motivate fundamentally new algorithms
50Acknowledgements (Part 2)
- Computational side
- Christopher Besemann
- Biological interpretation
- (Collaborators from NDSU Dept. of Biology)
- Ajay Yekkirala
- Ron Hutchison
- Marc Anderson
- The work was funded by the Dept. of CS, EPSCoR
and the NDSU Research Foundation