Graphs%20and%20Networks%20with%20Bioconductor - PowerPoint PPT Presentation

About This Presentation
Title:

Graphs%20and%20Networks%20with%20Bioconductor

Description:

Based on chapters from 'Bioinformatics and Computational Biology Solutions using ... (Clique: k=|G|-1) After: Social Network Analysis, Wasserman and Faust (1994) ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 67
Provided by: whu4
Category:

less

Transcript and Presenter's Notes

Title: Graphs%20and%20Networks%20with%20Bioconductor


1
Graphs and Networks with Bioconductor
Wolfgang Huber EMBL/EBI Bioconductor Conference
2005 Based on chapters from "Bioinformatics and
Computational Biology Solutions using R and
Bioconductor", Gentleman, Carey, Huber, Irizarry,
Dudoit, Springer Verlag.
2
Graphs
Set of nodes and set of edges. Nodes objects of
interest Edges relationships between them A
useful abstraction to talk about relationships
and interactions (think of integer numbers,
apples and fingers) Edges may have weights,
directions, types
3
Practicalities
As always, need to distinguish between the true,
underlying property of nature that you want to
measure, and the actual result of a measurement
(experiment) 1. False positive edges 2. False
negative edges (were tested, were not found, but
are there in nature) 3. Untested edges (were not
tested, are not in your data, but are there in
nature)
Uncertainty is not usually considered in
mainstream graph theory, but cannot be ignored in
functional genomics. Nice application of these
concepts to protein interactions Gentleman and
Scholtens, SAGMB 2004
4
Representation
Node-edge lists Adjacency matrix
(straightforward) Adjacency matrix
(sparse) From-To matrix They are equivalent, but
may be hugely different in performance and
convenience for different applications. Can
coerce between the representations
5
Algorithms
Bioconductor project emphasizes re-use and
interfacing to existing, well-tested software
implementations rather than reimplementing
everything from scratch ourselves. RBGL package
interface to Boost Graph Library started by V.
Carey, R. Gentleman, now driven by Li Long.
6
Example a pathway
7
Elementary computations on IMCA pathway
gt library("graph") gt data("integrinMediatedCellAdh
esion") gt class(IMCAGraph) gt s acc(IMCAGraph,
"SOS") Ha-Ras Raf MEK 1 2
3 ERK MYLK MYO 4 5
6 F-actin cell proliferation 7
5
8
Machine-readable pathway databases
KEGG reactome BioCarta (biocarta.com) National
Cancer Institute cMAP
9
Gene Ontology (GO)
A structed vocabulary to describe molecular
function of gene products, biological processes,
and cellular components. Plus A set of "is a",
"is part of" relationships between these
terms Directed acyclic graph
10
GO graphs
gttfGGOGraph("GO0003700", GOMFPARENTS)
11
Gene-Literature graphs
DKC1
12
Graphs vocabulary
Directed, undirected graphs Adjacent
nodes Accessible nodes Self-loop Multi-edge Node
degree Walk alternating sequence of nodes and
incident edges Closed walk Distance between
nodes, shortest walk Trail walk with no repeated
edges Path trail with no repeated nodes (except
possibly first/last) Cycle Connected graph Weakly
connected directed graph (see next page)
13
Strong and weak connectivity
14
Graphs vocabulary
Cut remove edges to disconnect a graph Cut-set
remove nodes - " - Connectivity of a
graph Cliques
15
Special types of graphs
16
Bipartite graph
17
Bipartite graphs
AG adjacency matrix (n x m) of a bipartite graph
G with node sets U, V One mode graphs AU AGt
AG AV AG AGt (Boolean algebra)
18
Multigraphs
Can have different types of edges
19
Hypergraphs
set of Nodes set of hyperedges A hyperedge
is a set of nodes (can be more than 2) A directed
hyperedge pair (tail and head) of sets of nodes
20
Directed acyclic graphs
Useful for representing hierarchies, partial
orderings (e.g. in time, from general to special,
from cause to effect) Many applications GO MeSH
Graphical models
21
Random Edge Graphs
  • n nodes, m edges
  • p(i,j) 1/m
  • with high probability
  • m lt n/2 many disconnected components
  • m gt n/2 one giant connected component size n.
  • (next biggest size log(n)).
  • degrees of separation log(n).
  • Erdös and Rényi 1960

22
?Random graphs
Random edge graph randomEGraph(V, p, edges) V
nodes either p probability per edge or
edges number of edges Random graph with latent
factor randomGraph(V, M, p, weightsTRUE) V
nodes M latent factor p probability For each
node, generate a logical vector of length
length(M), with P(TRUE)p. Edges are between
nodes that share gt 1 elements. Weights can be
generated according to number of shared
elements. Random graph with predefined degree
distribution randomNodeGraph(nodeDegree) nodeDeg
ree named integer vector sum of all node
degrees must be even
23
?Random edge graph
100 nodes 50 edges
degree distribution
24
Random graphs versus permutation graphs
For statistical inference, one can consider null
hypotheses based on aforementioned random graph
models and ones based on node permutation of
data graphs. The second is often more
appropriate.
25
Cohesive subgroups
For data graphs, the concept of clique is usually
too restrictive (false negative or untested
edges) n-clique distance between all members is
ltn. (Clique n1) k-plex maximal subgraph G in
which each member is neighbour of at least G-k
others. (Clique k1) k-core maximal subgraph G
in which each member is neighbour of at least k
others. (Clique kG-1) After Social Network
Analysis, Wasserman and Faust (1994)
26
?graph, RBGL, Rgraphviz
graph basic class definitions and
functionality RBGL interface to graph
algorithms Rgraphviz rendering functionality
Different layout algorithms. Node plotting, line
type, color etc. can be controlled by the user.
27
?Creating our first graph
gt library("graph") library(Rgraphviz) gt myNodes
c("s", "p", "q", "r") gt myEdges list( s
list(edges c("p", "q")), p list(edges
c("p", "q")), q list(edges c("p", "r")), r
list(edges c("s"))) gt g new("graphNEL",
nodes myNodes, edgeL myEdges, edgemode
"directed") gt plot(g)
28
?Querying nodes, edges, degree
gt nodes(g) 1 "s" "p" "q" "r" gt edges(g) s 1
"p" "q" p 1 "p" "q" q 1 "p" "r" r 1
"s" gt degree(g) inDegree s p q r 1 3 2
1 outDegree s p q r 2 2 2 1
29
?Graph manipulation
gt g1 lt- addNode("e", g) gt g2 lt- removeNode("d",
g) gt addEdge(from, to, graph, weights) gt g3
lt- addEdge("e", "a", g1, pi/2) gt
removeEdge(from, to, graph) gt g4 lt-
removeEdge("e", "a", g3) gt identical(g4, g1) 1
TRUE
30
?adjacent and accessible nodes
gt adj(g, c("b", "c")) b 1 "b" "c" c 1 "b"
"d" gt acc(g, c("b", "c")) b a c d 3 1 2 c a b
d 2 1 1
31
(No Transcript)
32
?Graph representations from-to-matrix
gt ft ,1 ,2 1, 1 2 2, 2
3 3, 3 1 4, 4 4 gt ftM2adjM(ft)
1 2 3 4 1 0 1 0 0 2 0 0 1 0 3 1 0 0 0 4 0 0 0 1
33
?GXL graph exchange language
ltgxlgt ltgraph edgemode"directed" id"G"gt ltnode
id"A"/gt ltnode id"B"/gt ltnode id"C"/gt
ltedge id"e1" from"A" to"C"gt ltattr
name"weights"gt ltintgt1lt/intgt lt/attrgt
lt/edgegt ltedge id"e2" from"B" to"D"gt ltattr
name"weights"gt ltintgt1lt/intgt lt/attrgt
lt/edgegt lt/graphgt lt/gxlgt
GXL (www.gupro.de/GXL) is "an XML sublanguage
designed to be a standard exchange format for
graphs". The graph package provides tools for
im- and exporting graphs as GXL
from graph/GXL/kmstEx.gxl
34
?RBGL interface to the Boost Graph Library
Connected components cc connComp(rg)
table(listLen(cc)) 1 2 3 4 15 18 36
7 3 2 1 1 Choose the largest
component wh which.max(listLen(cc)) sg
subGraph(ccwh, rg) Depth first search dfsres
dfs(sg, node "N14") nodes(sg)dfsresdiscovere
d 1 "N14" "N94" "N40" "N69" "N02" "N67" "N45"
"N53" 9 "N28" "N46" "N51" "N64" "N07" "N19"
"N37" "N35" 17 "N48" "N09"
rg
35
?depth / breadth first search
dfs(sg, "N14")
36
?connected components
sc strongComp(g2) nattrs makeNodeAttrs(g2,
fillcolor"") for(i in 1length(sc))
nattrsfillcolorsci myColorsi plot(g
2, "dot", nodeAttrsnattrs)
37
?shortest path algorithms
Different algorithms for different types of
graphs o all edge weights the same o positive
edge weights o real numbers and different
settings of the problem o single pair o single
source o single destination o all pairs
Functions bfs dijkstra.sp sp.between johnson.all.p
airs.sp
38
?shortest path
39
?shortest path
40
?minimal spanning tree
41
?connectivity
Consider graph g with single connected
component. Edge connectivity of g minimum number
of edges in g that can be cut to produce a graph
with two components. Minimum disconnecting set
the set of edges in this cut. gt
edgeConnectivity(g) connectivity 1
2 minDisconSet minDisconSet1 1 "D"
"E" minDisconSet2 1 "D" "H"
42
(No Transcript)
43
?Rgraphviz the different layout engines
dot directed graphs. Works best on DAGs and
other graphs that can be drawn as
hierarchies. neato undirected graphs using
spring models twopi radial layout. One node
(root) chosen as the center. Remaining nodes on
a sequence of concentric circles about the
origin, with radial distance proportional to
graph distance. Root can be specified or chosen
heuristically.
44
?Rgraphviz the different layout engines
45
?Rgraphviz the different layout engines
46
?Combining R graphics and graphviz custom node
drawing functions
47
?Combining graphviz layout and R plot
48
?ImageMap
lg agopen(g, ) imageMap(lg,
confile("imca-frame1.html", open"w") tags
list(HREF href, TITLE title,
TARGET rep("frame2",
length(AgNode(nag)))), imgnamefpng, widthimw,
heightimh)
Show drosophila interaction network example
49
?Application comparing gene co-expression and
protein interaction data
Nodes all yeast genes Graph 1 co-expression
clusters from yeast cell cycle microarray time
course Graph 2 protein interactions reported in
the literature Graph 3 protein interactions
found in a yeast-two-hybrid experiment Questions
Do the graphs overlap more than random? Is
there anything special about overlapping edges?
50
?Application comparing gene co-expression and
protein interaction data
51
?Application comparing gene co-expression and
protein interaction data
nPdist number of common edges as computed by a
node label per-mutation model. Number observed in
data 42
52
?Further questions for exploratory data analysis
Which expression clusters have intersections
with which of the literature clusters? Are
known cell-cycle regulated protein complexes
indeed clustered together in both graphs? Are
there expression clusters that have a number of
literature cluster edges going between them ?
suggesting that expression clustering was too
fine, or that literature clusters are not
cell-cycle regulated. Is the expression
behavior of genes that are involved in multiple
protein complexes different from that of genes
that are involved in only one complex?
53
?Generalization
Nothing in the preceding treatment was specific
to physical protein interactions or microarray
clustering. Can you similar reasoning for many
other graphs! - e.g. genomic vicinity, domain
composition similarity
54
? Application Using GO to interprete gene lists
55
?Using GO to interprete gene lists
Packages Gostats, Rgraphviz
56
?Using GO to interprete gene lists
57
Gene-Literature graphs
DKC1
58
? The bipartite gene-literature graph actor and
event size adjustment
actors genes actor size number of papers that a
gene appears in event paper event size number
of genes that appear in a paper Example R.
Strausberg et al. Generation and initial analysis
of more than 15,000 full-length human and mouse
cDNA sequences. PNAS 9916899903, 2002 cites
15,000 genes
59
? Are two genes remarkably often co-cited?
Note, usually one count (w.l.o.g. n22) is much
larger than everybody else. Test statistics that
do not depend on n22
60
? Closing gene lists with literature
Boundary of gene list L set of all genes that
have co-citation (above threshold weight) with
genes in L.
Gene 1
Gene X
Gene 2
Gene 3
Gene Y
Gene 4
Gene 5
61
? A pathway graph
62
? A pathway graph
63
? CGH aberration data
Genetic aberrations
From B. Gunawan et al., Cancer Res. 63
6200-6205 (2003)
Tumours
64
? Graphical model for CGH aberration data
oncotree package by Anja von Heydebreck
65
?Summary
Graphs are a natural way to represent
relationships, just as numbers are a natural way
to represent quantities. Three main applications
(1) to represent data (e.g. PPI) (2) to
represent knowledge (e.g. GO) (3) to represent
high-dimensional probability distributions Biocon
ductor provides a rich set of tools mainly for
(1) and (2). Various parts of R for (3), see also
gR project. There are still many challenges that
call for methods to model uncertainty, make
inference, and predictions.
66
?Further exercises
Fine control of graph rendering GOstats example
Write a Comment
User Comments (0)
About PowerShow.com