Multilinear Algebra for Analyzing Data with Multiple Linkages - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Multilinear Algebra for Analyzing Data with Multiple Linkages

Description:

Multilinear Algebra for Analyzing Data with Multiple Linkages Tamara G. Kolda plus: Brett Bader, Danny Dunlavy, Philip Kegelmeyer Sandia National Labs – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 32
Provided by: eceUmnEdu
Category:

less

Transcript and Presenter's Notes

Title: Multilinear Algebra for Analyzing Data with Multiple Linkages


1
Multilinear Algebra for Analyzing Data with
Multiple Linkages
  • Tamara G. Kolda
  • plus Brett Bader, Danny Dunlavy, Philip
    Kegelmeyer
  • Sandia National Labs
  • TRICAP 2006, Chania, Greece, June 4-9, 2006

2
Linear Algebra for Data with Linkages
Circle-Square Matrix
SVD Rank-k Approximation (k2)
3
Latent Semantic Indexing (LSI) for Text Retrieval
Term-Document Matrix
SMART Retrieval SystemG. Salton (1971)
LSI S. Dumais et al. (1988)
Query
Car Service
  • S. T. Dumais, G. W. Furnas, T. K. Landauer,
    S. Deerwester, and R. Harshman. Using latent
    semantic analysis to improve access to textual
    information. In CHI '88, pp. 281285, 1988
  • S. C. Deerwester, S. T. Dumais, T. K. Landauer,
    G. W. Furnas, and R. A. Harshman. Indexing by
    latent semantic analysis. J. Am. Soc. Inform.
    Sci., 41(6)391407, 1990
  • M. W. Berry, S. T. Dumais, and G. W. O'Brien.
    Using linear algebra for intelligent information
    retrieval. SIAM Rev., 37(4)573595, 1995

4
Applications of LSI
Graph the Results using U2 and V2
Term-Document Similarities
carservicemilitary repair
Terms
d1
car
d1
d2
d3
d2
service
Term-Term
d3
military
carservicemilitary repair
Document-Document
Documents
repair
5
Caveats for LSI
  • How to use ??
  • Term-document matrix weighting is critical!

Local WeightLogfij frequency
Global Term WeightInverse Document Frequency N
total docs ni docs with term i
Normalization FactorCosine
6
Citation/Link Analysis(Same Nodes)
Link Matrix
Hub Scores
Doc 1 is the most important hub!
Co-Citation Matrix
Authority Scores
Examples Citation data, Web links
Doc 3 is the most important authority!
Co-Reference Matrix
J. M. Kleinberg. Authoritative sources in a
hyperlinked environment. J. ACM, 46(5)604632,
1999.
7
Multiple Links?
Suppose the connections between nodes are
labeled in some fashion. In other words, we
have meta-data on the connections. Can we
somehow use multilinear algebra for link analysis?
8
PARAFAC
  • PARAFAC Parallel Factors
  • aka. CANDECOMP Canonical Decomposition
  • Higher-order analogue of the SVD
  • Columns of A, B, and C are not orthonormal
  • If R is minimal, then R is called the rank of the
    tensor (Kruskal 1977)
  • Can have rank(X) gt minI,J,K
  • Often guaranteed to be a unique rank
    decomposition!

K x R
C
I x J x K
I x R
J x R
B
A





I
R x R x R
  • R. A. Harshman. Foundations of the PARAFAC
    procedure models and conditions for an
    explanatory multi-modal factor analysis. UCLA
    working papers in phonetics, 16184, 1970
  • J. D. Carroll and J. J. Chang. Analysis of
    individual differences in multidimensional
    scaling via an N-way generalization of
    Eckart-Young' decomposition. Psychometrika,
    35283319, 1970.

9
Many ways to write PARAFAC
Kruskal Operator
Easy to write N-way case
J. B. Kruskal. Three-way arrays rank and
uniqueness of trilinear decompositions, with
application to arithmetic complexity and
statistics. Linear Algebra Appl., 18(2)95138,
1977.
10
Properties of the Kruskal Operator
PARAFAC core for a Tucker decomposition
Matricize (arbitrary map of indices to rows and
columns)
Mode-n matricize
Norm of a PARAFAC decomposition
11
PARAFAC for sparse data approximations
  • Our interest in the mathematical operations is
    motivated on two fronts
  • (1) Sparse computations
  • (2) Using tensor decompositions for approximation
  • Ex Considering how to efficiently implement
    PARAFAC-ALS for sparse data
  • Can PARAFAC be used for the best rank-k
    approximation, rather than finding an exact
    decomposition (excepting noise)
  • What does it even mean in this case??

12
Multilink Analysis using PARAFAC
  • Quick Review Tensors for Web Link Analysis
  • page x page x anchor text (TOPHITS)
  • New work Tensors for Publication Data Analysis
  • Case 1 doc x doc x similarity
  • Case 2 term x doc x author (HO-LSA??)

13
TOPHITS PARAFAC for Web Link Analysis
Graph representation shows basic connectivity
A set of four hyperlinked web pages
Labeled edges capture context
14
Analyzing Publication DataDoc x Doc x
Similarity Representation
15
Computing Different Doc-Doc Similarities
Computing term-based similarities (k1,2,3)
  • 5022 papers
  • 16617 unique terms (ignoring stop words, words
    with length less than 3 or greater than 30
    characters, and words that appear less than 2
    times)
  • Titles 5164
  • Abstracts 15752
  • Keywords 5248
  • 6891 authors
  • 2659 citations

Enforces sparseness!
Computing author similarities (k4)
16
PARAFAC for Doc x Doc x Similarity
Central idea Each triplet provides a core
grouping of the data, i.e., a specific topic.
  • H hubs
  • A authorities
  • C connections
  • Rank-30 decomposition

17
Sample Grouping 1
18
Sample Grouping 10
19
Applications of the H,A,C Decomposition
  • Latent document similarities
  • Calculate S ½ HHT ½ AAT
  • Analyzing a body of work
  • ch hub centroid, ca authority centroid
  • s ½ H ch ½ A ca
  • Disambiguation (EXAMPLE)
  • Calculate centroids using A (could also use H or
    AH)
  • Calculate simiarlities of centroids
  • Journal predicition
  • Use matrix A as features for input to a decision
    tree ensemeble classifier

20
Example of Disambiguation Results
Two authors with missing middle initials.
3 possible matches
Matrix of Similarities
21
Analyzing Publication DataTerm x Doc x Author
Representation
Terms must appear in at least 3 documents and no
more than 10 of all documents. Moreover, it must
have at least 2 characters and no more than 30.
767 documents 2251 terms 1072 authors 59738
nonzeros
Element (i,j,k) is nonzero only if author k wrote
document j using term i.
22
Different Graph Interpretations for Term x Doc x
Author
  • term-doc with author links
  • term-author with doc links
  • author-doc with term links
  • term-doc-author with links

Term
Doc
Different author links represented by different
colors
23
Author Data is Too Sparse
Result Resulting tensor has just a few nonzero
columns in each lateral slice.
term
author
doc
Experimentally, PARAFAC seems to overfit such
data and not do a good job of mixing different
authors.
24
Idea Use Tucker Transformation to Compress
We transform the tensor to a smaller tensor as
follows
or, equivalently
This transformation forces the authors to be
mixed and produces a dense result. Main problem
How to transform sparse tensor without creating
dense intermediate results?
Compute rank-25 PARAFAC on compressed tensor and
transform.
25
Tucker PARAFAC
  • Want PARAFAC for X in term x doc x author space
  • First, apply dimensionality reduction to X to
    obtain Y
  • Y in conceptual space
  • Next, compute PARAFAC on Y
  • Finally, reassemble results to yield PARAFAC for
    X

26
Three-Way Fingerprints
  • Each of the Terms, Docs, and Authors has a rank-k
    (k25) fingerprint from the PARAFAC approximation
  • All items can be directly compared in concept
    space
  • Thus, we can compare any of the following
  • Term-Term
  • Doc-Doc
  • Term-Doc
  • Author-Author
  • Author-Term
  • Author-Doc
  • The fingerprints can be used as inputs for
    clustering, classification, etc.

27
Sample Results Term
28
Sample Results Term
29
Sample Results Author
30
Summary Future Work
  • PARAFAC provides a technique for analyzing
    semantic graphs
  • Third dimension captures different connection
    types
  • Or may consider it as the interconnection of 3
    different node types
  • Analyzed journal articles using different tensor
    representations
  • Doc x Doc x Connection
  • Need to make definitive case of why 3D is better
    than 2D
  • Term x Doc x Author
  • Too sparse?
  • Still working towards large-scale, sparse
    problems
  • Need implicit compression for PARAFAC
  • 5M nonzeros
  • Other decompositions?
  • Other hybrids
  • Symmetry

31
Acknowledgments More Information
Thank You!
  • Thanks to
  • Brett Bader, Danny Dunlavy, Philip Kegelmeyer
  • Web data Joe Kenny, Travis Bauer et al., Ken
    Kolda
  • Journal data Kevin Boyack
  • Graph viz Ann Yoshimura
  • Related papers
  • Algorithm xxx MATLAB Tensor Classes for Fast
    Algorithm Prototyping (with B.W. Bader), ACM
    TOMS, to appear.
  • Multilinear algebra for analyzing data with
    multiple linkages (with D. Dunlavy and W. P.
    Kegelmeyer), Technical Report SAND2006-2079, Apr.
    2006.
  • Temporal analysis of social networks using
    three-way DEDICOM (with B.W. Bader and
    R.Harshman), Technical Report SAND2006-2161, Apr.
    2006.
  • Multilinear operators for higher-order
    decompositions. Technical Report SAND2006-2081,
    Apr. 2006.
  • The TOPHITS model for higher-order web link
    analysis (with B. Bader), in Proc. Workshop on
    Link Analysis, Counterterrorism and Security,
    SDM06, Apr. 2006
  • Higher-order web link analysis using multilinear
    algebra (with B.W.Bader), ICDM 2005, pp. 242249,
    Nov. 2005.
  • Contact Info
  • tgkolda_at_sandia.gov
  • http//csmr.ca.sandia.gov/tgkolda/
Write a Comment
User Comments (0)
About PowerShow.com