FINA - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

FINA

Description:

Similarity metric calculation between documents. Big numbers ... calculation can be done for only a fixed number of singular values desired (in our case, 200) ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 12
Provided by: cseOhi
Category:
Tags: fina | calculation

less

Transcript and Presenter's Notes

Title: FINA


1
  • FINA
  • Fast Index News Archive
  • Progress Report

2
Problems addressed this week
  • LSI Program
  • SVD Processing Time
  • Memory requirements
  • Similarity metric calculation between documents
  • Big numbers for all results
  • Implement Non-Linearized document matrix to
    compare results with Linearized Version

3
Resolved SVD bottleneck
  • Tried different implementations of SVD
  • one in Java that promised sparse matrix
    operation, but the svd was using dense matrix
    implementation
  • one in Java that was importing a package from
    Fortran and had not much documentation
  • one in C that had a rather difficult syntax
  • Choose SVDLIB tool to compute SVD (C
    implementation)
  • very fast
  • implementation optimized for sparse matrices
  • calculation can be done for only a fixed number
    of singular values desired (in our case, 200)
  • easy to use

4
Speed Memory Improvements in LSA program
  • Change data types in computing the initial matrix
  • double-gtfloat
  • int -gt short
  • We compared the matrices obtained in both cases.
  • Approximation errors were minimal, so we kept
    these new data types.
  • as the matrix grows bigger, saving memory becomes
    an issue.

5
SVD statistics
  • SVDLIB times and statistics
  • First matrix (3462 x 1312)
  • MATRIX DENSITY 3.36
  • MAX. NO. OF EIGENPAIRS 200
  • ELAPSED CPU TIME 9.52 sec.
  • (9.59 sec. with double format In S the
    difference is in only one element, at the 5th
    decimal. Very smaller differences in the V,U
    matrices also.)
  • 10K Matrix (9367 x 10047)
  • MATRIX DENSITY 1.46
  • MAX. NO. OF EIGENPAIRS 200
  • ELAPSED CPU TIME 38.77 sec.
  • (for all the eigenvalues approx 15 hours)

6
Linearization of the Document Matrix
  • The DS matrix, which represents the documents
    vectors has initially values between -1,1 (the
    vectors have norm1)
  • We decided to keep short int data types, which
    are much faster and take less space
  • each value DS(i,j)in the matrix has been
    transform
  • (int )((DS(i,j)126)128))
  • New values are between 2 , 254
  • With this linearization the cosine is not just
    the inner product

7
COS calculation
8
Linearized vs. Non-linearized Document Matrix
  • Linearized
  • 10,047 records
  • 200 smallint columns (2 byte each)
  • Loss of precision
  • Non-Linearized
  • 10,047 records
  • 200 real columns (4 byte each)
  • We implemented both Linearized and Non
    Linearized searches and compared the results

9
How the search is executed in FINA
  • Linear Search
  • Compute the inner product between a query and
    each document in the DB
  • Retrieve the top K similar documents
  • Scan of all the records cannot be avoid
  • In general results are more accurate than with a
    clustering approach

10
Linearized vs. Non-linearized Document Matrix
  • We executed both searches for the most similar 10
    documents of each document
  • Search Time
  • Linearized 1 query 1 sec 100 queries 72 sec
  • Non-Linearized 1 query 3 sec 100 queries 313
    sec
  • Comparison of the results
  • Difference Difference
  • 0 62.28 3 0.54
  • 1 32.98 4 0.07
  • 2 4.12 7 0.01
  • The improvement in speed/space justifies using
    the linearization

11
Future plans
  • Split the document feature vector into 2 tables
    each of 100 dimensions and see whether execution
    time improves
  • Attempt to implement relevance feedback
  • Change the interface and adapt it to the comments
    and new features
Write a Comment
User Comments (0)
About PowerShow.com