Auto-grouping Emails for Faster eDiscovery - PowerPoint PPT Presentation

About This Presentation
Title:

Auto-grouping Emails for Faster eDiscovery

Description:

Auto-grouping Emails for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM Research India *IBM Software ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 24
Provided by: IBMU192
Learn more at: https://www.vldb.org
Category:

less

Transcript and Presenter's Notes

Title: Auto-grouping Emails for Faster eDiscovery


1
Auto-grouping Emails for Faster eDiscovery
  • Sachindra Joshi, Danish Contractor, Kenney Ng,
    Prasad M Deshpande, and Thomas Hampp
  • IBM Research India IBM Software Group

2
Outline of the Talk
  • eDiscovery Process
  • A new way of eDiscovery Review Group Level
    Review
  • Creating Syntactic Groups
  • Creating Semantic Groups
  • Experiments and Conclusion

3
eDiscovery Process
  • Discovery Process in pre-trial phase
  • Produce relevant information
  • eDiscovery FRCP 2006 amendment
  • Produce relevant Electronically Stored
    Information (ESI)
  • Emails, chats, word docs, presentations etc.
  • Huge volumes of ESI - Process is expensive
  • 60 of cases warrant some form of eDisovery
  • 4.8 billion dollars industry in 2011

4
eDiscovery Process
  • High cost due to review stage
  • Lawsuit between Clinton administration and
    tobacco companies (U.S. Vs. Philip Morris)

Apply Text Mining Techniques to reduce high costs
involved in eDiscovery Process
5
Architecture of eDiscovery Review Systems
6
Group Level Review
  • Review groups of documents that are related
    instead of individual documents
  • Mark whole group as responsive/unresponsive or
    privileged
  • Efficient and consistent
  • Syntactically Similar Documents
  • Automated messages, Near and exact duplicates
  • Semantically Similar Documents
  • Threads, semantic categories

7
Detecting Syntactic Groups Automated Messages
8
Detecting Near Duplicates
  • S1 I am away from 17/2/2011 to 19/2/2011. Please
    mail xyz_at_in.ibm.com in case of any need
  • S2 I am away from 26/7/2011 to 31/7/2011. Please
    mail abc_at_us.ibm.com in case of any need
  • Notion of Similarity Resemblance
  • Use fingerprinting (Rabin) instead of actual
    chunks.

9
Efficient Detection of Near Duplicates
  • For a document of length n words there would be
  • n-K1 chunks with a window size of K
  • It suffices to keep for each document a
    relatively small fixed size signature
  • Let Sn be the set of permutations of n
  • And let P be chosen uniformly at random over Sn

10
Signature Annotator
  • In practice choosing the permutations randomly is
    hard
  • Use a set of n one-to-one functions fi and keep
    only the smallest value for each fi
  • Keep only j lowest significant bits for each
    value

11
Discovering Automated Messages
  • Generating groups of near duplicate Index Based
    Clustering
  • For each document d in index I do
  • If d is not covered
  • Let S S1, S2, , Sn be the signature of
    document d
  • D Query(I, atleast(S,k))
  • For each document d in D
  • d is covered
  • Discovering Groups of Automated Messages
  • Automated Messages, Group of bulk emails, Group
    of forward emails
  • Use MD5 to detect bulk emails. Emails with one
    segment are automated messages

12
Detecting Semantic Groups Email Threads
  • A tree like structure
  • A link denotes that the child node was written as
    a reply to the parent node.
  • Capture the context in which an email was written

13
Detecting Email Threads
  • Meta data based methods
  • Headers are not consistently used
  • Content of old mail remains in the new mail
  • A segment contains text of only one communication
  • An email ei contains ej iff ei approximately
    contains all the segment of ej

14
Method for Thread Detection
  • Email Segment Generator (ESG)
  • Creates segments of it where each segment
    contains content of only one email.
  • Segment Signature Generator (SSG)
  • Generates a signature for a segment
  • Use near duplicate signatures
  • For practical implementation, we limit on the
    number of segment signatures (N) that can be
    associated with an email, e.g. 20 segments.

15
Method Processing at Indexing Time
16
Method Processing at Query Time
q
Use Signature Of First Segment
Generating Candidate Thread Set
17
Detecting Email Threads
  • Given a Candidate Thread Set
  • Identify the email with only root segment
  • An email ec is child of an email ep if ec
    minimally contains ep

18
Creating Semantic Categories
  • Focus Categories
  • Documents that are likely to be responsive
  • Legal Content, Financial Communication,
    Intellectual Property
  • High recall
  • Filter Categories
  • Documents that are likely to be unresponsive
  • Bulk emails, Private communication, Jokes
  • High precision

19
Creating Semantic Categories
  • Email Segmentation
  • Pattern based annotation Use System T based
    method
  • Consolidation
  • Each concept is independent
  • Apply additional constraints over concepts

20
Experiments Near Duplicate Detection
  • Enron Corpus
  • 517K emails from 150 users
  • Measuring precision
  • Manually evaluated near duplicate set for 500
    queries
  • With more bits precision is 100 even with 40
    similarity threshold
  • Only 33.3 emails are unique

21
Experiments Email Thread Detection
  • No ground truth for threads
  • Subject approximation Method Based on Re,
    Fw etc in subject
  • Manually verified the results of thread for our
    method and subject approximation method
  • The union of correct emails in thread for both
    approaches is treated as ground truth.

22
Experiments Semantic Group
  • Ground truth Sampled 2200 emails using generic
    keywords and then manually labeled

23
Conclusions
  • We developed a framework that allow group level
    review of documents
  • We developed methods for finding syntactic groups
    such as automated messages for creating groups
  • We developed methods for finding email threads
    and semantic groups
  • We showed significant reduction in the review
    time by using the group level review and
    integrated the proposed techniques with IBM
    Infosphere eDiscovery Analyzer product
Write a Comment
User Comments (0)
About PowerShow.com