Auto-grouping Emails for Faster eDiscovery

About This Presentation

Title:

Auto-grouping Emails for Faster eDiscovery

Description:

Auto-grouping Emails for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng, Prasad M Deshpande, and Thomas Hampp IBM Research India *IBM Software ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 24

Provided by: IBMU192

Learn more at: https://www.vldb.org

Category:

more less

Transcript and Presenter's Notes

Title: Auto-grouping Emails for Faster eDiscovery

1
Auto-grouping Emails for Faster eDiscovery

Sachindra Joshi, Danish Contractor, Kenney Ng,
Prasad M Deshpande, and Thomas Hampp
IBM Research India IBM Software Group

2
Outline of the Talk

eDiscovery Process
A new way of eDiscovery Review Group Level
Review
Creating Syntactic Groups
Creating Semantic Groups
Experiments and Conclusion

3
eDiscovery Process

Discovery Process in pre-trial phase
Produce relevant information
eDiscovery FRCP 2006 amendment
Produce relevant Electronically Stored
Information (ESI)
Emails, chats, word docs, presentations etc.
Huge volumes of ESI - Process is expensive
60 of cases warrant some form of eDisovery
4.8 billion dollars industry in 2011

4
eDiscovery Process

High cost due to review stage
Lawsuit between Clinton administration and
tobacco companies (U.S. Vs. Philip Morris)

Apply Text Mining Techniques to reduce high costs
involved in eDiscovery Process
5
Architecture of eDiscovery Review Systems
6
Group Level Review

Review groups of documents that are related
instead of individual documents
Mark whole group as responsive/unresponsive or
privileged
Efficient and consistent
Syntactically Similar Documents
Automated messages, Near and exact duplicates
Semantically Similar Documents
Threads, semantic categories

7
Detecting Syntactic Groups Automated Messages
8
Detecting Near Duplicates

S1 I am away from 17/2/2011 to 19/2/2011. Please
mail xyz_at_in.ibm.com in case of any need
S2 I am away from 26/7/2011 to 31/7/2011. Please
mail abc_at_us.ibm.com in case of any need
Notion of Similarity Resemblance

Use fingerprinting (Rabin) instead of actual
chunks.

9
Efficient Detection of Near Duplicates

For a document of length n words there would be
n-K1 chunks with a window size of K
It suffices to keep for each document a
relatively small fixed size signature
Let Sn be the set of permutations of n
And let P be chosen uniformly at random over Sn

10
Signature Annotator

In practice choosing the permutations randomly is
hard
Use a set of n one-to-one functions fi and keep
only the smallest value for each fi
Keep only j lowest significant bits for each
value

11
Discovering Automated Messages

Generating groups of near duplicate Index Based
Clustering
For each document d in index I do
If d is not covered
Let S S1, S2, , Sn be the signature of
document d
D Query(I, atleast(S,k))
For each document d in D
d is covered
Discovering Groups of Automated Messages
Automated Messages, Group of bulk emails, Group
of forward emails
Use MD5 to detect bulk emails. Emails with one
segment are automated messages

12
Detecting Semantic Groups Email Threads

A tree like structure
A link denotes that the child node was written as
a reply to the parent node.
Capture the context in which an email was written

13
Detecting Email Threads

Meta data based methods
Headers are not consistently used
Content of old mail remains in the new mail
A segment contains text of only one communication
An email ei contains ej iff ei approximately
contains all the segment of ej

14
Method for Thread Detection

Email Segment Generator (ESG)
Creates segments of it where each segment
contains content of only one email.
Segment Signature Generator (SSG)
Generates a signature for a segment
Use near duplicate signatures
For practical implementation, we limit on the
number of segment signatures (N) that can be
associated with an email, e.g. 20 segments.

15
Method Processing at Indexing Time
16
Method Processing at Query Time
q
Use Signature Of First Segment
Generating Candidate Thread Set
17
Detecting Email Threads

Given a Candidate Thread Set
Identify the email with only root segment
An email ec is child of an email ep if ec
minimally contains ep

18
Creating Semantic Categories

Focus Categories
Documents that are likely to be responsive
Legal Content, Financial Communication,
Intellectual Property
High recall
Filter Categories
Documents that are likely to be unresponsive
Bulk emails, Private communication, Jokes
High precision

19
Creating Semantic Categories

Email Segmentation
Pattern based annotation Use System T based
method
Consolidation
Each concept is independent
Apply additional constraints over concepts

20
Experiments Near Duplicate Detection

Enron Corpus
517K emails from 150 users
Measuring precision
Manually evaluated near duplicate set for 500
queries
With more bits precision is 100 even with 40
similarity threshold
Only 33.3 emails are unique

21
Experiments Email Thread Detection

No ground truth for threads
Subject approximation Method Based on Re,
Fw etc in subject
Manually verified the results of thread for our
method and subject approximation method
The union of correct emails in thread for both
approaches is treated as ground truth.

22
Experiments Semantic Group

Ground truth Sampled 2200 emails using generic
keywords and then manually labeled

23
Conclusions

We developed a framework that allow group level
review of documents
We developed methods for finding syntactic groups
such as automated messages for creating groups
We developed methods for finding email threads
and semantic groups
We showed significant reduction in the review
time by using the group level review and
integrated the proposed techniques with IBM
Infosphere eDiscovery Analyzer product