Title: Identifying Duplicates
1Identifying Duplicates
- Presented by
- Kurt Jensen, President
2Set Similarity
- Set Similarity (Jaccard measure)
- View sets as columns of a matrix one row for
each element in the universe. aij 1 indicates
presence of item i in set j - Example
C1 C2 0 1 1 0 1 1
simJ(C1,C2) 2/5 0.4 0 0 1 1 0 1
3Identifying Duplicates Can Be Tricky
On the surface identifying duplicates may seem
like simple a process.
However, in todays world of emails, backup tapes
and copy/paste functions, it is more complicated
than one might think.
4Why De-Duplicate?
Why should you care that you may have duplicate
documents?
- To minimize the number of documents that will
need to be reviewed - To ensure that documents are not altered in
Native productions.
5Hashing - The Process
The process by which an algorithm looks at every
byte of the file and creates a unique,
alpha-numeric value.
- A Fingerprint
- MD5
- SHA 1 2
6Example of Hash Values
7What is Hashed?
- Text
- Metadata (selected or all)
- File Contents
- Varies from Vendor to Vendor
More dupes are identified if you hash the text
only. Fewer are identified if you hash text and
metadata.
8The Results
- Emails
- Emails may have exact same text but varying
metadata fields
Are they dupes?
9Duplicate Emails??
10Metadata Discrepancies
Even though the emails may look the same the
Metadata tells a different story.
11The Results, Cont
- Attachments
- Loose Files
- Attachments Loose Files can be duplicates but
have different file names. (Hash only the File
Content)
Are they dupes?
12Vertical vs. Horizontal
- Vertical
- Duping within a Custodian (files)
- Tape backup of a Custodians Email from Jan. 2005
to Dec. 2005 - Horizontal
- Duping across Custodians (files)
- Emails from HR to Multiple Custodians, identify
all and flag one for review
13Designations Consequences
- Duplicate Documents
- Duplicate Families
- Document Level Designations
- Family Designations
- Duplicate Family Designations
E
E
E
A
A
A
A
A
14Near Duplicates
- Apply to Emails Email Threads
- Attachments
- Duplicate Text
- Duplicate Text Metadata
- Neither Duplicate Text, Nor Duplicate Metadata
15Questions??