Identifying Duplicates - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Identifying Duplicates

Description:

View sets as columns of a matrix; one row for each element in the universe. ... looks at every byte of the file and creates a unique, alpha-numeric value. ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 16
Provided by: pamelad5
Category:

less

Transcript and Presenter's Notes

Title: Identifying Duplicates


1
Identifying Duplicates
  • Presented by
  • Kurt Jensen, President

2
Set Similarity
  • Set Similarity (Jaccard measure)
  • View sets as columns of a matrix one row for
    each element in the universe. aij 1 indicates
    presence of item i in set j
  • Example

C1 C2 0 1 1 0 1 1
simJ(C1,C2) 2/5 0.4 0 0 1 1 0 1
3
Identifying Duplicates Can Be Tricky
On the surface identifying duplicates may seem
like simple a process.
However, in todays world of emails, backup tapes
and copy/paste functions, it is more complicated
than one might think.
4
Why De-Duplicate?
Why should you care that you may have duplicate
documents?
  • To minimize the number of documents that will
    need to be reviewed
  • To ensure that documents are not altered in
    Native productions.

5
Hashing - The Process
The process by which an algorithm looks at every
byte of the file and creates a unique,
alpha-numeric value.
  • A Fingerprint
  • MD5
  • SHA 1 2

6
Example of Hash Values
7
What is Hashed?
  • Text
  • Metadata (selected or all)
  • File Contents
  • Varies from Vendor to Vendor

More dupes are identified if you hash the text
only. Fewer are identified if you hash text and
metadata.
8
The Results
  • Emails
  • Emails may have exact same text but varying
    metadata fields

Are they dupes?
9
Duplicate Emails??
10
Metadata Discrepancies
Even though the emails may look the same the
Metadata tells a different story.
11
The Results, Cont
  • Attachments
  • Loose Files
  • Attachments Loose Files can be duplicates but
    have different file names. (Hash only the File
    Content)

Are they dupes?
12
Vertical vs. Horizontal
  • Vertical
  • Duping within a Custodian (files)
  • Tape backup of a Custodians Email from Jan. 2005
    to Dec. 2005
  • Horizontal
  • Duping across Custodians (files)
  • Emails from HR to Multiple Custodians, identify
    all and flag one for review

13
Designations Consequences
  • Duplicate Documents
  • Duplicate Families
  • Document Level Designations
  • Family Designations
  • Duplicate Family Designations

E
E
E
A
A
A
A
A
14
Near Duplicates
  • Apply to Emails Email Threads
  • Attachments
  • Duplicate Text
  • Duplicate Text Metadata
  • Neither Duplicate Text, Nor Duplicate Metadata

15
Questions??
Write a Comment
User Comments (0)
About PowerShow.com