Near-Duplicate Detection for eRulemaking - PowerPoint PPT Presentation

About This Presentation
Title:

Near-Duplicate Detection for eRulemaking

Description:

for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Stuart Shulman Library and Information ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 17
Provided by: SchoolofC119
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Near-Duplicate Detection for eRulemaking


1
Near-Duplicate Detection for eRulemaking
  • Hui Yang, Jamie Callan
  • Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University

Stuart Shulman Library and Information
ScienceSchool of Information Sciences
University of Pittsburgh
2
Duplicates and Near-Duplicates in eRulemaking
  • U.S. regulatory agencies must solicit, consider,
    and respond to public comments.
  • Special interest groups make form letters
    available for generating comments via email and
    the Web
  • Moveon.org, http//www.moveon.org
  • GetActive, http//www.getactive.org
  • Modifying a form letter is very easy

3
Form Letters
  • Insert screen shot of moveon.org, showing form
    letter and enter-your-comment-here

Form Letter
Individual Information
Personal Notes
4
Duplicate - Exact
5
Near Duplicate - Block Edit
6
Near Duplicate Minor Change
7
Minor Change Block Edit
8
Near Duplicate - Block Reordering
9
Near Duplicate Key Block
10
Near-duplicate Detection Strategy
  • Group Near-duplicates based on
  • Text similarity
  • Similar Vocabulary
  • Similar Word Frequencies
  • Editing patterns
  • Metadata
  • Hints to the clustering algorithm about how to
    group documents

11
Must-links
  • Two instances must be in the same cluster
  • Created when
  • complete containment of the reference copy (key
    block),
  • word overlap gt 95 (minor change).

12
Cannot-links
  • Two instances cannot be in the same cluster
  • Created when two documents
  • cite different docket identification numbers
  • People submitted comments to wrong place

13
Family-links
  • Two instances are likely to be in the same
    cluster
  • Created when two documents have
  • the same email relayer,
  • the same docket identification number,
  • similar file sizes, or
  • the same footer block.

14
Experimental Results
Comparing with human-human intercoder agreement
(measured in AC1) USEPA-OAR-2002-0056 (EPA
Mercury dataset) USDOT-2003-16128 (DOT SUV
dataset)
15
Experimental Results
Comparing with other duplicate detection
Algorithms (measured in F1)
16
Impact of Instance-level Constraints
  • Number of Constraints vs. F1.
Write a Comment
User Comments (0)
About PowerShow.com