Title: AdLaw04
1The Internet Still Might (but Probably Wont)
Change Everything
Jamie Callan Carnegie Mellon University Eduard
Hovy USC/Information Sciences Institute Stuart
Shulman University of Pittsburgh Stephen
Zavestoski University of San Francisco
Presented October 21, 2004 at theAmerican Bar
Associations Administrative Law and Regulatory
Practice Conferenceand October 22, 2004 before
the Regulatory Affairs Committee of the United
States Chamber of Commerce
2Acknowledgements
- This has been supported by grants from the
National Science Foundation - EIA-0089892
- SGER Citizen Agenda-Setting in the Regulatory
Process Electronic Collection and Synthesis of
Public Commentary - EIA 0327979, 0328175, 0328914 0328618
- SGER Collaborative A Testbed for eRulemaking
Data - IIS IIS-0429293
- Collaborative Research Language Processing
Technology for Electronic Rulemaking - SES-0322662
- Democracy and E-Rulemaking Comparing
Traditional vs. Electronic Comment from a
Discursive Democratic Framework - Any opinions, findings and conclusions or
recommendations expressed in this material are
those of the authors and do not necessarily
reflect those of the National Science Foundation
3An eRulemaking Testbed
- A repository of public comments
- For example
- USDAs National Organic standard
- we are missing lots of paper comments
- EPAs Definition of US Waters (Post-SWANCC ANPR)
- we are missing lots of electronic mail
- DOTs latest CAFÉ standard
- electronic versus paper presorted, but some
sticky PDFs - A testbed for new tools to analyze the text
- Goal public and agency personnel experiment with
the new tools and provide continuous feedback - Invitation for other researchers to join the fray
- Real-world data with immediate consequences
http//hartford.lti.cs.cmu.edu/eRulemaking/Data.ht
ml
4(No Transcript)
5What if it were all paper?
- For our research, mercury is the best dataset yet
- 530,00 emails, all plain text, with many
duplicative (similar identical) comments - Average length about one typed page
- If all 1.8 gigabytes were on paper (which it is)
- it would weigh 5,350 pounds (about 2.7 tons)
- it would make a stack 214 feet high
6Scientific Research Objectives
- Applied Research Objectives
- Help agencies handle the information
- Measure efficiency quality improvements
- Basic Research Objectives
- Advance Natural Language Processing
- Analyze and categorize text according to several
novel dimensions stakeholders, opinions,
arguments - Advance social science methods for measuring the
impact of IT on democracy - Develop metrics for assessing the quality and
content of the public comment - Develop a public comment database and coding
scheme that facilitates greater empirical social
science research
7Problem Duplicate Detection
- Many public comments are form letters or edited
form letters - Real grassroots or astroturf created by
interest groups, lobbies
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12Duplicate Detection Solutions
- Duplicate detection algorithms
- Generate summary counts
- Identify the reference copy
- Summarize differences from reference copy
- Near-duplicate detection techniques
- Use cosine correlation to identify similar
documents - Identify near-duplicates using document
fingerprints - Sequences of words that match in each document
- Output
- A reliable and easy count of duplicates
- Unique passages isolated and displayed/clustered
13(No Transcript)
14(No Transcript)
15Near Duplicate Detection Examples
16Near Duplicate Detection Examples
17Near Duplicate Detection Examples
18Near Duplicate Detection Examples
19Near Duplicate Detection Examples
20Near Duplicate Detection Examples
21Capturing the Publics Comments
Par 2.2(a) I am for this I am against
this because This is what should be
done (please be specific)
- Help the public to formulate comments
- Link to existing commentary, regulation draft,
other material
pollution _____________ (see similar
views) expense ______________ (see similar
views) unfit for elderly ____________ (see
similar views) unfit for children ___________
(see similar views) add a new reason ___________
(explore all views)
relax requirement c _____________ (similar
ideas) phase in alternative _____________
(similar ideas) add a new suggestion
_______________________ _________________
(explore all suggestions)
22Clustering By Opinion
- For each (sub)topic
- Group together all Yes, No In-betweens
- Create these categories manually and have the
system learn to duplicate that - Extract reasons/motivations/authorities using
characteristic phrases - against it because I think X
- Y will have such a beneficial effect
23An Analysts Workbench
Main Opinions
- pro (19,566)
- pollution (11,003)
- energy-efficient (9,812)
- safe or safety (534)
-
- guarded pro (4,661)
- energy (3,652)
- safe or safety (1,202)
-
- anti (8,002)
- cost, expense, expensive (7,314)
-
- guarded anti (758)
- cost (500)
- difficult (105)
-
24Display Ideas
- Grouped by opinion and writer type
- Grouped by topic and cross-correlated
- Par 2.2(a1)
- Con
- 150, 818 impossible to maintain
- 272 too expensive for elderly
- Pro
- 169, 213, 391, 392, 394 already being done in
Alaska - 18 extend to children
25Putting It All Together
- Assemble tools in a Reg-Writer Workbench
- With interface/display, reg-writer is able to
- switch modules on or off
- move smoothly from one display to another
- Ex., from comments to regulation to cluster
- drag information easily into the response-writer
window - produce integrated regulatory text, response,
comments, etc.
26Available online at http//erulemaking.ucsur.pitt.
edu
27(No Transcript)
28Thanks!
- Dr. Stuart W. ShulmanUniversity of Pittsburgh
- Shulman_at_pitt.edu (email)
- http//shulman.ucsur.pitt.edu (home page)
- 412.918.1651 (voice)