Title: Daniel%20G.%20Bobrow
1Enhancing Legal Discovery with Linguistic
Processing
- Daniel G. Bobrow
- Research Fellow
- Palo Alto Research Center Inc.
- with Tracy King and Lawrence Lee
- June 4, 2007
2The problems in Legal Discovery
- Recall
- Nothing relevant left behind
- Precision
- Very little irrelevant to ignore
- Scalability
- Need to handle more and more
- Privacy
- What they see is only what they should get
3Today negotiated keyword search protocol
- All documents discussing or referencing
scientific research on the effects of secondhand
smoking published prior to 1985. - Defendants Initial Proposal secondhand
smok! and (finding or science or or research)
and (1985 or 1984 or 1983 or 1982 or 1981 or 1980
or 197! or 196! or 195!) - Plaintiffs Rejoinder ((find! or result! or
effect!) w/page (secondhand or second hand)) or
(other! w/5 smok!) - All documents relating to destruction of records
under defendants records retention policies and
practices. - Defendants Initial Proposal records and
destruction - Plaintiffs Counterproposal destr! or elim!
or dispos! or purg! or recycl! or retain! or
reten!
4Linguistic enhancement of keyword queries
- Inflexional morphology forms of verbs
- destroy ?destroys, destroyed, destroying,
- comply ?complies, complied, complying
- Derivational morphology verbs ? nouns
- destroy ? destruction, destroyer, ..
- comply ? compliance,
- retain ? retention,
- Word taxonomy (e.g. WordNet)
- result ?consequence, effect, outcome, result,
event, issue, upshot
5Processing the collection rather than the
queriesASKER A Semantically-indexed Knowledge
Repository
IntelligenceSource Documents
Filteredanswers
TextPassages
Query
QueryAKR
Expand
Simplify
Queryindexterms
Passage, AKR index terms
Retrievedpassages AKR
6Normalize to Semantic Representation
- Syntactic Normalization
- morphological
- bought ? buy past
- structural
- the file was lost by Mary? Mary lost the file
- derivational
- the destruction of the memo by the CEO ? the
CEO destroyed the memo -
- Semantic normalization
- word to list of WordNet synsets
- buy ? buy, purchase,
- Connect predicate and arguments
- Preddestroy Agent CEO Theme memo
- Fill in implicit arguments
- Ed was easy to please ? Ed was pleased
7Improved Recall (Google and Asker on Wikipedia)
- Query How many terrorists have died?
- Google
- In addition to the 19 hijackers, 2973 people died
in the terrorist attack ... - Although there were security alerts at many
locations, no other terrorist incidents occurred
outside central London. - This is a list of sportspeople who have died
- Asker
- The encounter resulted in the deaths of two
terrorists of the Al Omar Tanzeem - In blazing gunfire, five of the insurgents
perished - see to it that those terrorists die and are
broken
8Improved Precision (Using argument roles for
relevance test)
- Query What terrorists have been killed?
- Google
- .. not include most people killed in big
terrorist bombings - act of terrorism in which 93 innocent people
have been killed or are missing in the ruins - Asker
- During a two-hour gun battle in Mdantsane, police
kill a terrorist or freedom fighter - All the three terrorists killed in this incident
have been identified as Pakistani Nationals. - the former Socialist government carried out a
covert campaign in which 27 suspected Basque
terrorists were killed.
9Scalability (Cost of doing linguistic processing
at scale)
- Linguistic processing time lt 1 CPU sec/sentence
- parsing, semantic normalization, indexing
- Assumptions
- Average collection size 100 million documents
- Document size 25 sentences
- 8 core processor -- 6K or 250/month
(depreciated and housed for 3 years) - 2.5 million seconds month 100,000
documents/core/month - Cost for handling 100 million documents/month
- 1000 cores 125 processors250 32,000
- Use human review query costs are in the noise
10Privacy
- Identify sensitive content by entity type and
relationship (linguistic processing) - e.g. Phone numbers of people
- Encrypt content to make content unreadable(PARC
security technology) - Provide content-specific keys for those people
with a need to know specific information - Additional PARC security technologies can
identify additional content to be redacted to
mitigate inference channels - can redacted information be discovered based on
what is available?
11Linguistic processing can be useful in legal
discovery
With good Recall, Precision, Scalability, Privacy