Title: LINFORMAZIONE GIURIDICA
1A SEMANTIC MODELING APPROACH FOR THE RETRIEVAL OF
LEGISLATION AND DOCTRINE AN ITALIAN EXPERIENCE
Ginevra Peruginelli, Enrico Francesconi Institute
of Legal Information Theory and Techniques
(ITTIG CNR) Florence, Italy
8th Int. Conf. Law Via The Internet, 26
October 2007
2Contents
- The Project of a Portal to Legal Literature
- Phases
- Standards for legal document identification
- Software architecture of the Portal to Legal
Literature - Standard implementation
- Metadata generation
3LEGAL INFORMATION CONTENT
- Legislation
- Case law
- Administrative acts
- Legal literature
Links between these legal sources are highly
required
4PROJECT FOR A PORTAL TO LEGAL LITERATURE
Developed as a feasibility study by the Institute
of Legal Information Theory and Techniques
- Develop integrated tools for generating and
capturing metadata for structured and
un-structured web documents
1 phase
- Providing multilingual access to foreign legal
resources.
2 phase
- Enhancement of the existing solutions oriented to
semantic searching - Implementation of specific facilities to support
legal users of the Portal in semantic querying
3 phase
- Linking legal sources in a shared web environment
- A resolution mechanism for legal doctrine
4 phase
5THE 1st PHASE A FEDERATION SYSTEM
- Data of the Portal can be divided into two
classes - - structured data (repositories in libraries,
using specific metadata schemes) - - web documents (usually they do not contain
any particular metadata schemes). Selected by a
group of ITTIG legal experts. They are used to - - perform a selective exploration on the web
- train the modules aimed at selecting and
classifying web documents
Different data sources are retrieved by the
exploiting a common metadata scheme DC Dublin
Core Metadata Scheme and harvested according to
the OAI-PMH Protocol
6THE 2sd PHASE MULTILINGUAL ACCESS
Aim To foster and facilitate world wide
communication in the legal academic world, in the
legal professional sector, in business world and
in public administration services to citizens
Obstacles
1) Complexity and richness of each legal language
2) Differences between legal concepts inherent to
the diverse national legal systems
System-bound nature of legal terminology
Example - In England a mortgagee becomes a
conditional owner of the property mortgaged to
him, but not its possessor - In Spain, in France
the hypothécaire gains neither ownership nor
possession of the mortgaged property unless he
enforces the mortgage.
7CROSS LANGUAGE RETRIEVAL OF LEGAL INFORMATION
- Querying and retrieving multi-language documents
involves problems of query translation - Especially in legal domain, a word in a native
query language can be ambiguous - A word can have different translations in a
target language, each corresponding to a legal
category in the target legal system
8QUERY EXAMPLE
- Italian user query
- Give me back all the documents related to dolo
Italian system
English system
fraud (private law)
dolo
Documents related to dolo
Documents related to fraud
Documents related to malice
Ambiguous word
malice (criminal law)
Query contextualization is a key issue for a
focused multi language document retrieval.
9THE 3rd PHASE SEMANTIC QUERYING
- Assigning high quality metadata ( DCsubject
category of law) - Adoption of DC metadata
-
- Query contextualization according to a
category of law - Exploitation of a controlled legal vocabulary
allowing query - interpretation based on approximation
10DC METADATA ANDRETRIEVAL PRECISION
DC metadata (in particular dcsubject) allow
mainly to enhance the Precision in the retrieval
Evidence
Criminal Procedure
dcdescription
Evidence
Evidence
dcsubject
Criminal Procedure
Civil Procedure
11LEGAL VOCABULARY AND RETRIEVAL RECALL
Legal Controlled Vocabulary allows to enhance
mainly the Recall in the retrieval
Evidence
NO DOC
Criminal Procedure
dcdescription
Evidence
witnesses
Related terms (from acontrolled vocabulary)
Proof Criminal Invest.
Witness
Criminal Procedure
dcsubject
Criminal Procedure
12THE 4th PHASE LINKING LEGAL SOURCES
- URL Subset of URI that identify resources via a
representation of their primary access mechanism
(e.g., their network location). NO ANYMORE
- URN an unambiguous identifier, allowing the
references to be expressed in a stable way,
independently of document physical location
Goal a system of references based on a
resolution mechanism (RDS Resolver Discovery
Service) able to retrieve the corresponding object
regardeless of resources physical location.
13THE URN-NIR STANDARD FOR LEGISLATION
- The URN-NIR is defined as a combination of
elements, according to a specific grammar - Examples
- Decree of Ministry of Finance of 20.12.99
urnnirministero.finanzedecreto1999-12-20 - Communications and Health Inter-ministerial
Regulation, 9 September 1998 - urnnirministero.comunicazioniministero.saluter
egolamento1998-09-09
- Parser Available on-line, automatically
detects references within laws - Resolution
service Resolves URNs into URLs
14DEVELOPMENT OF A STANDARD FOR LEGAL LITERATURE
- Many identifiers available (ISBN, ISSN, BICI
SICI DOI) - But
- an intelligent, unique, self-explanatory standard
for bibliographic materials has not been
developed so far.
- Approach
- Agreement on a common document identification
technique and a standard data scheme - A cooperative layer for the availability of
common services and shared information
15GRAMMAR OF THE STANDARD FOR IDENTIFYING LEGAL
LITERATURE
Grammar of URN-Legal Doctrine reflects the
features of legal doctrine material
Example Sorge, C. and Bergfelder, M. (2004),
Signatures by Electronic Agents A Legal
Perspective, In Cevenini, C. (editor), The Law
and Electronic Agents Proceedings of the LEA
workshop, Bologna, June 2004, pp. 141-153
urnlldsorgebergfeldersignatures.by.electronic.
agents.a.legal.perspective200406the.law.and.ele
ctronic.agents.proceeding.of141_at_inproceedings
The URN can be created automatically from the
bibliographic citation and be easily adopted by
all providers of legal doctrine.
16SERVICES OF OPENURL RESOLVER
- Search authors in other databases
- Check how many citations
- Look for authors email
Which library owns the printed review?
Approriated Full Text
If full-text not available, Document delivery
service?
17WHAT IS NEEDED
- From the point of view of users there is need
for -
- Knowledge based tools for supporting semantic
searching - Persistent identifiers for reliable access to
digital objects in a context sensitive way. - Navigation across legislation, case law and
literature materials
18FEDERATED DATA
- Data of the prototype can be divided into two
classes - Structured data
- Web documents
19Metadata Approach
- Different data sources are described using common
metadata scheme (DC Metadata XML encoded, as
target bridging format.) - Integration and uniform view on different data
sources - Implementation of search facilities based on
those standards. - Metadata for structured resources
- Proprietary metadata mapping towards
- Dublin Core simple
- DCMI Cite (journal articles)
- Metadata for Web documents
- specific module generating a subset of DC
metadata, without imposing on data providers to
adopt a predefined format
20The architecture of the federation system
Portal
Index
Selection of metadatafrom structured data
providers (OAI-PMH)?
Indexer
Selection ofdocuments of interest from web sites
Service Provider
Metadata
DC-Qualified
DC-XML
Web
HTML pages
harvester
records
focused crawler
Automatic metadata
generator
DC mapping
Data Providers
Publisher
Library
DOGI
Catalog
catalog
WWW Documents
Structured Data Repositories
21Structured Resource Management the Open
Archives Initiative (OAI)
- Interoperability among structured resources
through Metadata Exchange - OAI Protocol for Metadata Harvesting (OAI-PMH)
- Data Providers are repositories that expose
structured metadata through OAI-PMH - Service Provider uses OAI-PMH to harvest
metadata - OAI-PMH is a set of six verbs that are invoked
within HTTP.
22(No Transcript)
23OAI-PMH verbs
- Identify http//localhost8080/OaiDataProvider/s
ervlet?verbidentify - ListMetadataFormathttp//localhost8080/OaiDataPr
ovider/servlet?verbListMetadataFormatsmetadataPr
efixoai_dc - ListSetshttp//localhost8080/OaiDataProvider/ser
vlet?verbListSets - ListIdentifiershttp//localhost8080/OaiDataProvi
der/servlet?verbListIdentifiersmetadataPrefixoa
i_dc - ListRecordshttp//localhost8080/OaiDataProvider/
servlet?verbListRecordsmetadataPrefixoai_dc - GetRecordhttp//localhost8080/OaiDataProvider/se
rvlet?verbGetRecordidentifieroaidogi.ittig.cnr
.it/1999Z0529metadataPrefixoai_dc
24Selection and Harvesting ofStructured Resources
Metadata
- Selection
- Bibliographic repositories in libraries
- UDC (Universal Decimal Classification)
- DDC (Dewey Decimal Classification)
- 34X notations plus a subset of additional
classes - Harvesting
- Structured repositories are adapted as OAI Data
Providers - OAI - Protocol of Metadata Harvesting (OAI-PMH)
25The architecture of thefederation system
Portal
Index
Selection of metadatafrom structured data
providers (OAI-PMH)?
Indexer
Selection ofdocuments of interest from web sites
Service Provider
Metadata
DC-Qualified
DC-XML
Web
HTML pages
harvester
records
focused crawler
Automatic metadata
generator
DC mapping
Data Providers
Publisher
Library
DOGI
Catalog
catalog
WWW Documents
Structured Data Repositories
26A Focused-Crawler to SelectLegal Web Documents
- The exploration of the web started from a subset
of sites of interest, selected by a group of
legal experts - Then we have considered a policy aimed at
following the hyperlinks with a high probability
of leading to documents of interest - Probability is obtained by means of an automatic
classifier on a set of words in the vicinity of
the hyperlink.
27Metadata for Web Documents
- Web documents metadata implementation is key
issue to improve search facilities - In literature different approaches have been
proposed - Reliable mapping of metadata originally included
in web documents, combined with the collaboration
of the authors - Automatic metadata generation on the basis of
keywords supplied by the authors
28Automatic DC Metadata Generation for Web
Documents
- Our approach
- automatic generation of a DC metadata subset
- it supports the intellectual activity in
organizing web documents. - Main criteria used for DC metadata generation
- properties of the documents (URL as
dcidentifier) - mapping between particular tags (lttitlegt html-tag
for dctitle) - html-metatags (as meta description for
dcdescription). - Particular attention to document classification
(dcsubject). We have tested - Naive Bayes
- Multiclass Support Vector Machine (Vapnik,
Crammer and Singer) - machine learning approaches for document
classification.
29C0 C1 Cn-1
Naive Bayes Classifier
P(c0 / dj)
Class Ranking
P(c1 / dj)
Web document dj
...
P(cn-1 / dj)
dj (w0j, , wkj,)
Bag of words wkj is a function of the kth word
frequency in document dj
where
Naive hypothesiswords in a document
occur independently of each other given the
class
Bayes Theorem
30Support Vector Machine (SVM)
- Given a training set of positive (S) and
negative (S-) examples, SVM determines the
surface si which divides S from S- with the
widest distance - Linear case
S
-
-
S-
-
-
-
-
-
-
-
-
si
31Multiclass Support Vector Machine (MSVM)
32Document Representation
- A document is represented by a vector of term
weights dj (w1, , wT) and three different
types of weights have been tested - Binary weights (presence/absence)
- Term frequency weight (tf)
- TF-IDF weight (tfidf, penalizes terms occuring in
many different documents, being less
disciminative) - Pre-processing to increase statistical qualities
of terms - Stemming (reduction of terms to their
morphological root) - Stopwords (deletion of very frequent terms)
- Digits and non alphanumeric characters
represented by a unique special character
33The Experiments
- Data set of 2478 documents from Web sites of
interest, belonging to 11 classes (categories of
law) - c0 Environmental law c6 International law
- c1 Administrative law c7 Labour law
- c2 Constitutional law c8 Criminal law
- c3 Ecclesiastic law c9 Private law
- c4 European law c10 Taxation law
- c5 Computer law
34Train and Test of the Classifier
- Training
- All the 2478 examples have been used to train the
classifier - Test
- Two experiments to calculate
- Train accuracy
- LOO (Leave-One-Out) accuracy
- (Test of the classifier generalization
capability)
35Train Accuracy
36Test of the MSVM classifierusing the LOO approach
LOO accuracy 74.7
37The architecture of thefederation system
Portal
Index
Selection of metadatafrom structured data
providers (OAI-PMH)?
Indexer
Selection ofdocuments of interest from web sites
Service Provider
Metadata
DC-Qualified
DC-XML
Web
HTML pages
harvester
records
focused crawler
Automatic metadata
generator
DC mapping
Data Providers
Publisher
Library
DOGI
Catalog
catalog
WWW Documents
Structured Data Repositories
38User Access Modalities
- Advanced searchMetadata-Based Document Querying
(MBDQ) - Simple searchKeyword (KBDQ) Category
(CBDQ)
Based Document Querying
- Both query modalities can be implemented using a
Legal Vocabulary - Semantics and Legal Vocabulary are exploited to
enhance precision and recall in retrieval
39Conclusions andFuture Developments
- Portal to Italian Legal Literature is the
result of a federative architecture for
integrating structured repositories and web
documents in a unique point of access - A Dublin Core metadata approach has been used to
give a uniform view on different data - The federation system combines
- the harvesting of structured data using OAI-PMH
- the gathering and automatic qualification of web
documents using a machine learning approach - CLIR facilities are expected to be studied, for
developing user interface facilities based on a
multilingual ontological support.
40Announcement
- On the basis of the experiences of legislative
standard projects (NIR, Metalex, Akomantoso,
LexDania, eLaw) - Workshop on Legislative XML
- Within Jurix 2007 Conference (http//www.jurix2007
.org) - December 15th, 2007, Leiden University, The
Netherlands - Deadline submission position papers 23rd
November 2007 - Topics
- Unique identification of (parts of) sources of
law, URI and URN - Lowest grain size of identifiable elements in
sources of law - What set of metadata should be part of an
interchange format - Ontologies for legislation
41Thank you for your attention !
peruginelli_at_ittig.cnr.it francesconi_at_ittig.cnr.it