LINFORMAZIONE GIURIDICA

About This Presentation

Title:

LINFORMAZIONE GIURIDICA

Description:

CONSIGLIO NAZIONALE DELLE RICERCHE. Istituto di Teoria e Tecniche dell'Informazione Giuridica ... (ITTIG CNR) Florence, Italy ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 42

Provided by: Perugi

Category:

more less

Transcript and Presenter's Notes

Title: LINFORMAZIONE GIURIDICA

1
A SEMANTIC MODELING APPROACH FOR THE RETRIEVAL OF
LEGISLATION AND DOCTRINE AN ITALIAN EXPERIENCE
Ginevra Peruginelli, Enrico Francesconi Institute
of Legal Information Theory and Techniques
(ITTIG CNR) Florence, Italy
8th Int. Conf. Law Via The Internet, 26
October 2007
2
Contents

The Project of a Portal to Legal Literature
Phases
Standards for legal document identification
Software architecture of the Portal to Legal
Literature
Standard implementation
Metadata generation

3
LEGAL INFORMATION CONTENT

Legislation
Case law
Administrative acts
Legal literature

Links between these legal sources are highly
required
4
PROJECT FOR A PORTAL TO LEGAL LITERATURE
Developed as a feasibility study by the Institute
of Legal Information Theory and Techniques

Develop integrated tools for generating and
capturing metadata for structured and
un-structured web documents

1 phase

Providing multilingual access to foreign legal
resources.

2 phase

Enhancement of the existing solutions oriented to
semantic searching
Implementation of specific facilities to support
legal users of the Portal in semantic querying

3 phase

Linking legal sources in a shared web environment
A resolution mechanism for legal doctrine

4 phase
5
THE 1st PHASE A FEDERATION SYSTEM

Data of the Portal can be divided into two
classes
- structured data (repositories in libraries,
using specific metadata schemes)
- web documents (usually they do not contain
any particular metadata schemes). Selected by a
group of ITTIG legal experts. They are used to
- perform a selective exploration on the web
train the modules aimed at selecting and
classifying web documents

Different data sources are retrieved by the
exploiting a common metadata scheme DC Dublin
Core Metadata Scheme and harvested according to
the OAI-PMH Protocol
6
THE 2sd PHASE MULTILINGUAL ACCESS
Aim To foster and facilitate world wide
communication in the legal academic world, in the
legal professional sector, in business world and
in public administration services to citizens
Obstacles
1) Complexity and richness of each legal language
2) Differences between legal concepts inherent to
the diverse national legal systems
System-bound nature of legal terminology
Example - In England a mortgagee becomes a
conditional owner of the property mortgaged to
him, but not its possessor - In Spain, in France
the hypothécaire gains neither ownership nor
possession of the mortgaged property unless he
enforces the mortgage.
7
CROSS LANGUAGE RETRIEVAL OF LEGAL INFORMATION

Querying and retrieving multi-language documents
involves problems of query translation
Especially in legal domain, a word in a native
query language can be ambiguous
A word can have different translations in a
target language, each corresponding to a legal
category in the target legal system

8
QUERY EXAMPLE

Italian user query
Give me back all the documents related to dolo

Italian system
English system
fraud (private law)
dolo
Documents related to dolo
Documents related to fraud
Documents related to malice
Ambiguous word
malice (criminal law)
Query contextualization is a key issue for a
focused multi language document retrieval.
9
THE 3rd PHASE SEMANTIC QUERYING

Assigning high quality metadata ( DCsubject
category of law)
Adoption of DC metadata
Query contextualization according to a
category of law
Exploitation of a controlled legal vocabulary
allowing query
interpretation based on approximation

10
DC METADATA ANDRETRIEVAL PRECISION
DC metadata (in particular dcsubject) allow
mainly to enhance the Precision in the retrieval
Evidence
Criminal Procedure
dcdescription
Evidence
Evidence
dcsubject
Criminal Procedure
Civil Procedure
11
LEGAL VOCABULARY AND RETRIEVAL RECALL
Legal Controlled Vocabulary allows to enhance
mainly the Recall in the retrieval
Evidence
NO DOC
Criminal Procedure
dcdescription
Evidence
witnesses
Related terms (from acontrolled vocabulary)
Proof Criminal Invest.
Witness
Criminal Procedure
dcsubject
Criminal Procedure
12
THE 4th PHASE LINKING LEGAL SOURCES

URL Subset of URI that identify resources via a
representation of their primary access mechanism
(e.g., their network location). NO ANYMORE

URN an unambiguous identifier, allowing the
references to be expressed in a stable way,
independently of document physical location

Goal a system of references based on a
resolution mechanism (RDS Resolver Discovery
Service) able to retrieve the corresponding object
regardeless of resources physical location.
13
THE URN-NIR STANDARD FOR LEGISLATION

The URN-NIR is defined as a combination of
elements, according to a specific grammar
Examples
Decree of Ministry of Finance of 20.12.99
urnnirministero.finanzedecreto1999-12-20
Communications and Health Inter-ministerial
Regulation, 9 September 1998
urnnirministero.comunicazioniministero.saluter
egolamento1998-09-09

- Parser Available on-line, automatically
detects references within laws - Resolution
service Resolves URNs into URLs
14
DEVELOPMENT OF A STANDARD FOR LEGAL LITERATURE

Many identifiers available (ISBN, ISSN, BICI
SICI DOI)
But
an intelligent, unique, self-explanatory standard
for bibliographic materials has not been
developed so far.

Approach
Agreement on a common document identification
technique and a standard data scheme
A cooperative layer for the availability of
common services and shared information

15
GRAMMAR OF THE STANDARD FOR IDENTIFYING LEGAL
LITERATURE
Grammar of URN-Legal Doctrine reflects the
features of legal doctrine material
Example Sorge, C. and Bergfelder, M. (2004),
Signatures by Electronic Agents A Legal
Perspective, In Cevenini, C. (editor), The Law
and Electronic Agents Proceedings of the LEA
workshop, Bologna, June 2004, pp. 141-153
urnlldsorgebergfeldersignatures.by.electronic.
agents.a.legal.perspective200406the.law.and.ele
ctronic.agents.proceeding.of141_at_inproceedings
The URN can be created automatically from the
bibliographic citation and be easily adopted by
all providers of legal doctrine.
16
SERVICES OF OPENURL RESOLVER

Search authors in other databases
Check how many citations
Look for authors email

Which library owns the printed review?
Approriated Full Text
If full-text not available, Document delivery
service?
17
WHAT IS NEEDED

From the point of view of users there is need
for
Knowledge based tools for supporting semantic
searching
Persistent identifiers for reliable access to
digital objects in a context sensitive way.
Navigation across legislation, case law and
literature materials

18
FEDERATED DATA

Data of the prototype can be divided into two
classes
Structured data
Web documents

19
Metadata Approach

Different data sources are described using common
metadata scheme (DC Metadata XML encoded, as
target bridging format.)
Integration and uniform view on different data
sources
Implementation of search facilities based on
those standards.
Metadata for structured resources
Proprietary metadata mapping towards
Dublin Core simple
DCMI Cite (journal articles)
Metadata for Web documents
specific module generating a subset of DC
metadata, without imposing on data providers to
adopt a predefined format

20
The architecture of the federation system
Portal
Index
Selection of metadatafrom structured data
providers (OAI-PMH)?
Indexer
Selection ofdocuments of interest from web sites
Service Provider
Metadata
DC-Qualified
DC-XML
Web
HTML pages
harvester
records
focused crawler
Automatic metadata
generator
DC mapping
Data Providers
Publisher
Library
DOGI
Catalog
catalog
WWW Documents
Structured Data Repositories
21
Structured Resource Management the Open
Archives Initiative (OAI)

Interoperability among structured resources
through Metadata Exchange
OAI Protocol for Metadata Harvesting (OAI-PMH)
Data Providers are repositories that expose
structured metadata through OAI-PMH
Service Provider uses OAI-PMH to harvest
metadata
OAI-PMH is a set of six verbs that are invoked
within HTTP.

22
(No Transcript)
23
OAI-PMH verbs

Identify http//localhost8080/OaiDataProvider/s
ervlet?verbidentify
ListMetadataFormathttp//localhost8080/OaiDataPr
ovider/servlet?verbListMetadataFormatsmetadataPr
efixoai_dc
ListSetshttp//localhost8080/OaiDataProvider/ser
vlet?verbListSets
ListIdentifiershttp//localhost8080/OaiDataProvi
der/servlet?verbListIdentifiersmetadataPrefixoa
i_dc
ListRecordshttp//localhost8080/OaiDataProvider/
servlet?verbListRecordsmetadataPrefixoai_dc
GetRecordhttp//localhost8080/OaiDataProvider/se
rvlet?verbGetRecordidentifieroaidogi.ittig.cnr
.it/1999Z0529metadataPrefixoai_dc

24
Selection and Harvesting ofStructured Resources
Metadata

Selection
Bibliographic repositories in libraries
UDC (Universal Decimal Classification)
DDC (Dewey Decimal Classification)
34X notations plus a subset of additional
classes
Harvesting
Structured repositories are adapted as OAI Data
Providers
OAI - Protocol of Metadata Harvesting (OAI-PMH)

25
The architecture of thefederation system
Portal
Index
Selection of metadatafrom structured data
providers (OAI-PMH)?
Indexer
Selection ofdocuments of interest from web sites
Service Provider
Metadata
DC-Qualified
DC-XML
Web
HTML pages
harvester
records
focused crawler
Automatic metadata
generator
DC mapping
Data Providers
Publisher
Library
DOGI
Catalog
catalog
WWW Documents
Structured Data Repositories
26
A Focused-Crawler to SelectLegal Web Documents

The exploration of the web started from a subset
of sites of interest, selected by a group of
legal experts
Then we have considered a policy aimed at
following the hyperlinks with a high probability
of leading to documents of interest
Probability is obtained by means of an automatic
classifier on a set of words in the vicinity of
the hyperlink.

27
Metadata for Web Documents

Web documents metadata implementation is key
issue to improve search facilities
In literature different approaches have been
proposed
Reliable mapping of metadata originally included
in web documents, combined with the collaboration
of the authors
Automatic metadata generation on the basis of
keywords supplied by the authors

28
Automatic DC Metadata Generation for Web
Documents

Our approach
automatic generation of a DC metadata subset
it supports the intellectual activity in
organizing web documents.
Main criteria used for DC metadata generation
properties of the documents (URL as
dcidentifier)
mapping between particular tags (lttitlegt html-tag
for dctitle)
html-metatags (as meta description for
dcdescription).
Particular attention to document classification
(dcsubject). We have tested
Naive Bayes
Multiclass Support Vector Machine (Vapnik,
Crammer and Singer)
machine learning approaches for document
classification.

29
C0 C1 Cn-1
Naive Bayes Classifier
P(c0 / dj)
Class Ranking
P(c1 / dj)
Web document dj
...
P(cn-1 / dj)
dj (w0j, , wkj,)
Bag of words wkj is a function of the kth word
frequency in document dj
where
Naive hypothesiswords in a document
occur independently of each other given the
class
Bayes Theorem
30
Support Vector Machine (SVM)

Given a training set of positive (S) and
negative (S-) examples, SVM determines the
surface si which divides S from S- with the
widest distance
Linear case

S

-

-
S-

-

-
-
-
-
-
-
-
si
31
Multiclass Support Vector Machine (MSVM)
32
Document Representation

A document is represented by a vector of term
weights dj (w1, , wT) and three different
types of weights have been tested
Binary weights (presence/absence)
Term frequency weight (tf)
TF-IDF weight (tfidf, penalizes terms occuring in
many different documents, being less
disciminative)
Pre-processing to increase statistical qualities
of terms
Stemming (reduction of terms to their
morphological root)
Stopwords (deletion of very frequent terms)
Digits and non alphanumeric characters
represented by a unique special character

33
The Experiments

Data set of 2478 documents from Web sites of
interest, belonging to 11 classes (categories of
law)
c0 Environmental law c6 International law
c1 Administrative law c7 Labour law
c2 Constitutional law c8 Criminal law
c3 Ecclesiastic law c9 Private law
c4 European law c10 Taxation law
c5 Computer law

34
Train and Test of the Classifier

Training
All the 2478 examples have been used to train the
classifier
Test
Two experiments to calculate
Train accuracy
LOO (Leave-One-Out) accuracy
(Test of the classifier generalization
capability)

35
Train Accuracy
36
Test of the MSVM classifierusing the LOO approach
LOO accuracy 74.7
37
The architecture of thefederation system
Portal
Index
Selection of metadatafrom structured data
providers (OAI-PMH)?
Indexer
Selection ofdocuments of interest from web sites
Service Provider
Metadata
DC-Qualified
DC-XML
Web
HTML pages
harvester
records
focused crawler
Automatic metadata
generator
DC mapping
Data Providers
Publisher
Library
DOGI
Catalog
catalog
WWW Documents
Structured Data Repositories
38
User Access Modalities

Advanced searchMetadata-Based Document Querying
(MBDQ)
Simple searchKeyword (KBDQ) Category
(CBDQ)

Based Document Querying

Both query modalities can be implemented using a
Legal Vocabulary
Semantics and Legal Vocabulary are exploited to
enhance precision and recall in retrieval

39
Conclusions andFuture Developments

Portal to Italian Legal Literature is the
result of a federative architecture for
integrating structured repositories and web
documents in a unique point of access
A Dublin Core metadata approach has been used to
give a uniform view on different data
The federation system combines
the harvesting of structured data using OAI-PMH
the gathering and automatic qualification of web
documents using a machine learning approach
CLIR facilities are expected to be studied, for
developing user interface facilities based on a
multilingual ontological support.

40
Announcement

On the basis of the experiences of legislative
standard projects (NIR, Metalex, Akomantoso,
LexDania, eLaw)
Workshop on Legislative XML
Within Jurix 2007 Conference (http//www.jurix2007
.org)
December 15th, 2007, Leiden University, The
Netherlands
Deadline submission position papers 23rd
November 2007
Topics
Unique identification of (parts of) sources of
law, URI and URN
Lowest grain size of identifiable elements in
sources of law
What set of metadata should be part of an
interchange format
Ontologies for legislation