Improving scalability in ILP incremental systems - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Improving scalability in ILP incremental systems

Description:

BerkeleyDB is an Open Source project under the supervision of Sleepycat. ... All these operators call C language API functions of Berkeley DB ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 22

Provided by: geni

Category:

more less

Transcript and Presenter's Notes

Title: Improving scalability in ILP incremental systems

1
Improving scalabilityin ILP incremental systems

Marenglen Biba, Teresa Maria Altomare Basile,
Stefano Ferilli, Floriana Esposito

2
Inductive Logic Programming Why do that?

Can learn sets of rules if-then that contain
variables called First-Order Horn clauses.
Ancestor(x,y) ? Parent(x,y)
Ancestor(x,y) ? Parent(x,z) ? Ancestor(z,y)
thus
Complex structural descriptions can be learned
such as
Molecular 2D and 3D structure
Document structure as a set of logic components
Protein fold description
and
Any other set of descriptions that need to model
complex relations among objects

3
ILP Formal setting (Muggleton 1994)

Given set of examples E E ? E-
Background knowledge B
Learn a logic theory T such as
T ? B - E Completeness
T ?B ? E- Consistency
Under the conditions
B E Necessity
B ? T ? E- Strong Consistency
B ?T Weak Consistency
B ? T - E Sufficiency
Neither Sufficiency nor Strong Consistency are
required for systems that deal with noise.
How to check Completeness and Consistency
Theorem prover

4
Example Describing protein folds
A positive and a negative example of the protein
fold 4-helical-up-and-down-bundle.

fold(Four-helical up-and-down bundle,P) -
helix(P,H1),
length(H1,hi),
position(P,H1,Pos),
interval(1ltPoslt3),
adjacent(P,H1,H2),
helix(P,H2).
The protein P has fold class Four-helical
up-and-down bundle if it contains a long helix
H1 at a secondary structure position between 1
and 3, and H1 is followed by a second helix H2.

3-D arrangement of secondary structure
units shown for -helices (cylinders) and -sheets
(arrows).
5
Learning logic programs proving theorems

T, B, and E are each logic programs.
A logic program is a set of definite clauses each
having the form h ? b1,..,bn where h is an atom
and b1,..,bn are atoms. E and E- consist of
ground clauses.
Procedure Tuning (E example T theory B
background knowledge M historical memory)
If E is a positive example AND ? covers(T,E)
then
Generalize(T,E,B,M-)
else
if E is a negative example AND covers(T,E) then
Specialize(T,E,B,M)
end if
end if
M M U E.
What we need is nothing more than a Prolog
interpreter as theorem prover

6
Learning logic programs doing it incrementally

How to learn? Batch or Incremental?
Batch Learning starts always from an empty
theory and the refinement of the theory continues
until all the examples are explained. Whenever a
problematic example appears, everything starts
again from the beginning without using the
previously generated theory. Since everything
must be learned again the computational effort is
relevant.
Incremental Why learn from scratch? What if we
use the theory learned till now and try to refine
it? We could save computational resources
learning in an incremental fashion.
INTHELEX (INcremental THEory Learner from
EXamples) is an ILP system that adopts
incremental refinement of the theories.
INTHELEX adopts multistrategic learning, can
start from scratch, learns simultanously various
concepts. Its a close loop learning system.

7
ILP applications What has successfully been done

Relational Data Mining and Knowledge Discovery
for
Drug Design/Toxicology,
SAR (Structure Activity Relationship),
Protein Structure Prediction,
Functional Genomics,
Systems Biology
Learning rules for early diagnosis of rheumatic
diseases
Learning qualitative models of dynamic systems
Applications in finite mesh design.
Applications in Natural Language Processing
Document Classification and Understanding

8
ILP future applications one of the challenges

Computational complexity remains a major
challenge for practical applications of ILP, for
example
In Systems Biology the scientific knowledge
discovery is made possible by the potentiality of
ILP, but millions of records that contain 3-D
molecule descriptions must be elaborated leading
therefore to extremely expensive computional
processes.
The huge space of hypotheses to search makes this
tasks NP-Hard and thus the need for scalable ILP
system that are able to process in reasonable
limits of time great quantities of data.

9
ILP scalability issue what has been done

Subsampling methods focus on a smaller set of
examples for processing very large datasets
(Srinivasan 99) .
Local coverage test within a learning from
interpretations setting where independence
assumptions among examples are reasonable and
coverage of each example can be tested without
considering the entire background knowledge. (De
Raedt 99)
Ad-hoc theta-subsumption techniques to speed-up
the coverage test have been also investigated (Di
Mauro 2003)
However the NP-Hard nature of problems with a
huge space of hypotheses to search motivates for
further efforts in implementing scalable ILP
system that are able to process very large
datasets.
How large large datasets are? Do we need
scalability?
Immagine millions of 3-D sequences of proteins
folds to be learned

10
ILP systems their drawbacks

Almost all the ILP systems are internal memory
systems in that they imply Prolog interpreters
for proving theorems using a main memory.
Unfortunately these interpreters are not
efficient in that they load all the examples in
main memory and whenever the theory must be
refined, they try to prove goals scanning the
entire main module where thousands of examples
may be present.
Prolog interpreters do not maintain any
information useful during the theorem proving
process. What does it mean? It means no clause
knows what examples it covers and whenever it is
refined an overall check of the entire memory
must be done.
What if every clause knows what examples to check
when a cover test should be done?
Prolog interpreters do not maintain any
information useful for understanding better the
learning process and maybe try other directions
in the search of hypotheses.
What if we maintain the various versions of the
clauses with relative statistics and use these
whenever the search process must be redirected?

11
Improving scalability First Step

Basic idea Eliminate main memory handling of
terms. Instead of loading them in main memory,
use an external storage that can provide fast
access and can
eliminate the sequential nature of the Prolog
module.
Make it faster, what to choose? Integrate
commercial RDBMS or implement a tailored solution
for external terms handling?
BerkeleyDB is an Open Source project under the
supervision of Sleepycat. It is a database
library that provides a powerfull API for data
management. It is an embedded data engine, in
that the libraries are executed in the space of
addresses of the hosting application, so no
communication between processes is required as an
external RDBMS would do (MySql or SQL Server).
The API are available in various languages C,
C, Java, Python, Prolog etc.
The underlying structures are hash structures,
btrees, record-number-based storage and queues.
Portable 16,32,64-bit machines, Windows, Unix,
Linux and any other OS. The abstraction layer is
about 2500 lines of C code, thus any OS is
possible.
No query language, only function-calls for data
retrieval thus no parsing of queries.
Indexing, Page cache management, Transactions and
logging, Locking and Concurrent Access Methods
are provided
Adopted by HP, SUN, Google, Motorola, Amazon.com
for their embedded systems. (The library is only
300KB but can handle 256 Terabytes of data)

12
Improving scalability First step

The interface we exploit uses the Concurrent
Access Methods product of Berkeley DB. This means
that multiple processes can open the same
database, but transactions and disaster recovery
are not supported. The environment and the
database files are ordinary Berkeley DB entities
that use a custom hash function.
We developed a tailored embedded solution using
the Sicstus Prolog interface to the Berkeley DB
toolset for supporting persistent storage of
Prolog terms. The idea is to obtain a behavior
similar to the built-in Prolog predicates such as
assert/1, retract/1 and clause/2.
DB-Specification
speclist spec1, . . . , specM
spec functor(argspec1, . . . , argspecN)
argspec -
where functor is a Prolog atom. A spec
F(argspec1, . . . ,argspecN) is applicable to any
ground term with principal functor F/N.

13
Improving scalability First step

The functors and the indexing specifications of
the terms to be stored have to
be given when the database is created.
The indexing is specified when the database is
created. It is possible to index
on other parts of the term than just the functor
and first argument.
Changes affect the database immediately.
The database can store variables (Other DBMSs do
not).
Example Storing and fetching examples in Prolog
db-spec examples(,-,-,-,-). Indexed on the
argument signed with
db_create(examples(,-,-,-,-), MyDb)
db_store(MyDb,examples(KeyExample,Concect,Head,Bo
dy,Constraints,Symbolic,Numeric)
db_fetch(MyDb,examples(KeyExample,_,_,_,_,_)
db_iterator(MyDb, examples(KeyExample,_,_,_,_,_)
db_iterator_next(My_Db) db_open(My_Db)
db_close(My_Db)
All these operators call C language API
functions of Berkeley DB

14
Improving scalability Second step

Basic idea having eliminated main memory of
Prolog we can link clauses to examples.
Implement a relational schema that keeps links
between clauses and examples they cover.
What is this useful for? When a clause is
specialized only the examples previously covered
by the clause should be checked. Thus picking up
all the examples and all the coverage tests that
in a Prolog fashion correspond to proving all
these goals, are replaced by an external fast
check of only the examples related to the clause.
When a clause is generalized it is useless to
check all the past positive examples. The
positive examples previously covered will
continue to be covered since the clause has been
generalized. Through our schema only negative
examples of the same concept (clauses and
examples indexed also on their concept) are
checked saving unnecessary checks that Prolog
interpreter would do by checking again the entire
main memory.
The expensive cover test is done only the first
time an example enters the database. Sucessively
it is sufficient to check a hash table with the
keys of the clauses and examples.

15
Improving scalability Is it all?

What if we maintain versions of the theory step
after step? At a certain moment of the search in
the hypotheses space, if we would want to
backtrack, how can we do it if we dont know what
has happened to the theory (the Prolog case) and
how this has evolved in time?
What if we could know which example caused a
certain refinement? In a Prolog setting this is
not possible. Instead with our solution we can
adopt partial memory techniques discriminating
the examples on the basis of what kind of
modifications they have been able to do to the
theory during the learning process.
To understand better the learning process we can
maintain statistics about the learning task, such
as number of refinements caused by an example,
number of refinement of a clause, times of
refinement and other precious information that in
a Prolog setting are not possible.

16
Relational schema Link it
17
Experiments 1
Document understanding domain learn theories to
classify documents and recognize the logical
components in them such as absctract, title, page
number, affiliation. Experiments between two
scalable versions of INTHELEX developed.
18
Experiments 2
Document understanding domain learn theories to
classify documents and recognize the logical
components in them such as absctract, title, page
number, affiliation. Experiments between
initial version and the final version of INTHELEX
19
Results interpretation

External storage of terms and fast-access through
function-call techniques resulted in important
savings in times of elaboration.
Linking clauses and examples improved
significanlty the learning process in terms of
efficiency.

20
Future work

Investigate the use of the information preserved
in the database such as the versions of the
theory for backtracking strategies.
Investigate the statistics about the examples for
implementing partial memory learning methods.