Improving scalability in ILP incremental systems - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Improving scalability in ILP incremental systems

Description:

BerkeleyDB is an Open Source project under the supervision of Sleepycat. ... All these operators call C language API functions of Berkeley DB ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 22
Provided by: geni
Category:

less

Transcript and Presenter's Notes

Title: Improving scalability in ILP incremental systems


1
Improving scalabilityin ILP incremental systems
  • Marenglen Biba, Teresa Maria Altomare Basile,
    Stefano Ferilli, Floriana Esposito

2
Inductive Logic Programming Why do that?
  • Can learn sets of rules if-then that contain
    variables called First-Order Horn clauses.
  • Ancestor(x,y) ? Parent(x,y)
  • Ancestor(x,y) ? Parent(x,z) ? Ancestor(z,y)
  • thus
  • Complex structural descriptions can be learned
    such as
  • Molecular 2D and 3D structure
  • Document structure as a set of logic components
  • Protein fold description
  • and
  • Any other set of descriptions that need to model
    complex relations among objects

3
ILP Formal setting (Muggleton 1994)
  • Given set of examples E E ? E-
    Background knowledge B
  • Learn a logic theory T such as
  • T ? B - E Completeness
  • T ?B ? E- Consistency
  • Under the conditions
  • B E Necessity
  • B ? T ? E- Strong Consistency
  • B ?T Weak Consistency
  • B ? T - E Sufficiency
  • Neither Sufficiency nor Strong Consistency are
    required for systems that deal with noise.
  • How to check Completeness and Consistency
    Theorem prover

4
Example Describing protein folds
A positive and a negative example of the protein
fold 4-helical-up-and-down-bundle.
  • fold(Four-helical up-and-down bundle,P) -
    helix(P,H1),
  • length(H1,hi),
  • position(P,H1,Pos),
  • interval(1ltPoslt3),
  • adjacent(P,H1,H2),
  • helix(P,H2).
  • The protein P has fold class Four-helical
    up-and-down bundle if it contains a long helix
    H1 at a secondary structure position between 1
    and 3, and H1 is followed by a second helix H2.

3-D arrangement of secondary structure
units shown for -helices (cylinders) and -sheets
(arrows).
5
Learning logic programs proving theorems
  • T, B, and E are each logic programs.
  • A logic program is a set of definite clauses each
    having the form h ? b1,..,bn where h is an atom
    and b1,..,bn are atoms. E and E- consist of
    ground clauses.
  • Procedure Tuning (E example T theory B
    background knowledge M historical memory)
  • If E is a positive example AND ? covers(T,E)
    then
  • Generalize(T,E,B,M-)
  • else
  • if E is a negative example AND covers(T,E) then
  • Specialize(T,E,B,M)
  • end if
  • end if
  • M M U E.
  • What we need is nothing more than a Prolog
    interpreter as theorem prover

6
Learning logic programs doing it incrementally
  • How to learn? Batch or Incremental?
  • Batch Learning starts always from an empty
    theory and the refinement of the theory continues
    until all the examples are explained. Whenever a
    problematic example appears, everything starts
    again from the beginning without using the
    previously generated theory. Since everything
    must be learned again the computational effort is
    relevant.
  • Incremental Why learn from scratch? What if we
    use the theory learned till now and try to refine
    it? We could save computational resources
    learning in an incremental fashion.
  • INTHELEX (INcremental THEory Learner from
    EXamples) is an ILP system that adopts
    incremental refinement of the theories.
  • INTHELEX adopts multistrategic learning, can
    start from scratch, learns simultanously various
    concepts. Its a close loop learning system.

7
ILP applications What has successfully been done
  • Relational Data Mining and Knowledge Discovery
    for
  • Drug Design/Toxicology,
  • SAR (Structure Activity Relationship),
  • Protein Structure Prediction,
  • Functional Genomics,
  • Systems Biology
  • Learning rules for early diagnosis of rheumatic
    diseases
  • Learning qualitative models of dynamic systems
  • Applications in finite mesh design.
  • Applications in Natural Language Processing
  • Document Classification and Understanding

8
ILP future applications one of the challenges
  • Computational complexity remains a major
    challenge for practical applications of ILP, for
    example
  • In Systems Biology the scientific knowledge
    discovery is made possible by the potentiality of
    ILP, but millions of records that contain 3-D
    molecule descriptions must be elaborated leading
    therefore to extremely expensive computional
    processes.
  • The huge space of hypotheses to search makes this
    tasks NP-Hard and thus the need for scalable ILP
    system that are able to process in reasonable
    limits of time great quantities of data.

9
ILP scalability issue what has been done
  • Subsampling methods focus on a smaller set of
    examples for processing very large datasets
    (Srinivasan 99) .
  • Local coverage test within a learning from
    interpretations setting where independence
    assumptions among examples are reasonable and
    coverage of each example can be tested without
    considering the entire background knowledge. (De
    Raedt 99)
  • Ad-hoc theta-subsumption techniques to speed-up
    the coverage test have been also investigated (Di
    Mauro 2003)
  • However the NP-Hard nature of problems with a
    huge space of hypotheses to search motivates for
    further efforts in implementing scalable ILP
    system that are able to process very large
    datasets.
  • How large large datasets are? Do we need
    scalability?
  • Immagine millions of 3-D sequences of proteins
    folds to be learned

10
ILP systems their drawbacks
  • Almost all the ILP systems are internal memory
    systems in that they imply Prolog interpreters
    for proving theorems using a main memory.
    Unfortunately these interpreters are not
    efficient in that they load all the examples in
    main memory and whenever the theory must be
    refined, they try to prove goals scanning the
    entire main module where thousands of examples
    may be present.
  • Prolog interpreters do not maintain any
    information useful during the theorem proving
    process. What does it mean? It means no clause
    knows what examples it covers and whenever it is
    refined an overall check of the entire memory
    must be done.
  • What if every clause knows what examples to check
    when a cover test should be done?
  • Prolog interpreters do not maintain any
    information useful for understanding better the
    learning process and maybe try other directions
    in the search of hypotheses.
  • What if we maintain the various versions of the
    clauses with relative statistics and use these
    whenever the search process must be redirected?

11
Improving scalability First Step
  • Basic idea Eliminate main memory handling of
    terms. Instead of loading them in main memory,
    use an external storage that can provide fast
    access and can
  • eliminate the sequential nature of the Prolog
    module.
  • Make it faster, what to choose? Integrate
    commercial RDBMS or implement a tailored solution
    for external terms handling?
  • BerkeleyDB is an Open Source project under the
    supervision of Sleepycat. It is a database
    library that provides a powerfull API for data
    management. It is an embedded data engine, in
    that the libraries are executed in the space of
    addresses of the hosting application, so no
    communication between processes is required as an
    external RDBMS would do (MySql or SQL Server).
  • The API are available in various languages C,
    C, Java, Python, Prolog etc.
  • The underlying structures are hash structures,
    btrees, record-number-based storage and queues.
  • Portable 16,32,64-bit machines, Windows, Unix,
    Linux and any other OS. The abstraction layer is
    about 2500 lines of C code, thus any OS is
    possible.
  • No query language, only function-calls for data
    retrieval thus no parsing of queries.
  • Indexing, Page cache management, Transactions and
    logging, Locking and Concurrent Access Methods
    are provided
  • Adopted by HP, SUN, Google, Motorola, Amazon.com
    for their embedded systems. (The library is only
    300KB but can handle 256 Terabytes of data)

12
Improving scalability First step
  • The interface we exploit uses the Concurrent
    Access Methods product of Berkeley DB. This means
    that multiple processes can open the same
    database, but transactions and disaster recovery
    are not supported. The environment and the
    database files are ordinary Berkeley DB entities
    that use a custom hash function.
  • We developed a tailored embedded solution using
    the Sicstus Prolog interface to the Berkeley DB
    toolset for supporting persistent storage of
    Prolog terms. The idea is to obtain a behavior
    similar to the built-in Prolog predicates such as
    assert/1, retract/1 and clause/2.
  • DB-Specification
  • speclist spec1, . . . , specM
  • spec functor(argspec1, . . . , argspecN)
  • argspec -
  • where functor is a Prolog atom. A spec
    F(argspec1, . . . ,argspecN) is applicable to any
    ground term with principal functor F/N.

13
Improving scalability First step
  • The functors and the indexing specifications of
    the terms to be stored have to
  • be given when the database is created.
  • The indexing is specified when the database is
    created. It is possible to index
  • on other parts of the term than just the functor
    and first argument.
  • Changes affect the database immediately.
  • The database can store variables (Other DBMSs do
    not).
  • Example Storing and fetching examples in Prolog
  • db-spec examples(,-,-,-,-). Indexed on the
    argument signed with
  • db_create(examples(,-,-,-,-), MyDb)
  • db_store(MyDb,examples(KeyExample,Concect,Head,Bo
    dy,Constraints,Symbolic,Numeric)
  • db_fetch(MyDb,examples(KeyExample,_,_,_,_,_)
  • db_iterator(MyDb, examples(KeyExample,_,_,_,_,_)
  • db_iterator_next(My_Db) db_open(My_Db)
    db_close(My_Db)
  • All these operators call C language API
    functions of Berkeley DB

14
Improving scalability Second step
  • Basic idea having eliminated main memory of
    Prolog we can link clauses to examples.
  • Implement a relational schema that keeps links
    between clauses and examples they cover.
  • What is this useful for? When a clause is
    specialized only the examples previously covered
    by the clause should be checked. Thus picking up
    all the examples and all the coverage tests that
    in a Prolog fashion correspond to proving all
    these goals, are replaced by an external fast
    check of only the examples related to the clause.
  • When a clause is generalized it is useless to
    check all the past positive examples. The
    positive examples previously covered will
    continue to be covered since the clause has been
    generalized. Through our schema only negative
    examples of the same concept (clauses and
    examples indexed also on their concept) are
    checked saving unnecessary checks that Prolog
    interpreter would do by checking again the entire
    main memory.
  • The expensive cover test is done only the first
    time an example enters the database. Sucessively
    it is sufficient to check a hash table with the
    keys of the clauses and examples.

15
Improving scalability Is it all?
  • What if we maintain versions of the theory step
    after step? At a certain moment of the search in
    the hypotheses space, if we would want to
    backtrack, how can we do it if we dont know what
    has happened to the theory (the Prolog case) and
    how this has evolved in time?
  • What if we could know which example caused a
    certain refinement? In a Prolog setting this is
    not possible. Instead with our solution we can
    adopt partial memory techniques discriminating
    the examples on the basis of what kind of
    modifications they have been able to do to the
    theory during the learning process.
  • To understand better the learning process we can
    maintain statistics about the learning task, such
    as number of refinements caused by an example,
    number of refinement of a clause, times of
    refinement and other precious information that in
    a Prolog setting are not possible.

16
Relational schema Link it
17
Experiments 1
Document understanding domain learn theories to
classify documents and recognize the logical
components in them such as absctract, title, page
number, affiliation. Experiments between two
scalable versions of INTHELEX developed.
18
Experiments 2
Document understanding domain learn theories to
classify documents and recognize the logical
components in them such as absctract, title, page
number, affiliation. Experiments between
initial version and the final version of INTHELEX
19
Results interpretation
  • External storage of terms and fast-access through
    function-call techniques resulted in important
    savings in times of elaboration.
  • Linking clauses and examples improved
    significanlty the learning process in terms of
    efficiency.

20
Future work
  • Investigate the use of the information preserved
    in the database such as the versions of the
    theory for backtracking strategies.
  • Investigate the statistics about the examples for
    implementing partial memory learning methods.

21
  • Thank you for your attention
  • Marenglen Biba, PhD Student
  • biba_at_di.uniba.it
Write a Comment
User Comments (0)
About PowerShow.com