Datcracker Open datamining platform connecting Rseslib and WEKA - PowerPoint PPT Presentation

About This Presentation
Title:

Datcracker Open datamining platform connecting Rseslib and WEKA

Description:

The algorithms can be combined together to build data processing schemes of large complexity. ... Immutability. Data objects are immutable: ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 32
Provided by: Wojna
Category:

less

Transcript and Presenter's Notes

Title: Datcracker Open datamining platform connecting Rseslib and WEKA


1
DatcrackerOpen data-mining platform connecting
Rseslib and WEKA
Marcin Wojnarski
Warsaw University, Poland
2
Outline
  • Datcracker is
  • Motivation
  • What is available in version 0.5
  • HOWTO
  • Architecture
  • Future releases

3
Datcracker is
  • an open-source extensible data-mining platform
    which provides common architecture for data
    processing algorithms of various types. The
    algorithms can be combined together to build data
    processing schemes of large complexity.

4
Main characteristics
  • Extensibility of algorithm pool through
    well-defined API
  • Extensibility of types of data that algorithms
    operate on
  • Stream-based data processing, for efficient
    handling of large volumes of data and for freedom
    of designing complex experiments
  • Language Java
  • Licence GPL
  • Download www.datcracker.org

5
Motivation
To enable independent research groups exchange
and combine their algorithms
To simplify implementation of new algorithms
6
Available in version 0.5
  • Rseslib algorithms
  • classifiers (20 algorithms)
  • Weka algorithms
  • ARFF reader
  • classifiers (60)
  • filters (47)
  • Datcracker algorithms
  • TrainTest evaluation scheme
  • Data types
  • vectors of numeric and/or symbolic features

7
HOWTO Read ARFF file
  • Cell arff new ArffReaderCell()
  • arff.set("filename", "data/iris.arff")
  • arff.set("labelIndex", "last")
  • arff.open()
  • System.out.println(arff.next())
  • System.out.println(arff.next())
  • arff.close()

Output data5.1 3.5 1.4 0.2
labelIris-setosa data4.9 3.0 1.4 0.2
labelIris-setosa
8
HOWTO Train classifier (Rseslib)
  • Cell learner new RseslibClassifier("C45")
  • learner.set("pruning", "true")
  • learner.setSource(arff)
  • learner.build()
  • learner.setSource(arff_test)
  • learner.open()
  • System.out.println(learner.next())
  • learner.close()

9
HOWTO Train classifier (Weka)
  • Cell learner new WekaClassifier("J48")
  • learner.set("minNumObj", "2")
  • learner.setSource(arff)
  • learner.build()

10
HOWTO Apply Weka filter
  • Cell filter new WekaFilter("attribute.Remove")
  • filter.set("attributeIndices", "3-6")
  • filter.setSource(arff)
  • filter.open()
  • System.out.println(filter.next())
  • System.out.println(filter.next())
  • filter.close()

11
HOWTO Set parameters
  • arff.set("filename", "data/iris.arff")
  • arff.set("labelIndex", "last")
  • ...

OR
Parameters par new Parameters() par.set("filena
me", "data/iris.arff") par.set("labelIndex",
"last") ... arff.setParameters(par) par
arff.getParameters()
12
HOWTO Train Test
  • Cell learner new RseslibClassifier("C45")
  • learner.set("pruning", "true")
  • TrainAndTest tt new TrainAndTest(learner)
  • tt.set("trainPercent", "70")
  • tt.set("repetitions", "10")
  • tt.setSource(source)
  • tt.build()
  • System.out.println(tt.report())

13
Data Processing Chain
Cell.setSource(sourceCell)
14
Architecture
15
Outline
  • Cell
  • interfaces
  • state
  • how to override
  • Data
  • MetaData

16
Cell
  • Main class of Datcracker architecture
  • Base class for all data-processing algorithms
  • classifiers
  • clusterers
  • filters
  • data loaders
  • data generators
  • Cells can be connected in a Data Processing Chain
  • Data transfer between cells have form of a stream
    of samples
  • Receiving cell may immidiately consume incoming
    samples? large volumes of data processed
    efficiently

17
Cells interface
  • Cell can be
  • a data source
  • a data receiver
  • buildable
  • parameterized

18
Cell as a data source
  • Cells interface for data transfer
  • open() MetaSample opens communication session
  • next() Sample retrieves next sample of data
  • close() closes communication session

19
Cell as a data receiver
  • Cells interface for receiving data
  • setSource(Cell) set source cell

20
Buildable cells
  • Some cells may be buildable they have to be
    built before use
  • Building a cell is implemented by subclasses and
    may mean different things
  • training a decision system
  • running an evaluation scheme (TT, CV, )
  • buffering input data
  • Cells interface for building
  • build() builds the cell
  • erase() erases the cell it can be built again
    afterwards

21
Fixed cells
  • Cells that are not buildable are called fixed.
    They are usable just after construction or
    setting parameters
  • file reader
  • WEKA filter

22
Parameterized cells
  • Cells interface for parameterization
  • set(String name, String value) sets a parameter
  • setParameters(Parameters) sets all parameters at
    once
  • getParameters() Parameters returns all
    parameters that are set

23
State of the cell
  • EMPTY cell has no content, cannot be used
  • CLOSED content has been built, cell ready to use
  • OPEN cell is being used now (generating samples
    of data)

24
motivation
  • To check against access violations when the cell
    is accessed.Examples
  • two cells try to retrieve data from a given cell
    at the same time
  • someone tries to use an empty cell
  • someone tries to reconnect cells during their
    activity
  • To simplify implementation of subclasses (new
    algorithms)they may safely assume that access
    is correct(build() before open(), open() before
    next(), )
  • To detect bugs early important in heterogenous
    system!

25
How to override Cell
  • Methods to override
  • onBuild()
  • onErase()
  • onOpen()
  • onNext()
  • onClose()
  • Public methods build(), cant be
    overriden.They perform state checking and then
    call on() method
  • Like event handlers in event-driven programming
  • You do not have to override all of them!(e.g.
    cell for reading data will not be buildable)
  • You can provide additional interface in your
    subclass

26
Data representation
  • Data set split into samples
  • Sample
  • data Data input data
  • label Data associated decision label
  • Separation of data and label
  • useful for complex types of data/labels, e.g. in
    image processing (like segmentation)
  • useful for meta-learning algorithm, which operate
    on labels alone
  • labelled / unlabelled / partially labl. samples
    handled in the same way
  • Data abstract base class. Downcasted by cells to
    what they expect
  • Currently available subclasses
  • NumericFeature, SymbolicFeature, DataVector
  • In the future time series, images, special
    types of labels, ...

27
Immutability
  • Data objects are immutable they cannot be
    modified after creation (like String class)
  • They can be freely shared among cells without
    risk of accidental modification
  • safety
  • simplicity
  • efficiency
  • no need to copy data between cells
  • no need for synchronization in multi-threaded
    execution

28
Metadata
  • Many algorithms have to know type of input data
    in advance, before processing of data starts?
    metadata
  • Separation of data and metadata? base class
    MetaData
  • Describes common properties of all Data objects
    generated in a given session
  • number and types of features in a DataVector
  • dictionary of possible values of a
    SymbolicFeature
  • Each Data subclass has an associated MetaData
    subclass
  • Immutable!

29
Future releases
  • Architecture
  • Multi-input and multi-output cells
  • Composite cells (e.g. meta-learning)
  • Serialization and copying
  • Progress info and suspension of cell building
  • Algorithms
  • cross-validation
  • data buffering
  • Data types
  • time series

30
Home
www.datcracker.org
31
Thank You
Write a Comment
User Comments (0)
About PowerShow.com