Title: Datcracker Open datamining platform connecting Rseslib and WEKA
1DatcrackerOpen data-mining platform connecting
Rseslib and WEKA
Marcin Wojnarski
Warsaw University, Poland
2Outline
- Datcracker is
- Motivation
- What is available in version 0.5
- HOWTO
- Architecture
- Future releases
3Datcracker is
- an open-source extensible data-mining platform
which provides common architecture for data
processing algorithms of various types. The
algorithms can be combined together to build data
processing schemes of large complexity.
4Main characteristics
- Extensibility of algorithm pool through
well-defined API - Extensibility of types of data that algorithms
operate on - Stream-based data processing, for efficient
handling of large volumes of data and for freedom
of designing complex experiments - Language Java
- Licence GPL
- Download www.datcracker.org
5Motivation
To enable independent research groups exchange
and combine their algorithms
To simplify implementation of new algorithms
6Available in version 0.5
- Rseslib algorithms
- classifiers (20 algorithms)
- Weka algorithms
- ARFF reader
- classifiers (60)
- filters (47)
- Datcracker algorithms
- TrainTest evaluation scheme
- Data types
- vectors of numeric and/or symbolic features
7HOWTO Read ARFF file
- Cell arff new ArffReaderCell()
- arff.set("filename", "data/iris.arff")
- arff.set("labelIndex", "last")
- arff.open()
- System.out.println(arff.next())
- System.out.println(arff.next())
- arff.close()
Output data5.1 3.5 1.4 0.2
labelIris-setosa data4.9 3.0 1.4 0.2
labelIris-setosa
8HOWTO Train classifier (Rseslib)
- Cell learner new RseslibClassifier("C45")
- learner.set("pruning", "true")
- learner.setSource(arff)
- learner.build()
- learner.setSource(arff_test)
- learner.open()
- System.out.println(learner.next())
- learner.close()
9HOWTO Train classifier (Weka)
- Cell learner new WekaClassifier("J48")
- learner.set("minNumObj", "2")
- learner.setSource(arff)
- learner.build()
10HOWTO Apply Weka filter
- Cell filter new WekaFilter("attribute.Remove")
- filter.set("attributeIndices", "3-6")
- filter.setSource(arff)
- filter.open()
- System.out.println(filter.next())
- System.out.println(filter.next())
- filter.close()
11HOWTO Set parameters
- arff.set("filename", "data/iris.arff")
- arff.set("labelIndex", "last")
- ...
OR
Parameters par new Parameters() par.set("filena
me", "data/iris.arff") par.set("labelIndex",
"last") ... arff.setParameters(par) par
arff.getParameters()
12HOWTO Train Test
- Cell learner new RseslibClassifier("C45")
- learner.set("pruning", "true")
- TrainAndTest tt new TrainAndTest(learner)
- tt.set("trainPercent", "70")
- tt.set("repetitions", "10")
- tt.setSource(source)
- tt.build()
- System.out.println(tt.report())
13Data Processing Chain
Cell.setSource(sourceCell)
14Architecture
15Outline
- Cell
- interfaces
- state
- how to override
- Data
- MetaData
16Cell
- Main class of Datcracker architecture
- Base class for all data-processing algorithms
- classifiers
- clusterers
- filters
- data loaders
- data generators
-
- Cells can be connected in a Data Processing Chain
- Data transfer between cells have form of a stream
of samples - Receiving cell may immidiately consume incoming
samples? large volumes of data processed
efficiently
17Cells interface
- Cell can be
- a data source
- a data receiver
- buildable
- parameterized
18Cell as a data source
- Cells interface for data transfer
- open() MetaSample opens communication session
- next() Sample retrieves next sample of data
- close() closes communication session
19Cell as a data receiver
- Cells interface for receiving data
- setSource(Cell) set source cell
20Buildable cells
- Some cells may be buildable they have to be
built before use - Building a cell is implemented by subclasses and
may mean different things - training a decision system
- running an evaluation scheme (TT, CV, )
- buffering input data
-
- Cells interface for building
- build() builds the cell
- erase() erases the cell it can be built again
afterwards
21Fixed cells
- Cells that are not buildable are called fixed.
They are usable just after construction or
setting parameters - file reader
- WEKA filter
-
22Parameterized cells
- Cells interface for parameterization
- set(String name, String value) sets a parameter
- setParameters(Parameters) sets all parameters at
once - getParameters() Parameters returns all
parameters that are set
23State of the cell
- EMPTY cell has no content, cannot be used
- CLOSED content has been built, cell ready to use
- OPEN cell is being used now (generating samples
of data)
24motivation
- To check against access violations when the cell
is accessed.Examples - two cells try to retrieve data from a given cell
at the same time - someone tries to use an empty cell
- someone tries to reconnect cells during their
activity - To simplify implementation of subclasses (new
algorithms)they may safely assume that access
is correct(build() before open(), open() before
next(), ) - To detect bugs early important in heterogenous
system!
25How to override Cell
- Methods to override
- onBuild()
- onErase()
- onOpen()
- onNext()
- onClose()
- Public methods build(), cant be
overriden.They perform state checking and then
call on() method - Like event handlers in event-driven programming
- You do not have to override all of them!(e.g.
cell for reading data will not be buildable) - You can provide additional interface in your
subclass
26Data representation
- Data set split into samples
- Sample
- data Data input data
- label Data associated decision label
- Separation of data and label
- useful for complex types of data/labels, e.g. in
image processing (like segmentation) - useful for meta-learning algorithm, which operate
on labels alone - labelled / unlabelled / partially labl. samples
handled in the same way - Data abstract base class. Downcasted by cells to
what they expect - Currently available subclasses
- NumericFeature, SymbolicFeature, DataVector
- In the future time series, images, special
types of labels, ...
27Immutability
- Data objects are immutable they cannot be
modified after creation (like String class) - They can be freely shared among cells without
risk of accidental modification - safety
- simplicity
- efficiency
- no need to copy data between cells
- no need for synchronization in multi-threaded
execution
28Metadata
- Many algorithms have to know type of input data
in advance, before processing of data starts?
metadata - Separation of data and metadata? base class
MetaData - Describes common properties of all Data objects
generated in a given session - number and types of features in a DataVector
- dictionary of possible values of a
SymbolicFeature -
- Each Data subclass has an associated MetaData
subclass - Immutable!
29Future releases
- Architecture
- Multi-input and multi-output cells
- Composite cells (e.g. meta-learning)
- Serialization and copying
- Progress info and suspension of cell building
- Algorithms
- cross-validation
- data buffering
-
- Data types
- time series
-
30Home
www.datcracker.org
31Thank You