TextGarden Software Suite Quick Overview - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

TextGarden Software Suite Quick Overview

Description:

(SVM, Winnow, Perceptron, NBayes) Crawling Web and Search Eng. (for large scale data acquisition) ... Winnow. Perceptron. NaiveBayes. K-Nearest-Neighbor (KNN) ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 26

Provided by: velblodVid

Category:

more less

Transcript and Presenter's Notes

Title: TextGarden Software Suite Quick Overview

1
Text-Garden Software Suite Quick Overview

Marko Grobelnik
Jozef Stefan Institute
Ljubljana, Slovenia

2
Outline

What is Text-Garden?
How Text-Garden is being built?
Major functionalities
Technical aspects
Future developments
Availability
Text-Garden SMART

3
What is Text-Garden?

Text-Garden is a software library and collection
of software tools for solving large scale tasks
dealing with structured, semi-structured and
unstructured data
in particular, emphasis of functionality is on
dealing with text
It can be used in various ways covering research
and applicative scenarios
Being used by several institutions such as BT,
MSR, CMU,

4
Some history

The work started in 1996 as a set of C classes
for dealing with text and to perform
text-learning tasks (two people working on it)
till 2002 it developed slowly according to the
academic tasks being on our agenda
From 2003 on Text-Garden became central software
platform JSI is using on many research and
applicative projects (10 people contributing)
all the solutions and results JSI is working on
eventually become part of Text-Garden environment

5
local JSI development of Text-Garden
SMART STREP

PASCAL NoE
SEKT IP
NEON IP
Projects
JSI team
Text-Garden
6
Major functionalities
7
Functionality blocks
Lexical text processing (tokenization,
stop-words, stemming, n-grams, Wordnet)
Named Entity Extraction (capitalization based)
Unsupervised learning (KMeans, Hierarchical-KMeans
, OneClassSVM)
Cross Correlation (KCCA, matching text with other
data)
Semi-Supervised learning (Uncertainty sampling)
Keyword Extraction (contrast, centroid, taxonomy
based)
Supervised learning (SVM, Winnow, Perceptron,
NBayes)
Large Taxonomies (dealing with DMoz, Medline)
Dimensionality reduction (LSI, PCA)
Crawling Web and Search Eng. (for large scale
data acquisition)
Visualization (Graph based, Tiling, Density
based, )
Scalable Search (inverted index)
8
Lexical processing

Includes transformation from various formats into
bag-of-words representation
text/html, many custom formats (Svm-light,
Reuters, )
will support most text encodings
Lexical processing includes
Tokenization
Stop-words removal
Stemming (Porter stemmer for English, we have ML
for learning stemmers from lexicons for other
languages)
Frequent N-Gram features (consecutive words
co-appearing)
Proximity features (words co-appearing within
window)
Wordnet integration (words co-appearing in
synsets or in e.g. hypernym relations)
Output of this level is .BOW (Bag-Of-Words) file
with dictionary and sparse vectors for documents

9
Unsupervised learning

Algorithms
K-means clustering
clustering into flat clusters
Hierarchical-K-Means
creating hierarchy of clusters
One-Class-SVM
learning from positive class only
result is .BowPart (BOW Partition) file

10
Supervised learning

Following algorithms are implemented
SVM (two class, regression)
Winnow
Perceptron
NaiveBayes
K-Nearest-Neighbor (KNN)
result is .BowMd (BOW Model) file which is used
for further classification

11
Dimensionality reduction

Transform original space into low dimensional one
and project original data
Two classical methods
Latent Semantic Indexing (LSI)
efficient implementation, working with sparse
matrices
Principal Component Analysis (PCA)
result is .SemSpace (Semantic Space) file which
is used for projecting the data
We can e.g. project original .BOW file into
transformed lower dimensional .BOW file

12
Named Entity Extraction

Simple and efficient NEE
it is based on word capitalization
Candidate NEs (words and sequences of words) need
to be capitalized
Heuristic rule capitalized candidates must
appear within text at least once
we handle exceptions separately
Works well with some user interaction

13
Crawler Search Engine

Crawler
Slovenian internet archive being crawled by
Text-Garden Crawler
Highly scalable indexing search of text
documents
E.g. indexing of 10M documents in several hours

14
Support for selected external sources

Text-Garden has special support for the following
databases and services
Google Search (Web/News/Scholar)
DMoz/Open Directory Project
Medline
WordNet
Yahoo! Finance
CIA World-Fact-Book
Cordis project database
Cyc (OpenCyc/ResearchCyc)
Reuters datasets (old, new), ACM TechNews,
In preparation
Wikipedia, EuroVoc, AgroVoc (FAO)
MSN Search, Yahoo Search

15
Technical aspects

Text Garden is almost entirely written in
portable C
it compiles under Windows (Microsoft Visual C,
Borland C) and Unix/Linux (GNU C)
it runs under 32bit and 64bit platforms
it consists of 200.000 relatively compact lines
of code

16
How to use Text-Garden functionality?

Text-Garden functionality can be accessed in a
number of ways
As plain C classes
Complete functionality
As DLL library of 250 functions
Simplified extract of major functionality
As command line utilities
60 command line utilities getting connected in
pipeline
Through and GUI tools
(e.g. DocAtlas, OntoGen, )
Through interfaces to several platforms
(Java, Matlab, ) next slide

17
Multiplatform Text-Garden

Text-Garden has the following interfaces with the
same API
C/C - through simplified DLL native C
Java through JNI
.NET e.g. accessible through C, VB,
Matlab through standard Matlab interface
Python through standard Python interface
Mathematica, Prolog, R in preparation
API has 40 classes and 250 functions
interfaces to the all above platforms are
generated automatically from the master
Text-Garden header file
next slides include some examples in Matlab and
Java

18
Text Parsing to TFIDF Matlab

BowDocBsId NewBowDocBs('en523', 'porter')
DId(1) AddBowDocBs_HtmlDocStr(BowDocBsId,
'Economics', 'There are several basic and
incomplete questions that must be answered in
order to resolve the problems of economics
satisfactorily. The scarcity problem, for
example, requires answers to basic questions,
such as what to produce, how to produce it, and
who gets what is produced. An economic system is
a way of answering these basic questions.
Different economic systems answer them
differently.', '', 1)
DId(2) AddBowDocBs_HtmlDocStr(BowDocBsId,
'Oscar Wilde', 'Oscar Fingal OFlahertie Wills
Wilde (October 16, 1854 -- November 30, 1900) was
an Irish playwright, novelist, poet, short
story writer and Freemason. One of the most
successful playwrights of late Victorian
London, and one of the greatest celebrities of
his day, known for his barbed and clever wit,
he suffered a dramatic downfall and was
imprisoned after being convicted in a famous
trial for gross indecency (homosexual acts).',
'', 1)
BowDocWgtBsId GenBowDocWgtBs(BowDocBsId)
WIds GetBowDocwgtBs_DocWIds(BowDocWgtBsId,
DId(1))
for WIdN 0(WIds-1)
WId GetBowDocWgtBs_DocWId(BowDocWgtBsId,
DId(1), WIdN)
WordStr GetBowDocBs_WordStr(BowDocBsId,
WId)
WordWgt GetBowWgtDocBs_DocWWgt(BowDocWgtBsId
, DId(1), WId)
sprintf('s.5f', WordStr, WordWgt)
end

19
SVM Classification Java

import si.ijs.jtextgarden.
public class SVM
public static void main(String args)
System.out.println("Loading
JTextGardenLib...")
JTextGardenLib tg new JTextGardenLib()
System.out.println("Loading bow...")
int BowDocBsId tg.LoadBowDocBs("./data/t
opic50k.bow")
tg.SaveBowDocBsStat(BowDocBsId,
"./res/BowDocBsStat.txt")
System.out.println("Training linear SVM
binary classifier...")
int ECatId tg.GetBowDocBs_CId(BowDocBsId
, "ECAT")
int BinSVMBowMdId tg.GetBinSVMBowMd(BowD
ocBsId, ECatId)
System.out.println("Testing model...")
String Doc1 "There are several basic
and incomplete questions that "

20
Google WebNews Querying Java

import si.ijs.jtextgarden.
public class Google
public static void main(String args)
System.out.println("Loading
JTextGardenLib...")
JTextGardenLib tg new JTextGardenLib()
System.out.println("Web Search...")
int WebRSetId tg.GoogleWebSearch("sloven
ia", 50)
System.out.println("Number of hits "
tg.GetRSet_Hits(WebRSetId))
for (int HitN 0 HitN lt 10 HitN)
System.out.println("Hit " (HitN1)
" " tg.GetRSet_HitTitleStr(WebRSetId,
HitN))
System.out.println("News Search...")
int NewsRSetId tg.GoogleNewsSearch("slov
enia basketball")

21
Active Learning Java

import si.ijs.jtextgarden.
import java.io.
public class ActiveLearning
public static void main(String args) throws
IOException
System.out.println("Loading
JTextGardenLib...")
JTextGardenLib tg new JTextGardenLib()
System.out.println("Loading bow...")
int BowDocBsId tg.LoadBowDocBs("./data/C
iaWFB.partly.bow")
String CatNm "Europe"
int CatId tg.GetBowDocBs_CId(BowDocBsId,
CatNm)
int BowALId tg.NewBowAL(BowDocBsId,
CatId)
System.out.println("Starting Active
Learning loop...")

22
Future developments

Around Text-Garden is being prepared a text-book
for text-mining
Text-Garden will serve as software covering most
of the topics within the book
Text-Garden is getting extended by other sets of
functionalities
Graph-Garden dealing with graphs and networks
Media-Garden dealing with images and videos
Semantic-Garden dealing with Semantic-Web
issues
Stream-Garden dealing with streams of data

23
Availability

Text-Garden is under LGPL licence
It is available from www.textmining.net
new complete release will be uploaded soon

24
Text-Garden SMART

Ideally, Text-Garden can serve as an integrating
platform for the tools and data developed and
being used within the project
especially academic part of the project could
largely benefit and can make integartion to
commercial platforms easier
evaluation part of the project would benefit
largely since it needs commonly agreed procedures