Title: TextGarden Software Suite Quick Overview
1Text-Garden Software Suite Quick Overview
- Marko Grobelnik
- Jozef Stefan Institute
- Ljubljana, Slovenia
2Outline
- What is Text-Garden?
- How Text-Garden is being built?
- Major functionalities
- Technical aspects
- Future developments
- Availability
- Text-Garden SMART
3What is Text-Garden?
- Text-Garden is a software library and collection
of software tools for solving large scale tasks
dealing with structured, semi-structured and
unstructured data - in particular, emphasis of functionality is on
dealing with text - It can be used in various ways covering research
and applicative scenarios - Being used by several institutions such as BT,
MSR, CMU,
4Some history
- The work started in 1996 as a set of C classes
for dealing with text and to perform
text-learning tasks (two people working on it) - till 2002 it developed slowly according to the
academic tasks being on our agenda - From 2003 on Text-Garden became central software
platform JSI is using on many research and
applicative projects (10 people contributing) - all the solutions and results JSI is working on
eventually become part of Text-Garden environment
5local JSI development of Text-Garden
SMART STREP
PASCAL NoE
SEKT IP
NEON IP
Projects
JSI team
Text-Garden
6Major functionalities
7Functionality blocks
Lexical text processing (tokenization,
stop-words, stemming, n-grams, Wordnet)
Named Entity Extraction (capitalization based)
Unsupervised learning (KMeans, Hierarchical-KMeans
, OneClassSVM)
Cross Correlation (KCCA, matching text with other
data)
Semi-Supervised learning (Uncertainty sampling)
Keyword Extraction (contrast, centroid, taxonomy
based)
Supervised learning (SVM, Winnow, Perceptron,
NBayes)
Large Taxonomies (dealing with DMoz, Medline)
Dimensionality reduction (LSI, PCA)
Crawling Web and Search Eng. (for large scale
data acquisition)
Visualization (Graph based, Tiling, Density
based, )
Scalable Search (inverted index)
8Lexical processing
- Includes transformation from various formats into
bag-of-words representation - text/html, many custom formats (Svm-light,
Reuters, ) - will support most text encodings
- Lexical processing includes
- Tokenization
- Stop-words removal
- Stemming (Porter stemmer for English, we have ML
for learning stemmers from lexicons for other
languages) - Frequent N-Gram features (consecutive words
co-appearing) - Proximity features (words co-appearing within
window) - Wordnet integration (words co-appearing in
synsets or in e.g. hypernym relations) - Output of this level is .BOW (Bag-Of-Words) file
- with dictionary and sparse vectors for documents
9Unsupervised learning
- Algorithms
- K-means clustering
- clustering into flat clusters
- Hierarchical-K-Means
- creating hierarchy of clusters
- One-Class-SVM
- learning from positive class only
- result is .BowPart (BOW Partition) file
10Supervised learning
- Following algorithms are implemented
- SVM (two class, regression)
- Winnow
- Perceptron
- NaiveBayes
- K-Nearest-Neighbor (KNN)
-
- result is .BowMd (BOW Model) file which is used
for further classification
11Dimensionality reduction
- Transform original space into low dimensional one
and project original data - Two classical methods
- Latent Semantic Indexing (LSI)
- efficient implementation, working with sparse
matrices - Principal Component Analysis (PCA)
- result is .SemSpace (Semantic Space) file which
is used for projecting the data - We can e.g. project original .BOW file into
transformed lower dimensional .BOW file
12Named Entity Extraction
- Simple and efficient NEE
- it is based on word capitalization
- Candidate NEs (words and sequences of words) need
to be capitalized - Heuristic rule capitalized candidates must
appear within text at least once - we handle exceptions separately
- Works well with some user interaction
13Crawler Search Engine
- Crawler
- Slovenian internet archive being crawled by
Text-Garden Crawler - Highly scalable indexing search of text
documents - E.g. indexing of 10M documents in several hours
14Support for selected external sources
- Text-Garden has special support for the following
databases and services - Google Search (Web/News/Scholar)
- DMoz/Open Directory Project
- Medline
- WordNet
- Yahoo! Finance
- CIA World-Fact-Book
- Cordis project database
- Cyc (OpenCyc/ResearchCyc)
- Reuters datasets (old, new), ACM TechNews,
- In preparation
- Wikipedia, EuroVoc, AgroVoc (FAO)
- MSN Search, Yahoo Search
15Technical aspects
- Text Garden is almost entirely written in
portable C - it compiles under Windows (Microsoft Visual C,
Borland C) and Unix/Linux (GNU C) - it runs under 32bit and 64bit platforms
- it consists of 200.000 relatively compact lines
of code
16How to use Text-Garden functionality?
- Text-Garden functionality can be accessed in a
number of ways - As plain C classes
- Complete functionality
- As DLL library of 250 functions
- Simplified extract of major functionality
- As command line utilities
- 60 command line utilities getting connected in
pipeline - Through and GUI tools
- (e.g. DocAtlas, OntoGen, )
- Through interfaces to several platforms
- (Java, Matlab, ) next slide
17Multiplatform Text-Garden
- Text-Garden has the following interfaces with the
same API - C/C - through simplified DLL native C
- Java through JNI
- .NET e.g. accessible through C, VB,
- Matlab through standard Matlab interface
- Python through standard Python interface
- Mathematica, Prolog, R in preparation
- API has 40 classes and 250 functions
- interfaces to the all above platforms are
generated automatically from the master
Text-Garden header file - next slides include some examples in Matlab and
Java
18Text Parsing to TFIDF Matlab
- BowDocBsId NewBowDocBs('en523', 'porter')
- DId(1) AddBowDocBs_HtmlDocStr(BowDocBsId,
'Economics', 'There are several basic and
incomplete questions that must be answered in
order to resolve the problems of economics
satisfactorily. The scarcity problem, for
example, requires answers to basic questions,
such as what to produce, how to produce it, and
who gets what is produced. An economic system is
a way of answering these basic questions.
Different economic systems answer them
differently.', '', 1) - DId(2) AddBowDocBs_HtmlDocStr(BowDocBsId,
'Oscar Wilde', 'Oscar Fingal OFlahertie Wills
Wilde (October 16, 1854 -- November 30, 1900) was
an Irish playwright, novelist, poet, short
story writer and Freemason. One of the most
successful playwrights of late Victorian
London, and one of the greatest celebrities of
his day, known for his barbed and clever wit,
he suffered a dramatic downfall and was
imprisoned after being convicted in a famous
trial for gross indecency (homosexual acts).',
'', 1) - BowDocWgtBsId GenBowDocWgtBs(BowDocBsId)
- WIds GetBowDocwgtBs_DocWIds(BowDocWgtBsId,
DId(1)) - for WIdN 0(WIds-1)
- WId GetBowDocWgtBs_DocWId(BowDocWgtBsId,
DId(1), WIdN) - WordStr GetBowDocBs_WordStr(BowDocBsId,
WId) - WordWgt GetBowWgtDocBs_DocWWgt(BowDocWgtBsId
, DId(1), WId) - sprintf('s.5f', WordStr, WordWgt)
- end
19SVM Classification Java
- import si.ijs.jtextgarden.
- public class SVM
-
- public static void main(String args)
-
- System.out.println("Loading
JTextGardenLib...") - JTextGardenLib tg new JTextGardenLib()
- System.out.println("Loading bow...")
- int BowDocBsId tg.LoadBowDocBs("./data/t
opic50k.bow") - tg.SaveBowDocBsStat(BowDocBsId,
"./res/BowDocBsStat.txt") - System.out.println("Training linear SVM
binary classifier...") - int ECatId tg.GetBowDocBs_CId(BowDocBsId
, "ECAT") - int BinSVMBowMdId tg.GetBinSVMBowMd(BowD
ocBsId, ECatId) - System.out.println("Testing model...")
- String Doc1 "There are several basic
and incomplete questions that "
20Google WebNews Querying Java
- import si.ijs.jtextgarden.
- public class Google
-
- public static void main(String args)
-
- System.out.println("Loading
JTextGardenLib...") - JTextGardenLib tg new JTextGardenLib()
- System.out.println("Web Search...")
- int WebRSetId tg.GoogleWebSearch("sloven
ia", 50) - System.out.println("Number of hits "
tg.GetRSet_Hits(WebRSetId)) - for (int HitN 0 HitN lt 10 HitN)
- System.out.println("Hit " (HitN1)
" " tg.GetRSet_HitTitleStr(WebRSetId,
HitN)) -
- System.out.println("News Search...")
- int NewsRSetId tg.GoogleNewsSearch("slov
enia basketball")
21Active Learning Java
- import si.ijs.jtextgarden.
- import java.io.
- public class ActiveLearning
-
- public static void main(String args) throws
IOException -
- System.out.println("Loading
JTextGardenLib...") - JTextGardenLib tg new JTextGardenLib()
- System.out.println("Loading bow...")
- int BowDocBsId tg.LoadBowDocBs("./data/C
iaWFB.partly.bow") - String CatNm "Europe"
- int CatId tg.GetBowDocBs_CId(BowDocBsId,
CatNm) - int BowALId tg.NewBowAL(BowDocBsId,
CatId) - System.out.println("Starting Active
Learning loop...")
22Future developments
- Around Text-Garden is being prepared a text-book
for text-mining - Text-Garden will serve as software covering most
of the topics within the book - Text-Garden is getting extended by other sets of
functionalities - Graph-Garden dealing with graphs and networks
- Media-Garden dealing with images and videos
- Semantic-Garden dealing with Semantic-Web
issues - Stream-Garden dealing with streams of data
23Availability
- Text-Garden is under LGPL licence
- It is available from www.textmining.net
- new complete release will be uploaded soon
24Text-Garden SMART
- Ideally, Text-Garden can serve as an integrating
platform for the tools and data developed and
being used within the project - especially academic part of the project could
largely benefit and can make integartion to
commercial platforms easier - evaluation part of the project would benefit
largely since it needs commonly agreed procedures
25some images from Text-Garden
Document-Atlas
Content-Land
SEKTbar
Contexter
Semantic-Graphs
Onto-Genesis