TextGarden Software Suite Quick Overview - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

TextGarden Software Suite Quick Overview

Description:

(SVM, Winnow, Perceptron, NBayes) Crawling Web and Search Eng. (for large scale data acquisition) ... Winnow. Perceptron. NaiveBayes. K-Nearest-Neighbor (KNN) ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 26
Provided by: velblodVid
Category:

less

Transcript and Presenter's Notes

Title: TextGarden Software Suite Quick Overview


1
Text-Garden Software Suite Quick Overview
  • Marko Grobelnik
  • Jozef Stefan Institute
  • Ljubljana, Slovenia

2
Outline
  • What is Text-Garden?
  • How Text-Garden is being built?
  • Major functionalities
  • Technical aspects
  • Future developments
  • Availability
  • Text-Garden SMART

3
What is Text-Garden?
  • Text-Garden is a software library and collection
    of software tools for solving large scale tasks
    dealing with structured, semi-structured and
    unstructured data
  • in particular, emphasis of functionality is on
    dealing with text
  • It can be used in various ways covering research
    and applicative scenarios
  • Being used by several institutions such as BT,
    MSR, CMU,

4
Some history
  • The work started in 1996 as a set of C classes
    for dealing with text and to perform
    text-learning tasks (two people working on it)
  • till 2002 it developed slowly according to the
    academic tasks being on our agenda
  • From 2003 on Text-Garden became central software
    platform JSI is using on many research and
    applicative projects (10 people contributing)
  • all the solutions and results JSI is working on
    eventually become part of Text-Garden environment

5
local JSI development of Text-Garden
SMART STREP

PASCAL NoE
SEKT IP
NEON IP
Projects
JSI team
Text-Garden
6
Major functionalities
7
Functionality blocks
Lexical text processing (tokenization,
stop-words, stemming, n-grams, Wordnet)
Named Entity Extraction (capitalization based)
Unsupervised learning (KMeans, Hierarchical-KMeans
, OneClassSVM)
Cross Correlation (KCCA, matching text with other
data)
Semi-Supervised learning (Uncertainty sampling)
Keyword Extraction (contrast, centroid, taxonomy
based)
Supervised learning (SVM, Winnow, Perceptron,
NBayes)
Large Taxonomies (dealing with DMoz, Medline)
Dimensionality reduction (LSI, PCA)
Crawling Web and Search Eng. (for large scale
data acquisition)
Visualization (Graph based, Tiling, Density
based, )
Scalable Search (inverted index)
8
Lexical processing
  • Includes transformation from various formats into
    bag-of-words representation
  • text/html, many custom formats (Svm-light,
    Reuters, )
  • will support most text encodings
  • Lexical processing includes
  • Tokenization
  • Stop-words removal
  • Stemming (Porter stemmer for English, we have ML
    for learning stemmers from lexicons for other
    languages)
  • Frequent N-Gram features (consecutive words
    co-appearing)
  • Proximity features (words co-appearing within
    window)
  • Wordnet integration (words co-appearing in
    synsets or in e.g. hypernym relations)
  • Output of this level is .BOW (Bag-Of-Words) file
  • with dictionary and sparse vectors for documents

9
Unsupervised learning
  • Algorithms
  • K-means clustering
  • clustering into flat clusters
  • Hierarchical-K-Means
  • creating hierarchy of clusters
  • One-Class-SVM
  • learning from positive class only
  • result is .BowPart (BOW Partition) file

10
Supervised learning
  • Following algorithms are implemented
  • SVM (two class, regression)
  • Winnow
  • Perceptron
  • NaiveBayes
  • K-Nearest-Neighbor (KNN)
  • result is .BowMd (BOW Model) file which is used
    for further classification

11
Dimensionality reduction
  • Transform original space into low dimensional one
    and project original data
  • Two classical methods
  • Latent Semantic Indexing (LSI)
  • efficient implementation, working with sparse
    matrices
  • Principal Component Analysis (PCA)
  • result is .SemSpace (Semantic Space) file which
    is used for projecting the data
  • We can e.g. project original .BOW file into
    transformed lower dimensional .BOW file

12
Named Entity Extraction
  • Simple and efficient NEE
  • it is based on word capitalization
  • Candidate NEs (words and sequences of words) need
    to be capitalized
  • Heuristic rule capitalized candidates must
    appear within text at least once
  • we handle exceptions separately
  • Works well with some user interaction

13
Crawler Search Engine
  • Crawler
  • Slovenian internet archive being crawled by
    Text-Garden Crawler
  • Highly scalable indexing search of text
    documents
  • E.g. indexing of 10M documents in several hours

14
Support for selected external sources
  • Text-Garden has special support for the following
    databases and services
  • Google Search (Web/News/Scholar)
  • DMoz/Open Directory Project
  • Medline
  • WordNet
  • Yahoo! Finance
  • CIA World-Fact-Book
  • Cordis project database
  • Cyc (OpenCyc/ResearchCyc)
  • Reuters datasets (old, new), ACM TechNews,
  • In preparation
  • Wikipedia, EuroVoc, AgroVoc (FAO)
  • MSN Search, Yahoo Search

15
Technical aspects
  • Text Garden is almost entirely written in
    portable C
  • it compiles under Windows (Microsoft Visual C,
    Borland C) and Unix/Linux (GNU C)
  • it runs under 32bit and 64bit platforms
  • it consists of 200.000 relatively compact lines
    of code

16
How to use Text-Garden functionality?
  • Text-Garden functionality can be accessed in a
    number of ways
  • As plain C classes
  • Complete functionality
  • As DLL library of 250 functions
  • Simplified extract of major functionality
  • As command line utilities
  • 60 command line utilities getting connected in
    pipeline
  • Through and GUI tools
  • (e.g. DocAtlas, OntoGen, )
  • Through interfaces to several platforms
  • (Java, Matlab, ) next slide

17
Multiplatform Text-Garden
  • Text-Garden has the following interfaces with the
    same API
  • C/C - through simplified DLL native C
  • Java through JNI
  • .NET e.g. accessible through C, VB,
  • Matlab through standard Matlab interface
  • Python through standard Python interface
  • Mathematica, Prolog, R in preparation
  • API has 40 classes and 250 functions
  • interfaces to the all above platforms are
    generated automatically from the master
    Text-Garden header file
  • next slides include some examples in Matlab and
    Java

18
Text Parsing to TFIDF Matlab
  • BowDocBsId NewBowDocBs('en523', 'porter')
  • DId(1) AddBowDocBs_HtmlDocStr(BowDocBsId,
    'Economics', 'There are several basic and
    incomplete questions that must be answered in
    order to resolve the problems of economics
    satisfactorily. The scarcity problem, for
    example, requires answers to basic questions,
    such as what to produce, how to produce it, and
    who gets what is produced. An economic system is
    a way of answering these basic questions.
    Different economic systems answer them
    differently.', '', 1)
  • DId(2) AddBowDocBs_HtmlDocStr(BowDocBsId,
    'Oscar Wilde', 'Oscar Fingal OFlahertie Wills
    Wilde (October 16, 1854 -- November 30, 1900) was
    an Irish playwright, novelist, poet, short
    story writer and Freemason. One of the most
    successful playwrights of late Victorian
    London, and one of the greatest celebrities of
    his day, known for his barbed and clever wit,
    he suffered a dramatic downfall and was
    imprisoned after being convicted in a famous
    trial for gross indecency (homosexual acts).',
    '', 1)
  • BowDocWgtBsId GenBowDocWgtBs(BowDocBsId)
  • WIds GetBowDocwgtBs_DocWIds(BowDocWgtBsId,
    DId(1))
  • for WIdN 0(WIds-1)
  • WId GetBowDocWgtBs_DocWId(BowDocWgtBsId,
    DId(1), WIdN)
  • WordStr GetBowDocBs_WordStr(BowDocBsId,
    WId)
  • WordWgt GetBowWgtDocBs_DocWWgt(BowDocWgtBsId
    , DId(1), WId)
  • sprintf('s.5f', WordStr, WordWgt)
  • end

19
SVM Classification Java
  • import si.ijs.jtextgarden.
  • public class SVM
  • public static void main(String args)
  • System.out.println("Loading
    JTextGardenLib...")
  • JTextGardenLib tg new JTextGardenLib()
  • System.out.println("Loading bow...")
  • int BowDocBsId tg.LoadBowDocBs("./data/t
    opic50k.bow")
  • tg.SaveBowDocBsStat(BowDocBsId,
    "./res/BowDocBsStat.txt")
  • System.out.println("Training linear SVM
    binary classifier...")
  • int ECatId tg.GetBowDocBs_CId(BowDocBsId
    , "ECAT")
  • int BinSVMBowMdId tg.GetBinSVMBowMd(BowD
    ocBsId, ECatId)
  • System.out.println("Testing model...")
  • String Doc1 "There are several basic
    and incomplete questions that "

20
Google WebNews Querying Java
  • import si.ijs.jtextgarden.
  • public class Google
  • public static void main(String args)
  • System.out.println("Loading
    JTextGardenLib...")
  • JTextGardenLib tg new JTextGardenLib()
  • System.out.println("Web Search...")
  • int WebRSetId tg.GoogleWebSearch("sloven
    ia", 50)
  • System.out.println("Number of hits "
    tg.GetRSet_Hits(WebRSetId))
  • for (int HitN 0 HitN lt 10 HitN)
  • System.out.println("Hit " (HitN1)
    " " tg.GetRSet_HitTitleStr(WebRSetId,
    HitN))
  • System.out.println("News Search...")
  • int NewsRSetId tg.GoogleNewsSearch("slov
    enia basketball")

21
Active Learning Java
  • import si.ijs.jtextgarden.
  • import java.io.
  • public class ActiveLearning
  • public static void main(String args) throws
    IOException
  • System.out.println("Loading
    JTextGardenLib...")
  • JTextGardenLib tg new JTextGardenLib()
  • System.out.println("Loading bow...")
  • int BowDocBsId tg.LoadBowDocBs("./data/C
    iaWFB.partly.bow")
  • String CatNm "Europe"
  • int CatId tg.GetBowDocBs_CId(BowDocBsId,
    CatNm)
  • int BowALId tg.NewBowAL(BowDocBsId,
    CatId)
  • System.out.println("Starting Active
    Learning loop...")

22
Future developments
  • Around Text-Garden is being prepared a text-book
    for text-mining
  • Text-Garden will serve as software covering most
    of the topics within the book
  • Text-Garden is getting extended by other sets of
    functionalities
  • Graph-Garden dealing with graphs and networks
  • Media-Garden dealing with images and videos
  • Semantic-Garden dealing with Semantic-Web
    issues
  • Stream-Garden dealing with streams of data

23
Availability
  • Text-Garden is under LGPL licence
  • It is available from www.textmining.net
  • new complete release will be uploaded soon

24
Text-Garden SMART
  • Ideally, Text-Garden can serve as an integrating
    platform for the tools and data developed and
    being used within the project
  • especially academic part of the project could
    largely benefit and can make integartion to
    commercial platforms easier
  • evaluation part of the project would benefit
    largely since it needs commonly agreed procedures

25
some images from Text-Garden
Document-Atlas
Content-Land
SEKTbar
Contexter
Semantic-Graphs
Onto-Genesis
Write a Comment
User Comments (0)
About PowerShow.com