Title: Machine Learning and the Semantic Web
1Machine Learning and the Semantic Web
- Hendrik Blockeel
- Katholieke Universiteit Leuven
- Department of Computer Science
- Thanks Raymond Kosala, Nico Jacobs
2Overview
- Machine learning and data mining
- Relationship with semantic web
- Synergy between both
- Some concrete examples
- Document classification
- Information integration
- Conclusions
3Machine Learning Data Mining
- Related technology, different focus
- Machine learning
- Programs that improve their performance on
certain tasks - Focus on adaptive behaviour
- Data mining
- Discovering implicit knowledge (regularities) in
large amounts of data - Focus on handling large amounts of data
- Very useful technology in the context of the Web
4Learning Agents
- Programs that
- Learn the users preferences
- Make life for the user as simple as possible
- E.g., intelligent mail reader
- E.g., adaptive web pages
- Move links, create direct links, ...
- Index page synthesis (Perkowitz Etzioni, IJCAI
1999) - Learn how to find reliable information
- E.g., learn which other people have similar
preferences to this user, use their opinions to
make suggestions - (other applications learning to play games, ...)
5(No Transcript)
6Mining the Web
- Analyze data that are available on the Web
- Distinguish 3 types
- Web content mining
- Look in contents of documents (text, ...)
- Web structure mining
- Look at links between documents
- Web usage mining
- Look at user logs (e.g. who accessed a web page,
which links often used, ...)
7Web Content Mining
- Relies on information extraction
- E.g., in a text find keywords, ...
- Techniques from machine learning, statistics, ...
used to guess from context - what a word means
- what its function in the text is
- ...
- Fill a schema with specific slots, based on
analysis of text - Even more complicated recognise objects in
pictures, ... - I.E. is a complex matter
8Mining for Genes
- Jenssen et al. (2001), Nature Genetics 28, A
literature network of human genes - Mining MEDLINE database of abstracts
- Find names of genes occurring together
- Construct similarity graph
- Construct a database with this information
- Database contains knowledge no single individual
has, or could obtain without data mining - Similar techniques could be used on the web
- One extra problem uncertainty about reliability
9Web Structure Mining
- Analyse structure of the web
- Which sites have many incoming / outgoing links?
- Identify hubs
- Find clusters of sites that are strongly
interconnected - Web communities
- ...
- E.g., Google
- Identifies important pages based on links that
point to it (rather than contents of page itself)
10Web Usage Mining
- Log user behaviour
- Which links are often followed, in which order,
how long is a page looked at, ... - Possible at several levels
- General usage statistics
- User-specific statistics
- Relating behaviour to properties of user, insofar
available - E.g., adaptive web sites
- Adaplix project
- automatic index page creation
11Web Mining As It Currently Is
- Machine learning / data mining strongly rely on
- Data quantity
- Data quality
- Quantity is usually not a problem on the Web
- Quality is!
- Much data not in easily processable format
- E.g. Inside text documents need information
extraction - Unstructured, poorly structured, heterogeneously
structured - Lots of noise
- ...
12How Is All This Related to the Semantic Web?
- There can be a synergy
- Machine learning can help with building the
Semantic Web - The Semantic Web will help mining the Web, making
Web interfaces and agents more intelligent
13What Machine Learning Can Do for the Semantic Web
- Upgrading the current web to a semantic web
involves a lot of work - Can partially be automated!
- Examples
- Learning ontologies
- Automatic document classification
- Information integration
- ...
14Learning Ontologies
- Maedche Staab (2001), Ontology learning for
the semantic web - View
- Manually creating of ontologies is very
labour-intensive - Fully automating creating of ontologies is not
feasible - Hence develop tool that helps building
ontologies - Basic components
- Good graphical interface (interaction
man-machine) - Powerful underlying machine learning techniques
15Text-To-Onto
- Framework
- Import / reuse existing ontologies
- Extract ontology from documents
- Identify new terms, map onto existing concepts or
define new ones - Identify relationships between concepts
- ...
- Many opportunities for general machine learning
techniques - Prune ontology
- Refine ontology
16Some Useful Techniques for Learning Ontologies
- Term extraction from texts
- Identification of concepts
- Hierarchical Clustering
- Clustering finding groups of similar things
- Hierarchical clustering clusters of clusters
- Taxonomy can be constructed through hierarchical
clustering of concepts - Association rules
- Find sets of terms that often occur together
- May indicate important relations
- E.g., events in texts often co-occur with
locations
17Information Integration
- Doan, Domingos, Halevy Reconciling Schemas of
Disparate Data Sources, ACM SIGMOD 2001 - Context
- Given databases with different schemas
- Find similarities in schemas, guess how concepts
map onto each other - Integrate the schemas
- Essentially the same as mapping ontologies onto
each other
18Automated Document Classification
- Mitchell et al.
- Based on examples of web pages what kind of
page they are (course page, student page, ...), - Learn to classify new pages
- Can be based on contents of page, links pointing
to page, typical structure of certain kinds of
web sites (e.g. universities), ... - Note helps to relate objects to ontology
- Problem how to get labeled examples
- Unlimited amount of unlabelled pages available
- But labelling them manually is labour intensive!
19Exploiting Unlabelled Data
- A solution co-training (Blum Mitchell 1998)
- Learn separate (imperfect) classifiers from
disjoint sets of sufficient information - E.g. Learn to classify pages from
- Content of page (Home page of CS 101)
- Links pointing to page (CS 101)
- Take classifications that classifier A is most
certain of, add these labels to training set for
B (and vice versa) - Repeat multiple times (kind of bootstrapping
process) - Co-training allows to exploit large amounts of
unlabelled data!
20What the Semantic Web Can Do for Machine Learning
- Will make mining the web much easier
- Reason 1 removal of ambiguity
- More precise knowledge of what is meant with
certain terms - Reason 2 structured vs. unstructured data
- Learning from structured data is much easier than
from unstructured data - Reason 3 availability of background knowledge
- Can be used to make better decisions when learning
21Removal of Ambiguity
- Example text document classification
- E.g., given a text, tell in which newsgroups it
belongs - Typical approaches bag of words
- Look only at which words occur, in the text, and
how often - Each time a word occurs that occurs mainly in one
particular class, increase probability for that
class - But words are ambiguous!
- Increased classification accuracy can be expected
by removing ambiguity
22Mining From (Un)structured Data
- Mining data intensively querying data
- Answering a querying is
- Easy in structured data
- Relational database, XML, ...
- Harder in semi-structured data (e.g., HTML)
- Hard in unstructured data
- Information exraction needed
- Could do this by learning a wrapper
- This involves one extra layer of learning
- Relating this to our text example taking into
account function of words in text
23Availability of Background Knowledge
- Learning finding relevant patterns in behaviour
- Important to have the right context to describe
these patterns - Example
- Making interesting offers to clients
- People who bought this book also bought ...
- Instance-based learning
- Estimate profile of user
- Find users with similar profile
- Look at behaviour of those users to help current
user
24Availability of Background Knowledge
- Can work better if more background knowledge is
available, e.g., type of book, author, ... - For instance, for books
- similar profile users that up till now bought
same books as this user - May not be many people
- similar often bought books by same author
- Probably many more people, allows for more
reasonable guess - similar often bought books of same genre
(fiction, ...) - May work even better
- Ontologies (among other) provide such background
knowledge
25Web Mining Revisited
- Semantic Web will change
- Content mining
- Clearer view on contents and meaning of documents
- Structure mining
- More relevant structure
- Usage mining
- More relevant information on actions of user
- Will in general improve intelligence of systems
- E.g. mail filter gets a better view of contents
of mails
26Promising Learning Techniques
- Many different learning techniques exist
- Neural networks, support vector machines,
instance-based learning, bayesian learning,
association rules, ... - Not all equally suitable for any task
- E.g. SVM for document classification works well
- E.g. instance-based learning find other users
with same profile as this user to make
predictions - Intelligent agents will use a mix of them
- Relational learners seem interesting
- Can handle explicit information on objects and
relations between them - Classic example Inductive logic programming
27Inductive Logic Programming
- Induces rules in first order logic from examples
or other rules - Such rules can be used to reason with
- The reasoning can be explained
- Cf. example of mail program
- Can use existing background knowledge
- knowledge intensive learning
- Currently good background knowledge has to be
engineered manually - Will become more easily available with semantic
web - Example mining in chemical domains
28(No Transcript)
29Mining in chemical domains
- Example problem relate activity of molecule to
its properties - Useful for, e.g., drug development
- Which properties are important?
- Chemically relevant properties functional
groups, 3D structure, ... ? - Has to be encoded manually
- Ideally get relevant information from some
trustworthy data source as and when needed - Intelligent agents will exploit (tap) the
common intelligence of the Web
30Conclusions
- Machine learning is an promising tool for the
Semantic Web - For building it
- For exploiting it
- Clear synergy between Semantic Web efforts and
Machine Learning efforts
31Some References
- Maedche, A Machine Learning Perspective for the
Semantic Web, position paper www.semanticweb.org/
SWWS/program/position/soi-maedche.pdf - Maedche Staab (2001) Ontology Learning for the
Semantic Web, IEEE Intelligent Systems 16(2) - Jenssen et al., Nature Genetics 28
- Doan et al. (2001), ACM SIGMOD conf.
- Kosala Blockeel (2000), SIGKDD Explorations
2(1) - Mitchell (1996), Machine Learning