Contents - PowerPoint PPT Presentation

About This Presentation
Title:

Contents

Description:

Contents Introduction Knowledge discovery from text & links Knowledge discovery from usage data Important open issues – PowerPoint PPT presentation

Number of Views:168
Avg rating:3.0/5.0
Slides: 56
Provided by: George666
Category:

less

Transcript and Presenter's Notes

Title: Contents


1
Contents
  • Introduction
  • Knowledge discovery from text links
  • Knowledge discovery from usage data
  • Important open issues

2
WWW the new face of the Net
Once upon a time, the Internet was a forum for
exchanging information. Then
came the Web.
The Web introduced new capabilities
and attracted many more people
increasing commercial interest
and turning the Net into a real forum
3
Information overload
as more people started using it ...
the quantity of information on the Web
increased...
increasing the quantity of online information
further...
attracting even more people ...
and leading to the overload of information for
the users ...
4
WWW an expanding forum
  • The Web is large and volatile
  • More than 600.000.000 users online
  • More than 800.000 sign up every day
  • More than 9.000.000 Web sites
  • More than 300.000.000.000 pages online
  • Less than 50 of Web sites will be there next
    year
  • leading to the abundance problem
  • 99 of online information is of no
  • interest to 99 of the people

5
Information access services
  • A number of services aim to help the user gain
    access to online information and products ...

but can they really cope?
6
New requirements
  • Current indexing does not allow for wide
    coverage Less than 5 of the Web covered by
    search engines.
  • What I want is hardly ever ranked high enough.
  • Product information in catalogues is often biased
    towards specific suppliers and outdated.
  • Product descriptions are incomplete and
    insufficient for comparison purposes.
  • E in E-commerce stands for English More
    than 70 of the Web is English.
  • and many more problems lead to the conclusion
    ...
  • that more intelligent solutions are needed!

7
A new generation of services
  • Some have already made their way to the market

many more are being developed as I speak
8
Approaches to Web mining
  • Primary data (Web content)
  • Mainly text,
  • with some multimedia content (increasing)
  • and mark-up commands including hyperlinks.
  • Underlying databases (not directly accessible).
  • Knowledge discovery from text and links
  • Pattern discovery in unstructured textual data.
  • Pattern discovery in the Web graph / hypertext.

9
Approaches to Web mining
  • Secondary data (Web usage)
  • Access logs collected by servers,
  • potentially using cookies,
  • and a variety of navigational information
    collected by Web clients (mainly JavaScript
    agents).
  • Knowledge Discovery from usage data
  • Discovery of interesting usage patterns, mainly
    from server logs.
  • Web personalization Web intelligence.

10
Contents
  • Introduction
  • Knowledge discovery from text links
  • Introduction
  • Information filtering and retrieval
  • Ontology learning
  • Knowledge discovery from usage data
  • Important open issues

11
Information access
  • Goals
  • Organize documents into categories.
  • Assign new documents to the categories.
  • Retrieve information that matches a user query.
  • Dominating statistical idea
  • TFIDFterm frequency inverse document frequency
  • Problems on the Web
  • Huge scale and high volatility demand automation.

12
Text mining
  • Knowledge (pattern) discovery in textual data.
  • Clarifying common misconceptions
  • Text mining is NOT about assigning documents to
    thematic categories, but about learning document
    classifiers.
  • Text mining is NOT about extracting information
    from text, but about learning information
    extraction patterns.
  • Difficulty unstructured format of textual data.

13
Approaches to text mining
  • Combination of language engineering (LE),
    machine learning (ML) and statistical methods

ML-Stats
LE
ML-Stats
LE
14
Hyperlink information is useful
  • Information access can be improved by
    identifying authoritative pages (authorities)
    and resource index pages (hubs).
  • Linked pages often contain complementary
    information (e.g. product offers).
  • Thematically related pages are often linked,
    either directly or indirectly.

15
Document category modelling
Training documents (pre-classified)
Stopword removal (and, the, etc.) Stemming
(played ? play) Bag-of-words coding
Pre-processing
Statistical selection/combination of
characteristic terms (MI, PCA)
Dimensionality reduction
Machine Learning
Supervised classifier learning
Category models (classifiers)
16
Document category modelling
  • Example Filtering spam email.
  • Task classify incoming email as spam and
    legitimate (2 document categories).
  • Simple blacklist and keyword-based methods have
    failed.
  • More intelligent, adaptive approaches are needed
    (e.g. naive Bayesian category modeling).

17
Document category modelling
  • Step 1 (linguistic pre-processing) Tokenization,
    removal of stopwords, stemming/lemmatization.
  • Step 2 (vector representation) bag-of-words or
    n-gram modeling (n2,3).
  • Step 3 (feature selection) information gain
    evaluation.
  • Step 4 (machine learning) Bayesian modeling,
    using word/n-gram frequency.

18
Link structure analysis
  • Improve information retrieval by scoring Web
    pages according to their importance in the Web or
    a thematic sub-domain of it.
  • Nodes with large fan-in (authorities) provide
    high quality information.
  • Nodes with large fan-out (hubs) are good starting
    points.

19
Link structure analysis
  • The HITS algorithm Kleinberg, ACM Journal 1999
  • Given a set of Web pages, e.g. as generated by a
    query,
  • expand the base set by including pages that are
    linked to by the ones in the initial set or link
    to them,
  • assign a hub and an authority weight to each
    page, initialised to 1,
  • update the authority weight of page p according
    to the hub weights of the pages that link to it
  • update the hub weight of page p according to the
    authority weights of the pages that it links to
  • repeat the weight update for a given number of
    times,
  • return a list of the pages ranked by their
    weights.

20
Link structure analysis
  • Interesting issues
  • Does the social network hypothesis hold, i.e.,
    authorities are highly cited? This may be
    unrealistic in competitive commercial domains.
  • What happens if link structure adapts to the
    method, e.g. unrelated pages link to each other
    to increase their rating?
  • What about interesting new pages? How will people
    get to them?

21
Focused crawling spidering
  • Crawling/Spidering Automatic navigation through
    the Web by robots with the aim of indexing the
    Web.
  • Crawling v. Spidering (subjective) inter-site v.
    intra-site navigation.
  • Focused crawling/spidering Efficient, thematic
    indexing of relevant Web pages, e.g. maintenance
    of a thematic portal.
  • Underlying assumption similar to HITS
    thematically similar pages are linked.

22
Focused crawling
  • Focused crawling Chakrabarti et al., WWW 1999
  • Given an initial set of Web pages about a topic,
    e.g. as found in a Web directory,
  • use document category modelling to build a topic
    classifier,
  • extract the hyperlinks within the initial set of
    pages and add them to a queue of pages to be
    visited,
  • retrieve pages from the queue,
  • use the classifier to assess the relevance of
    retrieved pages,
  • use a variant of HITS to assign a hub score to
    pages and the hyperlinks in the queue,
  • re-sort the links in the queue according to their
    hub score,
  • continue the retrieval of new pages, periodically
    updating the score of hyperlinks in the queue.

23
Focused crawling spidering
  • Domain-specific spidering
  • Goal retrieve interesting pages, without
    traversing the whole site.
  • Differences from crawling
  • The site is much more restricted in size and
    thematic diversity than the whole of the Web.
  • Social network analysis is less relevant within a
    site (no hubs and authorities).
  • Requirement link scoring using local features,
    e.g. the anchor text and the textual context.

24
Information extraction
  • Goals
  • Identify interesting events in unstructured
    text.
  • Extract information related to the events and
    store it in structured templates.
  • Typical application
  • Information extraction from newsfeeds.
  • Difficulties
  • Deals with unstructured or semi-structured text.
  • Identification of entities and relations.
  • Usually requires some understanding of the text.

25
A typical extraction system
Unstructured text and database schema (event
templates)
Lemmatization (said ? say), Sentence and word
separation. Part-of-speech tagging, etc.
Morphology
Shallow syntactic parsing.
Syntax
Named-entity recognition. Co-reference
resolution. Sense disambiguation.
Semantics
Discourse
Pattern matching.
Structured data (filled templates)
26
Wrappers/fact extraction
  • Simplified information extraction
  • Extract interesting facts from Web documents.
  • Assumes structure in the documents (usually
    dynamically generated from databases).
  • Reduced demand for pre-processing and LE.
  • Typical application
  • Product comparison services (price, availability,
    ).
  • Difficulties
  • Semi-structured data.
  • Different underlying database schemata and
    presentation formats.

27
Wrappers/fact extraction
ltHTMLgtltTITLEgt Some Country Codes lt/TITLEgt ltBODYgtltBgt Some Country Codes lt/Bgt ltPgt ltBgt Congo lt/Bgt ltIgt 242 lt/Igt ltBgt Egypt lt/Bgt ltIgt 20 lt/Igt ltBgt Greece lt/Bgt ltIgt 30 lt/Igt ltBgt Spain lt/Bgt ltIgt 34 lt/Igt ltHRgt ltBgt End lt/Bgt lt/BODYgt lt/HTMLgt
Example
Wrapper (page P) Skip past first occurrence of ltPgt in P While (next ltBgt is before next ltHRgt in P) For each ltl, rgt ? (ltBgt, lt/Bgt) , (ltIgt, lt/Igt) Extract the text between l and r return ltcountry, code gt extracted pairs
Country Code
Congo 242
Egypt 20
Greece 30
Spain 34
28
Wrapper induction
Training documents (semi-structured)
Abstraction of mark-up structure (often omitted)
Data pre-processing
Database schema (interesting facts)
Machine Learning
Structural/sequence learning
Fact extraction patterns (wrapper)
29
Ontology learning
Training documents (unclassified)
Stopword removal (and, the, etc.) Stemming
(played ? play) Syntactic/Semantic
analysis Bag-of-words coding
Pre-processing
Hand-made thesauri (Wordnet) Term co-occurrence
(LSI)
Dimensionality reduction
Machine Learning
Unsupervised learning (clustering and association
discovery)
Ontologies
30
Ontology learning
  • Hierarchical clustering is most suitable
  • Agglomerative clustering
  • Conceptual clustering (COBWEB)
  • Model-based clustering (EM-type MCLUST)
  • but flat clustering can also be adapted
  • K-means and its variants
  • Bayesian clustering (Autoclass)
  • Neural networks (self-organizing maps)
  • Association discovery (e.g. Apriori) for
    non-taxonomic relations.

31
Ontology learning
  • Example Acquisition of an ontology for tourist
    information. based on Maedche Staab, ECAI 2000

32
Ontology learning
  • Source data Web pages of tourist sites.
  • Background knowledge generic and domain-specific
    ontologies.
  • Target users Tourist directories, large travel
    agencies.
  • Goals
  • Identify types of page (e.g. room descriptions)
    and terms/entities inside pages (e.g. hotel
    addresses).
  • Identify taxonomic relations between concepts
    (e.g. accommodation hotel).
  • Identify non-taxonomic relations between concepts
    (e.g. accommodation area).

33
Ontology learning
  • Heavy linguistic pre-processing
  • Syntactic analysis,e.g. verb subcategorization
    framesverb(arrive) -gt prep(at),
    dir_obj(Torino).
  • Semantic analysis, e.g. named entity
    recognition Via Lagrange -gt Street namee.g.
    special dependency relations Hotel Concord in
    Torino

34
Contents
  • Introduction
  • Knowledge discovery from text links
  • Knowledge discovery from usage data
  • Personalization on the Web
  • Data collection and preparation issues
  • Personalized assistants
  • Discovering generic user models
  • Sequential pattern discovery
  • Knowledge discovery in action
  • Important open issues

35
Personalized information access
sources
personalization server
receivers
36
Personalization v. intelligence
  • Better service for the user
  • Reduction of the information overload.
  • More accurate information retrieval and
    extraction.
  • Recommendation and guidance.

37
Personalized assistants
  • Personalized crawling Liebermann et al., ACM
    Comm., 2000
  • The system knows the user (log-in).
  • It uses heuristics to extract important terms
    from the Web pages that the user visits and add
    them to thematic profiles.
  • Each time the user views a page, the system
  • searches the Web for related pages,
  • filters them according to the relevant thematic
    profile,
  • and constructs a list of recommended links for
    the user.
  • The Letizia version of the system searches the
    Web locally, following outgoing links from the
    current page.
  • The Powerscout version uses a search engine to
    explore the Web.

38
Personalized assistants
  • Adaptive Web interfaces Jörding, UM 1999
  • The TELLIM system collects user information,
    (e.g. a selection of a link) using a Java applet
    .
  • User information is used as training data in
    order to create generic models reflecting the
    users interest in different products.
  • The system creates short-term personal models
    using the generic models and the current users
    behavior.
  • Web pages containing more detailed information
    about these products, together with multimedia
    content and VRML presentations are created
    dynamically and presented to the users.

39
User modelling
  • Basic elements
  • Constructing models that can be used to adapt the
    system to the users requirements.
  • Different types of requirement interests (sports
    and finance news), knowledge level (novice -
    expert), preferences (no-frame GUI), etc.
  • Different types of model personal generic.
  • Knowledge discovery facilitates the acquisition
    of user models from data.

40
User Models
  • User model (type A) PERSONAL
  • User x -gt sports, stock market
  • User model (type B) PERSONAL
  • User x, Age 26, Male -gt sports, stock market
  • User community GENERIC
  • Users x,y,z -gt sports, stock market
  • User stereotype GENERIC
  • Users x,y,z, Age 20..30, Male -gt sports,
    stock market

41
Generic user models
  • Stereotypes Models that represent a type of
    user, associating personal characteristics with
    parameters of the system,
  • e.g. Male users of age 20-30 are interested in
    sports and politics.
  • Communities Models that represent a group of
    users with common preferences,
  • e.g. Users that are interested in sports and
    politics.

42
Learning user models
43
Knowledge discovery process
Collection of usage data by the server and the
client.
Data collection
Data cleaning, user identification, session
identification
Data pre-processing
Construction of user models
Pattern discovery
Report generation, visualization, personalization
module.
Knowledge post-processing
44
Pre-processing usage data
  • Cleaning
  • Log entries that correspond to error responses.
  • Trails of robots.
  • Pages that have not been requested explicitly by
    the user (mainly image files, loaded
    automatically). Should be domain-specific.
  • User identification
  • Identification by log-in.
  • Cookies and Javascript.
  • Extended Log Format (browser and OS version).
  • Bookmark user-specific URL.
  • Various other heuristics.

45
Pre-processing usage data
  • User session/Transaction identification in log
    files
  • Time-based methods, e.g. 30 min silence interval.
    Problems with cache. Partial solutions special
    HTTP headers, Java agents.
  • Context-based methods e.g. separate pages into
    navigational and content and impose heuristics on
    the type of page that a user session may consist
    of.
  • User sessions can be subdivided into smaller
    transaction sequences, e.g. by identifying a
    backward reference in the sequence of requests.
  • Encoding of training data
  • Bag-of-pages representation of sessions/transactio
    ns.
  • Transition-based representation of
    sessions/transactions.
  • Manually determined features of interest.

46
Collaborative filtering
  • Information filtering according to the choices of
    similar users.
  • Avoids semantic content analysis.
  • Cold-start problem with new users.
  • Approaches
  • memory-based learning,
  • model-based clustering,
  • item-based recommendation.

47
Memory-based learning
  • Nearest-neighbour approach
  • Construct a model for each user. Often use
    explicit user ratings for each item.
  • Index the user in the space of system parameters,
    e.g. item ratings.
  • For each new user,
  • index the user in the same space, and
  • find the k closest neighbours.
  • Simple metrics to measure the similarity between
    users, e.g. Pearson correlation.
  • Recommend the items that the new user has not
    seen and are popular among the neighbours.

48
Model-based clustering
  • Clustering users into communities.
  • Methods used
  • Conceptual clustering (COBWEB).
  • Graph-based clustering (Cluster mining).
  • Statistical clustering (Autoclass).
  • Neural Networks (Self-Organising Maps).
  • Model-based clustering (EM-type).
  • BIRCH.
  • Community models cluster descriptions.

49
Model-based clustering
0,9
0,9
0,9
0,9
0,8
0,8
0,4
0,4
0,1
0,1
0,5
0,5
50
Item-based recommendation
  • Focus on item usage in the profiles, instead of
    the users themselves.
  • Practically useful in e-commerce, e.g. cross-sell
    recommendations.
  • Simple modification to the clique-based
    clustering method graph of items instead of
    graph of users.
  • Related to frequent itemset discovery in
    association rule mining.

51
Item-based recommendation
0,9
0,9
Politics
Sports
0,9
0,9
0,8
0,8
0,4
0,4
0,1
0,1
World
Finance
0,5
0,5
52
Contents
  • Introduction
  • Knowledge discovery from text links
  • Knowledge discovery from usage data
  • Personalization on the Web
  • Data collection and preparation issues
  • Personalized assistants
  • Discovering generic user models
  • Sequential pattern discovery
  • Knowledge discovery in action
  • Important open issues

53
Sequential pattern discovery
  • Identifying navigational patterns, rather than
    bag-of-page models.
  • Methods
  • Clustering transitions between pages.
  • First-order Markov models.
  • Probabilistic grammar induction.
  • Association-rule sequence mining.
  • Path traversal through graphs.
  • Personal and community navigation models.

54
Sequential pattern discovery
  • Clique-based transition clustering small
    modification of the model-based item clustering
    approach an item is a transition between pages.

0,9
0,9
Sports-gtPolitics
Finance-gtPolitics
0,9
0,9
0,8
0,8
0,4
0,4
0,1
0,1
Sports-gtFinance
Finance-gtSports
0,5
0,5
55
References
  • J. Borges and M. Levene, Data mining of user
    navigation patterns. Proceedings of Workshop on
    Web Usage Analysis and User Profiling (WEBKDD),
    in conjunction with ACM SIGKDD International
    Conference on Knowledge Discovery and Data
    Mining. San Diego, CA., pp. 31-36.
  • S. Chakrabarti, M. H. van den Berg, B. E. Dom,
    Focused Crawling a new approach to
    topic-specific Web resource discovery,
    Proceedings of the Eighth International World
    Wide Web Conference (WWW), Toronto, Canada, May
    1999.
  • T. Jörding, T, A Temporary User Modeling Approach
    for Adaptive Shopping on the We, In Proceedings
    of the 2nd Workshop on Adaptive Systems and User
    Modeling on the WWW, UM'99, Banff, Canada, 1999.
  • J. Kleinberg. Authoritative sources in a
    hyperlinked environment. Journal of the ACM, v.
    46, 1999.
  • H. Lieberman, C. Fry and L. Weitzman. Exploring
    the Web with Reconnaissance Agents,
    Communications of the ACM, August 2001, pp.
    69-75.
  • A. Maedche, S. Staab. Discovering Conceptual
    Relations from Text. In W.Horn (ed.) ECAI 2000.
    Proceedings of the 14th European Conference on
    Artificial Intelligence (ECAI), Berlin, August
    21-25, 2000.
  • A. McCallum, D. Freitag and F. Pereira, Maximum
    Entropy Markov Models for Information Extraction
    and Segmentation, Proceedings of the
    International Conference on Machine Learning
    (ICML), Stanford, CA, 2000, pp. 591-598.
  • I. Muslea , S. Minton and C. Knoblock , STALKER
    Learning extraction rules for semistructured
    Web-based information sources. Proceedings of the
    National Conference on Artificial Intelligence
    (AAAI), Madison, Wisconsin, 1998.
  • C. Nédellec, Corpus-based learning of semantic
    relations by the ILP system, Asium, Learning
    Language in Logic, Cussens J. and Dzeroski S.
    (Eds.), Springer Verlag, September 2000.
  • J. Rennie and A. McCallum. Efficient Web
    Spidering with Reinforcement Learning.
    Proceedings of the International Conference on
    Machine Learning (ICML), 1999.
  • E. I. Schwartz. Webonomics. New York Broadway
    books, 1997.
  • E. Schwarzkopf, An adaptive Web site for the
    UM2001 conference. Proceedings of the Workshop on
    Machine Learning for User Modeling, in
    conjunction with the International Conference on
    User modelling (UM), pp 77-86, 2001.
Write a Comment
User Comments (0)
About PowerShow.com