Title: Kein Folientitel
1Semantic Web Usage Mining An International
Perspective
Bettina Berendt
Humboldt University Berlin, Institute of
Information Systems www.wiwi.hu-berlin.de/berendt
Talk at Universitat Pompeu Fabra, Barcelona,
28 February 2005
2Acknowledgements
- Elke Brenstein
- Martin Eisend
- Jorge Gonzalez
- Sebastian Hinz
- Andreas Hotho
- Anett Kralisch
- Ernestina Menasalvas
- Bamshad Mobasher
- Daniel Oberle
- Myra Spiliopoulou
- Gerd Stumme
- Max Teltzrow
- Bert Wendland
Note all publications cited in these slides can
either be found at http//www.wiwi.hu-berlin.de/b
erendt or obtained from me by email
3Goals and top-level questions
- Make the worlds knowledge available to the world
- How do people discover knowledge on the Web?
- How can more knowledge sources contribute to the
Web?
4Approaches to the current Webs biggest
challenges lots of data, human-understandable
Web Mining extracts implicit knowledge
The Semantic Web makes knowledge machine- understa
ndable
Berendt, Hotho, Stumme, Proc. ISWC 2002 --
(Eds.), Proc. WS Semantic Web Mining at ECML/PKDD
2001 and 2002 -- Mladenic, van Someren A
Roadmap for Web Mining ... 2004
5Agenda
6Semantics of requests Step 1 Domain ontology
- community portal ka2portal.aifb.uni-karlsruhe.de
- ontology-based
- Knowledge base in F-Logic
- Static pages annotations
- Dynamic pages generated from queries
- Queries also in F-Logic
- Logs contain these queries
Oberle, Berendt, Hotho, Gonzalez, Proc. AWIC
2003
7Semantics of requests Step 2 Modelling
requests and sessions-as-sets
- RESEARCHER
- PERSON
- PROJECT
- PUBLICATION
- RESEARCHTOPIC
- EVENT
- ORGANIZATION
- RESEARCHINTEREST
- LASTNAME
- TITLE
- ISABOUT
- EVENTS
- EVENTTITLE
- WORKSATPROJECT
- AUTHOR
- AFFILIATION
- ISWORKEDONBY
- PROGRAMCOMMITTEE
- EMPLOYS
An example query with concepts and relations
FORALL N,PEOPLE lt-PEOPLE Employeeaffiliation-gt
gt "http//www.anInstitute.org" and
PEOPLEPersonlastName-gtgtN.
Query feature vector of concepts
relations ? Session feature vector of
concepts relations, summed over all queries in
the session
Clustering, Association rules, Classification, ...
8Semantics of sequences Step 3 Strategy pattern
discovery
- An ontology of navigation strategies
- Define strategy templates as regular expressions
- of requests (mapped to ontological entities)
- of transitions (between ontological entities)
- Ex. .search . individual
- Discover strategies by learning a strategy trie
affiliationSearch, 629
topicSearch, 312
...
...
repetition, 402
refinement, 113
...
individual, 112
repetition, 295
...
Berendt Spiliopoulou, VLDB Journal,
2000 Berendt, Data Mining and Knowledge
Discovery, 2002
9Semantics of sequences Step 4 Strategy pattern
evaluation
- Use strategy patterns statistics to
- Derive descriptive measures of patterns
- support, confidence
- popularity, effectiveness, efficiency
- Apply inferential statistics to compare patterns
Berendt, Data Mining and Knowledge Discovery,
2002
10Communication Visual data mining Step 5
Mapping an ontological relation over concepts
to a linear order and to visual variables
Concreteness
Goal Individual page
Reach goal
Refine search
More constraints on search
First search page
Remain unspecific
Abandon search
Time
11Communication Visual data mining Step 5
Mapping an ontological relation over concepts
to a linear order and to visual variables
Concreteness
Goal Individual page
Reach goal
Refine search
More constraints on search
First search page
Remain unspecific
Abandon search
Time
12Communication Visual data mining Step 5
Example
Berendt, Data Mining and Knowledge Discovery,
2002, Berendt, Postproc. WebKDD 2001
13Communication Visual data mining Step 6
Visual abstraction ? new semantic patterns
Close- ness to product
Shopping for cameras
Shopping for jackets
Datasee Berendt, Günther, Spiekermann,
Communications of the ACM,in press
14Communication Visual data mining Step 6
Visual abstraction ? new semantic patterns
Close- ness to product
Shopping for cameras
Shopping for jackets
Datasee Berendt, Günther, Spiekermann,
Communications of the ACM,in press
15Agenda
Web Mining
(Semantic) Web
16Mining ? Web Using results for site improvement
Name
Name
17Mining ? Web Using results for site improvement
Name
Name
- Path analysis metrics c2 analysis showed
- All search criteria were approx. equally
effective - Location-based search was most popular
- City-based search was most efficient ... but
least popular - ? Modify site design to make efficient search
more popular
18Mining ? Web Using results for site improvement
Name
City
Name
- Path analysis metrics c2 analysis showed
- All search criteria were approx. equally
effective - Location-based search was most popular
- City-based search was most efficient ... but
least popular - ? Modify site design to make efficient search
more popular
19Agenda
20Step 7 Theory-driven usage mining here
Internationalization
- The Web offers unprecedented opportunities for
world-wide access to information resources. - But how should content interface be
internationalized / localized? ? investment
decision! - Case study eHealth portal for worldwide
audiences - same content, same interface in all languages
all countries. - ? Research goals determine the impact of
- users cultural backgrounds
- users linguistic backgrounds
- users domain knowledge
- on their navigational behaviour.
see Kralisch Berendt, Proc. CATAC 2004
Kralisch Berendt, Proc. GOR 2005 Kralisch,
Köppen, Berendt, Proc. ECIS 2005
21The investigated site
22Which diagnosis is that?
Request frequency for a specific diagnosis in the
investigated eHealth portal, depending on time
and request language
Yihune, 2003
23Measuring Culture Cultural Dimensions
Hofstede Power Distance Collectivism
Uncertainty Avoidance Long-Term
Orientation Masculinity Hall
Monochronic vs. Polychronic Context
specifity ...
... other authors ... more dimensions
Kralisch Berendt, Proc. IWIPS 2004 Kralisch,
Eisend, Berendt, Proc. HCI International 2005
24Measuring Culture Cultural Dimensions
Hofstede Power Distance Collectivism
Uncertainty Avoidance Long-Term
Orientation Masculinity Hall Mono- vs.
Polychronic Context specificity ...
How people deal with
... other authors ... more dimensions
Kralisch Berendt, Proc. IWIPS 2004 Kralisch,
Eisend, Berendt, Proc. HCI International 2005
251. Amount of INFORMATION
- A) Uncertainty Avoidance (Hofstede)
- High Uncertainty Avoidance
- worried about the unknown
- Low Uncertainty Avoidance
- open to new things and changes
(e.g. Belgium)
(e.g. UK)
26(No Transcript)
27(No Transcript)
281. Amount of INFORMATION
B) Context Specificity (Hall)
- High Context
- Most of the meaning in the
- context, very little in the message.
-
- Low Context
- If the information is not explicitly
- stated, the meaning is distorted.
-
(e.g. Japan)
(e.g. USA)
292. TIME perception
C) Long-Term Orientation (Hofstede)
- Short-Term Orientation
- focussed on past and present
- Long-Term Orientation
- focussed on future rewards
(e.g. Canada)
(e.g. China)
302. TIME perception
D) Monochronic vs. Polychronic (Hofstede)
- Polychronic
- Plans are constantly shifted
- several things at a time, synchronic
- people do not adhere rigidly to
- appointment schedules
- Monochronic
- schedules
- one thing at a time, sequential
- staying on schedule is a must
(e.g. Germany)
(e.g. Greece)
313. SPACE perception
E) Power Distance (Hofstede)
... refers to the extent to which less powerful
individuals expect and accept that power is
distributed unequally.
- High Power Distance
- relationships supervisors sub-
- ordinates strictly ruled
- large hierarchies
- Low Power Distance
- supervisors and subordinates work
- together and consult one another
- flat hierarchies
(e.g. Sweden)
(e.g. France)
32Derivation of hypotheses search preferences
LOW HIGH
INFO Uncertainty Avoidance Limited choices Maximum of choices
Context specificity
TIME Long-Term Orientation
SPACE Power Distance
33Derivation of hypotheses search preferences
LOW HIGH
INFO Uncertainty Avoidance Limited choices Maximal choices
Context specificity Insensitive to context Sensitive to context
TIME Long-Term Orientation
SPACE Power Distance
34Derivation of hypotheses search preferences
LOW HIGH
INFO Uncertainty Avoidance Limited choices Maximum of choices
Context specificity Insensitive to context Sensitive to context
TIME Long-Term Orientation Desire immediate results Patience in achieving goals Patience in achieving goals
SPACE Power Distance
35Derivation of hypotheses search preferences
LOW HIGH
INFO Uncertainty Avoidance Limited choices Maximal choices
Context specificity Context insensitiv Context desired
TIME Long-Term Orientation Desire immediate results Patient
SPACE Power Distance Hierarchies uncommon Hierarchies common
36Hypotheses search preferences
Search option Characteristics Presumably preferred by
Search engine little context fast information access no hierarchies Low context Low Uncertainty Avoidance Short-Term oriented Low Power Distance
Alphabetically organized links large hierarchies High Power Distance
Content-organized links highest amount of (context) information more time-consuming information access large hierarchies High context High Uncertainty Avoidance Long-Term oriented High Power Distance
372. Navigation Patterns
Hypotheses navigation behaviour
LOW HIGH
Long-Term Orientation Desire immediate results Patience in achieving goals
Uncertainty Avoidance
Monochron vs. Polychron
382. Navigation Patterns
Hypotheses navigation behaviour
LOW HIGH
Long-Term Orientation Desire for immediate results patience
Uncertainty Avoidance Worried about the unknown
Monochron vs. Polychron
Lots of time spent per page request
- Little time spent per page request
- Extensive information collection
- More diagnoses visited
392. Navigation Patterns
Hypotheses navigation behaviour
LOW HIGH
Long-Term Orientation Desire for immediate results patience
Uncertainty Avoidance Worried about the unknown
Monochronic vs. Polychronic Sequential Synchronic
Lots of time spent per page request
- Little time spent per page request
- Extensive information collection
- More diagnoses visited
40Data preprocessing (1) Semantic enrichment by
country culture information
Web server log
- 200.x4.xx.xx - - 09/Apr/2002222835 0200
"GET /cgi-bin/ivw/CP/doia/image. asp.
ivw?zugrdlangecd14nr87diagnr757370
HTTP/1.0" 200 735 "http//www.dermis.net/doia/imag
e.asp?zugrdlangecd14nr87diagnr757370"
"Mozilla/4.0 (compatible MSIE 5.0 Windows 98
DigExt)" - 200.x4.xx.xx ? IP address
- doia/ diagnr757370 ? requested page (also
search modus) - etc.
IP address Localization
Culture
IP_ADDRE COUNTRY CITY LATITUDE LONGITUD TIMEZONE
CERTAINT 8x.7x.6x.xxx Albania Tirane 41.3330 19.
8330 0100 90 1x3.1x4.xx.xxx Algeria Algiers 36.7
630 3.0510 0100 80
41Data preprocessing (2) Semantic enrichment by
content category
- HOME www\.dermis\.net\/
- HOME dermis\.multimedica\.de
- DOIA \/doia/mainmenu\.asp\?zugrdlangdesp
- PEDOIA \/doia/mainmenu\.asp\?zugrplangdesp
- D_ALPH1 \/doia/abrowser\.asp\?zugrdlangdesp
- D_ALPH2 \/doia/abrowser\.asp\?zugrdlangdespb
eginswithA-Z - D_ALPH2 \/doia/abrowser\.asp\?zugrdlangdespb
eginswithA-Zsize0-9 - D_LOKAL1 \/doia/dbrowser\.asp\?zugrdlangdesp
benrA-Z - D_LOKAL2 \/doia/dbrowser\.asp\?zugrdlangdesp
benrA-Z_0-9 - D_LOKAL3 \/doia/dbrowser\.asp\?zugrdlangdesp
benrA-Z_0-9_0-9 - D_LOKAL4 \/doia/dbrowser\.asp\?zugrdlangdesp
benrA-Z_0-9_0-9_0-9 - SEARCH \/doia/abrowser\.asp\?zugrdlangdespbe
ginswithA-Za-zA-Za-zA-Za-ztypesearch
. - SEARCH \/doia/abrowser\.asp\?zugrplangdespbe
ginswithA-Za-zA-Za-zA-Za-ztypesearch
. - SEARCH \/doia/diagalphabrowser\.asp\?zugrdpla
ngdesptypesearchbeginswith. - D_DIAGNOSE \/doia/diagnose\.asp\?zugrdlangdeps
.diagnr. - P_DIAGNOSE \/doia/diagnose\.asp\?zugrplangdeps
.diagnr. - P_DIAGNOSE \/doia/diagnose\.asp\?langdepszugr
pdiagnr. (and so on)
42Data preprocessing (1) Semantic enrichment by
country culture information
Web server log
- 200.x4.xx.xx - - 09/Apr/2002222835 0200
"GET /cgi-bin/ivw/CP/doia/image. asp.
ivw?zugrdlangecd14nr87diagnr757370
HTTP/1.0" 200 735 "http//www.theEHealthPortal.net
/doia/image.asp?zugrdlangecd14nr87diagnr7
57370" "Mozilla/4.0 (compatible MSIE 5.0
Windows 98 DigExt)" - 200.x4.xx.xx ? IP address
- doia/ diagnr757370 ? requested page (also
search modus) - etc.
IP address Localization
Culture
IP_ADDRE COUNTRY CITY LATITUDE LONGITUD TIMEZONE
CERTAINT 8x.7x.6x.xxx Albania Tirane 41.3330 19.
8330 0100 90 1x3.1x4.xx.xxx Algeria Algiers 36.7
630 3.0510 0100 80
Data gt 3 million page requests, gt 200,000
sessions
43Data preprocessing (2) Semantic enrichment by
content category
- HOME www\.theEHealthPortal\.net\/
- HOME theEHealthPorta\.anotherEntryPage\.de
- DOIA \/doia/mainmenu\.asp\?zugrdlangdesp
- PEDOIA \/doia/mainmenu\.asp\?zugrplangdesp
- D_ALPH1 \/doia/abrowser\.asp\?zugrdlangdesp
- D_ALPH2 \/doia/abrowser\.asp\?zugrdlangdespb
eginswithA-Z - D_ALPH2 \/doia/abrowser\.asp\?zugrdlangdespb
eginswithA-Zsize0-9 - D_LOKAL1 \/doia/dbrowser\.asp\?zugrdlangdesp
benrA-Z - D_LOKAL2 \/doia/dbrowser\.asp\?zugrdlangdesp
benrA-Z_0-9 - D_LOKAL3 \/doia/dbrowser\.asp\?zugrdlangdesp
benrA-Z_0-9_0-9 - D_LOKAL4 \/doia/dbrowser\.asp\?zugrdlangdesp
benrA-Z_0-9_0-9_0-9 - SEARCH \/doia/abrowser\.asp\?zugrdlangdespbe
ginswithA-Za-zA-Za-zA-Za-ztypesearch
. - SEARCH \/doia/abrowser\.asp\?zugrplangdespbe
ginswithA-Za-zA-Za-zA-Za-ztypesearch
. - SEARCH \/doia/diagalphabrowser\.asp\?zugrdpla
ngdesptypesearchbeginswith. - D_DIAGNOSE \/doia/diagnose\.asp\?zugrdlangdeps
.diagnr. - P_DIAGNOSE \/doia/diagnose\.asp\?zugrplangdeps
.diagnr. - P_DIAGNOSE \/doia/diagnose\.asp\?langdepszugr
pdiagnr. (and so on)
44Sequence analysis Presence (or not) of linear
navigation patterns
- Express these patterns as path template in the
WUM query language - Use sequence mining to detect support and
confidence in the log partitions
45A linear pattern (semantically enriched)
In the original data, this is just a chain
46linear
Another linear pattern (semantically enriched)
47Unzooming the same pattern
marks diagnoses
48Dto. Im Original
The same pattern (not semantically enriched)
49Nicht-linear
1,6
A non-linear pattern (semantically enriched)
This pattern is just a chain in the
non-semantically enriched version!
50Search behaviour sample results
UA Uncertainty Avoidance Cont Context
Specifity LTO Long-Term Orientation PD Power
Distance
- Which search options were used?
Expected results
Unexpected results
- all results significant (plt0.001)
content-organized links
search engine
H
H
H
H
L
L
L
L
H
H
H
H
L
L
L
L
UA
Cont
UA
LTO
Cont
LTO
PD
PD
51Search behaviour summary
SEARCH ENGINE SEARCH ENGINE SEARCH ENGINE SEARCH ENGINE CONTENT-ORGANIZED LINKS CONTENT-ORGANIZED LINKS CONTENT-ORGANIZED LINKS CONTENT-ORGANIZED LINKS
UA CONT LTO PD UA CONT LTO PD
Our Hypothesis L L L L H H H H
1. Which search options were used? H L L L L H H H
2. In which combination? H L L L L H H H
3. In which order? H L H H
4. Number of page requests prior to access? H H H H
5. Frequency of use? H L L H
6. Relative frequency of use? L H H H
expected results
unexpected results difference between the
groups is not significant
52Navigation behaviour results
all results are significant all results
show a positive correlation between cultural
dimension and navigation behaviour, as
predicted
53How to explain the unexpected results of the
Uncertainty-Avoidance Dimension (that point in
the reverse direction)?
High Uncertainty Avoidance is characterized
by Preference for limited choices
Need for extensive collection of
information
UA users of search engines
UA users of all search options
Lower use of content- organized links
- higher use of search
- engines
High UA
High UA
( obtained, but unexpected result)
UA users of all search options
- higher number of page
- requests
High UA
54Agenda
Web Mining
(Semantic) Web
55Using results for personalization
Kralisch, Eisend, Berendt, Proc. HCI
International, 2005
56Using results for personalization
Kralisch, Eisend, Berendt, Proc. HCI
International, 2005
57New results (questionnaires log file analysis)
suggest that a different form of country grouping
may be a better predictor
Classical approach CULTURAL DIMENSIONS
here no significant correlations
INFORMATION SEEKING BEHAVIOUR
Our new (explorative) approach Analyses based
on GEOGRAPHICAL REGIONS
Kralisch Berendt, Proc. GOR, 2005
58Using the results for evaluation, site
improvement, and personalization
- Mining for the evaluation of sites and services
- Not-for-profit sites
- Multi-channel user contact
- Privacy attitudes and behaviour
- Differences in cognitive styles and abilities
- Internationalisation / localisation
- Behavioural patterns, user groups ? recommend (
evaluate) changes in - page design
- navigation design
- domain ontology
Spiliopoulou Berendt, Handb. Data Mining im
Marketing, 2000 Teltzrow Berendt, Proc.
WebKDD 2003 Berendt, Günther, Spiekermann,
Communications of the ACM, in press Berendt
Brenstein, Behavior Research Methods,
Instruments, Computers, 2001 Berendt
Spiliopoulou, VLDB Journal, 2000 Kralisch,
Eisend, Berendt, Proc. HCI International,
2005 Mobasher, Anand, Berendt, Hotho (Eds.),
Proc. AAAI WS Semantic Web Personalization 2004
59Agenda
Web Mining
- ...
- ltBIBLIOGRAPHYgtltFLOATgtltPAGENUMBERgt136lt/PAGENUMBERgtlt
/FLOATgt - ltHEADgtLiteraturverzeichnislt/HEADgt
- ltCITATION WORKTYPE"journal" PUBLISHED"PUBLISHED
"gtltCUT ID"bib-15-"gt1 lt/CUTgtltWORKAUTHORgtAgarwal,
R. Krueger, B. P. Scholes, G. D. Yang, M.
Yom, J. Mets, L. Fleming, G. R.lt/WORKAUTHORgtUltAR
TICLETITLEgtltrafast energy transfer in LHC-II
revealed by three-pulse photon echo peak shift
measurementslt/ARTICLETITLEgt, ltWORKTITLEgtJ. Phys.
Chem. Blt/WORKTITLEgt, ltPUBDATEgt2000lt/PUBDATEgt,
ltNUMBERgt104lt/NUMBERgt, ltPAGESgt2908lt/PAGESgt, - lt/CITATIONgt
- ...
Semantic Web
60Authoring support for document servers
- Surveys Web usage mining analysis of a digitial
publishing service showed - Metadata creation is one of the main barriers for
contribution. - Reasons include deficiencies in
- information flow
- understanding and use of structured search
- education in structured writing
- HCI aspects
? Marketing
Berendt, Brenstein, Li, Wendland, Proc. ETD
2003 Berendt, Proc. AAAI Spring Symposium KCVC,
2005
61Authoring support for document servers
Berendt, Proc. AAAI Spring Symposium KCVC, 2005
62Authoring support for document servers
Berendt, Proc. AAAI Spring Symposium KCVC, 2005
63Authoring support for document servers
Web service
further text, link, usage mining
Berendt, Proc. AAAI Spring Symposium KCVC, 2005
64Outlook
- Combine mining experimental methology
- Educational portal Berendt Brenstein, BRMIC
2001 - eHealth portal Kralisch Berendt, GOR 2005
- Learning supplying integrating behavioural
- ontologies
- knowledge bases
- Patterns over time pattern monitoring, streams,
Web dynamics - Authoring support
- Personalisation
- Mining and metacognition / reflexivity
65Thank you for your attention!
66Textual representation of sequence and repetition
information
WUM
67Mining ? Web Approach 2 Using results for
personalization
Kralisch, Eisend, Berendt, Proc. HCI
International, 2005
682. Semantics of sequencescomplex application
events Approach 1 Sequences of atomic
application events
- select t
- from node a b, template a b as t
- where a.url startswith "SEITE1-"
- and a.occurrence 1
- and b.url contains "1SCHULE"
- and b.occurrence 1
- and (b.support / a.support) gt 0.2
Berendt Spiliopoulou, VLDB Journal, 2000
69Framework of the STRATDYN tool
- Web usage analysis tools can be classified along
3 dimensions - Single sessions mass data
- Content structure of navigation
- Requests viewed as set sequence graph
- Goal Integrated analysis along these dimensions
- ? Visual DM ? relationships visual variables /
perceived semantics
70Sessionisation-heuristics evaluation shows
Simple temporal heuristics are best (most of the
time)
71Web metrics and privacyThe SIMT framework
72Further research and activities
- Evaluation of mining algorithms
- Data preparation Web metrics privacy
- Visuo-spatial cognition, educational software
- Director of the educational portal
www.schulweb.de (1999-2001) - Projects with
- the educational portal www.eduserver.de
- the digital publishing portal edoc.hu-berlin.de
- the dermatological portal www.dermis.net
- EU 5FP Network of Excellence KDNet (2002-2004)
- Web mining workshops and tutorials (since 2001)
- ECML/PKDD, AAAI, KDD, IJCAI, ...
Spiliopoulou, Mobasher, Berendt, Nakagaway,
INFORMS Journal on Computing, 2003 Teltzrow,
Preibusch, Berendt, Proc. IEEE Conf.
E-Commerce, 2004 Berendt Brenstein, Behavior
Research Methods, Instruments, Computers,
2001 Jansen-Osmann Berendt, Environment and
Behavior, 2002 Jansen-Osmann Berendt,
Quarterly Journal of Experimental Psychology,in
press Ritter, Berendt, Fischer, Richter,
Preim, Proc. Mensch Computer, 2002
73Simulation of single sessions
STRATDYN
74Textual representation of sequence and repetition
information
WUM
75Zooming ? from structure to content
ISM
76The ISM tool
sessionize log calculate session measures
for within-session or between-sample
differentiation
different (un)zooming options
77Cultural Dimensions
Hofstede Power Distance Collectivism
Uncertainty Avoidance Long-Term
Orientation Masculinity Hall Mono- vs.
Polychronic Context specificity ...
How people deal with
... other authors ... more dimensions
78Cultural dimensions an example
79Using results for personalization
Kralisch, Eisend, Berendt, Proc. HCI
International, 2005