Kein Folientitel - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Kein Folientitel

Description:

to a linear order and to visual variables. More. constraints. on search. 26 ... Search criterion textual property. Communication Visual data mining. Step 5 Example ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 75
Provided by: carbonVide
Category:
Tags: folientitel | kein

less

Transcript and Presenter's Notes

Title: Kein Folientitel


1
Semantic Web Usage Mining Overview and Case
Studies
Bettina Berendt
Humboldt University Berlin Institute of
Information Systems www.wiwi.hu-berlin.de/berendt
2
Goals and top-level questions
  • Make the worlds knowledge available to the world
  • How do people discover knowledge on the Web?
  • How can more knowledge sources contribute to the
    Web?

3
Approaches to the current Webs biggest
challenges lots of data, human-understandable
Web Mining extracts implicit knowledge
The Semantic Web makes knowledge machine- understa
ndable
Berendt, Hotho, Stumme, Proc. ISWC
2002 Berendt, Mladenic, et al. (Eds.), From Web
to Semantic Web, Springer LNAI 2004 Berendt,
Grobelnik, Mladenic et al. (Eds.), Semantics,
Web, and Mining, Springer LNAI 2006
4
Agenda
Web Mining
Why?
5
1. What should I buy?
6
2. Where do I find relevant information on ...?
7
3. What do people do there?
Name
8
4. How can a site be made usable for a
worldwide audience?
9
5a. Why go to a shop ...
  • ... if everything is available on the Internet?

10
5b. What is my site worth for my business?
11
6. How to help people become active members of
the knowledge society help them to contribute
content?
12
Agenda
Web Mining
How?
13
Web Mining
  • Knowledge discovery (aka Data
    mining)
  • the non-trivial process of identifying valid,
    novel, potentially useful, and ultimately
    understandable patterns in data. 1
  • Web Mining
  • the application of data mining techniques on the
    content, (hyperlink) structure, and usage of Web
    resources.

Web mining areas Web content mining
1 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.,
Uthurusamy, R. (Eds.) (1996). Advances in
Knowledge Discovery and Data Mining. Boston, MA
AAAI/MIT Press
14
Data analysis the textbook version
  • The meaning of attributes is clear
  • The meaning of attribute values is clear
  • ? Data modelling can be applied directly (e.g.,
    regression, classification, clustering,
    association-rule discovery)

(A simplified extract from the adult dataset in
the UCI machine learning repository)
15
Data analysis the reality ? data mining /
knowledge discovery process
  • ...
  • p3ee24304.dip.t-dialin.net - - 19/Mar/20021203
    51 0100"GET /search.html?tjane20austenSID023
    785ordasc HTTP/1.0" 200 1759
  • p3ee24304.dip.t-dialin.net - - 19/Mar/20021205
    06 0100 "GET /search.html?tjane20austenmvide
    oSID023785orddesc HTTP/1.0" 200 8450
  • p3ee24304.dip.t-dialin.net - - 19/Mar/20021206
    41 0100 "GET /view.asp?id3456SID023785
    HTTP/1.0" 200 3478
  • ...
  • What is the meaning of the attributes?
  • What is the meaning of the attribute values?
  • ? Data modelling is only one part!

CRISP-DM
16
Where does semantics come in?
Semantics
17
Agenda
Semantic Web
How?
18
What is an ontology?
  • Definition Core ontology with axioms
  • a structure O ( C, C , R , s , R , A )
    consisting of
  • two disjoint sets C (concept identifiers) and R
    (relation identifiers)
  • a partial order C on C (concept hierarchy or
    taxonomy)
  • a function s R ? C (signature), where C is
    the set of all finite tuples of elements in C
  • a partial order R on R (relation hierarchy),
    where
  • r1 R r2 implies s(r1) s(r2)
  • ?i (s(r1)) C ?i (s(r2)) for all 1 i s(r1),

    with ?i the projection on the i-th component
  • a set A of axioms in a logical language L

an explicit specification of a shared
conceptualisation (Gruber, 1993)
Stumme, Hotho, Berendt, Journal of Web
Semantics, 2006, and sources there
19
Agenda
20
Semantics of requests Step 1 Domain ontology
  • community portal ka2portal.aifb.uni-karlsruhe.de
  • ontology-based
  • Knowledge base in F-Logic
  • Static pages annotations
  • Dynamic pages generated from queries
  • Queries also in F-Logic
  • Logs contain these queries

Oberle, Berendt, Hotho, Gonzalez, Proc. AWIC
2003
21
Semantics of requests Step 2 Modelling
requests and sessions-as-sets
  • RESEARCHER
  • PERSON
  • PROJECT
  • PUBLICATION
  • RESEARCHTOPIC
  • EVENT
  • ORGANIZATION
  • RESEARCHINTEREST
  • LASTNAME
  • TITLE
  • ISABOUT
  • EVENTS
  • EVENTTITLE
  • WORKSATPROJECT
  • AUTHOR
  • AFFILIATION
  • ISWORKEDONBY
  • PROGRAMCOMMITTEE
  • EMPLOYS

An example query with concepts and relations
FORALL N,PEOPLE lt-PEOPLE Employeeaffiliation-gt
gt "http//www.anInstitute.org" and
PEOPLEPersonlastName-gtgtN.
Query feature vector of concepts
relations ? Session feature vector of
concepts relations, summed over all queries in
the session
Clustering, Association rules, Classification, ...
22
Semantics of sequences Step 3 Strategy pattern
discovery
  • An ontology of navigation strategies
  • Define strategy templates as regular expressions
  • Of requests (mapped to ontological entities)
  • Of transitions (between ontological entities)
  • Ex. .search . individual
  • Discover strategies by learning a strategy trie

affiliationSearch, 629
topicSearch, 312
...
...
repetition, 402
refinement, 113
...
individual, 112
repetition, 295
...
Berendt Spiliopoulou, VLDB Journal,
2000 Berendt, Data Mining and Knowledge
Discovery, 2002
23
NB For more exploratory analyses The Web Usage
Miner WUM
  • select t
  • from node a b, template a b as t
  • where a.url startswith "SEITE1-"
  • and a.occurrence 1
  • and b.url contains "1SCHULE"
  • and b.occurrence 1
  • and (b.support / a.support) gt 0.2

Spiliopoulou, 1999 Berendt Spiliopoulou, VLDB
Journal, 2000
24
Semantics of sequences Step 4 Strategy pattern
evaluation
  • Use strategy patterns statistics to
  • Derive descriptive measures of patterns
  • support, confidence
  • popularity, effectiveness, efficiency
  • Apply inferential statistics to compare patterns

Berendt, Data Mining and Knowledge Discovery,
2002
25
Communication Visual data mining Step 5
Mapping an ontological relation over concepts
to a linear order and to visual variables
Concreteness
Goal Individual page
Reach goal
Refine search
More constraints on search
First search page
Remain unspecific
Abandon search
Time
26
Ad Q.3 What do people do there?
27
Communication Visual data mining Step 5
Example
Berendt, Data Mining and Knowledge Discovery,
2002, Berendt, Postproc. WebKDD 2001
28
An online shop with a difference
Berendt, Günther, Spiekermann, Communications
of the ACM,2005
29
Communication Visual data miningStep 6 Visual
abstraction ? new semantic patterns
Close- ness to product
Shopping for cameras
Shopping for jackets
Berendt, Data Mining and Knowledge Discovery,
2002, Berendt, Postproc. WebKDD 2002
30
Ad Q.4 Worldwide usability
31
The impact of language and domain knowledge on
search option choice
  • 2 studies on the use of search options in the
    eHealth site
  • Webserver log 3 928 235 requests / 277 809
    sessions from 188 countries
  • 83.2 first-language users, 16.8
    second-language users
  • Webserver log Questionnaire 165 (106) people
    from 34 countries
  • 84.9 first-language users, 15.1 second-language
    users
  • 10.4 physicians, 89.6 patients
  • Results
  • Search engine, alphabetical search in particular
    first-language users, physicians
  • Content-organized search in particular
    second-language patients
  • ?
  • Domain knowledge compensates for limited language
    knowledge.

Kralisch Berendt, New Review of Hypermedia and
Multimedia, 2005
32
Semantics Service ontology
33
Results on frequent search patterns
Alphabetical search hub-and-spoke ? only
linguistic relations (6.4)
Diagnoses are hubs" for navigation (5.3, 4)
Localization search linear / Depth-first ?
search refinement medical knowledge (5)
Berendt, Postproc. WebKDD 2005
34
Mining with ISOVISSemantic drill-down,
visualizing detail context
Berendt, Postproc. WebKDD 2005
35
Ad Q.5 Shopping behaviour and Web site value
36
5. What is my site worth for my business?
  • A site is often only a part of a distribution
    strategy / one channel to reach customers.
  • What are the conversion rates (how many visitors
    become buyers etc.)?
  • What are the cross-channel effects?

Internet market shares BCG 2002
37
Semantics The buying process as a service
ontology
38
Mining (example) Association rules for
investigating preferences in the buying process
  • Study based on 100K sessions, 13K transactions
    from 2002 at a leading European retailer of
    consumer electronics showed, among other things
  • Online payment ? Direct delivery (s0.27, c0.97)
    lt 1/3 tradit. online users!
  • Online payment ? In-store pickup (s0.02, c0.03)
  • Cash on delivery ? Direct delivery (s0.02,
    c0.03)
  • In-store payment ? In-store pickup (s0.69,
    c0.94)
  • ? Site is primarily used for information search.
  • ? Key performance indicators (Web metrics ),
    e.g.
  • conversion efficiency
  • offline conversion
  • effectivity and effiziency of search options

Berendt Spiliopoulou, VLDB Journal,
2000, Berendt, Data Mining and Knowl. Discovery,
2002 Teltzrow Berendt, Proc. WebKDD 2003
39
Agenda
Web Mining
(Semantic) Web
40
Step 6 Deployment of results Example 1 Using
results for site improvement
Name
City
Name
  • Path analysis metrics c2 analysis showed
  • All search criteria were approx. equally
    effective
  • Location-based search was most popular
  • City-based search was most efficient ... but
    least popular
  • ? Modify site design to make efficient search
    more popular

Berendt Spiliopoulou, VLDB Journal,
2000, Berendt, Data Mining and Knowl. Discovery,
2002 Spiliopoulou Pohle, DMKD, 2001
41
Step 6 Deployment of results Example 2 Using
results for personalization
Kralisch, Eisend, Berendt, Proc. HCI
International, 2005
42
Step 6 Deployment of results Example 3 A
privacy-preserving Web-metrics analysis service
Teltzrow, Preibusch, Berendt, IEEE EC Conf.
2004
43
Agenda
Web Mining
... ltBIBLIOGRAPHYgtltFLOATgtltPAGENUMBERgt136lt/PAGENUMB
ERgtlt/FLOATgt ltHEADgtLiteraturverzeichnislt/HEADgt
ltCITATION WORKTYPE"journal" PUBLISHED"PUBLISHED"
gtltCUT ID"bib-15-"gt1 lt/CUTgtltWORKAUTHORgtAgarwal,
R. Krueger, B. P. Scholes, G. D. Yang, M.
Yom, J. Mets, L. Fleming, G. R.lt/WORKAUTHORgtUltAR
TICLETITLEgtltrafast energy transfer in LHC-II
revealed by three-pulse photon echo peak shift
measurementslt/ARTICLETITLEgt, ltWORKTITLEgtJ. Phys.
Chem. Blt/WORKTITLEgt, ltPUBDATEgt2000lt/PUBDATEgt,
ltNUMBERgt104lt/NUMBERgt, ltPAGESgt2908lt/PAGESgt,
lt/CITATIONgt ...
Semantic Web
44
Data and metadata in the Digital Library EDOC
  • ltBIBLIOGRAPHYgtltFLOATgtltPAGENUMBERgt136lt/PAGENUMBERgtlt
    /FLOATgt
  • ltHEADgtLiteraturverzeichnislt/HEADgt
  • ...
  • ltCITATION WORKTYPE"journal" PUBLISHED"PUBLISHED"
    gt
  • ltCUT ID"bib-45-"gt2 lt/CUTgtltWORKAUTHORgtAlbrecht,
    T. F. Bott, K. Meier, T. Schulze, A. Koch,
    M. Cundiff, S. T. Feldmann, J. Stolz, W.
    Thomas, P. Koch, S. W. Goumlbel E.
    O.lt/WORKAUTHORgt ltARTICLETITLEgtDisorder mediated
    biexcitonic beats in semiconductor quantum
    wellslt/ARTICLETITLEgt, ltWORKTITLEgtPhys. Rev.
    Blt/WORKTITLEgt, ltPUBDATEgt1996lt/PUBDATEgt,
    ltNUMBERgt54lt/NUMBERgt, ltPAGESgt4436lt/PAGESgt,
  • lt/CITATIONgt ...
  • (http//edoc.hu-berlin.de/diml/dtd/xdiml.dtd)

45
Authoring support for document servers
  • Surveys Web usage mining analysis of a digitial
    publishing service showed
  • Metadata creation is one of the main barriers for
    contribution.
  • Reasons include deficiencies in
  • information flow
  • understanding and use of structured search
  • education in structured writing
  • HCI aspects

? Marketing
) ) ? Education )
Berendt, Brenstein, Li, Wendland, Proc. ETD
2003 Berendt, Proc. AAAI Spring Symposium KCVC,
2005
46
and this has consequences(problems of the
fully manual approach)
  • ltBIBLIOGRAPHYgtltFLOATgtltPAGENUMBERgt136lt/PAGENUMBERgtlt
    /FLOATgt
  • ltHEADgtLiteraturverzeichnislt/HEADgt
  • ltCITATION WORKTYPE"journal" PUBLISHED"PUBLISHED
    "gt
  • ltCUT ID"bib-15-"gt1 lt/CUTgtltWORKAUTHORgtAgarwal,
    R. Krueger, B. P. Scholes, G. D. Yang, M.
    Yom, J. Mets, L. Fleming, G. R.lt/WORKAUTHORgtUltAR
    TICLETITLEgtltrafast energy transfer in LHC-II
    revealed by three-pulse photon echo peak shift
    measurementslt/ARTICLETITLEgt, ltWORKTITLEgtJ. Phys.
    Chem. Blt/WORKTITLEgt, ltPUBDATEgt2000lt/PUBDATEgt,
    ltNUMBERgt104lt/NUMBERgt, ltPAGESgt2908lt/PAGESgt,
  • lt/CITATIONgt
  • ...

47
The fully automatic approach
48
Why is this a problem?
Cardona Marx, Physik Journal 2004
Berendt, in Neues Handbuch Hochschullehre, 2003
49
  • Build a tool that is
  • user-friendly
  • intelligent
  • modular and extensible

50
Berendt, Dingel, Hanser, Proc. ECDL 2006
51
IR-THESIS System architecture
Text mining / Information Extraction tools
Web services
Databases (local a/o mirrored)
Web services
VBA macro
other WS and info. sources
52
(No Transcript)
53
Search and retrieval
54
(No Transcript)
55
(No Transcript)
56
Organisation of the literature /bibliography
construction
57
(No Transcript)
58
Discussion
59
(No Transcript)
60
Writing
61
Conclusions and outlook
  • Semantics are often necessary to do mining at all
  • Semantics often allow the analyst to make more
    sense of the results
  • Semantic Web Mining is semi-automatic ?
    interactive tools!
  • Standardisation can make the mining process more
    automatic
  • Mining can help to generate semantics
  • To what extent are further user and context
    modelling useful a/o necessary for valid
    conclusions (intentions, goals, constraints, )?
  • How can we encourage standards?
  • When are explicit (formal) semantics better, when
    implicit semantics?
  • How can we move beyond the Web (ubiquitous
    environments)?
  • How can privacy be protected in a data-rich and
    mining-rich world? (Are privacy semantics à la
    P3P a solution?)
  • What do users want? What about other
    stakeholders? Whom and what and how to ask?

62
Thank you for your attention!
63
Discussion points 1 Is reference markup
ontological / Semantic Web?
  • DiML (Dissertation Markup Language), used in the
    case study above, is approximately structured
    like Bibtex (with the difference that the type of
    publication is an attribute, so there is only one
    top-level concept citation). This makes it
    comparable also to Dublin Core. The system in
    ist latest versions also contains mapping to DC
    and other commonly used schemata.
  • This makes it indeed an extremely primitive
    ontology (essentially, a concept hierarchy with
    one concept, publication with attributes with
    literals as value range author, title, etc.).
  • Extensions to make this really semantic include
    (some are part of our current work)
  • Author, affiliation, etc. as concepts with
    instances, as in Repec.org ? introduces relations
    like is-author-of
  • Unique identifiers of publications that allow the
    detection of duplicates, as in Citeseer
  • Links to libraries, as in OpenURL
  • Versioning and other interesting relations
    between different publications (cf. The Dublin
    Core element relation)

64
Discussion point 2 Can folksonomies be used
instead of ontologies? (1)
  • This is a difficult question, not least because
    it is still unclear what exactly tags are
  • an object-level summary and thus more content, or
  • a truly meta-level classification which comes
    from a set of labels that is categorically
    different from just more content words ?
  • In the following, I use the second
    interpretation. I refer to folksonomy tags as
    "concepts" because a folksonomy can formally be
    regarded as an extremely simple ontology a set
    of concepts with no hierarchical or other
    relations between them.
  • The answer to the question in the title of this
    slide depends on the aspect of folksonomies one
    is most interested in, and how important one
    thinks certain properties of ontologies.

65
Discussion point 2 Can folksonomies be used
instead of ontologies? (2)
  • The answer tends to be YES when one focuses on
  • WHO DEFINES THE CONCEPTS
  • All ontologies used in the case studies shown
    were based on or extended popular models and/or
    ontologies in the domain of investigation
  • search in the educational portal models of
    information search from information science
  • shopping models of the customer buying process
    from marketing
  • shopping with bot assistance the same our
    design of questions, developed in conjunction
    with a major German retailer
  • search in the medical portal like search in the
    educational portal plus the medical ICD-9, the
    International Classification of Diseases
    DiML/DC).
  • But in fact, none of the ontologies used in the
    case studies here was a "standard" in the sense
    that many people agree on it and many
    applications use it - in fact, there are precious
    few such standard behaviour models!
  • In that sense, the ontologies used here are, like
    much of the Semantic Web work, just one
    possibility proposed by a number of people (the
    research group application partners), instead
    of the result of a standardisation effort.
  • IN FOLKSONOMY-STYLE TAGGING, A RESOURCE USUALLY
    HAS MORE THAN ONE TAG
  • Any set of concepts that a group agrees on can be
    used.
  • In SWUM (Semantic Web Usage Mining), Web pages
    are mapped to single concepts (ex. slides 22ff.)
    or sets of concepts (ex. slide 21). This set of
    concepts could also be a tag set as in
    del.icio.us.

66
Discussion point 2 Can folksonomies be used
instead of ontologies? (3)
  • The answer tends to be MAYBE when one focuses on
  • DYNAMICS introduce a non-stability of the
    mapping, which means that the patterns would
    change "depending on how you look at them" -
    which may or may not be desirable
  • My opinion This quickly becomes untractable,
    thus an ontology-based treatment of different
    viewpoints and dynamics (? ontology evolution)
    appears to be the better choice.
  • The answer tends to be NO when one focuses on
  • FORMAL PROPERTIES
  • HIERARCHIES generalization is an important
    feature of many mining algorithms (unless you
    abstract, you may not find any pattern.
  • (Non-hierarchical) RELATIONS
  • In folksonomies, there are no relations on
    concepts. Therefore, meaningful visualizations
    become harder to produce (note that the
    stratograms shown on slides 27 and 29 require
    relations that induce a linear order on
    concepts).
  • Also, all other inference possibilities are lost.
  • COMPARABILITY The results of SWUM can only be
    compared (e.g., conversion rates in one site with
    those in another site) if stable and uniform
    ontologies are used.

67
Discussion point 3 Which of the techniques shown
in this talk are being used in industry and other
real-world sites? (1)
  • Pre-remark 1 The contents of this talk was
    (recent) research, thus it would be surprising to
    see it already incorporated into industrial
    practice. However, given that Web usage mining
    has been around for a number of years, the
    question is valid.
  • Pre-remark 2 Web usage mining is used on a large
    scale by search engines. Google says it, Yahoo!
    Says it. Both say they rely rather on
    latent-semantic-indexing style semantics than on
    Semantic-Web-style semantics (but they do use
    lexica and other helpers) the boundaries are
    fluid. Anyway, they dont say too much about the
    details of their algorithms. After all, mining is
    their business model ...
  • Anyway, we believe that SWUM is applicable to
    analysing search when the focus is on what
    services of a site(s interface) are used, not
    when the content of searches is investigated (cf.
    content vs. Service conceptual hierarchies in
    Berendt Spiliopoulou, VLDB Journal 2000). Thus,
    search engines are not the intended application
    areas of our techniques, but retail, information,
    e-Government, etc. sites.
  • The question should therefore be rephrased as 3
    questions
  • Do off-the-shelf software packages (used by
    end-user companies either on-site or in ASP mode,
    i.e. without external consultants to do the
    analyses) support Web usage mining, and
    specifically Web usage mining with semantics?
  • The answer is Very partly.
  • Do consultants offer SWUM analyses?
  • The answer is partly.
  • What are the likely reasons?
  • A tentative answer is Perception problems and
    lack of incentives.

68
Discussion point 3 (2) Support in off-the-shelf
software basic forms of analysis
  • Pageview counts and simple OLAP-type analyses
    (hits by country, by language, etc.) are pretty
    standard and supported even by most of the
    simplest freeware products (e.g., Analog). Their
    usage is very common in industry.
  • State-of-the-art commercial analysis software
    like Webtrends allows a certain degree of
    programming for extracting more attributes that
    can be subjected to OLAP-type analyses (see below
    for an example).
  • State-of-the-art software often also supports the
    extraction of more information transferred via
    Javascript. An example is Google Analytics.
  • Syntax is generally the only basis. Semantics
    usually comes in only insofar as the Content
    Management System used by most sites today
    provides a certain frame of reference and
    meaning.

69
Discussion point 3 (3) Support in off-the-shelf
software Conversion rates
  • Software generally also supports the definition
    of simple templates from which conversion rates
    can be computed automatically (e.g., a click on
    page X with referrer Y, or after a sequence of
    pages that started with referrer Y, is a
    converted customer brought to us from the banner
    shown on affiliated site S).
  • Conversion rates are not only extremely simple
    (divide the number of sessions that reached X and
    then Y by the number of sessions that reached X),
    but also quite powerful Every success measure
    that can be defined via reachability can be
    cast a conversion rate.
  • The 3-click rule (every page must be reachable
    with 3 clicks) is a related and equally
    simple-to-compute measure. That a page is
    reachable in 3 clicks can be computed from the
    site graph, that it is reached can be computed
    from frequent sequences. This only requires that
    the tool can compute frequent contiguous
    sequences, which is algorithmically simple and
    requires little thinking on the part of the
    analyst.
  • For conversion-rate computations, semantics
    occurs in the simple sequence templates offered
    by the tools, the mapping is gathered from the
    users via Web forms or scripts.
  • Conversion rates are also related to pricing
    models such as GoogleAds.
  • For a survey of software, see http//www.kdnuggets
    .com/solutions/web-mining.html

70
Discussion point 3 (4) Support in off-the-shelf
software possibilities and limitations /
example country language
  • Language
  • is usually defined as either the presentation
    language (in a site with dynamic pages generated
    by a content management system, this can easily
    be extracted)
  • or the language (assumed to be) preferred by the
    user (the browser setting, which in most cases is
    likely to be the default with which the browser
    is shipped).
  • Country is inferred from the IP address and an IP
    ? geo-coordinates mapping. Such mappings are
    provided by software like Maxmind. This is
    relatively reliable according to the producers
    and according to a test we did (publication in
    preparation).
  • To obtain the users native language, we inferred
    it from the Geo-IP mapping and official data on
    official languages in countries around the world.
    In a small experimental sample in which we asked
    users to specify their native language, we
    obtained quite high accuracy (Kralisch Berendt,
    NRHM 2005).
  • I do not know of data on the accuracy of the
    browser setting ? native language mapping, or of
    data comparing it to the Geo-IP approach we used.
  • But only the combination presentation language
    users native language gives information about
    whether a user accesses content in his/her native
    language or in a foreign language and this
    knowledge may be much more important for
    personalization than presentation language or
    preferred language alone (see Kralisch, Ph.D.
    dissertation 2006, http//edoc.hu-berlin.de/docvie
    ws/abstract.php?id27410)
  • Nonetheless, even the semantics of presentation
    language / user language are to my knowledge
    not utilized in off-the-shelf software. One
    reason is that the awareness of the importance of
    language in Internet design has only begun.

71
Discussion point 3 (5) Consultancy companies
  • More advanced forms of conversion-rate analysis,
    which rely on (some) semantics, have been
    introduced or popularized by consultancy
    companies.
  • Examples
  • NetGenesis (Cutler Sterne) E-Metrics White
    Paper, 2000, http//www.emetrics.org/articles/whit
    epaper.html
  • The funnelmetrics introduced there are now also
    offered, for example, by Google Analytics
  • http//www.google.com/analytics/feature_funnel.ht
    ml
  • Accenture (R. Ghani), Mining the Web to add
    semantics to retail data mining, in Berendt et
    al., Web Mining From Web to Semantic Web (2004).
  • survey by Anand et al., On the deployment of Web
    usage mining, ibid.
  • Unfortunately, publicly available data on Web
    usage are usually at a very high level of
    aggregation and (also for this reason) build on
    essentially non-semantic analysis types, e.g.
  • http//www.nielsen-netratings.com/resources.jsp?s
    ectionpr_netvnav1

72
Discussion point 3 (6) Likely reasons
  • One major problem is a divergence between the
    (current or definitional?) nature of data mining
    / knowledge discovery on the one hand, and
    business expectations on the other
  • KD is still more an art than an engineering
    process, with few standards even for process.
  • Business often expects data mining to be a set of
    fully automatic, pre-packaged black-box
    solutions.
  • The CRISP-DM process model shown on slide 16 ,
    for example, is a very high-level attempt at
    standardisation which leaves many details open.
  • In fact, it can be (and often is) argued that
    the search for interesting and novel patterns
    through exploratory data analysis by definition
    involves hand-crafting. Going back to the
    original definition of data mining (see slide
    13), one could argue that looking for the values
    of pre-defined pattern templates (e.g.,
    conversion rates) is the antithesis of novel
    patterns and thus by definition not data mining.
  • On the other hand, Web usage mining is
    essentially market research a study of user /
    consumer behaviour. Market research is an
    established discipline in which it is quite
    accepted that methods involve human intervention
    and interpretation rather than the automatic
    application of pre-packaged procedures (one
    example is the focus-group method).

73
Discussion point 3 (7) Likely reasons contd.
  • Maybe this is a perception problem While it is
    clear that consumer opinions bear a strong
    qualitative element (such that focus groups
    cannot be prepared, administered and interpreted
    by a machine only), data mining carries the image
    of number crunching (implying that computers are
    the main actors here).
  • In line with this, the responsible people often
    have disjoint qualifications The market research
    people have a strong background in the relevant
    social-science methods the IT people (who are
    expected to do the data mining on the side) can
    use tools, but usually have limited knowledge
    about empirical methods in general or data mining
    in particular.
  • This point was discussed at a panel at the WebKDD
    workshop at SIGKDD 2005 one result was that the
    job description Chief Data Officer ( a
    senior-management person with resources who knows
    about data mining in the sense of data analysis
    AND computers) was a really recent invention. In
    the meantime, data-mining consultancies filled
    the gap (but had to convince companies they were
    worth it).
  • Or it is a problem of lacking standards (once we
    have behaviour models of retail sites, of
    education sites, etc., we can pre-package these
    behaviour ontologies and even compare sites).
  • Standards (in behaviour modeling) require that
    there is an interest in what the behaviour models
    say, and an interest in being comparable to other
    sites. Encouraging developments in this direction
    can currently be observed in the digital
    libraries community.
  • ... to be continued ...

74
Thank you for your questions!
Write a Comment
User Comments (0)
About PowerShow.com