Automatische Bewerking en Verwerking van Informatie - PowerPoint PPT Presentation

About This Presentation
Title:

Automatische Bewerking en Verwerking van Informatie

Description:

Bookstores, restaraunts, travel, malls, dictionaries, news, stock quotes, yellow ... Health/Medicine. Travel. Business. Entertainment. Arts. 6/6/09. ABVI, ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 20
Provided by: liamvand
Category:

less

Transcript and Presenter's Notes

Title: Automatische Bewerking en Verwerking van Informatie


1
Automatische Bewerking en Verwerking van
Informatie
  • Week 4
  • Web mining, Web-KB,
  • Semantic web
  • 25 februari 2003

2
Over de cursus opbouw
  • Deel I informatietechnologie
  • week 2 text mining en IE
  • week 3 question answering
  • week 4 web mining
  • week 5 documentclassificatie en IR
  • week 6 automatisch samenvatten

3
The Web
  • Over 2 billion indexed HTML pages, 30 terabytes
    18 of total
  • Wealth of information
  • Bookstores, restaraunts, travel, malls,
    dictionaries, news, stock quotes, yellow white
    pages, maps, markets, .........
  • Diverse media types text, images, audio, video
  • Heterogeneous formats HTML, XML, postscript,
    pdf, JPEG, MPEG, MP3, PPT
  • Highly Dynamic
  • 1 million new pages each day
  • Average page changes in a few weeks
  • Graph structure with links between pages
  • Average page has 7-10 links
  • Hundreds of millions of queries per day

4
Why is Web IR Important?
  • According to most predictions, the majority of
    human information will be available on the Web in
    ten years
  • Effective information retrieval can aid in
  • Research
  • Health/Medicine
  • Travel
  • Business
  • Entertainment
  • Arts

5
Web IR Model
Repository
Storage Server
Web Server
Crawler
Clustering Classification
The jaguar has a 4 liter engine
Indexer
The jaguar, a cat, can run at speeds reaching 50
mph
Inverted Index
Topic Hierarchy
engine jaguar cat
Root
Documents in repository
Business
News
Science
jaguar
Search Query
Computers
Automobiles
Plants
Animals
6
Why is Web IR Difficult?
  • The Abundance Problem (99 of information of no
    interest to 99 of people)
  • Limited Coverage of the Web (Internet sources
    hidden behind search interfaces)
  • The Web is extremely dynamic
  • Limited query interface based on keyword-oriented
    search
  • Limited customization to individual users

7
Mining hyperlinks
  • Use hyperlink structure to identify authoritative
    Web sources for broad-topic information discovery
  • Premise Sufficiently broad topics contain
    communities consisting of two types of
    hyperlinked pages
  • Authorities highly-referenced pages on a topic
  • Hubs pages that point to authorities
  • A good authority is pointed to by many good hubs
    a good hub points to many good authorities

8
Mining hyperlinks Google
  • Search engine that uses link structure to
    calculate a quality ranking (PageRank) for each
    page
  • PageRank
  • Intuition PageRank is the probability that a
    random surfer visits a page
  • Parameter p is probability that the surfer gets
    bored and starts on a new random page
  • (1-p) is the probability that the random surfer
    follows a link on current page

9
Web mining general
  • The WWW is huge, widely distributed, global
    information service center for
  • content news, education, culture,
    advertisements, government, commercial, etc.
  • hyperlink information
  • access and usage information
  • all this information provides rich sources for
    data mining

10
Web content mining
  • text mining
  • document collection is a selected fragment of the
    Web
  • for instance, a list of query results
  • also
  • methods to distinguish (e.g.) personal home pages
    from other pages
  • shopping robots learn to look for product
    information (e.g. prices) within web pages

11
A Brain for Humanity
  • The World Wide Web is a big and impressive
    success story, both in terms
  • of the amount of available information and
  • of the growth rate of human users.
  • It starts to penetrate most areas of our daily
    life and business.
  • This success is based on its simplicity.

1. The Vision A Brain for Humanity
12
A Brain for Humanity (2)
  • However, this simplicity may hamper the further
    development of the Web.
  • Or in other words What we see currently is the
    very first version of the web and the next
    version will probably even more bigger and much
    more powerful compared to what we have now.

1. The Vision A Brain for Humanity
13
A Brain for Humanity (3)
  • Tim Berners-Lee has a vision of a semantic web
    which
  • has machine-understandable semantics of
    information, and
  • trillions of small specialized reasoning services
    that provide support in automated task
    achievement based on the accessible information.

1. The Vision A Brain for Humanity
14
A Brain for Humanity (4)
  • knowledge acquisition/engineering deals with the
    bottleneck of acquiring and modeling knowledge
    (human-oriented problem).
  • knowledge representation deals with the
    bottleneck on representing knowledge and reason
    about (computer-oriented problem).
  • Never took off.

1. The Vision A Brain for Humanity
15
A Brain for Humanity (6)
  • Imagine a web that contains large bodies of the
    overall human knowledge and trillions of
    specialized reasoning services that make use of
    it.
  • Compared to the potential of the knowledge web
    the original AI visions look like a small and
    old-fashioned idea of the 19th century.

1. The Vision A Brain for Humanity
16
RDF
  • The Resource Description Framework RDF provides a
    standard data model for representing
    machine-processable semantics of information.
  • The basic construction in RDF is an
    object-attribute-value triple.
  • RDF schema define a set of modeling primitives to
    define structured vocabularies for providing
    machine-processable semantics of information.

3. Techniques
17
OIL
  • OIL extends RDF Schema to a full-fledged ontology
    representation language. It has been developed in
    the IST project On-to-knowledge in cooperation
    with many external partners. It provides
    intelligent access to static information
    sources.http//www.ontoknowledge.org

3. Techniques
18
UPML
  • UPML extends RDF Schema to a language for
    specifying reasoning components (i.e., active
    information sources) on the Web. It has been
    developed in the IST project IBROW. The Ibrow
    broker enables intelligent access to dynamic
    information sources.http//www.swi.psy.uva.nl/pro
    jects/ibrow/home.html.

3. Techniques
19
Ontologies
  • XML fixes the syntax.
  • RDF, OIL, and UPML provide predefined modeling
    primitives and semantics.
  • Ontologies are consensual formal specifications
    of conceptualizations.
  • gt providing a shared and common understanding of
    a domain that can be communicated across people
    and application systems.

3. Techniques
Write a Comment
User Comments (0)
About PowerShow.com