Title: Nessun titolo diapositiva
1(No Transcript)
2(No Transcript)
3(No Transcript)
4Limitations of actual IRSs (and SE!)
- They behave as a black box for a same query the
same answer
- They are mainly based on static models (systems
do not adapt or minimally adapt - their
behaviour based on learning of users real
information needs)
- They do not account for vagueness intrinsic in
the process of verifying the property of
information items to be informative, i.e.
relevant to specific needs. They may account to
uncertainty (probabilistic) only at a limited
extent
- Query languages are usually based on selection
criteria specified by terms (keywords) no
possibility to be vague-uncertain
- Simple visualization techniques of retrieval
results
CONSEQUENCE subjectivity modeled only at a
shallow extent while IRSs should adapt to users
needs!
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9The weights specify soft constraints on the
weighted document representations The RSV of a
document express the degree of constraints
satisfaction
10(No Transcript)
11- Given a weighted query, the Retrieval Status
Value computed for a given document for that
query expresses the degree of constraints
satisfaction.
12Query weights semantics
- 1) THRESHOLD
- (Radecki, 1979 Buell Kraft, 1981)
- 2) RELATIVE IMPORTANCE
- (Bookstein, 1981 Yager 1987, Sanchez 1989)
- 3) SIMILARITY (IDEAL IMPORTANCE VALUES)
- (Cater Kraft 1989 Bordogna,Carrara Pasi,
1991)
13(No Transcript)
14(No Transcript)
15Linguistic query weights
Boolean expressions on pairs ltt,lwgt lw ? very
important, important, not very important ..
- Each value lw has a function mlw associated,
which evaluates the compatibility between the
constraint lw and the numeric values F(d,t)
?0,1
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27Personalized indexing of semi-structured documents
- Limitations of the usual indexing procedure
- the weighted representation of documents does not
take into account that a term can play a
different role within a text, according to the
location and distribution of its occurrences. - usual indexing procedures produce the same
document representation for all users
Need for personalized indexing procedures
28A model for personalized indexing of structured
documents
(Bordogna Pasi, International Journal of
Approximate Reasoning, 1995) (Bordogna Pasi,
Information Retrieval, 2005)
29Hierarchical document structure
30A model for personalized indexing of structured
documents
- The model is composed by
- a static component that extracts the index terms
and computes for each of them and each document
the significance degrees in the document sections
- an adaptive component activated by a user query
that computes the overall significance degree
F(d,t) for each query term t and document d.
31(No Transcript)
32(No Transcript)
33(No Transcript)
34A model for personalized indexing of structured
documents
The structured indexing model can take into
account hierarchical structures. The index term
weight in a section is computed by aggregating
index term weights at the lowest level. aaa
35(No Transcript)
36(No Transcript)
37XPath
- XPath is the standard language to write tree
traversal expressions that extract XML fragments - XPath selection is also used within the
fully-fledged XML Query language (XQuery)
- The main features of XPath are
- a rich set of available built-in expressions
- elementary data-types Boolean, Number, String,
Node set - specification of selection conditions to be
satisfied by the relevant nodes
- Path-based selection the user knows enough of
the target schema so as to be able to formulate a
search path to be matched against the structure
of the target XML documents - root_node/ /parent_node/target_node
38Flexibility in XPath
- Some flexibility can be already achieved in XPath
(by means of wildcards)
39Vague predicates
- Traditional query languages (in the database
context) allow for data selection based on binary
predicates (crisp). - Relevance w.r.t. a query is therefore modeled as
a binary concept either an information item is
relevant or not
- On the other hand, a vague predicate, represented
by a fuzzy subset, expresses a soft condition,
whose evaluation produces a numeric value in
0,1, with the consequence that the results can
be ranked - The membership functions are defined in
accordance with the semantics of the linguistic
labels employed for the vague predicates
(expensive, recent, )
40Flexible selection
- Since atomic information items are clustered in a
hierarchical structure, we can state that the
nearer two items are the more likely they are
semantically related.
- XPath provides two crisp constructs that can
express topology constraints, as follows - /articles//article. The axis matches any tag
in a specified position, disregarding its name
(the position is known, the name is unknown) - /articles//article. The // axis matches any path
that descends the containment hierarchy, and all
the article elements are matched, independently
of their distance from the articles element
they are contained in (the name is known, the
position is unknown).
New proposal a new XPath axis NEAR can be
defined /articles/NEAR/article The
result set will be ranked w.r.t. the increasing
number of steps to be descended.
41The PENG project
.
- The PENG project was a STREP project aimed at
defining a flexible, personalised system for the
gathering, filtering, retrieval and presentation
of multimedia news for news professionals (e.g.
journalists and editors), with a view of making
the system also available for general users.
42Characteristics of the Clustering algorithm
- Categories of news are not known a-priori
- Unsupervised clustering
- Clusters content summarization
- Need to deal with categorization ambiguity
- Probabilistic or fuzzy clustering
- Need to identify Categories of news with
distinct granularity organized in general topics
and specific-topics - Hierarchical fuzzy clustering
43Characteristics of the Clustering Algorithm
- The algorithm generates a hierarchy of Fuzzy
clusters by recursively applying the Fuzzy
C-Means algorithm
- Extensions of FCMs
- Cosine similarity measure
- The algorithm automatically identifies the number
of clusters to generate
44(No Transcript)
45Do SE and commercial systems apply Fuzzy Set
Theory?
- Try the following query lt search engine fuzzy
gt - Lot of answers!
- The answer to the previous questions some yes
- Mainly for facing the problem of misspelling. ?
Levenshstein distance. - Other talk about fuzzy matching mechanisms
- HAKIAVERITY ..
46Thank you for your attention!