Title: Diversity in search: what, how, and what for?
1Diversity in search what, how, and
what for?
- Bettina Berendt
- Dept. Computer Science,
- KU Leuven
2Thanks to
- Sebastian Kolbe-Nusser
- Anett Kralisch
- Siegfried Nijssen
- Ilija Subašic
- Mathias Verbeke
- Hugo Zaragoza
- ...
3Diversity in natural language
- diverse (s2), various
- distinctly dissimilar or unlike
- ..., diversity (s1), ..., variety
- noticeable heterogeneity
- (Wordnet)
- the fact that members of a set are different
from one another
4Why is diversity interesting for search?
- People like to see a range of different,
non-redundant things/views/etc. - Different people search differently.
- ? How?
- ? When / under what conditions?
- ? (What) can we do?
5What is diverse?
- Documents
- the relevance of a document must be determined
considering the documents appearing before it
(Goffman, 1964) - E.g. MMR (Carbonell Goldstein, 1998)
- Many further developments, e.g. for images
- Presentation choices, e.g. re-ranking or
clustering?
6What is diverse?
- Documents
- People
- The term diversity is a form of euphemistic
shorthand to describe differences in racial or
ethnic classifications, age, gender, religion,
philosophy, physical abilities, socioeconomic
background, sexual orientation, gender identity,
intelligence, mental health, physical health,
genetic attributes, behavior, attractiveness,
place of origin, cultural values, or political
view as well as other identifying features. - http//en.wikipedia.org/wiki/Diversity_(politics)
7What is diverse?
- Documents
- People
- Knowledge and its articulations
- ( documents in a wider sense?!)
- Knowledge and its articulations are strongly
influenced by diversity in, e.g., cultural
backgrounds, schools of thought, geographical
contexts. - LivingKnowledge will study the effect of
diversity and time on opinions and bias. - The goal is to improve navigation and search
in very large multimodal datasets (e.g., the Web
itself).
8How we got here
The impact of language and culture on Web usage behaviour
Diversity of users
9How we got here
The impact of language and culture on Web usage behaviour
The impact of language and culture on Web usage behaviour Tools for sense-making in literature search
Diversity of users Diversity of documents
10How we got here
The impact of language and culture on Web usage behaviour
The impact of language and culture on Web usage behaviour Tools for sense-making in literature search
The impact of language and culture on Web usage behaviour Tools for sense-making in literature search PORPOISE, STORIES tools for graphical news summa-rization and understanding
Diversity of users Diversity of documents
11How we got here
The impact of language and culture on Web usage behaviour
The impact of language and culture on Web usage behaviour Tools for sense-making in literature search
The impact of language and culture on Web usage behaviour Tools for sense-making in literature search PORPOISE, STORIES tools for graphical news summa-rization and understanding
The impact of language and culture on Web usage behaviour Collaborative re-use of literature search results Tools for sense-making in literature search PORPOISE, STORIES tools for graphical news summa-rization and understanding
Diversity of users Diversity of diversity ? Diversity of documents
12Why this talk?
The impact of language and culture on Web usage behaviour
The impact of language and culture on Web usage behaviour Tools for sense-making in literature search
The impact of language and culture on Web usage behaviour Tools for sense-making in literature search PORPOISE, STORIES tools for graphical news summa-rization and understanding
The impact of language and culture on Web usage behaviour Collaborative re-use of literature search results Tools for sense-making in literature search PORPOISE, STORIES tools for graphical news summa-rization and understanding
Diversity of users Diversity of diversity ? Diversity of documents
13Why this talk?
The impact of language and culture on Web usage behaviour
The impact of language and culture on Web usage behaviour Tools for sense-making in literature search
The impact of language and culture on Web usage behaviour Tools for sense-making in literature search PORPOISE, STORIES tools for graphical news summa-rization and understanding
The impact of language and culture on Web usage behaviour Collaborative re-use of literature search results Tools for sense-making in literature search PORPOISE, STORIES tools for graphical news summa-rization and understanding
e.g. Information Retrieval J. 2009 Proceedings Living Web WS_at_ISWC 2009 Inf. Processing Management 2010 e.g. Knowledge and Information Systems J. 2009
Towards an integrated understanding of diversity
14The impact of linguistic diversity on Web usage
and thereby on the Web
- Or
- Why are non-English languages under-represented
on the Web? - A web-analysis approach asking for underlying
- cognitive-linguistic
- behavioural
- attitude
- factors
15A simple expectation of how much content exists
in which language
16But Dynamics of content creation, link setting,
link following, attitudes, and use
17But Dynamics of content creation, link setting,
link following, attitudes, and use
People create less content
People link less to content
People use links less
People think the content is bad ... and use it
less
18But Dynamics of content creation, link setting,
link following, attitudes, and use
? Under-representation !
19Underlying data and methods
- Database of countries and official languages
- Distribution comparisons between
- worldwide proportions of native speakers of
different languages - worldwide distribution of servers registered by
country - crawler analysis of links to a multilingual site
S - log analysis assigning each session a native
language - log analysis of
- (user native language) (S-entry-page language)
- Questionnaire/TAM analysis of native and
non-native users of S - usability, ease of use, competence in English,
beliefs about availability of content in native
language
20Some questions
- Does one find such dynamics also in search
engines? - What factors stop or reverse such
language-marginalisation trends? - Critical mass?
- Laws?
- Volunteers?
- Did / can Web 2.0/3.0 change this?
- (When) is it better to work without pre-defined
labels for users?
21 ? Part 2 An approach that ...
- Does one find such dynamics also in search
engines? - What factors stop or reverse such
language-marginalisation trends? - Critical mass?
- Laws?
- Volunteers?
- Did / can Web 2.0/3.0 change this?
- (When) is it better to work without pre-defined
labels for users?
22Motivation (1) Diversity of people is ...
- Speaking different languages (etc.) ?
localisation / internationalisation - Having different abilities ? accessibility
- Liking different things ? collaborative filtering
- Structuring the world in different ways ? ?
23Motivation (2) Diversity-aware applications ...
- Must have a (formal) notion of diversity
- Can follow a
- personalization approach
- ? adapt to the users value on the diversity
variable(s) - ? transparently? Is this paternalistic?
- customization approach
- ? show the space of diversity
- ? allow choice / raise awareness / semi-automatic!
24Measuring grouping diversity
- Diversity 1 similarity 1 - Normalized
mutual information
By colour
NMI 0
NMI 0.35
25Measuring user diversity
- How similarly do two users group documents?
- For each query q, consider their groupings gr
- How similarly do two users group documents?
- For each query q, consider their groupings gr
- For various queries aggregate
26... and now the application domain
... thats only the 1st step!
27Workflow
- Query
- Automatic clustering
- Manual regrouping
- Re-use
- Learn present way(s) of grouping
- Transfer the constructed concepts
28Concepts
- Extension
- the instances in a group
- Intension
- Ideally squares vs. circles
- Pragmatically defined via a classifier
29Step 1 Retrieve
- CiteseerX via OAI
- Output set of
- document IDs,
- document details
- their texts
30Step 2 Cluster
- the classic bibliometric solution
- CiteseerCluster
- Similarity measure co-citation, bibliometric
coupling, word or LSA similarity, combinations - Clustering algorithm k-means, hierarchical
- Damilicious phrases ? Lingo
- How to choose the best?
- Experiments Lingo better than k-means at
reconstruction and extension-over-time
31Step 3 (a) Re-organise work on document groups
32Step 3 (b) Visualising document groups
33Steps 45 Re-use
- Basic idea
- learn a classifier from the final grouping (Lingo
phrases) - apply the classifier to a new search result
- ? re-use semantics
- Whose grouping?
- Ones own
- Somebody elses
- Which search result?
- the same (same query, structuring by somebody
else) - More of the same (same query, later time ?
more doc.s) - related (... Measured how? ...)
- arbitrary
34Visualising user diversity (1)
- Simulated users with different strategies
- U0 did not change anything (System)
- U1 tried produce a better fit of the document
groups to the cluster intensions 5 regroupings - U2 attempted to move everything that did not fit
well into the remainder group Other topics,
better fit 10 regroupings - U3 attempted to move everything from Other
topics into matching real groups 5 regroupings - U4 regrouping by author and institution 5
regroupings - ? 55 matrix of diversities gdiv(A,B,q)
- ? multidimensional scaling
35Visualising user diversity (2)
Web mining
- aggregated
- using gdiv(A,B)
36Evaluating the application
- Clustering only Does it generate meaningful
document groups? - yes (tradition in bibliometrics) but data?
- Small expert evaluation of CiteseerCluster
- Clustering regrouping
- End-user experiment with CiteseerCluster
- 5-person formative user study of Damilicious
37The Damilicious tool Summary and (some) open
questions
- A tool that helps users in sense-making,
exploring diversity, and re-using semantics - diversity measures when queries and result sets
are different? - how to best present of diversity?
- How to integrate into an environment supporting
user and community contexts? - Incentives to use the functionalities?
- how to find the best balance between similarity
and diversity? - which measures of grouping diversity are most
meaningful? - Extensional?
- Intensional? Structure-based? Hybrid? (cf.
ontology matching) - which other sources of user diversity?
- Diversity and relevance can we learn from
user-dependent relevance judgements?
38Some lessons learned (or questions raised?)
- We need to embrace diversity.
- We need to take into account
- The diversity of documents / knowledge
- The diversity of people
- The diversity of diversity .
- We need to be clear about what we mean.
- We need to ask whether / when striving for
diversity is in itself A Good Thing. - We need to ask whether / when raising awareness
of diversity is in itself A Good Thing.
Thanks!
39Diversity in search what, how, and
what for?
- Bettina Berendt
- Dept. Computer Science,
- KU Leuven
40... and now the application domain
... thats only the 1st step!