Title: Mining%20unstructured%20data%20for%20meaning
1Mining unstructured data for meaning
Toby Mostyn
2The concept of mining unstructured data (the
web!) for meaning
Collecting, managing and accessing web data
Polecat's (ongoing) solution
Information mining techniques employed
3I know what I want
I don't know anything
4(No Transcript)
5- 1 Billion Computer Chips on the Internet
- 2 Million emails per second
- 1 Million IM messages per second
- 8 Terabytes per second Traffic
- 65 billion phone calls per year
- 255 Exabytes magnetic storage
- 600 billion RFID tags in use
6(No Transcript)
7(No Transcript)
8Context/ Relevance
9Energy regulator Ofgem today warned Britons may
not be able to afford to heat their homes in the
years ahead unless there is radical overhaul of
the country's energy supplies. The regulator
warned the country's current system may not be
sufficient to ensure "secure and sustainable"
power across the country beyond 2015.
British Government
Solar panel manufactures
Help, we need a plan! Context policy
People will now be more willing to buy our
products Context Sales
Duvet Makers
People may want warmer duvets in the
future Context product line
British Gas
We know we have a supply problem already. Now
everyone else knows Context Public relations
10The concept of mining unstructured data (the
web!) for meaning
Collecting, managing and accessing web data
Polecat's (ongoing) solution
Information mining techniques employed
11Data Sources
link title language category (ltmetagt)
External
Internal
Web Pages
Intranet Pages
Microblogs
Emails
to from subject date sent
Blog postings
RSS Feeds
Documents
ltitemgt title link author category publication
date source
12Polecat Data Store
Blogs
5,000,000 per day 1,825,000,000 per year
Microblogs
50,000 per day 18,250,000 per year
Articles
Web Pages
???
13(No Transcript)
14Requirements Flexibility, speed of access,
power of access
record_type_id
Attribute
Record Type
0...
1
0...
0...
1
1
Record
Attribute Value
0...
1
attribute_id record_id
record_type_id
15The concept of mining unstructured data (the
web!) for meaning
Collecting, managing and accessing web data
Polecat's (ongoing) solution
Information mining techniques employed
16The concept of mining unstructured data (the
web!) for meaning
Collecting, managing and accessing web data
Polecat's (ongoing) solution
Information mining techniques employed
171. Information Retrieval finding linguistic
fingerprints beyond keywords a. Query
extraction b. Query expansion c.
Search results order by relevance
18Query extraction
Find words and queries particular to these
documents
Identify key words/ phrases Find key POS
combinations Compare to other docs (tf idf)
A
A
N
global warming big fat lie
N
19Query expansion
From the initial keywords, find related
terms Local document analysis Similarity
matrices Semantic vector Relevant based
expansion (Rocchio)
Subject, not synonym Large expansion necessary
to indicate relevance
Differing linguistic fingerprint for different
source types - National vs regional press -
Articles vs Blogs - Microblogging???
Iterative approach Build hierarchical taxonomies
20Searching for relevance
Hierarchical taxonomies
original term
sub term (0.9)
sub term (0.5)
sub term (0.8)
sub term
sub term
sub term
sub term
BT, NT, RT, SYN, PT etc
21(No Transcript)
22(No Transcript)
23Press release tracking What is the probability
that a given document is written based on a
particular press release
Query extraction (key phrases longer the better)
Term similarity matrix based on original press
release
Document size less important
Not a measure of document similarity
24The future Semantic Web Phrases and
metaphors Real-time search and analysis A
dictionary of taxonomies Language drift/
language evolution