Mining%20unstructured%20data%20for%20meaning - PowerPoint PPT Presentation

About This Presentation
Title:

Mining%20unstructured%20data%20for%20meaning

Description:

Mining unstructured data for meaning Toby Mostyn – PowerPoint PPT presentation

Number of Views:167
Avg rating:3.0/5.0
Slides: 25
Provided by: Tob147
Category:

less

Transcript and Presenter's Notes

Title: Mining%20unstructured%20data%20for%20meaning


1
Mining unstructured data for meaning
Toby Mostyn
2
The concept of mining unstructured data (the
web!) for meaning
Collecting, managing and accessing web data
Polecat's (ongoing) solution
Information mining techniques employed
3
I know what I want
I don't know anything
4
(No Transcript)
5
  • 1 Billion Computer Chips on the Internet
  • 2 Million emails per second
  • 1 Million IM messages per second
  • 8 Terabytes per second Traffic
  • 65 billion phone calls per year
  • 255 Exabytes magnetic storage
  • 600 billion RFID tags in use

6
(No Transcript)
7
(No Transcript)
8
Context/ Relevance
9
Energy regulator Ofgem today warned Britons may
not be able to afford to heat their homes in the
years ahead unless there is radical overhaul of
the country's energy supplies. The regulator
warned the country's current system may not be
sufficient to ensure "secure and sustainable"
power across the country beyond 2015.
British Government
Solar panel manufactures
Help, we need a plan! Context policy
People will now be more willing to buy our
products Context Sales
Duvet Makers
People may want warmer duvets in the
future Context product line
British Gas
We know we have a supply problem already. Now
everyone else knows Context Public relations
10
The concept of mining unstructured data (the
web!) for meaning
Collecting, managing and accessing web data
Polecat's (ongoing) solution
Information mining techniques employed
11
Data Sources
link title language category (ltmetagt)
External
Internal
Web Pages
Intranet Pages
Microblogs
Emails
to from subject date sent
Blog postings
RSS Feeds
Documents
ltitemgt title link author category publication
date source
12
Polecat Data Store
Blogs
5,000,000 per day 1,825,000,000 per year
Microblogs
50,000 per day 18,250,000 per year
Articles
Web Pages
???
13
(No Transcript)
14
Requirements Flexibility, speed of access,
power of access
record_type_id
Attribute
Record Type
0...
1
0...
0...
1
1
Record
Attribute Value
0...
1
attribute_id record_id
record_type_id
15
The concept of mining unstructured data (the
web!) for meaning
Collecting, managing and accessing web data
Polecat's (ongoing) solution
Information mining techniques employed
16
The concept of mining unstructured data (the
web!) for meaning
Collecting, managing and accessing web data
Polecat's (ongoing) solution
Information mining techniques employed
17
1. Information Retrieval finding linguistic
fingerprints beyond keywords a. Query
extraction b. Query expansion c.
Search results order by relevance
18
Query extraction
Find words and queries particular to these
documents
Identify key words/ phrases Find key POS
combinations Compare to other docs (tf idf)
A
A
N
global warming big fat lie
N
19
Query expansion
From the initial keywords, find related
terms Local document analysis Similarity
matrices Semantic vector Relevant based
expansion (Rocchio)
Subject, not synonym Large expansion necessary
to indicate relevance
Differing linguistic fingerprint for different
source types - National vs regional press -
Articles vs Blogs - Microblogging???
Iterative approach Build hierarchical taxonomies
20
Searching for relevance
Hierarchical taxonomies
original term
sub term (0.9)
sub term (0.5)
sub term (0.8)
sub term
sub term
sub term
sub term
BT, NT, RT, SYN, PT etc
21
(No Transcript)
22
(No Transcript)
23
Press release tracking What is the probability
that a given document is written based on a
particular press release
Query extraction (key phrases longer the better)
Term similarity matrix based on original press
release
Document size less important
Not a measure of document similarity
24
The future Semantic Web Phrases and
metaphors Real-time search and analysis A
dictionary of taxonomies Language drift/
language evolution
Write a Comment
User Comments (0)
About PowerShow.com