Title: Inferring Demographic Attributes of Anonymous Internet Users
1Topic
Inferring Demographic Attributes of Anonymous
Internet Users
Web Mining Seminar Qian, Jun
2Structure
- 1 Abstract
- 2 Introduction
- 3 Approach
- 4 Conclusion
3Abstract
- Anonymous internet uses
- Advertisement demographic attributes
- Usage information
- Latent Semantic Analysis
- Neural Model
4Structure
- 1 Abstract
- 2 Introduction
- 3 Approach
- 4 Conclusion
5The Problem
- Web advertisers want to target customers with
certain demographic attributes - Most Internet users are anonymous
6The Solution
- 1 Post the ad on relevant web-sites
- 2 Wait for the search term of the users
- 3 Make survey
- Anyway...
7Target Of This Research
- Build a high-quality database to establish the
possibility of inferring up to 6 demographic
factors to those whose demographic information is
not otherwise available
8Methodology
- Collect usage information
- Prepare usage information-LSA
- Create a neural model
9LSA Overview
- Information retrieval technique to create vector
- Like create a single vector representing an
internet user of interest - Combination of vectors and a vector
10Vector-space Information Retrieval
- Documents are vectors of terms d(t1,t2,tn)
- A query is a vector of terms as well
q(t1,t2,tn) - Term-by-Document matrix/Row-by-Column
11The Singular Value Decomposition (SVD)
- Decompose txd term-by-document matrix A , A
TSDt, into - a txk matrix T of term vectors
- the transpose of a dxk matrix of document vectors
- a kxk diagonal matrix S of singular value ,
define 100ltklt300
12Structure
- 1 Abstract
- 2 Introduction
- 3 Approach
- 4 Conclusion
13Collect Background Information
- Target----- a collection of documents consisting
of popular web pages accessed by internet users - Procedure--a web-crawler was used,web pages with
less than 4k bytes in size were accessed
14Create A Term By Document Matrix
- Target------ Create term-by-document matrix from
the document collection as input - Procedure--SMART software from Cornell University
15Perform A SVD On The Term-document Matrix
- Target------an LSA vector representing all the
usage data associated with each Internet user of
interest - Procedure--Compute the sum of the vectors in the
matrix T, scale the resulting vector by the
inverse of the matrix S, add the document vectors
representing the web pages accessed by the
Internet user to the pseudo-document vector
created in the previous step
16Create A Neural Model To Test The Hypothesis
- Model----- 3-layer neural model
- Training--- independent dependent variables
- Number----40000 observations for training, 20000
observations for validation
17Variables
Variables Gender Age Under 18 Age 55 Income
Under 50000 Marital Status Some College
Education Children in the Home
Possible Values male, female true,false true,false
true,false single,married true,false true,false
18Training
- Training Data contain equal proportions of the
values of the dependent variable under
consideration - Validation Datacontain true proportions of the
values of the dependent variable under
consideration
19Structure
- 1 Abstract
- 2 Introduction
- 3 Approach
- 4 Conclusion
20Conclusion
- It is really possible to make demographic
inferences about Internet users for whom
information is not otherwise available - Privacy concern