Title: Data Mining
1Data Mining
The collection, organization, and storage of data.
The Good Old Days When the size of the data was
small and access was very localized, data mining
was the picture of simplicity.
Todays Reality Not so much...
2Sample Data Mining Applications
- Purchasing Patterns
- Determining credit and loan eligibility
- Targeted marketing (sales, catalogs, coupons)
- Product placement within store, website, etc.
- Medical Records
- Identifying disease outbreaks
- Calculating insurance rates
- Marketing new pharmaceuticals
- Airline Flights
- Planning for peak dates and times
- Pinpointing passengers posing security threats
- Monitoring weather conditions
3The Three Stages
Validation! Next, apply various mathematical
models to sample data until you can choose one
that fits the datas behavior and is thus
considered predictive. Nyuk, nyuk, nyuk!
Exploration! Clean the data by selecting
specific features to focus on. Otherwise, the
sheer volume of information may be too complex to
analyze, you knucklehead!
Deployment! Use the selected model on any new
data to predict future outcomes.
4Exploiting The Data
Databases are often mined for information for
which they were not originally intended.
The Centers for Disease Control wishes to mine
insurance company records on disease incidents,
seriousness, patient background, etc., in order
to identify disease outbreaks.
Corporate telephone directories are designed to
facilitate contacting individuals, but can be
used by competitors to determine, for instance
department sizes and thus company direction.
Banks identify high-risk neighborhoods for home
loans, but such redlining frequently defines
minority neighborhoods and results in racial
discrimination.
5Identifying The Anonymous
Latanya Sweeneys 2001 MIT study.
87 of the U.S. population can be uniquely
identified based on three pieces of information
- 5-digit zip code
- gender
- date of birth
Using a free Massachusetts voter list and a 20
insurance company database for Cambridge, Sweeney
found
- Only 6 people with the governors DOB
- Only 3 of those were men
- Only 1 of those had the govs zip code
So, for 20, she obtained the governors complete
medical history!
6Privacy PreservationData Obfuscation
Hide protected data by modifying some of it.
Example U.S. Census Bureau Public Use Microdata
- Summarize data by census block (min. 300
people) - Use ranges of values instead of particular
values - Eliminate sparse values (i.e., top/bottom
coding) - Randomly swap values among similar individuals
7Privacy PreservationSummarization
Make only innocuous data summaries available.
Example Statistical Queries
- Users query protected data via statistical
operators. - A groups total income doesnt reveal an
individuals income - Problem Multiple Queries
- Query 1 X Total Salary For All Company
Employees - Query 2 Y Total Salary For All Employees
Except Boss - So Salary of Boss Equals X - Y
- The Trick Perturb the data and/or output by
introducing noise without compromising the
datas statistical integrity
8Privacy PreservationData Separation
Allow only trusted parties to see the data.
Example Patient Medical Records
For instance, a medical study might need data
from various providers in order to correlate
complaints/procedures and unrelated drugs.
Insurance Company
Hospital
Correlation between, say, male performance-enhance
ment drugs and rheumatoid arthritis
Each provider must agree not to release a
patients data without the patients consent.
Pharmacy
Doctor
9Identity Theft
The Identity Theft Assumption Deterrence Act
- 1998 federal law
- Federal crime when someone transfers or uses,
without lawful authority, a means of
identification of another person with the intent
to commit, or to aid or abet, any unlawful
activity..." - Means of identification name, SSN, credit card
number, cellular telephone electronic serial
number, etc. - Maximum penalty 15 years imprisonment, a fine,
and forfeiture of any personal property used or
intended to be used to commit the crime.
10Phishing Expedition
Phishing is a high-tech scam that uses spam or
pop-up messages to deceive Web users into
disclosing credit card numbers, bank account
information, Social Security number, passwords,
or other sensitive information.
11The Nigerian Scam
Claiming to be Nigerian officials,
businesspeople, or the surviving spouses of
former government honchos, con artists offer to
transfer millions of dollars to your bank account
for a small fee
- LAGOS, NIGERIA.
- ATTENTION THE PRESIDENT/CEO
- DEAR SIR,
- CONFIDENTIAL BUSINESS PROPOSAL
- HAVING CONSULTED WITH MY COLLEAGUES AND BASED ON
THE INFORMATION GATHERED FROM THE NIGERIAN
CHAMBERS OF COMMERCE AND INDUSTRY, I HAVE THE
PRIVILEGE TO REQUEST FOR YOUR ASSISTANCE TO
TRANSFER THE SUM OF 47,500,000.00 (FORTY SEVEN
MILLION, FIVE HUNDRED THOUSAND UNITED STATES
DOLLARS) INTO YOUR ACCOUNTS. THE ABOVE SUM
RESULTED FROM AN OVER-INVOICED CONTRACT, EXECUTED
COMMISSIONED AND PAID FOR ABOUT FIVE YEARS (5)
AGO BY A FOREIGN CONTRACTOR. THIS ACTION WAS
HOWEVER INTENTIONAL AND SINCE THEN THE FUND HAS
BEEN IN A SUSPENSE ACCOUNT AT THE CENTRAL BANK OF
NIGERIA APEX BANK. - WE ARE NOW READY TO TRANSFER THE FUND OVERSEAS
AND THAT IS WHERE YOU COME IN. IT IS IMPORTANT TO
INFORM YOU THAT AS CIVIL SERVANTS, WE ARE
FORBIDDEN TO OPERATE A FOREIGN ACCOUNT THAT IS
WHY WE REQUIRE YOUR ASSISTANCE. THE TOTAL SUM
WILL BE SHARED AS FOLLOWS 70 FOR US, 25 FOR
YOU AND 5 FOR LOCAL AND INTERNATIONAL EXPENSES
INCIDENT TO THE TRANSFER. - THE TRANSFER IS RISK FREE ON BOTH SIDES. I AM AN
ACCOUNTANT WITH THE NIGERIAN NATIONAL PETROLEUM
CORPORATION (NNPC). IF YOU FIND THIS PROPOSAL
ACCEPTABLE, WE SHALL REQUIRE THE FOLLOWING
DOCUMENTS - YOUR BANKER'S NAME, TELEPHONE, ACCOUNT AND FAX
NUMBERS. - YOUR PRIVATE TELEPHONE AND FAX NUMBERS -- FOR
CONFIDENTIALITY AND EASY COMMUNICATION. - YOUR LETTER-HEADED PAPER STAMPED AND SIGNED.
- ALTERNATIVELY WE WILL FURNISH YOU WITH THE TEXT
OF WHAT TO TYPE INTO YOUR LETTER-HEADED PAPER,
ALONG WITH A BREAKDOWN EXPLAINING,
COMPREHENSIVELY WHAT WE REQUIRE OF YOU. THE
BUSINESS WILL TAKE US THIRTY (30) WORKING DAYS TO
ACCOMPLISH. - PLEASE REPLY URGENTLY.
- BEST REGARDS
12Identity Theft Victims By State
Per 100,000 Population, 1/1/2004-12/31/2004