Title: Secondary Data and Sources
1Chapter 4
- Secondary Data and Sources
2Secondary Data Defined
- Secondary data is information that has been
collected previously for a purpose other than the
need at hand
3Reasons for Secondary Research
- Secondary research may solve the problem
- Secondary information costs less
- Supplementary uses
- Defining the research problem
- Planning collection of primary data
- Defining the population and selecting the sample
4Weighing evidence
- There are dangers in relying entirely on
secondary data - Relevant Information pertains to the problem at
hand - Accurate Information reflects the reality
- Current Information isnt dated.. Data is
perishable - Sufficient Enough information and detail
- Available Information is easy to find
- Measurement of Units Is it the same as in your
analysis (Sales, Profit, Employees, Sales/Sq.
Ft., Sq. Ft.) - Classification Categories50-64,999 65-79,999
80,000 and over what if you want 150,000 and
over? - Knowing how information was collected helps
determine how credible the data is.
5Credibility of Secondary Data
- What was the purpose of the study?
- Is it Relevant? Biasedpolitical and civic
groups overstate information to make positions
look more/less attractive - Who collected the information?
- Competence of the organization to do quality
research? - What information was collected?
- How was the information collected?
- Are there errors in research design, sampling,
analysis, non-response, etc., Measure reliability
and validity. - Are the findings consistent with other
studies/information? - It is inexpensive?
- How does cost compare to the cost of primary data
collection.
6Internal Sources
- Internal secondary information is available
within the company - Prior research reports
- Documents and databases (sales reports, warranty
information) - Key is knowing the information exists and how to
access it
7Is Secondary Data Appropriate?
- Define the Purpose
- What do you want to know? Be as specific as
possible. - Industry Analysis
- Analyze the structure of the market Who are the
major players? - Literature search (library/Internet)
- These is the starting point for data collection.
- Who has the information you need most?
- Start with the most likely sources and work to
the least likely. - Share and discuss your information with other
team members.
8External Sources
- External data comes from outside sources
- Huge amounts are available
- Typically cover non-controllable (environmental)
factors - Market size
- Competitive information
- Market characteristics
9Common External Sources
- Government
- Trade associations and trade press
- Periodicals and professional journals
- Institutions (universities)
- Commercial services
- Government data is the largest source
- Experienced researchers rely on government and
trade sources
10Consumer and Economic Data
- Market and Consumer Information
- U.S. Bureau of the Census
- Bureau of Labor Statistics
- County and City Databook (Census Bureau)
- Demographics USA Reference Book
- Lifestyle Market Analyst
- Woods and Poole MSA Profile Reference Book
- MediaMark Reporter
- Sales and Marketing Management Survey of Buying
Power - General Economic Information
- Survey of Current Business
- Federal Reserve Bulletin
- Statistical Abstract of the U.S.
11Government Publications
- Statistical Abstract of the United States
- Demographic data from census reports
- The State and Metropolitan Area Data Book
- Same as above, broken down by state, county, and
metropolitan area - Census of Population and Census of Housing
- Conducted every ten years
- Available online
- www.firstgov.gov (official government portal)
- www.census.gov
12Other Sources
- Encyclopedia of Associations
- Listings of trade groups, contact information,
and publications - Online databases
- Periodicals indexes like EBSCO and Infotrac
- Subject specific resources, like LexisNexis
- The books are available in academic, public, and
corporate libraries the online databases
typically require a subscription and are often
available through university and public libraries
13Online Sources
- LEXIS-NEXIS
- Business News contains magazine and newspaper
articles, broadcast transcripts. - Compare Companies shows what companies fit
certain size characteristics. - Patent searches for product information can be
found by selecting business, then patents. - Journalist Express, provides a complete listing
of - Wire services
- News services
- Search engines
- International news
14Financial Information
- Hoovers
- Financial information for publicly traded
companies. - company capsules, key competitors, and links to
SEC 10-A and 10-Q reports, links to articles on
companies, track insider trades for companies,
including who purchased/sold what - Public Registers Annual Report Service
- Annual report information for over two thousand
companies.
15Syndicated Services
- Specialist research companies that generate
updated reports on a regular, ongoing bases that
they provide to a number of clients - Examples include IDC, Mediamark, ACNielsen
16Database Marketing/Data Mining
- Data mining is the process of analyzing vast
computer databases looking for patterns and
relationships between variables that will help
marketing efforts - Data explosion problem
- Automated data collection tools and mature
database technology has produced huge amounts of
data stored in databases and data warehouses
- Valuable data containing potentially valuable
information - Solution Data warehousing and data mining
- Data warehousing and on-line analytical
processing - Extraction of interesting knowledge (rules,
regularities, patterns, constraints) from data
in large databases
17Data Mining History
- 1960s
- Data collection, database creation, DBMS
- 1970s
- Relational data base management systems
- 1980s
- Advanced RDBMA models (extended-relational,
deductive) and application-oriented DBMS
(spatial, scientific, engineering, etc.) - 1990s2000s
- Data mining and data warehousing, multimedia
databases, and Web databases
18Data Mining
- What is a database?
- Internal database a database developed from data
within an organization. - Where does the data come from?
- Sales Invoices
- Salespersons Call Reports
- Warranty Cards
- Customer Registration/Sign-in
19Data Mining of Internal Secondary Data
- Database mining/marketing the micro marketing to
customers based on customers and potential
customers profiles and purchasing patterns. - Internal database mining of secondary data
enables firms to - evaluate sales territories
- identify most and least profitable customers
- identify potential market segments
- identify which products, services, and segments
need the most marketing support - evaluate opportunities for offering new products
or services - identify most and least profitable products and
services - evaluate existing marketing programs
20Database Applications
- Database analysis and decision support
- Market analysis and management
- target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation - Customer acquisition
- discover attributes that predict customer
responses to mktg. programs - Customer retention
- target customers who are on the verge of
switching to a competitor. - Customer abandonment
- are some customers too costly to maintain?
- Risk analysis and management
- Forecasting, customer retention, quality control,
competitive analysis - Fraud detection and management
- Other Applications
- Text mining (news group, email, documents) and
Web analysis. - Intelligent query answering
21Market Analysis Examples
- Where are the data sources for analysis?
- Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies - Target marketing
- Find clusters of model customers who share the
same characteristics interest, income level,
spending habits, etc. - Determine customer purchasing patterns over time
- Conversion of single to a joint bank account
marriage, etc. - Cross-market analysis
- Associations/co-relations between product sales
- Prediction based on the association information
22Market Analysis Examples
- Customer profiling
- data mining can tell you what types of customers
buy what products (clustering or classification) - Identifying customer requirements
- identifying the best products for different
customers - use prediction to find what factors will attract
new customers - Provides summary information
- various multidimensional summary reports
- statistical summary information (data central
tendency and variation)
23Finance/Risk Examples
- Finance planning and asset evaluation
- cash flow analysis and prediction
- contingent claim analysis to evaluate assets
- cross-sectional and time series analysis
(financial-ratio, trend analysis, etc.) - Resource planning
- summarize and compare the resources and spending
- Competition
- monitor competitors and market directions
- group customers into classes and a class-based
pricing procedure - set pricing strategy in a highly competitive
market
24Fraud Detection Examples
- Applications
- widely used in health care, retail, credit card
services, telecommunications (phone card fraud),
etc. - Approach
- use historical data to build models of fraudulent
behavior and use data mining to help identify
similar instances - Examples
- auto insurance detect a group of people who
stage accidents to collect on insurance - money laundering detect suspicious money
transactions (US Treasury's Financial Crimes
Enforcement Network) - medical insurance detect professional patients
and ring of doctors and ring of references
25Fraud Detection Examples
- Sports
- IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain
competitive advantage for New York Knicks and
Miami Heat - Astronomy
- JPL and the Palomar Observatory discovered 22
quasars with the help of data mining - Internet Web Surf-Aid
- IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior pages,
analyzing effectiveness of Web marketing,
improving Web site organization, etc.
26Case Example Bell South
- Bell South data mining to eliminate least likely
to purchase - Used the attributes of existing customers to
identify, model and predict potential customers
27The Data Mining Process
Insight and Knowledge
Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
28Data Mining of What Kind of Data?
- Relational databases
- Data warehouses
- Transactional databases
- Advanced DB and information repositories
- Object-oriented and object-relational databases
- Spatial databases
- Time-series data and temporal data
- Text databases and multimedia databases
- Heterogeneous and legacy databases
- WWW
29Data Mining Analysis
- Concept description Characterization and
discrimination - Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions - Association (correlation and causality)
- Multi-dimensional vs. single-dimensional
association - age(X, 20..29) income(X, 20..29K) à buys(X,
PC) support 2, confidence 60 - contains(T, computer) à contains(x, software)
1, 75
30Data Mining Analysis
- Classification and Prediction
- Finding models (functions) that describe and
distinguish classes or concepts for future
prediction - E.g., classify countries based on climate, or
classify cars based on gas mileage - Presentation decision-tree, classification rule,
neural network - Prediction Predict some unknown or missing
numerical values - Cluster analysis
- Class label is unknown Group data to form new
classes, e.g., cluster houses to find
distribution patterns - Clustering based on the principle maximizing the
intra-class similarity and minimizing the
interclass similarity
31Data Mining Analysis
- Outlier analysis
- Outlier a data object that does not comply with
the general behavior of the data - It can be considered as noise or exception but is
quite useful in fraud detection, rare events
analysis - Trend and evolution analysis
- Trend and deviation regression analysis
- Sequential pattern mining, periodicity analysis
- Similarity-based analysis
- Other pattern-directed or statistical analyses
32Discovering Interesting Patterns
- A data mining system/query may generate thousands
of patterns, not all of them are interesting. - Suggested approach Human-centered, query-based,
focused mining - Interestingness measures A pattern is
interesting if it is easily understood by humans,
valid on new or test data with some degree of
certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to
confirm - Objective vs. subjective interestingness
measures - Objective based on statistics and structures of
patterns, e.g., support, confidence, etc. - Subjective based on users belief in the data,
e.g., unexpectedness, novelty, actionability, etc.
33Can We Find Interesting Patterns?
- Find all the interesting patterns Completeness
- Can a data mining system find all the interesting
patterns? - Association vs. classification vs. clustering
- Search for only interesting patterns
Optimization - Can a data mining system find only the
interesting patterns? - Approaches
- First general all the patterns and then filter
out the uninteresting ones. - Generate only the interesting patternsmining
query optimization
34Major Issues in Data Mining
- Mining methodology and user interaction
- Mining different kinds of knowledge in databases
- Interactive mining of knowledge at multiple
levels of abstraction - Incorporation of background knowledge
- Data mining query languages and ad-hoc data
mining - Expression and visualization of data mining
results - Handling noise and incomplete data
- Pattern evaluation the interestingness problem
- Performance and scalability
- Efficiency and scalability of data mining
algorithms - Parallel, distributed and incremental mining
methods
35Major Issues in Data Mining
- Issues relating to the diversity of data types
- Handling relational and complex types of data
- Mining information from heterogeneous databases
and global information systems (WWW) - Issues related to applications and social impacts
- Application of discovered knowledge
- Domain-specific data mining tools
- Intelligent query answering
- Process control and decision making
- Integration of the discovered knowledge with
existing knowledge A knowledge fusion problem - Protection of data security, integrity, and
privacy
36Meta-Analysis
- 1952 Hans J. Eysenck concluded that there were
no favorable effects of psychotherapy, starting a
raging debate - 20 years of evaluation research and hundreds of
studies failed to resolve the debate - 1978 To prove Eysenck wrong, Gene V. Glass
statistically aggregated the findings of 375
psychotherapy outcome studies - Glass (and colleague Smith) concluded that
psychotherapy did indeed work - Glass called his method meta-analysis
37The Emergence of Meta-Analysis
- Ideas behind meta-analysis predate Glass work by
several decades - R. A. Fisher (1944)
- When a number of quite independent tests of
significance have been made, it sometimes happens
that although few or none can be claimed
individually as significant, yet the aggregate
gives an impression that the probabilities are on
the whole lower than would often have been
obtained by chance (p. 99). - Source of the idea of cumulating probability
values - W. G. Cochran (1953)
- Discusses a method of averaging means across
independent studies - Laid-out much of the statistical foundation that
modern meta-analysis is built upon (e.g., inverse
variance weighting and homogeneity testing)
38The Logic of Meta-Analysis
- Traditional methods of review focus on
statistical significance testing - Significance testing is not well suited to this
task - highly dependent on sample size
- null finding does not carry to same weight as a
significant finding - Meta-analysis changes the focus to the direction
and magnitude of the effects across studies - Isnt this what we are interested in anyway?
- Direction and magnitude represented by the effect
size
39When Can You Do Meta-Analysis?
- Meta-analysis is applicable to collections of
research that - are empirical, rather than theoretical
- produce quantitative results, rather than
qualitative findings - examine the same constructs and relationships
- have findings that can be configured in a
comparable statistical form (e.g., as effect
sizes, correlation coefficients, odds-ratios,
etc.) - are comparable given the question at hand
40Effect Size The Key to Meta-Analysis
- Central Tendency Research
- prevalence rates
- Pre-Post Contrasts
- growth rates
- Group Contrasts
- experimentally created groups
- comparison of outcomes between treatment and
comparison groups - naturally occurring groups
- comparison of spatial abilities between boys and
girls - Association Between Variables
- measurement research
- validity generalization
- individual differences research
- correlation between personality constructs
41The Replication Continuum
You must be able to argue that the collection of
studies you are meta-analyzing examine the same
relationship. This may be at a broad level of
abstraction, such as the relationship between
criminal justice interventions and recidivism or
between school-based prevention programs and
problem behavior. Alternatively it may be at a
narrow level of abstraction and represent pure
replications. The closer to pure replications
your collection of studies, the easier it is to
argue comparability.
42Which Studies to Include?
- It is critical to have an explicit inclusion and
exclusion criteria the broader the research
domain, the more detailed they tend to become - developed iteratively as you interact with the
literature - To include or exclude low quality studies
- the findings of all studies are potentially in
error (methodological quality is a continuum, not
a dichotomy) - being too restrictive may restrict ability to
generalize - being too inclusive may weaken the confidence
that can be placed in the findings - must strike a balance that is appropriate to your
research question
43Searching for Studies to Include
- Argument We only included published studies
because they have been peer-reviewed Significant
findings are more likely to be published than
non-significant findings - Critical to try to identify and retrieve all
studies that meet your eligibility criteria - Potential sources for identification of documents
- computerized bibliographic databases
- authors working in the research domain
- conference programs
- dissertations
- review articles
- hand searching relevant journal
- government reports, bibliographies, clearinghouses
44Strengths of Meta-Analysis
- Imposes a discipline on the process of summing up
research findings - Represents findings in a more differentiated and
sophisticated manner than conventional reviews - Capable of finding relationships across studies
that are obscured in other approaches - Protects against over-interpreting differences
across studies - Can handle a large numbers of studies (this would
overwhelm traditional approaches to review)
45Weaknesses of Meta-Analysis
- Requires a good deal of effort
- Mechanical aspects dont lend themselves to
capturing more qualitative distinctions between
studies - Apples and oranges comparability of studies is
often in the eye of the beholder - Most meta-analyses include blemished studies
- Selection bias posses continual threat
- negative and null finding studies that you were
unable to find - outcomes for which there were negative or null
findings that were not reported - Analysis of between study differences is
fundamentally correlational
46Steps in Meta-Analysis
- Define the research question and specific
hypotheses - Define the criteria for including and excluding
studies - Study designs (randomized vs. observational)
- Publication and date thereof (vs. unpublished)
- Language of publication
- Multiple publications from the same sample
- Sample size (large vs. small)
- Method and length of follow-up/ascertainment
- Population characteristics (high vs. low risk)
- Treatment or exposure (drug name, dose)
- Missing information about key effect sizes
- Locate research studies
- Determine which studies are eligible for
inclusion - Maintain log of reasons for ineligibility
- Independent review by 2 or more abstractors
- Blind abstractors to results of study (and
authors if possible)
47Steps in Meta-Analysis
- Classify and code important study characteristics
(e.g., sample size length of follow-up
definition of outcome Drug brand and dose) - Develop and pilot test abstraction form
- Develop abstracting instructions and rules
- Train abstractors and monitor their reliability
- Consider using a quality rating system
- Select or translate results from each study using
a common metric - Intention to treat vs. treatment received
- Adjusted vs. unadjusted
- Entire sample vs. subgroup
- Truncate follow-up time if necessary
48Steps in Meta-Analysis
- Aggregate findings across studies, generating
weighted pooled estimates of effect size. - Fixed effects Did the treatment produce
benefit, on average, in the studies reported to
date? - Random effects Will the treatment produce
benefit, on average? - (Assumes that the reported studies are a sample
of some hypothetical population of studies) - Select or translate results from each study using
a common metric - Intention to treat vs. treatment received
- Adjusted vs. unadjusted
- Entire sample vs. subgroup
- Truncate follow-up time if necessary
- Evaluate the statistical homogeneity of pooled
studies - Use stratification or modeling (meta-regression)
techniques to explain variation in findings
across studies - Perform sensitivity analyses to assess the impact
of excluding or down-weighting unpublished
studies, studies of lower quality, out-of-date
studies, etc.