Data Analysis - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Data Analysis

Description:

... them to understand the world of digital cameras (example from the introduction lecture) ... analysis module involves finding the best possible match between ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 27
Provided by: srip1
Category:
Tags: analysis | data

less

Transcript and Presenter's Notes

Title: Data Analysis


1
Data Analysis
  • Yaji Sripada

2
In this lecture you learn
  • The objectives of data analysis
  • Fitting models to the input data according to the
    end-user requirements
  • Data Analysis
  • Tasks and
  • Methods
  • Knowledge acquisition (KA) techniques to
    understand the required data analysis tasks and
    requirements due to HCI (Human Computer
    Interaction)
  • An iterative process for designing data analysis
    methods
  • With multiple KA and evaluation studies
  • Issues with the reuse of data analysis algorithms
    developed in other fields by matching
  • Requirements due to HCI
  • Features of data analysis methods

3
Introduction
  • High level architecture of our systems
  • Data Analysis (DA)
  • Compute patterns or models (in general
    abstractions) from raw input data
  • Information Visualization (infovis)
  • Present the relevant abstractions (patterns or
    models) in a form suitable to the end user
  • Support user interaction (will see examples
    later)
  • Integrating Data Analysis and InfoVis is the main
    focus of this course
  • Two Options for integration
  • Option 1 - Loose Coupling
  • Option 2 Theory Driven

4
Introduction (2)
  • Loose Coupling
  • Two libraries of data analysis and infovis are
    offered to the user
  • User is given freedom in exploiting the available
    methods to understand data
  • Certain constraints may be defined in linking a
    specific data analysis method with a specific
    infovis method
  • Already available in many existing tools such as
    R, Excel
  • Theory Driven
  • InfoVis module defines a HCI theory which guides
    the user access to data analysis methods and also
    the visualizations
  • DA works under two contexts
  • Domain Context
  • HCI (Human Computer Interaction) Context
  • We lump together all HCI related issues here but
    study them later
  • Impact of HCI context on DA under investigation

5
Introduction (3)
  • Objective of IE
  • Make sense of input data
  • Making sense of data involves fitting a known
    model to the data
  • If the fit is successful we say we understand the
    data
  • Because we can derive or infer new information
    using the model
  • Example Pressure volume data
  • Fitting a linear trend line
  • Model Linear model
  • Linear models such as this are easy to
    communicate
  • Text There is an inverse relationship between
    pressure and volume of an ideal gas Boyles law
  • Graph as shown on the side

6
Data Analysis
  • Data Analysis
  • Compute meaningful abstractions from raw input
    data
  • May involve a strategic application of several
    individual analysis methods
  • Integrate elementary observations to identify
    high level abstractions
  • Results of data analysis are communicated to the
    end-user using infovis
  • This means data analysis module needs to be
    controlled by end-user
  • Several Iterations of data analysis might be
    performed by end-user to develop insights into
    the whole underlying data
  • Separation of Data analysis tasks and data
    analysis methods
  • Data analysis tasks are achieved by data analysis
    methods
  • Data analysis is defined by specifying its
  • input and
  • output

7
Input Data
  • Sensor Data
  • Measurement of something in the real world
  • E.g. Dive computer data is obtained from a
    pressure sensor installed on the dive computer
    and BabyTalk data
  • Data collected from any form of measurements
    belongs to this class
  • Always involves context in which the data can be
    interpreted
  • Simulation data
  • Data generated by a computer simulation
  • Weather data, pollen data, etc
  • Does not involve context for interpretation
  • Quality of input data determines the quality of
    the output and also determines the effort
    required for data analysis
  • Clean data is easier to process and produces high
    quality output

8
Output Models and Patterns
  • In general, outputs of data analysis involve
    representations of abstractions
  • Models are abstractions that span the whole data
    set
  • Global abstractions
  • Do not model the data generation process which
    produced the input data
  • Patterns are abstractions that span portions of a
    data set
  • Local abstractions
  • Output abstractions should be
  • Simple - such as the linear abstraction computed
    from the pressure volume data (slide 5)
  • Global - such as the linear abstraction computed
    from the pressure volume data (slide 5)
  • Local abstractions are acceptable in contexts
    where the user already has a global view of the
    information
  • We may not always succeed in fitting global
    models
  • We may have to fit models piecewise
  • Fitting models to subsets (portions) of the input
    data set
  • Easy to map to graphical elements in information
    visualizations (and also to words/phrases in the
    text sometimes)

9
Knowledge Acquisition (KA)
  • In a conventional exploratory data analysis
    context, data analysis is a bottom up process or
    data driven
  • In our case, data analysis is a top down process
    or goal driven
  • Knowledge acquisition (KA) studies (discussed
    next) identify required data analysis tasks
  • For designing the data analysis modules we need
  • knowledge about the application domain and
  • Knowledge of the user tasks and users
    informational requirements
  • KA studies to be performed before the design
    phase
  • With experts
  • With users
  • Case studies
  • Exploratory data analysis (EDA)
  • Prototype development

10
KA Techniques
  • Techniques developed in the expert system
    community
  • Think aloud sessions
  • Direct interviews
  • Studying examples or case studies
  • Exploratory Data Analysis (EDA)
  • To understand the data set using data analysis
    methods from descriptive statistics
  • Analytical methods
  • Graphical methods
  • You used EDA in practical 1

11
Identification of data analysis tasks
  • KA studies normally produce a list of queries
    user wants to ask the system such as
  • What is the typical value in the data set?
  • What are the outliers?
  • What is the relationships among the various data
    items?
  • What are the portions of the data that fit a
    given pattern?
  • What is the model that describes the data?
  • Queries suggest the required data analysis tasks
  • System response to queries (individual or
    grouped) can be viewed as messages about the
    underlying data
  • Please note messages can be realized either using
    graphics or using text

12
Simple Example - Analysis of exam marks data
  • Simple questions to be answered in this case
  • What are the maximum and minimum marks?
  • What is the class average or standard deviation?
  • Frequency counts
  • How many failed the exam?
  • How many got first class?
  • On which of the questions students performed
    well/not well?
  • And so on
  • Answers to the above questions are the different
    messages in this application
  • In this case, the different data analysis tasks
    are
  • compute maximum,
  • Compute minimum,
  • Compute average
  • And other statistics
  • We can also work out questions users ask of a
    system helping them to understand the world of
    digital cameras (example from the introduction
    lecture)

13
Design of Data Analysis Module
  • Main Steps in the design process
  • Perform KA studies
  • Identify the HCI (Human Computer Interaction)
    features of the users interaction with the full
    system
  • Single view of output
  • Interactive views of output
  • Identify required data analysis tasks from KA
    studies
  • For each of the tasks design a data analysis
    method
  • Decide about how these methods are controlled
  • Pipeline or
  • More sophisticated architectures
  • If the user wants to interact with the system
    freely (Loose coupling)
  • If the user wants to interact with the system
    according to an HCI theory (Theory driven)

14
Design of Data Analysis Module (II)
  • Consider the contextual effects of other tasks
  • Each method works in the context of other methods
    related to other tasks
  • Unknown territory more studies required here
  • Optionally design
  • A pre-processing method for preparing the raw
    input data for data analysis
  • a post-processing method that organizes the
    results of data analysis as required by the
    infovis module
  • Cycle through the above design steps many times
  • Evaluating the design at the end of every cycle
  • The above procedure relies a lot on the
    information from KA studies and evaluation
    studies
  • Quality of KA and evaluations is important
  • KA and evaluations are the hardest tasks of the
    system building activity

15
Evaluation
  • Independent evaluation of the data analysis
    module
  • Using known metrics such as precision and recall
  • Evaluation of the data analysis module in the
    context of the whole system
  • New metrics required to measure the goodness of
    the system as a whole
  • Metrics may vary with improvements in technology
  • Task (user) based evaluations
  • Studies later during the course
  • Evaluations are costly
  • Multiple cheap evaluations often better than one
    expensive evaluation

16
Design of Data Analysis Methods
  • For each identified data analysis task we need a
    data analysis method
  • The actual procedure or algorithm that achieves
    the task
  • Data analysis methods are developed in many
    fields
  • Statistics
  • Data Mining
  • Pattern Recognition
  • Machine Learning
  • We reuse methods developed in the above fields
  • Sometimes, we assume a library of data analysis
    methods
  • Such as R/MatLab

17
Statistics
  • Time tested techniques for primary data analysis
  • Most of the data analysis tasks in our exam marks
    example can be achieved by statistical techniques
  • Two types of techniques
  • Numerical
  • Compute statistics such as mean and standard
    deviation
  • Good at computing objective and precise
    descriptions of data
  • Graphical
  • Create histograms, stem and leaf displays and box
    plots
  • Good at presenting (communicating) the data to
    humans
  • Note Statisticians exploited the power of
    combining numerical techniques (data analysis)
    and graphical techniques (infovis)
  • Work great for analysing smaller data sets -
    hundreds and thousands of data items not millions
    and billions
  • Need for data analysis techniques that process
    large data sets millions and billions of data
    items
  • Algorithmic implementations of many statistical
    procedures are available in the form of libraries
    (for example R)

18
Data Mining
  • Data driven techniques for discovering
    unsuspected and useful patterns or models from
    very large data sets - Mega and Giga bytes
  • Mainly used for secondary analysis of data often
    without any specific goal
  • Pure statisticians might call data fishing
  • Largely made up of existing statistical ideas
    scaled up!
  • Data Mining does not replace humans
  • Data mining offers tools to perform data analysis
  • Like all tools quality of results of data mining
    depends on the skill of the user
  • Increases the productivity of the user
  • User should be good at
  • Statistics
  • Computer Science and
  • Domain knowledge

19
Pattern Recognition
  • Techniques for solving perceptual problems
  • Image Processing
  • Speech Processing
  • In this course we are concerned with simple
    patterns such as rapid ascent in a scuba dive
    profile
  • We will design our own simple pattern detection
    methods
  • But in general pattern recognition methods are
    part of our technology

20
Machine Learning
  • Data analysis techniques for automated learning
  • Usually the output of learning used by machines
    not humans
  • Not studied here (As part of CS5565 Data Mining)

21
Reusing data analysis methods
  • Data analysis methods are normally designed in an
    idealized mathematical context
  • The user of these methods (such as R/MatLab
    methods) is expected to know how to map
    information from real contexts to this idealized
    mathematical context
  • As a result, while reusing data analysis methods
    we need to map information from our context to
    the idealized context and map the results back
    from the idealized context to our context
  • Note that we use data analysis in a HCI context
  • This also means designing a data analysis module
    involves
  • a search for a method in a library of methods
    (such as R/MatLab) with a good match between
  • requirements due to HCI context and
  • known features of the data analysis methods
  • Adapt an existing method to suit the user
    requirements

22
Requirements due to HCI
  • Interactivity communication using information
    visualizations (studied later in the course) are
    essentially interactive
  • Here, communicating the systems internal context
    to the user is important
  • Multi-modality based on the abilities or
    disabilities of users
  • Gaps in communication depending on the output
    modality certain abstractions might be hard to
    communicate
  • Users informational requirements
  • level of expertise or prior knowledge
  • Output size restrictions
  • limited screen size etc.

23
Features of Data Analysis Methods
  • Configurability
  • Data analysis methods use parameters that allow
    users to configure its runtime behaviour
  • These parameters may not be suitable from the
    communication perspective
  • Users may not always be able to specify these
    parameters accurately
  • When an ideal fit of parameters is not available
    we modify these methods with the
    parameterisation suitable to our contexts
  • Or look for approximate fits
  • Level of Abstraction
  • Data analysis methods abstract the raw input data
    as stated earlier into either global models or
    local patterns
  • The level of abstraction achieved has important
    consequences for the infovis module
  • Because the level of abstraction determines the
    level of detail in the final output
  • The level of abstraction should be determined by
    the end-user tasks and end-users informational
    requirements
  • Size of the final output
  • One of the major constraints on the design of
    data analysis module is the size of the output
    produced by the whole system
  • Users do not prefer
  • Large complicated graphics or
  • Large volumes of text
  • Again, the user size requirements should
    determine how much information is computed by the
    data analysis method

24
Many alternative methods
  • Data Mining community develops multiple methods
    for achieving the same data analysis task
  • When several data analysis methods are available,
    a method that generates abstractions which
    satisfy user requirements should be preferred
  • When an ideal fit is not available we choose the
    method that achieves the best result and make
    alternatives available for exploration
  • Making the exploration of multiple methods user
    friendly is challenging
  • For complex methods users need a black-box view
  • For simpler methods users require a glass-box
    view
  • We could also implement an adapted version of an
    existing method
  • Users could be offered features to control the
    adaptation process

25
Data types
  • Input data can be of different types
  • Single variable
  • Multi-variable
  • Time series or
  • Spatial
  • And others
  • Data Analysis methods depend upon the type of the
    input data
  • In this course we focus on analysis of
  • Time series data and
  • E.g. Scuba dive profile data
  • Spatial data
  • E.g georeferenced census data

26
Summary
  • Data analysis module needs to be integrated to
    the infovis module
  • Data analysis methods need to be controlled by
    the user
  • The success of user control depends upon the
    success of KA
  • Better understanding of overall system
    requirements and its operational context leads to
    better design
  • Designing data analysis module involves finding
    the best possible match between
  • Requirements due to HCI context and
  • Features of data analysis methods
  • In this course we focus on analysis of
  • Time series data and
  • Spatial data
Write a Comment
User Comments (0)
About PowerShow.com