Title: Educational data mining overview
1Educational data mining overview Introduction
to Exploratory Data Analysis with DataShop
- Ken Koedinger CMU Director of PSLC
- Professor of Human-Computer Interaction
Psychology - Carnegie Mellon University
2Overview
- DataShop Overview
- Logging model
- DataShop Features
- Quantitative models of learning curves
- Power law, logistic regression
- Contrasting KC models
- Exploratory Data Analysis Exercise (start)
- Knowledge Component Model Editing
3Logging Storage Models
- Education technologies are instrumented to
produce log data - We encourage a standard log format
- XML format generalized from Ritter Koedinger
(1995) - Also convert log data from other formats
4Relational Database -- complex!
5Example activity generating click stream data
- Geometry Cognitive Tutor Making Cans problem
- Find the area of scrap metal left over after
removing a circular area (the end of a can) from
a metal square. - Student enters values in worksheet
- Tutor provides feedback instruction
- Records students actions tutor responses
- Logs stored in files on school server or database
at Carnegie Learning - Later imported into DataShop
6DataShop logging model
- Main constructs
- Context message the student, problem, and
session with the tutor - Tool message represents an action in the tool
performed by a student or tutor - Tutor message represents a tutors response to a
student action
7DataShop XML format Context message
- ltcontext_message context_message_id"C2badca9c5c-
7fe5" name"START_PROBLEM"gt ltdatasetgt
ltnamegtGeometry Hampton 2005-2006lt/namegt
ltlevel type"Lesson"gt ltnamegtPACT-AREAlt/namegt
ltlevel type"Section"gt ltnamegtPACT-AREA-6lt/namegt
ltproblemgt ltnamegtMAKING-CANSlt/namegt
lt/problemgt lt/levelgt lt/levelgt
lt/datasetgtlt/context_messagegt
Dataset name
Course unit
Course section
Problem
8DataShop XML format Tool Tutor Messages
- lttool_message context_message_id"C2badca9c5c-7fe
5"gt ltsemantic_event transaction_id"T2a9c5c-7fe
7" name"ATTEMPT" /gt ltevent_descriptorgt
ltselectiongt(POG-AREA QUESTION2)lt/selectiongt
ltactiongtINPUT-CELL-VALUElt/actiongt
ltinputgt200.96lt/inputgt lt/event_descriptorgtlt/tool
_messagegtlttutor_message context_message_id"C2bad
ca9c5c-7fe5"gt ltsemantic_event
transaction_id"T2a9c5c-7fe7" name"RESULT" /gt
ltevent_descriptorgt as above
lt/event_descriptorgt ltaction_evaluationgtCORRECTlt/
action_evaluationgtlt/tutor_messagegt
9Example Stored Transactions
- Student interactions (or transactions) are stored
in a relational database, can be exported as
table - Example Student S01 on Making-Cans problem
10Transactions
- Info for each transaction
- student(s), session, time, problem, problem step,
attempt number, student action - tutor response, number of hints, knowledge
component code - Logging of on-line tools (e.g., a virtual lab)
does not include tutor response
11Step Transaction Definitions
- A problem-solving activity typically involves
many tool tutor messages. - Steps represent completion of possible subgoals
or pieces of a problem solution - Transactions are attempts at a step or requests
for instructional help
12Example data aggregated by student-step
13Overview
- DataShop Overview
- Logging model
- DataShop Features
- Quantitative models of learning curves
- Power law, logistic regression
- Contrasting KC models
- Exploratory Data Analysis Exercise (start)
- Knowledge Component Model Editing
14DataShop Analysis Tools
- Dataset Info
- Performance Profiler
- Learning Curve
- Error Report
- Export
- Sample Selector
15Dataset Info
- Meta data for given dataset
- PIs get edit privileges, others must request it
Papers and Files storage
15
16Performance Profiler
Multipurpose tool to help identify areas that are
too hard or easy
- View measures of
- Error Rate
- Assistance Score
- Avg Hints
- Avg Incorrect
- Residual Error Rate
- Aggregate by
- Step
- Problem
- KC
- Dataset Level
17Learning Curve
Visualizes changes in student performance over
time
View by KC or Student, Assistance Score or Error
Rate
Time is represented on the x-axis as
opportunity, or the of times a student (or
students) had an opportunity to demonstrate a KC
18Error Report
- Provides a breakdown of problem information (by
step) for fine-grained analysis of
problem-solving behavior - Attempts are categorized by student
View by Problem or KC
19Sample Selector
Easily create a sample/filter to view a smaller
subset of data
- Filter by
- Condition
- Dataset Level
- Problem
- School
- Student
- Tutor Transaction
Shared (only owner can edit) and private samples
20Export
You can also export the Problem Breakdown table
and LFA values!
- Two types of export available
- By Transaction
- By Step
- Anonymous, tab-delimited file
- Easy to import into Excel!
21Help/Documentation
- Extensive documentation with examples
- Contextual by tool/report
- http//learnlab.web.cmu.edu/datashop/help
Glossary of common terms, tied in with PSLC
Theory wiki
22New Features
- Manage Knowledge Component models
- Create, Modify Delete KC models within DataShop
- Addition of Latency Curves to Learning Curve
Reporting - Time to Correct
- Assistance Time
- Problem Rollup Export
- Enhanced Contextual Help
23Overview
- DataShop Overview
- Logging model
- DataShop Features
- Quantitative models of learning curves
- Power law, logistic regression
- Contrasting KC models
- Exploratory Data Analysis Exercise (start)
- Knowledge Component Model Editing
24Recall learning curve story
Without decomposition, using just a single
Geometry KC,
no smooth learning curve.
But with decomposition, 12 KCs for area concepts,
a smooth learning curve.
Upshot A decomposed KC model fits learning
transfer data better than a faculty theory of
mind
25Learning curve analysis
- The Power Law of Learning (Newell Rosenbloom,
1993) - Y a Xb
- Y error rate
- X opportunities to practice a skill
- a error rate on 1st opportunity
- b learning rate
- After the log transformation
- a is the intercept or starting point of the
learning curve - b is the slope or steepness of the learning
curve
26More sophisticated learning curve model
- Generalized Power Law to fit learning curves
- Logistic regression (Draney, Wilson, Pirolli,
1995) - Assumptions
- Different students may initially know more or
less - gt use an intercept parameter for each student
- Students learn at the same rate
- gt no slope parameters for each student
- Some productions may be more known than others
- gt use an intercept parameter for each
production - Some productions are easier to learn than others
- gt use a slope parameter for each production
- These assumptions are reflected in detailed math
model
27More sophisticated learning curve model
p ?
- Probability of getting a step correct (p) is
proportional to - if student i performed this step Xi, add
overall smarts of that student ?i - if skill j is needed for this step Yj, add
easiness of that skill ?jadd product of number
of opportunities to learn Tj
amount gained for each opportunity ?j
Use logistic regression because response is
discrete (correct or not) Probability (p) is
transformed by log odds stretched out with
s curve to not bump up against 0 or 1 (Related
to Item Response Theory, behind standardized
tests )
28Different representation, same model
- Predicts whether student is correct depending on
knowledge practice - Additive Factor Model (Draney, et al. 1995, Cen,
Koedinger, Junker, 2006)
29The Q Matrix
- How to represent relationship between knowledge
components and student tasks? - Tasks also called items, questions, problems, or
steps (in problems) - Q-Matrix (Tatsuoka. 1983)
-
-
-
-
- 2 8 is a single-KC item
- 28 3 is a conjunctive-KC item, involves two
KCs -
-
Item KC Add Sub Mul Div
28 0 0 1 0
28 - 3 0 1 1 0
29
30Model Evaluation
- How to compare cognitive models?
- A good model minimizes prediction risk by
balancing fit with data complexity (Wasserman
2005) - Compare BIC for the cognitive models
- BIC is Bayesian Information Criteria
- BIC -2log-likelihood numPar log(numOb)
- Better (lower) BIC better predict data that
havent seen - Mimics cross validation, but is faster to compute
30
31- Data the Geometry Area Unit
- 24 students, 230 items, 15 KCs
Model Title LL BIC numPar
G -2,175 4,566 26
Original -1,911 4,271 54
Item -1,720 5,554 254
31
32Learning curve constrast in Physics dataset
33Not a smooth learning curve -gt this knowledge
component model is wrong. Does not capture
genuine student difficulties.
34More detailed cognitive model yields smoother
learning curve. Better tracks nature of student
difficulties transfer (Few observations after
10 opportunities yields noisy data)
35Best BIC (parsimonious fit) for Default
(original) KC model
36Overview
- DataShop Overview
- Logging model
- DataShop Features
- Quantitative models of learning curves
- Power law, logistic regression
- Contrasting KC models
- Exploratory Data Analysis Exercise (start)
- Knowledge Component Model Editing
37Exploratory Data Analysis Exercise
- Goals 1) Get familiar with data 2)
Learn/practice Excel skills - Tasks 1) create a step table 2) graph
learning curves
38TWO_CIRCLES_IN_SQUARE problem Initial screen
39TWO_CIRCLES_IN_SQUARE problem An error a few
steps later
40TWO_CIRCLES_IN_SQUARE problem Student follows
hint completes prob
41Exported File Loaded into Excel
42See handout of exercise Do some of in next
session
43Overview
- DataShop Overview
- Logging model
- DataShop Features
- Quantitative models of learning curves
- Power law, logistic regression
- Contrasting KC models
- Exploratory Data Analysis Exercise (start)
- Knowledge Component Model Editing
44DataShop Demo
- Examples of exercise
- KC model editing
45END