Title: Spreadsheet structure inspection using low level access and visualisation
1Spreadsheet structure inspection using low level
access and visualisation
- Daniel Ballinger, Robert Biddle and James Noble
- School of Mathematical and Computing Sciences
- Email db, robert, kjx_at_mcs.vuw.ac.nz
- http//www.mcs.vuw.ac.nz/db/honours.html
2Motivation
- Spreadsheets are a common form of end-user
programming. - Unfamiliar spreadsheets can contain daunting
amounts of information in the layout and
inter-cell dependencies. - Methods for studying these structures are usually
limited to what the application provides. - It is difficult to get a global sense of the
structure of an individual formula that may have
dependencies spread out all over the spreadsheet
table. Users have to track down individual cell
dependencies one by one, tacking back and fourth
all over the spreadsheet. - - Bonnie Nardi, A Small Matter of Programming
(1993)
3Our Proposed Solution
- Working outside the spreadsheet application
allows for greater flexibility in addressing
issues. - The flexibility allows for visualisations to aid
in end-user understanding of spreadsheets beyond
what the application is capable of. - A set of visualisations as interface enhancements
that allow the user to progress from an abstract
levels towards actual details present. - The greater flexibility is a trade-off with
direct interaction. - We focused on Microsoft Excel due to its large
market share.
4Spreadsheet elements requiring extraction
- Artefacts of interest are derived from low-level
structures. - The basic unit of interest is any occupied cell.
- Each occupied cell will have a value, and
optionally a formula. - Formula should be in the same format shown to the
user (not RPN). - Building our own BIFF reader would be a sizeable
project in itself. Hence we use third party
software.
5IBM alphaWorks ExcelAccessor
- A Java Bean to access and modify to contents of
spreadsheets using a Windows DLL. - Requires using Windows native code and Excel to
be installed. Limiting portability. - Excellent ability to extract all details
correctly. - Prone to irregular crashes on larger corpus tasks.
6Andy Khans JExcelAPI
- Pure Java integrates better with toolkit.
- Open source allows for easier expansion and bug
fixes. - Some element types produce problems with
extraction. E.g. Array functions, intersections,
absolute references becoming relative, and
earlier BIFF formats.
7Spreadsheet Application Toolkit
Query
Gobbler
Google
- Find and store spreadsheets from the Internet.
- Extract low level structures. E.g. Cell values
and formulas. - Analyse spreadsheet structures. Either individual
or corpus. - Conveying the findings through visualisation.
URLs
Fetcher
Web Servers
XLS files
Extractor
BIFF Reader
Toolkit Files
Algorithms
Analyser
Processed Data
Metrics
Visualisation Tools
Image
8Aspects of spreadsheet structure and use
- The spreadsheet paradigm has two main
characteristics - The spatial relationships between cells
- The logical relationships created by formula
- These characteristics are not always disjoint.
- User problems are mapped onto a 2D table that
shields them from low-level details of
programming and allows for more natural
expression of many problems.
9Spreadsheet layout Real-estate Utilisation 2D
- Understanding layout is an important first step
in learning about a new spreadsheet. - Actual values and formulas are only shown as
occupied cells. - The visualisation layout mimics that of Excel,
with columns along the top of the x-axis and rows
running down the y-axis. - Cells with a higher occupancy level are coloured
towards the red end of the colour spectrum.
10Spreadsheet layout Real-estate Utilisation 3D
- Occupancy data is projected into 3D to create a
surface map. - Discrete to continuous data transformation helps
smooth the effects of spikes. - Coloured to give a Topographical terrain effect.
- Full benefit is seen with user interaction.
11Formula Inspection Data Flow
- Visualising formula components that are
extracted. - Fully trace worksheets in one view.
- Single Cell, Range, Union, and Intersection.
Basic Referencing Components
12Formula in a summation example
- Visualising formula in a fairly common summation
example - The relative complexity of the bottom-right
formula is clear from the larger circle.
Igarashis Static Global View
13Formula Inspection - Dependency Types
Row Absolute
- Excel allows for combinations of relative and
absolute positioning. - Understanding the referencing type is important
when replicating formula and identifying regular
patterns.
Fully Absolute
Column Absolute
Relative (default)
14Related Work
- Takeo Igarashi
- Spreadsheets augment a visible tabular layout
with invisible formulas. - Created visualisations to help reveal the hidden
dataflow graphs and superficial tabular layouts
of spreadsheets. - Markus Clermont
- Most end-users are not trained programmers.
- Many spreadsheets exist beyond being simple
scratch pads. - Raymond Panko
- Studies of empirical data into spreadsheet
errors. - Found error rates can be disturbingly high.
- Errors attributed to over confidence and lack of
formal checking. - Margaret Burnett
- The importance of scalability in visualisations.
- Forms/3 and an embedded testing methodology.
15Summary and Future Work
- We created a Java toolkit to extract artefacts
from spreadsheets and then convert the basic
information into visualisations. - These visualisations are used to augment the
information provided by Excel in helping users
understand spreadsheets. - Future work will include detailed user studies
and corpus analysis to find larger patterns. - We must also address visualisation scalability
for larger, more complex, spreadsheets.
http//www.mcs.vuw.ac.nz/db/honours.html