Evolution in Open Source Software: A Case Study - PowerPoint PPT Presentation

About This Presentation

Title:

Evolution in Open Source Software: A Case Study

Description:

Usually, we consider evolution to begin once ... e.g., whale fin bone structure resembles hand. Change patterns and evolutionary narratives ... [ Holt] (Fin) ... – PowerPoint PPT presentation

Number of Views:124

Avg rating:3.0/5.0

Slides: 71

Provided by: michaelw1

Category:

more less

Transcript and Presenter's Notes

Title: Evolution in Open Source Software: A Case Study

1
Evolution in Open Source Software A Case Study

Michael W. Godfrey
Qiang Tu
paper in ICSM 2000
Software Architecture Group
University of Waterloo

2
What is software evolution?

Evolution is what happens
while youre busy
making other plans.
Usually, we consider evolution to begin once the
first version has been delivered
Maintenance is the planned set of tasks to
effect changes.
Evolution is what actually happens to the
software.

3
Previous research

Lehmans laws
Parnas on software geriatrics
Eick et al. on code decay (10 MLOC telecom)
Gall et al. (10 MLOC telecom)
Munro, Burd et al. (2 MLOC gcc)

4
Lehmans Laws in a nutshell

Observations
(Most) useful software must evolve or die.
As a software system gets bigger, its resulting
complexity tends to limit its ability to grow.
Development progress/effort is (more or less)
constant growth is at best constant.
Advice
Need to manage complexity.
Do periodic redesigns.
Treat software and its development process as a
feedback system (and not as a passive theorem).

5
Lehmans examples
6
A case study in evolutionThe Linux OS kernel
7
A case study in evolutionThe Linux OS kernel

Its Linux!
Large system, very stable, many releases over
several years, many developers
Growing mainstream adoption
Open source development model
Interesting phenomenon in itself
Easy to track, can publish results, many experts
Not much previous study

8
Methodology

Examined 96 versions of Linux kernel
34 of the 67 stable releases
62 of the 369 development releases
All measures considered only .c/.h files
contained in the tarball
Counted LOC using wc l and an awk script that
ignored comments and blank lines
Counted of fcns/vars/macros using ctags
Architectural model (SSs hierarchy) based on
default directory structure
We plotted growth against calendar time
Lehman suggests plotting growth against release
number

9
Growth of of source files
10
Growth of of global fcns, variables, and macros
11
Growth of Lines of Code (LOC)
12
Average/median .c file size
13
Average/median .h file size
14
Growth of major SSs (dev. releases)
15
SS LOC as percentage of total system
16
SS LOC as percentage of total system (ignoring
drivers)
17
Growth of small core SSs
18
Growth of arch SSs
19
Growth of drivers SSs
20
Observations and hypotheses

Growth along devel. path is super-linear
y .21x2 252x 90,055 r2.997
y size in LOC
x days since v1.0
r2 is coefficient of determination using least
squares
Lehman/Turskis model y y E/y2 ?
(3Ex)(1/3)
Linuxs strong growth is continuing.
This is stronger growth at MLOC level than
observed by others (Lehman, Gall), even for other
OSs.

21
Why has Linux been able to continue its geometric
growth?

Core code quality is carefully maintained
Architecture/problem domain
Its largely drivers
Much of the code is parallel
Its not as big as you might think
Vanilla configuration used only 15 of files
Development model (OSD) and its sociology
Popularity and visibility has encouraged
outsiders (both hackers and industry) to
contribute

22
Growth of pine (email client)
23
Growth of gcc/g/egcs
24
Growth of vim (text editor)
25
vim avg comments and blank lines per file
26
vim avg/median file size
27
vims architecture
28
Hypotheses

Factors affecting evolution include
Size and age of system
Use of traditional sw. eng. principles during
development
PLUS
Problem domain
Problem complexity, multi-platform,
multi-features
Software architecture
Process model
Sociology, market forces, and acts-of-God

29
Software evolution research What next?

So far, we have examined only growth.
More case studies needed
Qualitative and quantitative
Industrial and open source systems
Different problem domains, architectures
Supporting tools to aid analysing, visualizing,
and querying program evolution
More than just RCS and perl
Support for architecture repair
Codified knowledge Why and how does software
change?
Build catalogue of change patterns and
evolutionary narratives

30
Codified knowledge

Mature engineering disciplines codify knowledge
and experience.
Arguably, this is lacking in software
engineering.
Software architecture styles Shaw
Design patterns GoF
Codified knowledge of how and why programs
evolve
Evolutionary narratives Godfrey
Long term, coarse granularity
Change patterns
Short term, fine granularity

31
Change patterns and evolutionary narratives

Cathedral style Raymond
careful control and management
debugging done before committing code
evolution is slow, planned, rarely undone
Bazaar style (OSD)
lots of low-level changes, frequent fixes
lots of building around rather than wholesale
changing, occasional redesigns
creeping feature-itis, complete dependency
graph

32
Change patterns and evolutionary narratives

Band-aid evolution (just add a layer)
quick dirty way to add new functionality, esp.
if system is not well understood
e.g., Y2K fixing, adding portability, new
features
Vestigial features
design artifact persists after rationale dies
e.g., whale fin bone structure resembles hand

33
Change patterns and evolutionary narratives

Phenomena observed in Linux evolution
Bandwagon effect
Contributed third party code
Mostly parallel enables sustained growth
Clone and hack
Careful control of core code more flexibility on
contributed drivers, experimental features

34
Defining, Transforming, and Exchanging High-Level
SchemasA guided journey through the outback

Presented by Michael W. Godfrey
Software Architecture Group (SWAG)
Dept of Comp Sci, Univ of Waterloo
This presentation is available from
http//plg.uwaterloo.ca/migod/papers/

35
What is a High-Level Schema?

My answer
Any schema above the statement level
I see two distinct levels of abstraction
Programming language entity level
Entities are (shared) fcns, vars, types, classes,
Architectural level
Entities are modules, subsystems, classes,
interfaces,

36
Previous Work

Lots of
motivational work
ad hoc extractor snarfing
experimental translation mechanisms
Examples (many others exist)
CORUM I and II
GRAX
TAXForm (TA eXchange FORMat) using Acacia,
Rigiparse
Rigi using VisualAge C
Dali using Sniff

37
My (selfish) goals

I would like to be able to use other extractors
Want to perform architectural analyses of systems
written in languages other than C
Want to implement BEAGLE
(a tool for exploring software evolution)
but extractors differ in languages modelled,
level of detail, robustness, bugs, data format,
I want to be able to convert data between tools.
Need agreement (awareness) from tool creators

38
TAXForm Utopia
39
Transforming Between Schemas
40
TAXForm Procedural schema
41
TAXForm High level schema
42
Back to my (selfish) goals

Would like to concentrate on procedural and OO
languages.
Others are interested in COBOL, JCL etc.
I am interested in high-level info (f calls g)
but not in ASGs, code-level metrics
Need to agree on
Syntax
Level of granularity and detail
What to do in case of X e.g., X
missing files

43
My schema wish list

influenced by Acacias C and C data models
Top-level programming language entities
functions, variables, constants, type definitions
(procedural languages)
methods, class member data, static methods and
member data (object-oriented languages)
Entity containers
files, modules, classes, packages

44
My schema wish list

Entity attributes
Name, unique identifier (UID -- see next section)
UID of container, UID of containing file (if
container is not a file)
Signature/data type
Line number information (see below)
Declared scope/visibility, static or not, final
or not
Definition or declaration (see below)
Entity container attributes
name, UID
relative path (if a file)
version identifier (if provided)
UID of container (if not a file), UID of cont.
file (if not a file)

45
My schema wish list

Relationships
Function calls, variable uses
Line number information (see below)
Container use/inclusion (by other containers)
Inheritance (various kinds)
Friendship, various template relationships
Relationship attributes
Line number information (see below)
Scope/permission of inheritance

46
Problems

Some technical problems
UID generation? (name-mangling?)
Line numbering (ranges)?
Incomplete information?
ill-formed code, gcc/KR-isms
missing header files
resolving entity use to dfn/dcl
(esp. with polymorphism, overloading)
Pre or post preprocessing?

47
Problems

Weve had these conversations before
Getting academics to agree on anything is like
herding cats.

48
Example Extractors/Systems

Others
Rigi UVictoria
SPOOL UMontréal
Datrix Bell Canada
MOOSE UBern
SHORE SDM
Neuhold UVienna
VisualAge C IBM
many others

Included here
PBS UWloo
Acacia ATT
cxref, ctags, cscope
TA UOttawa
BAUHAUS UStuttgart
GUPRO UKoblenz

49
Dimensions of Variation

Intended use
Level of schema (entity level,
architectural level, or mixed)
Amount of detail
Languages modelled
Multi-lingual
Common super schemas
Explicit model cross-overs (e.g., JCL,
embedded SQL)
Hidden assumptions
Known limitations
Notation/approach to store factbase
Support for translations and transformations
Whats particularly novel and noteworthy

50
PBS Holt et al. _at_ UWaterloo

Portable Bookshelf is a reverse engineering tool
for creating software architecture models of
large systems
Guinea pigs Mozilla, Linux, Apache, VIM, Mitel,
TOBEY,
Consists of fact extractor, fact manipulation
engine (grok), and visualization tool
(landscape)

cfx
grok
landscape viewer
entity-level facts
source code
architectural facts
51
PBS C Language E/R View
52
PBS Architectural Schema
53
Acacia Chen, Gansner et al. _at_ ATT

History
CIA ? CIAO ? Acacia
Consists of
C and C extractors
SQL-like query engine
visualization with auto-layout

54
Acacia C/C Schemas

Entity attributes
Hex UID, name, kind (file, function, type, var,
macro), filename, datatype (string), typeclass
(enum, struct, etc.), linenum info for def/dec,
def/dec/undef, param list, template info, scope,
storage spec (static, const, inline, inline
virtual, etc.), signature
Relationship attributes
Linenum info, rel. kind (refers, contains,
inherits, instantiates, typedef, etc.),
relationship scope

55
Acacia Queries

SQL-like queries for entities and relationships
produces delimited textual output
ksh cdef -u fu closeTagFile
26f53ececloseTagFilefunctionentry.hvoidregula
r83083dec00000000(const boolean)extern
76e7ae31closeTagFilefunctionentry.cvoidregula
r551553563def00000000(const
boolean)extern
ksh cref u - - m - file2osdeps.h
ltall entity1 attrsgt ltall entity2 attrs gt ltrel
attrsgt

56
ctags, cxref, cscope

These are open source Unix tools that perform
extractions
ctags extracts only entity info
e.g., file, name, line num, kind, etc
works with C, C, Eiffel, Fortran, and Java.
Used for fast context switching while editing
source code with vim/emacs
cxref generates cross-reference table for C
systems.
Often used for webifying source code (e.g.,
Linux, Mozilla).
cscope used for program comprehension of C
systems (e.g., who calls f, who uses v)
Older commercial Unix tool, recently open sourced.

57
TA Lethbridge et al. _at_ UOttawa

TKSee aids programming comprehension
i.e., what programmers do all day
TA is the data modelling language
Want full story from the source code
Want pre-preprocessing view of code for all
platforms and environments (text editors view)
but most extractors use a compiler front end
and preprocess toward a particular target and
option set
Some extractors keep some macro info

58
TA Combined E/R Model
59
BAUHAUS Koschke et al. _at_ UStuttgart

Software architecture recovery system
Parse code, look for hidden/decayed abstractions,
then redesign
Uses various heuristics to perform clustering
Works both at entity level and subsystem level
Built from many tools
including Rigi viewer and a customized C
parser/extractor that (optionally) dumps RSF
Example WoSEF problem
Cannot derive full includes hierarchy from
Bauhaus extracted facts this was a design
decision, as the researchers were not interested
in this information

60
BAUHAUS Entities
61
BAUHAUS Relationships
62
BAUHAUS Combined E/R
63
GUPRO Ebert, Kullbach, Winter et al._at_ UKoblenz

GUPRO supports simultaneous modelling of
inter-related systems written in different
programming languages
In particular, concerned with the COBOL/MVS/JCL
mainframe world
GUPRO is notable because
Simultaneously multilingual
Explicitly models boundary crossings (!)
Looks at (very real) problems of the mainframe
world
COBOL, JCL, database migration

64
GUPRO

Candidate system is modelled in an object-based
repository using a graph-based approach
EER (modelling language)
GRAL (constraint language)
GReQL mechanism supports structured queries on
the repository via restricted first-order logic

65
GUPRO

COBOL schema

JCL schema

66
GUPRO

Integrated schemas for JCL and COBOL

67
GUPRO Multi-Language Model
68
Summary High-Level Schemas

Lots of sticky issues at the prog. lang. level
To pre- or not to pre-process
Entity resolution often not done (e.g., Datrix)
What is a function def, dec, polymorphism,
overloading, templates,
How to deal with missing libraries, incremental
extractions, versioned extractions,
non-ANSI-isms,
Conceptual gaps
COBOL/JCL world very different from C/C/Java
world
I didnt know you wanted full includes info

69
Summary Good News

Many of us seem to be doing similar kinds of
extractions. It seems like that
Many extractors can be used within other tools
Some form of common interchange format is
feasible, tho it may not please everyone.
Challenges
May want to use multiple tools together
I have been working on a standalone cxref-based
hack to add full includes information to a
BAUHAUS converter
Can we take advantage of the web to set up some
sort of distributed fact extraction/conversion
factory? Holt

70
(No Transcript)

Write a Comment

User Comments (0)