Title: Evolution in Open Source Software: A Case Study
1Evolution in Open Source Software A Case Study
- Michael W. Godfrey
- Qiang Tu
- paper in ICSM 2000
- Software Architecture Group
- University of Waterloo
2What is software evolution?
- Evolution is what happens
- while youre busy
- making other plans.
- Usually, we consider evolution to begin once the
first version has been delivered - Maintenance is the planned set of tasks to
effect changes. - Evolution is what actually happens to the
software.
3Previous research
- Lehmans laws
- Parnas on software geriatrics
- Eick et al. on code decay (10 MLOC telecom)
- Gall et al. (10 MLOC telecom)
- Munro, Burd et al. (2 MLOC gcc)
4Lehmans Laws in a nutshell
- Observations
- (Most) useful software must evolve or die.
- As a software system gets bigger, its resulting
complexity tends to limit its ability to grow. - Development progress/effort is (more or less)
constant growth is at best constant. - Advice
- Need to manage complexity.
- Do periodic redesigns.
- Treat software and its development process as a
feedback system (and not as a passive theorem).
5Lehmans examples
6A case study in evolutionThe Linux OS kernel
7A case study in evolutionThe Linux OS kernel
- Its Linux!
- Large system, very stable, many releases over
several years, many developers - Growing mainstream adoption
- Open source development model
- Interesting phenomenon in itself
- Easy to track, can publish results, many experts
- Not much previous study
8Methodology
- Examined 96 versions of Linux kernel
- 34 of the 67 stable releases
- 62 of the 369 development releases
- All measures considered only .c/.h files
contained in the tarball - Counted LOC using wc l and an awk script that
ignored comments and blank lines - Counted of fcns/vars/macros using ctags
- Architectural model (SSs hierarchy) based on
default directory structure - We plotted growth against calendar time
- Lehman suggests plotting growth against release
number
9Growth of of source files
10Growth of of global fcns, variables, and macros
11Growth of Lines of Code (LOC)
12Average/median .c file size
13Average/median .h file size
14Growth of major SSs (dev. releases)
15SS LOC as percentage of total system
16SS LOC as percentage of total system (ignoring
drivers)
17Growth of small core SSs
18Growth of arch SSs
19Growth of drivers SSs
20Observations and hypotheses
- Growth along devel. path is super-linear
- y .21x2 252x 90,055 r2.997
- y size in LOC
- x days since v1.0
- r2 is coefficient of determination using least
squares - Lehman/Turskis model y y E/y2 ?
(3Ex)(1/3) -
- Linuxs strong growth is continuing.
- This is stronger growth at MLOC level than
observed by others (Lehman, Gall), even for other
OSs.
21Why has Linux been able to continue its geometric
growth?
- Core code quality is carefully maintained
- Architecture/problem domain
- Its largely drivers
- Much of the code is parallel
- Its not as big as you might think
- Vanilla configuration used only 15 of files
- Development model (OSD) and its sociology
- Popularity and visibility has encouraged
outsiders (both hackers and industry) to
contribute
22Growth of pine (email client)
23Growth of gcc/g/egcs
24Growth of vim (text editor)
25vim avg comments and blank lines per file
26vim avg/median file size
27vims architecture
28Hypotheses
- Factors affecting evolution include
- Size and age of system
- Use of traditional sw. eng. principles during
development - PLUS
- Problem domain
- Problem complexity, multi-platform,
multi-features - Software architecture
- Process model
- Sociology, market forces, and acts-of-God
29Software evolution research What next?
- So far, we have examined only growth.
- More case studies needed
- Qualitative and quantitative
- Industrial and open source systems
- Different problem domains, architectures
- Supporting tools to aid analysing, visualizing,
and querying program evolution - More than just RCS and perl
- Support for architecture repair
- Codified knowledge Why and how does software
change? - Build catalogue of change patterns and
- evolutionary narratives
30Codified knowledge
- Mature engineering disciplines codify knowledge
and experience. - Arguably, this is lacking in software
engineering. - Software architecture styles Shaw
- Design patterns GoF
- Codified knowledge of how and why programs
evolve - Evolutionary narratives Godfrey
- Long term, coarse granularity
- Change patterns
- Short term, fine granularity
31Change patterns and evolutionary narratives
- Cathedral style Raymond
- careful control and management
- debugging done before committing code
- evolution is slow, planned, rarely undone
- Bazaar style (OSD)
- lots of low-level changes, frequent fixes
- lots of building around rather than wholesale
changing, occasional redesigns - creeping feature-itis, complete dependency
graph
32Change patterns and evolutionary narratives
- Band-aid evolution (just add a layer)
- quick dirty way to add new functionality, esp.
if system is not well understood - e.g., Y2K fixing, adding portability, new
features - Vestigial features
- design artifact persists after rationale dies
- e.g., whale fin bone structure resembles hand
33Change patterns and evolutionary narratives
- Phenomena observed in Linux evolution
- Bandwagon effect
- Contributed third party code
- Mostly parallel enables sustained growth
- Clone and hack
- Careful control of core code more flexibility on
contributed drivers, experimental features
34Defining, Transforming, and Exchanging High-Level
SchemasA guided journey through the outback
- Presented by Michael W. Godfrey
- Software Architecture Group (SWAG)
- Dept of Comp Sci, Univ of Waterloo
- This presentation is available from
- http//plg.uwaterloo.ca/migod/papers/
35What is a High-Level Schema?
- My answer
- Any schema above the statement level
- I see two distinct levels of abstraction
- Programming language entity level
- Entities are (shared) fcns, vars, types, classes,
- Architectural level
- Entities are modules, subsystems, classes,
interfaces,
36Previous Work
- Lots of
- motivational work
- ad hoc extractor snarfing
- experimental translation mechanisms
- Examples (many others exist)
- CORUM I and II
- GRAX
- TAXForm (TA eXchange FORMat) using Acacia,
Rigiparse - Rigi using VisualAge C
- Dali using Sniff
37My (selfish) goals
- I would like to be able to use other extractors
- Want to perform architectural analyses of systems
written in languages other than C - Want to implement BEAGLE
- (a tool for exploring software evolution)
- but extractors differ in languages modelled,
level of detail, robustness, bugs, data format, - I want to be able to convert data between tools.
- Need agreement (awareness) from tool creators
38TAXForm Utopia
39Transforming Between Schemas
40TAXForm Procedural schema
41TAXForm High level schema
42Back to my (selfish) goals
- Would like to concentrate on procedural and OO
languages. - Others are interested in COBOL, JCL etc.
- I am interested in high-level info (f calls g)
- but not in ASGs, code-level metrics
- Need to agree on
- Syntax
- Level of granularity and detail
- What to do in case of X e.g., X
missing files
43My schema wish list
- influenced by Acacias C and C data models
- Top-level programming language entities
- functions, variables, constants, type definitions
(procedural languages) - methods, class member data, static methods and
member data (object-oriented languages) - Entity containers
- files, modules, classes, packages
44My schema wish list
- Entity attributes
- Name, unique identifier (UID -- see next section)
- UID of container, UID of containing file (if
container is not a file) - Signature/data type
- Line number information (see below)
- Declared scope/visibility, static or not, final
or not - Definition or declaration (see below)
- Entity container attributes
- name, UID
- relative path (if a file)
- version identifier (if provided)
- UID of container (if not a file), UID of cont.
file (if not a file)
45My schema wish list
- Relationships
- Function calls, variable uses
- Line number information (see below)
- Container use/inclusion (by other containers)
- Inheritance (various kinds)
- Friendship, various template relationships
- Relationship attributes
- Line number information (see below)
- Scope/permission of inheritance
46Problems
- Some technical problems
- UID generation? (name-mangling?)
- Line numbering (ranges)?
- Incomplete information?
- ill-formed code, gcc/KR-isms
- missing header files
- resolving entity use to dfn/dcl
- (esp. with polymorphism, overloading)
- Pre or post preprocessing?
47Problems
- Weve had these conversations before
- Getting academics to agree on anything is like
herding cats.
48Example Extractors/Systems
- Others
- Rigi UVictoria
- SPOOL UMontréal
- Datrix Bell Canada
- MOOSE UBern
- SHORE SDM
- Neuhold UVienna
- VisualAge C IBM
- many others
- Included here
- PBS UWloo
- Acacia ATT
- cxref, ctags, cscope
- TA UOttawa
- BAUHAUS UStuttgart
- GUPRO UKoblenz
49Dimensions of Variation
- Intended use
- Level of schema (entity level,
architectural level, or mixed) - Amount of detail
- Languages modelled
- Multi-lingual
- Common super schemas
- Explicit model cross-overs (e.g., JCL,
embedded SQL) - Hidden assumptions
- Known limitations
- Notation/approach to store factbase
- Support for translations and transformations
- Whats particularly novel and noteworthy
50PBS Holt et al. _at_ UWaterloo
- Portable Bookshelf is a reverse engineering tool
for creating software architecture models of
large systems - Guinea pigs Mozilla, Linux, Apache, VIM, Mitel,
TOBEY, - Consists of fact extractor, fact manipulation
engine (grok), and visualization tool
(landscape)
cfx
grok
landscape viewer
entity-level facts
source code
architectural facts
51PBS C Language E/R View
52PBS Architectural Schema
53Acacia Chen, Gansner et al. _at_ ATT
- History
- CIA ? CIAO ? Acacia
- Consists of
- C and C extractors
- SQL-like query engine
- visualization with auto-layout
54Acacia C/C Schemas
- Entity attributes
- Hex UID, name, kind (file, function, type, var,
macro), filename, datatype (string), typeclass
(enum, struct, etc.), linenum info for def/dec,
def/dec/undef, param list, template info, scope,
storage spec (static, const, inline, inline
virtual, etc.), signature - Relationship attributes
- Linenum info, rel. kind (refers, contains,
inherits, instantiates, typedef, etc.),
relationship scope
55Acacia Queries
- SQL-like queries for entities and relationships
produces delimited textual output - ksh cdef -u fu closeTagFile
- 26f53ececloseTagFilefunctionentry.hvoidregula
r83083dec00000000(const boolean)extern
- 76e7ae31closeTagFilefunctionentry.cvoidregula
r551553563def00000000(const
boolean)extern - ksh cref u - - m - file2osdeps.h
- ltall entity1 attrsgt ltall entity2 attrs gt ltrel
attrsgt
56ctags, cxref, cscope
- These are open source Unix tools that perform
extractions - ctags extracts only entity info
- e.g., file, name, line num, kind, etc
- works with C, C, Eiffel, Fortran, and Java.
- Used for fast context switching while editing
source code with vim/emacs - cxref generates cross-reference table for C
systems. - Often used for webifying source code (e.g.,
Linux, Mozilla). - cscope used for program comprehension of C
systems (e.g., who calls f, who uses v) - Older commercial Unix tool, recently open sourced.
57TA Lethbridge et al. _at_ UOttawa
- TKSee aids programming comprehension
- i.e., what programmers do all day
- TA is the data modelling language
- Want full story from the source code
- Want pre-preprocessing view of code for all
platforms and environments (text editors view) - but most extractors use a compiler front end
and preprocess toward a particular target and
option set - Some extractors keep some macro info
58TA Combined E/R Model
59BAUHAUS Koschke et al. _at_ UStuttgart
- Software architecture recovery system
- Parse code, look for hidden/decayed abstractions,
then redesign - Uses various heuristics to perform clustering
- Works both at entity level and subsystem level
- Built from many tools
- including Rigi viewer and a customized C
parser/extractor that (optionally) dumps RSF - Example WoSEF problem
- Cannot derive full includes hierarchy from
Bauhaus extracted facts this was a design
decision, as the researchers were not interested
in this information
60BAUHAUS Entities
61BAUHAUS Relationships
62BAUHAUS Combined E/R
63GUPRO Ebert, Kullbach, Winter et al._at_ UKoblenz
- GUPRO supports simultaneous modelling of
inter-related systems written in different
programming languages - In particular, concerned with the COBOL/MVS/JCL
mainframe world - GUPRO is notable because
- Simultaneously multilingual
- Explicitly models boundary crossings (!)
- Looks at (very real) problems of the mainframe
world - COBOL, JCL, database migration
64GUPRO
- Candidate system is modelled in an object-based
repository using a graph-based approach - EER (modelling language)
-
- GRAL (constraint language)
- GReQL mechanism supports structured queries on
the repository via restricted first-order logic
65GUPRO
66GUPRO
- Integrated schemas for JCL and COBOL
67GUPRO Multi-Language Model
68Summary High-Level Schemas
- Lots of sticky issues at the prog. lang. level
- To pre- or not to pre-process
- Entity resolution often not done (e.g., Datrix)
- What is a function def, dec, polymorphism,
overloading, templates, - How to deal with missing libraries, incremental
extractions, versioned extractions,
non-ANSI-isms, - Conceptual gaps
- COBOL/JCL world very different from C/C/Java
world - I didnt know you wanted full includes info
69Summary Good News
- Many of us seem to be doing similar kinds of
extractions. It seems like that - Many extractors can be used within other tools
- Some form of common interchange format is
feasible, tho it may not please everyone. - Challenges
- May want to use multiple tools together
- I have been working on a standalone cxref-based
hack to add full includes information to a
BAUHAUS converter - Can we take advantage of the web to set up some
sort of distributed fact extraction/conversion
factory? Holt
70(No Transcript)