Evolution in Open Source Software: A Case Study - PowerPoint PPT Presentation

About This Presentation
Title:

Evolution in Open Source Software: A Case Study

Description:

Usually, we consider evolution to begin once ... e.g., whale fin bone structure resembles hand. Change patterns and evolutionary narratives ... [ Holt] (Fin) ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 71
Provided by: michaelw1
Category:

less

Transcript and Presenter's Notes

Title: Evolution in Open Source Software: A Case Study


1
Evolution in Open Source Software A Case Study
  • Michael W. Godfrey
  • Qiang Tu
  • paper in ICSM 2000
  • Software Architecture Group
  • University of Waterloo

2
What is software evolution?
  • Evolution is what happens
  • while youre busy
  • making other plans.
  • Usually, we consider evolution to begin once the
    first version has been delivered
  • Maintenance is the planned set of tasks to
    effect changes.
  • Evolution is what actually happens to the
    software.

3
Previous research
  • Lehmans laws
  • Parnas on software geriatrics
  • Eick et al. on code decay (10 MLOC telecom)
  • Gall et al. (10 MLOC telecom)
  • Munro, Burd et al. (2 MLOC gcc)

4
Lehmans Laws in a nutshell
  • Observations
  • (Most) useful software must evolve or die.
  • As a software system gets bigger, its resulting
    complexity tends to limit its ability to grow.
  • Development progress/effort is (more or less)
    constant growth is at best constant.
  • Advice
  • Need to manage complexity.
  • Do periodic redesigns.
  • Treat software and its development process as a
    feedback system (and not as a passive theorem).

5
Lehmans examples
6
A case study in evolutionThe Linux OS kernel
7
A case study in evolutionThe Linux OS kernel
  • Its Linux!
  • Large system, very stable, many releases over
    several years, many developers
  • Growing mainstream adoption
  • Open source development model
  • Interesting phenomenon in itself
  • Easy to track, can publish results, many experts
  • Not much previous study

8
Methodology
  • Examined 96 versions of Linux kernel
  • 34 of the 67 stable releases
  • 62 of the 369 development releases
  • All measures considered only .c/.h files
    contained in the tarball
  • Counted LOC using wc l and an awk script that
    ignored comments and blank lines
  • Counted of fcns/vars/macros using ctags
  • Architectural model (SSs hierarchy) based on
    default directory structure
  • We plotted growth against calendar time
  • Lehman suggests plotting growth against release
    number

9
Growth of of source files
10
Growth of of global fcns, variables, and macros
11
Growth of Lines of Code (LOC)
12
Average/median .c file size
13
Average/median .h file size
14
Growth of major SSs (dev. releases)
15
SS LOC as percentage of total system
16
SS LOC as percentage of total system (ignoring
drivers)
17
Growth of small core SSs
18
Growth of arch SSs
19
Growth of drivers SSs
20
Observations and hypotheses
  • Growth along devel. path is super-linear
  • y .21x2 252x 90,055 r2.997
  • y size in LOC
  • x days since v1.0
  • r2 is coefficient of determination using least
    squares
  • Lehman/Turskis model y y E/y2 ?
    (3Ex)(1/3)
  • Linuxs strong growth is continuing.
  • This is stronger growth at MLOC level than
    observed by others (Lehman, Gall), even for other
    OSs.

21
Why has Linux been able to continue its geometric
growth?
  • Core code quality is carefully maintained
  • Architecture/problem domain
  • Its largely drivers
  • Much of the code is parallel
  • Its not as big as you might think
  • Vanilla configuration used only 15 of files
  • Development model (OSD) and its sociology
  • Popularity and visibility has encouraged
    outsiders (both hackers and industry) to
    contribute

22
Growth of pine (email client)
23
Growth of gcc/g/egcs
24
Growth of vim (text editor)
25
vim avg comments and blank lines per file
26
vim avg/median file size
27
vims architecture
28
Hypotheses
  • Factors affecting evolution include
  • Size and age of system
  • Use of traditional sw. eng. principles during
    development
  • PLUS
  • Problem domain
  • Problem complexity, multi-platform,
    multi-features
  • Software architecture
  • Process model
  • Sociology, market forces, and acts-of-God

29
Software evolution research What next?
  • So far, we have examined only growth.
  • More case studies needed
  • Qualitative and quantitative
  • Industrial and open source systems
  • Different problem domains, architectures
  • Supporting tools to aid analysing, visualizing,
    and querying program evolution
  • More than just RCS and perl
  • Support for architecture repair
  • Codified knowledge Why and how does software
    change?
  • Build catalogue of change patterns and
  • evolutionary narratives

30
Codified knowledge
  • Mature engineering disciplines codify knowledge
    and experience.
  • Arguably, this is lacking in software
    engineering.
  • Software architecture styles Shaw
  • Design patterns GoF
  • Codified knowledge of how and why programs
    evolve
  • Evolutionary narratives Godfrey
  • Long term, coarse granularity
  • Change patterns
  • Short term, fine granularity

31
Change patterns and evolutionary narratives
  • Cathedral style Raymond
  • careful control and management
  • debugging done before committing code
  • evolution is slow, planned, rarely undone
  • Bazaar style (OSD)
  • lots of low-level changes, frequent fixes
  • lots of building around rather than wholesale
    changing, occasional redesigns
  • creeping feature-itis, complete dependency
    graph

32
Change patterns and evolutionary narratives
  • Band-aid evolution (just add a layer)
  • quick dirty way to add new functionality, esp.
    if system is not well understood
  • e.g., Y2K fixing, adding portability, new
    features
  • Vestigial features
  • design artifact persists after rationale dies
  • e.g., whale fin bone structure resembles hand

33
Change patterns and evolutionary narratives
  • Phenomena observed in Linux evolution
  • Bandwagon effect
  • Contributed third party code
  • Mostly parallel enables sustained growth
  • Clone and hack
  • Careful control of core code more flexibility on
    contributed drivers, experimental features

34
Defining, Transforming, and Exchanging High-Level
SchemasA guided journey through the outback
  • Presented by Michael W. Godfrey
  • Software Architecture Group (SWAG)
  • Dept of Comp Sci, Univ of Waterloo
  • This presentation is available from
  • http//plg.uwaterloo.ca/migod/papers/

35
What is a High-Level Schema?
  • My answer
  • Any schema above the statement level
  • I see two distinct levels of abstraction
  • Programming language entity level
  • Entities are (shared) fcns, vars, types, classes,
  • Architectural level
  • Entities are modules, subsystems, classes,
    interfaces,

36
Previous Work
  • Lots of
  • motivational work
  • ad hoc extractor snarfing
  • experimental translation mechanisms
  • Examples (many others exist)
  • CORUM I and II
  • GRAX
  • TAXForm (TA eXchange FORMat) using Acacia,
    Rigiparse
  • Rigi using VisualAge C
  • Dali using Sniff

37
My (selfish) goals
  • I would like to be able to use other extractors
  • Want to perform architectural analyses of systems
    written in languages other than C
  • Want to implement BEAGLE
  • (a tool for exploring software evolution)
  • but extractors differ in languages modelled,
    level of detail, robustness, bugs, data format,
  • I want to be able to convert data between tools.
  • Need agreement (awareness) from tool creators

38
TAXForm Utopia
39
Transforming Between Schemas
40
TAXForm Procedural schema
41
TAXForm High level schema
42
Back to my (selfish) goals
  • Would like to concentrate on procedural and OO
    languages.
  • Others are interested in COBOL, JCL etc.
  • I am interested in high-level info (f calls g)
  • but not in ASGs, code-level metrics
  • Need to agree on
  • Syntax
  • Level of granularity and detail
  • What to do in case of X e.g., X
    missing files

43
My schema wish list
  • influenced by Acacias C and C data models
  • Top-level programming language entities
  • functions, variables, constants, type definitions
    (procedural languages)
  • methods, class member data, static methods and
    member data (object-oriented languages)
  • Entity containers
  • files, modules, classes, packages

44
My schema wish list
  • Entity attributes
  • Name, unique identifier (UID -- see next section)
  • UID of container, UID of containing file (if
    container is not a file)
  • Signature/data type
  • Line number information (see below)
  • Declared scope/visibility, static or not, final
    or not
  • Definition or declaration (see below)
  • Entity container attributes
  • name, UID
  • relative path (if a file)
  • version identifier (if provided)
  • UID of container (if not a file), UID of cont.
    file (if not a file)

45
My schema wish list
  • Relationships
  • Function calls, variable uses
  • Line number information (see below)
  • Container use/inclusion (by other containers)
  • Inheritance (various kinds)
  • Friendship, various template relationships
  • Relationship attributes
  • Line number information (see below)
  • Scope/permission of inheritance

46
Problems
  • Some technical problems
  • UID generation? (name-mangling?)
  • Line numbering (ranges)?
  • Incomplete information?
  • ill-formed code, gcc/KR-isms
  • missing header files
  • resolving entity use to dfn/dcl
  • (esp. with polymorphism, overloading)
  • Pre or post preprocessing?

47
Problems
  • Weve had these conversations before
  • Getting academics to agree on anything is like
    herding cats.

48
Example Extractors/Systems
  • Others
  • Rigi UVictoria
  • SPOOL UMontréal
  • Datrix Bell Canada
  • MOOSE UBern
  • SHORE SDM
  • Neuhold UVienna
  • VisualAge C IBM
  • many others
  • Included here
  • PBS UWloo
  • Acacia ATT
  • cxref, ctags, cscope
  • TA UOttawa
  • BAUHAUS UStuttgart
  • GUPRO UKoblenz

49
Dimensions of Variation
  • Intended use
  • Level of schema (entity level,
    architectural level, or mixed)
  • Amount of detail
  • Languages modelled
  • Multi-lingual
  • Common super schemas
  • Explicit model cross-overs (e.g., JCL,
    embedded SQL)
  • Hidden assumptions
  • Known limitations
  • Notation/approach to store factbase
  • Support for translations and transformations
  • Whats particularly novel and noteworthy

50
PBS Holt et al. _at_ UWaterloo
  • Portable Bookshelf is a reverse engineering tool
    for creating software architecture models of
    large systems
  • Guinea pigs Mozilla, Linux, Apache, VIM, Mitel,
    TOBEY,
  • Consists of fact extractor, fact manipulation
    engine (grok), and visualization tool
    (landscape)

cfx
grok
landscape viewer
entity-level facts
source code
architectural facts
51
PBS C Language E/R View
52
PBS Architectural Schema
53
Acacia Chen, Gansner et al. _at_ ATT
  • History
  • CIA ? CIAO ? Acacia
  • Consists of
  • C and C extractors
  • SQL-like query engine
  • visualization with auto-layout

54
Acacia C/C Schemas
  • Entity attributes
  • Hex UID, name, kind (file, function, type, var,
    macro), filename, datatype (string), typeclass
    (enum, struct, etc.), linenum info for def/dec,
    def/dec/undef, param list, template info, scope,
    storage spec (static, const, inline, inline
    virtual, etc.), signature
  • Relationship attributes
  • Linenum info, rel. kind (refers, contains,
    inherits, instantiates, typedef, etc.),
    relationship scope

55
Acacia Queries
  • SQL-like queries for entities and relationships
    produces delimited textual output
  • ksh cdef -u fu closeTagFile
  • 26f53ececloseTagFilefunctionentry.hvoidregula
    r83083dec00000000(const boolean)extern
  • 76e7ae31closeTagFilefunctionentry.cvoidregula
    r551553563def00000000(const
    boolean)extern
  • ksh cref u - - m - file2osdeps.h
  • ltall entity1 attrsgt ltall entity2 attrs gt ltrel
    attrsgt

56
ctags, cxref, cscope
  • These are open source Unix tools that perform
    extractions
  • ctags extracts only entity info
  • e.g., file, name, line num, kind, etc
  • works with C, C, Eiffel, Fortran, and Java.
  • Used for fast context switching while editing
    source code with vim/emacs
  • cxref generates cross-reference table for C
    systems.
  • Often used for webifying source code (e.g.,
    Linux, Mozilla).
  • cscope used for program comprehension of C
    systems (e.g., who calls f, who uses v)
  • Older commercial Unix tool, recently open sourced.

57
TA Lethbridge et al. _at_ UOttawa
  • TKSee aids programming comprehension
  • i.e., what programmers do all day
  • TA is the data modelling language
  • Want full story from the source code
  • Want pre-preprocessing view of code for all
    platforms and environments (text editors view)
  • but most extractors use a compiler front end
    and preprocess toward a particular target and
    option set
  • Some extractors keep some macro info

58
TA Combined E/R Model
59
BAUHAUS Koschke et al. _at_ UStuttgart
  • Software architecture recovery system
  • Parse code, look for hidden/decayed abstractions,
    then redesign
  • Uses various heuristics to perform clustering
  • Works both at entity level and subsystem level
  • Built from many tools
  • including Rigi viewer and a customized C
    parser/extractor that (optionally) dumps RSF
  • Example WoSEF problem
  • Cannot derive full includes hierarchy from
    Bauhaus extracted facts this was a design
    decision, as the researchers were not interested
    in this information

60
BAUHAUS Entities
61
BAUHAUS Relationships
62
BAUHAUS Combined E/R
63
GUPRO Ebert, Kullbach, Winter et al._at_ UKoblenz
  • GUPRO supports simultaneous modelling of
    inter-related systems written in different
    programming languages
  • In particular, concerned with the COBOL/MVS/JCL
    mainframe world
  • GUPRO is notable because
  • Simultaneously multilingual
  • Explicitly models boundary crossings (!)
  • Looks at (very real) problems of the mainframe
    world
  • COBOL, JCL, database migration

64
GUPRO
  • Candidate system is modelled in an object-based
    repository using a graph-based approach
  • EER (modelling language)
  • GRAL (constraint language)
  • GReQL mechanism supports structured queries on
    the repository via restricted first-order logic

65
GUPRO
  • COBOL schema
  • JCL schema

66
GUPRO
  • Integrated schemas for JCL and COBOL

67
GUPRO Multi-Language Model
68
Summary High-Level Schemas
  • Lots of sticky issues at the prog. lang. level
  • To pre- or not to pre-process
  • Entity resolution often not done (e.g., Datrix)
  • What is a function def, dec, polymorphism,
    overloading, templates,
  • How to deal with missing libraries, incremental
    extractions, versioned extractions,
    non-ANSI-isms,
  • Conceptual gaps
  • COBOL/JCL world very different from C/C/Java
    world
  • I didnt know you wanted full includes info

69
Summary Good News
  • Many of us seem to be doing similar kinds of
    extractions. It seems like that
  • Many extractors can be used within other tools
  • Some form of common interchange format is
    feasible, tho it may not please everyone.
  • Challenges
  • May want to use multiple tools together
  • I have been working on a standalone cxref-based
    hack to add full includes information to a
    BAUHAUS converter
  • Can we take advantage of the web to set up some
    sort of distributed fact extraction/conversion
    factory? Holt

70
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com