Title: CoxR: Open Source Development History Search System
1CoxR Open Source Development History Search
System
Makoto Matsushita, Kei Sasaki, and Katsuro Inoue
Osaka University
2Contents
- Background
- Open-source software development
- Repository analysis system CoxR
- Supporting Dynamic Communication System
- Future research interests
3Open Source Software Development
- Open and parallel software development
- Anybody join the party at anytime
- Developers are living all over the world
source code
source code
source code
source code
source code manual
CVS
email
requests
requests ? fixes
developers
email archives
submit bug-report request feature enhancement
GNATS
4Reusing repositories
- System repositories have valuable information
such as products evolutional histories and each
developers information - processes to be done to products
- knowledge on requirements and design
- Analyze and reuse these contents may help to
reduce time/efforts of whole software development - reuse the ways of bug-fix
- understanding a project itself that are going to
join - reuse (a part of) products/components
- However, there are some difficulties to reuse
contents
5Problem 1less relationship between systems
- Where can I find what I want?
It seems that bktr driver has a bug so Id like
to fix it
user
files also need to be changed
proposed fix for bktr driver
discussions on bktr driver
source code fixes
CVS
GNATS
email archive
6Problem 2Interests may vary
- Even if the problem is same, a solution that is
done in the past is not suitable for all peoples - knowledge and processes may vary for developers
- information needs may vary on time
Maybe similar bugs were appeared on other drivers
so search them up
Problem theres a bug on bktr driver
Id like to seek authorities of graphics driver
Id like to have a new version of bktr driver
7Objective
- Analyze past processes/histories kept on existing
systems, to help developers to search,
understand, reuse such processes - Modeling information on systems as development
community, using CVS, Email, and GNATS - Propose an information extraction approach from
development community - A prototype of the proposed approach
8Topics
- Step 1 Modeling information
- Step 2 Information extraction algorithm
- Step 3 System implementation
9Model elements
- People developers registered to CVS, email
archive, and GNATS databases - Knowledge contents of CVS, E-mail, and GNATS
integrated model
email archives
GNATS
CVS
10Extracting people/knowledge
Knowledge
?
file path revision tag, date
source code comments
developer contributor
CVS
Subject body
From To, Cc
Message-Id Date
E-mail
modification
base
file path PR date last modified
Originator Responsible
fix audit-trail status
category bug class description
GNATS
11People/Knowledge network
- We assume that the network has 3 types of edges
- People-Knoledge
- People-People
- Knowledge-Knoledge
Development Community
12Extracting network edges (1/2)
- People-Knowledge edge
- People/Knowledge elements in the same CVS, Email
and GNATS information - People-People edge
- Peoples in the same CVS, Email, and GNATS
information - Peoples subscribed to the same lists
- Peoples working on the same directory
13Extracting network edges (2/2)
- Directly connected
- Revision histories to the same file
- Files in the same directory
- Modified at the same time
- Email threads
- Email/PR IDs
- Similar Knowledges
- Source codes
- Keywords
- Base/modification information in GNATS
14Topics
- Step 1 Modeling information
- Step 2 Information extraction algorithm
- Step 3 System implementation
Finding out a small network that is matched to
the users input
15Topic community
- Topic reusable process and information
- Elements related to a topic can be defined as a
sub-network of development community - Topic community may vary to each user
development community
Experts on this area
patches
Topic communmity
16Topic community extraction (1/6)
- Select the initial knowledge elements
- Assume that a topic is given by a user
- Extract knowledge matched to the topic
- Select an initial knowledge elements
I found that there is an register error on bktr
driver while watching TV by fxtv program
Code fragments Directory/file name Mailing lists
name Bug class/description Keywords Date
CVSbktr_core.c 1.20 Comment fix register error
Keyword bktr
E-mailSubject bktr module unloding (2002)
user
GNATSDescription fix bktr option error (2000)
Search results
17Topic community extraction (2/6)
- Select the initial knowledge elements
- Assume that a topic is given by a user
- Extract knowledge matched to the topic
- Select an initial knowledge elements
It seems that bktr_card.c rev. 1.20 is good
CVSbktr_core.c 1.20 Comment fix register error
E-mailSubject bktr module unloding (2002)
user
Select bktr_card.c
GNATSDescription fix bktr option error (2000)
18Topic community extraction (3/6)
- Show related people/knowledges using the network
- User selects appropriate elements again
Id like to know the people working on bktr_core.c
developer fjoe
bktr_core.c
contributor phk
Search results
user
Search related elements
contributor roger
19Topic community extraction (4/6)
- Show related people/knowledges using the network
- User selects appropriate elements again
developer fjoe
Hmm, fjoe is actual developer so I want to know
more about him.
bktr_core.c
contributor phk
Select fjoe
user
contributor roger
20Topic community extraction (5/6)
- Search and select elements repeated
Variables changed in yuv422_pro()
Same time changed bktr_card.c
Ok, are there any other elements that when fjoe
changed bktr_core.c
developer fjoe
bktr_core.c
Search results
user
Search related elements
21Topic community extraction (6/6)
- Search and select elements repeated
Tracking GNATS elements that is talking about
bktr_card.c
Variables changed in yuv422_pro()
Same time changed bktr_card.c
GNATS PR41437 (closed) DescriptionProblems
bktr_card.cyuv422_pro()
developer fjoe
bktr_core.c
Email commented to the change
PR41437 causes a register error
Search results
Topic community
user
The user finally get information about the
changes to bktr_card.c, that helps to fix
register error
Search related elements
22Topics
- Step 1 Modeling information
- Step 2 Information extraction algorithm
- Step 3 System implementation
CoxR web-based system, using FreeBSD data
23CoxR implementation
- Using FreeBSD development data, from 1994 to 2004
- System development environment
- CPU Pentium4 1.5GHz
- RAM 512MB(SDRAM)
- OS Debian GNU/Linux
- System size about 10000 LOCs
CVS FreeBSD CVS repository (Total 57822 files,
618186 revisions) E-mail Commited changes
mailing lists (Total 213723) BTS FreeBSD
GNATS PRs (Total 82350)
24System overview
Topic words
Web Server
Search results
selection
user
System Control
History DB
Matched People/Knowledge
Knowledge-Knowledge relations
People-Knowledge relations
People-People relations
Information Extraction
Knowledge People
Relation DB
Knowledge People
CVS
E-mail
Relation extraction
GNATS
CoxR-C
???????
????????
???????
25System evaluation
- Purpose
- CoxR provides useful information to developers
with appropriate search results - Process
- Announcing CoxR to freebsd-hackers and
freebsd-current mailing lists that are mainly
for FreeBSD developers - Trace users behaviors with webservers log
- Evaluation period Jan/31/2005-Feb/21/2005
- Total users79 (31 unique users)
26Initial knowledge selection
- Unfortunately not all users select knowledge from
the topic search results - Maybe they are just try to use CoxR search, or
search results - is not good for users
- 18 out of 31 users select initial knowledge
- Type of information selected
- CVS 12
- E-mail 4
- GNATS 2
- Selection times average 4 times per topics (min
1, max 9)
27Topic community search
- Users actually search topic community
- 12 out of 18
- they used to search related people and knowledge
within the same subsystem - Average network traversal 2 times
- People-People 1
- People-Knowledge 8
- Knowledge-Knowledge 13
28Discussions
- Initial knowledge selections
- 56 search results would leads to valuable
information - Search by keyword, then search by developer
names and/or date is a typical search patterns - Topic community selection
- 67 users who find initial knowledge elements are
successfully find their own topic community - They used to trace Knowledge-Knowledge and
People-Knowledge edge of development network
29Conclusion
- CoxR, a search system for open-source software
development - CVS, Email, and GNATS
- Development network, topic community
- Evaluation helped with real developers
- Keywords may have its information costs
- Easy to find important keywords
- Links between similar keywords
- Developer roles
- Easy to find people by their roles
- Reuse topic community found by others
- It can be a suggestion of finding out topic
community
30 31CoxR
CoxR (Web Server)
CoxR user
CGI-Main
Data Display Record System
Token compare tool
Lexical analysis tool
CVS Info DB
Fusion info DB
E-mail Info DB
Code DB
CoDS
SPxR
Fusion info Create tool
CVS info Create tool
E-mail info Create tool
DB Create tool
E-mail Archive
CVS Repository
32Example case
Sending a password
Needs improvements
33Searching the repositories
Identify similar code
34Searching similar code
Theres an evidence of improvement, but hard to
understand whats are actually changed
35Searching related information
36Search by revision histories
37Search by development time
38Search by keyword openssh
Combining search results will make it easy to
find what we need
39Search similar information
Files commit at the same time (2001/03/20
020640) and same developer (green)
Actual source code of how to hide the password
packet length is found by CoxR
40Solutions
Search how to fix
41Discussions
- Search similar codeshows actual changes
- Search relative infomation Understanding how
- to fix the
security hole - Easy to detect what we need, since any kind of
information, including keywords, time, developer
name, code fragment, can be used. - Easy to understand search results by finding
relative information easily it helps to grasp
not only what, but also why this change
happened.
42Conclusion Remarks
- Implementing CoxR, a search system for both CVS
revisions and email archives. - Using actual open-source development data, CoxR
provides easy and quick way to search useful
information on software development. - Broader experimentation
- Improvements on search method (multiple search at
one time) - Information scoring (define importance/relation
level of each information)