Tutorial "Mining Programs and Processes": Assignments
Arrangements
- Send your solutions electronically (see instructions below)
to Andreas Zeller <zeller@cs.uni-sb.de>
- This e-mail address also can be used for questions.
- Please turn in assignments by July 15, 2008.
Assignment 1: Simplifying Input
In this project, you write a simple tool that simplifies
failure-inducing input. The idea is to use delta
debugging to find a minimal input that causes an XML parser
to fail.
This project requires you to program in Python. Learning Python
is not hard, though; and even if you include the time it takes you
to learn Python, you will still complete your work faster than in
most other languages.
Read this first
The failure
XMLProc is a small XML parser written in Python. It has
a small defect. To exhibit the defect, follow these steps:
- Download the XMLProc
parser. It should unpack into a directory called
xmlproc.
- In the xmlproc dictionary, the file
xpcmd.py is a command-line interface to the XML
parser. To have the parser parse the XML file
demo/nstest1.xml, type
$ cd xmlproc
$ python xpcmd.py demo/nstest1.xml
xmlproc version 0.70
Parsing 'demo/nstest1.xml'
Parse complete, 0 error(s) and 0 warning(s)
$
- The input file demo/urls.xml causes the
parser to fail:
$ python xpcmd.py demo/urls.xml
xmlproc version 0.70
Traceback (most recent call last):
File "xpcmd.py", line 112, in ?
p.parse_resource(sysid) [...]
File "xmlproc/xml/parsers/xmlproc/xmlproc.py", line 401, in parse_charref
if digs==None: return
UnboundLocalError: local variable 'digs' referenced before assignment
$
Your task
Use Delta Debugging to simplify the failure-inducing input.
Proceed in eight steps:
Step 1: Write a testing function.
Start with a Python program that invokes the parser, as described
above, and assesses the result. To avoid complications, you may
wish to conduct this work in the xmlproc dictionary.
To invoke the parser, you have two options:
- Invoke the parser as a separate program. Use Python's os
module
to execute the parser.
You can use fork(), execv(), and
wait() functions, as in Figure 5.11; however, for
this particular task, system() or popen()
are far easier to use. Python's built-in File
Objects are useful in communicating with the parser via
files.
- Invoke the parser directly from Python. Take a look at the
xpcmd.py source to see how this works. The
essential lines are
from xml.parsers.xmlproc import xmlproc
app = xmlproc.Application()
p = xmlproc.XMLProcessor()
p.parse_resource(file)
In order to catch the exception thrown by the parser, be sure
to read and understand exception
handling in Python.
Take care to differentiate three outcomes:
- The parser completely parses the file (PASS), without
errors or warnings;
- The parser fails as in the original failure (FAIL)
- The parser has another outcome, in particular syntax
errors (UNRESOLVED)
Step 2: Write a splitting function.
You can start with the split() function as
described in the book (Figure 5.9).
In the long term, you may want to split the input along token
delimiters, or even use syntactic simplification (Section 5.8.3).
Step 3: Attach Delta Debugging.
To complete Delta Debugging, you need an implementation of the
listminus() function as described in the book (Figure 5.10). The
listsets module gives you exactly this.
Finally, you need the ddmin() function from Figure 5.7. This ddmin module even comes with a self-contained
test function.
Step 4: Choose a representation.
How do you represent the configuration that is to be minimized?
- If you want to simplify input, each delta might be a single
character—thus, the string "defect" becomes the
list ['d', 'e', 'f', 'e', 'c', 't']. However, this will
cause trouble as you cannot distinguish between the same character
at different locations. For instance,
listminus(['d', 'e', 'f', 'e', 'c', 't'], ['e']) = ['d',
'f', 'c', 't']—which is not what you want.
- A better solution is to store characters with their
indices, as in [('d', 1), ('e', 2), ('f', 3), ('e',
4), ('c', 5), ('t', 6)]—this way, each circumstance
can be uniquely identified.
- For a more general solution, you may wish to set up
Circumstance or Delta classes and to store lists
of these objects.
Step 5: Run it.
What is the simplified failure-inducing input in
demo/urls.xml? Record the number of tests that Delta
Debugging takes.
Step 6: Document it.
Be sure to have docstrings
for every function which describe its purpose.
Have a README file that describes how to invoke the simplification
program.
Step 7 (optional): Make it more general.
Convert your testing function into Python's unittest
framework.
Generalize the delta debugging function such that it can be
parameterized with an arbitrary unit test and an
arbitrary input—that is, it becomes independent
from XMLProc and can easily be adapted to other input minimization
tasks. (As examples of such tasks, see the paper Simplifying
and Isolating Failure-Inducing Input.) Eventually, you will
have a general stand-alone framework; to have it
solve the XMLProc problem, instantiate it with the
XMLProc unit test and the failing input.
Step 8 (optional): Improve efficiency.
In the book, Section 5.8.3 sketches how to implement syntactic
simplification. Using a Python XML parser such as Expat,
split the input along the XML structure to narrow down the failure
cause more rapidly. Validate the success by counting the number
of tests.
Assignment 2: Mine a Bug Database
In this project, you write a simple tool that summarizes defect
data as collected from the AspectJ project. The idea is
to identify those components which have had the most defects in
the past — such that quality assurance efforts would be
redirected towards these components.
You do not interact with version and bug databases directly;
instead, you operate on the iBugs
data set which contains all the necessary data in XML format.
Read this first
- Tutorial Slides
- Chapter 4 "Predicting Bugs from History" of the "Software
Evolution" book (PDF)
- The iBugs project (www.ibugs.org) and the papers
referred therein.
- Learn how to process XML data in your favorite programming language.
The data
Download this archive. It contains
- The AspectJ source code (in
aspectj.org
)
- Defect data from iBugs (in
repository.xml
)
The repository.xml
file has an entry for every
single bug (enclosed in <bug>...</bug>
).
Each bug, among other characteristics, lists the file(s) where the
bug was fixed, enclosed in
<fixedFiles>...</fixedFiles>
. Each
<file>
has a name
attribute
listing the file that was fixed.
A property name="severity"
attribute lists the
severity of the bug, from enhancement
over
minor
, normal
, and major
to
critical
.
Task 1: Find the most defect-prone files.
By parsing repository.xml
, create a top 5 list of
the most defect-prone files in AspectJ — that is, those five
files in which the most defects were fixed. Ignore all
enhancement
bugs.
A common observation in software systems is Pareto's Law:
80% of the defects can be found in 20% of the locations. Is this
true for AspectJ, too?
Task 2: Find the most defect-prone directories.
Every file is part of a directory, which represents a higher-level
component of a system. By parsing "repository.xml",
create a top 5 list of the most defect-prone directories in
AspectJ — that is, those directories in which the most
defects were fixed. Again, ignore all
enhancement
bugs.
Note: Be sure to count every bug only once per directory. That
is, a single bug whose fix affects foo/a.c
and
foo/b.c
should be counted only once for
foo/
.
Task 3: Hypothesize.
In your own words, give an assessment of what makes these files or
directories the most defect-prone. Use the supplied source code and bug
reports as additional information.
Please note: As the supplied AspectJ source code is just a
snapshot, it does not contain every file referred in
repository.xml
, which records a multi-year version
history.
Task 4 (optional): Where are the most critical bugs?
Repeat Task 1 and 2, above, but with using only bugs that are of
critical
severity. Do the results differ? Why?