Tutorial "Mining Programs and Processes": Assignments


Assignment 1: Simplifying Input

In this project, you write a simple tool that simplifies failure-inducing input. The idea is to use delta debugging to find a minimal input that causes an XML parser to fail.

This project requires you to program in Python. Learning Python is not hard, though; and even if you include the time it takes you to learn Python, you will still complete your work faster than in most other languages.

Read this first

The failure

XMLProc is a small XML parser written in Python. It has a small defect. To exhibit the defect, follow these steps:
  1. Download the XMLProc parser. It should unpack into a directory called xmlproc.
  2. In the xmlproc dictionary, the file xpcmd.py is a command-line interface to the XML parser. To have the parser parse the XML file demo/nstest1.xml, type
    $ cd xmlproc
    $ python xpcmd.py demo/nstest1.xml
    xmlproc version 0.70
    Parsing 'demo/nstest1.xml'
    Parse complete, 0 error(s) and 0 warning(s)
    $ _
  3. The input file demo/urls.xml causes the parser to fail:
    $ python xpcmd.py demo/urls.xml
    xmlproc version 0.70
    Traceback (most recent call last):
      File "xpcmd.py", line 112, in ?
        p.parse_resource(sysid) [...]
      File "xmlproc/xml/parsers/xmlproc/xmlproc.py", line 401, in parse_charref
        if digs==None: return
    UnboundLocalError: local variable 'digs' referenced before assignment
    $ _

Your task

Use Delta Debugging to simplify the failure-inducing input. Proceed in eight steps:

Step 1: Write a testing function.

Start with a Python program that invokes the parser, as described above, and assesses the result. To avoid complications, you may wish to conduct this work in the xmlproc dictionary.

To invoke the parser, you have two options:

Take care to differentiate three outcomes:

Step 2: Write a splitting function.

You can start with the split() function as described in the book (Figure 5.9).

In the long term, you may want to split the input along token delimiters, or even use syntactic simplification (Section 5.8.3).

Step 3: Attach Delta Debugging.

To complete Delta Debugging, you need an implementation of the listminus() function as described in the book (Figure 5.10). The listsets module gives you exactly this.

Finally, you need the ddmin() function from Figure 5.7. This ddmin module even comes with a self-contained test function.

Step 4: Choose a representation.

How do you represent the configuration that is to be minimized?

Step 5: Run it.

What is the simplified failure-inducing input in demo/urls.xml? Record the number of tests that Delta Debugging takes.

Step 6: Document it.

Be sure to have docstrings for every function which describe its purpose.

Have a README file that describes how to invoke the simplification program.

Step 7 (optional): Make it more general.

Convert your testing function into Python's unittest framework.

Generalize the delta debugging function such that it can be parameterized with an arbitrary unit test and an arbitrary input—that is, it becomes independent from XMLProc and can easily be adapted to other input minimization tasks. (As examples of such tasks, see the paper Simplifying and Isolating Failure-Inducing Input.) Eventually, you will have a general stand-alone framework; to have it solve the XMLProc problem, instantiate it with the XMLProc unit test and the failing input.

Step 8 (optional): Improve efficiency.

In the book, Section 5.8.3 sketches how to implement syntactic simplification. Using a Python XML parser such as Expat, split the input along the XML structure to narrow down the failure cause more rapidly. Validate the success by counting the number of tests.

Assignment 2: Mine a Bug Database

In this project, you write a simple tool that summarizes defect data as collected from the AspectJ project. The idea is to identify those components which have had the most defects in the past — such that quality assurance efforts would be redirected towards these components.

You do not interact with version and bug databases directly; instead, you operate on the iBugs data set which contains all the necessary data in XML format.

Read this first

The data

Download this archive. It contains The repository.xml file has an entry for every single bug (enclosed in <bug>...</bug>).

Each bug, among other characteristics, lists the file(s) where the bug was fixed, enclosed in <fixedFiles>...</fixedFiles>. Each <file> has a name attribute listing the file that was fixed.

A property name="severity" attribute lists the severity of the bug, from enhancement over minor, normal, and major to critical.

Task 1: Find the most defect-prone files.

By parsing repository.xml, create a top 5 list of the most defect-prone files in AspectJ — that is, those five files in which the most defects were fixed. Ignore all enhancement bugs.

A common observation in software systems is Pareto's Law: 80% of the defects can be found in 20% of the locations. Is this true for AspectJ, too?

Task 2: Find the most defect-prone directories.

Every file is part of a directory, which represents a higher-level component of a system. By parsing "repository.xml", create a top 5 list of the most defect-prone directories in AspectJ — that is, those directories in which the most defects were fixed. Again, ignore all enhancement bugs.

Note: Be sure to count every bug only once per directory. That is, a single bug whose fix affects foo/a.c and foo/b.c should be counted only once for foo/.

Task 3: Hypothesize.

In your own words, give an assessment of what makes these files or directories the most defect-prone. Use the supplied source code and bug reports as additional information.

Please note: As the supplied AspectJ source code is just a snapshot, it does not contain every file referred in repository.xml, which records a multi-year version history.

Task 4 (optional): Where are the most critical bugs?

Repeat Task 1 and 2, above, but with using only bugs that are of critical severity. Do the results differ? Why?