Tutorial "Mining Programs and Processes": Assignments
Arrangements
-  Send your solutions electronically (see instructions below)
     to Andreas Zeller <zeller@cs.uni-sb.de>
 -  This e-mail address also can be used for questions.
 -  Please turn in assignments by July 15, 2008.
 
Assignment 1: Simplifying Input
    In this project, you write a simple tool that simplifies
    failure-inducing input.  The idea is to use delta
    debugging to find a minimal input that causes an XML parser
    to fail.
    
    This project requires you to program in Python.  Learning Python
    is not hard, though; and even if you include the time it takes you
    to learn Python, you will still complete your work faster than in
    most other languages.
    
Read this first
    
    The failure
    XMLProc is a small XML parser written in Python.  It has
    a small defect.  To exhibit the defect, follow these steps:
    
	-  Download the XMLProc
	parser.  It should unpack into a directory called
	xmlproc.
	
 - In the xmlproc dictionary, the file
	xpcmd.py is a command-line interface to the XML
	parser.  To have the parser parse the XML file
	demo/nstest1.xml, type
	
$ cd xmlproc
$ python xpcmd.py demo/nstest1.xml
xmlproc version 0.70
Parsing 'demo/nstest1.xml'
Parse complete, 0 error(s) and 0 warning(s)
$ 
	 - The input file demo/urls.xml causes the
	parser to fail:
	
$ python xpcmd.py demo/urls.xml
xmlproc version 0.70
Traceback (most recent call last):
  File "xpcmd.py", line 112, in ?
    p.parse_resource(sysid) [...]
  File "xmlproc/xml/parsers/xmlproc/xmlproc.py", line 401, in parse_charref
    if digs==None: return
UnboundLocalError: local variable 'digs' referenced before assignment
$ 
     
    Your task
    Use Delta Debugging to simplify the failure-inducing input.
    Proceed in eight steps:
    Step 1: Write a testing function.
    Start with a Python program that invokes the parser, as described
    above, and assesses the result.  To avoid complications, you may
    wish to conduct this work in the xmlproc dictionary.
    
    To invoke the parser, you have two options:
    
      -  Invoke the parser as a separate program.  Use Python's os
								     module
	to execute the parser.
	You can use fork(), execv(), and
	wait() functions, as in Figure 5.11; however, for
	this particular task, system() or popen()
	are far easier to use. Python's built-in File
	  Objects are useful in communicating with the parser via
	files.
      
 -  Invoke the parser directly from Python.  Take a look at the
      xpcmd.py source to see how this works.  The
	essential lines are
	
	  from xml.parsers.xmlproc import xmlproc
	
app = xmlproc.Application()
p = xmlproc.XMLProcessor()
p.parse_resource(file)
	In order to catch the exception thrown by the parser, be sure
	to read and understand exception
	handling in Python.
     
    Take care to differentiate three outcomes:
    
      -  The parser completely parses the file (PASS), without
	errors or warnings;
      
 - The parser fails as in the original failure (FAIL)
      
 - The parser has another outcome, in particular syntax
	errors (UNRESOLVED)
    
 
    Step 2: Write a splitting function.
    You can start with the split() function as
    described in the book (Figure 5.9).
    
    In the long term, you may want to split the input along token
    delimiters, or even use syntactic simplification (Section 5.8.3).
    
Step 3: Attach Delta Debugging.
    To complete Delta Debugging, you need an implementation of the
    listminus() function as described in the book (Figure 5.10).  The
    listsets module gives you exactly this.
    
    Finally, you need the ddmin() function from Figure 5.7.  This ddmin module even comes with a self-contained
    test function.
    
    
Step 4: Choose a representation.
    How do you represent the configuration that is to be minimized?
    
      -  If you want to simplify input, each delta might be a single
    character—thus, the string "defect" becomes the
    list ['d', 'e', 'f', 'e', 'c', 't'].  However, this will
    cause trouble as you cannot distinguish between the same character
    at different locations.  For instance,
	listminus(['d', 'e', 'f', 'e', 'c', 't'], ['e']) = ['d',
	'f', 'c', 't']—which is not what you want.
      
 -  A better solution is to store characters with their
      indices, as in [('d', 1), ('e', 2), ('f', 3), ('e',
      4), ('c', 5), ('t', 6)]—this way, each circumstance
      can be uniquely identified.
   
 -  For a more general solution, you may wish to set up
   Circumstance or Delta classes and to store lists
   of these objects.
    
 
    Step 5: Run it.
    What is the simplified failure-inducing input in
    demo/urls.xml?  Record the number of tests that Delta
    Debugging takes.
    Step 6: Document it.
    Be sure to have docstrings
    for every function which describe its purpose.
    
    Have a README file that describes how to invoke the simplification
    program.
    
Step 7 (optional): Make it more general.
    Convert your testing function into Python's unittest
    framework.  
    
    Generalize the delta debugging function such that it can be
    parameterized with an arbitrary unit test and an
    arbitrary input—that is, it becomes independent
    from XMLProc and can easily be adapted to other input minimization
    tasks.  (As examples of such tasks, see the paper Simplifying
      and Isolating Failure-Inducing Input.)  Eventually, you will
      have a general stand-alone framework; to have it
    solve the XMLProc problem, instantiate it with the
    XMLProc unit test and the failing input.
    
Step 8 (optional): Improve efficiency.
    In the book, Section 5.8.3 sketches how to implement syntactic
    simplification.  Using a Python XML parser such as Expat,
    split the input along the XML structure to narrow down the failure
    cause more rapidly.  Validate the success by counting the number
    of tests.
Assignment 2: Mine a Bug Database
    In this project, you write a simple tool that summarizes defect
    data as collected from the AspectJ project.  The idea is
    to identify those components which have had the most defects in
    the past — such that quality assurance efforts would be
    redirected towards these components.
    You do not interact with version and bug databases directly;
    instead, you operate on the iBugs 
    data set which contains all the necessary data in XML format.
    
Read this first
    
      -  Tutorial Slides
      
 - Chapter 4 "Predicting Bugs from History" of the "Software
      Evolution" book (PDF)
      
 - The iBugs project (www.ibugs.org) and the papers
      referred therein.
      
 - Learn how to process XML data in your favorite programming language.
    
 
    The data
    Download this archive.  It contains
    
    -  The AspectJ source code (in 
aspectj.org)
     -  Defect data from iBugs (in 
repository.xml)
     
    The repository.xml file has an entry for every 
    single bug (enclosed in <bug>...</bug>).
    
    Each bug, among other characteristics, lists the file(s) where the
    bug was fixed, enclosed in
    <fixedFiles>...</fixedFiles>.  Each
    <file> has a name attribute
    listing the file that was fixed.  
    
    A property name="severity" attribute lists the
    severity of the bug, from enhancement over
    minor, normal, and major to
    critical.
    
Task 1: Find the most defect-prone files.
    By parsing repository.xml, create a top 5 list of
    the most defect-prone files in AspectJ — that is, those five
    files in which the most defects were fixed.  Ignore all
    enhancement bugs.
    
    A common observation in software systems is Pareto's Law:
    80% of the defects can be found in 20% of the locations.  Is this
    true for AspectJ, too?
    
Task 2: Find the most defect-prone directories.
    Every file is part of a directory, which represents a higher-level
    component of a system.  By parsing "repository.xml",
    create a top 5 list of the most defect-prone directories in
    AspectJ — that is, those directories in which the most
    defects were fixed.  Again, ignore all
    enhancement bugs.
    
    Note: Be sure to count every bug only once per directory.  That
    is, a single bug whose fix affects foo/a.c and
    foo/b.c should be counted only once for
    foo/.
    
Task 3: Hypothesize.
    In your own words, give an assessment of what makes these files or
    directories the most defect-prone.  Use the supplied source code and bug
    reports as additional information.
    
    Please note: As the supplied AspectJ source code is just a
    snapshot, it does not contain every file referred in
    repository.xml, which records a multi-year version
    history.
    
Task 4 (optional): Where are the most critical bugs?
    Repeat Task 1 and 2, above, but with using only bugs that are of
    critical severity.  Do the results differ?  Why?