In this tutorial we discuss the general problem of program understanding and show how relational techniques can help to formalize software facts and to reason about them. The presentation will explain basic concepts and will give examples of the application of relations to solve real program understanding problems. See www.meta-environment.org for further documentation and software.
Creating and maintaining software systems is a multi-disciplinary activity, whereby a number of artifacts – the source code, the design documents, the tests, etc. – need to be created and maintained synchronously. In this study we investigate whether production code and test code evolve. We argue that co-evolution is an indicator of the test health of the software development process, as it provides clues as to how easy the future evolution of both production and test code will be. In our study we mine version control systems (VCSs), e.g., CVS, Subversion, ClearCase and Visual SourceSafe. VCSs are designed to support concurrent development of software systems by multiple developers, but they also create visibility into the development process by keeping track of the development history. We use this historical information to investigate how production code units and their corresponding unit tests co-evolve. We introduce three views that allow to study the past development activities of both production and test code. Our first view, the change history view, focuses on the actual commits that the developers make to the VCS, giving a good view of when co-evolution is (not) taking place. To get a better idea of the impact of changes, which we cannot determine from the change history view, we visualize size-related metrics for both production and test code in the growth history view. Finally, our third view, the test writing efficiency view, looks at the (relative) amount of test code that is necessary to obtain a certain level of test coverage. We have applied our approach on three case studies, two of which are open source software systems and one of which is an industrial software project. We noted significant differences in development processes between these projects and we also noted differences in their testing processes.
Over the last decades, coding standards, along with the tools to enforce them, have become an increasingly popular means to improve software quality. One important way in which they are assumed to help is by preventing introduction of faults into the software. However, in spite of the widespread use of coding standards there is little empirical evidence supporting this intuition. Not only can compliance with a set of rules having little relation to actual faults be considered wasted effort, but it may even result in an increase in faults, as any modification has a non-zero probability of introducing a fault.
Therefore, it is important to gain empirical knowledge on the relation between coding standard violations and faults. This will give us a more viable basis for selecting rules to be enforced in the context of fault reduction. In this talk, we will discuss a number of approaches to extract the relation between a coding standard and observed faults from the version history of a software project. We apply these approaches to a medium-sized industrial project, and discuss results as well as lessons learned.
Software evolution visualization, a young branch of software visualization, aims at facilitating the maintenance phase of large software projects, by revealing how a system came into being. The main question it tries to answer is:
The intended audience of software evolution visualization consists of the management team and software engineers involved in the maintenance phase of large software projects. These professionals usually face software in the late stages of its development process, and need to get an understanding of it, often with no other support than the source code itself. For all these people, software evolution visualization tries to answer a number of questions, following the visual analytics mantra:
These questions range from concrete, specific queries about a certain well-defined aspect or component of a software system, to more vague concerns about the evolution of the system as a whole. Typical questions are:
This presentation gives an overview on the software evolution visualization techniques implemented in the Visual Code Navigator toolset  developed at TU/e. The emphasis is put on the practical application of the techniques for assessing the evolution of real-world software systems.
For about 4 years now I have been sporadically talking about API migration as a scientific and technical challenge as well as a potential business idea. It all started with some background in grammar engineering, formal semantics, and re-engineering which made me see that APIs are tricky software languages that account for a good part of the intricacies in automated software evolution. I was fired up when I acquired a handle on Microsoft's XML APIs during my recent work on the XML team at Microsoft. At this point I gained confidence in the business value and the eventual feasibility of API migration. While API migration is arguably the top subject on my research agenda, I haven't made substantial progress over the last few years. Meanwhile others have made both expected and unexpected contributions to the subject, which I would like to put into perspective. However, I contend that the hard problem is still to be attacked. In this talk, I am going to reexamine the landscape of API migration, and submit a mid-term call-to-arms to cross the API migration's Rubicon. I will also report on some experiments or experiences with API usage analysis, API protocol specification, API-related flow analysis of programs, and transformation technology for API migration.
Data-intensive applications, also called database programs, generally comprise a database (sometimes in the form of a set of files) and a collection of application programs in strong interaction with the former. They constitute critical assets in most entreprises, since they support business activities in all production and management domains. Data-intensive applications form most of the so-called legacy systems : they typically are one or more decade old, they are very large, heterogeneous and highly complex. Many of them "significantly resist modifications and change" [Brodie and Stonebraker, 1995] due to the use of ageing technologies and to inflexible architectures.
Data-intensive systems exhibit an interesting symmetrical property due to the mutual dependency of the database and the programs. When no useful documentation is available, it appears that (1) understanding the database schema is necessary to understand the procedural code and, conversely, (2) understanding what the procedural code is doing on the data considerably helps in understanding the properties of the data. In the first part of the talk, we will particularly focus on property (2). We will demonstrate how both static and dynamic program analysis techniques can be used in the context of database reverse engineering.
A second important, non-trivial problem is the coupled evolution of databases and programs. When the database evolves, the application programs often are to be adapted in order to preserve the global consistency. In the second part of the talk, we will address this problem by considering two distinct database evolution scenarios: database migration and database schema refactoring. In both cases, we will describe automated program conversion techniques used to propagate database evolutions at the program level.
Joint work with R. Adnan, B. Graaf, and J. Zonneveld
For large software systems, interface structure has an important impact on their maintainability and build performance. For example, for complex systems written in C, recompilation due to a change in one central header file can run into hours. In this paper, we explore how automated cluster analysis can be used to refactor interfaces, in order to reduce the number of dependencies and to improve encapsulation, thus improving build performance and maintainability. We implemented our approach in a tool called "Interface Regroup Wizard", which we applied to several interfaces of a large industrial embedded system. From this, we not only learned that automated cluster analysis works surprisingly well to improve the design of interfaces, but also which of the refactoring steps are best done manually by an architect.
Software systems are bound to suffer from a phenomenon known as crosscutting concerns. Crosscutting concerns are concerns –pieces of functionality– that cannot be fit neatly into a system’s design. As a result, code scattering, and tangling can occur. An example of a commonly occurring crosscutting concern is error (or exception) handling. Typically, error handling code is scattered across the system because each module must deal with exceptional situations. Furthermore, exception handling code tends to be tangled with normal (i.e., non-exceptional) code because normal and exceptional control flow are often correlated.
The problems caused by crosscutting concerns can be observed in legacy software systems in particular, either because legacy systems cannot be easily changed due to their design, or because old programming languages lack the features required (for instance, the lack of exception handling in C). Programmers are then forced to use an idiom for the implementation of a concern at every applicable location in the source code. An idiom is a prescribed coding solution to a commonly recurring problem. Idiomatic implementation then consists of a manual coding step, in which a programmer produces code by applying the idiom to a particular case.
The hypothesis of my PhD thesis is that idiomatic implementation of crosscutting concerns is a source of quality problems in legacy systems. Consequently, idiomatic implementations should be replaced by suitable language technology. In my talk I will zoom in on the following research questions:
The SWERL Reconstructor project aims at raising the state of the art in software architecture reconstruction. One of the issues addressed by Reconstructor is interactive visualization, i.e., “How can users be enabled to understand the large amounts of information produced using visual representations?”. To answer this question, we employ techniques used in the areas of information visualization, scientific visualization, and computer graphics and develop new techniques for visualizing software systems based on insight gained.
To aid a user in navigating and understanding software execution traces, we propose to visualize such traces using two linked views. One is a structure interaction view that shows part of an execution trace using Hierarchical Edge Bundles. The other is a more detailed Massive Sequence View that offers detailed information regarding the temporal location and interleaving calls.
We furthermore provide a novel visualization method for the comparison of hierarchically organized data, e.g., different versions of hierarchically organized software systems. Our method depicts how the hierarchies are related by explicitly visualizing the relations between matching subhierarchies. Elements that are unique to each hierarchy are shown, as well as the way in which elements are relocated, split or joined. The relations are again visualized using Hierarchical Edge Bundles.
In this talk we report on a number of maintainability assessments carried out in LaQuSo. Core methodology behind the assessments is visual software analytics. Visual analytics is an emerging branch of computer science studying analytical reasoning facilitated by interactive visual interfaces. Visual software analytics investigates visual analytics approaches relevant for studying software systems and related processes.
Maintainability assessment aims at identifying weaker spots in the code for possible improvement or provides a system overview supporting business decisions, e.g., the feasibility of migration of a legacy system to a new platform. Our approach supports maintainability assessment by combining quality analysis techniques such as metrics calculation or dependency analysis, with interactive visualization of the analysis results.
Combining measurement and visualization is not novel per se. Successful combination of analysis and visualization is, however, challenged by the following issues:
Novelty of our approach consists, therefore, in addressing these challenges by combining state-of-the-art visualization techniques, e.g., the one presented in the previous talk, with state-of-the-art quality assessment methods and tools based on a generic framework implemented as a plug-in architecture. The choice for plug-in architecture accounts for ease of integrating additional analysis and visualization tools.
Our approach has been implemented in SQuAVisiT, a flexible tool for visual software analytics. SQuAVisiT has been successfully applied in a number of case studies, ranging from hundreds to thousands KLOC, from homogeneous to heterogeneous systems, from traditional procedural programming languages to object-oriented ones.
In this talk we start by introducing our approach and explaining how it has been implemented in SQuAVisiT. The lion’s share of the presentation will be dedicated to experimental evaluation of the approach and tooling proposed by means of series of industrial case studies.
Joint work with Ruurd Kuiper (TU/e)
In OO, one's thinking often moves between the static description in classes and the dynamics of the execution in objects. Thinking at the clas level is quite well supported (by the language facilities), thinking at the object level less so.
Visualization of what-happens-in-the-computer when a program is executed helps to quickly establish a coherent model, facilitating understanding of and communication about designs and programs.
Such visualization involves two separate decisions: What should be visualized and how should the things selected be visualized.
We investigate these questions from two perspectives: the perspective of the beginning programmer and the perspective of the developer of relatively large programs and present two visualization tools.
In the tool for the novice programmer, we visualize all dynamic aspects of an executing program on the abstraction level of objects, variables, and control locations in the source code. In the tool for developers, the emphasis is on large object spaces, therefore visualized on the higher abstraction level of classes and method calls.
Analysis of legacy code is often focussed on extracting either metrics or relations, e.g. call relations or structure relations. For object-oriented programs, e.g. Java or C++ code, such relations are commonly represented as UML diagrams: e.g., such tools as Columbus and Cpp2XMI are capable of extracting from the C++ code UML class, and UML class, sequence and activity diagrams, respectively.
New challenges in UML diagram extraction arise when
We start by observing that in their simplest form UML state machines contain nothing but states and transitions connecting states, such that transitions are associated with events and guards. At each moment of time the system can be in one and only one of the states. When an event occurs the system should check whether the guard is satisfied, and, should this be the case, move to the subsequent state. Observe, that implementing a state machine behaviour involves, therefore, a three-phase decision making:
Based on this simple observation, our approach consists in looking for nested-choice patterns, such as "if" within "if" or "switch within switch". As guards can be omitted we require the nesting to be at least two. As we do not aim to discover all possible state-machines present in the code, validation of the approach will consist in applying in the case study and comparing the state-machines detected with the results expected by the domain experts.
We have chosen to implement the approach based on the Cpp2XMI tool set. Since Cpp2XMI was designed for reverse engineering C++, we first had to adapt the tool for C. Second, we added a number of new filters to detect the nested-choice patterns in the abstract syntax trees. Finally, we had to extend the visualisation component to provide for state machine visualisation. As the case study we consider an approximately ten-year old system, developed for controlling material handling components, such as conveyer belts, sensors, sorters, etc. Up-to-date documentation is missing and the code is reportedly hard to maintain. While a re-implementation of the system is considered by the company, understanding the current functionality is still a necessity.
It turned out that the original software engineers have quite consistently used switch-statements within switch-statements to model the state machines. Therefore, already the first version of the implementation based solely on the "switch within switch" pattern produced a number of relevant state machines.
At the moment more than forty state machines have been extracted from the code. The size of the extracted state machines varied from 4 states up to 25 states. All the machines extracted were presented to the (software) engineers of the company and their correctness as well as importance were confirmed by them.
We will present an ongoing effort on extracting UML state machines from legacy non-object-oriented code. We have observed that UML state machines are useful for the developers and maintainers, and that they can be derived automatically even from a non-object-oriented code. The approach proved to be very successful in the case study and, is in general, promising. As the future work we consider
As software technology evolves, older technologies become outdated. Consequently, enterprise software frequently needs to be renewed, even when the functional requirements did not change fundamentally. By migration, legacy software can be transformed into modernized software that meets renewed technical demands or can operate in a new technical domain, yet provides similar functionality. However, code resulting from automated migration is typically poorly structured and hard to understand or maintain. In a case study on migrating a business-critical insurance enterprise application from desktop application to a web application model, we made reverse engineering techniques more powerful and robust by using partial evaluation to rewrite complex, branched code fragments to simpler imperative code blocks. The use of partial evaluation can break the constraints imposed by the legacy architecture and ensure a target system that is more tailored to the target domain.
Program comprehension is an important activity in software maintenance, as software must be sufficiently understood before it can be properly modified. The study of a program's execution, known as dynamic analysis, has become a common technique in this respect and has received substantial attention by the research community.
This talk presents an introduction into the use of dynamic analysis for program comprehension. We describe how an extensive literature survey that we performed has motivated our research on this topic. We then report on our results thus far, which include two types of visualizations and the various manners in which they are evaluated.
Ajax-based web 2.0 applications rely heavily on stateful asynchronous client/server communication, and client-side run-time manipulation of the DOM tree. This not only makes them fundamentally different from traditional web applications, but also more error-prone and harder to test.
Static analysis techniques have limitations in revealing faults which are due to the complex run-time behavior of modern rich web applications. It is this dynamic run-time interaction that is believed to make testing such applications a challenging task. On the other hand, when applying dynamic analysis on this new domain of web, the main difficulty lies in detecting the various doorways to different dynamic states, providing proper interface mechanisms for input values, and specifying oracles for test output comparison.
We propose a dynamic analysis method for testing \ajax applications automatically, based on a crawler to infer a state-flow graph for all (client-side) user interface states. We identify faults that can occur in such states (related to DOM validity, error messages, discoverability, back-button compatibility, etc.) as well as DOM-tree invariants that can serve as oracle to detect such faults. Our approach is implemented in ATUSA, a tool offering generic invariant checking components, a plugin-mechanism to add application-specific state validators, and generation of a test suite covering the paths obtained during crawling.
Ad-hoc reuse through copy-and-paste occurs frequently in practice affecting the evolvability of software. This talk summarizes the state of the art in software clone management (detection, tracking, presentation, assessment, removal, change of software clones) and the empirical knowledge we have gained so far. In the course of the summary, the talk identifies further research potential.
Joint work with Rui Abreu and Arjan van Gemund (TUD)
Computer-aided diagnosis techniques help to localize the faults that are the root cause of discrepancies between expected and observed behavior of systems. As such, these techniques are a natural companion to testing efforts, which aim at exposing such discrepancies. In software development, automated diagnosis can reduce the effort spent on manual debugging, which shortens the test-diagnose-repair cycle, and can hence be expected to lead to more reliable systems, and a shorter time-to-market. Outside the software development cycle, automated diagnosis is also used in maintenance, and can serve as the basis for (possibly automated) recovery strategies.
In this session we will give an overview of our research in the area of software fault diagnosis. We start from a high-level view of the diagnosis problem, which allows us to describe and relate two specific approaches: model-based diagnosis (MBD) and spectrum-based fault localization (SFL). The former technique, which originated in the artificial intelligence domain, infers a diagnosis from a compositional, behavioral model combined with real-world observations, and has successfully been applied to digital circuits and complex mechanical systems. The latter technique, SFL, is based on discovering statistical coincidences between failures and the activity of system components, and overcomes the dependance of MBD on suitable behavioral models. As such, SFL applies naturally to software, for which these models are generally not available. We will discuss our recent results on the diagnostic accuracy of SFL as a function of several properties of the underlying tests, and give an account of our experience with applying the technique in the context of embedded software development in an industrial environment. In addition, we discuss our efforts to combine MBD and SFL into a novel, dynamic modeling approach to software fault localization.
Correct enforcement of authorization policies is a difficult task, especially for multi-threaded software. Even in carefully-reviewed code, unauthorized access may be possible in subtle corner cases. We introduce Transactional Memory Introspection (TMI), a novel reference monitor architecture that builds on Software Transactional Memory-a new, attractive alternative for writing correct, multi-threaded software.
TMI facilitates correct security enforcement by simplifying how the reference monitor integrates with software functionality. TMI can ensure complete mediation of security-relevant operations, eliminate race conditions related to security checks, and simplify handling of authorization failures. We present the design and implementation of a TMI-based reference monitor and experiment with its use in enforcing authorization policies on four significant servers. Our experiments confirm the benefits of the TMI architecture and show that it imposes an acceptable runtime overhead.
Joint work with Erik Poll (RU)
This talk is about the use of a formal specification language (JML) to specify security policies for Java mobile phone applications (MIDlets), which can then be verified using an automated program verification tool (ESC/Java2). The Java Modelling Language (JML) allows the programmer to annotate Java source code with formal specifications. The level of detail of these specifications can range from simple annotations specifying if references can be null or not to complex functional specifications. The annotated program can then be statically or dynamically checked by a tool to establish program correctness, i.e. whether the Java code satisfies the specification. To check relatively simple specifications one can use an automatic, lightweight tool like ESC/Java2 (http://kindsoftware.com/documents/tutorials/). For more complex specifications, a more elaborate, interactive checker might be necessary, like KeY (http://www.key-project.org) or Jack (http://www-sop.inria.fr/everest/soft/Jack/jack.html). In this talk we will show how JML and ESC/Java2 can be used to establish adherence of a Java mobile phone application (MIDlet) to a security policy given by a navigation graph. A MIDLet navigation graph defines the user's interaction space with the MIDlet: a set of screens for the applet and limitations on possible transitions between the screens. More importantly it also defines when sensitive API calls are allowed (for example, sending an SMS on a mobile phone). We will show how to translate such a navigation graph into JML annotations and how to very the applet code with ESC/Java2. We will also accompany the talk with a simple demo of ESC/Java2, KeY, and an example MIDlet application.
This work is done in the context of the EU IST project Mobius: Mobility, Ubiquity and Security -- Enabling proof-carrying code for Java on mobile devices, http://mobius.inria.fr
The goal of this work is to develop a methodology for improving the quality of industrial software by combining model checking with theorem proving such that the advantages of both methods are fully exploited. Additionally, we enhance existing or develop new techniques to support the various translations of conversions that are needed in this process. The proposed methodology resembles what in software engineering is called code refactoring. However, the latter replaces existing code with the same functionality. It does not fix bugs; it rather it improves the understandability of the code by changing its internal structure.
The methodology consists of the following steps.
As a case study, we took an implementation of the classic readers-writers problem. Readers-writers has been extensively studied. This holds to a lesser degree for the reentrant version, where it is allowed to nest locking actions. Such nesting is useful when a library is created with various procedures that each start and end with a lock. Allowing nesting makes it possible for these procedures to call each other.
We considered an existing widely used industrial implementation (from Trolltech's Qt library) of the reentrant readers-writers problem. We modeled itusing a model checker revealing a serious error: a possible deadlock situation. The model was improved and checked satisfactorily for a fixed number of processes. To achieve a correctness result for an arbitrary number of processes the model was converted to a theorem prover with which it was proven.
Test coverage is an important indicator for unit test quality. Tools such as Clover or Emma can compute test coverage by first instrument the code with logging functionality, and then log which parts are executed during unit test runs. Since computation of test coverage is a dynamic analysis, it presupposes appropriate hardware and a working installation of the software.
In the context of software quality assessment by an independent third party, a working installation is often not available. The evaluator may not have access to the required software libraries or hardware platform. The installation procedure may not be automated or documented.
The presentation will guide through the process of answering two questions:
The Meta-Environment is an actively supported and growing collection of open source tools for software analysis and transformation. It has a wide range of applications. By design the Meta-Environment promotes a very rigorous separation of concerns by introducing the following stages:
We will conclude with a general discussion on the quality of the analysis tools created and the effectiveness of using a system that enforces such strict staging.
Drawing on well-established concepts and techniques from the field of type systems, type-based program analysis is about specifying and implementing static program analyses in terms of nonstandard type- assignment systems. Typically, such a nonstandard type-assignment system takes the form of a so-called annotated type system in which the syntax of types is enriched with annotations that describe the properties of interest for a given analysis. In this talk, we illustrate the basics of annotated type systems by considering a small typed functional programming language and extending its type system to a nonstandard, annotated type system for control-flow analysis. Specific topics include subeffecting and the derivation of an inference algorithm.
Inclusion of subeffecting is generally a sine qua non for making annotated type systems applicable to all programs of interest. Naive uses of subeffecting, however, typically introduce the need to analyse complete programs at once and, moreover, subject analyses to a phenomenon that is known as "poisoning" and that renders analyses very imprecise. In this talk, we discuss subtyping and polyvariance: two techniques that can be used to increase modularity and to circumvent poisoning. Applying these techniques results in precise, context- sensitive program analyses that naturally scale to real-world programming languages with support for separate compilation.
Joint work with Marko van Eekelen and Alejandro Tamalet (RU).
The aim of the AHA project is to develop a foundational framework for checking and inference of nonlinear bounds on heap consumption. The results are to be used in the future for practical applications, for instance, in Java program verification.
Amortised heap space analysis was explored first by German researchers Martin Hofmann and Steffen Joost for linear heap space bounds and functional programs. We have been extending their result to nonlinear bounds. The necessary step for such an extension is studying output-on-input size dependencies.
We started with a size-annotated type system for first-order language over lists. We explored checking and inference of strict size dependencies, like for instance, the dependencies for typical implementations of "append:List_n(A) x List_m(A) --> List_(n+m)(A)" or for the Cartesian product of two lists. We are able to check and infer dependencies for a large subclass of such programs. We have developed an interpolation based semi-algorithm that terminates on well-typed definitions and outputs annotations for them. These results were published at Typed Lambda Calculi and Applications, TLCA, 2007 and Workshop on Functional and (Constraint) Logic Programming, WFLP, 2007. We also designed a demonstrator of the polynomial size analysis work. The results on an extension of the analysis to other algebraic data types, such as trees, have been published in Trends in Functional Programming, TFP, 2008.
Extending the results to functions that have lower and upper bounds on output sizes is a topic of our current research. We have been studying non-deterministic rewriting systems for size dependencies and approximations of their solutions by indexed families of polynomials. This approach seems to be applicable to the majority of the Haskel list library. For instance, we can infer and check the type for "insert:(A x A --> Bool) x A x List_n(A) --> List_(n+i, 0 <= 1 <= 1)(A)". Moreover, in many cases we can infer lower and upper polynomial bounds using interpolation. After careful exploring of size dependencies we believe that using them for an amortisation-aware type system will be a technical routine.
Our future challenge is to use these results for Java applications. For instance, one may use a similar approach to inference and checking of integer loop variants in avionic Java applications. One involves integer program (and ghost) variables instead of size variables in size annotations.
Coding guidelines for Java-like programming languages recommend favouring immutable data structures. Reasons for this advice include that immutable data can safely be accessed concurrently without synchronization, and that immutable data can safely be passed to untrusted program components without the danger of getting tampered with. The Java language provides limited support for statically checked immutability in the form of "final" fields, which can only be set inside object constructors. However, final fields are insufficient for many applications, and researchers have proposed various more flexible notions of immutability. In this lecture, we will discuss several kinds of immutability as well as type-based techniques for specifying and statically verifying them. We may also discuss the JSR 308, a very useful proposal for an extended Java annotation syntax (put forward by the program analysis team at MIT) that facilitates the implementation and use of pluggable type systems for Java.