Stanford Software Seminar

The Stanford Software Seminar is usually held on Mondays 3-4 pm in Gates 104. To subscribe to the seminar mailing list, send email to majordomo@lists.stanford.edu with body:

subscribe software-research [user@host.domain]

Upcoming Talks

Date 2:30-3:30 pm, Tuesday, October 31, 2006
Place Gates 104
Speaker Guy L. Steele Jr., Sun Microsystems Laboratories
Title Parallel Programming and Code Selection in Fortress
Abstract As part of the DARPA program for High Productivity Computing Systems, the Programming Language Research Group at Sun Microsystems Laboratories is developing Fortress, a language intended to support large-scale scientific computation with the same level of portability that the Java programming language provided for multithreaded commercial applications. One of the design principles of Fortress is that parallelism be encouraged everywhere; for example, it is intentionally just a little bit harder to write a sequential loop than a parallel loop. Another is to have rich mechanisms for encapsulation and abstraction; the idea is to have a fairly complicated language for library writers that enables them to write libraries that present a relatively simple set of interfaces to the application programmer. Thus Fortress is as much a framework for language developers as it is a language for coding scientific applications. We will discuss ideas for using a rich parameterized polymorphic type system to organize multithreading and data distribution on large parallel machines. The net result is similar in some ways to data distribution facilities in other languages such as HPF and Chapel, but more open-ended, because in Fortress the facilities are defined by user-replaceable and -extendable libraries rather than wired into the compiler. A sufficiently rich type system can take the place of certain kinds of flow analysis to guide certain kinds of code selection and optimization, again moving policymaking out of the compiler and into libraries coded in the Fortress source language.

Date (time: TBD), Monday, November 6, 2006
Place Gates 104
Speaker Ranjit Jhala, University of California, San Diego
Title TBD
Abstract TBD

Date (time: TBD), Thursday, November 30, 2006
Place Gates 104
Speaker Rupak Majumdar, University of California, Los Angeles
Title TBD
Abstract TBD

Date (time: TBD), Monday, December 4, 2006
Place Gates 104
Speaker Koushik Sen, University of California, Berkeley
Title TBD
Abstract TBD

Previous Talks

Date 3:00-4:00 pm, Monday, October 16, 2006
Place Gates 104
Speaker Zhendong Su, University of California, Davis
Title Scalable and Accurate Tree-based Detection of Code Clones
Abstract Detecting code clones has many software engineering applications. Existing approaches either do not scale to large code bases or are not robust against code modifications. In this talk, I will present an efficient algorithm for identifying similar subtrees. The algorithm is based on a novel characterization of subtrees with numerical vectors in the Euclidean space R^n and an efficient algorithm to cluster these vectors w.r.t. the Euclidean distance metric. Subtrees with vectors in one cluster are considered similar. We have implemented our tree similarity algorithm as a clone detection tool called Deckard and evaluated it on large code bases written in C and Java including the Linux kernel and JDK. Our experiments show that Deckard is both scalable and accurate. It is also language independent, applicable to any language with a formally specified grammar.

Joint work with Lingxiao Jiang, Ghassan Misherghi, and Stephane Glondu.

Date 3:00-4:00 pm, Thursday, June 8, 2006
Place Gates 104
Speaker Bowen Alpern, IBM T. J. Watson Research Center
Title Early Ruminations on Anticipating Demand for Software Delivered as a Stream
Abstract This is not, at least not primarily, a talk about the economic viability of a particular model of software distribution. Rather, it reports on some preliminary investigations into whether predictive caching techniques might be able to mitigate the first-time performance penalty incurred when software executes as it is being delivered. The context for this work is the PDS (Progressive Deployment System) project at IBM Research which uses application virtualization to deliver an application and its non operating system dependencies as a stream in order to avoid dependency problems that can occur when multiple applications are installed on the same machine. This is joint work with the PDS group.

Date 2:00-3:00 pm, Thursday, May 18, 2006
Place Gates 104
Speaker Danny Dig
Title Automated Detection of Refactorings in Evolving Components
Abstract One of the costs of reusing software components is upgrading applications to use the new version of the components. Upgrading an application can be error-prone, tedious, and disruptive of the development process. An important kind of change in OO components is a refactoring. Refactorings are program transformations that improve the internal design without changing the observable behavior (e.g., renamings, moving methods between classes, splitting/merging of classes). Our previous study showed that more than 80% of the disruptive changes in five different components were caused by refactorings. If the refactorings that happened between two versions of a component could be automatically detected, a refactoring tool could replay them on applications.

I will present an algorithm that detects refactorings performed during component evolution. Our algorithm uses a combination of a fast syntactic analysis to detect refactoring candidates and a more expensive semantic analysis to refine the results. The experiments on components ranging from 17 KLOC to 352 KLOC show that our algorithm detects refactorings in real-world components with accuracy over 85%.

Joint work with Can Comertoglu, Darko Marinov and Ralph Johnson (UIUC)

Date 3:00-4:00 pm, Monday, May 15, 2006
Place Gates 104
Speaker Stephen Freund, Williams College
Title Practical Hybrid Type Checking
Abstract Software systems typically contain large API's that are only informally and imprecisely specified and hence easily misused. Practical mechanisms for documenting and verifying precise specifications would significantly improve software reliability.

The Sage programming language is designed to provide high-coverage checking of expressive specifications. The Sage type language is a synthesis of (unrestricted) refinement types and Pure Type Systems. Since type checking for this language is not statically decidable, Sage uses hybrid type checking, which extends static type checking with dynamic contract checking, automatic theorem proving, and a database of refuted subtype judgments.

In this talk, I present the key ideas behind hybrid type checking, the Sage language, and preliminary experimental results suggesting that hybrid type checking of precise specifications is a promising approach for the development of reliable software. I will also discuss more recent work on extending Sage to include mutable objects.

This is joint work with Cormac Flanagan, Jessica Gronski, Kenn Knowles, and Aaron Tomb at University of California, Santa Cruz.

Date 3:00-4:00 pm, Monday, April 24, 2006
Place Gates 104
Speaker Kathy Yelick, UC Berkeley and Lawrence Berkeley National Lab
Title Compilation Technology for Computational Science
Abstract The emergence of multicore processors marks the end of an era in computing: whereas hardware developers were largely responsible for exponential performance gains in the past decade, software developers will now be equally responsible if these gains are to continue. The notoriously difficult problem of writing parallel software will now be commonplace. Much of the experience using parallelism for performance resides in the scientific computing community, and while that group has focused on numerical simulations and very large scale parallelism, surely some lessons can be learned. In this talk, I will describe the class of partitioned global address space languages, which have recently received support from the scientific computing community as a possible alternative to message passing and threads. One of these languages, Titanium, is a modest extension of Java with domain-specific extensions for scientific computing and large-scale parallelism.

Titanium has proven to be significantly more expressive than message passing and has been used for significant scientific problems, including a parallel simulation of blood flow in the heart and an elliptic solver based on adaptive mesh refinement. Several interesting computer science challenges have arisen from this work, including the need for program analysis techniques specialized to parallel languages, machine-independent optimization strategies, and runtime support for latency hiding. I will describe some recent work in compilation of parallel languages, including thread-aware pointer analysis, sequential consistency enforcement, and model-driven communication optimizations, and give an overview of some open questions in the field.

Date 3:00-4:00 pm, Monday, April 17, 2006
Place Gates 104
Speaker Brad Chamberlain, Cray Inc.
Title Chapel: Cray Cascade's High Productivity Language
Abstract In 2002, DARPA launched the High Productivity Computing Systems (HPCS) program, with the goal of improving user productivity on High-End Computing systems for the year 2010. As part of Cray's research efforts in this program, we have been developing a new parallel language named Chapel, designed to:
1. support a global view of parallel programming with the ability to tune for locality,
2. support general parallel programming including data- and task-parallel codes, as well as nested parallelism, and
3. help narrow the gulf between mainstream and parallel languages.
In this talk I will introduce the motivations and foundations for Chapel, describe several core language concepts, and show some sample computations written in Chapel.

Date 2:15-3:15 pm, Friday, March 10, 2006
Place Gates B8
Speaker Stephen Fink, IBM T. J. Watson Research Center
Title Effective Typestate Verification in the Presence of Aliasing
Abstract We describe a novel framework for verification of typestate properties, including several new techniques to precisely treat aliases without undue performance costs. In particular, we present a flow-sensitive, context-sensitive, integrated verifier that utilizes a parametric abstract domain combining typestate and aliasing information. To scale to real programs without compromising precision, we present a staged verification system in which faster verifiers run as early stages which reduce the workload for later, more precise, stages.

We have evaluated our framework on a number of real Java programs, checking correct API usage for various Java standard libraries. The results show that our approach scales to hundreds of thousands of lines of code, and verifies correctness for over 95% of the potential points of failure.

Date 4:00-5:15 pm, Thursday, February 23, 2006
Place Gates 104
Speaker Vivek Sarkar, IBM T. J. Watson Research Center
Title X10: An Object-Oriented Approach to Non-Uniform Cluster Computing
Abstract It is now well established that the device scaling predicted by Moore's Law is no longer a viable option for increasing the clock frequency of future uniprocessor systems at the rate that had been sustained during the last two decades. As a result, future systems are rapidly moving from uniprocessor to multiprocessor configurations, so as to use parallelism instead of frequency scaling as the foundation for increased compute capacity. The dominant emerging multiprocessor structure for the future is a Non-Uniform Cluster Computing (NUCC) system with nodes that are built out of multi-core SMP chips with non-uniform memory hierarchies, and interconnected in horizontally scalable cluster configurations such as blade servers. Unlike previous generations of hardware evolution, this shift will have a major impact on existing software. Current OO language facilities for concurrent and distributed programming are inadequate for addressing the needs of NUCC systems because they do not support the notions of non-uniform data access within a node, or of tight coupling of distributed nodes.

We have designed a modern object-oriented programming language, X10, for high performance, high productivity programming of NUCC systems. A member of the partitioned global address space family of languages, X10 highlights the explicit reification of locality in the form of places; lightweight activities embodied in async, future, foreach, and ateach constructs; a construct for termination detection (finish); the use of lock-free synchronization (atomic blocks); and the manipulation of cluster-wide global data structures. We present an overview of the X10 programming model and language, experience with our reference implementation, and results from some initial productivity comparisons between the X10 and Java(TM) languages.

This is joint work with other members of the X10 core team --- Vijay Saraswat, Raj Barik, Philippe Charles, Christopher Donawa, Christian Grothoff, Allan Kielstra, Igor Peshansky, and Christoph von Praun.

Date 3:00-4:00 pm, Monday, February 6, 2006
Place Gates 104
Speaker Kathleen Fisher
Title PADS: Processing Arbitrary Data Sources
Abstract Many high-volume data sources exist that can be mined very profitably, for example: call detail records, web server logs, network packets, network configuration and log files, provisioning records, credit card records, stock market data, etc. Unfortunately, many such data sources are in formats over which data consumers have no control. A significant effort is required to understand such a data source and write a parser for the data, a process that is both tedious and error-prone. Often, the hard-won understanding of the data ends up embedded in parsing code, making both sharing the understanding and maintaining the parser difficult. Typically, such parsers are incomplete, failing to specify how to handle situations where the data does not conform to the expected format.

In this talk, I will describe the PADS project, which aims to provide languages and tools for simplifying the analysis of ad hoc data. We have designed a declarative data-description language, PADS/C, expressive enough to describe the data sources we see in practice at AT&T, including ASCII, binary, EBCDIC (Cobol), and mixed formats. From PADS/C we generate a C library with functions for parsing, manipulating, summarizing, querying, and writing the data.

This work is joint with Bob Gruber, Mary Fernandez, David Walker, Yitzhak Mandelbaum, and Mark Daly.

Date 4:15-5:30 pm, Tuesday, November 29, 2005
Place Gates 104
Speaker Sanjit Seshia, University of California, Berkeley
Title SAT-Based Decision Procedures and Malware Detection
Abstract SAT-based decision procedures operate by performing a satisfiability-preserving encoding of their input to a Boolean satisfiability (SAT) problem, on which a SAT solver is invoked. In this talk I will present UCLID, a verification tool based on SAT-based decision procedures, and describe an application to detecting malware (e.g., viruses and worms). UCLID's SAT-based decision procedures are for quantifier-free first-order logics involving arithmetic. These have been used within a malware detector that shows greater resilience to malware obfuscations than commercial tools. I will describe the notion of a "semantic signature," the detection algorithm, and experimental results.

Date 3:30-4:30 pm, Tuesday, November 15, 2005
Place Gates 104
Speaker Norman Ramsey, Harvard University
Title A Low-level Approach to Reuse for Programming-Language Infrastructure
Abstract New ideas in programming languages are best evaluated experimentally. But experimental evaluation is helpful only if there is an implementation that is efficient enough to encourage programmers to use the new features. Ideally, language researchers would build efficient implementations by reusing existing infrastructure, but existing infrastructures do not serve researchers well: in high-level infrastructures, many high-level features are built in and can't be changed, and in low-level infrastructures, it is hard to support important *run-time* services such as garbage collection, exceptions, and so on.

I am proposing a different approach: for reuse with many languages, an infrastructure needs *two* low-level interfaces: a compile-time interface and a *run-time* interface. If both interfaces provide appropriate mechanisms, the mechanisms can be composed to build many high-level abstractions, leaving the semantics and cost model up to the client.

In this talk, I will illustrate these ideas with examples drawn from two parts of the C-- language infrastructure: exception dispatch and procedure calls. I will focus on the mechanisms that make it possible for you to choose the semantics and cost model you want. For exceptions, these mechanisms are drawn from both compile-time and run-time interfaces, and together they enable you to duplicate all the established techniques for implementing exceptions. For procedure calls, the mechanisms are quite different; rather than provide low-level mechanisms that combine to form different kinds of procedure calls, I have found it necessary to extend the compile-time interface to enable direct control of the semantics and cost of procedure calls. I will also sketch some important unsolved problems regarding mechanisms to support advanced control features such as threads and first-class continuations.

Date 4:15-5:30 pm, Tuesday, November 8, 2005
Place Gates 104
Speaker Terence Parr, University of San Francisco
Title ANTLR and Computer Language Implementation
Abstract While parsing has been well understood for decades and a number of decent parser generators exist, anyone who has built an interpreter, translator, or compiler of consequence will tell you that the overall problem of supporting language development has not been adequately solved. Yes, parsing has been solved in theory but many of the strongest parsing strategies and systems are cumbersome in practice. Moreover, parsing is only one component of language implementation and the other pieces have either been studied purely from a compiler point of view or have resulted in powerful but inaccessible solutions that programmers in the trenches are unable or unwilling to use.

My goal in this talk is twofold: (1) to convince you that there is still work to be done and interesting problems to solve in the realm of language tools such as IDEs tailored to language development, grammar reuse strategies, automatic grammar construction from sample inputs, and transformation systems that are accessible to the average programmer; (2) to demonstrate a few items from my ANTLR research program such as the LL(*) parsing algorithm, tree grammars, grammar rewrite rules, and ANTLRWorks grammar development environment.

Date 4:15-5:30 pm, Tuesday, November 1, 2005
Place Gates 104
Speaker George Necula, University of California, Berkeley
Title Data Structure Specifications via Local Equality Axioms
Abstract We describe a program verification methodology for specifying global shape properties of data structures by means of axioms involving arbitrary predicates on scalar fields and pointer equalities in the neighborhood of a memory cell. We show that such local invariants are both natural and sufficient for describing a large class of data structures. We describe a complete decision procedure for such a class of axioms. The decision procedure is not only simpler and faster than in other similar systems, but has the advantage that it can be extended easily with reasoning for any decidable theory of scalar fields.

Date 4:15-5:30 pm, Tuesday, October 18, 2005
Place Gates 104
Speaker Ulfar Erlingsson, Microsoft Research
Title Principles and Applications of Software Control-Flow Integrity
Abstract Current software attacks often build on exploits that subvert machine-code execution. The enforcement of a basic safety property, Control-Flow Integrity (CFI), can prevent such attacks from arbitrarily controlling program behavior. CFI enforcement is simple, and its guarantees can be established formally, even with respect to powerful adversaries. Moreover, CFI enforcement is practical: it is compatible with existing software and can be done efficiently using software rewriting in commodity systems. Finally, CFI provides a useful foundation for enforcing further security policies.

This talk describes CFI and an x86 implementation of CFI enforcement, assesses its security benefits against real-world attacks, and shows how the CFI guarantees can enable efficient software implementations of a protected shadow call stack and of access control for memory regions.

This is joint work with Martin Abadi, Mihai Budiu, and Jay Ligatti. More information about the work can be found at http://research.microsoft.com/research/sv/gleipnir/

Date 2:30-3:45 pm, Monday, October 10, 2005
Place Gates 104
Speaker Dan Grossman, University of Washington
Title Strong Atomicity for Today's Programming Languages
Abstract The data races and deadlocks that riddle threaded applications are an ever-greater impediment to reliable and responsive desktop applications. The lock-based approach to shared-memory concurrency (i.e., the approach taken in current programming languages such as Java) has software-engineering shortcomings compared to atomicity. An atomic construct is a concurrency primitive that executes code as though no other thread has interleaved execution. To ensure correctness, fair scheduling, and reasonable performance, we advocate a logging-and-rollback approach to implementing atomic. Moreover, we believe we can implement atomic well enough on today's commodity hardware to utilize atomicity and further investigate its usefulness.

This talk will describe ongoing work designing and implementing languages with atomicity. After describing the advantages of atomicity, we will describe our experience with two prototypes: (1) AtomCaml is a working prototype that extends the mostly-functional language OCaml with atomicity. OCaml does not support true parallelism (it essentially assumes a uniprocessor), which lets us perform key optimizations for this common case. In particular, non-atomic code can run unchanged. (2) AtomJava is a Java extension currently under development. It implements atomicity in terms of locks (which are not visible to programmers) without potential deadlock. The AtomJava compiler produces Java source code that can run on any Java implementation.