Computer Science Division
University of California, Berkeley
Technical Progress Report (08/01/97 - 07/31/98)
The Computer Science Division of the Electrical Engineering and Computer Sciences department is recognized as a world leader, rated #1, along with MIT and Stanford in recent national evaluations. The division consists of 30 faculty, roughly 200 graduate students, and a large undergraduate population in both the College Engineering and the College of Letters and Science. The department has had a major impact on the technological world with developments such as the BSD UNIX operating system, computer-aided design tools for integrated circuits (such as Spice and Magic), relational database systems, IEEE floating point, and pioneering work in RISC computer architectures and RAID storage systems. It is also renown for its theoretical work, such as the theory of NP-completeness, and multiple of its faculty have received the Turing Award.
In the late 80's the Division began a major effort to construct a new Computer Science building. Because of its strong impact on the computing industry, it was able to fund and develop a state-of-the-art facility, Soda Hall. The Titan project grew out of the thought processes of designing the new building and the efforts by the faculty to envision what would be the dominant directions of computing as we enter into the next century. We wanted to create an environment in which to experience and investigate the salient issues of computer systems as they "will be" and in which we could rapidly incorporate advancing technology, especially networking, high-performance computing, and interactive multimedia.
The Computer Science faculty at the University of California proposed to NSF to develop as its computing and communication infrastructure a new type of computing system, called Titan, which would harness breakthrough communications technology to integrate a large collection of commodity computers into a powerful resource pool that can be accessed directly through its constituent nodes or through inexpensive media stations. The vision was to treat the building as an integrated computing system, with a core computing component providing vast amounts of computing power and storage, connected to media stations and other advanced devices. A software architecture for the global operating system and programming language would be developed and the system design would be driven by a set of advanced applications with demanding computational, I/O, and graphics requirements. Funding for the project is shared between the National Science Foundation and the University, with individual research groups adding value to this infrastructure through their research personnel and equipment supported through other sources. NSF requested, in response to the original proposal and its addendum, that the project directly incorporate a significant experimental systems research component, along the lines described in a separate UCB proposal: "NOW: Design and Implementation of a Distributed Supercomputer as a Cost-effective Extension to a Network of Workstations." The Titan project comprises a core computing component, a multimedia component, an advanced networking component, and a set of driving applications.
In this report we outline the progress on Titan during its fourth year. The report organization follows the primary segmentation of the progrect:
We deployed a new switched 100Mb/s network to support the new Intel PC mediastations. Initially the available products froim Bay allowed us only to construct a network with limited bisection bandwidth (200 Mb/s). Recently we were able to increase this by an order of magnitude with a new switched ethernet product. This switched network is heavily used for multimedia traffic, as well as file traffic.
We have received gigabit ethernet switches and adapators and are currently undergoing tests in advance of replacing the ATM cloud and 10 Mb/s external network of NOW with a switch gigabit core and 100 Mb/s links to the NOW nodes.
Surprisingly, across the platforms the main limitation to attaining peak I/O performance is the CPU, due to lack of data locality. Increasing processor performance (especially with improved block operation performance) will be of great aid for these workloads in the future. The cluster design requires more memory bandwidth per processor, but the bandwidth available in current designs in adequate and, by design, scales with the number of procesors. For a cluster workstation, the I/O bus is a major system bottleneck, because of the increased load placed on it from network communication. A well-balanced cluster workstation should have copious I/O bus bandwidth, perhaps via multiple I/O busses. The SMP suffers from poor memory-system performance; even when there is true parallelism in the benchmark, contention in the shared-memory system leads to reduced performance. As a result, the clustered workstations provide higher absolute performance for streaming I/O workloads.
As part of this work we have rebuilt the "Lanai" firmware to use a single context, making it simpler and better suited to a wide range of network interface cards, especially those emerging for gigabit ethernet. We have ported the AM II layer, Solaris virtual networks driver, Split-C, and MPI to PentiumPro platforms using the PCI-based Lanai.
The virtual network driver has been ported to operate on SMP nodes with multiple concurrent driver threads and multiple network interface cards per processor. Multi-NIC operation has been demonstrated on a cluster of four 8-way Sun Enterprise 5000 SMPs with three NICS per SMP. We have built an initial multi-protocol AM II layer for Clusters of SMPs (Clumps) and demonstrated its operation through benchmarks and applications on our cluster. This work has raised key issues for such layers in the design of concurrent communication objects, adaptive polling algorithms, and lock-free queue management algorithms.
The AM II layer is completely integrated with an automatic network mapper (demonstrated and proven correct in a SPAA 97 paper). An map of the current NOW (updated every minute) can be found at http://www.cs.berkeley.edu/~alanm/map.html. The network mapper is now running on all nodes by default. It takes a fraction of a second to map the network a network of 43 switches, 93 hosts, and 192 cables. The automatic mapping algorithm has been substantially revised as well to make it faster, more robust, and to yield better route selection. In its current form, all nodes independently map the network. Improvements to the algorithm and time-out mechanism allow this to be very fast. By allowing each of the nodes to choose their routes, consistent with an up*-down* ordering of the network graph, a better spread of load is obtained over the physical links. We have experimented with integrating the remapping operation into the error handling loop of the AM layer.
We have conducted an extensive performance evaluation study of the NAS
Parallel Benchmarks on NOW by building an in situ measurement facility
into our MPI layer. This study the power of direct execution approach;
a single run through the Class A benchmark suite is a trillion instructions
and this study involved hundreds of such runs. Class B has also been run,
and it is several times large. The study showed NOW scalability to be substantially
better than that of the SP-2 and as good as the Cray T3D. We developed
a number of tools to isolate the performance factors, including instruction
breakdowns, scaling of computational work, MPI send, receive, and wait
time, and cache traces for parallel applications. The study reveals the
extensive change in architectural interactions under constant problem size
scaling, including changes in communication characteristics and memory
load. Although the cost of communication increases and extra work is performed,
we obtain perfect scaling on several NPB applications because of improvements
in computational efficiency due to a large number of caches to hold the
working set. We have been working on the NAS Parallel Benchmark followup
work on SGI Origin 2000 machines. The basic conclusion from the work
is the SGI machine is highly sensitive to the cache benefit to gain
super linear speedup. The node performance is has a first order impact
on the overall scalability of the benchmark, on both NOW cluster and the
SGI Origin 2000.
We built kernel-to-kernel AM and test apparatus to study the sensitivity to communication performance of application with use the internet-protocols (e.g. TCP, UDP, RPC). The apparatus, currently running on a 4-node cluster, allows independent variation of network overhead, inter-packet gap, latency and bandwidth. We have constructed a controlled environment for studying sensitivity on the SPEC SFS benchmark of NSF. Using the apparatus, we have run a large number of experiments on the sensistivity of NFS to network performance. We used the SPEC-SFS load generators and evaluation metrics; they are industry standards. The current results suggest strong sensitivity to overhead, and weak sensitivity to bandwidth. Sensitivity to latency and per-message gap is very low for the performance ranges of interest in local area networks.
We have studied in detail how to minimize the latency of a message through a network that consists of a number of store-and-forward stages, especially for the page size chunks transported within cluster files systems. This research is especially important for today's low overhead communication subsystems that employ dedicated processing elements for protocol processing. We have developed an abstract pipeline model that reveals a crucial performance tradeoff. We exploit this tradeoff in fragmentation algorithms designed to minimize message latency. By applying a rather formal methodology to the Myrinet-GAM system, we have improved its latency by up to 51%. A paper describing this work can be found at http://www.cs.berkeley.edu/~rywang/papers/pipeline
A prototype was developed that demonstrates user customization of virtual network interfaces through safe language extensions. A Java Virtual Machine was built for the Myrinet board LanAI processor and a class library was developed to provide the basic communication subprimitives that are used within Lanai control programs. The architecture permits applications to safely specify code to be executed within the NI on message transmission and reception. The design is based on the Cornel Unet implementation and achieves impressive performance.A draft of our U-Net/SLE whitepaper is at http://www.cs.berkeley.edu/~mdw/proj/unet-sle/unet-sle.ps
We have completed a reference implementation of the Virtual Interface Architecture (VIA). Based on the VIA specification version 1.0 published jointly by Microsoft, Compaq and Intel, the Berkeley VIA successfully achieves a networking interface between the user process and networking hardware that requires no kernel transitions to accomplish data transmission/receipt. When a virtual interface is created, the host allocates pinned memory from which the network interface can perform DMA. The user process writes/reads data and transfer descriptors to this memory region and initiates a transfer by writing a special doorbell token to a memory page that is mapped into the network interface card. By using this paged doorbell approach, system calls are eliminated while assuring some form of protection to the user process. Presently, our VIA supports the Sun Ultrasparc platform running Sun Solaris 2.6 and Myricom's Myrinet network. Analysis of the architecture on this platform yields single-packet latencies as small as 24 microseconds. Measured bandwidth for a 2KB packet is approximately 196 Mbits/sec. We intend to continue the development of the Berkeley VIA by expanding its functionality and porting to different host/network platforms such as Intel based PC running Windows NT and Gigabit Ethernet.
We have implemented the ideas in our implicit scheduling work and have conducted an extensive empirical investigation of that theory. The implementation process revealed a number of subtle issues in the AM II layer and within the Solaris scheduler. The adaptive two-phase algorithms have been shown to work extremely well in practice, both on the synthetic workloads of the original simulation study and on collections of real programs. However, it is critical that no layer below the run-time library silently block. The critical issue for implicit scheduling work is reacting to the response time of remote operations. This turns out to be much more important than the actual message arrival. We have also be able to demonstrate simple extensions that provide fairness.
We have been working on a system for accessing NOW resources across the wide area. The goal is to allow for authorized users across the Internet to be able to utilize unique computational resources, such as the NOW, across the wide area. Our work has resulted in the design and implementation of a new authentication and access control system, called CRISIS. A goal of CRISIS is to explore the systematic application of a number of design principles to building highly secure systems, including: redundancy to eliminate single points of attack, caching to improve performance and availability over slow and unreliable wide area networks, fine-grained capabilities and roles to enable lightweight control of privilege, and complete local logging of all evidence used to make each access control decision (e.g., no implicit reasoning about transfer of rights). Measurements of a prototype CRISIS-enabled wide area file system show that CRISIS adds only marginal overhead relative to unprotected wide area accesses. A paper entitled "The CRISIS Wide Area Security Architecture", by Eshwar Belani, Amin Vahdat, Thomas Anderson, and Michael Dahlin has been accepted to the USENIX Security Symposium and is available at http://www.cs.berkeley.edu/~vahdat/uss.ps
We have begun design and developement of an adaptive I/O system for parallel programs on NOW, called River, that utilizes what was learned from numerous high performance I/O studies. The key problem with applications that perform parallel I/O (e.g., NOW-Sort) is that a small perturbation on a single workstation leads to a large-scale performance hit. We call this performance regime "meta-stable" (very much like the marble on top of the hill). What is needed is an I/O environment that provides more "stable" performance, where small perturbations lead to small (or no) performance decreases. Thus, we have been working on a system we call "River". The River mantra is "move the I/O to the computation", and draws on work from both the parallel I/O and task queue literature. By defining a higher level, more flexible interface, and a dynamic, "perturbance-aware" system underneath, we plan to provide robust parallel I/O to a range of interesting applications (decision support, scientific, etc.).
Our Large-scale Trace Analysis facility is essentially complete. Unfortunately, there is a problem with one set of traces, so we will need to reprocess them from scratch and recalculate all results for this trace. Analysis done includes: reexamine sprite tests to observe changes over time, use of long-term traces to see long-term file lifetime and disk accesses beyond the sprite study, and post-cache file system behavior an analysis of disk seeks. It appears that a good case can be made for disk reorganization to help read cost. This analysis could not be done with short-term traces. The traces were used to look at long-term self-similar behavior, to appear in Sigmetrics 98. Because the traces are long-term, we were able to see at what granularity self-similar behavior pans out. The camera ready sigmetrics paper on this topic has been turned in. We may have a follow-up paper examining the causes of burstiness and post-cache behavior. We are studying file system backup policies (again only long-term traces were useful for this project) and have done initial work on read cost for different layout policies. Initial results show that ffs has slightly lower read cost.
We are doing an experiment to see if the failure rate of disk sectors is affected by workload. Basically, there are several groups of disks that are continuously subjected to different load levels and read/write patterns.The load patterns are based on some advice from people at IBM. This experiment has been running for about three weeks. The load patterns are based on some advice from people at IBM. We have detected some new bad sectors, but its too early to say anything definite.
The basic goal in Self Maintaining Storage is to limit maintenance by a system adminisration to regular intervals. To do this requires a combination of monitoring and fault tolerance. We're attempting to solve this problem for our application (and others like it that are web accessible and read mostly). So far we've built some fault tolerance into our web server (failover for front end, using mirroring to recover from failed servers). We are porting NOW monitoring software to BSD and adding other modules to monitor additional devices like disk enclosures.
[Ch*98] Virtual Network Transport Protocols for Myrinet, Brent Chun, Alan Mainwaring, and David Culler, IEEE Micro (Special issue on Hot Interconnects), Jan/Feb 1998.
[Gr*98] Gribble, Steven D., Gurmeet Singh Manku, Drew Roselli, Eric A. Brewer, Timothy J. Gibson, and Ethan L. Miller, "Self-Similarity in File Systems, Proceedings of ACM SIGMETRICS 1998, Madison, Wisconsin, June 1998.
[LuCu98] Steven S. Lumetta, David E. Culler. Managing Concurrent Access for Shared Memory Active Messages. IPPS/SPDP 98 , Orlando, FL , March, 1998.
[Arp*98] Remzi Arpaci-Dusseau, Andrea Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson. The Architectural Costs of Streaming I/O: A Comparison of Workstations, Clusters, and SMPs. HPCA 4, Las Vegas, February, 1998 .
[Lu*97] Multi-Protocol Active Messages on a Cluster of SMP's, Steven S. Lumetta, Alan M. Mainwaring, David E. Culler. SC'97 , San Jose, California , November, 1997 .
[Cu*97] Parallel Computing on the Berkeley NOW, David E. Culler, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Brent Chun, Steven Lumetta, Alan Mainwaring, Richard Martin, Chad Yoshikawa, Frederick Wong, JSPP'97 (9th Joint Symposium on Parallel Processing), May 1997, Kobe, Japan.
[Ma*97] Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture, Richard P. Martin, Amin M. Vahdat, David E. Culler, Thomas E. Anderson, ISCA 24 , Denver, Co , June, 1997 .
[AD*97] High-Performance Sorting on Networks of Workstations. Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson.SIGMOD '97 , Tucson, Arizona , May, 1997 .
[Gho*98] GLUnix: A Global Layer Unix for a Network of Workstations. To appear in Software Practice and Experience. Douglas P. Ghormley, David Petrou, Steven H. Rodrigues, Amin M. Vahdat, Thomas E. Anderson.
[Nee*97] Improving the Performance of Log-Structured File Systems with Adaptive Methods. SOSP 16 , St. Malo, France , October 5-8, 1997 . Jeanna Neefe Matthews, Drew Roselli, Adam M. Costello, Randy Wang, Tom Anderson.
[Cha*97] Experience with a Language for Writing Coherence Protocols. USENIX Conference on Domain-Specific Languages USENIX/DSL , Santa Barbara, California , October 15-17, 1997 . Satish Chandra, Michael Dahlin, Bradley Richards, Randolph Wang, Thomas E. Anderson, James R. Larus.
[Dus*97] Extending Proportional-Share Scheduling to a Network of Workstations. International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'97) , Las Vegas, Nevada , June, 1997 . Andrea C. Arpaci-Dusseau, David E. Culler.
[Wo*98] Architecture Requirements and Scalability of the NAS Parallel Benchmarks, Fredrick Wong, Richard Martin, Remzi Arpaci-Dusseau, and David Culler, submitted for publication.
[Mai*98] Design and Implementation of Virtual Networks, Alan Mainwaring and David Culler, Submitted for publication.
Doug Ghormley received his PhD. Steven Lumetta, Andrea Dusseau, Amin Vahdat, Randy Wang, and Richard Martin will be finishing soon and have obtained faculty or post-doc positions. Several students passed quals and or received MS degrees.
In terms of networking research, the group has focused on methods for improving the performance of TCP over cellular wireless networks. We have implemented a TCP-aware link layer protocol called the Snoop Protocol that isolates wired senders from the lossy characteristics of a wireless link. In our latest work, we have employed a novel protocol based on Explicit Loss Notification (ELN) to improve transport performance. This technique is particularly well suited for packet radio networks, it which the lossy link need not be limited to the final network hop. We have obtained extensive packet traces of wireless errors from a production multi-hop wireless network (i.e., the Reinas Remote Sensing network deployed in the Monterey Bay by researchers at UC Santa Cruz) and have derived an empirical model of channel errors based on this data. We have used this to evaluate the performance of "standard" TCP Reno, TCP Selective Acknowledgments, and our Snoop protocol for Web workloads to mobile hosts. Furthermore, we have extensively studied the scaling behavior of the Snoop protocol to understand how it performs under load. This analysis leads to general insights about efficient protocol design for reliable wireless transport.
H. Balakrishnan, M. Stemm, S. Seshan, R. H. Katz, "Analyzing Stability in Wide-Area Network Performance," ACM Sigmetrics Conference, Seattle, WA, (June 1997).
B. Noble, M. Satyanarayanan, G. Nguyen, R. H. Katz, "Trace Based Mobile Network Emulation," ACM SIGCOMM Conference, Cannes, France, (September 1997).
T. Hodes, R. H. Katz, E. Servan-Schreiber, L. A. Rowe, "Composable Ad-Hoc Mobile Services for Universal Interaction," Third ACM Mobicom Conference, Budapest, Hungary, (September 1997). Best Paper Award. mber 1997).
H. Balakrishnan, V. Padmanabhan, R. H. Katz, "The Effects of Asymmetry on TCP Performance over Wide-Area Wireless Networks," Third ACM Mobicom Conference, Budapest, Hungary, (September 1997). .
T. Henderson, R. Katz, "Satellite Transport Protocol (STP)--An SSCOP-based Transport Protocol for Datagram Satellite Networks," Second Workshop on Satellite-Based Information Systems (WOSBIS-97), Budapest, Hungary, (October 1997). mber 1997).
M. Stemm, S. Seshan, R. H. Katz, "SPAND: Shared Passive Network Performance Discovery," USENIX Symposium on Internet Technologies and Systems, Monterey, CA, (December 1997). mber 1997).
H. Balakrishnan, M. Stemm, S. Seshan, V. Padmanabhan, R. H. Katz, "TCP Behavior of a Busy Internet Server: Analysis and Solutions," IEEE Infocomm Conference, San Francisco, CA, (March 1998). mber 1997).
M. Stemm, H. Balakrishnan, S. Seshan, V. Padmanabhan, R. H. Katz, "TCP Improvements for Heterogeneous Networks: The Daedalus Approach," Proceedings Allerton Conference, Urbana, IL, (September 1997). Invited Paper.
R. H. Katz, "Beyond Third Generation Telecommunications Infrastructures," ACM Sigmobile Newsletter, V. 2, N. 2, (April 1998), pp. 1-5. Invited Paper based on ACM Mobicom Keynote Address, September 1997.
D. Goodman, N. Abramson, E. Cacciamani, J. Engel, M. Epstein, B. Fette, D. Fields, B. Gavish, A. Goldsmith, R. H. Katz, E. Kelley, K. Pahlavan, C. Perkins, T. Rappaport, J. Russell, The Evolution of Untethered Communications, National Research Council Press, 1997.
R. H. Katz, W. L. Scherlis, S. L. Squires, "The National Information Infrastructure: A High Performance Computing and Communications Perspective," in White Papers, The Unpredictable Certainty: Information Infrastructure Through 2000, National Research Council Press, Washington, 1998, pp. 315-334.
The uniprocessor and SMP versions of Titanium have been used in the graduate parallel computing class (CS267) and by undergraduate researchers for writing parallel algorithms. The group now has several benchmarks that are designed for these platforms: an electromagnetics model using an unstructured mesh, a Particle-in-Cell code, matrix multiplication, Cholesky and LU decompositions (without pivoting), a multigrid solver for sturctured meshes, a linear systems solver for tridiagonal systems, and a parallel sorting algorithm, and a simple n-body simulation. In addition, a major application development is under development with Luigi Semenzato and Phillip Colella at NERSC/LBNL. The application is a Poisson solver that uses Adaptive Mesh Refinement techniques and it runs on the SMPs (both Intel and Sun) and the NOW.
The group is currently working on improved parallel code generation for distributed memory machines like the NOW and tuning some of the applications for distributed data layout on these machines. There have been several results in the area of program analysis and optimization of explicitly parallel code, such as communication optimization, synchronization analysis, cache optimization for grid-based computation, and analysis and optimization of dynamic memory management.
References:
Titanium: A High-Performance Java Dialect. ACM 1998 Workshop on Java for High-Performance Network Computing. To appear in Concurrency: Practice and Experience.
Analyses and Optimizations for Shared Address Space Programs. A. Krishnamurthy and K. Yelick, Journal of Parallel and Distributed Computation, 1996.
Evaluation of Architectural Support for Global Address-Based Communication in Large Scale Parallel Machines. A. Krishnamurthy, K. Schauser, C. Scheiman, R. Wang, D. Culler, and K. Yelick, Architectural Support for Programming Languages and Operating Systems, November, 1996.
Empirical Evaluation of Global Memory Support on the Cray-T3D and Cray-T3E. A. Krishnamurthy, D. Culler, and K. Yelick, UCB//CSD-98-991.
Alex Aiken and David Gay. Barrier Inference. Proceedings of the Twenty-Fifth Annual ACM Sigplan Symposium on Principles of Programming Langauges, San Diego, California, January, 1998.
David Gay and Alex Aiken. Memory Management with Explicit Regions. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, to appear, Montreal, Canada, June 1998.
Boris Weissman, delivered a paper prepared with Juergen Quittek on efficient synchronization in Sather at the 1997 International Scientific Computing in Object-Oriented Parallel Environments Conference. Another aspect of synchronization in Sather, fairness, has been addressed in a paper delivered by a former ICSI postdoc Michael Philippsen at the IASTED International Conference on Parallel and Distributed Computing and Systems. Matthias Anlauff, a German postdoc, gave a talk about his work on a system for formal specification of programming language semantics at the International Workshop on the Theory and Practice of Algebraic Specifications, Amsterdam. Efficiency aspects of thread migration on the emerging hardware platforms such as the networks of SMPs (CLUMPS) are investigated in the paper to appear at the 12th International Parallel Processing Symposium and 9th IEEE Symposium on Parallel and Distributed Processing (IPPS/SPDP 1998). Based on our experience with pSather over the past few years we have formulated a new programming model for a safe high-performance programming language. This resulted in a paper prepared in cooperation with the researchers from Karlsruhe, Germany and submitted to the European Conference on Object-Oriented Programming (ECOOP 98).
David Stoutamire completed his doctoral dissertation on a new ``Zones'' model for improving locality in compiling high-level languages. Although the developments are based on Sather and fully implemented in the compiler, they are broadly applicable. David has moved to JavaSoft, where he joins Robert Griesemer, another ICSI alumnus. The Sather project has already had some influence on Java development and promises to have more.
Claudio Fleiner finished his Ph.D. thesis on parallel optimizations in Sather and passed the defense at the University of Fribourg, Switzerland. The thesis was done during Fleiner's stay at ICSI. Claudio moved on to accept a research position with IBM Labs in Zurich.
Ben Gomes completed his doctoral dissertation on reusable parallel frameworks for mapping connectionist networks onto parallel machines using pSather as an implementation platform. He also continued his Sather library work. He will also join the core language gropup at Javasoft. Michael Holtkamp, a student from Hamburg has finished his thesis on thread migration with Active Threads on the CLUMPs.
Serial Sather remains fairly stable. Our current efforts are mostly concerned with further work on high-performance parallel runtimes including thread scheduling for locality.
\bibitem{Quittek} Quittek,J. \& Boris Weissman, ``Efficient Extensible Synchronization in Sather'', The 1997 International Scientific Computing in Object-Oriented Parallel Environments Conference,
\bibitem{Fleiner-1} Fleiner, C. \& Philippsen M. ``Fair Multi-Branch Locking of Several Locks,'' Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems, Washington D.C., October 1997.
\bibitem{Weissman-1} Weissman, B., Gomes, B., Quittek J.W., Holtkamp M., ``Efficient Fine-Grain Thread Migration with Active Threads,'' 12th International Parallel Processing Symposium and 9th IEEE Symposium on Parallel and Distributed Processing (IPPS/SPDP 1998), Orlando, March 1998, to appear.
\bibitem{Gomes-1} Gomes, B., Lowe W., Quittek J.W., Weissman B., ``Safe Sharing of Objects in a High-Performance Parallel Language,'' Submitted to the 12th European Conference on Object-Oriented Programming (ECOOP 98).
\bibitem{Anlauff} Anlauff, M., Kutter, P.W., Pierantonio A., ``Formal Aspects of and Development Environments for Montages,'' 2nd International Workshop on the Theory and Practice of Algebraic Specifications, Amsterdam 1997
\bibitem{Weissman-2} Weissman, B. ``Active Threads: an Extensible and Portable Light-Weight Thread System,'' TR-97-036, ICSI, November 1997.
\bibitem{Weissman-3} Weissman, B., Gomes, B., Quittek J.W., Holtkamp M., ``A Performance Evaluation of Fine-Grain Thread Migration with Active Threads,'' TR-97-054, ICSI, December 1997.
\bibitem{Gomes-2} Gomes, B., Stoutamire, D., Weissman, B. Klawitter, H., "Sather 1.1: A Language Manual" currently available on-line at http://www.icsi.berkeley.edu/ ~sather/Documentation/LanguageDescription/webmaker/index.html, upcoming TR, ICSI
\bibitem{Fleiner-2} Fleiner, C., ``Advanced Constructs and Compiler Optimizations for a Parallel, Object Oriented, Shared Memory Language running on a Distributed System'', Ph.D. Thesis, #1148, University of Fribourg, Institute of Informatics, Switzerland, April 1997.
\bibitem{Stoutamire} Stoutamire, D., ``Portable, Modular Expression of Locality,'' Ph.D. Thesis, University of California at Berkeley, 1997.
\bibitem{Gomes-3} Gomes, B., ``Mapping Connectionist Networks onto Parallel
Machines: A Library Approach,'' Ph.D. Thesis, University of California
at Berkeley, 1997.
The most realistic application we have implemented to date analyzes RLL (Relay Ladder Logic) programs. RLL is an embedded control language used in most US manufacturing facilities. Bugs in RLL programs cost thousands of dollars *per minute* to fix when factory controllers crash. It is not uncommon for a single RLL bug to cost hundreds of thousands of dollars to repair.
In consultation with Rockwell we focused on the problem of detecting "relay races" in RLL programs. Relay races are very difficult to detect using standard testing techniques and are a common source of bugs in practice. Our RLL analysis was very successful at finding these bugs in very large, production RLL programs, including one known bug that had originally required four hours of factory down time to repair. It is fair to characterize this as a very positive result---very few other tools that are able to process realistic size software systems and glean useful information. A paper on this work received the Best Paper Award from the European Association of Programming Languages and Systems federated conference.
Publications
Partial Online Cycle Elimination in Inclusion Constraint Graphs. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, to appear, Montreal, Canada, June 1998 (with M. Faehndrich, J. Foster, and Z. Su).
Detecting Races in Relay Ladder Logic Programs. In Proceedings of the 1st International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Lisbon, Portugal, pages 184-200, April, 1998 (with M. Faehndrich and Z. Su).
A Toolkit for Constructing Type- and Constraint-Based Program Analyses (invited paper). In Proceedings of the 2nd International Workshop on Types in Compilation, Kyoto, Japan, pages 165-169, March 1998 (with M. Faehndrich, J. Foster, and Z. Su).
Program Analysis Using Mixed Term and Set Constraints. Proceedings of the 4th International Static Analysis Symposium, Paris, France, September, 1997 (with M. Faehndrich).
Optimal Representations of Polymorphic Types with Subtyping (Extended Abstract). Theoretical Aspects of Computer Software (TACS), September, 1997 (with E. Wimmers and J. Palsberg).
Reference:
James Shin Young, Josh MacDonald, Michael Shilman, Abdallah Tabbara,
Paul Hilfinger, and A. Richard Newton, "Design and Specification of Embedded
Systems in Java Using Successive, Formal Refinement", Proceedings of Design
Automation Conference (DAC), 1998, to appear.
Eigenvectors have proved to be an invaluable computational tool in many diverse applications. To the quantum chemist they may signify wave functions; statisticians compute eigenvectors of a covariance matrix to find directions of maximum variance in the data (Principal Components Analysis), while computer scientists have lately used eigenvectors to partition graphs, segment images and retrieve textual information (Latent Semantic Indexing). Dhillon's thesis focuses on the computation of the eigenvectors of a symmetric tridiagonal matrix T, which is an important phase in finding the eigenvectors of any symmetric matrix. Previous practical algorithms to find all the n eigenvectors of T take O(n^3) time in the worst case. This is due to the need for Gram-Schmidt (or similar) orthogonalization when eigenvalues are close. It presents a new O(n^2), embarrassingly parallel algorithm that avoids this need by: 1. finding multiple representations of T and its translates that determine the locally small eigenvalues to high relative accuracy, 2. techniques for computing such small eigenvalues to full accuracy, 3. procedures to compute associated eigenvectors that have guaranteed tiny residual norms. An interesting facet of our work is that high accuracy in intermediate computations lead to a much faster overall algorithm.
Our ideas are well illustrated on a problem arising from joint work with computational quantum chemists at the Pacific Northwest National Laboratory (PNNL). In a problem arising in the modeling of a biphenyl molecule, our new eigensolver takes 2 seconds as opposed to the 2 minutes taken by the previous LAPACK inverse iteration algorithm. We observe considerable speedups on a variety of other test matrices. Software based on this new algorithm will soon be available as part of the LAPACK and ScaLAPACK public-domain libraries. An earlier version of this software (for distributed-memory machines) is already available in PNNL's PeIGS software library.
The execution time of a symmetric eigendecomposition depends upon the application, the algorithm, the implementation, and the computer. Symmetric eigensolvers are used in a variety of applications, no two applications solve exactly the same eigenproblem. Many different algorithms can be used to perform a symmetric eigendecompostion, each with differing computational properties. Different implementations of the same algorithm have different computational properties. The computer on which the eigensolver is run not only affects execution time but may favor certain algorithms and implementations over others. Stanley's thesis explains the performance of the ScaLAPACK symmetric eigensolver, the algorithms that it uses and other important algorithms for solving the symmetric eigenproblem on today's fastest computers.
The performance of conjugate gradient schemes for minimizing unconstrained energy functional in the context of electronic structure calculations is studied. The unconstrained functionals allow a straightforward apoplication of conjugate gradients by removing the explicit orthonormality constraints on the quantum- mechanical wave functions. However, the removal of the constraints can lead to slow convergence, in particular when preconditioning is used. The convergence properties of two previously suggested energy functionals are analyzed in Pfrommer's MS thesis, and a new functional is proposed which unifies some of the advantages of the other functionals. A numerical example confirms the analysis.
Blackston's thesis describes the design of several portable and efficient parallel implementations of adaptive N-body methods, including the adaptive Fast Multipole Method, the adaptive version of Anderson's method, and the Barnes-Hut algorith. Our codes are based on a communication and work-partitioning scheme that allows an efficient implementation of adapative multipole methods even on high-latency systems. Our test runs demonstrate high performance and speed-up on several parallel architectures, including traditional MPPs, shared-memory machines, and networks of workstations.
The parallel construction of maximal independent sets is a useful building block for many algorithms in the computational sciences, including graph coloring and multigrid coarse grid creation on unstructured meshes. We present an efficient asynchronous maximal independent set algorithm for use on parallel computers, for use on ``well partitioned'' graphs, that arise from finite element (FE) models. For appropriately partitioned bounded degree graphs, it is shown that the running time of our algorithm under the CREW PRAM computational model is of O(1), which is an improvement over the previous best PRAM complexity for this class of graphs. Adams presents numerical experiments on an IBM SP, that confirm our PRAM complexity model is indicative of the performance one can expect with practical partitions on graphs from FE problems.
Publications and Theses:
Tzu-Yi Chen, MS, "Balancing Sparse Matrices for Computing Eigenvalues", 1998
Inderjit Dhillon, PhD, "A New O(n^2) Algorithm for the Symmetric Tridiagonal Eigenvalue/Eigenvector Problem," 1997
Ken Stanley, PhD, "Execution time of Symmetric Eigensolvers", 1997
Bernd Pfrommer, MS, "Minimizing unconstrained electronic structure energy functionals with conjugate gradients on parallel computers", 1997
Bernd Pfrommer, H. Simon, J. Demmel, "Uncontrained Energy Functionals for Electronic Structure Calculations", sumitted to J. Comp. Physics
David Blackston and Torsten Suel, "Highly Portable and Efficient Implementations of Parallel Adaptive N-Body Methods", Supercomputing 98
Mark Adams, "A maximal independent set algorithm", Fifth Copper Mountain
Conference, 1998 (Best Student Paper Award)
BLAS3 matrix-matrix operations usually have great potential for agressive
optimization. Unfortunately, they usually need to be hand-coded for a specific
machine and/or compiler to achieve near peak performance. We have developed
a methodology whereby near-peak performance on such routines can be acheved
automatically. First, rather than code by hand, we produce parameterized
code generators whose parameters are germane to the resulting machine performance.
Second, the generated code follows the PHiPAC (Portable High Performance
Ansi C) coding suggestions that include manual loop unrolling, explicit
removal of unnecessary dependencies in code blocks (if not removed, C semantics
would prohibit many optimizations), and use of machine sympathetic C constructs.
Third, we develop search scripts that, for a given code generator, find
the best set of parameters for a given architecture/compiler. We have developed
a BLAS-GEMM compatible multi-level cache-blocked matrix-matrix multiply
code generator that has achieved performance around 90% of peak on the
Sparcstation-20/61, IBM RS/6000-590, HP 712/80i, SGI Power Challenge R8k,
SGI Octane R10k, and 80% on the SGI Indigo R4k. On the IBM, HP, SGI R4k,
and the Sun Ultra-170, the resulting DGEMM is, in fact, faster than the
GEMM in the vendor-optimized BLAS GEMM. Other generators, search scripts,
and performance results are under development.
We are working towards developing toolbox for generating high performance sparse matrix kernels for uniprocessors and SMPs. In particular, sparse matrix-vector multiplication is a core routine of many applications, including large scale iterative and eigenvalue solvers. As the initial phase of this research, we have studied several optimizations for sparse matrix-vector multiplication, including register blocking, cache blocking, and reordering. The benefits of these techniques vary widely depending on the matrix structure and machine characteristics. Register blocking appears to be the most useful of the three for Finite Element problems and other matrix that arise in physical simulations, because there are often small dense sub-blocks that naturally arise in these matrices. Cache blocking has so far proven useful only in a matrices from a web search application, which has a nearly random structure. Reordering has shown some benefit across problem domains, but the improvement is small, and it is often worse than leaving the matrix in its natural order, if the reordering destroys the natural sub-block structure.
The Sparsity group is building a toolbox that will be provided as a
web service. It will collect information from the user about the matrix
structure and machine type, using a combination of questions and answers,
supplying example matrices, and downloading code to the user that measures
various parameters.
Media Gateways are software agents which bridge two or more conferencing
sessions and process the data streams between
the sessions. Examples of such processing include transcoding between
two formats, rate limiting, and application of encryption
or decryption. One of the primary uses of such gateways is accomodating
the heterogeneity inherent in Internet conferences by applyingtranscoding
and rate limiting on the "well connected" portion of the network. Furthermore,
a gateway can provide user-level tunnels for bridging multicast capable
islands over a non-multicast capable link. The Media Gateway (MeGa) Architecture
is an experimental deployment of an architecture which automatically deploys
gateways on the Titan infrastructure on behalf of end users across slow
speeds links. The architecture attemps to make the use of the gateways
as transparent and seamless as possible. As such, the architecture incorporates
the use of the conventional Mbone tools: vat, wb, sdr, and vic. Vat and
wb can be used in their unmodified versions while sdr and vic are modified
to work within the architecture. (See http://www.cs.berkeley.edu/~elan/mega/)
E. Amir, S. McCanne, R. H. Katz, "Receiver Driven Bandwidth Allocation for Light Weight Session," ACM Multimedia '97 Conference, Seattle, WA (November 1997). Best Paper Award. mber 1997).
S. McCanne, E. Brewer, R. Katz, L. Rowe, E. Amir, Y. Chawathe, A. Coopersmith, K. Mayer-Patel, S. Raman, A. Schuett, D. Simpson, A. Swan, T-L Tung, D. Wu, B. Smith, "Toward a Common Infrastructure for Multimedia-Networking Middleware," Seventh International Workshop on Network and Operating System Support for Digital Audio and Video, St. Louis, (May 1997). Invited Paper.
The recent advent of the Internet Multicast service has enabled a number of successful real-time multimedia applications, yet the scalability of these applications remains challenged by the inherent heterogeneity of the underlying Internet. One promising approach for taming this heterogeneity is to encode each media flow as a layered signal that is striped across multiple multicast groups, thereby allowing a receiver to tune its individual reception rate by modulating its subscription to multicast groups. Though significant progress had been made on media transport protocols and congestion control strategies for adjusting multicast groups in this fashion, comparatively little work has been devoted to extending the session directory service and address allocation architecture to meet the needs and requirements of layered media. Moreover, the large-scale deployment of layered media formats is hindered by the lack of support for layered formats in existing session directory tools. To overcome these limitations, we propose a new architecture for session advertisement and caching that exploits multicast ``administrative scope'' through protocol proxies to admit layered media formats and reduce the start-up latency of a directory-service client by an order of magnitude or more. Our architecture is fully compatible with the existing directory service allowing our implementation, which is split across a new session directory tool and network proxy, to be incrementally deployed within the current Internet multimedia conferencing architecture.
Internet video is emerging as an important multimedia application area. Although development and use of video applications is increasing, the ability to manipulate and process video is missing within this application area. Current video effects processing solutions are not well matched for the Internet video environment. A software-only solution, however, provides enough flexibility to match the constraints and needs of a particular video application. The key to a software solution is exploiting parallelism. Mayer-Patel's papers present the design of a parallel software-only video effects processing system. Preliminary experimental results exploring the use of temporal parallelism are presented. In Wong's paper, we describe the design and implementation of a software video production switcher, vps, that improves the quality of MBone broadcasts. vps is modeled after the broadcast television industry's studio production switcher. It provides special effects processing to incorporate audience discussions, add titles and other information, and integrate stored videos into the presentation. vps is structured to work with other MBone conferencing tools. The ultimate goal is to automate the production of MBone broadcasts.
Andrew Swan, Steven McCanne and Lawrence A. Rowe, Layered Transmission and Caching for the Multicast Session Directory Service, to appear Proceedings of The Sixth Annual ACM International Multimedia Conference, September 1998. Best paper award. (students: A. Swan MS 12/97)
Ketan Mayer-Patel and Lawrence A. Rowe, Exploiting Temporal Parallelism for Software-only Video Effects Processing, to appear Proceedings of The Sixth Annual ACM International Multimedia Conference, September 1998. (students: K. Mayer-Patel MS 12/97)
T. Wong, K. Mayer-Patel, D. Simpson, and L.A. Rowe, A Software-Only Video Production Switcher for the Internet MBone, Multimedia Computing and Networking 1998, Proc. IS&T/SPIE Symposium on Electronic Imaging: Science & Technology, San Jose, CA, January 1998. (students: T. Wong MS 12/97, D. Simpson MS (forthcoming))
K. Mayer-Patel and L.A. Rowe, "Design and Performance of the Berkeley Continuous Media Toolkit," in Multimedia Computing and Networking 1997, Martin Freeman, Paul Jardetzky, Harrick M. Vin, Editors, Proc. SPIE 3020, pp 194-206 (1997). (students: K. Mayer-Patel)
R. Malpani and L.A. Rowe, "Floor Control for Large-Scale MBone Seminars," to appear Proceedings of The Fifth Annual ACM International Multimedia Conference, Seattle, WA, November 1997, pp 155-163. (students: Malpani MS 5/97)
Swan and Mayer-Patel are PhD students. Mayer-Patel has passed quals
and will finish 6/99. Swan will take quals fall 98, phd 6/00.
The project focuses on extending the expressiveness of probabilistic models, extending the scope of learning algorithms, and developing substantial scientific applications. In the last year, we have obtained the following results requiring substantial computational resources: \begin{itemize} \item {\it Methods for automatic creation and modification of model structures. }\\ A new algorithm, Structural EM (Friedman, 1997) represents a significant development in computational statistical methods. SEM generalizes the well-known EM to incorporate structural as well as parametric learning, while retaining the convergence guarantees of EM. We have applied SEM to learn substantial models, including DBN models of speech generation that substantially outperform hidden markov models for recognition (Zweig \& Russell, 1998). \item {\it Methods for combined model learning and reinforcement learning in partially observable environments.}\\ Andre et al.~(1997) have shown how DBN models can be combined with reinforcement learning, providing a powerful method for adaptive control of Markov processes. Dearden et al~(1998) developed a Bayesian formulation of reinforcement learning and derived improved exploration algorithms, with application to several hard problems from the literature. \item {\it Automated extraction of human driver models from videotapes.}\\ We have developed simple DBN models of human drivers and and trained them directly from vehicle tracking data (Oza \& Russell, submitted). \item {\it Adaptive hierarchical control for large Markov processes}.\\ For very large Markov processes, tractable control policies must be hierarchically structured. In (Parr \& Russell, 1997), a language is proposed for describing partially specified hierarchical policies. Algorithms are given for efficient online learning of optimal policies consistent with the prior specifications. These results may make practical a theoretically rigorous approach to the control of very large systems. Solution of systems with several thousand states has ben demonstrated. \item {\it Object identification under uncertainty.}\\ When observing over time a system consisting of multiple objects, state estimation and model learning require computing the probability that one observed object is in fact the same as another. We developed appropriate probabilistic models and inference algorithms for this problem, and applied them to estimate freeway travel times using video data streams from widely separated camera sites. The resulting paper (Huang \& Russell, 1998) received a Distinguished Paper Award at IJCAI 97 (out of 816 submissions) and will appear as an invited paper in AIJ. This research required both massive computational resources and massive storage facilities for video sequence data. \end{itemize}
Selected Relevant Publications (out of 24 papers total)
\begin{enumerate} \item D. Andre, N. Friedman, and R. Parr, ``Generalized Prioritized Sweeping.'' In {\em NIPS '97}. \item J. Binder, D. Koller, S. Russell, K. Kanazawa, ``Adaptive Probabilistic Networks with Hidden variables.'' {\it Machine Learning}, {\bf 29}, 213--244, 1997a. \item S. Dasgupta, ``The sample complexity of learning Bayesian networks.'' {\it Machine Learning}, {\bf 29}, 165--180, 1997. \item N. Friedman, ``Learning belief networks in the presence of missing values and hidden variables.'' In {\em ICML-97}. \item T. Huang and S. Russell, ``Object Identification in a Bayesian Context.'' {\it Artificial Intelligence}, to appear. \item R. Parr and S. Russell, ``Reinforcement Learning with Hierarchies of Machines.'' In {\em NIPS '97}. \item G. Zweig and S. Russell. ``Speech Recognition with Dynamic Bayesian Networks.'' In {\em AAAI-98}.
\item John Binder, Kevin Murphy, Stuart Russell, ``Space-Efficient Inference in Dynamic Probabilistic Networks.'' In {\em Proc.~Fifteenth International Joint Conference on Artificial Intelligence}, Nagoya, Japan, 1997b.
\item Sanjoy Dasgupta, ``The sample complexity of learning fixed-structure Bayesian networks.'' {\it Machine Learning}, {\bf 29}, 165--180, 1997.
\item Jeffrey Forbes, Nikunj Oza, Ronald Parr, and Stuart Russell. ``Feasibility Study of Fully Automated Traffic Using Decision-Theoretic Control.'' California PATH Research Report UCB-ITS-PRR-97-18, Institute of Transportation Studies, University of California, Berkeley. April 1997.
\item N. Friedman, D. Geiger, and M. Goldszmidt, ``Bayesian networks classifiers.'' {\it Machine Learning}, {\bf 29}, 131--164, 1997.
\item N. Friedman and M. Goldszmidt, ``Sequential update of Bayesian network structure.'' In {\it Proc. Thirteenth Conference on Uncertainty in Artificial Intelligence (UAI)}, Providence, RI, 1997.
\item N. Friedman and M. Goldszmidt, ``Learning Bayesian Networks with Local Structure.'' To appear in M. I. Jordan (Ed.) {\it Learning and Inference in Graphical Models}, 1997.
\item Nir Friedman, Stuart Russell, ``Image Segmentation in Video Sequences.'' In {\it Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence}, Providence, Rhode Island: Morgan Kaufmann, 1997.
\item Daishi Harada, ``Reinforcment learning in time.'' In {\em Proc.~AAAI-97}, Providence, RI, 1997.
\item Timothy Huang and Stuart Russell, ``Object Identification in a Bayesian Context.'' {\it Artificial Intelligence}, to appear (invited paper).
\item Timothy Huang and Stuart Russell, ``Object identification in a Bayesian context.'' Distinguished Paper Prize, in {\em Proc.~Fifteenth International Joint Conference on Artificial Intelligence}, Nagoya, Japan, 1997.
\item Kevin Murphy, ``Inference and Learning in Hybrid Bayesian Networks.'' Technical Report, Computer Science Division, University of California, Berkeley. January 1998.
\item Ron Parr and Stuart Russell, ``Reinforcement Learning with Hierarchies of Machines.'' In {\em NIPS '97: Neural Information Processing Systems}, Denver, 1997.
\item Stuart Russell, Lewis Stiller, and Othar Hansson, ``PNPACK: Computing with Probabilities in Java.'' {\it Concurrency: Practice and Experience}, {\bf 9}, 1333--1339, 1997.
\item Prasad Tadepalli and Stuart Russell, ``Learning from Examples and Membership Queries with Structured Determinations.'' {\it Machine Learning}, to appear.
\item Richard Dearden, Nir Friedman, and Stuart Russell, ``Bayesian Q-Learning.'' To appear in {\it AAAI-98}.
\item Nir Friedman, Kevin Murphy, and Stuart Russell, ``Learning the
Structure of Dynamic Probabilistic Networks.'' To appear in {\it UAI-98}.
Students graduated: Nikunj Oza, MS Geoff Zweig, PhD Tim Huang, PhD Ron
Parr, PhD Othar Hansson, PhD
Our approach is a coordinated attack on the elements needed to demonstrate a combined processor and reconfigurable array: the design of a configurable array architecture that includes features making it more efficient for tight coupling with a processing core; a programming system that can take advantage of the processing core and the reconfigurable resources; a prototype chip implementation of the combined device to verify its practicality; and a demonstration of the efficiency of the device on a set of applications. We have used the thousands of CPU hours on the Titan/NOW infrastructure. We specified and begun detailed design and layout work on an advanced high speed reconfigurable array, specified a prototype chip design that merges DRAM and a reconfigurable array, completed a compilation path from C to our Garp chip, developed and demonstrated a library element generator system in the Java programming language, provided interface specifications Ptolemy group so they can use as an intermediate target mapping designs to FPGA/RC devices · working strategy for adding placement directives, worked out scheme for architecture specific library inheritance to ease portability. A paper on generator system written/accepted FCCM'98 Project web page: http://www.cs.berkeley.edu/projects/brass/
T. Callahan, P. Chong, A. DeHon, J. Wawrzynek, "Fast Module Mapping and Placement for Datapaths in FPGAs'' Published in Proceedings of the 1998 ACM/SIGDA Sixth International Symposium on Field Programmable Gate Arrays (FPGA '98, February 22-24, 1998).
T. Callahan, J. Wawrzynek, "Datapath-oriented FPGA Mapping and Placement for Configurable Computing," presented at FCCM'97, Napa Valley, CA (April 1997).
J. Hauser, and J. Wawrzynek, "Garp: A MIPS Processor with a Reconfigurable Coprocessor," published in Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'97, April 16-18, 1997), pp. 24-33.
The IRAM design has been specified in terms of block diagrams, pins and functionality. Current work includes the development of Verilog model for the design and initial circuit synthesis. Full custom circuits for the low-swing interconnect schemes have also been sized. Project web page: http://iram.cs.berkeley.edu/ .
C. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, K. Yelick, "Scalable Processors in the Billion Transistor: IRAM". Proceedings IEEE Computer Special Issue: Future Microprocessors - How to use a Billion Transistors, September 1997.
D. Patterson, K. Asanovic, A. Brown, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, C. Kozyrakis, D. Martin, S. Perissakis, R. Thomas, N. Truehaft, K. Yelick, "Intelligent RAM (IRAM): the Industrial Setting, Applications, and Architecture", Proceedings ICCD '97 International Conference on Computer Design, Austin, Texas, 10-12 October 1997.
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K. Yelick, "Intelligent RAM (IRAM): Chips that Remember and Compute," 1997 IEEE International Solid-State Circuits Conference, San Francisco CA, 6-8 February 1997.
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K. Yelick, "A Case for Intelligent DRAM: IRAM," IEEE Micro, April 1997.
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, C. Kozyrakis, B. McGaughy, S. Perissakis, K. Yelick, "The Energy Efficiency of IRAM Architectures" ISCA '97: The 24th Annual International Symposium on Computer Architecture, Denver, CO, 2-4 June 1997.
N. Bowman, N. Cardwell, C. Kozyrakis, C. Romer, H. Wang, "Evaluation of Existing Architectures in IRAM Systems" Workshop on Mixing Logic and DRAM: Chips that Compute and Remember at ISCA '97, Denver, CO, 1 June 1997.
D. Patterson, R. Arpaci-Dusseau, K. Keeton, "IRAM and SmartSIMM: Overcoming the I/O Bus Bottleneck" Workshop on Mixing Logic and DRAM: Chips that Compute and Remember at ISCA '97, Denver, CO, 1 June 1997.
[1] ``Searching for the Sorting Record: Experience with NOW-Sort'' Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson. To appear in 2nd SIGMETRICS Symposium on Parallel and Distributed Tools, August 1998.
[2] ``The Architectural Costs of Streaming I/O: A Comparison of Workstations, Clusters, and SMPs'' Remzi H. Arpaci-Dusseau, Andrea C. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson. High Performance Computer Architecture, Feb. 1998.
[3] ``High-Performance Sorting on Networks of Workstations'' Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson. ACM SIGMOD Conference on Management of Data, May 1997.
The simulator was used to predict the stable pose distribution of real industrial parts. The results are described in \cite{MZG96}. This paper includes experimental data on stable poses for a number of different parts. The simulation results correlate very well with these, so this paper provides support for the realism of IMPULSE for simulations involving many rigid-body collisions.
Another paper desribed the use of IMPULSE as a design tool for vibratory bowl feeders. Vibratory bowls use strategically-placed slots and fences to deflect parts that are not properly oriented. They are among the most widely-used devices in manufacturing, But their design is currently a totally manual exercise. Some design principles have been suggested by Boothroyd, but these suppose that the statistical effects of the slots and fences on a given part is known.
The paper \cite{Berk96} describes a simulation-only approach to feeder design. A geometric model of the feeder track is built first, with one or more of the design parameters specified as variable. A supervisory program systematically varies those parameters, and then performs a simulation on each geometry. Each simulation includes parts in several different initial poses. For each geometry, the success rate is computed for each feature and initial pose, and used to construct a state-transition diagram for the feeder. A follow-up paper \cite{BC97} includes some improvements in the design tool and compares the results with trials on a real feeder.
Simulation was also used to study a new design of planar motion arrays using MEMS. The paper \cite{RBC96} describes our design efforts for a particular MEMS array using IMPULSE. We studied several design parameter changes, and we able to propose a solution to the binding problem that these arrays exhibit (we found in simulation that parts being fed would stick at the boundaries between feeder rows. Later we found that this happens in the real arrays as well).
\bibitem[BC96]{Berk96} Dina Berkowitz and John Canny. \newblock Designing part feeders with dynamic simulation. \newblock In {\em IEEE Conference on Robotics and Automation}, pages 1127--1132. IEEE, 1996.
\bibitem[BC97]{BC97} Dina~R. Berkowitz and John Canny. \newblock A comparison of real and simulated designs for vibratory parts feeding. \newblock In {\em Proceedings of the IEEE Conference on Robotics and Automation}, 1997.
\bibitem[MC95]{MC95:dynsim} Brian Mirtich and John Canny. \newblock Impulse-based simulation of rigid bodies. \newblock In {\em Symp. on Interactive 3D Graphics}, 1995. \newblock Monterrey, CA.
\bibitem[Mir95]{Mir95} Brian Mirtich. \newblock Hybrid simulation: Combining constraints and impulses. \newblock In {\em Proceedings of First Workshop on Simulation and Interaction
in Virtual Environments}, 1995.
\bibitem[MZG{\etalchar{+}}96]{MZG96} Brian Mirtich, Yan Zhuang, Ken Goldberg, John Craig, Rob Zanutta, Brian Carlisle, and John Canny. \newblock Estimating pose statistics for robotic part feeders. \newblock In {\em IEEE International Conference on Robotics and Automation.}, May 1996. \newblock Minneapolis.
\bibitem[RBC96]{RBC96} D.~Reznik, S.~Brown, and J.~Canny. \newblock Dynamic simulation as a design tool for a microactuator array. \newblock In {\em IEEE Conf. on Robotics and Automation (ICRA)}, 1996. \newblock Albuquerque, NM.
GRADS: Brian Mirtich completed his Ph.D. in Spring of 1996, and is now a researcher at MERL in Cambridge Mass.
Our main web server has moved from the older HP equipment to our new Sun Ultra Enterprise 3000 server which is a dual processor model with 512MB of memory and 42GB of disk. We have since expanded the system with 50GB of additional diskspace, 2 F/W SCSI-2 controllers, and a Sony SDX-300 AIT tape drive for backups. This server now hosts all of our on-line web-pages and data for our image, document, and geographic data collections (larger resolutions are stored on tertiary storage). The system also operates as a database server and compute server for OCR, image processing, and other supportive functions for the project. On the software side, we have also migrated away from the NCSA web server to using the Apache web server.
As our collections continue to grow, we have also acquired disk space on the Sun Sonoma file-servers provided by the extension of Titan and Major Sun donations. This facility uses RAID-5 for increased reliability. This has relieved the space pressure that we were experiencing on our production server. All of the document data and indexing storage has now been moved to the Sonomas. Some of the freed space has been used to extend overfull partitions and to consolidate our image collection. More space consolidation and expansion is currently underway.
HP has donated 2 new C160 workstations, 3 new Vectra PCs, and 5 HP 5P color scanners. The new Vectra PCs help support our Java development environment, and the C160 workstations have replaced older desktop workstations within the group. We plan to use the scanners to explore further ways to share information and are looking to incorporate them into our daily work. As an example, we have used them to scan and post meeting handwritten notes on the network and to scan and OCR a printed paper to digitize the references.
In the summer of 1996, Intel has donated 11 200 MHz Pentium Pro PCs to support our project efforts. Each PC comes fully equipped with 64MB RAM, an 8x CD-ROM drive, a Fast EtherLink PCI 10/100Base-T netwrok interface card, an Adaptec Ultra-wide SCSI adaptor card, and a Matrox MGA Millenium graphics card with 2MB VRAM for 64-bit graphics. In addition, Microsoft has also supplied us with one of just about every piece of PC software they have, on each of our 11 Wondows/NT machines.
Prof. Fateman has used Titan to implement a web server providing symbolic integration table lookup, (http://http.cs.berkeley.~fateman/htest.html) This involves heavy computation (about a week of CPU time) for computing a high-order taylor series expansion describing the energy dissipation in a classic 3-dimensional vortex problem. The program was written in Macsyma, recompiled for Allegro Common Lisp.
References:
Gary Kopec. ``Multilevel character templates for document image decoding'', in Document Recognition IV, L. Vincent and J. Hull, editors, Proc. SPIE vol. 3027, 1997.
Serge Belongie, Chad Carson, Hayit Greenspan, and Jitendra Malik, ``Color- and Texture-based Image Segmentation Using EM and Its Application to Content-Based Image Retrieval.'' International Conference on Computer Vision, Jan. 4-7, 1998, Bombay, India.
Serge Belongie and Jitendra Malik. ``Finding Boundaries in Natural Images: A New Method Using Point Descriptors and Area Completion.'' Submitted to the European Conference on Computer Vision, 1998, Freiburg, Germany.
Michael Buckland. ``What is a Document.'' Journal of the American Society for Information Science. 48(9), pp. 804-809, 1997.
Michael Buckland and Christian Plaunt. ``Selecting Libraries, Selecting Documents, Selecting Data''. International Symposium on Research, Development and Practice in Digital Libraries, ISDL 97, pp. 85-91. Nov. 18-21, 1997, University of Library and Information Science, Tsukuba City, Japan.
Michael Buckland, Youngin Kim and Barbara Norgard. ``Search Support for Unfamiliar Metadata Vocabularies.'' Unpublished manuscript.
Chad Carson, Serge Belongie, Hayit Greenspan, and Jitendra Malik. ``Region-Based Image Querying.'' Workshop on Content-Based Access of Image and Video Libraries, associated with the Conference on Computer Vision and Pattern Recognition, June 20, 1997, San Juan, Puerto Rico.
Chad Carson, Serge Belongie, Hayit Greenspan, and Jitendra Malik. ``Color- and Texture-Based Image Segmentation Using EM and Its Application to Image Querying and Classification.'' Submitted to Pattern Analysis and Machine Intelligence.
Richard J. Fateman. ``More Versatile Scientific Documents'', Proceedings of the Fourth International Conference on Document Analysis and Recognition, Ulm, Germany, 18-20 Aug. 1997, IEEE Computer Society, 1997, p. 1107-1110, vol.2.
Richard Fateman. ``How to Find Mathematics on a Scanned Page.'' Unpublished manuscript.
D.A. Forsyth and M.M. Fleck. ``Finding People and Animals by Guided Assembly'', Proc. International Conference on Image Processing, Santa Barbara, 1997.
David Forsyth, Jitendra Malik, and Robert Wilensky. ``Searching for Digital Pictures.'' Scientific American, June 1997.
Gary Kopec. ``An EM Algorithm for Character Template Estimation.'' Submitted to IEEE Trans. PAMI.
Ray R. Larson and Jerome McDonough. ``Cheshire II at TREC 6: Interactive Probabilistic Retrieval.'' In: The Sixth Text REtrieval Conference, D.K. Harman and E.M. Voorhees, eds. (in press)
Thomas Leung and Jitendra Malik. ``Contour Continuity in Region Based Image Segmentation.'' Submitted to the European Conference on Computer Vision, 1998, Freiburg, Germany.
Ginger Ogle. California Native Plant Society newsletter. Oct, 1997.
Thomas A. Phelps and Robert Wilensky. ``Multivalent Annotations.'' In the Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, September 1-3, 1997 Pisa, Italy.
Christian Plaunt and Barbara Norgard. ``An Association Based Method for Automatic Indexing with a Controlled Vocabulary.'' To appear in the Journal of the American Society for Informations Science.
Lisa R. Schiff, Nancy A. Van House, and Mark H. Butler. ``Understanding Complex Information Environments: a Social Analysis of Watershed Planning.'' Digital Libraries '97: Proceedings of the ACM Digital Libraries Conference, Philadelphia, PA, July, 1997. pp. 161-186.
Jianbo Shi and Jitendra Malik. ``Normalized Cuts and Image Segmentation.'' International Conference on Computer Vision, Jan. 4-7, 1998, Bombay, India.
Nancy A. Van House, Mark H. Butler, and Lisa R. Schiff. ``The Situated Nature of Information: Practices and Artifacts.'' Submitted to the Journal of the American Society for Information Science.
Robert Wilensky and Isaac Cheng. ``An Experiment in Enhancing Information Access by Natural Language Processing.'' UC Berkeley Computer Science Technical Report UCB/CSD-97-963, June 1997.
Richard Fateman: Symbolic Computation of Turbulence and Energy Dissipation
in the Taylor Vortex Model, Intl J. Modern Physics (C) Vol 9, No 3 (May
1998).
In [R] a novel approach to the classical problem of approximating the permanent of a positive real matrix is investigated. This involves substituting matrix entries by suitably chosen random elements from a Clifford algebra. Surprisingly, this method yields an algorithm that is more efficient in the worst case than any known competitor. In the development of the algorithm, large symbolic computations were performed in order to prove certain algebraic identities. These were crucial to both the design and the analysis of the algorithm.
References
[RSW] Yuval Rabani, Alistair Sinclair and Rolf Wanka,
``Local divergence of Markov chains and analysis
of iterative load-balancing schemes,'' submitted
to IEEE Symposium on Foundations of Computing, 1998.
[R] Lars Rasmussen, ``New results in approximate counting,''
PhD thesis, Computer Science Division, UC Berkeley, to
be filed July 1998.
Such seamless integration of high-performance physical simulations requires large quantities of computing power and the ability to distribute information dynamically between simulators and visualization clients. To that end, we are investigating methods for handling the problems of real-time distributed simulation-visualization data management. The Berkeley Architectural Walkthru has already addressed some of the problems of distributed visualization and of the interaction between the user and the virtual world. In our recent work, we have shown that the basic virtual environment structure used in the Walkthru, a spatial subdivision of the world into densely occluded cells with connecting portals, can be put to good use for simulation data management. In addition to optimizing the visualization task, it is also useful for optimizing bandwidth requirements between a visualizer and simulator running on networked workstations, both for the purpose of communicating conditions to the simulator and communicating simulated states back to the visualizer. Using this structure, we can optimize bandwidth requirements for arbitrarily large visualizations and simulations, and relieve the visualization and simulation designers of the complexity of the data management problem. We are currently extending this solution to multiple distributed visualizers and simulators operating on one virtual world, using networked Windows NT computers and Silicon Graphics workstations to create dynamic, physically realistic, multiuser distributed virtual worlds.
Recent results were reported at the following conferences:
Bukowski, R.W. and Sequin, C.H. Performance Evaluation in a Virtual Environment, Part III: Understanding Performance Through an Interactive Environment. To appear in Proceedings of the Second International Conference on Performance-Based Codes and Fire Safety Design Methods (Maui, Hawaii, May 1998).
Bukowski, R.W. and Sequin, C.H. Interactive Simulation of Fire in Virtual Building Environments. Proceedings of SIGGRAPH 97 (Los Angeles, CA, August 1997).
Bukowski, R.W. and Sequin, C.H. The FireWalk System: Fire Modeling in Interactive Virtual Environments. Proceedings of the 2nd International Conference on Fire Research and Engineering (Gaithersburg, MD, August 1997).
In this research we are developing a simple and clean language for use as a digital interface for rapid prototyping of mechanical parts using Solid Free-Form Fabrication (SFF) or some special machining approach called CyberCut in which the part to be fabricated is encapsulated and rigidly held in place with a special plastic material that can later be removed easily. The role of this "Solid Interchange Format" (SIF) is to describe the desired solid part in an unambiguous, fabrication-process-independent way, so that the interaction between designers and fabricators can be simplified and streamlined. As in the current proposal, a key issue is to first understand the interactions between designers and fabricators and to capture the semantics of the information that needs to flow across this interface. Then a robust yet efficient language has to be developed to serve this purpose.
During the first year of our contract period we have defined the language and started to use it in research as well as in classroom settings. We also had some parts fabricated on a new interactive CAD tool that described its output in a special dialect SIF_DSG that was developed for the CyberCut machining environment.
S. McMains, C.H. Sequin, "SIF: The Emerging Solids Interchange Format", Fifth SIAM Conference on Geometric Design, Nov 3-6, 1997, Nashville, TN.
Most of the final projects are related to the students' research interests,
and many are likely to be continued as part of their thesis research. The
projects fall roughly into two categories: parallel applications and system
support for parallel machines. Among the first category, there is
a parallel Stiff ODE Integrator, a comparison of parallel direct solver
software for Finite Element Methods, a radiative transfer algorithm based
on Monte Carlo simuluation, and a solution of the transport-of-intensity
equation. There are also two applications that come from problem
domains outside of scientific computing, one being a parallelization of
a database join with a large number of processor on a system with a smaller
number of processors; this will allow researchers to investigate scaling
issues in algorithms and systems. There are three projects related
to thread support, one that adds threads to the SPMD model in the Titanium
language, another to study the performance problems of threads and caching
on SMPs in pSather, and the last to look at the problem of building
a thread-safe library from an unsafe one and tools to aid in this conversion.
Finally, there is a project to build support for interactive parallel visualization.
Among the second category of systems developments was a program to
simulate a multiprocessor with a large number of processor on a system
with a smaller number of processors; this will allow researchers to investigate
scaling issues in algorithms and systems. There are three projects related
to thread support, one that adds threads to the SPMD model in the Titanium
language, another to study the performance problems of threads and caching
on SMPs in pSather, and the last to look at the problem of building a thread-safe
library from an unsafe one and tools to aid in this conversion.