ResearchComputational Science Center
I earned my Ph.D. in Applied Mathematics from Brown University and am currently a joint appointee between the University of Colorado at Boulder (UCB) and the National Center for Atmospheric Research (NCAR), where I hold appointments as a tenured Associate Professor in the Department of Computer Science at UCB and as a Scientist III at NCAR. As a Computational Scientist I have made contributions to scientific application development, high- performance computing, parallel algorithms and architectures, high-productivity computing, numerical methods, and scientific visualization. I have published extensively in a broad spectrum of journals and peer-reviewed conferences (69 publications to date), been PI or Co-I on over $10M in grants at UCB, taught a variety of courses, and successfully mentored 38 scientists, software engineers, postdoctoral researchers, and Ph.D., M.S., and undergraduate students. My work has been recognized with Gordon Bell prizes in 1999 and 2000 for demonstrated excellence in high-performance and large-scale parallel computing. In my capacity as Faculty Director of Research Computing at UCB and Section Head of the Computer Science Section at NCAR, I have administrative experience that involves the management of computational facilities and personnel, and, in particular, the responsibility for managing NCAR’s TeraGrid facilities with its expansion of the Blue Gene/L system to four racks last summer and the deployment of a new 183.9 teraflop/s system at CU this past fall. The annual budget for my research group of 15 people is approximate $3M.
Since coming to Boulder I have built a computational science program that spans both institutions and have worked to lay its foundations in terms of people, organization, funding, computational resources, and research collaborations. Being the leader of the Computer Science Section at NCAR provides me with the ability to couple research activities between the two organizations and involve UCB students in NCAR research and a unique opportunity to modernize the computational science capabilities of a geophysical modeling center. To provide a focus for my UCB group’s efforts, I founded the Computational Science Center (CSC - http://csc.cs.colorado.edu). This is a multidisciplinary research center with initiatives in high-performance, parallel, cluster, and grid-computing systems, emerging parallel architectures, and scientific application development; CSC also architects and manages UCB supercomputing facilities.
My activities have expanded the traditional NCAR collaborator base beyond the atmospheric and geoscience communities to include computer and computational scientists, and developed collaborations outside of the traditional scope of each organization. Of particular note is my leadership in bringing high-performance computing resources to the university community, usually with the support of a multi-institutional consortium. For example, in 2005, I led a proposal to acquire an IBM BlueGene/L (BG/L) system located at NCAR but serving a consortium of researchers at NCAR, UCB and the University of Colorado at Denver / Health Sciences (UCD-HS). This experimental, massively parallel system was a test bed for systems design research, but continues to support computational science among UCB, NCAR, and UCD-HS scientists; the research performed on the BG/L resulted in 96 peer-reviewed publications. In 2008, I led a proposal involving 62 senior personnel to acquire a new supercomputer system for UCB, leading to the development of campus-wide resource and one of the top university supercomputers in the world.
With the advent of high-performance computing systems and their general availability to the scientific community, simulation has joined theory and experimentation as the third pillar of science. Computational Science is an emerging field that encompasses traditional simulation science as well as all of the components, both physical and virtual, required to employ simulation as a means of scientific investigation. By its very nature, computational science requires a multidisciplinary approach that includes computer science, numerical algorithms, and domain expertise. Accordingly, my most significant scientific accomplishments have been achieved as a pivotal member or leader in multidisciplinary teams. The following section concentrates on those accomplishments that I feel best illustrate my contributions to the field of Computational Science.
One of my primary research interests is in the implementation of high-order numerical methods such as spectral element and discontinuous Galerkin methods. These techniques combine the accuracy of spectral methods with the flexibility of element-based methods and have proven to be highly efficient, scalable, and most importantly, applicable to a wide range of problems that are of interest to the scientific community.
As a graduate student and postdoctoral researcher, I worked with Dr. Paul Fischer (Argonne National Laboratory - ANL) to develop NEK5000 (http://nek5000.mcs.anl.gov), a flow code that simulates unsteady incompressible flows using the spectral element method. NEK5000 supports general two- and three-dimensional geometries using meshes composed of (deformed) quadrilateral or hexahedral elements. My contribution was to develop the domain- decomposition software, design the communication algorithms and associated software to enforce continuity across element interfaces, develop a novel coarse-grid solver (see section (b) for details) as part of a multilevel overlapping Schwarz scheme, and write the immersive visualization component of the postprocessor (see section (h) for details). Moreover, I employed NEK5000 to investigate several challenging flow and heat transfer problems. One example is my work with Dr. Miles Greiner (Univ. of Nevada, Reno) on a series of simulation experiments to determine how to maximize heat transfer for a given pumping power using transverse grooves to enhance mixing. This is increasingly important in electronic devices where miniaturization of electronic components has increased circuit junction density and the associated heat loads that must be removed to maintain reliable operation. Another example is my work with Dr. Fischer to numerically simulate the interaction of a flat-plate boundary layer with an isolated hemispherical roughness element. Of principle interest in these studies is the creation of hairpin vortices that form an interlacing pattern in the wake of the hemisphere, lift away from the wall, and are stretched by the shearing action of the boundary layer since the tails remain in the low-speed (near-wall) region of the flow while their heads are entrained in the high- speed region. These hairpins are of interest because they have frequently been observed in whole or in part in transitional and turbulent boundary layers and thus may be a critical component in the production of turbulence. Employing the immersive visualization system I developed, I was able to explore a series of immersive visualizations of these simulations across a broad spectrum of Reynolds numbers (450–1500). The hairpin images compared favorably to those captured by experiments and I identified and study several key features in the evolution of this complex vortex topology not previously observed in other visualization formats. Movies can be viewed on my web site.
Since coming to Boulder, I have been collaborating with several scientists at NCAR to investigate the use of high-order methods for climate modeling. The primary framework for this research is the High-Order Method Modeling Environment (HOMME, http://www.homme.ucar.edu), which is the Computer Science Section’s environment for building and evaluating dynamical cores. HOMME is concerned with the development of a class of high-order scalable conservative atmospheric models for climate and general atmospheric modeling applications. The spatial discretizations are derived from the spectral element (SE) and discontinuous Galerkin (DG) methods. HOMME employs a cubed-sphere geometry exhibiting none of the singularities present in conventional latitude-longitude spherical geometries. The element-based formulation enables the use of general curvilinear geometries and adaptive conforming or non-conforming meshes. HOMME is currently capable of solving the hydrostatic primitive equations with SE and inherently conservative DG methods. In partnership with DOE, the SE option in the HOMME core was integrated into CAM, the atmospheric component of the Community Climate System Model (CCSM), and included in the CAM 4.0 release as an unsupported alternate core. Integration of the DG option into CAM was recently completed.
My contributions to HOMME are in four areas. The first two, developing scalable solvers and demonstrating that HOMME can scale to tens-of-thousands of processors with the necessary algorithmic modifications, are discussed in sections (b) and (d) respectively. The third was to assist Dr. John Dennis to build an adaptive mesh refinement package for HOMME using my GS package discussed in section (g). The final contribution is to lead several DOE efforts (http://csc.cs.colorado.edu/SciDAC-CCPP) with Drs. Ram Nair, Peter Lauritzen, Phil Rasch, Joe Tribbia, John Dennis, and Amik St-Cyr (all from NCAR). My specific contributions: (1) to integrate the modal and nodal versions of the DG method into HOMME with Dr. Nair; (2) to develop a high-order accurate spectral finite-volume method with Dr. Nair and Dr. Vani Cheruvu; (3) to develop a 3D hydrostatic model in HOMME using the DG option combined with a new conservative vertical discretization scheme based on the 1D-Langrangian coordinates developed by Starr; (4) to integrate the DG-based hydrostatic model into CAM with Jose Garcia (NCAR); and (5) to assist with the validation efforts of the 3D hydrostatic model with Dr. Saroj Misra and Dr. Nair using the baroclinic wave instability test of Jablonowski and Williamson and the aqua-planet experiment (http://www-pcmdi.llnl.gov/projects/amip/ape).
Incompressible flow problems are characterized by the need to solve elliptic problems and thus require scalable linear solvers. I developed an overlapping Schwarz preconditioner that achieves scalability through the use of a multilevel solver that includes a coarse-grid solver to rapidly transfer low-wavenumber information throughout the domain and a local tensor-product-based solver. For large numbers of processors the coarse-grid operation is communication dominated and potentially the rate-limiting step. I developed a novel parallel direct solver that has proven to significantly outperform competing methods on thousands of processors. The approach is based on creating a sparse A-conjugate basis using a procedure developed by Wilkinson and, for appropriately ordered FEM or FD meshes, requires significantly less computation than competing methods and is communication minimal. To support application scientists, I extended the course-grid solver to handle non-symmetric problems and multiple right hand sides. Finally, in collaboration with Dr. Steve Thomas (NCAR), I extended the local tensor-product-based solver to the cubed-sphere geometry. Having scalable solvers which operate on the cubed sphere are essential to achieving scientifically useful integration rates in HOMME at AFES (AGCM for Earth Simulator) resolutions.
My experience researching and developing software applications and algorithms to solve scientific problems on massively parallel computer systems led me to more closely examine the implications of the underlying computer itself on performance. Application and algorithm performance can be obtained only when a supercomputer’s underlying computational, memory, interconnect, storage, and accelerator components are selected and combined to form an overall balanced and efficiently programmable system. Thus, one of the primary ongoing activities of my Research Systems Evaluation Team (ReSET) at NCAR and research group at UCB is conducting research into architectural and empirical computer performance for scientific applications. One highlight of this thrust is my experimental systems research for Linux clusters. Working with my group, I evaluated various aspects of cluster design, including parallel filesystems, diskless compute nodes, scalable management systems, and cluster-interconnect solutions. Some results of this research are a better understanding of how to deploy cost-effective computational clusters that can accommodate the NCAR scientific workload while minimizing on-going management overhead and complexity. Notably, these research projects often produce new operational technologies, such as modifying the Linux boot process to support using Lustre as a root file system and using it on UCB clusters years before such a feature was offered by Lustre.
In December 1999, IBM Research launched a multi-year and multi-disciplinary project called Blue Gene to develop the next generation of petascale high-performance computing architectures. I led a collaborative effort in 2005 between UCB, NCAR, and UCD-HS to acquire a one rack BG/L supercomputer (2048 processors, 5.73 TF peak). (In 2009, we acquired an additional three racks from the San Diego Supercomputing Center and integrated them into a four-rack system.) This effort brought together twelve researchers from the three institutions to investigate and address the technical obstacles to achieving practical petascale computing in geoscience, aerospace engineering, and mathematical applications. To date, this collaboration has produced 96 publications. This experimental system still serves a test bed for my efforts to develop scalable solvers and build new dynamical cores that can efficiently use massive parallelism.
In 2008, I expanded the collaborative effort between UCB, NCAR, and UCD-HS to create the Front Range Computing Consortium and acquire a new supercomputing system serving the research computing needs of these organizations. In partnership with Dell, as well as Michael Oberg and Dr. Matthew Woitaszek from my section at NCAR, I architected the system and its co-designed facility. Designing and deploying a system of this magnitude – 16,384 processor cores requiring 0.5 MW of electrical power, producing 184 TF peak and thus ranking as the 31st most powerful computer at the time of its installation – required several years of computer systems research to identify the appropriate processing, networking, and storage technologies, and the appropriate balance of each to best serve the community’s needs. In the years prior to preparing the grant, and during the following 18-month design revision window, ReSET and UCB students used in-house and remote platforms to evaluate each of the system’s components and finally design and test several small-scale system prototypes to ensure a successful integration. While different from my computational science background, publications to systems research conferences and journals (e.g. USENIX) are in preparation to describe the unique aspects of this work.
In 2005, through a partnership with IBM, the Computer Science Section had access to a thirty-two rack IBM BG/L system (65,536 processors, 183.5 TF peak) in Rochester, Minnesota. Working with researchers at the IBM T.J. Watson Research Center and at NCAR, graduate student Theron Voran, under my supervision, tuned and benchmarked HOMME using moist Held-Suarez and aqua-planet simulations configured to match the resolution of the 2002 Gordon Bell AFES run. We were able to sustain performance of 8.0 TF on 32,768 processors for the moist Held-Suarez test problem in coprocessor mode and 11.3 TF on 32,768 processors for the aqua-planet test problem in virtual node mode. This worked was later extended by Dr. Mark Taylor at Sandia National Laboratories to demonstrate scaling to O(100K) cores on the BG/L system at Lawrence Livermore National Laboratory. My primary contribution here was to work with Dr. John Dennis (NCAR) to determine a mapping between the HOMME computational grid (effectively a stacked set of 2D grids) and the BG/L 3D torus that minimized communication overhead. This was accomplished in two phases: (1) map the 2D array of blocks (or elements) into a 1D ordering using (weighted) space-filling curves (SFCs); and (2) map the resulting 1D ordering into the 3D torus by employing a “snake” mapping, which relies on a small 2D Morton SFC to modifying the default lexicographical ordering to operate on 2x2 block. This significantly increased performance large core counts (e.g., greater than 2x for 32,768 cores).
In 2000, I worked with members of the FLASH team to redesign FLASH, the simulation code developed by the Center for Astrophysical Thermonuclear Flashes at the University of Chicago and used to study reactive flows encountered in astrophysical environments. This effort resulted in a fourfold improvement in performance and led to the team being awarded a Gordon Bell prize in the Special Category for achieving 0.238 TF on 3210 nodes of the ASCI Red system based on a simulation of a carbon detonation in a Type Ia Supernovae. One of the primary features of the FLASH code is the use of adaptive mesh refinement (AMR) to control placement of resolution; at that time, it was thought that obtaining this level of efficiency and scalability in an AMR code was unachievable. My contribution was to analyze the performance of the code, identify the bottlenecks inhibiting scalability and efficient processor utilization, and work with the team to address these inefficiencies. Approximately half of the improvement gains came through elimination of copies and reordering of operations to make better use of the processor's cache. The remaining improvement was obtained by removing global synchronization points, in particular the global sort in the AMR ordering routines. Over the years FLASH has matured into an astrophysics community code that is used daily by hundreds of researchers all across the world on a diverse range of applications.
In 1999, Dr. Paul Fischer and I achieved sustained terascale performance (0.376 TF on 2048 nodes of ASCI Red) simulating hairpin vortex generation with NEK5000. This effort was also recognized by a Gordon Bell award in the Special Category. This unprecedented level of scalability for an elliptical problem was a direct result of the coarse-grid solver discussed in sections (b) and (g). I made additional contributions by fully exploit the tensor-product formulation so as to recast all local operations as matrix-matrix products and optimizing the matrix-matrix product routines with Dr. Greg Henry of Intel. The latter is of paramount importance as these routines account for over 90% of the floating-point operations in a typical simulation. Finally, I developed a hybrid MPI/OpenMP version of the code that obtained comparable performance at scale.
Grid computing addresses the general concern in the computational science community that the integration of computing infrastructure needs to encompass more than just hardware configurations. For example, in a rapidly evolving computational and information environment scientific workflows are becoming too complicated for manual (or semi-manual) implementation. To address this problem in the context of the geosciences, I have expanded the notion of infrastructure beyond facilities and hardware to encompass software development and integration, and have worked to provide a set of seamless discipline-specific services to the community that encompass all of the components, both physical and virtual, required to produce an end-to-end solution.
One example of this research is the Grid-BGC project (http://www.gridbgc.ucar.edu). This collaborative project with Dr. Peter Thornton (Oak Ridge National Laboratory) had the objective of creating a cost effective end-to-end solution for terrestrial ecosystem modeling. Grid-BGC is built on the Globus Toolkit and allows scientists to easily configure and run high-resolution terrestrial carbon cycle simulations without having to worry about the individual components of the simulation or the underlying computational and data storage systems, which are often distributed. Core development components of this system were the development of a science portal and the design and implementation of a service oriented architecture (SOA) for the geosciences. This work, which was started in the early 2000s, predates the now common approach of constructing web-based “science gateways” to provide interfaces and automation for scientific workflows. I supervised graduate student Jason Cope and Dr. Woitaszek to design and implement the SOA, which is the backend to the Grid-BGC science portal that allows scientists to interactively setup global carbon cycle experiments. Grid-BGC is the first working computational grid for carbon cycle simulations and its companion SOA provides a seamless set of services from which the geoscience and other communities can build other end-to-end solutions. This work led to Dr. Woitaszek’s research examining the use of lightweight technologies and frameworks for rapidly developing science gateway solutions, as well as four of my own complimentary research areas: fault-tolerant archival storage, urgent computing, high-throughput computing, and cloud computing.
One interesting aspect of the project was that it generated a large volume of compute tasks. I worked with graduate student Paul Marshall and Dr. Woitaszek to develop and implement a new ensemble-based task specification strategy that delays the enumeration of tasks until execution, eliminating the need to store tasks and their status in a persistent data structure. When integrated with a complimentary task dispatch method, this feature made BG/L a viable platform for many-task workloads in addition to massively parallel MPI programs.
While Grid-BGC executed a workflow for the domain scientists’ convenience, my examination of climate and meteorology use cases led to my exploration of urgent computing – the use of workflow automation to perform time- sensitive simulations for critical real-world situations. As complex software systems, wildfire models and hurricane models can take engineers and scientists a substantial amount of time to set up and run; the only way to use models in natural disaster situations is to have the models ready to execute at the touch of a button. The technology developed for Grid-BGC was a precursor to my work with graduate student Jason Cope to design a simulator to model data transfers and data placement of urgent computing workflows. Using this simulator, we demonstrated that dynamic adjustments to application configurations and robust resource allocation heuristics could significantly improve the execution time of time-critical workflows.
Finally, I’m working Dr. Rob Knight’s group (UCB), Dr. Kate Keahey’s group (ANL) and with graduate student Paul Marshall to transfer the lessons learned in developing Grid-BGC to build a bioinformatics knowledge environment (BiKE) that allows comparison of microbial communities at several levels, and to relate information about sequences to structures and vice versa. The back-end to this environment is a distributed service-oriented framework that integrates a variety of cloud resources to execute the cyber-experiments.
One essential infrastructure component required by the high-performance computing community is archival storage, such as NCAR’s 10PB+ Mass Store System. Although magnetic tape is a well-established archival storage medium, tape systems have limited read and write throughput, require tape retrieval queue time, and risk storing all of the important data in one place. To address these limitations, I worked with Dr. Woitaszek to build a reliable and high- performance file system for archival storage using low-density parity check codes (LDPC), in particular those based on irregular graphs (e.g., Tornado codes). The advantage of moving to a LDPC-based scheme based on an open software infrastructure is that it allows one to leverage emerging storage solutions. We demonstrated that Tornado encoding schemes could be designed so that they are significantly more fault tolerant than either RAID or mirrored systems. Moreover, by using cooperatively selected Tornado Code graphs to build a geographically distributed data stewarding system, we showed that one can obtain overall systems fault tolerance exceeding that of its constituent storage sites or site replication strategies. Though the use of LDPC codes to build file systems is not new, using them to enhance throughput and reliability of archival systems is a unique contribution to the computer science literature.
XXT and XYT Solver Packages: I efficiently implemented both the symmetric and nonsymmetric direct solver methods discussed in section (b) and created an easy-to-use interface that is well suited for developing scalable domain- decomposition-based applications. This code is distributed world wide via the Portable, Extensible Toolkit for Scientific Computation (PETSc) from ANL and is freely available from http://www-unix.mcs.anl.gov/petsc/petsc-as/.
GS Gather-Scatter Package: Gather-scatter operations are central in all domain-decomposition approaches to parallel solution of PDEs. I developed a general-purpose, stand-alone gather-scatter routine that has a lightweight user interface and supports a number of associative/commutative operations, including user-defined functions. It has low setup overhead that is appropriate for dynamically adaptive meshing. Currently, numerous researchers in the United States and Europe use this package and it is being employed by members of the Computer Science Section to develop a version of HOMME that employs adaptive mesh refinement to place resolution where required.
Working with Mike Papka (ANL), I enhanced the NEK5000 postprocessor to map spectral-element data to unstructured hexahedral meshes of arbitrary density. This data is then processed by a package designed for immersive visualization of general mesh data whose components are built using VTK (an open-source software system for visualization) and the CAVE library, which enables projection and exploration of immersive stereo images. For immersive visualization, the processed output is loaded into another VTK-based application that is built on top of the CAVE library and allows the user to rapidly explore data from a number of different viewing angles.