Sign In
GTC Logo
GPU
Technology
Conference

March 24-27, 2014 | San Jose, California
Slidecasts of GTC sessions are available now for conference registrants – please “Sign In” to view.
PDFs of presentation slides will be available by mid-April. Registrants must login to view slidecasts and PDFs.
For non-registrants, this GTC content will be available at the end of April on GTC On Demand.

GPU Technology Conference Schedule Planner

Print
 
  • List View
  • Calender View

 
Refine:
  • Session Levels:
  • |
  • |
  • |
  • |
  • Session Levels:
  •  
  •  
  •  
  •  
  • = Highly Rated Speaker

TALK

Presentation
Details

S4171 - Efficient GPU-Friendly Pre-Conditioners for Large-Scale Finite Element Analysis

Krishnan Suresh ( Associate Professor, University of Wisconsin )
Krishnan Suresh
Krishnan Suresh is currently an Associate Professor in the Department of Mechanical Engineering, University of Wisconsin, Madison. He graduated in 1998 from Cornell with a Ph.D. in Mechanical Engineering. He later served as an Engineering Manager at Kulicke and Soffa Industries (1998-2002). His research interests are in optimization, high-performance computing and He has co-authored over 35 journal papers, and several conference papers, two of which have received best-paper awards from ASME

The goal of this session is to introduce a new GPU-friendly pre-conditioner, specifically for finite-element applications. The pre-conditioner is assembly-free in that neither the finite-element stiffness matrix nor the pre-conditioner is assembled (ever!). The memory foot-print is therefore extremely small, and the GPU implementation is, in most cases, compute-bound. A CUDA implementation will be discussed, followed by examples of finite element problems with 10's of millions of degrees of freedom. It is assumed that registrants are already familiar with finite element techniques.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Computational Structural Mechanics; Computer Aided Design

Day: Tuesday, 03/25
Time: 13:00 - 13:25
Location: Room LL21D

S4327 - Hierarchical Algorithms on Heterogeneous Architectures: Adaptive Multigrid Solvers for LQCD on GPUs

M Clark ( HPC Compute Engineer , NVIDIA )
M Clark
Dr. Clark's background is in high energy physics, having completed doctoral research in Monte Carlo algorithms for lattice quantum chromodynamics in 2005, graduating from the University of Edinburgh. Upon subsequently moving to Boston University, Dr Clark focused upon adaptive multi-grid algorithms and symplectic integrators. It was during this time that research was initiated into harnessing GPUs for lattice QCD computation: this research has since evolved into the QUDA library. Dr. Clark spent 2009-2011 at Harvard University, continuing to work on algorithms for GPUs and many-core processors, with focus on signal processing. Dr. Clark moved to NVIDIA in 2011, and continues to work at the interface between applications, algorithms and parallel computation.

Learn how GPUs are using advanced multigrid-solver algorithms to revolutionize the study of sub-nuclear physics. Lattice quantum chromodynamics (LQCD) is the study of quarks and gluons, the constituent particles that make up protons and neutrons. Owing to the computationally demanding nature of these calculations, GPUs are an increasingly popular platform for deployment, where a single calculation can requires thousands of GPUs working in tandem for months. There has been much progress to date in developing scalable sparse linear solver algorithms, utilizing well-known mathematical methods such as mixed precision, domain decomposition and pipelining to improve performance, allowing efficient use of large GPU installations such as Blue Waters and Titan. However, there has been less focus on deploying 'mathematically optimal' linear solvers, that have optimal O(N) complexity. In this work we utilize the QUDA framework to deploy adaptive multigrid solvers on GPUs, in particular we describe the architecture abstractions that allow for deployment on heterogeneous systems, utilizing both GPUs and CPUs. We discuss in general the suitability of heterogeneous architectures for hierarchical algorithms, and compare performance against a highly optimized CPU implementation.

Session Level: Intermediate
Session Type: Talk
Tags: Computational Physics; Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 13:00 - 13:50
Location: Room 212A

S4571 - Applications of GPU Computing to Mission Design and Satellite Operations at NASA's Goddard Space Flight Center

Abel Brown ( Principal Systems Engineer, A.I. Solutions )
Abel Brown
Abel holds degrees in Mathematics and Physics as well as a PhD in the field of Geodesy & Geophysics from The Ohio State University. For the past eight years Abel has been developing distributed software frameworks and administering high performance computing clusters. He has deployed and managed many sensor networks around the world in Antarctica, South America, and Greenland. Abel is dually appointed on the Magnetopheric Multiscale (MMS) Ground System and Conjunction Assessment development teams and manages numerous research projects at a.i. solutions on GPU computing, image analytics, and advanced satellite perturbation techniques. As co-author, Abel's recent work contributed to the PNAS publication which was featured in WIRED Magazine's "Best Scientific Figures of 2012" titled "Greenland Rising".

The computational intensity required for modern-day space missions is quickly outgrowing existing CPU capabilities. The Magnetosphere Multiscale (MMS) mission is the first NASA mission to fly four satellites in formation and thus has uniquely challenging design and operational requirements, namely, mitigation of collision scenarios involving space debris and/or the formation with itself. By design, no more than 1 in 1000 unsafe close approaches may go undetected while operationally no more than 1 in 20 alarms raised my be false - so as to minimize science interruptions. The confidence intervals required to satisfy such requirements pose daunting computational demands, which operationally, can not be met using traditional CPU solutions. Here it is demonstrated how GPU-accelerated solutions are being deployed, for the first time, at the NASA Goddard Space Flight Center (GSFC) to meet operational MMS mission requirements. Additional applications to Space Situational Awareness and mission design are discussed.

Session Level: All
Session Type: Talk
Tags: Defense; Numerical Algorithms & Libraries; Supercomputing; Scientific Visualization; Recommended for All Press

Day: Tuesday, 03/25
Time: 13:00 - 13:25
Location: Room 210D

S4228 - Fast N-body Methods as a Compute-Bound Preconditioner for Sparse Solvers on GPUs

Rio Yokota ( Research Scientist, KAUST )
Rio Yokota
Rio Yokota is a Research Scientist in the Strategic Initiative for Extreme Computing at KAUST. He currently works on fast multipole methods (FMM), and their implementation on large scale heterogeneous architectures. During his PhD, he worked on the implementation of fast multipole methods on special purpose machines such as MDGRAPE-3, and then on GPUs after CUDA was released. During his post-doc he continued to work on FMM, and was part of the team that won the Gordon Bell prize for price/performance in 2009 using 760 GPUs. During his postdoc with Lorena Barba at Boston University he developed an open source parallel FMM code -- ExaFMM. He is now running this code on full nodes of the TSUBAME 2.0 and K computer in Japan, and also on Mira, Titan, and Stampede in the US.

Learn how to unleash the full power of GPUs on one of the more difficult problems -- preconditioning in sparse solvers -- by using fast N-body methods as a preconditioner. Fast N-body methods have been able to achieve high percentage of the peak performance since the early days of GPU computing. However, its successful applications have been limited to astrophysics and molecular dynamics, where the physics itself is naturally described by a collection of discrete points. Mathematically, there is nothing that prevents the use of fast N-body methods as a solver for a more general class of PDEs. This would not have been a good idea back when Flops were expensive, since it essentially turns the sparse matrix into a dense matrix of the same size, before hierarchically grouping the off-diagonal blocks. But now that Flops are becoming comparatively cheap, the notion of a "compute-bound preconditioner" sounds attractive more than ever. We will demonstrate how competitive such a preconditioner actually is on Kepler.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Computational Physics

Day: Tuesday, 03/25
Time: 13:30 - 13:55
Location: Room LL21D

S4632 - Exploring the Earth in 3D: Multiple GPUs for Accelerating Inverse Imaging

Chris Leader ( Graduate Student, Department of Geophysics, Stanford )
Chris Leader is currently a 5th year student in Stanford Earth Sciences working on both acquisition and HPC applications for seismic exploration. He was also worked on these topics at Shell International, Houston and Total, Pau, France. He holds an MSc from Imperial College London and a BA from Oxford University.

Discover how we can harness the power of multiple GPUs to explore the Earth with seismic data. A wave equation based inversion process is used to turn these data into a high-fidelity image, however for contemporary datasets this requires around 1018 operations, if not more. GPUs can ease this computational bottleneck but they create two further limiting factors: exacerbated disk accesses and global memory limitations. These can be addressed by manipulating the domain boundaries and by decomposing our problem across multiple GPUs. We will show you how we can create detailed seismic images without these traditional restrictions

Session Level: Beginner
Session Type: Talk
Tags: Energy Exploration; Computational Physics; Scientific Visualization; Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 13:30 - 13:55
Location: Room LL20B

S4213 - Kokkos: A Manycore Device Performance Portability Library for C++ HPC Applications

H. Carter Edwards ( Principle Member of Technical Staff, Sandia National Laboratories )
Highly-Rated Speaker
H. Carter Edwards
H. Carter Edwards: 1979-1982, BS Aerospace Engineering, University of Texas at Austin. 1982-1991, LinCom Corporation, contractor at Johnson Space Center, Houston TX. 1991-1993, MS Aerospace Engineering, University of Texas at Austin. 1993-1997, PhD Computational and Applied Mathematics University of Texas at Austin. 1998-current, Sandia National Laboratories.

Discover how the Kokkos library enables you to develop HPC scientific applications that are performance portable across disparate manycore devices such as NVIDIA Kepler and Intel Xeon Phi. Portable programming models such as OpenMP, OpenACC, OpenCL, and Thrust focus on parallel execution but fail to address memory access patterns critical for achieving best performance. Thus codes must be extensively re-written to meet device specific memory access pattern requirements; e.g., data structures and loops transformed from array-of-structures patterns to structure-of-arrays patterns. We address this issue by integrating compile-time polymorphic data layout with parallel execution. We will present manycore performance portability of the LAMMPS molecular dynamics code and Trilinos/Tpetra linear solvers implemented with MPI+Kokkos, and run on a clusters with Intel Xeon Phi and NVIDIA Kepler devices.

Session Level: Intermediate
Session Type: Talk
Tags: Supercomputing; Numerical Algorithms & Libraries; Programming Languages & Compilers

Day: Tuesday, 03/25
Time: 14:00 - 14:50
Location: Room LL21A

S4287 - High Performance Numerical Algorithms for Seismic and Reservoir Simulations

Hatem Ltaief ( Research Scientist, KAUST )
Hatem Ltaief
Hatem received the PhD degree in Computer Science from the University of Houston in 2007. He was a Research Scientist in the Innovative Computing Laboratory in the EECS Department at the University of Tennessee, Knoxville until 2010. He joined KAUST, Saudi Arabia, in January 2011 and he is currently a Computational Scientist at the Supercomputing Laboratory. He was the recipient of an NVIDIA Cuda Research Center Award in 2012. He is part of the European Exascale Software Initiative (EESI) to build a European vision and roadmap to address the challenges of the new generation of massively parallel systems. His research interests include parallel numerical algorithms, fault tolerant algorithms, parallel programming models and performance optimizations for multicore architectures and hardware accelerators.
Rio Yokota ( Research Scientist, KAUST )
Rio Yokota
Rio Yokota is a Research Scientist in the Strategic Initiative for Extreme Computing at KAUST. He currently works on fast multipole methods (FMM), and their implementation on large scale heterogeneous architectures. During his PhD, he worked on the implementation of fast multipole methods on special purpose machines such as MDGRAPE-3, and then on GPUs after CUDA was released. During his post-doc he continued to work on FMM, and was part of the team that won the Gordon Bell prize for price/performance in 2009 using 760 GPUs. During his postdoc with Lorena Barba at Boston University he developed an open source parallel FMM code -- ExaFMM. He is now running this code on full nodes of the TSUBAME 2.0 and K computer in Japan, and also on Mira, Titan, and Stampede in the US.

Learn how to leverage current numerical algorithms for solving challenging reservoir and seismic simulation problems on GPUs using: 1) a novel preconditioner technique based on massively parallel, compute intensive Fast N-body methods, 2) an optimized implementation of the Sparse Matrix-Vector multiplication used during the iterative solver phase, which exploits the existing structure of the sparse matrix and 3) a synchronization-reducing algorithm for stencil-based computation during explicit time integration.

Session Level: Intermediate
Session Type: Talk
Tags: Energy Exploration; Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 14:00 - 14:25
Location: Room LL20B

S4299 - Fast Solvers for Linear Systems on the GPU

Cornelis Vuik ( Professor, Delft University of Technology )
Cornelis Vuik
Cornelis received his Master Applied Mathematics from TU Delft and his Ph.D. from Utrecht University. Since 2010, Cornelis has served as the Associate Editor SIAM Journal on Scientific c Computing (SISC). Cornelis has authored more than 130 ISI papers, is the Co-investigator of an Erasmus Mundus Master program and Director of Delft Centre for Computational Science and Engineering and Scientific c Director of 3TU.AMI Applied Mathematics Institute.

Some examples are given to solve large linear systems coming from practical/industrial applications. The methods are based on preconditioned Krylov subspace methods. Most building blocks are easy implemented on the GPU. The most involved operation is the preconditioner. In this talk three variants are discussed: (1) Neumann series, (2) Deflation techniques, and (3) Recursive red black ordering. The methods are applied so multi-phase flow and a ship simulator application and show speedups of a factor 30-40.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Computational Fluid Dynamics; Supercomputing; Manufacturing

Day: Tuesday, 03/25
Time: 14:00 - 14:25
Location: Room LL21D

S4320 - Packet-based Network Traffic Monitoring & Analysis with GPUs

Wenji Wu ( Network Researcher, Fermilab )
Wenji Wu
Wenji Wu received the doctorate degree in computer engineering in 2003 from the University of Arizona, Tucson. He is currently a network researcher in Fermi National Accelerator Laboratory. His research interests include high performance networking, operating systems, and distributed systems.

In high-speed networks, network traffic monitoring and analysis applications may require enormous raw compute power and high I/O throughputs, especially when traffic scrutiny on a per-packet basis is needed. Under those conditions, the applications face tremendous performance and scalability challenges. The GPU architecture fits well with the features of packet-based network monitoring and analysis applications. At Fermilab, we have prototyped a GPU-assisted network traffic monitoring & analysis system, which analyzes network traffic on a per-packet basis. We implemented a GPU-accelerated library for network traffic capturing, monitoring, and analysis. The library consists of various CUDA kernels, which can be combined in various ways to perform monitoring and analysis tasks. In this talk, we will describe our architectural approach in developing a generic GPU-assisted network traffic monitoring and analysis capability. Multiple examples will be given to demonstrate how to use GPUs to analyze network traffic.

Session Level: All
Session Type: Talk
Tags: Big Data Analytics & Data Algorithms; Computational Physics; Supercomputing; Numerical Algorithms & Libraries; Recommended Press Session – HPC-Science

Day: Tuesday, 03/25
Time: 14:00 - 14:50
Location: Room 210B

S4693 - Accelerating Low-Lying Eigenmode Deflation for Lattice QCD Fermion Inverters on GPUs

Alexei Strelchenko ( Computational Physics Developer, Fermi National Accelerator Laboratory )
Alexei Strelchenko is a computational physics developer at the Fermi National Accelerator Laboratory. He received a Ph.D. degree from Leipzig University, Germany. From 2009 to 2012 he was working on the Lattice QCD on GPU Architectures project at the Computation-based Science and Technology Research Centre (CaSToRC). He was also participating in European PRACE-1IP (Partnership For Advance Computing in Europe) and LinkSCEEM-2 projects. His main research interests focus on computational physics, general purpose computing on graphics processing units (GPGPU), Lattice QCD.

Learn how to leverage the power of GPUs to accelerate solution of large sparse linear systems with multiple right hand sides by means of the incremental eigCG algorithm. For a given hermitian system with multiple right hand sides this algorithm allows (1) to compute incrementally a number of small magnitude eigenvalues and corresponding eigenvectors while solving the first few systems with standard Conjugate Gradient (CG), and then (2) to reuse the computed eigenvectors to deflate the CG solver for the remaining systems. In this session we will discuss implementation aspects of the technique and analyse its efficiency on the example of lattice QCD fermion matrix inversions.

Session Level: Intermediate
Session Type: Talk
Tags: Computational Physics; Numerical Algorithms & Libraries; Supercomputing

Day: Tuesday, 03/25
Time: 14:00 - 14:25
Location: Room 212A

S4204 - Multifrontal Sparse QR Factorization on the GPU

Tim Davis ( Professor, University of Florida )
Tim Davis
Tim Davis is a professor in Computer and Information Science and Engineering at the University of Florida. He is a Fellow of the Society of Industrial and Applied Mathematics (SIAM), in recognition for his work on sparse matrix algorithms. His software for sparse direct methods appears in 100s of applications in industry, academia, and government labs, including MATLAB (x=A), Mathematica, NASTRAN, Cadence, Mentor Graphics, Google Ceres (StreetView, PhotoTours), IBM, Berkeley Design Automation, Xyce, and many others. For a full CV, see http://www.cise.ufl.edu/~davis/background.html .

Sparse matrix factorization involves a mix of regular and irregular computation, which is a particular challenge when trying to obtain high-performance on the highly parallel general-purpose computing cores available on graphics processing units (GPUs). We present a sparse multi-frontal QR factorization method that meets this challenge, and is up to eleven times faster than a highly optimized method on a multicore CPU. Our method is unique compared with prior methods, since it factorizes many frontal matrices in parallel, and keeps all the data transmitted between frontal matrices on the GPU. A novel bucket scheduler algorithm extends the communication-avoiding QR factorization for dense matrices, by exploiting more parallelism and by exploiting the staircase form present in the frontal matrices of a sparse multifrontal method. Peak performance is over 80 Gflops on an Fermi Tesla C2070, in double precision. This is joint work with Nuri Yeralan and Sanjay Ranka.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 14:30 - 15:20
Location: Room LL21D

S4453 - GPU-Based Lattice QCD Simulations as Thermometer for Heavy-Ion Collisions

Mathias Wagner ( Postdoc, Bielefeld University & Indiana University )
Theoretical physicist Dr. Mathias Wagner is currently working in the physics department at Indiana University. After receiving his PhD in 2009 at Technical University Darmstadt he moved on to Bielefeld University in 2010. There he focussed on CUDA implementations of Lattice QCD simulations. He is a member of the team administrating the Bielefeld GPU cluster and is the PI of the CUDA Research Center in Bielefeld. At Indiana University he continues working on high-performance Lattice QCD simulations on GPUs, intensively collaborating with researchers from the National Center for Supercomputing Applications at the University of Illinois and the developers of the QUDA library.

See how advances in GPU Computing enable us to simulate Quantum Chromodynamics and learn about fundamental properties of strongly interacting matter i.e., quarks and gluons at finite temperatures. With the advances in hardware and algorithms these simulations have reached a level that allows for a quantitative comparison with experimental data from heavy-ion colliders. Discover how the Kepler architecture helps us to boost the performance of the simulations and reach new level of precision. I will discuss selected optimizations for the Kepler K20 cards and modifications to prepare the code for the Titan supercomputer. Furthermore I compare and discuss pros and cons of our in-house in comparison to available libraries like the QUDA library.

Session Level: Intermediate
Session Type: Talk
Tags: Computational Physics; Supercomputing; Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 14:30 - 14:55
Location: Room 212A

S4250 - PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Peter Vincent ( Lecturer, Imperial College London )
Peter Vincent
Peter Vincent is a Lecturer and EPSRC Early Career Fellow in the department of Aeronautics at Imperial College London, working at the interface between mathematics, computing, fluid dynamics, and aeronautical engineering. He holds a 1st class undergraduate degree from the department of Physics at Imperial College (graduating top of the year), and a PhD from the department of Aeronautics at Imperial College in the field of CFD. Prior to his appointment as a Lecturer, Peter served as a Postdoctoral Scholar in the department of Aeronautics and Astronautics at Stanford University, where he developed novel high-order numerical methods for CFD, and implemented them for massively-parallel many-core Graphical Processing Units (GPUs).

Discover how GPUs are being used to accelerate high-fidelity computational fluid dynamics (CFD) simulations on unstructured grids. In this talk I will (i) introduce the flux reconstruction approach to high-order methods; a discretization that is particularly well-suited to many-core architectures, (ii) introduce our massively parallel implementation, PyFR, which through a combination of symbolic manipulation and run-time code generation is able to easily target NVIDIA GPU hardware and, (iii) showcase some of the high-fidelity, unsteady, simulations undertaken using PyFR on both desktop and HPC systems.

Session Level: Intermediate
Session Type: Talk
Tags: Computational Fluid Dynamics; Numerical Algorithms & Libraries; Supercomputing

Day: Tuesday, 03/25
Time: 15:00 - 15:25
Location: Room 210A

S4145 - High Frequency Elastic Seismic Modeling on GPUs Without Domain Decomposition

Thor Johnsen ( HPC Expert, Chevron )
Thor Johnsen
20 years of experience as a software developer in the oil and gas industry. Worked for Schlumberger for 17 years in various roles on projects targeting all forms of seismics, EM data and borehole logs. Worked as an HPC expert for WesternGeco from 2009 until 2011. Currently works as an HPC expert for Chevron ETC. Loves the challenge of parallel designs of complicated algorithms.

What if you want to do FDTD modeling on a dataset that cannot possibly fit into GPU memory? This session explores design patterns that take advantage of two levels in the GPU memory hierarchy that is often overlooked, host memory and disk, thereby greatly expanding the size of problem that can be handled. Two seismic modeling kernels were implemented, acoustic TTI with variable density and elastic triclinic. We show that these GPU kernels can handle extremely large datasets without domain decomposition (10's of billions of cells) while also taking full advantage of the computational throughput of 16 Kepler GPUs, achieving 20-30x better throughput than highly optimized CPU code running on a dual socket Sandy Bridge server. We also show that this design pattern can be applied to other numerical methods that have a concept of timestepping and exhibit good spatial locality, such as Lattice Boltzmann methods for fluid flow modeling.

Session Level: Advanced
Session Type: Talk
Tags: Energy Exploration; Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 15:30 - 16:20
Location: Room LL20B

S4311 - Lanczos Algorithm Using CUDA for Lattice QCD

Hyung-Jin Kim ( Research Associate, Brookhaven National Laboratory )
Mr. Kim received his Ph.D. from Seoul National University in Republic of Korea(2012)

Getting an eigenvalue and eigenvector set of inversion matrix is the key point of accelerating the matrix inversion and the Lanczos algorithm is one of the well-known methods for this problem. But this routine is highly dominated by data access IO so it can be another bottleneck in the whole sequence. Even though the FLOPS/Bandwidth ratio of GPU is not good enough, GPU still has an advantage in memory bandwidth compared with that of CPU. We are implementing the Lanczos algorithm based on CUDA and will show preliminary performance result on multi GPU clusters.

Session Level: Advanced
Session Type: Talk
Tags: Computational Physics; Numerical Algorithms & Libraries; Supercomputing

Day: Tuesday, 03/25
Time: 15:30 - 15:55
Location: Room 212A

S4318 - Performance Impact of Dynamic Parallelism on Clustering Algorithms on GPUs.

Michela Taufer ( Associate Professor, University of Delaware )
Michela Taufer
Dr. Taufer is the David L. and Beverly J.C. Mills Chair of Computer and Information Sciences and an associate professor in the same department at the University of Delaware. She earned her master's degrees in Computer Engineering from the University of Padova (Italy) and her doctoral degree in Computer Science from the Swiss Federal Institute of Technology (Switzerland). From 2003 to 2004 she was a La Jolla Interfaces in Science Training Program (LJIS) Postdoctoral Fellow at the University of California San Diego (UCSD) and The Scripps Research Institute (TSRI), where she worked on interdisciplinary projects in computer systems and computational chemistry. From 2005 to 2007, she was an Assistant Professor at the Computer Science Department of the University of Texas at El Paso (UTEP). She joined the University of Delaware in 2007 as an Assistant Professor and was promoted to Associate Professor with tenure in 2012.

Discover and quantify the performance gains of dynamic parallelism for clustering algorithms on GPUs. Dynamic parallelism effectively eliminates the superfluous back and forth communication between the GPU and CPU through nested kernel computations. The change in performance is measured using two well-known clustering algorithms that exhibit data dependencies: the K-means clustering and the hierarchical clustering. K-means has a sequential data dependence wherein iterations occur in a linear fashion, while the hierarchical clustering has a tree-like dependence that produces split tasks. Analyzing the performance of these data-dependent algorithms gives us a better understanding of the benefits or potential drawbacks of CUDA 5's new dynamic parallelism feature.

Session Level: Beginner
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Supercomputing

Day: Tuesday, 03/25
Time: 15:30 - 15:55
Location: Room LL21D

S4392 - From CPU to GPU: Optimizing 3D Depthmap and Filtering

Tim Droz ( VP & GM North America, SoftKinetic )
Highly-Rated Speaker
Tim Droz
Tim heads the SoftKinetic U.S. organization delivering 3D time-of-flight (TOF) and gesture solutions to international customers such as Intel and Texas Instruments. Prior to SoftKinetic, Tim was VP Platform Engineering and head of the Entertainment Solutions Business Unit (ESBU) at Canesta, developing the Kinect 2 TOF sensor (acquired by Microsoft). Tim's pioneering work extends into all aspects of the gesture and 3D ecosystem including 3D sensors, gesture-based middleware and applications.

The advent of 3D technologies has created a particular strain on processing resources for embedded platforms such as Tegra. 3D Depthmap generation and filtering optimization have traditionally been processed using the CPU, but by offloading these capabilities to the much more robust GPU, up to a quarter of the bottleneck created in processing 3D images can be eliminated. In this session learn how utilizing the GPU to free up processing resources allows for a leaner, faster developer experience. Also, discuss how to manage and overcome the most difficult part of GPU processing - synchronization.

Session Level: All
Session Type: Talk
Tags: Mobile Summit; Computer Vision; Numerical Algorithms & Libraries; Programming Languages & Compilers; Recommended Press Session – Mobile

Day: Tuesday, 03/25
Time: 15:30 - 15:55
Location: Room 210E

S4570 - Accelerating 3D Reconstruction from Range Images with a Novel Cyclic Scheme

Christopher Schroers ( Research Assistant, Saarland University )
Christopher Schroers
Christopher Schroers received a M. Sc. degree in Visual Computing from the Saarland University, Saarbrücken, Germany in 2011. Since then he is a research assistant in the Mathematical Image Analysis Group, Saarland University.

Attend this session to get a deep understanding of variational range image integration methods. Such approaches are able to deal with a substantial amount of noise and outliers, while regularizing and thus creating smooth 3D reconstructions. See that incorporating a new direction-dependent smoothing behavior yields a better control of the smoothing with respect to the local structure of the unknown surface and thus state-of-the-art results. Also, learn how the integration can be accelerated with a novel and generic cyclic scheme named Fast Jacobi. Fast Jacobi is essentially a modified Jacobi over-relaxation (JOR) method, where the relaxation parameter is not fixed but varied in a cyclic way. Due to this, Fast Jacobi is much more efficient than JOR but still as simple to implement and perfectly suited for parallelization. Furthermore, the Fast Jacobi scheme is also applicable to a large range of other PDE-based image analysis problems.

Session Level: Beginner
Session Type: Talk
Tags: Computer Vision; Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 15:30 - 15:55
Location: Room 212B

S4142 - Easy Multi-GPU Programming with CUDArrays

Javier Cabezas ( Ph.D. Student, Barcelona Supercomputing Center )
Javier Cabezas
Javier Cabezas is a Phd student in the Computer Architecture department at Universitat Politècnica de Catalunya (UPC) and works as researcher at the Barcelona Supercomputing Center (BSC). He received the B.S. and M.S. degrees in computer science from UPC, in 2006 and 2008, respectively. He has extensive experience in GPU programming and has participated in the development of other products that ease the development for GPUs like GMAC.

Learn how to boost your productivity with CUDArrays. CUDArrays is user-level library that eases the development of CUDA programs by offering a multi-dimensional array data type that can be used both in host and device code. This data type relieves programmers of the burden of managing multi-dimensional arrays through flat "C"-style memory allocations. Moreover, in systems with several GPUs and P2P memory access support, CUDArrays transparently distributes the computation across several GPUs. Using data access pattern information provided by the compiler, the runtime automatically determines how to partition (or replicate) the arrays to minimize the number of accesses to other GPUs' memories. Results show that linear speedups can be achieved in most cases. Examples will be provided for different types of scientific computations.

Session Level: Beginner
Session Type: Talk
Tags: Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 16:00 - 16:25
Location: Room LL21D

S4260 - Crash, Boom, Bang! Leveraging Game Physics and Graphics APIs for Scientific Computing

Peter Messmer ( Senior HPC DevTech Engineer, NVIDIA )
Highly-Rated Speaker
Peter Messmer
Peter Messmer joined NVIDIA in 2011 after spending more than 15 years developing HPC and GPU accelerated applications for industry and government clients, mainly in the area of plasma and EM simulations, data analysis and visualization. In his role as senior devtech engineer at NVIDIA, Peter is working with HPC users around the globe supporting them in accelerating their scientific discovery process by taking advantage of GPUs in their applications. Peter holds and MSc and PhD in Physics from ETH Zurich, Switzerland, with specialization in kinetic plasma physics and nonlinear optics.

In this talk, you will learn how to use the game and visualization wizard's tool chest to accelerate your scientific computing applications. NVIDIA's game physics engine PhysX and the ray tracing framework OptiX offer a wealth of functionality often needed in scientific computing application. However, due to the different target audiences, these frameworks are generally not very well known to the scientific computing communities. High-frequency electromagnetic simulations, particle simulations in complex geometries, or discrete element simulations are all examples of applications that could immediately benefit from these frameworks. Based on examples, we will talk about the basic concepts of these frameworks, introduce their strengths and their approximation, and how to take advantage of them from within a scientific application.

Session Level: Intermediate
Session Type: Talk
Tags: Computational Physics; Supercomputing; Numerical Algorithms & Libraries; Recommended Press Session – HPC-Science

Day: Tuesday, 03/25
Time: 16:00 - 16:50
Location: Room 212A

S4656 - Machine Learning with GPUs: Fast Support Vector Machines without the Coding Headaches

Stephen Tyree ( Graduate Student, Washington University in St. Louis )
Stephen Tyree is a PhD student in the Department of Computer Science and Engineering at Washington University in St. Louis. He holds a Bachelors degree in computer science and mathematics and a Masters degree in computer science, both from the University of Tulsa. His research focuses on parallel and approximate methods for fast and scalable machine learning.

Speeding up machine learning algorithms has often meant tedious, bug-ridden programs tuned to specific architectures, all written by parallel programming amateurs. But machine learning experts can leverage libraries such as CuBLAS to greatly ease the burden of development and make fast code widely available. We present a case study in parallelizing Kernel Support Vector Machines, powerful machine-learned classifiers which are very slow to train on large data. In contrast to previous work which relied on hand-coded exact methods, we demonstrate that a recent approximate method can be compelling for its remarkably simple implementation, portability, and unprecedented speedup on GPUs.

Session Level: Intermediate
Session Type: Talk
Tags: Machine Learning & AI; Big Data Analytics & Data Algorithms; Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 16:00 - 16:25
Location: Room LL21B

S4307 - GPU Accelerated Parallel Simulated Annealing for Fitting Molecular Dynamics Potentials

Pierre-Yves Taunay ( Research Programmer, The Pennsylvania State University )
Pierre-Yves Taunay
Pierre-Yves Taunay obtained his Master of Science in Aerospace Engineering from The Pennsylvania State University in 2012, along with a "General Engineer" degree from the Ecole Centrale de Nantes, France. He has been working since then at the Research Computing and Cyberinfrastructure unit at The Pennsylvania State University as a Research Programmer. His current research focuses on high performance computing for large scale engineering and scientific applications such as molecular dynamics, fluid dynamics, or plasma physics, using Graphics Processing Unit (GPU) and the Message Passing Interface (MPI) standard.

This work presents a parallel simulated annealing implementation for fitting molecular dynamics potentials. In our implementation, each GPU is given a random set of Lennard-Jones parameters sigma and epsilon, and performs separately a molecular dynamics simulation. A derived quantity, the structure factor, is then compared to experimental data and determines the quality of the fitting parameters. Information about the best fit is exchanged across GPUs at a fixed number of iterations. The choice of random parameters is then restarted in the vicinity of the best parameter set. Using GPUs, a larger parameter set can be explored in a given time as molecular dynamics simulations benefit greatly from GPU acceleration.

Session Level: Beginner
Session Type: Talk
Tags: Molecular Dynamics; Numerical Algorithms & Libraries; Supercomputing

Day: Tuesday, 03/25
Time: 16:30 - 16:55
Location: Room LL21E

S4586 - Reasoning About Memory Performance Using Index-Digit Notation

Brandon Lloyd ( Software Engineer, NVIDIA )
Brandon Lloyd
Brandon got his Ph.D. at the University of North Carolina at Chapel Hill for his work with shadows. He worked at Microsoft Research for several years doing GPGPU work, including contributions to the DirectX11 FFT library. He now works at NVIDIA in the OptiX group working on GPU raytracing.

Achieving good memory performance in CUDA for algorithms on arrays with non-trivial access patterns, such as transpose or FFT, requires careful attention to shared memory bank conflicts, global memory coalescing, and on older GPUs, partition camping. Thinking about memory performance issues in the native multi-dimensional problem domain can sometimes be challenging. Index-digit notation provides an abstract representation of memory access patterns that can make reasoning about solutions to memory performance issues easier. In this session learn how to resolve bank conflicts, coalescing, and partition camping by performing simple transformations in index-digit notation. Applications to transpose and FFT will be discussed.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 16:30 - 16:55
Location: Room LL21D

S4155 - Portability and Performance: A Functional Language for Stencil Operations

Gerhard Zumbusch ( Professor, Friedrich-Schiller Universität Jena )
Gerhard Zumbusch
Gerhard is professor for scientific computing at the mathematics department of University Jena. He is working on efficient numerical algorithms, parallel computing and the solution of partial differential equations. He graduated from TU München, received his PhD from FU Berlin and habilitation from University Bonn.

A new programming language designed for stencils operations in explicit finite difference and image processing applications is introduced. Learn to use a small domain specific functional language. It allows for a short and portable way to express numerical schemes. Objects are immutable functions without storage type and side effects. The results are independent of the order of instructions and decisions to redundantly re-compute partial results. The scheduling of instructions and the storage layout, the partition into GPU kernels and the memory management are all left to the compiler. Learn about the parallel patterns used by the compiler to create high performance implementations of the numerical scheme for a specific problem size and hardware configuration. These include data layout for effective vectorization, strategies to re-compute or cache intermediate results, sliding window and space-time tiling of the iteration space, and list-scheduling to create code blocks for off-loading. Strategies which are useful in general.

Session Level: Intermediate
Session Type: Talk
Tags: Programming Languages & Compilers; Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 17:00 - 17:25
Location: Room LL20D

S4163 - Heterogeneous CPU+GPU Molecular Dynamics Engine in CHARMM

Antti-Pekka Hynninen ( Research Scientist, National Renewable Energy Laboratory )
Antti-Pekka Hynninen
Antti-Pekka Hynninen is a research scientist at the National Renewable Energy Laboratory where his main focus is the software development of the CHARMM Molecular Dynamics (MD) engine. Antti-Pekka recently rewrote the CHARMM MD engine core functions to be 2x faster than the previous version and he implemented a modern domain decomposition algorithm in CHARMM that enables parallel simulations on hundreds of CPU cores. Prior to joining NREL, Antti-Pekka was a post-doctoral researcher at Princeton University where he did research on Monte Carlo simulations of charged colloids. Antti-Pekka holds a PhD in physics from Utrecht University, The Netherlands.

This presentation provides a first glimpse of a heterogeneous CPU+GPU Molecular Dynamics (MD) engine in CHARMM. In the MD engine, the GPU is used for the calculation of the direct part of the non-bonded force calculation, while the CPU takes care of the rest of the work (reciprocal force calculation, bonded force calculation, integration, etc.). The MD engine is built around the CHARMM domain decomposition code enabling massively parallel MD simulations on multiple CPU+GPU nodes. The new MD engine outperforms the CPU code by a factor of 8 or more.

Session Level: Beginner
Session Type: Talk
Tags: Molecular Dynamics; Supercomputing; Computational Physics; Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 17:00 - 17:25
Location: Room LL21E

S4664 - In-Place Array Transposition and Fast Array of Structure Accesses

Bryan Catanzaro ( Senior Research Scientist, NVIDIA )
Bryan Catanzaro
Ph.D., UC Berkeley. Bryan works in the Programming Systems and Applications Research group at NVIDIA where he focuses on machine learning algorithms for GPUs.

We'll present a new algorithm for in-place array transposition. The algorithm is useful for in-place transposes of large matrices, as well as in-place conversions between Arrays of Structures and Structures of Arrays. The simple structure of this algorithm enables full memory bandwidth accesses to Arrays of Structures. We'll discuss the algorithm, as well as several implementations on GPUs and CPUs.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 17:00 - 17:25
Location: Room LL21D

S4188 - How to Avoid Global Synchronization by Domino Scheme

Lung-Sheng Chien ( Software Engineer, NVIDIA )
Lung-Sheng  Chien
Lung-Sheng Chien is a software engineer at Nvidia, working on CUBLAS and CUSPARSE libraries. Prior to Nvidia, he was a Ph.D. student in the Department of Mathematics at National Tsing Hua University. He received his B.S. and M.S. in the Department of Computer Science at National Tsing Hua University in 2003 and 2005, respectively.

Learn how to trace a data dependence graph without global synchronization. Such dependence graph can be built from sparse triangular solve, incomplete Cholesky factorization or incomplete LU factorization. There are several issues we will address, including: 1)How to reproduce the result without atomic operation; 2)How to keep one kernel to track data dependence graph; 3)How to keep small working space because GPU has limited device memory, and; 4)Penalty of warp size on this latency-sensitive application.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 17:30 - 17:55
Location: Room LL21D

S4488 - GooFit: Massively parallel likelihood fitting using GPUs

Rolf Andreassen ( Postdoctoral Fellow, University of Cincinnati )
Rolf Andreassen
Dr. Andreassen's research interests lie mainly in the creation of tools for physics analysis. He began developing software for HEP purposes as a summer student at CERN, writing a C++ API for the ISAJET event-generation package. As a Master's and Ph.D. student he wrote custom modules for analysing BABAR data, and later a fast simulation of the DIRC component of the SuperB experiment. Dr. Andreassen is the designer and lead developer of the GooFit fitting package, and has given courses at the University of Cincinnati and at CERN in CUDA programming and use of GooFit. He is involved with the QuarkNet outreach program, bringing high-school students and teachers to the university to gain experience with HEP theory and research.

We present the GooFit maximum likelihood fit framework which has been develop to run effectively on general purpose graphics processing units (GPUs) to enable next generation experimental high energy physics (HEP) research. Most analyses of data from HEP experiments use maximum likelihood fits. Some of today's analyses use fits which require more than 24 hours on traditional multi-core systems. The next generation of experiments will require computing power two orders of magnitude greater for analyses which are sensitive to New Physics. Our GooFit framework, which has been demonstrated to run on nVidia GPU devices ranging from high end Teslas to laptop GeForce GTs, uses CUDA and the Thrust library to massively parallelize the per-event probability calculation. For realistic physics fits we achieve speedups, relative to executing the same algorithm on a single CPU, of several hundred.

Session Level: Intermediate
Session Type: Talk
Tags: Computational Physics; Numerical Algorithms & Libraries; Big Data Analytics & Data Algorithms

Day: Tuesday, 03/25
Time: 17:30 - 17:55
Location: Room 212A

S4243 - A GPU Sparse Direct Solver for AX=B

Jonathan Hogg ( Researcher, Science and Technology Facilities Council (STFC) )
Jonathan Hogg
Following completion of his Ph.D. at Edinburgh in 2009, Jonathan joined STFC where he works with the Numerical Analysis Group of the Scientific Computing Department. He is the author of several high performance software packages in the sparse and dense linear algebra with a focus on desktop HPC.

The solution of Ax=b for sparse A is one of the core computation kernels ("dwarves") used in scientific computing. While there are many GPU iterative methods libraries available, these can only tackle a limited range of problems due to preconditioning requirements. On the CPU, black box direct solvers are often the first port of call for more challenging problems, however little GPU support is present in existing libraries. We present a new direct solver library capable of performing entirely on GPU factorization and solve for symmetric problems. The talk will cover our solution to a number of the challenges involved in making this reality, and present results across a number of application areas including FEM and Optimization.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries

Day: Wednesday, 03/26
Time: 09:00 - 09:25
Location: Room LL20D

S4259 - Optimization of a CUDA-based Monte Carlo Code for Radiation Therapy

Nick Henderson ( Postdoctoral Scholar, Stanford University, Institute for Computational and Mathematical Engineering )
Nick Henderson is a Postdoctoral scholar with the CUDA Center of Excellence in the Institute for Computational and Mathematical Engineering at Stanford University.

Learn about optimization efforts in G4CU, a CUDA Monte Carlo code for radiation therapy. G4CU is based on the core algorithm and physics processes in Geant4, a toolkit for simulating particles traveling through and interacting with matter. The techniques covered will include the use of texture references for look-up tables, device configuration for different simulation components, and scheduling of work for different particle types.

Session Level: Intermediate
Session Type: Talk
Tags: Computational Physics; Medical Imaging & Visualization; Numerical Algorithms & Libraries

Day: Wednesday, 03/26
Time: 09:00 - 09:25
Location: Room 212A

S4316 - Simulating Generation, Retention and Expulsion of Hydrocarbons on GPUs

Massimo Bernaschi ( Director of Technology, National Research Council of Italy )
Massimo Bernaschi
Massimo Bernaschi is with CNR, the National Research Council of Italy as Chief Technology Officer of the Institute for Applied Computing. He is also an Adjunct Professor of Systems Programming at "Sapienza" University in Rome; Trainer in Digital Forensics at "Sapienza" and Modena Universities and Adviser of the Consortium for Supercomputing Applications (CASPUR). Before joining CNR in 1998, Massimo worked ten years at the IBM European Center for Scientific and Engineering Computing where he developed the IBM PVMe product and received two Outstanding Technical Achievement Awards. His main scientific interests are parallel computing; modelling of complex systems (finance and biology); systems and network security; high performance computing. He is the author of about 150 papers in peer-reviewed journals and international conferences. Massimo Bernaschi started working with CUDA back in 2008. He developed the first CUDA implementation of the Lattice Boltzmann method for irregular geometries and has been a pioneer in multi-GPU programming. In 2011 he received a Honorable Mention in the Gordon Bell Award for MUPHY, a multi-physics CUDA code for the simulation of bio-fluids that achieves excellent scalability up to 4000 GPU. In 2013 he is again a finalist for the Gordon Bell Award with a simulation of protein crowding that achieves 20 Petaflops on 18000 GPU. He also developed CUDA codes for spin systems simulations, dictionary and brute-force attacks to cryptosystems, signal-processing and simulation of soft matter. In 2012 Massimo Bernaschi has been named "CUDA Fellow".

Learn how to use GPUs as batch processors to simulate thousands of independent systems having a complex dynamics but relatively limited computing requirements. By using an apparently naive approach with a single CUDA thread simulating an entire system, it is possible to obtain excellent global performances and minimize, at the same time, the differences in the results with respect to the original, serial, implementation of the same application. Crucial for the success of the porting is a proper choice of the data structures that need to be designed so that the global memory of the GPU can be accessed effectively even if threads work on distinct problems. The application we present simulates products of primary migration and the expulsion of hydrocarbons from source rock but the idea can be applied to other fields. The final result in our case is a highly scalable code that runs transparently on multiple GPUs and that can be more easily updated when the underlining model changes.

Session Level: Intermediate
Session Type: Talk
Tags: Energy Exploration; Numerical Algorithms & Libraries; Computational Physics

Day: Wednesday, 03/26
Time: 09:30 - 09:55
Location: Room LL20B

S4524 - Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Xiaoming Chen ( PhD candidate, Tsinghua University )
Xiaoming Chen
Xiaoming Chen received the BS degree from department of Electronic Engineering, Tsinghua University, in 2009, where he is currently working towards the Phd degree. His research interests include parallel EDA algorithms on CPUs and GPUs, and low power and reliability design methodologies of VLSI. He has published 17 papers and received 3 best paper nominations.

Simulation Program with Integrated Circuit Emphasis (SPICE) simulators are widely used for transistor-level simulation in IC design and verification. The time cost of SPICE simulators is dominated by two parts: MOSFET model evaluation and the sparse linear solver. This session will talk about our work on GPU-based sparse LU factorization which is specially designed for SPICE simulation. In particular, we will introduce the challenges of mapping a sparse solver onto a GPU, our parallelization strategies of sparse LU factorization, and performance optimization approaches. Experimental results will be presented and discussed as well.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Electronic Design Automation

Day: Wednesday, 03/26
Time: 09:30 - 09:55
Location: Room LL20D

S4373 - Enhanced Oil Recovery Simulation Performances on New Hybrid Architectures

Thomas Guignon ( Research Engineer, IFP Energies Nouvelles )
Thomas Guignon is a research engineer in computer science at IFP Energies Nouvelles since 2005. He received his Ph.D. in computer science 2003 while he works in an HPC company on software and hardware management of Linux cluster for HPC. His research work at IFP Energies Nouvelles focus on : improving simulation software performances, parallel linear solvers and GPU computing.

The goal of this session is to show that GPU linear solvers with highly parallel preconditioners can tackle with most advanced ones (CPR-AMG) using classical MPI based programming model in the context of reservoir simulation.

Session Level: Intermediate
Session Type: Talk
Tags: Energy Exploration; Numerical Algorithms & Libraries

Day: Wednesday, 03/26
Time: 10:00 - 10:25
Location: Room LL20B

S4486 - Raising the Roofline on GPU Applications with Stacked Memory

Lorena Barba ( Associate Professor of Mechanical and Aerospace Engineering, George Washington University )
Highly-Rated Speaker
Lorena A. Barba is Associate Professor of Mechanical and Aerospace Engineering at the George Washington University, in Washington DC. She has MSc and PhD degrees in Aeronautics from the California Institute of Technology and BSc and PEng degrees in Mechanical Engineering from Universidad Técnica Federico Santa María in Chile. Previous to joining GW, she was Assistant Professor of Mechanical Engineering at Boston University (2008–2013) and Lecturer/Senior Lecturer of Applied Mathematics at University of Bristol, UK (2004–2008). Prof. Barba is an Amelia Earhart Fellow of the Zonta Foundation (1999), a recipient of the EPSRC First Grant program (UK, 2007), an NVIDIA Academic Partner award recipient (2011), and a recipient of the NSF Faculty Early CAREER award (2012). She was appointed CUDA Fellow by NVIDIA Corporation (2012) and is an internationally recognized leader in computational science and engineering.

GPU applications face three potential bottlenecks: instruction throughput, memory throughput and latency. Sometimes we can refactor the algorithm to improve performance after profiling. Another approach is to use the roofline model to analyze computational kernels and identify performance limitations on specific hardware. Such analysis characterizes many important scientific algorithms as memory-bound when running on GPUs. But as we look forward to new generations endowed with stacked DRAM, we see the roof magically lifting due to reduced latencies and higher bandwidths, leading to unprecedented speed-up factors in memory-bound algorithms. With my co-author Manuel Ujaldon, NVIDIA CUDA Fellow and Professor of Computer Architecture at the University of Malaga (Spain), we are looking at how scientific algorithms may benefit from the stacked DRAM of future GPU generations. In this talk, I will present how we characterize GPU application performance via the roofline model and analyze the contribution of stacked DRAM to anticipate its impact in raising performance ceilings in future GPUs like Volta.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Supercomputing

Day: Wednesday, 03/26
Time: 10:00 - 10:25
Location: Room LL20D

S4173 - Fast Evaluation of the Inverse Poisson Cumulative Distribution Function

Mike Giles ( Professor of Scientific Computing, University of Oxford )
Mike Giles
Prior to joining the faculty of Oxford University, Mike was an Assistant/Associate Professor at MIT. at Oxford, Mike is also a CUDA Fellow and Director of the Oxford University CUDA Center of Excellence.

The inverse of the Poisson cumulative distribution function maps uniformly-distributed random numbers to Poisson random variates. This talk describes a fast implementation for GPUs which is based on some novel approximations of the inverse of the closely-related incomplete gamma function for the case of large Poisson rates. Both single-precision and double-precision versions have been developed, and in each case the computational cost is not much more than the cost of the corresponding function for inverting the Normal cumulative distribution function. The software is freely available as open source from http://people.maths.ox.ac.uk/gilesm/poissinv/

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries

Day: Wednesday, 03/26
Time: 10:30 - 10:55
Location: Room LL20D

S4201 - GPU Acceleration of Sparse Matrix Factorization in CHOLMOD

Steven Rennich ( Senior HPC Developer Technology Engineer, NVIDIA )
Highly-Rated Speaker
Steven Rennich
Steven Rennich is a Sr. NVIDIA HPC Developer Technology Engineer. His primary activities include promoting the use of GPUs in computational structural mechanics and the development and optimization of parallel algorithms for direct and iterative solvers for sparse linear systems. Steve holds a Ph.D. in Aeronautics and Astronautics from Stanford University where his research involved computational fluid mechanics and vortex system instabilities. Prior to joining Nvidia, Steve spent many years parallelizing structural analysis and rigid body dynamics codes.
Tim Davis ( Professor, University of Florida )
Tim Davis
Tim Davis is a professor in Computer and Information Science and Engineering at the University of Florida. He is a Fellow of the Society of Industrial and Applied Mathematics (SIAM), in recognition for his work on sparse matrix algorithms. His software for sparse direct methods appears in 100s of applications in industry, academia, and government labs, including MATLAB (x=A), Mathematica, NASTRAN, Cadence, Mentor Graphics, Google Ceres (StreetView, PhotoTours), IBM, Berkeley Design Automation, Xyce, and many others. For a full CV, see http://www.cise.ufl.edu/~davis/background.html .

Sparse direct solvers, and their requisite factorization step, are a critical component of computational engineering and science codes. High performance is typically achieved by reducing the sparse problem to dense sub-problems and applying dense math kernels. However, achieving high performance on a GPU is complicated due to the range of sizes of the dense sub-problems, irregular memory access patterns, and the limited communication bandwidth between the host system and the GPU. This talk will describe the high factorization performance achieved in CHOLMOD using the GPU and discuss in detail key techniques used to achieve this performance including minimizing communication and maximizing concurrency.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Big Data Analytics & Data Algorithms; Computational Structural Mechanics

Day: Wednesday, 03/26
Time: 14:00 - 14:50
Location: Room LL20D

S4599 - An Adventure in Porting: Adding GPU-Acceleration to Open-Source 3D Elastic Wave Modeling

Robin Weiss ( Research Programmer, The University of Chicago )
Robin Weiss
Robin has a strong background in a wide range of applied computer science topics including swarm intelligence and evolutionary algorithms, inverse problems, and scientific visualization. He currently works at the University of Chicago's Research Computing Center providing technical consulting services to researchers. Robin has 5 years experience with CUDA and heterogeneous high-performance computing.

In this session we will describe our experience porting finite-difference time-domain (FDTD) algorithms for solving 3D anisotropic elastic wave equations to GPU, and extending the implementation to support clusters of GPU-equipped compute nodes. These implementations have been integrated with the open-source Madagascar seismic processing software package to allow for accelerated computation of 3D anisotropic elastic wave models. In our work we adopt a straightforward porting strategy that leads to a transparent yet high-performance implementation suitable for mid-sized computational grids. The approach is based on a stress-stiffness formulation on a non-staggered grid and achieves significant speedup compared to a parallel CPU-based implementation allowing for computation of seismic data at lower hardware cost and in less time than was previously possible. We also report details of our implementation strategy as well as performance evaluations in varied heterogeneous compute environments with a number of different GPU architectures.

Session Level: Intermediate
Session Type: Talk
Tags: Energy Exploration; Computational Physics; Scientific Visualization; Numerical Algorithms & Libraries

Day: Wednesday, 03/26
Time: 14:00 - 14:25
Location: Room LL20B

S4500 - Accelerating Particle-Mesh Interaction for Particle-in-Cell Simulation

Alberto Madonna ( Graduate Student, University of Padova )
Alberto Madonna
Alberto Madonna obtained his Bachelor's Degree in Aerospace Engineering at the University of Padova, with the thesis "Implementation of a Sparse Bundle Adjustment Algorithm on an NVIDIA CUDA GPU", under supervision of Professor Stefano Debei. Currently, Alberto attends the laboratory course in Aerospace Propulsion at the University of Padova, where he focuses on numerical simulation of plasma propulsion. He is expected to obtain his Master's degree in Aerospace Engineering at the University of Padova in October 2013, with the thesis "Development of Software Modules for Simulating on GPU the Dynamics of Plasma for Electrical Spacecraft Propulsion", under supervision of Professor. Daniele Pavarin and Dr. Marco Manente. He is proficient in MATLAB, C/C++ and CUDA C, and has been programming GPUs since 2009.

We present an extremely innovative GPU implementation of a Particle-in-Cell code for plasma dynamics simulation on 3-D unstructured grids. Starting from a proven codebase, we integrate solutions and ideas coming from a thorough study of the state-of-the-art in parallel plasma simulation and other fields, adding some original contributions in areas such as workload management, particle ordering and domain decomposition. The result is a novel, flexible simulation pipeline, capable of performing more than an order of magnitude faster than the CPU implementation it originates from, while still presenting exciting opportunities for future developments. Moreover, all the concepts presented are applicable not only to Particle-in-Cell simulation, but in general to any simulation relying on the interaction between lagrangian particles and a spatial grid.

Session Level: Intermediate
Session Type: Talk
Tags: Computational Physics; Numerical Algorithms & Libraries

Day: Wednesday, 03/26
Time: 14:30 - 14:55
Location: Room 212A

S4510 - A Parallel GPU Solution to the Maximal Clique Enumeration Problem for CBIR

Christopher Henry ( Assistant Professor, University of Winnipeg )
Christopher Henry
Christopher received his Ph.D., Department of Electrical and Computer Engineering, University of Manitoba in 2011. He currently holds a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant.

The focus of this talk is on a parallel GPU solution to the Maximal Clique Enumeration (MCE) problem, which is a depth-first search method commonly referred to as the backtracking paradigm. The solution to this problem is an outgrowth of work investigating an efficient method for finding all tolerance classes on a set of objects. Recently, the problem of finding tolerance classes has been shown to be the same as the MCE problem. Tolerance classes are sets where all the pairs of objects within a set must satisfy the tolerance relation and the set is maximal with respect to inclusion. Finding such classes is a computationally complex problem and has many applications areas (e.g. genomics and social media). In particular, this talk will focus on content-based image retrieval (CBIR) involving sets of objects with similar features. In the proposed application to CBIR, classes in image covers determined by a tolerance relation provide the content used for CBIR.

Session Level: Intermediate
Session Type: Talk
Tags: Video & Image Processing; Numerical Algorithms & Libraries

Day: Wednesday, 03/26
Time: 14:30 - 14:55
Location: Room LL21A

S4289 - Efficient Solution of Multiple Scalar and Block-Tridiagonal Equations

Endre László ( Ph.D. student, University of Oxford, Oxford e-Research Center )
Endre László is a visiting Ph.D. student at the University of Oxford, Oxford e-Research Center under the supervision of prof. Michael B. Giles. Finished MSc in 2010 in Electrical and Computer Engineering at Pazmany Peter Catholic University (PPCU-FIT) in Budapest, Hungary. Worked for a financial consultancy company and the Institute for Technical Physics and Materials Science, Hungarian Academy of Sciences. Started PhD in Parallel Computing at PPCU-FIT in 2011.

Many numerical methods require the solution of multiple independent tridiagonal systems. This talk will describe optimized methods for solving such systems, considering both the case where the tridiagonal elements are scalar, and the case where they are composed of square blocks of dimension D, typically 3-8. For the scalar case very good performance is achieved using a combination of the Thomas algorithm and parallel cyclic reduction. In the block case it is shown that good performance can be achieved by using D cooperating threads, all within the same warp.

Session Level: Advanced
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Finance; Computational Physics

Day: Wednesday, 03/26
Time: 15:00 - 15:25
Location: Room LL20D

S4347 - Conquering the Titan Supercomputer: A Star-by-Star Simulation of the Milky Way Galaxy

Evghenii Gaburov ( HPC Advisor, SURFsara )
Evghenii Gaburov
Evghenii Gaburov received MPhys with Astrophysics from University of Leicester. He continued with a PhD research at the University of Amsterdam working on stellar dynamics, stellar collisions and parallel processing on GPUs. Afterwards he spent two years at Leiden Observatory investigating the impact of strong magnetic field on accretion disks around supermassive black holes. He continued with this research at the Northwester University on the prestigious CIERA and NASA Hubble postdoctorate fellowships. Afterwards, he has joined SURFsara, the Dutch national supercomputing centre, to help researches to take advantage of compute power that modern processors offer.
Jeroen Bédorf ( Ph.D. Student, Leiden Observatory )
Jeroen Bedorf is a PhD student at the Leiden Observatory in the Netherlands. He obtained his Master of Science in Grid Computing at the University of Amsterdam. The topic of the thesis was high performance N-body simulations on Graphics Processing Units. After his master he started as a Ph.D. to continue the work on high performance N-body codes and applying them to the problems in computational astrophysics. The N-body codes are accelerated using GPUs. He has developed multiple N-body codes either using direct N-body or using hierarchical tree-code algorithms (for example the Bonsai tree-code). These tree-code methods are used to simulate the formation of galaxies and the effect galaxy mergers have on their evolution.

In this session we demonstrate how we are able to leverage the massive parallelism of thousands of GPUs inside the Titan supercomputer and be able to simulate the past and future of the Milky Way Galaxy on a star-by-star basis in less than 10 days. The audience will learn what it takes to parallelize an advanced hierarchical GPU tree-code to efficiently run on the Titan supercomputer. A gravitational N-body problem is by definition an all-to-all problem and it is of utmost importance for scalability to hide data communication behind computations. This turned out to be a major challenge on the Titan supercomputer because Bonsai's GPU kernels are ~3x faster on Kepler than on Fermi, which reduced compute time and as a result hampered scalability. We were able to solve this by redesigning the communication strategy by taking full advantage of each of the 16- CPU cores while the GPUs were busy computing gravitational forces. This allowed Bonsai to scale to more than 8192 GPUs.

Session Level: Intermediate
Session Type: Talk
Tags: Astronomy & Astrophysics; Supercomputing; Computational Physics; Numerical Algorithms & Libraries; Recommended Press Session – HPC-Science

Day: Wednesday, 03/26
Time: 15:00 - 15:50
Location: Room LL21F

S4585 - FastFlow: Combining Pattern-Level Abstraction and Efficiency in GPGPUs

Marco Aldinucci ( Researcher, Computer Science Department, University of Torino )
Marco Aldinucci is an assistant professor at Computer Science Department of the University of Torino since 2008. Previously, he has been researcher at University of Pisa and Italian National Research Agency. He is the author of over a hundred papers in international journals and conference proceeding (Google scholar h-index 21). He has been participating in over 20 national and international research projects concerning parallel and autonomic computing. He is the recipient of the HPC Advisory Council University Award 2011 and the NVidia Academic Research programme 2013. He has been leading the "Low-Level Virtualization and Platform-Specific Deployment" workpackage within the EU-STREP FP7 ParaPhrase (Parallel Patterns for Adaptive Heterogeneous Multicore Systems) project, the GPGPU workpackage within the IMPACT project (Innovative Methods for Particle Colliders at the Terascale), and he is the contact person for University of Torino for the European Network of Excellence on High Performance and Embedded Architecture and Compilation. In the last year he delivered 5 invited talks in international workshops (March 2012 – March 2013). He co-designed, together with Massimo Torquati, the FastFlow programming framework and several other programming frameworks and libraries for parallel computing. His research is focused on parallel and distributed computing.

Learn how FastFlow's parallel patterns can be used to design parallel applications for execution on both CPUs and GPGPUs while avoiding most of the complex low-level detail needed to make them efficient, portable and rapid to prototype. As use case, we will show the design and effectiveness of a novel universal image filtering template based on the variational approach.

Session Level: Beginner
Session Type: Talk
Tags: Video & Image Processing; Numerical Algorithms & Libraries; Programming Languages & Compilers

Day: Wednesday, 03/26
Time: 15:00 - 15:50
Location: Room LL21A

S4655 - Efficient Lifetime Portfolio Sensitivities: AAD Versus Early-Start Longstaff-Schwartz Compression

Chris Kenyon ( Director, Quantitative Research – CVA / FVA, Lloyds Banking Group )
Chris Kenyon
Chris Kenyon is a Director in the Quantitative Research – CVA / FVA team at Lloyds Bank. Previously he was head quant for counterparty credit risk globally at Credit Suisse, and at DEPFA Bank PLC he was Head of Structured Credit Valuation (post crisis), working on pricing model development, and validation. He has also held positions at IBM Research, and Schlumberger where he applied real options pricing to everything from offshore rig lease extension options to variable volume outsourcing contracts. Chris holds a PhD in Applied Mathematics from Cambridge University where he was a Research Fellow (Computer Modeling), and has an MSc in Operations Research from the University of Austin, Texas.
Andrew Green ( Head of Quantitative Research - CVA / FVA, Lloyds Banking Group )
Andrew Green (Head of Quantitative Research – CVA / FVA) has headed the Quantitative Research – CVA / FVA team at Lloyds Bank for the last five years and is responsible for the development of models for credit and funding valuation adjustments. Prior to joining Lloyds, he headed the DCRM Quant team at Barclays Capital with responsibility for CVA model development. During his career as a quantitative analyst, Andrew has worked on fixed income, equity, credit and hybrid derivative models. Andrew holds a DPhil and BA in Physics from Oxford University and the Certificate in Advanced Study in Mathematics (Part III) from Cambridge University.

Developments in financial regulation (Dodd-Frank, Basel III) emphasize capital adequacy. Efficient lifetime portfolio VaR-based capital calculations for trading decisions are highly computationally challenging. The major impediment to widespread GPU adoption has been the need for multiple code-bases, however the Xcelerit middleware solves this. We give a single source CPU/GPU approach for highly-efficient lifetime portfolio sensitivity calculations. This talk introduces Early-Start Longstaff-Schwartz Compression (ES-LSC) which replaces (Automatic) Algorithmic Differentiation (AAD) that we demonstrate is technically unsuitable after t=0. Longstaff-Schwartz is a state-space method for pricing which we also apply to non-American derivatives as a compression technique. Early-Start means simulations (GPU/CPU) start from the past so the state space is available both at t=0 and all later times for VaR calculations for capital pricing (or IM). State space regressions provide sensitivities either analytically, or by finite difference.

Session Level: Intermediate
Session Type: Talk
Tags: Finance; Numerical Algorithms & Libraries

Day: Wednesday, 03/26
Time: 15:00 - 15:50
Location: Room 210C

S4170 - GPU Neutron Transport: Simulating Nuclear Reactions One Neutron at a Time

Tony Scudiero ( Compute Devtech, NVIDIA )
Tony Scudiero
Tony Scudiero is a Developer Technology software engineer, or devtech, at NVIDIA focusing on compute performance for HPC applications. He works very closely with developers to identify performance bottlenecks and optimization strategies to improve overall performance in their applications. Tony has a long history of using GPUs to accelerate science which predates the introduction of CUDA. Prior to working at NVIDIA, Tony worked at Cray as a high-performance compiler engineer and science library optimization engineer. He has also spent time in other industry roles including computational finance and algorithm design for medical and defense applications. Tony holds an M.S. in Computer Science, a B.S. in Computer Science, and a B.S. in Mathematics from the University of Minnesota.

Monte Carlo neutron transport is an approach to simulating radiation transport and nuclear reaction physics by simulating the individual lifespans of many millions of unbound neutrons. OpenMC is a recently developed Monte Carlo neutron transport application intended to allow future reactor designer to leverage extremely low-level simulation of new reactors years before they are built. The presenter, Tony Scudiero, has adapted OpenMC from its original incarnation as 27k lines of single-threaded Fortran 90 to a parallel CUDA C/C++ implementation optimized for the GPU. This talk covers computational considerations of Monte Carlo neutron transport, the design and process of porting OpenMC to CUDA, and the results and lessons learned in the process. Along with OpenMC, its miniapp benchmark XSBench will be discussed.

Session Level: Advanced
Session Type: Talk
Tags: Computational Physics; Supercomputing; Numerical Algorithms & Libraries

Day: Wednesday, 03/26
Time: 15:30 - 16:20
Location: Room 212A

S4190 - Finite Difference Simulations on GPU Clusters: How Far Can You Push 1D Domain Decomposition?

Pierre Wahl ( Ph.D .Student, Brussels Photonics Team/ Vrije Universiteit Brussel )
Pierre Wahl
Pierre Wahl received his B.S. degree in electrical engineering and the Erasmus Mundus M.S. degree in photonics from Vrije Universiteit Brussel, Brussels, Belgium, in 2007 and 2010, respectively. He wrote his Master's thesis at Interuniversity Microelectronics Center, Leuven, Belgium on high-frequency electrical voltage-controlled oscillators. Pierre joined the Miller Group, Stanford University, Stanford, Calif., as a Visiting Researcher from 2010 to July 2011. He is currently pursuing a PhD degree in electrical engineering at Vrije Universiteit Brussel (Brussels Photonics Team) on low-energy optical interconnects. His current research interests include optical interconnects and advanced simulation and optimization methods in nanophotonics.

To fully utilize a GPU Cluster the single GPU code as well as the inter GPU communication needs to be efficient. In this session the FDTD code B-CALM is introduced and is used as a case study to explain by example how both targets can be met. We explain how the memory bound kernels of B-CALM have been optimized for Fermi and Kepler and how efficient inter GPU communication was enabled by using CUDA-aware MPI. We explain in detail how this was done and present two performance models which we have developed to estimate single GPU performance as well as the scaling limits. To validate the model performance results from different systems are presented including a Infiniband Cluster with GPUDirect RDMA.

Session Level: Beginner
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Clusters & GPU Management; Supercomputing; Computational Physics

Day: Wednesday, 03/26
Time: 15:30 - 15:55
Location: Room LL20D

S4304 - Batch QR Decomposition of Small Matrices on the GPU Using Givens Rotations

Pierre-Yves Taunay ( Research Programmer, The Pennsylvania State University )
Pierre-Yves Taunay
Pierre-Yves Taunay obtained his Master of Science in Aerospace Engineering from The Pennsylvania State University in 2012, along with a "General Engineer" degree from the Ecole Centrale de Nantes, France. He has been working since then at the Research Computing and Cyberinfrastructure unit at The Pennsylvania State University as a Research Programmer. His current research focuses on high performance computing for large scale engineering and scientific applications such as molecular dynamics, fluid dynamics, or plasma physics, using Graphics Processing Unit (GPU) and the Message Passing Interface (MPI) standard.

This work details several GPU implementations of the QR decomposition algorithm using Givens rotations, with a particular focus on large batches of small matrices, displaying performance improvements over similar CPU routines. Each approach essentially consists of successive operations on the input matrix in order to transform it to the upper triangular matrix R, while accumulating operations in the matrix Q. Each GPU block operates on one or more matrices, with care taken to avoid thread divergence and large memory transfers.

Session Level: Intermediate
Session Type: Talk
Tags: Defense; Numerical Algorithms & Libraries; Supercomputing

Day: Wednesday, 03/26
Time: 15:30 - 15:55
Location: Room 210D

S4371 - AMR-Based on Space-Filling Curve for Stencil Applications

Takayuki Aoki ( Professor, Tokyo Institute of Technology )
Takayuki Aoki received a BSc in Applied Physics, an MSc in Energy Science and Dr.Sci (1989) from Tokyo Institute of Technology, has been a professor in Tokyo Institute of Technology since 2001 and the deputy director of the Global Scientific Information and Computing Center since 2009. He received the Minister award of the Ministry of Education, Culture, Sports, Science & Technology in Japan and many awards and honors in GPU computing, scientific visualization, and others. He was the leader of the team of the Gordon Bell Prize in 2011 and also recognized as a CUDA fellow by NVIDIA in 2012.

AMR (Adaptive Mesh Refinement) is an efficient method capable to assign a mesh with a proper resolution to any local areas. It has great advantages from the view point of computational cost and memory usage for practical stencil applications such as computational fluid dynamics. According to the octree data structure, the refinement process is recursive and the computation is carried out on the leaf meshes. By using bigger leaves than those of CPU, we can assign a CUDA block to a leaf with enough thread numbers. We show a GPU implementation in which the leaves are connected by the Hilbert space-filling curve and discuss the overhead of the data management.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Computational Fluid Dynamics; Supercomputing

Day: Wednesday, 03/26
Time: 16:00 - 16:25
Location: Room LL20D

S4536 - An Approach to Parallel Processing of Big Data in Finance for Alpha Generation and Risk Management

Yigal Jhirad ( Head of Quantitative Strategies and Risk Management , Cohen & Steers )
Yigal  Jhirad
Yigal D. Jhirad, Senior Vice President, is Director of Quantitative Strategies and a Portfolio Manager for Cohen & Steers’ options and real assets strategies. Mr. Jhirad heads the firm’s Investment Risk Committee. He has 26 years of experience. Prior to joining the firm in 2007, Mr. Jhirad was an executive director in the institutional equities division of Morgan Stanley, where he headed the company’s portfolio and derivatives strategies effort. He was responsible for developing, implementing and marketing quantitative and derivatives products to a broad array of institutional clients, including hedge funds, active and passive funds, pension funds and endowments. Mr. Jhirad holds a BS from the Wharton School. He is a Financial Risk Manager (FRM), as Certified by the Global Association of Risk Professionals. He is based in New York.
Blay Tarnoff ( Senior Software Engineer, Cohen & Steers )
Blay Tarnoff
Blay Tarnoff is a senior applications developer and database architect. He specializes in array programming and database design and development. He has developed equity and derivatives applications for program trading, proprietary trading, quantitative strategy, and risk management. He is currently a consultant at Cohen & Steers and was previously at Morgan Stanley.

This session discusses the convergence of parallel processing and big data in finance as the next step in evolution of risk management and trading systems. We advocate a risk management approach in finance should evolve from more traditional inter-day top down metrics to intra-day bottom up approach using signal generation and pattern recognition. We have also determined that parallel processing is a key tool to absorb greater insights into market patterns providing "trading DNA" and more effective tools to manage risk in real time.

Session Level: All
Session Type: Talk
Tags: Finance; Big Data Analytics & Data Algorithms; Numerical Algorithms & Libraries

Day: Wednesday, 03/26
Time: 16:00 - 16:50
Location: Room 210C

S4365 - RAMSES on the GPU: An OpenACC-Based Approach

Claudio Gheller ( Computational Scientist, ETH-CSCS )
Highly-Rated Speaker
Claudio Gheller took a Ph.D. in Astrophysics at SISSA-ISAS (Trieste). He is currently working as computational scientist at the Swiss National Supercomputing Center (CSCS), which is part of ETH. Among current duties, he is responsible for WP8 (“Scientific codes enabling on new HPC architectures”) in the EU funded PRACE project, responsible for the Physics network in the Swiss funded PASC project and involved in a number of research projects on code development, HPC simulations and data analysis in astrophysics and visualization of scientific data.

We present the work accomplished to enable the numerical codes "RAMSES" to the GPU, in order to efficiently exploit hybrid accelerated HPC architectures. RAMSES is a code designed for the study astrophysical problems on different scales (e.g. star formation, galaxy dynamics, large scale structure of the universe) treating at the same time various components (dark energy, dark matter, baryonic matter, photons) and including a variety of physical processes (gravity, magneto-hydrodynamics, chemical reactions, star formation, supernova and AGN feedback, etc.). It is implemented in Fortran 90 and adopts the OpenACC paradigm to offload some the most computationally demanding algorithms to the GPU. Two different strategies have been pursued for code refactoring, in order to explore complementary solutions and select the most effective approach. The resulting algorithms are presented together with the results of tests, benchmarks and scientific use cases.

Session Level: Advanced
Session Type: Talk
Tags: Astronomy & Astrophysics; Supercomputing; Numerical Algorithms & Libraries; Computational Physics

Day: Wednesday, 03/26
Time: 16:30 - 16:55
Location: Room LL21F

S4497 - Parallelizing a Real-Time 3D Finite Element Algorithm using CUDA: Limitations, Challenges and Opportunities

Vukasin Strbac ( PhD student, KULeuven University, Leuven )
Vukasin Strbac is a PhD student at KULeuven University, Leuven, Belgium. He is a member of the Biomechanics section, within the Department of Mechanical Engineering. He is also a member of the Robotics Assisted Surgery group specializing in the Finite Element Method and parallel computing for the intraoperative setting.

Learn about the challenges of parallelizing a Finite Element problem using the Total Lagrangian Explicit Dynamic formulation. We examine the algorithm and perform a detailed analysis of the performance limiting factors of parallelization using CUDA. Potential optimization benefits are elucidated in terms of register usage thresholds and other factors for better performance. Results of a larger usability study are presented on a simple problem examining single/double precision tradeoff on a wide range of GPUs and problem sizes. Discover the impact that real-time FE can bring to the intraoperative surgical setting with in-the-loop computation facilitating surgical robotics.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Computational Structural Mechanics; Computational Physics

Day: Wednesday, 03/26
Time: 16:30 - 16:55
Location: Room LL20D

S4518 - Accelerating Dissipative Particle Dynamics Simulation on Kepler: Algorithm, Numerics and Application

Yu-Hang Tang ( Ph.D. Student, Brown University )
Yu-Hang Tang
Yu-Hang Tang is a Ph.D. student in the Division of Applied Mathematics at Brown University. He got his bachelor's degree in Polymer Science at Zhejiang University, China. Following one year of study at the Center for Biophysics and Computational Biology at University of Illinois at Urbana-Champaign, he started his Ph.D. research in applied mathematics at Brown University. His current interests are various particle-based simulation techniques including molecular dynamics, dissipative particle dynamics and smooth particle hydrodynamics. He is also devoted to the development of massively parallel algorithms.

The talk focuses on the implementation of a highly optimized dissipative particle dynamics (DPD) simulation code in CUDA, which achieves 20 times speedup on a single Kepler GPU over 12 Ivy-Bridge cores. We will introduce a new pair searching algorithm that is parallel, deterministic, capable of generating strictly ordered neighbor list and atomics-free. Such neighbor list leads to optimal memory efficiency when combined with proper particle reordering schemes. We also propose an in-situ generation scheme for Gaussian random numbers that has a better performance without losing quality. In addition, details will be given on how to design custom transcendental functions that fit specifically to our DPD functional form. The code is scalable and can run on over a thousand nodes on the Titan supercomputer. Demonstration of large-scale DPD simulations on vesicle assembly and red blood cell suspension hydrodynamics using our code will be given.

Session Level: Intermediate
Session Type: Talk
Tags: Molecular Dynamics; Computational Fluid Dynamics; Supercomputing; Numerical Algorithms & Libraries

Day: Wednesday, 03/26
Time: 16:30 - 16:55
Location: Room LL21E

S4283 - Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

Elmar Westphal ( Scientific Programmer, Forschungszentrum Jülich GmbH )
Elmar Westphal has been working as a prgrammer and cluster architect at Forschungszentrum Juelich for more than 15 years. In the last years he ported simulation programs from different fields of computational physics to single- and/or multi-GPU systems and developed CUDA-based building blocks, libraries and applications mostly for Molecular Dynamics- and Micromagnetism-simulations.

See how subdividing and preprocessing static parts of your simulation system beyond the obvious can significantly increase your performance. As an example we use our Micromagnetism simulator TetraMag whose solvers for systems of linear equations and field calculation parts rely heavily on sparse matrix - vector multiplications. The sizes of the involved matrices for large-scale simulations often outgrow the memory capacity of a single GPU. In our case, these matrices are constant over a program run, which can mean millions of iterations. This talk will show how analyZing, reordering and splitting our original matrices in a checkerboard style enables us to reduce expensive data transfers between GPUs and helps to reduce transfer overhead through fine grained streaming.

Session Level: Intermediate
Session Type: Talk
Tags: Computational Physics; Numerical Algorithms & Libraries

Day: Wednesday, 03/26
Time: 17:00 - 17:25
Location: Room 212A

S4593 - Chrono::Flex – A Flexible Multibody Dynamics Framework on the GPU

Daniel Melanz ( Research Assistant, University of Wisconsin - Madison )
Daniel Melanz
Daniel has a Master's Degree in Mechanical Engineering from the University of Wisconsin - Madison. His technical area of focus is modeling and simulation using high-performance computing with an emphasis on terramechanics and multiphysics. He is currently working toward his PhD in Mechanical Engineering with a minor in Computer Science.

In this work, we investigate the performance gains that the Spike::GPU methodology offers over alternative solutions based on using other linear solvers, such as Pardiso. We present results for problems of sizes that are relevant in engineering applications; for example, a net simulation composed of approximately one million beam elements.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Computational Physics; Computational Structural Mechanics; Combined Simulation & Real-Time Visualization

Day: Wednesday, 03/26
Time: 17:00 - 17:25
Location: Room LL20D

S4522 - Explore Computational Power of GPU in Electromagnetics and Micromagnetics

Sidi Fu ( Research Assistant, UCSD )
Sidi Fu
Sidi Fu is currently a PhD student from University of California, San Diego. He is working as a research assistant in Computational Electromagnetics and Micromagnetics Group advised by Prof. Vitaliy Lomakin. His research interest is parallel computation algorithms and implementations on GPU to accelerate electromagnetic and micro-magnetic simulations.

This session presents how GPUs are utilized to parallelize the computation on realistic large-scale practical Electromagnetic and Micromagnetic simulators. Two important algorithms are discussed: Non-uniform Fast Fourier Transform (NUFFT) and Sparse Matrix Vector Multiplication (SpMV). Methods used to overcome the bottlenecks related to communications between threads and irregular data access pattern are presented. We also outline a scheme to distribute the computations between multiple GPUs for further acceleration. We, then, demonstrate how this GPU accelerated methods are used in Electromagnetic and Micromagnetic solvers for modeling the magnetization dynamics in ultra-complex magnetic nanostructured materials and devices.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Computational Physics

Day: Wednesday, 03/26
Time: 17:30 - 17:55
Location: Room LL20D

S4934 - Signal Processing Libraries for High Performance Embedded Computing (Presented by GE)

David Tetley ( Senior Technology Software Manager, GE Intelligent Platforms )
David is the Engineering Manager for GE Intelligent Platform's High Performance Embedded Computing centre of excellence. His team works closely with customers to develop integrated CPU and GPU platforms for signal and image processing applications for the military and aerospace market. He is also responsible for GE's AXIS software libraries and graphical tools.He graduated from the University of Bath in the UK with a B.Eng in Electronic and Communication engineering.

High Performance Embedded Computing (HPEC) is bringing hardware found in the Top500 supercomputers into the embedded market space. This is leading to Linux clusters consisting of a mixture of CPUs and GPUs being deployed to tackle signal and image processing applications such as those found on Intelligence, Surveillance and Reconnaissance (ISR) platforms. Developers, whilst wanting to take advantage of the potential performance advantage of GPUs want to keep their code architecture agnostic so it can be ported to other hardware platforms without significant re-design. Whilst CUDA and OpenCL are emerging to offer this capability at a lower programing level, the industry standard Vector Signal and Image Processing API provides a higher level of abstraction with over 600 signal processing and vector math functions to choose from. This enables developers to build portable signal processing algorithms that can be targeted at either the CPU or GPU with no source code changes. This session provides an overview of the VSIPL standard and demonstrates the portability between CPU and GPU platforms.

Session Level: All
Session Type: Talk
Tags: Signal & Audio Processing; Numerical Algorithms & Libraries; Performance Optimization; Big Data Analytics & Data Algorithms

Day: Wednesday, 03/26
Time: 17:30 - 17:55
Location: Room 212B

S4349 - Multi-GPU Iterative Solvers Made Easy Using ArrayFire

Pavan Yalamanchili ( Product Lead, ArrayFire )
Pavan Yalamanchili has been working on High Performance computing for nearly 6 years. For the past 4 years he has been part of ArrayFire where he has been a key contributor to, and now product leader of, ArrayFire. Prior to joining ArrayFire, he graduated from Clemson University with Masters degree in Electrical Engineering. At Clemson, he worked on accelerating Bayesian cortical models on clusters of cell processors which eventually led to a Phase 2 grant from the DoD.

Learn how to control of the location of your data while relieving the burden of managing the communication between GPUs using ArrayFire. ArrayFire is a scientific library that has the fastest implementations of hundreds of algorithms including Linear Algebra (Sparse and Dense), Numerical methods. ArrayFire's easy to use array notation coupled with fast, out of core, implementations of commonly used algorithms helps users to easily implement traditional and customized iterative solvers.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries

Day: Thursday, 03/27
Time: 09:00 - 09:25
Location: Room LL21D

S4370 - GPU Floating Point Accuracy: Theory and Practice

Lars Nyland ( Compute Architect, NVIDIA )
Highly-Rated Speaker
Lars Nyland
Lars Nyland is a member of the Compute Architecture team at NVIDIA. The goal of the team is to design and test features on the GPU that support high-performance computations. Nyland's efforts have resulted in better memory performance, atomic memory operations, and instructions that reduce computation time. Prior to joining NVIDIA, he was a professor of Computer Science at the Colorado School of Mines and the University of North Carolina (where he maintains an adjunct faculty position).
Dale Southard ( Sr Solution Architect, NVIDIA )
Highly-Rated Speaker
Dale is a Senior Solution Architect with NVIDIA covering extreme scale HPC.
Alex Fit-Florea ( Manager SW Systems, NVIDIA )
Alex is a Manager with NVidia. His responsibilities include several of the CUDA mathematical libraries.

Issues related to floating point accuracy and IEEE754 compliance are often confusing on both CPUs and GPUs. Different systems can follow the standard, but produce non-identical results. We will provide real world examples of differences even when the math is the same, and how this leads to incorrect conclusions. The basics of IEEE754 will be covered, and we'll explain how NVIDIA GPUs comply, and what to expect from them. Compute capabilities, controlling modes with flags, and good practices will be reviewed. As real-life example, we examine the accumulation of round-off errors in the n-body application. While results can vary depending on the order of operations, our solution tracks the accumulated errors and results in a dramatic reduction in round-off error. Typical results are the floating-point value nearest to the mathematical answer. Furthermore, the performance impact of tracking the errors is small, even on numerically-intense algorithms such as n-body.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Programming Languages & Compilers; Supercomputing

Day: Thursday, 03/27
Time: 09:00 - 10:20
Location: Room LL21B

S4719 - Delivering Performance in Scientific Simulations: Present and Future Role of GPUs in Supercomputing

Thomas Schulthess ( Professor of Computational Physics, ETH Zurich / CSCS )
Thomas Schulthess
Thomas Schulthess received his Ph.D. in physics from ETH Zurich in 1994. He is a professor for computational physics at ETH Zurich and Director of the Swiss National Supercomputing Center in Lugano, Switzerland. Thomas holds a visiting distinguished professor appointment at ORNL, where he was group leader and researcher in computational materials science for over a decade before moving to ETH Zurich in 2008. His current research interests are in development of efficient and scalable algorithms for the study of strongly correlated quantum systems, as well as electronic structure methods in general. He is also engaged in the development of efficient tools and simulations systems for other domain areas, such as meteorology/climate and geophysics.

GPU-based supercomputers are the most energy efficient and among the most powerful computing systems in use today. We show with examples from computational physics and climate simulations how this performance is delivered today to solve real-world problems. You will see how application software can has been structured in order to port seamlessly across hardware platforms, what aspects of current hybrid CPU-GPU platforms matter, and how such architectures should best develop, so that applications continue to benefit from exponential performance increases in the future.

Session Level: All
Session Type: Talk
Tags: Climate, Weather, Ocean Modeling; Supercomputing; Computational Physics; Numerical Algorithms & Libraries; Recommended Press Session – HPC-Science

Day: Thursday, 03/27
Time: 09:00 - 09:25
Location: Room 212B

S4235 - Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points

Peter Zaspel ( Research Assistant, University of Bonn )
Peter Zaspel is research assistant at the Institute for Numerical Simulation of the University of Bonn, Germany. He studied computer science and is now working on his PhD in mathematics. In 2010, he was an intern in the Nvidia research group. His research topics are computational fluid dynamics, uncertainty quantification, multi-GPU parallelization and visualization.

Join our presentation to get insight into our latest developments on meshless numerical methods with millions of degrees of freedoms. So-called radial basis function (RBF) kernel methods allow us to attack numerical problems such as interpolation, quadrature or the solution of partial differential equations with provable high-order convergence. Important applications are knowledge gain on extreme-size data sets or e.g. fluid-dynamics. However, kernel methods usually involve to solve dense linear systems (with O(N³) complexity) which makes them unattractive for large-scale problems. We are now able to overcome this complexity bottleneck by an appropriately preconditioned iterative approach, thus achieving O(N²) or even O(N log(N)) complexity. The preconditioner fully decouples many small subproblems thus perfectly fits to multi-GPU and later on to Exascale systems. Overall, the method allows for an almost perfect scaling on hundreds of GPUs and RBF kernel problems with millions of unknowns.

Session Level: Advanced
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Supercomputing; Computational Physics; Machine Learning & AI

Day: Thursday, 03/27
Time: 09:30 - 09:55
Location: Room LL21D

S4483 - Recursive Interaction Probability: A New Paradigm in Parallel Data Processing

Richard Heyns ( Founder and CEO, brytlyt )
Richard Heyns
Richard is the CEO and founder of brytlyt limited. Richard's background is a BSc in Electo-Mechanical Engineering from the University of Cape Town (1995). He has since worked mainly in the Business Intelligence space. He has most recently worked on Big Data solutions for large retailers like Kroger in the USA and Tesco in the UK. Richard's passion is the interaction of exotic hardware, cool software and ingenious algorithms.

This session will describe Recursive Interaction Probability (RIP) and why it is a pretty cool algorithm. Time will be spent on benchmark analysis against other algorithms as well as performance within an operational database. The presentation will end with how RIP was implemented on a NVIDIA Kepler K20c, the design choices and how these affect performance. Use cases that play to the strengths of RIP as well as use cases that reveal its weaknesses will also be shared.

Session Level: Beginner
Session Type: Talk
Tags: Big Data Analytics & Data Algorithms; Numerical Algorithms & Libraries; Clusters & GPU Management

Day: Thursday, 03/27
Time: 09:30 - 09:55
Location: Room 210B

S4649 - PyFR: Technical Challenges of Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Freddie Witherden ( Ph.D. Student, Imperial College London )
Freddie Witherden
Freddie Witherden studied Physics with Theoretical Physics at Imperial College London between 2008–2012 earning an MSci degree with first class honours. His masters thesis was on the development of a parallel Barnes-Hut type treecode for simulating laser-plasma interactions. Currently, he is a PhD candidate in the department of Aeronautics at Imperial College London under the supervision of Dr Peter Vincent. Outside of academia Freddie is the chief technology officer of the news analytics firm Newsflo Ltd. He also has a keen interest in digital forensics being the primary author of the libforensic1394 library.

Learn how to develop efficient highly-scalable GPU codes faster through use of the Python programming language. In this talk I will describe our accelerated massively parallel computational fluid dynamics (CFD) code, PyFR, and outline some of the techniques employed to reduce development time and enhance performance. Specifically, it will be shown how even complex algorithms – such as those employed for performing CFD on unstructured grids – can be constructed in terms of efficient matrix-matrix multiplications. Moreover, general advice will be given on how best to integrate CUDA and MPI. Furthermore, I will demonstrate how Python can be used both to simplify development and bring techniques such as run-time kernel generation to the mainstream. Examples of these techniques, as utilized in PyFR, will be given throughout.

Session Level: Intermediate
Session Type: Talk
Tags: Computational Fluid Dynamics; Numerical Algorithms & Libraries; Supercomputing

Day: Thursday, 03/27
Time: 09:30 - 09:55
Location: Room LL20B

S4207 - PARALUTION: A Library for Iterative Sparse Methods on Multi-core CPUs and GPUs

Dimitar Lukarski ( Post-Doctoral Researcher, Uppsala University, Sweden )
Highly-Rated Speaker
Dimitar Lukarski
Dimitar Lukarski holds a post-doc research position at the Department of Information Technology, Uppsala Universitet, Sweden. He works on interdisciplinary topics in the area of parallel numerical methods and emerging hardware such as GPUs and multi-core CPUs. His research focus is mainly on robust and fine-grained parallel sparse solvers and preconditioners. Dimitar received his Bachelor's degree from Technical University of Sofia / Bulgaria, Master's degree from Technical University of Karlsruhe / Germany, and doctoral degree from Karlsruhe Institute of Technology (KIT) / Germany in 2006, 2008 and 2012, respectively.

Dive deep into the sparse iterative solvers on GPUs without touching CUDA, with advanced preconditioning techniques and full portability of your program towards CPUs! Learn, how the PARALUTION library is able to handle these features! The library provides various Krylov subspace and algebraic/geometric multigrid solvers, including ILU and approximate inverse type of preconditioners/smoothers. You will investigate the design of the library in details, learn about its key techniques for fine-grained parallelism and additionally take note on the latest performance benchmarks on multi-core CPU, GPU and Xeon Phi. Source code examples will be presented to show the ease of use. Finally, the talk will give insight on how to directly integrate PARALUTION into your application using the C++ API or with the supplied plug-ins for FORTRAN, Deal.II, OpenFOAM, Elmer and Agros2D.

Session Level: Beginner
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Supercomputing; Computational Fluid Dynamics

Day: Thursday, 03/27
Time: 10:00 - 10:25
Location: Room LL21D

S4389 - GRID-Based Methods for the Analysis of the Wave Function in Quantum Chemistry Accelerated by GPUs

Jorge Garza ( Full professor, Universidad Autónoma Metropolitana-Iztapalapa )
Jorge Garza obtained his Ph. D at the Universidad Autónoma Metropolitana-Iztapalapa (UAMI) in México City by studying confinement effects on the electron structure of atoms, within the context of the density functional theory. After his Ph. D, Jorge Garza gained experience on parallel programming techniques at the Pacific Northwest National Laboratory working with the quantum chemistry code suite NWChem. Dr. Garza has now a position as a full professor at the UAMI, he has published around 70 scientific reports related with quantum chemistry supported by parallel computing. In 2008 he was responsible of the installation of the fastest supercomputer in Latin America and continues applying parallel programming techniques on quantum chemistry problems.

Learn how to distribute on GPUs scalar and vectorial fields defined in quantum chemistry. In this talk we analyze the wave function obtained by Hartree-Fock, density functional theory or many-body perturbation theory to second order by using the atoms in molecules approach. Gradient and laplacian of the electron density are used as examples of fields that can be evaluated easily on GPUs. The performance of our algorithms are contrasted with algorithms non accelerated by GPUs.

Session Level: Intermediate
Session Type: Talk
Tags: Quantum Chemistry; Supercomputing; Numerical Algorithms & Libraries

Day: Thursday, 03/27
Time: 10:00 - 10:25
Location: Room LL21E

S4493 - GPU-Accelerated Modeling of Coherent Processes in Magnetic Nano-Structures

Aleksey Demenev ( Director of the Research-Education Center “Parallel and Distributed Computing", Perm State University )
Aleksey Demenev was born in 1967, Perm, Russia. Physics (Specialist) and Theoretical Physics (Major) in 1991. Application of Computers, Mathematical Modeling and Mathematical Methods in Scientific Research (postgraduate study) in 1991-1994. 1st Prize of the RegionalYouth Research Competition (Perm Region, Russia) for basic research on Phys. & Math. Scence in 1995. Kandidat Nauk (Ph.D.) in Physical & Mathematical Sciences since 1999. Associate Professor on Informatics and Computer Science since 2002. Perm State University (Russia): Director of the Research-Education Center "Parallel and Distributed Computing" since 2009, Associate Professor (part-time) of the Applied Mathematics and Informatics Department since 2006.

Multi-scale molecular dynamics of the systems of nanomagnets is investigated by numerical simulation using parallel algorithms. Fortran- code Magnetodynamics-F provides next types of research: study of the possibility of regulation time of switching of the magnetic moment of the nanostructure; estimation of the role of nanocrystal geometry on super-radiation of 1-, 2- and 3-dimensional objects; study of magnetodynamics of a nanodots inductively coupled with the passive resonator; depending on the solution from initial orientation of the magnetic moment in order to find the configurations for which the super-radiance and radiative damping are maximal. The parallel programs created using application programming interfaces OpenMP and OpenACC. The estimates of speedup and efficiency of implemented algorithms in comparison with sequential algorithms have been obtained. It is shown that the use of NVIDIA Tesla accelerates simulation for study of magnetic dynamics systems which include thousands of magnetic nanoparticles.

Session Level: Intermediate
Session Type: Talk
Tags: Computational Physics; Molecular Dynamics; Numerical Algorithms & Libraries

Day: Thursday, 03/27
Time: 10:00 - 10:25
Location: Room 212A

S4566 - CUB: "Collective" Software Primitives for CUDA Kernel Development

Duane Merrill ( Research Scientist, NVIDIA )
Duane Merrill
Duane Merrill joined NVIDIA Research in 2011. His current research interests include algorithms and programming models for constructing sparse and irregular parallelizations, focusing on software composition and performance-portability. He received his B.S., M.C.S., and Ph. D. from the University of Virginia.

Learn how to use the CUB library of "collective" SIMT primitives to simplify CUDA kernel development, maintenance, and tuning. Constructing, tuning, and maintaining kernel code is perhaps the most challenging, time-consuming aspect of CUDA programming. CUDA kernel software is where the complexity of parallelism is expressed. Programmers must reason about deadlock, livelock, synchronization, race conditions, shared memory layout, plurality of state, granularity, throughput, latency, memory bottlenecks, etc. However, with the exception of CUB, there are few (if any) software libraries of reusable kernel primitives. In the CUDA ecosystem, CUB is unique in this regard. CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model:Device-wide primitives (sort, prefix scan, reduction, histogram, etc.); Block-wide "collective" primitives (I/O, sort, prefix scan, reduction, histogram, etc.); Warp-wide "collective" primitives (Warp-wide prefix scan, reduction, etc.)

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Programming Languages & Compilers

Day: Thursday, 03/27
Time: 14:00 - 14:50
Location: Room LL21D

S4544 - Harnessing GPUs to Overcome Conventional Fluid-Particle Interaction Simulation Limitations

Adam Sierakowski ( Ph.D. Student, The Johns Hopkins University )
Adam Sierakowski
I graduated with honors from The Johns Hopkins University in 2010 with a B.S. in Mechanical Engineering, and Aerospace Engineering concentration, and a Mathematics Minor. I learned about scientific computing and GPUs through a series of summer internships at The Johns Hopkins University Applied Physics Laboratory during my undergraduate studies. I am currently working on my Ph.D. under Professor Andrea Prosperetti in the Mechanical Engineering Department at The Johns Hopkins University focusing on computationally simulating large-scale fluid-particle interactions. In my time away from science, I am a nationally-ranked triathlete and enjoy teaching others how to swim, cycle, and run.

Are you interested in decreasing the runtime of your 24-hour flow simulation to nine minutes? This is the story of how GPUs achieved a 150-time speedup and made Physalis into a viable computational tool for investigating the behavior of large fluid-particle systems. The Physalis method is the only known means of applying near-perfect boundary conditions to spherical particles in a coarse Cartesian finite-difference flow solver, but it suffers from a debilitating computational requirement. GPU technology enables us to overcome this limitation so we can investigate the underlying physics behind natural phenomena like dust storms and energy-generation technologies such as fluidized bed reactors. We will discuss concepts of the design of a GPU finite-difference incompressible Navier-Stokes flow solver, introduce the algorithm behind the Physalis method, and evaluate the current and future capabilities of this GPU fluid-particle interaction code.

Session Level: Beginner
Session Type: Talk
Tags: Computational Fluid Dynamics; Numerical Algorithms & Libraries

Day: Thursday, 03/27
Time: 14:30 - 14:55
Location: Room LL20B

S4154 - Pricing American Options with Least Square Monte Carlo simulations on GPUs

Massimiliano Fatica ( Manager, Tesla Performance Group, NVIDIA )
Massimiliano Fatica
Massimiliano Fatica is a manager of the Tesla Performance Group at NVIDIA where he works in the area of GPU computing (high-performance computing and clusters). He holds a laurea in Aeronautical Engineering and a Phd in Theoretical and Applied Mechanics from the University of Rome "La Sapienza". Prior to joining NVIDIA, he was a research staff member at Stanford University where he worked at the Center for Turbulence Research and Center for Integrated Turbulent Simulations on applications for the Stanford Streaming Supercomputer.

This talk will present a CUDA implementation of the Least Square Monte Carlo method by Longstaff and Schwartz to price American options on GPUs. We will examine all the details of the implementation, from the random number and paths generations to the Least Square estimation of the continuation value. The implementation can price a put option with 200,000 paths and 50 time steps in less than 10 ms on a Tesla K20X.

Session Level: Intermediate
Session Type: Talk
Tags: Finance; Numerical Algorithms & Libraries

Day: Thursday, 03/27
Time: 15:00 - 15:25
Location: Room 210C

S4541 - MAGMA: Development of High-Performance Linear Algebra for GPUs

Stan Tomov ( Research Director, University of Tennesse, Knoxville )
Stanimire (Stan) Tomov is a Research Director in the Innovative Computing Laboratory (ICL) and Adjunct Assistant Professor in Electrical Engineering and Computer Science Department at the University of Tennessee, Knoxville. Tomov's research interests are in parallel algorithms, numerical analysis, and high-performance scientific computing (HPC). He has been involved in the development of numerical algorithms and software tools in a variety of fields ranging from scientific visualization and data mining to accurate and efficient numerical solution of PDEs. Currently, his work is concentrated on the development of numerical linear algebra libraries for emerging architectures for HPC, such as heterogeneous multicore processors, graphics processing units (GPUs), and Many Integrated Core (MIC) architectures. In particular, he is leading the development of the Matrix Algebra on GPU and Multicore Architectures (MAGMA) libraries, targeting to provide LAPACK/ScaLAPACK functionality on the next-generation of architectures. Tomov is also Principal Investigator of the CUDA Center of Excellence (CCOE) at UTK.

In this session you will learn about the newest developments in high-performance numerical linear algebra for heterogeneous GPU-based systems. We will show a number of novel algorithms for solving linear systems and eigenvalue problems. Besides the algorithmic developments, we will present the methodology for their implementation on multiGPU platforms. Ease of development is achieved through a programming model that allows to express algorithms through sequential code that gets executed in parallel by a run-time system that schedules the execution over GPUs and multicore CPUs while seamlessly moves data between (when needed) GPUs and CPUs. The implementations are open source, available through the MAGMA library - a next generation of Sca/LAPACK for heterogeneous architectures. Besides the Sca/LAPACK functionality for dense linear algebra problems, we will present a new MAGMA component that deals with sparse linear algebra problems as well.

Session Level: Beginner
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Supercomputing

Day: Thursday, 03/27
Time: 15:00 - 15:50
Location: Room LL21D

S4682 - Developing a System For Real-Time Numerical Simulation During Physical Experiments in a Wave Propagation Laboratory

Darren Schmidt ( Numerical Computing Specialist, National Instruments )
Darren Schmidt
Darren Schmidt has worked for National Instruments in Austin, TX for almost two decades serving as a computation expert on a wide array of products and authoring several patents across multiple computational math domains. Currently, he works in NI's Scientific Research and Lead User Group defining, developing and deploying cutting edge systems for big analog data and large physics applications. These real-world applications demand use of a broad range of (co-)processor technologies in time-constrained situations for which he has amassed a great deal of intuition and experience.

ETH-Zurich is proposing a new concept for wave propagation laboratories in which the physical experiment is linked with a numerical simulation in real time. Adding live experimental data to a larger numerical simulation domain creates a virtual lab environment never before realized and enabling the study of frequencies inherent in important seismological and acoustic real-world scenarios. The resulting environment is made possible by a real-time computing system under development. This system must perform computations typically reserved for traditional (offline) HPC applications but produce results in a matter of microseconds. To do so, National Instruments is using the LabVIEW platform to leverage NI's fastest data acquisition and FPGA hardware with NVIDIA's most powerful GPU processors to build a real-time heterogenous simulator.

Session Level: Intermediate
Session Type: Talk
Tags: Climate, Weather, Ocean Modeling; Big Data Analytics & Data Algorithms; Signal & Audio Processing; Numerical Algorithms & Libraries

Day: Thursday, 03/27
Time: 15:00 - 15:25
Location: Room 212B

S4391 - Great Performance for Tiny Problems: Batched Products of Small Matrices

Nikolay Markovskiy ( Compute Devtech Engineer, NVIDIA )
Nikolay Markovskiy
Nikolay is an HPC engineer with experience in scientific research and software development focusing on computational techniques related to physics, chemistry, and biology.

Learn how to get great performance on Kepler GPUs for small dense matrix products. Dense linear algebra operations are generally best performed in cuBLAS, but for batches of very small matrices, it may be possible to exploit some extra knowledge of your particular application to improve the performance. After an analysis of an initial implementation, we will look into different algorithmic improvements (tiling, prefetching), use special features of the Kepler architecture and finally investigate autotuning to select the best implementation for a given problem size.

Session Level: Intermediate
Session Type: Talk
Tags: Quantum Chemistry; Numerical Algorithms & Libraries

Day: Thursday, 03/27
Time: 15:30 - 15:55
Location: Room LL21E

S4557 - Accelerating Option Risk Analytics in R Using GPUs

Matthew Dixon ( Term Assistant Professor of Analytics, University of San Francisco )
Matthew Dixon
Matthew is a term assistant professor of analytics in the School of Management and Department of Computer Science at the University of San Francisco. He is also a consulting director of risk for HedgeFacts, LLP, a portfolio analytics and fund administration platform for hedge funds. In addition to holding academic appointments as Krener Assistant Professor at UC Davis and postdoctoral researcher at Stanford University, Matthew has worked and consulted for various investment banks and the Bank for International Settlements on quantitative risk methodology. He serves on the Global Association of Risk Professionals’ San Francisco chapter committee and co-chairs the workshop on high performance computational finance at SC, the International Conference for High Performance Computing, Networking, Storage and Analysis.

Learn how to combine the convenience of the R Statistical Software Package with the computational resources provided by GPUs to accelerate computationally intensive financial computations exhibiting high degrees of parallelism. In this talk, we describe ongoing work towards the development of a R library providing GPU optimized computationally intensive kernels frequently appearing in option risk analytics applications. Such kernels are bottlenecks in a work-flow which is often highly dependent on a rich set of numerical and statistical functionality native to R. This functionality may be difficult to replicate outside of R. We demonstration the utility of our approach to the intra-day calibration of the Bates stochastic volatility jump-diffusion models, often used for risk analysis of equity derivatives. The combined performance gain from rewriting the error function in C++ and deploying the computations on a NVIDIA Tesla K20c (Kepler architecture) is approximately 760x. Detailed results will be presented during the talk.

Session Level: Intermediate
Session Type: Talk
Tags: Finance; Numerical Algorithms & Libraries

Day: Thursday, 03/27
Time: 15:30 - 15:55
Location: Room 210C

S4117 - Fast Fixed-Radius Nearest Neighbor Search on the GPU: Interactive Million-Particle Fluids

Rama Hoetzlein ( Graphics Devtech, NVIDIA )
Rama Hoetzlein
Rama Hoetzlein is a graphics research scientist working in the areas physical simulation, procedural animation, and scientific visualization, focusing on methods that utilize GPU-based computation. In January 2013, he started at NVIDIA as a Graphics Devtech.

Nearest neighbor search is the key to efficient simulation of many discrete physical models. This talk focuses on a novel, efficient fixed-radius NNS by introducing counting sort accelerated with atomic GPU operations which require only two kernel calls. As a sample application, fluid simulations based on smooth particles hydrodynamics (SPH) make use of NNS to determine interacting fluid particles. The Counting-sort NNS method achieves a performance gain of 3-5x over previous Radix-sort NNS, which allows for interactive SPH fluids of 4 million particles at 4 fps on current hardware. The technique presented is generic and easily adapted to other domains, such as molecular interactions or point cloud reconstructions.

Session Level: Advanced
Session Type: Talk
Tags: Computational Fluid Dynamics; Molecular Dynamics; Numerical Algorithms & Libraries; Performance Optimization

Day: Thursday, 03/27
Time: 16:00 - 16:25
Location: Room LL20B

S4542 - GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Yongpeng Zhang ( Computational Scientist, Stone Ridge Technology )
Yongpeng Zhang
Yongpeng Zhang is a computational scientist at Stone Ridge Technology. He received his BS and MS from Beihang University and Drexel University in Electrical Engineering. He earned his Ph.D degree in Computer Science from North Carolina University in 2012. His research interests include data-mining, programming models and compiling technologies for large-scale heterogeneous systems. His recent work at Stone Ridge is to develop the fastest algebraic multi-grid (AMG) solver for sparse linear systems.

We present our latest development work for GAMPACK, a fully GPU-accelerated Algebraic Multigrid PACKage. GAMPACK is used to solve elliptic PDEs found in various applications including reservoir simulation, CFD and structural mechanics. We compare classical and aggregation-based AMG algorithms on GPUs and demonstrate substantial acceleration of both the setup and solve phases over CPU-only implementations. We discuss how we achieve good scaling for large problems by utilizing all computing resources (including multi-GPU, multi-core CPU and clusters), by overlapping communication and computation and by optimally distributing the workload across available hardware resources. Finally, we describe how accelerated AMG can benefit engineering and scientific applications by significantly reducing the time to solution.

Session Level: Intermediate
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Clusters & GPU Management

Day: Thursday, 03/27
Time: 16:00 - 16:25
Location: Room LL21D

S4357 - Driving the Next Generation of Extremely Large Telescopes Using Adaptive Optics with GPUs

Damien Gratadour ( Associate professor, LESIA - Observatoire de Paris )
Damien Gratadour is an associate professor at Paris-Diderot University and astronomer at Observatoire de Paris. His research focuses on the development of high angular resolution instrumentation for the largest telescopes on the planet as well as the study of active galactic nuclei and super stellar clusters with these telescopes. Before obtaining his tenure position at Paris Observatory, he worked for 3 years as an adaptive optics scientist at the Gemini Observatory in Hawaii and Chile.

The European Southern Observatory is leading the construction of the European Extremely Large Telescope (E-ELT), a 39m diameter telescope, to provide Europe with the biggest eye on the Universe ever built, with a first light foreseen in 2022. The E-ELT will be the first telescope that will entirely depend, for routine operations, on adaptive optics (AO), an instrumental technique for the correction of dynamically evolving aberrations in an optical system, used on astronomical telescopes to compensate, in real-time, for the effect of atmospheric turbulence. In this session, we will show how GPUs can provide the throughput required to both simulate at high framerate and drive in real-time these AO systems that provide tens of thousands of degrees of freedom activated several hundreds times per second.

Session Level: All
Session Type: Talk
Tags: Astronomy & Astrophysics; Supercomputing; Numerical Algorithms & Libraries; Recommended Press Session – HPC-Science; Recommended for All Press

Day: Thursday, 03/27
Time: 16:30 - 16:55
Location: Room LL21F

S4407 - Incremental Risk Charge With cuFFT: A Case Study of Enabling Multi Dimensional Gain with Few GPUs

Amit Kalele ( Associate Consultant, Tata Consultancy Services Limited )
Amit Kalele is presently working as Associate Consultant with TCS at Center of Excellence for optimization and parallelization. Prior to TCS Amit was working as Scientist at Computational Research laboratory. His research areas are HPC, parallel computing, computational aspect in finance, cryptography and cryptanalysis. Amit Kalele has a PhD in Electrical Engineering from Indian Institute of Technology Bombay, Mumbai India.
Manoj Nambiar ( Principal Scientist, Tata Consultancy Services Limited )
Manoj Nambiar is currently working with TCS as a Principal Scientist, heading the Performance Engineering Research Center (PERC). He also leads the Parallelization and Optimization Centre of excellence as a part of the company’s HPC Initiative. Until 2011, Manoj has been working as a research lead in High Performance Messaging, Networking and Operating Systems in PERC. Prior to this has executed several consulting assignments in the performance engineering area specializing in network and systems performance. Manoj has a B.E from the University of Bombay, and a post graduate diploma in VLSI design from C-DAC, India.

GPUs are well suited for massively parallel problems but many a times users have a dilemma of adoption due to limited memory bandwidth between host and device. The problem of Incremental Risk Charge calculation was posed to us by one of our customer. This proof of concept demonstrates that GPUs with cuFFT library and multi-stream computations not only enables speedy performance but also achieves substantial reduction in hardware footprint and energy consumption. These gains cannot be overlooked by any business unit. This study is also helpful in taking an informed decision for choosing the right technology for business use.

Session Level: Beginner
Session Type: Talk
Tags: Finance; Numerical Algorithms & Libraries

Day: Thursday, 03/27
Time: 16:30 - 16:55
Location: Room 210C

S4418 - Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Christopher Stone ( Owner, Computational Science and Engineering, LLC )
Dr. Christopher Stone received his PhD from Georgia Tech in 2003 and has been the owner of Computational Science and Engineering, LLC since 2006. His professional research and development interests include combustion modeling, computational fluid dynamics (CFD), iterative methods for sparse linear systems, and numerical integration methods. He has been developing parallel GPU algorithms in CUDA since 2008.

Explore the latest techniques for accelerating combustion simulations with finite-rate chemical kinetics using GPUs. In this session we will compare the performance of different numerical methods for solving stiff and non-stiff ODEs and discuss the compromises that must be made between parallel throughput and numerical efficiency. Learn techniques used to (1) manage variable integration costs across the concurrent ODEs and (2) reduce thread divergence caused by non-linear iterative solvers.

Session Level: Advanced
Session Type: Talk
Tags: Computational Fluid Dynamics; Numerical Algorithms & Libraries; Supercomputing; Computational Physics

Day: Thursday, 03/27
Time: 16:30 - 16:55
Location: Room LL20B

S4476 - GPU-Based Bayesian Phylogenetic Inference Beyond Extreme Scale

Mitchel Horton ( Research Scientist , Georgia Institute of Technology )
Mitchel Horton
Mitch Horton works with the Keeneland project doing advanced user support and application development. Previously, Mitch worked in the Innovative Computing Laboratory (ICL) on the Matrix Algebra on GPU and Multicore Architectures (MAGMA) project.

See how researchers can, for the first time, infer phylogenetics trees of unlimited size using the bonanza of biological sequence data available to them today. We will present a phylogenetic inference approach that combines an existing GPU-based Bayesian phylogenetic reconstruction application (BEAST/BEAGLE) with the notion of performing an independent Markov chain Monte Carlo (MCMC) run on any number of GPUs, on any number of nodes, of any size HPC GPU cluster. The approach will be shown to scale indefinitely for sufficiently large problems. In addition, we will present a new batch matrix-matrix product CUDA kernel used for the matrix exponentiaton at the heart of the phylogenetic inference algorithm.

Session Level: Intermediate
Session Type: Talk
Tags: Bioinformatics & Genomics; Numerical Algorithms & Libraries; Supercomputing

Day: Thursday, 03/27
Time: 16:30 - 16:55
Location: Room 210H

S4668 - General Transformations for GPU Execution of Tree Traversals

Milind Kulkarni ( Assistant Professor, Purdue University )
Milind Kulkarni
Milind Kulkarni is an assistant professor in the School of Electrical and Computer Engineering at Purdue University. His research focuses on developing languages, compilers and systems that can efficiently and effectively exploit locality and parallelism in complex applications on complex computation platforms. Before joining Purdue, he was a postdoctoral research associate at the University of Texas at Austin from May 2008 to August 2009. He received his Ph.D. in Computer Science from Cornell University in 2008. Prior to that, he received his M.S. in Computer Science from Cornell University in 2005, and BS degrees in Computer Science and Computer Engineering from North Carolina State University in 2002. He has served on numerous program committees, and was the organizer of the 2013 Workshop on Languages and Abstractions for High Performance Computing, co-located with PPoPP. He is a 2012 recipient of the NSF CAREER award and a 2013 recipient of the Department of Energy Early Career Research Award. He is a member of the ACM and the IEEE Computer Society.

We present general-purpose techniques for implementing irregular algorithms on GPUs that exploit similarities in algorithmic structure rather than application-specific knowledge. We demonstrate these techniques on several tree traversal algorithms, achieving speedups of up to 38x over 32-thread CPU versions.

Session Level: Beginner
Session Type: Talk
Tags: Numerical Algorithms & Libraries; Programming Languages & Compilers; Performance Optimization

Day: Thursday, 03/27
Time: 16:30 - 16:55
Location: Room LL21D

Talk
 

TUTORIAL

Presentation
Details

S4710 - Session 1: Introduction to Productive GPU Programming (Presented by ArrayFire)

Umar Arshad ( Senior Software Engineer, CUDA Training Specialist, ArrayFire )
Umar Arshad is an engineer at ArrayFire where he primarily works on improving concurrency in ArrayFire and applications using ArrayFire. He also created the CUDA and OpenCL Optimization training material and regularly gives tutorials throughout the country. Before joining ArrayFire Umar was a developer at Inovalon where he was involved with improving performance and designing large scale applications. Umar has graduated from Georgia State University with a Masters degree in Computer Science. At GSU, he studied Parallel programming and was the Program Chair of ACM in the University.

Excited to get started with GPU computing? Learn about the best practices and tools to quickly get started with GPUs. We will introduce you to the latest advancements available in the CUDA ecosystem and describe how to efficiently use them. You will walk away with the knowledge of the right tools to get started with increased productivity and cutting edge libraries to accelerate your applications using GPUs. Some of the libraries discussed will include cuBLAS, cuFFT, ArrayFire and Thrust.

Session Level: Beginner
Session Type: Tutorial
Tags: Programming Languages & Compilers; Debugging Tools & Techniques; Performance Optimization; Numerical Algorithms & Libraries

Day: Monday, 03/24
Time: 09:00 - 10:20
Location: Room 210B

S4711 - Session 2: Fast, Parallel Algorithms for Computer Vision and Machine Learning with GPUs (Presented by ArrayFire)

Umar Arshad ( Senior Software Engineer, CUDA Training Specialist, ArrayFire )
Umar Arshad is an engineer at ArrayFire where he primarily works on improving concurrency in ArrayFire and applications using ArrayFire. He also created the CUDA and OpenCL Optimization training material and regularly gives tutorials throughout the country. Before joining ArrayFire Umar was a developer at Inovalon where he was involved with improving performance and designing large scale applications. Umar has graduated from Georgia State University with a Masters degree in Computer Science. At GSU, he studied Parallel programming and was the Program Chair of ACM in the University.

Working on image processing, computer vision, or machine learning? Learn best practices for implementing parallel versions of popular algorithms on GPUs. Instead of reinventing the wheel, you will learn where to find and how to use excellent versions of these algorithms already available in CUDA and ArrayFire libraries. You will walk away equipped with the best tools and knowledge for implementing accelerated image processing and machine learning. This session will also include information about programming CUDA on Tegra mobile devices for computer vision applications.

Session Level: Beginner
Session Type: Tutorial
Tags: Computer Vision; Machine Learning & AI; Video & Image Processing; Numerical Algorithms & Libraries

Day: Monday, 03/24
Time: 10:30 - 11:50
Location: Room 210B

S4712 - Session 3: Advanced CUDA Optimizations (Presented by ArrayFire)

Umar Arshad ( Senior Software Engineer, CUDA Training Specialist, ArrayFire )
Umar Arshad is an engineer at ArrayFire where he primarily works on improving concurrency in ArrayFire and applications using ArrayFire. He also created the CUDA and OpenCL Optimization training material and regularly gives tutorials throughout the country. Before joining ArrayFire Umar was a developer at Inovalon where he was involved with improving performance and designing large scale applications. Umar has graduated from Georgia State University with a Masters degree in Computer Science. At GSU, he studied Parallel programming and was the Program Chair of ACM in the University.

In this session, we will examine Instruction Level Parallelism (ILP), Kepler specific optimization including shuffle instructions, dynamic parallelism. We will also equip you with knowledge of important profiling and debugging tools to improve GPU utilization and kernel performance.

Session Level: Advanced
Session Type: Tutorial
Tags: Performance Optimization; Programming Languages & Compilers; Debugging Tools & Techniques; Numerical Algorithms & Libraries

Day: Monday, 03/24
Time: 13:00 - 14:20
Location: Room 220B

S4713 - Session 4: Deploying Your CUDA Applications Into The Wild (Presented by ArrayFire)

Umar Arshad ( Senior Software Engineer, CUDA Training Specialist, ArrayFire )
Umar Arshad is an engineer at ArrayFire where he primarily works on improving concurrency in ArrayFire and applications using ArrayFire. He also created the CUDA and OpenCL Optimization training material and regularly gives tutorials throughout the country. Before joining ArrayFire Umar was a developer at Inovalon where he was involved with improving performance and designing large scale applications. Umar has graduated from Georgia State University with a Masters degree in Computer Science. At GSU, he studied Parallel programming and was the Program Chair of ACM in the University.

Excited about CUDA but concerned about deployment? In this session, you will learn best practices for deploying your CUDA application and about how to resolve issues that commonly arise in the process. You will learn about scaling your application to multiple GPUs to handle large amounts of data (such as streams and/or files on disk). You will also learn about deploying your CUDA based applications in the cloud using Node.js, containers via Docker, etc.

Session Level: Intermediate
Session Type: Tutorial
Tags: Numerical Algorithms & Libraries; Clusters & GPU Management; Big Data Analytics & Data Algorithms; Mobile Applications

Day: Monday, 03/24
Time: 14:30 - 15:50
Location: Room 210B

Tutorial
 

HANDS-ON LAB

Presentation
Details

S4868 - Hands-on Lab: Signal Processing with cuFFT

Jason Cohen ( Software Engineer, Developer Tools, NVIDIA )
Jason Cohen
Jason Cohen develops performance analysis tools for GPU programming. Currently the primary developer of the CUDA profiler in Nsight Visual Studio, he contributes to all layers of software from drivers to user interfaces, and has developed such features in the tools as the NVTX annotation library and kernel-replay profiling. Jason holds a B.S. in Computer Science and a B.S. and M.S. in Electrical and Computer Engineering from Carnegie Mellon University.

This lab will provide a guided example of developing applications using GPU-accelerated FFTs in C/C++. The process begins with prototyping an algorithm in MATLAB. Next, the algorithm is ported directly to C/C++ using CUFFTW first for convenience, and then cuFFT for production-quality performance. Finally, optimization techniques for maximizing GPU usage will be explored. Emphasis will be placed on using CUDA profiling tools to monitor GPU usage, take accurate measurements, and empirically verify all claims about performance at each step. Be prepared for this hands-on lab by installing the suggested software at bit.ly/gtc14labs on your system.

Session Level: Intermediate
Session Type: Hands-on Lab
Tags: Numerical Algorithms & Libraries

Day: Monday, 03/24
Time: 14:30 - 15:50
Location: Room 230A

S4788 - Hands-on Lab: Rapid Multi-GPU Programming with CUDA Libraries

Nikolay Markovskiy ( Compute DevTech Engineer, NVIDIA )
Nikolay is an HPC engineer with experience in scientific research and software development focusing on computational techniques related to physics, chemistry, and biology.

Learn how to use CUDA libraries for quick, high-level programming on multiple GPUs. We will accelerate Octave, using NVBLAS to provide drop-in acceleration on the GPU. We will walk through configuration of the library to run on multiple GPUs. We will then move on to use the extended (XT) library interfaces in cuBLAS and cuFFT, specifically using large matrices support in cuBLAS-XT and single & batch transforms across multiple GPUs using cuFFT-XT. Be prepared for this hands-on lab by installing the suggested software at bit.ly/gtc14labs on your system.

Session Level: Beginner
Session Type: Hands-on Lab
Tags: Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 13:00 - 14:20
Location: Room 230A

S4790 - Hands-on Lab: Numerical Integration in CUDA

Carl Ponder ( DevTech Engineer, NVIDIA )
Highly-Rated Speaker
Carl Ponder
Carl is a DevTech Engineer at NVIDIA where he focuses on CUDA application tuning and performance. Carl received his Ph.D. in Computer Science from the University of California, Berkley.

Evaluating integrals is an important part of modelling physical systems. For sufficiently complex systems, integrals as closed-form expressions are difficult to derive or do not exist, so numerical approximation is the method of choice. In this session we will survey methods of Numerical Integration -- Tiling, Monto Carlo and transforms -- and discuss their efficiencies and the characteristics of their approximation error. We will work through some simple hands-on exercises of integrating the Gaussian function, estimating Pi, and measuring the volume of a multidimensional polytope. You will gain some practice writing simple CUDA code and using the cuRand library to generate high-quality random numbers in parallel, which are also applicable to other areas such as randomized simulation. Be prepared for this hands-on lab by installing the suggested software at bit.ly/gtc14labs on your system.

Session Level: Intermediate
Session Type: Hands-on Lab
Tags: Numerical Algorithms & Libraries; Finance

Day: Tuesday, 03/25
Time: 16:00 - 17:20
Location: Room 230A

S4791 - Hands-on Lab: Building a Sparse Linear Solver using CUDA Libraries

Sharan Chetlur ( CUDA Software Engineer, NVIDIA )

In this hand-on session, we will construct a Sparse Iterative Solver using CUDA library routines. We will use the standard CUBLAS and CUSPARSE libraries to construct a simple, yet performant Solver without writing any custom CUDA kernels. We will walk through an example of how to set up and use various CUBLAS and CUSPARSE APIs to implement the SSOR (Symmetric Successive Over-Relaxation) algorithm. Be prepared for this hands-on lab by installing the suggested software at bit.ly/gtc14labs on your system.

Session Level: Intermediate
Session Type: Hands-on Lab
Tags: Computational Fluid Dynamics; Computational Physics; Numerical Algorithms & Libraries; Manufacturing

Day: Wednesday, 03/26
Time: 14:00 - 15:20
Location: Room 230A

Hands-on lab
 

SPECIAL EVENT

Presentation
Details

S4908 - Hangout: GPU Accelerated Libraries

Lung Shen Chien ( Software Engineer, NVIDIA )
Yang Song ( Senior Software Engineer, NVIDIA )
Philippe Vandermersch ( Senior Software Engineer, NVIDIA )
Sharn Chetlur ( CUDA Software Engineer, NVIDIA )

Connect with NVIDIA engineers, devtechs and invited experts and get answers to all your burning questions.

Session Level: All
Session Type: Special Event
Tags: Numerical Algorithms & Libraries

Day: Monday, 03/24
Time: 12:00 - 13:50
Location: Concourse Pod C

S4891 - Birds of a Feather Lunch: Computational Photography, Libraries

Join peers who share common interests over lunch and participate in discussions without any pre-planned agenda. We've organized this year's BoFs around core topic areas and have assigned discussion leaders for each topic. Topics will be be listed on each BoF table. Check out the cool stuff in the Exhibit Hall then pick up your lunchbox and head on over!

Session Level: All
Session Type: Special Event
Tags: NVIDIA - Special Event; Computational Photography; Numerical Algorithms & Libraries

Day: Tuesday, 03/25
Time: 13:00 - 13:50
Location: Room 220A

S4927 - Hangout: GPU-Accelerated Libraries

Lung Shen Chien ( Software Engineer, NVIDIA )
Philippe Vandermersch ( Senior Software Engineer, NVIDIA )

Connect with NVIDIA engineers, devtechs and invited experts and get answers to all your burning questions.

Session Level: All
Session Type: Special Event
Tags: Numerical Algorithms & Libraries

Day: Wednesday, 03/26
Time: 14:00 - 15:50
Location: Concourse Pod C

S4945 - Hangout: GPU-Accelerated Libraries

Philippe Vandermersch ( Senior Software Engineer, NVIDIA )
Lung Shen Chien ( Software Engineer, NVIDIA )
Sharan Chetlur ( CUDA Software Engineer, NVIDIA )

Connect with NVIDIA engineers, devtechs and invited experts and get answers to all your burning questions.

Session Level: All
Session Type: Special Event
Tags: Numerical Algorithms & Libraries

Day: Thursday, 03/27
Time: 09:00 - 10:50
Location: Concourse Pod C

S4893 - Birds of a Feather Lunch: Quantum Chemisty & Libraries

Join peers who share common interests over lunch and participate in discussions without any pre-planned agenda. We've organized this year's BoFs around core topic areas and have assigned discussion leaders for each topic. Topics will be listed on each BoF table. Check out the cool stuff in the Exhibit Hall then pick up your lunchbox and head on over!

Session Level: All
Session Type: Special Event
Tags: NVIDIA - Special Event; Numerical Algorithms & Libraries; Quantum Chemistry

Day: Thursday, 03/27
Time: 13:00 - 13:50
Location: Room 220A

Special event