Sign In
GTC Logo
GPU
Technology
Conference

April 4-7, 2016 | Silicon Valley
Check back often for session updates.

Scheduler Planner

Print
Download Pdf
 

 
Refine:
  • Session Levels:
  • |
  • |
  • |
  • |
  • Session Levels:
  •  
  •  
  •  
  •  
  • = Highly Rated Speaker

TALK

Presentation
Details

S6645 - Scientific Visualization in HPC

Peter Messmer Principal Software Engineer , NVIDIA
Highly-Rated Speaker
Peter Messmer is a senior software engineer in NVIDIA's Developer Technology organization, working with clients to accelerate their scientific discovery process with GPUs. One area of his current research is to investigate how to utilize the GPUs in high performance computing systems for data analysis and visualization. Prior to joining NVIDIA, Peter spent more than 15 years developing HPC- and GPU-accelerated applications for industry and government clients, ranging from simulating next-generation particle accelerators or electromagnetic problems to modeling the behavior of dust on the surface of the Moon. Peter holds an M.S. and Ph.D. in physics from ETH Zurich, Switzerland, with specialization in kinetic plasma physics and nonlinear optics.

Learn how to leverage the graphics power in your GPU-accelerated supercomputer to turn your simulation data into insight. Starting from simulation data distributed across the nodes of a remote supercomputer, we'll cover various techniques and tools to convert this data into insightful visualizations at your workstation, leading to an end-to-end GPU accelerated visualization pipeline.

Level: Beginner
Type: Talk
Tags: In-Situ and Scientific Visualization; Supercomputing & HPC

Day: Monday, 04/04
Time: 09:00 - 09:50
Location: Room 212A

S6590 - HPC Visualization Using NVIDIA IndeX™

Tom-Michael Thamm Director, Software Product Management, NVIDIA
Tom-Michael Thamm is director for software product management at the NVIDIA Advanced Rendering Center (ARC) in Berlin, Germany, and is responsible for all commercial software products, such as NVIDIA mental ray, NVIDIA Iray, and NVIDIA IndeX. He is managing and coordinating with his team the customer support as well as the general product definition and positioning. Tom-Michael has worked for NVIDIA ARC, and before for mental images, for over 25 years. He has led several key software projects and products, such as the NVIDIA IndeX product for large volume visualization. He has studied mathematics.
Christopher Lux Senior Graphics Software Engineer, NVIDIA
Christopher Lux is a senior graphics software engineer at the NVIDIA Advanced Rendering Center. He received is Ph.D. in computer science in 2013 from the Bauhaus-Universitat Weimar, Germany. Through his interest in real-time computer graphics and scientific visualization, he early on focused his work on the interactive visualization of large-scale datasets from the geo-scientific and medical domain.
Marc Nienhaus Product Technology Lead of the NVIDIA IndeX commercial software at NVIDIA, NVIDIA
Marc Nis the product technology lead of the NVIDIA IndeX commercial software at NVIDIA. He manages the NVIDIA IndeX software engineering team and is responsible for the product architecture and applications of NVIDIA IndeX in various application domains. Before joining mental images' R&D rendering department and NVIDIA ARC, Marc researched as a post-doc at Northwestern University in Illinois and led research projects at the University of Potsdam. His research interests include parallel and distributed rendering and computing, scientific visualization, GPU-based rendering, and photorealistic and non-photorealistic expressive depictions. Marc holds an M.S. in mathematics with a minor in computer science from the University of Muenster, and a Ph.D. in computer science from the Hasso Plattner Institute at the University of Potsdam. Marc has published various papers on GPU-based real-time and non-photorealistic rendering techniques.

We'll give a technical overview of the NVIDIA IndeX architecture that enables instant visualization of simulation and compute data, details on the interface design and use. Further, NVIDIA IndeX's capabilities are demonstrated by real-world solutions, which include a real-time weather prediction and a seismic wave-propagation algorithm.

Level: All
Type: Talk
Tags: In-Situ and Scientific Visualization; Large Scale and Multi-Display Visualization

Day: Monday, 04/04
Time: 10:00 - 10:50
Location: Room 212A

S6738 - Computational Displays for Virtual and Augmented Reality

David Luebke Senior Director of Research, NVIDIA
Highly-Rated Speaker
David Luebke helped found NVIDIA Research in 2006 after eight years teaching computer science on the faculty of the University of Virginia. David is currently Senior Director of Research at NVIDIA, where leads the computer graphics research team. His personal research interests include virtual and augmented reality, display technology, ray tracing, and graphics architecture. His honors include the NVIDIA Distinguished Inventor award, the NSF CAREER and DOE Early Career PI awards, and the ACM Symposium on Interactive 3D Graphics "Test of Time Award". David has co-authored a book, a SIGGRAPH Electronic Theater piece, a major museum exhibit visited by over 110,000 people, and dozens of papers, articles, chapters, and patents on computer graphics and GPU computing.

We'll describe work by NVIDIA Research and our partners on challenges common to all wearable VR and AR displays:(1) FOCUS: how to put a display as close to the eye as a pair of eyeglasses, where we cannot bring it into focus? (2) FIELD OF VIEW: how to fill the user's entire vision with displayed content? (3) RESOLUTION: how to fill that wide field of view with enough pixels? A "brute force" display would require 10,000x8,000 pixels per eye! (4) BULK: displays should be vanishingly unobtrusive, as light and forgettable as a pair of sunglasses, but the laws of optics dictate that most VR displays today are bulky boxes bigger than ski goggles. I will discuss several "computational display" prototypes which sidestep these challenges by co-designing the optics, display, and rendering algorithm.

Level: All
Type: Talk
Tags: Virtual Reality & Augmented Reality

Day: Monday, 04/04
Time: 13:00 - 13:50
Location: Room LL20C

S6767 - VisionWorks(TM), A CUDA Accelerated Computer Vision Library

Elif Albuz Computer Vision Software Lead, NVIDIA
Elif Albuz is the technical lead for VisionWorks Toolkit at NVIDIA, driving features and optimizations with CUDA acceleration on Tegra GPUs. Before Computer Vision Group, she was leading CUDA FFT Library; designing new algorithms for motion estimation, superresolution and frame-rate up conversion and accelerating them on NVIDIA GPUs; designing architecture for error concealment, adaptive quantization for video codec hardware; and implementing low-level code for h.264, MPEG2 codecs. Prior to joining NVIDIA, she worked at Sony Electronics, leading DVD decoder firmware stack that was used in DVD players and Playstation 2, implementing real-time OS for multi-processor systems and accelerating h.264 using SIMD in the Multimedia Research Labs. Elif Albuz holds dual degree on Electrical Engineering and Computer Science where she focused on Artificial Intelligence and Robotics, and holds a Masters degree in Electrical Engineering where she did research on content based image retrieval, parallel architectures and algorithms.

In this talk, we will introduce NVIDIA VisionWorks™ toolkit, a software development package for computer vision (CV) and image processing. VisionWorks(TM) implements and extends the Khronos OpenVX standard, and it is optimized for CUDA-capable GPUs and SOCs enabling computer vision applications on a scalable and flexible platform. VisionWorks implements a thread-safe API and framework for seamlessly adding user defined primitives. The talk will give an overview of the VisionWorks toolkit, OpenVX API and framework, VisionWorks-plus modules including VisionWorks Structure From Motion and Object Tracker modules, and computer vision pipeline samples showing integration of the library API into a computer vision pipeline on Tegra platforms.

Level: All
Type: Talk
Tags: Computer Vision & Machine Vision; Embedded; Self-Driving Cars & Automotive

Day: Monday, 04/04
Time: 13:00 - 13:50
Location: Room LL20A

S6131 - Nvpro-Pipeline: Handling Massive Transforms Updates in a SceneGraph

Markus Tavenrath Senior Developer Technology Engineer, NVIDIA
Markus finished his studies in computer science with focus on computer graphics in 2008. He was one of the first using ray tracing on CUDA for this diploma thesis which brought him straight to NVIDIA. There he primarily worked on GPU raytracing for SceniX, NVIDIA's scene graph technology, which had been showcased at SIGGRAPH 2008. Afterwards he applied his experience to implement parts of OptiX, improve SceniX and develop several ray tracing demos. In close cooperation with external partners he improved rendering performance and scenegraph usability as developer technology engineer. Now he is using the gained knowledge to experiment with future rendering technologies that bring high interactivity to complex scenes. This work includes both CPU and GPU strategies to solve typical scene graph operations related to rendering.

This session will walk through the new transforms hierarchy module in nvpro-pipeline which is able to compute the world matrices for each node in a transform hierarchy massively parallel instead of the traditional serial computation. Running the algorithm on a high-end GPU like a Quadro M6000 gives a massive speedup of over 20x over computing the hierarchy on the CPU. In addition to this, the data which has to be transferred between the CPU and GPU gets minimized which gives another performance boost.

Level: Intermediate
Type: Talk
Tags: Real-Time Graphics

Day: Monday, 04/04
Time: 13:30 - 13:55
Location: Room 211B

S6132 - Nvpro-Pipeline: Using LLVM to Accelerate Your OpenGL Rendering Core

Markus Tavenrath Senior Developer Technology Engineer, NVIDIA
Markus finished his studies in computer science with focus on computer graphics in 2008. He was one of the first using raytracing on CUDA for this diploma thesis which brought him straight to NVIDIA. There he primarily worked on GPU raytracing for SceniX, NVIDIA's scenegraph technology, which had been showcased at SIGGRAPH 2008. Afterwards he applied his experience to implement parts of OptiX, improve SceniX and develop several raytracing demos. In close cooperation with external partners he improved rendering performance and scenegraph usability as developer technology engineer. Now he is using the gained knowledge to experiment with future rendering technologies that bring high interactivity to complex scenes. This work includes both CPU and GPU strategies to solve typical scenegraph operations related to rendering.

This session will demonstrate how LLVM has been used to accelerate the inner render loops of the OpenGL RiX backend. It'll show how to unroll the render loop using a JIT compiler, why it's faster to do it and how to reduce the cost of calling into OpenGL within the JITted code.

Level: Intermediate
Type: Talk
Tags: Real-Time Graphics

Day: Monday, 04/04
Time: 14:00 - 14:25
Location: Room 211B

S6504 - A Data-Driven Methodology for NVIDIA GRID™ vGPU™ Sizing

Jeremy Main Senior Solution Architect, NVIDIA
Jeremy is the Senior Solution Architect for NVIDIA's GRID enterprise graphics virtualization in Japan. He works to architect solutions for organizations to deliver high-fidelity GPU-accelerated desktops and applications. Before joining NVIDIA, Jeremy led the development of several remote graphics products as well as 3D CAD software development. Jeremy received his Bachelor of Science from the University of Utah.
Milan Diebel Senior Product Manager, NVIDIA
Milan is the Senior Product Manager for the NVIDIA GRID product family. He has been working in the technology sector for 15 years in a variety of roles. Milan holds a PhD in Physics from the University of Washington as well as an MBA from Cornell University.

GRID vGPU sizing is often viewed as more of an art form than a science. One of the challenges is that synthetic performance benchmarks are not a good representation of actual user workloads in virtualized environments. Influenced by many customer interactions, we will be introducing a systematic way of producing sizing information utilizing both real user workloads and synthetic performance benchmarks.

Level: Intermediate
Type: Talk
Tags: Graphics Virtualization

Day: Monday, 04/04
Time: 14:30 - 15:20
Location: Room 210G

S6117 - Parallelization and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures

Mark Govett Chief, Advanced Computing Section, NOAA Earth System Research Laboratory
Highly-Rated Speaker
Mark manages the High Performance Computing Section, a software group that both supports model development, parallelization, and porting to high performance computers, and explores advanced computing technologies for the National Oceanic and Atmospheric Administration (NOAA). Mark has worked in high performance computing, code parallelization and compiler development for over 20 years. He has developed two Fortran compilers, the Scalable Modeling System (SMS) for MPI based parallelization, and the F2C-ACC GPU compiler. He also parallelized two weather models using the F2C-ACC compiler and has been collaborating with Cray and PGI to improve the capabilities and performance of their commercial GPU compilers.

In an era defined by increasing diversity in computing architectures, performance portability is a key requirement for weather and climate applications that require massive computing resources. In this talk, you will learn about how we developed and achieve performance on CPU, GPU and MIC architectures using industry-standard OpenACC and OpenMP directives. Performance results from the NIM weather model will be shown for a number of device, node and multi-node and system configurations. Further, communications optimizations will highlight a more than a 40% improvement in runtime with scaling to thousands of GPUs.

Level: Intermediate
Type: Talk
Tags: Earth System Modelling; Performance Optimization; Supercomputing & HPC; Programming Languages; OpenACC

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Room 211A

S6148 - Enabling Smart Cities with GPU-Accelerated Infrastructure

Pradeep Gupta Senior Solutions Architect - APJ, NVIDIA
Pradeep Gupta is a lead deep learning solutions architect at NVIDIA, where he supports customers and developers across the Asia Pacific, Japan, and India regions for deep learning and HPC application development. He also works to enable the GPU computing ecosystem in universities and research labs across these regions. He is responsible for running and managing R&D projects at the NVIDIA Technology Centre at Singapore. He is working on smart cities enablement with GPU computing initiative at NVIDIA. Before joining NVIDIA, Pradeep worked with various technologies in high performance computing domains. He received an M.S. in research from the Indian Institute of Science (IISc), Bangalore. His research focused on developing compute-efficient algorithms. He has numerous publications in IEEE, SPIE, and other reputed conferences.

Smart cities are getting a lot of attention and both academia and industry are focusing and investing in next-generation technologies for making this as a reality. We'll present a case study on how GPU-based IT infrastructure can enable different components and use-cases of a smart city platform. Smart cities IT infrastructure will need massive computational power and visualization of extremely rich visual contents within a given energy budget. GPU-accelerated data centers can provide a unified IT infrastructure and software platform to achieve that. This case study has taken Singapore's smart nation initiative as a reference and will also present different initiatives and projects using the GPU platform.

Level: All
Type: Talk
Tags: Intelligent Video Analytics (IVA); Self-Driving Cars & Automotive ; Big Data Analytics; Deep Learning & Artificial Intelligence; IoT

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Room LL20D

S6164 - Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs

Bertil Schmidt Professor, JGU Mainz
Bertil Schmidt is a tenured full professor and chair for Parallel and Distributed Architectures at the University of Mainz, Germany. Prior to that he was a faculty member at Nanyang Technological University (Singapore) and at University of New South Wales. Bertil's research group has designed a variety of algorithms and tools for computational science and bioinformatics, mainly focusing on the analysis of large-scale sequence and short read datasets. For his research work, he has received a CUDA Research Centre award, a CUDA Academic Partnership award, a CUDA Professor Partnership award, and Best Paper Awards at IEEE ASAP 2009 and IEEE ASAP 2015. Bertil serves as the champion for bioinformatics and computational biology on gpucomputing.net.

Learn how to efficiently parallelize gene set enrichment analysis (GSEA) using CUDA. GSEA is an important bioinformatics method that determines whether given sets of genes are statistically overrepresented between two phenotypes. The GSEA software from the Broad Institute is the most popular tool to perform such studies with several thousand users. NGS technologies are gradually replacing microarrays for high-throughput gene expression studies. Size and availability of input data sets are increasing, leading to high runtimes of the desktop GSEA application. We present an efficient CUDA parallelization of the core GSEA algorithm. By using a combination of parallelization techniques, we achieve speed-ups of around two orders of magnitude on a single GPU.

Level: Intermediate
Type: Talk
Tags: Computational Biology

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Marriott Salon 5

S6227 - Distributed Deep Learning at Scale

Soumith Chintala Research Engineer, Facebook AI Research
Soumith Chintala is a Research Engineer at Facebook AI Research. Prior to joining Facebook in August 2014, Soumith worked at MuseAmi, where he built deep learning models for music and vision targeted at mobile devices. In the past, Soumith worked on state-of-the-art deep learning models for pedestrian detection, natural image OCR, depth-images among others while driving his research heavily using CUDA and multiple GPUs.

This talk provides a brief overview of deep learning research, the challenges involved in scaling it up across multi-GPU and multi-machine clusters, while providing software that is flexible enough for research settings. We discuss the clear trends that are emerging in deep learning from a HPC perspective and discuss several examples from our work at Facebook AI Research.

Level: All
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Room 210D

S6253 - VMD: Petascale Molecular Visualization and Analysis with Remote Video Streaming

John Stone Senior Research Programmer, University of Illinois at Urbana-Champaign
Highly-Rated Speaker
John Stone is a senior research programmer in the Theoretical and Computational Biophysics Group at the Beckman Institute for Advanced Science and Technology, and associate director of the NVIDIA CUDA Center of Excellence at the University of Illinois. John is the lead developer of VMD, a high-performance molecular visualization tool used by researchers all over the world. His research interests include molecular visualization, GPU computing, parallel processing, ray tracing, haptics, and virtual environments. John was named an NVIDIA CUDA Fellow in 2010. In 2015, he joined the Khronos Group Advisory Panel for the Vulkan graphics API. John also provides consulting services for projects involving computer graphics, GPU computing, and high performance computing.

We'll showcase recent successes in the use of GPUs to accelerate challenging molecular visualization and analysis tasks on hardware platforms ranging from commodity desktop computers to the latest GPU-accelerated petascale supercomputers by Cray and IBM. We'll highlight the use of in-situ ray tracing and rasterization combined with GPU-accelerated video streaming for high-interactivity remote visualization, CUDA just-in-time compilation to increase the performance of data-driven visualization and analysis algorithms, and we'll describe new, GPU-accelerated, MD trajectory clustering algorithms.

Level: Intermediate
Type: Talk
Tags: In-Situ and Scientific Visualization; Computational Chemistry; Rendering & Ray Tracing

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Room LL21D

S6275 - Restore, Customize and Revamp an Iconic Motorbike with NVIDIA Iray® and Substance Designer

Matt Gueller Sr. Surface Designer, Harley-Davidson
With a B.S. in industrial design from Milwaukee Institute of Art and Design, Matt Gueller has been working over 18 years in the design field, as Class A surfacer, digital model maker, and visualization expert. Much of his career has been focused on the use of digital tools in the design process and how non-traditional techniques can impact the manufacturing workflow.
Jérôme Derel Chief Product Officer, Allegorithmic
Engineer and product designer Jerome Derel joined Allegorithmic in 2014 as a chief product officer. Jerome has been working for seven years at Dassault Systemes as a visualization expert in the Design Studio and the CATIA Design teams, leading projects that produce high-quality virtual materials.

Leveraging Substance Designer, Iray, and Rhino 3D, a Harley-Davidson Knucklehead, the iconic chopper of the late 1960s, with rust and dust, can be restored to its original glory or even turn into a custom race bike, bringing design iterations and study to the next level.

Level: All
Type: Talk
Tags: Product & Building Design; Rendering & Ray Tracing

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Room LL21A

S6391 - Bootstrapping Labels for One Hundred Million Images

Jimmy Whitaker Software Engineer, Digital Reasoning
Jimmy Whitaker is a software engineer at Digital Reasoning, a cognitive computing company focused on enabling humans to leverage big data to make decisions, where he has been pioneering computer vision efforts. Prior to joining Digital Reasoning, Jimmy completed his M.S. in computer science at the University of Oxford, where he achieved a distinction for his research in the field of steganalysis -- detecting hidden information in images.

We'll describe how we created an iterative labeling process to perform data science on 100 million+ images using a GPU-powered workflow with convolutional neural networks. Recently, deep learning techniques such as deep convolutional neural networks (ConvNets) have achieved state-of-the-art results in many computer vision tasks. The data-driven nature of deep learning normally requires a large number of labeled examples to achieve high accuracies. Unfortunately, much of the publicly available data on the web is not labeled, thus requiring human labelers for large datasets or unsupervised machine learning techniques. Our labeling process allows weak labels and a small number of strong labels to be used to create classifiers for very large datasets.

Level: Beginner
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Big Data Analytics; Computer Vision & Machine Vision

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Room 210G

S6422 - Enhancing Visual Realism of Mixed Reality Applications with Stereo Vision

Azzam Edwin CTO, Stereolabs
Edwin Azzam co-founded STEREOLABS in 2010. As STEREOLABS's Chief Technical Officer, Edwin is responsible for leading the company's product development and technology strategy in stereo vision. Prior to founding STEREOLABS, Edwin was a project manager at Astrium Space Transportation, Paris.Edwin holds a Master's degree in Optics & Image Processing from Institut d'Optique, France, as well as a Master's degree in Management from ESSEC Business School. He is a PhD supervisor and a National Technical Expert for the ANR (National Research Agency), where he uses his technical and market expertise for the assessment of national research projects in the field of computer vision and 3D image processing.

Discover how stereo vision and 3D depth sensing on GPU enable the development of mixed reality applications, which merge virtual information into a live 3D video stream of the real world. We will discuss the various stages of a real-time mixed reality processing pipeline, and how NVIDIA's GPU acceleration is integral to every step of the pipeline. We will also show demonstrations of how stereo depth sensing can be used to create 3D virtual playgrounds and real-time augmentation of the environment.

Level: All
Type: Talk
Tags: Computer Vision & Machine Vision; Virtual Reality & Augmented Reality; Video & Image Processing; Embedded

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Room 210F

S6423 - Accelerating Approximate Weighted Matching on GPUs

Antonino Tumeo Research Scientist, Pacific Northwest National Laboratory
Highly-Rated Speaker
Dr. Antonino Tumeo has been a research scientist in the PNNL's High Performance Computing group since February 2011. Antonino received an M.S. degree in informatic engineering in 2005, and a Ph.D. in computer engineering in 2009, from Politecnico di Milano in Italy. He Joined PNNL in 2009 as a post-doctoral research associate. Previously, he was a post doctoral researcher at Politecnico di Milano. His research interests are modeling and simulation of high-performance architectures, hardware-software codesign, FPGA prototyping, and GPGPU computing.

Matching is a fundamental graph problem with numerous applications in science and engineering. This talk discusses the efficient implementation of half-approximate weighted matching on GPUs. We start by describing the Suitor algorithm, currently considered the best algorithm for this problem, and identifying by its key implementation challenges. In its basic formulation, the Suitor algorithm appears poorly suited to GPUs, due to the irregular memory accesses and the use of locks. We proceed by introducing four variants of the algorithm that progressively address these challenges by exploiting Kepler's hardware features. We demonstrate that the final implementation outperforms by several times the performance of previous best matching algorithms for GPUs and of the Suitor algorithm on CPUs.

Level: Intermediate
Type: Talk
Tags: Algorithms; Big Data Analytics; Aerospace & Defense

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Room 210E

S6452 - Run-Time Scene-Graph Construction from Geographic Source Data

Tim Woodard Chief Technology Officer, Diamond Visionics
Highly-Rated Speaker
Tim Woodard is the chief technology officer at Diamond Visionics, with over 18 years of experience specializing in the design and development of software architectures for real-time, PC-based image generation using Agile development processes, advanced C++, and modern OpenGL techniques. Tim has received patents for the real-time simulator database generation technology that forms the basis of Diamond Visionics' GenesisRTX worldwide database generation system. GenesisRTX provides high-fidelity generation, visualization, and manipulation of visual databases at run-time directly from source data on low-cost PC-based platforms, eliminating the need for traditionally labor-intensive off-line database production processes. He has served as the director of engineering, director of research and development, and principal investigator for a number of Phase I, II, and III U.S. Government Small Business Innovative Research Grants. Tim has also published and presented papers at I/ITSEC, IMAGE, NVIDIA's GPU Technology Conference, ASQ, and ITEC.

In modern computing hardware, the gaps in performance between GPUs, CPUs, RAM, and storage continue to widen. When visualizing large and dense geographic datasets (e.g., imagery, elevation, vectors, features), balancing the workload effectively between these resources (and considering the bottlenecks between them) is crucial. Conventional wisdom for optimal performance from just 10 years ago may not provide the same benefits it once did. In this talk, we demonstrate that by exploiting parallelism on the CPU and especially the GPU, much greater throughput can be achieved. Furthermore, by utilizing modern OpenGL techniques (e.g., NV_command_list), an order of magnitude increase in performance can be achieved when compared to previously available rendering methods.

Level: Intermediate
Type: Talk
Tags: Real-Time Graphics; Aerospace & Defense; Performance Optimization

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Marriott Salon 2

S6514 - CUDA Optimization Tips, Tricks and Techniques

Stephen Jones Senior Software Engineer, SpaceX
Stephen Jones leads the Simulation and Analytics group at SpaceX, where he works on various projects including large-scale simulation of combustion processes in rocket engines. He previously worked at NVIDIA, where he was the architect for the CUDA language and worked closely with NVIDIA's hardware designers to develop new GPU features in support of parallel programming. His background is in computational fluid mechanics and plasma physics, but he has worked in diverse industries, including networking, CAD/CAM, and scientific computing.

Optimizing your code can be one of the most challenging tasks in GPU programming, but also one of the most rewarding: the performance difference between an initial version and well-tuned code can be a factor of ten or more. Some optimizations can be quite straightforward while others require care and deep understanding of how the code is executing. A particular focus will be on optimization of the CPU part of your code, which is frequently overlooked even though it is often easier to tune and just as effective. Sometimes the biggest obstacle is just knowing what to look for, and this talk will cover a range of techniques which everyone from beginners all the way to CUDA ninjas might not have thought of before.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Algorithms; Tools & Libraries

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Room 212A

S6616 - NCCL: Accelerated Collective Communications for GPUs

Nathan Luehr Senior Devtech Engineer, NVIDIA
Nathan Luehr is a senior developer technology engineer for compute applications at NVIDIA. He earned a Ph.D. in theoretical chemistry from Stanford University in June 2015.

We present NCCL, a library of multi-GPU communication collectives (e.g., broadcast, all-reduce, all-gather). NCCL enables applications to harness the computational throughput of multiple GPUs with minimal developer effort by providing optimized, topology-aware, asynchronous collectives with a familiar API.

Level: All
Type: Talk
Tags: Tools & Libraries; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Room 211B

S6650 - Optimizing In-Field Processing Using GPUs

Tarik Saidani Senior Software Engineer, PGS
Tarik is a Senior Software Engineer at PGS. He is specialized in parallel programming and software optimization. He worked in the Oil and Gas industry for the last five years helping research geophysicist in commercializing their applications, using parallel programming and software optimization techniques. He holds a PhD degree in parallel computing from Paris-Sud University.

Learn how GPU accelerators help marine seismic acquisition to efficiently perform one of the fundamental steps in the on-board processing flow. GPUs not only allow unprecedented data processing throughput, but also reduce hardware footprint, power consumption and heat dissipation of the in-field compute system.

Level: Intermediate
Type: Talk
Tags: Energy Exploration; Performance Optimization; Signal & Audio Processing

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Marriott Salon 1

S6689 - Creating CONSTRUCT: A GPU-Rendered Short Film

Kevin Margo Director / VFX Supervisor, Blur Studio
Highly-Rated Speaker
Kevin is director of the hit sci-fi short film "Grounded". He joined Blur studio in 2003 as a scene assembly, lighting and compositing artist and has since moved into the studio's VFX/CG Supervisor role. Recent work includes the prologue for Thor 2: The Dark World and the David Fincher produced Halo 4: scanned cinematic trailer.

We will describe how Chaos V-Ray RT and NVIDIA GPUs were used throughout production on the groundbreaking short film CONSTRUCT, rendered entirely on GPUs. Go here (http://constructfilm.com/) to see more of the project and here (https://www.youtube.com/watch?v=nnaz8q6FLCk) to see how interactive GPU rendering was used on a motion capture stage during production.

Level: All
Type: Talk
Tags: Rendering & Ray Tracing; Media & Entertainment; Real-Time Graphics

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Room LL21B

S6157 - Effective Evaluation of Betweenness Centrality on Multi-GPU Systems

Massimo Bernaschi Director of Technology, National Research Council of Italy
Highly-Rated Speaker
Massimo Bernaschi is with CNR, the National Research Council of Italy as Chief Technology Officer of the Institute for Applied Computing. He is also an Adjunct Professor of Systems Programming at "Sapienza" University in Rome; Trainer in Digital Forensics at "Sapienza" and Modena Universities. Before joining CNR in 1998, Massimo worked ten years at the IBM European Center for Scientific and Engineering Computing where he developed the IBM PVMe product and received two Outstanding Technical Achievement Awards. His main scientific interests are parallel computing; modelling of complex systems (finance and biology); systems and network security; high performance computing. Massimo is the author of over 150 papers in peer-reviewed journals and international conferences.

Learn how to use (multi) GPU and CUDA to speed up the process of ranking the importance of each node in a large scale network. You will see how to solve an extraordinary challenge, that is the exact computation of Betweenness Centrality, by using as building blocks relatively simple algorithms, like the Breadth First Search, that have been highly tuned for latest generation GPU cards. Our approach is fully scalable and overcomes the limitation on the size of the graph that can be studied on a single GPU. We'll present results obtained on both synthetic and real-world graphs.

Level: Intermediate
Type: Talk
Tags: Algorithms; Performance Optimization

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Marriott Salon 3

S6322 - PARADIGM: Towards an Efficient Abstraction Framework for Parallel Systems

Bharatkumar Sharma Lead Research Engineer, Siemens
Bharatkumar has more than six years of experience in HPC. Currently working with Siemens CT, India as Lead Research Engineer. Current role demands establishing company's foothold in Parallel Computing by publishing Patents and IPR Liaison with Internal and External Clients. He delivers and provides support for High Performance Solutions to the company's Business Units. Previously worked with Nvidia, India. Worked in direction of strengthening CUDA Eco-System/Research in India being part of NVIDIA India University Relations Program. Have worked on various technologies like GPGPU Computing, CUDA, OpenCL, Data warehouse plugins. Have received masters' degree in Information Technology from International Institute of Information Technology (IIIT), Bangalore. My area of interest lies in HPC, Software Architecture, Design and Architectural Patterns, Big Data Platforms.

This talk will introduce you to the extensions made in Thrust Library to provide means and methodologies to facilitate the development of structured, efficient and portable parallel applications with minimal loss in performance as compared to the hand optimized code. We will introduce patterns and strategies like the load balancer, two-phase decomposition and Init-Exec-Deinit that allows the developer to distribute the workload to multiple devices and to minimize the abstraction overhead in a heterogeneous computing environment. We will demonstrate the result using sample algorithms used in various domains.Our experimental results demonstrate that use of these patterns improve the productivity and not only ensures ease of use but also provides performance at par with native implementation.

Level: All
Type: Talk
Tags: Tools & Libraries; Programming Languages

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Room 211B

S6362 - CNN Based Object Detection in Large Video Images

Tao Wang Chief Scientist, iQIYI ltd. Corp.
Dr. Tao Wang is chief scientist of iQIYI ltd. Corp., the biggest video sharing platform in China, where he works on computer vision and multimedia software applications. He received his Ph.D. in computer science from Tsinghua University in 2003. Tao then worked as a senior researcher in Intel Labs China. He has published more than 60 papers in IJCV, CVPR, CIVR, ICME, and ACM multimedia.

Object detection in real video images is more challengable than image data set. We'll present CNN based object detection research on IQIYI large image and videos. It is used for content based ads recommendation.

Level: All
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Room 210G

S6420 - Parallel Silence Coding Algorithms for Seismic Data Compression on GPUs

John Cheng Research Scientist, BGP
John is a research scientist with profound industry experience in high-performance. John has developed seismic imaging products with GPU and many parallel applications computing on heterogeneous computing platforms. John is the author of many books, including Professional CUDA C programming, by Wiley 2014. John has profound experience in both academic research and industry development, and is gifted in making complex subjects accessible to readers with a concise and illustrative approach. John earned his doctoral degree in Computational Intelligence from Tokyo Institute of Technology.

Join industry experts for a discussion on a novel parallel silence coding algorithms on GPUs. Silence coding, combined with Huffman coding to form a lossless scheme, is extensively used in seismic data compression. It is inherently an serial procedure and not easy to be parallelized for GPUs. In this session, you will learn how to convert the sequential computation into the parallel computation through prefix-scan operations, a key primitive in many parallel algorithms, and how to quickly implement your kernels by utilizing NVIDIA CUB, a library of high-performance parallel primitives and reusable components for every layer of the CUDA programming mode. Concepts and performance are illustrated through examples by adjusting alternative algorithmic strategies provided in CUB.

Level: Intermediate
Type: Talk
Tags: Energy Exploration; Performance Optimization

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Marriott Salon 1

S6463 - Continuous Learning for Speech Recognition

Wonkyum Lee Ph.D. Student, Carnegie Mellon University
Wonkyum Lee is a Ph.D. student in electrical and computer engineering at Carnegie Mellon University. He is advised by Professor Ian Lane. Before he joined CMU, he received M.S. and B.S. degrees in electrical engineering at KAIST and radio communications engineering at Korea University in 2007 and 2009, respectively. Wonkyum's fields of interests are speech recognition, deep learning and their applications.

We're proposing continuous learning systems, in which systems are deployed and learn online through real-world data. In this work, we show that new learning methodology is able to effectively update model parameters using the data that appears sequentially to the system along with the feedback from the speech recognition system.

Level: Intermediate
Type: Talk
Tags: Signal & Audio Processing; Deep Learning & Artificial Intelligence; Algorithms

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Marriott Salon 2

S6563 - Where Tegra Meets Titan: Asymmetric Computer Vision for Smartphones and Robotics

Tom Drummond Professor, Monash University
Tom Drummond has been a principal investigator on several EU Framework projects and is a chief investigator in the ARC Centre of Excellence for Robotic Vision. Tom studied mathematics for his B.A. at the University of Cambridge. In 1989, he emigrated to Australia and worked for CSIRO in Melbourne for four years before moving to Perth for his Ph.D. in computer science at Curtin University. In 1998, he returned to Cambridge as a postdoctoral research associate and in 1991 was appointed as a university lecturer and was subsequently promoted to senior university lecturer. In 2010, he returned to Melbourne and took up a professorship at Monash University.

This presentation will argue that battery life and thermal limits will prevent small mobile devices from implementing the next generation of visual processing algorithms without external assistance from high performance computing. Several innovative methods of distributing these problems between lightweight and high-powered nodes will be explored for a number of visual processing applications relevant to smartphones and robotics. We'll illustrate how these problems can be mapped onto the thread model of GPUs and will present a couple of CUDA tricks used to maximize efficiency.

Level: All
Type: Talk
Tags: Computer Vision & Machine Vision; Robotics & Autonomous Machines; Virtual Reality & Augmented Reality

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Room 210F

S6577 - Fighting Infections and Antimicrobial Resistance Through GPU-Accelerated In Silico Models

Radu Marculescu Professor of Electrical & Computer Engineering, Carnegie Mellon University
Dr. Radu Marculescu is a professor in the Electrical and Computer Engineering Department at Carnegie Mellon University. He has received several Best Paper Awards in top conferences and journals covering design automation of integrated systems and embedded systems. Radu currently serves as the editor-in-chief of Foundations & Trends of Electronic Design Automation and as an associate editor of Elsevier Journal of Nano Communication Networks. Radu has been involved in organizing many symposia and conferences, and has been guest editor of special issues in archival journals and magazines. His current research focuses on cyber-physical systems, social, and biological systems. He is an IEEE Fellow.

Learn the core principles behind cell-cell communication and understand the use of in silico models and simulation algorithms needed to evaluate the dynamics of heterogeneous microbial populations. Explore the pathogens' inter-cellular network, its dynamics and contribution to biofilm formation. See how the newest GPU-based platforms can enable highly parallel simulations with performance gains of orders of magnitude over existing CPU-only solutions. Understand how the application of social and network sciences to understanding bacterial population dynamics can aid in developing new treatments and better drugs to control the many pathogenic bacteria that use social interactions to cause infections and antimicrobial resistance.

Level: Beginner
Type: Talk
Tags: Computational Biology; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Marriott Salon 5

S6583 - WetBrush: GPU-Based 3D Painting Simulation at the Bristle Level

Zhili Chen 3D Graphics Researcher, Adobe Research
Zhili Chen is a 3D graphics researcher at Adobe. He got his Ph.D. in computer science at The Ohio State University in 2015. His research interests include physically based simulation, real-time graphics, 3D reconstruction, and virtual reality.

We built a real-time oil painting system that simulates the physical interactions among brush, paint, and canvas at the bristle level entirely using CUDA. To simulate sub-pixel paint details given the limited computational resource, we propose to define paint liquid in a hybrid fashion: the liquid close to the brush is modeled by particles, and the liquid away from the brush is modeled by a density field. Based on this representation, we develop a variety of techniques to ensure the performance and robustness of our simulator under large time steps, including brush and particle simulations in non-inertial frames, a fixed-point method for accelerating Jacobi iterations, and a new Eulerian-Lagrangian approach for simulating detailed liquid effects.

Level: Intermediate
Type: Talk
Tags: Real-Time Graphics; Computational Fluid Dynamics

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Room 210E

S6628 - Co-Designing GPU-Based Systems and Tools for Numerical Weather Predictions

Thomas Schulthess Director, Swiss National Supercomputing Centre
Thomas Schulthess is director of the Swiss National Supercomputing Centre (CSCS) and a professor for computational physics at ETH Zurich. He received his Ph.D. in physics in 1994. Since 2010, he has taken interest in refactoring climate codes to take advantage of novel, energy-efficient computing architectures.
Carlos Osuna Scientific Developer, Meteoswiss, Zurich
Carlos Osuna is a scientific software developer at MeteoSwiss, Zurich. Since 2011, he has been involved in research projects at ETH Zurich and MeteoSwiss, refactoring dynamical cores of weather codes using DSLs to port legacy codes to GPUs and provide performance portable applications. He received his Ph.D. in experimental high energy physics in 2003.

We'll discuss the hardware-software co-design project behind the most cost and energy efficient system for numerical weather prediction -- an appliance based on the Cray CS-Storm system architecture that is loaded with NVIDIA K80 GPUs and operated on behalf of MeteoSwiss by CSCS since October 2015.

Level: Intermediate
Type: Talk
Tags: Earth System Modelling

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Room 211A

S6210 - NVIDIA GRID™ and Dassault Catia from Proof of Concept to Production

Fred Devoir Sr. Architect & Manager of IT Infrastructure, Textron Inc.
Fred Devoir is a senior systems architect and manager of IT infrastructure at Textron Inc. Fred has a wide variety of specialized and business systems experience with particular interests in integration and virtualization projects specifically centered around virtual desktop infrastructure (VDI), graphics acceleration, and high performance computing clusters. His past experience includes 22 years of IT professional work experience as an IT manager, senior systems analyst, engineer, and architect for Fortune 500 companies in the aerospace/defense, engineering, medical, and pharmaceutical industries.

Join us for a technical discussion on NVIDIA GRID-accelerated virtual desktop infrastructure to support Dassault Catia workloads and what it takes to deploy the infrastructure from proof of concept to production. Learn how to tune Catia for use on virtual desktops, including how to optimize the NVIDIA GRID graphics drivers in windows to deliver Catia workloads. Learn about frame rate limiters and other performance optimization settings available in the infrastructure. Learn how persona management can assist with concurrent user deployments and what data is required to save with the users personalization settings in order to retain the Catia settings.

Level: Intermediate
Type: Talk
Tags: Graphics Virtualization

Day: Tuesday, 04/05
Time: 14:00 - 14:50
Location: Marriott Salon 4

S6214 - GPU-Based Distributed Deep Learning Platform

Xiaoxin Zhang Scientist, JD.com Inc
Xiaoxin Zhang is a research scientist at JD.com Inc., an electronic commerce company headquartered in Beijing. It is one of the largest B2C online retailers in China by transaction volume. Xiaoxin is responsible for the development of the deep learning platform, design, and implementation algorithm. Before joining JD.com, he worked for Google as a software engineer.

We'll focus on the problem of training a deep neural network with hundreds of GPU-attached machines. We've developed a software framework that utilizes hundreds of physical machines, each attached with several GPU cards. The framework is responsible for the system management, such as fault tolerance, data split, model split, task schedule, and system visualization. Algorithm engineers only have to design algorithms logically.

Level: Beginner
Type: Talk
Tags: Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room 210G

S6246 - Digital Actors at MPC: Bridging the Uncanny Valley with GPU Technology

Damien Fagnou Global Head of VFX Operations, MPC
Highly-Rated Speaker
Damien Fagnou is the global head of VFX Operations at MPC, where he brings together his expertise in software and production to evolve and refine the creation processes across all feature film VFX work. After finishing university with an M.S. in computer science in France, he worked for an animated series implementing the technology to speed up the motion capture pipeline and rendering. He later accepted a job to help set up the workflow at Attitude studios and then took on the role of Tools and Workflow Programmer at Climax in the U.K. In 2003, he transferred his skills to the film industry and started at leading VFX post-production studio MPC to work on Troy, implementing preview tools and city rendering scripts. In 2005, Damien became R&D lead on Charlie and the Chocolate Factory, 10,000 BC, and Narnia. He then moved closer to production and became MPC's stereographer working on movies, including Pirates of the Caribbean: On Stranger Tides, the Harry Potter films, and Prometheus. After a few years in production, he returned to his software roots and became global head of Software overseeing software development efforts across the company.

Discover the next generation of GPU-enabled facial rigs for digital actors at MPC. Through a mixed approach of linear deformers and non-linear analysis, MPC aims to improve the performance and appearance of its digital actors and improve upon the state of the art in the visual effects industry. You'll learn from industry experts how MPC is using the latest fabric engine technology to ease the transition to GPUs, enabling fast drawing of characters and fast parallel computation of deformers on CUDA.

Level: Intermediate
Type: Talk
Tags: Media & Entertainment; Performance Optimization; Algorithms

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room LL21C

S6334 - Optimizing OpenGL Shaders for Order Independent Transparency with Modern GPU Architectures

Pyarelal Knowles Student, RMIT University
Pyarelal Knowles is a PhD student at RMIT University, Melbourne, with research interests in real-time computer graphics and physics simulations. He completed his Bachelor of IT (Games and Graphics Programming) in 2008, before a Comp. Sci. (Honours) year in 2009 at RMIT.

Some techniques to improve sorting performance of many small lists are discussed for much faster order independent transparency, a problem with elements of both compute and graphics which are quickly converging. The focus is on technical issues encountered, such as occupancy and memory hierarchy performance, a comparison between GLSL shaders and CUDA, and some discussion of GPU evolution.

Level: Intermediate
Type: Talk
Tags: Real-Time Graphics; Algorithms; Performance Optimization

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room 210E

S6337 - GPU-Accelerated Graph Query for Cyber Applications

Jim Carbonaro Senior Software Engineer, SYSTAP, LLC
Jim Carbonaro is subject matter expert for integration and scaling of Blazegraph solutions with real-time analytic processing frameworks, including Spark, Scala, Storm, Kafka, GraphX, and Redis. He is a lead developer of DASL and DASL algorithms for large-scale graph analytics. He led recent work to compare performance of Apache Spark GraphX with Blazegraph-accelerated graph analytics.

Cyberspace is a critical domain for government and commercial organizations. It is about networks, devices, and how they interact. Graphs model nodes and links and how they are connected. Defending the critical networks in cyberspace requires processing and analyzing extremely large quantities of graph data in near-real time. Key cyber analytics and data sets ranging from Topological Vulnerability Analysis, Traffic Flow Analysis, and Network Attack Graphs are graphs. This session will discuss how Blazegraph GPU meets this challenge by delivering near-real time performance at a very large data scales, uses a flexible and updatable graph representation to support complex analytics, and supports existing graph frameworks (RDF, Tinkerpop) and query languages (SPARQL).

Level: Intermediate
Type: Talk
Tags: Aerospace & Defense; Big Data Analytics; Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Marriott Salon 2

S6345 - Advances in V-Ray RT GPU

Vladimir Koylazov CTO, Chaos Software Ltd.
Highly-Rated Speaker
Vladimir Koylazov (Vlado) has more than 15 years of software development experience, the majority of which he spent developing and improving the render engine V-Ray. Passionate about 3D graphics and programming, Vlado is the driving force behind Chaos Group's software solutions. Vladimir is CTO of Chaos Software and one of the original creators of the V-Ray renderer.

The talk describes recent advances in the V-Ray RT GPU raytracer for production rendering. With V-Ray 3.0 we will be offering a host of new features, optimizations, and improvements that will simplify artists' workflow while offering advanced capabilities and great speed improvements. One of the key improvements will be the simplified workflow.

Level: Advanced
Type: Talk
Tags: Rendering & Ray Tracing; Product & Building Design

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room LL21B

S6357 - Towards Efficient Communication Methods and Models for Scalable GPU-Centric Computing Systems

Holger Froening Associate Professor, Ruprecht-Karls University of Heidelberg
Holger Froenig is an associate professor at the Department of Mathematics and Computer Science at the Ruprecht-Karls University of Heidelberg (Germany), and leads the Computer Engineering Group at the Institute of Computer Engineering. His research interests include parallel computing, computer architecture, interconnection networks and hardware design with a recent focus on application-specific heterogeneous computing, data movement optimizations and associated power and energy aspects.

GPU computing is used pervasively for many reasons, including performance increase and improved energy efficiency. They are used pervasively in high performance computing, resulting in a strong need to optimize data movements between GPUs at the cluster level. Existing communication models and methods are designed for CPUs, however. We'll point out limitations when employing traditional techniques to GPUs, and how to overcome those to support a full GPU-centric traffic souring and sinking. Our experiments show that such specialized communication models and methods provide substantial advantages in terms of energy and time. We observe that besides specialization in computing, we also need specializing in communication for utmost performance and energy efficiency.

Level: Intermediate
Type: Talk
Tags: Supercomputing & HPC; Performance Optimization

Day: Tuesday, 04/05
Time: 14:00 - 14:50
Location: Room 212A

S6384 - NVIDIA CUDA® for Mobile

Yogesh Kini Manager, CUDA System Software, NVIDIA
Yogesh Kini manages the Tegra CUDA driver team at NVIDIA. For last four years he has been working towards enabling GPU compute software on different Tegra platforms. His team is responsible for the CUDA API and system software on various embedded, automotive, and mobile platforms based on Tegra SOC. He holds a B.S. from Manipal Institute of Technology, India.

This session is about a few important use-cases in mobile that can be accelerated using CUDA. Use-cases include image processing, camera output post-processing, and real-time texture compression in graphics applications. Attendees will learn that: [1] Tegra has unified memory architecture that can be utilized by applications to reduce total memory usage and power consumption.The use-case presented demonstrates effective use of UVM on Tegra. [2] CUDA provides means to take inputs from a camera via EGLImage and EGLStreams interoperability. This can be used to post-process camera images using CUDA. The example presented demonstrates use of these CUDA API. [3] CUDA provides API for interoperability with OpenGL-ES.Texture compression in a graphics application is demonstrated as an example.

Level: Intermediate
Type: Talk
Tags: Computer Vision & Machine Vision; Video & Image Processing; Tools & Libraries

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room 210F

S6401 - Towards Interactive Visual Exploration of Massively Parallel Programs Using a Domain-Specific Language

Tobias Klein Student, TU Vienna / KAUST
Tobias Klein is an M.S. student at TU Vienna working under the direction of Professor Eduard Groller. He is visiting the Visual Computing Center at KAUST for his M.S. thesis research work on the interactive visualization and analysis of massively parallel GPU programs in the context of the KAUST NVIDIA CUDA Research Center, in collaboration with Dr. Peter Rautek and Professor Markus Hadwiger.

Learn about the world of visual exploration of massively parallel programs. We describe our framework for interactive program visualization to help you understand program run-time behavior and find the causes of possible slowdowns and bugs. Our framework comprises a simple domain-specific language for annotating OpenCL kernel code, automatic just-in-time compilation of the necessary code instrumentation, and interactive visualization capabilities. Our problem-specific code annotation facilitates user-centered analysis. We describe a variety of interactive visualizations using the well-known D3 framework, providing insight into the program's structure, execution, and memory accesses. We describe several use cases that show the program visualization capabilities of our approach in action.

Level: Intermediate
Type: Talk
Tags: Tools & Libraries; Performance Optimization; Programming Languages

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room 211B

S6414 - Structure-Preserving Smoothing for Seismic Amplitude Data by Anisotropic Diffusion Using GPGPU

Joner Duarte R&D Software Engineer, Tecgraf/PUC-Rio
Joner Duarte is researcher at computational geophysics group of Tecgraf/PUC-Rio. He received MSc degree in Computer Graphics at Pontifical Catholic University of Rio de Janeiro (2012). He currently researches on High Performance Computing and HCI for Virtual Reality applied to geophysics.

In this session we present a new method for attenuating noise on seismic data while preserving structural features. Our approach uses anisotropic diffusion to filter the seismic amplitude data that involves solving a large linear system. Moreover, we use seismic attributes, that represent horizons and faults, as restrictions of the diffusion process and an implicit method for solving the diffusion equation. The use of GPGPU to accelerate the linear system solution allows the seismic filtering to be executed at interactive time providing a fine adjustment of input parameters before processing the whole data.

Level: All
Type: Talk
Tags: Energy Exploration; Tools & Libraries

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Marriott Salon 1

S6424 - Exploring Scalable Implementations of Triangle Enumeration in Graphs of Diverse Densities: Apache Spark vs. GPUs

Michela Taufer Associate Professor, University of Delaware
Michela Taufer joined the University of Delaware in 2007, where she was promoted to associate professor with tenure in 2012. She earned her M.S. in computer engineering from the University of Padova and her Ph.D. in computer science from the Swiss Federal Institute of Technology (ETH). She was a post-doctoral researcher supported by the La Jolla Interfaces in Science Training Program (also called LJIS) at UC San Diego and the Scripps Research Institute. Before she joined the University of Delaware, Michela was faculty in computer science at the University of Texas at El Paso.
Travis Johnston Postdoctoral Researcher, Univeristy of Delaware
Travis Johnson is a post-doctoral researcher working with Dr. Michela Taufer in the Global Computing Laboratory at the University of Delaware. He is working on several projects that are centered on big data analytics for scientific computation.

We'll present graphs as powerful tools when analyzing complex relationships between entities. We'll share how many structures commonly found in computer science, like social networks, computer networks, and the world wide web, can be modeled as graphs. Since many of the real graphs are very large and complex, the associated analysis algorithms must be very efficient and highly parallel. We present two implementations of a key graph-based analysis such as the triangle enumeration for two different parallel paradigms: GPU programming and Apache Spark. We'll reveal the performance of the two different implementations for the different paradigms as the characteristics of the graph change.

Level: Beginner
Type: Talk
Tags: Algorithms; Tools & Libraries; Big Data Analytics

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Marriott Salon 3

S6468 - Democratizing Sequencing with Ion Torrent S5 and S5XL DNA Sequencers Powered by GPUs

Mohit Gupta Staff Software Engineer, Thermo Fisher Scientific
Highly-Rated Speaker
Mohit Gupta is working as a staff software engineer in Genetic, Medical and Applied Sciences division of Life Sciences Solutions, a part of Thermo Fisher Scientific Inc. In this capacity, he is responsible for speeding up algorithms used in data analysis of PGM, Proton, S5, and S5XL DNA sequencers, with a particular focus on GPU computing. Previously, Mohit worked as senior research and development engineer with Mirafra Technologies, Bangalore, India, in the area of electronic design automation working on compiler for hardware description languages like Verilog. He holds a B.Tech. in electrical engineering from the Indian Institute of Technology, Bombay, India and an M.S. in computer engineering from University of California, San Diego. He has published or presented in conferences and workshops like ICCAD, GTC, and DFMY.
Jakob Siegel Staff Software Engineer, Thermo Fisher Scientific
Highly-Rated Speaker
Jakob Siegel is currently working as a staff software engineer in Genetic, Medical and Applied Sciences division of Life Sciences Solutions, a part of Thermo Fisher Scientific Inc. His work focuses on high performance computing tasks in the context of DNA sequencing. Jakob graduated as a Dipl-Ing in software engineering from the University of Applied Sciences in Esslingen Germany, after which he also got an M.S. and Ph.D. in electrical and computer engineering from the University of Delaware. He has been involved in software projects in a variety of fields from pure computer sciences through the automotive sector, naval communication systems, atmospheric research until he joined the Ion Torrent team in January 2012 to work on the software side of DNA sequencing. Jakob has published or presented in multiple computer engineering conferences, workshops, and journals, including Computing Frontiers, ICS, ICPPW, GTC, AACEC, and JACT.

Learn how GPUs have accelerated the pace of research in targeted DNA sequencing by providing quick turnaround time in data analysis of terabytes of raw data generated by Ion Torrent DNA sequencers, like S5XL, in a matter of hours. We'll showcase our complete signal processing pipeline running on GPUs and share our results with lessons learnt in developing CUDA code for Kepler and Maxwell architectures. We'll share our experiences with using CUDA Multi Process Service (MPS). We'll touch upon several examples in areas of clinical research, drug discovery, and human identification that have got a tremendous boost from the speed of our technology propelled by GPUs.

Level: All
Type: Talk
Tags: Computational Biology

Day: Tuesday, 04/05
Time: 14:00 - 14:50
Location: Marriott Salon 5

S6528 - Implementing Deep Learning for Video Analytics on Tegra X1

Javier Rodriguez Saeta CEO, Herta Security
Dr. Javier Rodriguez Saeta is CEO of Herta Security, which he founded in 2009. He received M.S. and Ph.D. in electrical engineering from the Universitat Politecnica de Catalunya in 2000 and 2005, respectively. He received a B.A. in business administration from the Open University of Catalonia, and an MBA from ESADE Business School. In 2000, he worked for Robert Bosch, GmbH, in Hildesheim, Germany. In 2001, he joined Biometric Technologies, in Barcelona, Spain, where he was the R&D manager. He has published more than 20 papers indifferent magazines and workshops, and holds three patents. His main research interests include all issues related to innovation, security, and biometric systems and applications.
Carles Fernández Tena Director of Research, Herta Security
Carles Fernández Tena received his B.S. in Telecommunication Eng. and M.S. in Language and Speech from the Technical University of Catalonia (UPC) in 2005. He received an M.S. in Computer Vision and AI from the Autonomous University of Barcelona (UAB) in 2008, where he obtained his Ph.D. cum laude in 2010, receiving the 2010 Extraordinary Ph.D. Award. He has published more than 40 scientific articles in international journals and conferences. Currently he leads the research team at Herta Security. His research interests include Biometrics, Computer Vision, and Machine Learning, particularly unconstrained facial analysis in image and video.

The performance of Tegra X1 architecture opens the door to real-time evaluation and deployment of deep neural networks for video analytics applications. This session presents a highly optimized, low-latency pipeline to accelerate demographics estimation based on deep neural networks in videos. The proposed techniques leverage the on-die hardware video decoding engine and Maxwell GPU cores for conducting advanced video analytics such as gender or age estimation. Our results show that Tegra X1 is the right platform for developing embedded video analytics solutions.

Level: All
Type: Talk
Tags: Intelligent Video Analytics (IVA); Embedded; Aerospace & Defense; Deep Learning & Artificial Intelligence; IoT

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room LL20D

S6536 - Building Immersive NVIDIA SHIELD Android TV Games

Luc Beaulieu CTO, Frima Studio
Luc Beaulieu is chief technology officer at Frima, a 350+ employee company, with a passion for digital entertainment. He is currently leading the technical innovation and smart toy groups. Luc has over 20 years of experience in videos games, online communities, and digital experiences.
Jean-Philippe Doiron Technology Director, Frima Studio
With Frima since 2009, Jean-Philippe benefits from over 15 years of professional experience in the creation of games and Web applications. Before joining Frima, he co-founded an Internet multimedia development company where he acted as developer for the better part of six years. He also worked for seven years as a consultant, mostly in Canada, the US and the UK, developing highly optimized and multithreaded database architecture and reporting systems for transit operators. His current responsibilities as Director of Technology include aligning the R&D efforts with Frima's technological needs, laying down quality standards, and providing the R&D team with support. Jean-Philippe is an expert in software and game development, profiling and optimization, Direct X, C++, and Flash 3D.

We'll show how game experiences on the SHIELD Android TV can be extended beyond the border of the monitor. Sound is an obvious example, but what about lights? Find out how Frima increased immersion in its Chariot game by adding connectivity between SHIELD and the Philips Hue lighting system. We'll show how the game interacts with the lights and what had to change to make it work. With Chariot being a console game first, the audience also learns about performance comparisons between the SHIELD TV, new and old generation consoles.

Level: All
Type: Talk
Tags: Game Development

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room 212B

S6594 - Connecting the Dots of Emerging Technologies and Real-World Application

Anthony Cortez Senior Designer | Visualization | Lighting Consultant, Arup
Anthony Cortez is the Arup visualization leader in the Americas Region. A graduate of the Art Institute of Pittsburgh, Anthony has been working as a senior designer for the visualization industry for over 18 years. Anthony has worked on numerous architectural and engineering projects in visualization, lighting design, and motion graphics. Projects include the Fulton Center, the Tappan Zee Bridge, JFK JetBlue Terminal 5, YAS Marina Hotel, Singapore Stadium and the Academy of Arts & Science Theater at the Lighthouse International. In addition to traditional 3D visualization, he integrates with other Arup disciplines to produce validated lighting studies, acoustic and pedestrian/vehicle simulations, fire simulations, as well as real-time simulations using cutting-edge gaming technology.

New technologies are enabling us to bring designs to life in ways never before possible. Real-time graphics engines immerse viewers in a virtual environment, providing a perfect tool for showcasing projects and troubleshooting problems during the design phase. Our engineers and graphics experts collaborate to produce interactive walk-throughs showing options for different spaces and environments. As new synergies continue to emerge in the design field, visualization is key to providing a common basis of understanding among multiple design disciplines and project stakeholders. Arup is helping turn best practice into the next practice by using real-time graphics engines as a presentation tool, greatly improving communication and understanding for project teams and clients alike.

Level: All
Type: Talk
Tags: Product & Building Design; Virtual Reality & Augmented Reality

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room LL21A

S6621 - Unified CPU+GPU Programming for the ASUCA Production Weather Model

Michel Müller Research Assistant, Tokyo Institute of Technology
Michel Muller entered the Ph.D. graduate course at the Department of Energy Sciences, Tokyo Institute of Technology in 2015, supervised by Professor Aoki. He graduated with an M.S. in electrical engineering and information technology from ETH Zurich in 2012. From 2009 to 2014, he worked as a consultant and then a systems architect at ATEGRA AG, Switzerland.

Porting applications to GPUs still requires compromises between time-to-solution, GPU performance, and CPU performance. This often leads to major challenges for large, Fortran-based applications like weather and climate models. We'll focus on two of these challenges, whose significance is shown using real-world code examples and performance results: The differing requirements on parallel task granularity as well as storage order between the two architectures. A proposed solution is a flexible preprocessor framework called "Hybrid Fortran," which has been used to port both dynamics and physics of ASUCA, one of the Japan Meteorological Agency's current operational weather models. Finally, an even more hands-off solution to GPU portability is proposed in the shape of a black box solution.

Level: Intermediate
Type: Talk
Tags: Earth System Modelling; Tools & Libraries; Computational Physics; Supercomputing & HPC; OpenACC

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room 211A

S6640 - Large-Scale Parallel Data Visualization and Analysis on GPU Clusters

Silvio Rizzi Postdoctoral Appointee, Argonne National Laboratory

We'll describe vl3, a GPU-accelerated parallel framework for large-scale scientific visualization and visual analytics. We'll explain its parallel architecture, based on a combination of the message passing interface (MPI) and GLSL shaders. In addition, we'll present applications to interactive visualization of large-scale volumetric and particle-based datasets generated on some of the most powerful supercomputers on the planet. We will also discuss strong and weak scalability experiments on up to 125 NVIDIA K80 GPUs. Finally, we'll cover vl3's capability for remote data visualization and streaming ultra-high-resolution images to remote displays, including a large tiled display driven by a workstation with multiple Quadro M6000 cards and Mosaic technology.

Level: All
Type: Talk
Tags: In-Situ and Scientific Visualization; Large Scale and Multi-Display Visualization

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room LL21D

S6695 - Generative Adversarial Networks

Ian Goodfellow Senior Research Scientist, Google
Ian is a Senior Research Scientist on the Google Brain team. He is the lead author of the textbook Deep Learning (www.deeplearningbook.org). He studies new methods to improve deep learning. His interests include generative models, machine learning in the adversarial setting, and accelerating the training of neural networks. He has contributed to several open source machine learning libraries that leverage CUDA, including Theano, Pylearn2, and TensorFlow.

Generative adversarial networks (GANs) provide a way to generate realistic images with deep neural networks. Compared to other approaches to generative modeling, GANs are able to learn the cost function. This allows them to learn to capture important details that a fixed, manually designed cost function, such as mean squared error, would ignore. Compared to maximum likelihood estimation (MLE), GANs are specialized for the task of generating realistic samples. Both MLE and GANs are consistent statistical estimators, but have different priorities. MLE aims to assign high likelihood to all of the data, but may also assign high likelihood to other points and thus generate unrealistic samples. GANs aim to always generate realistic samples.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room 210D

S6761 - Advanced System Power Management for Deep Learning and A.I. Machines (Presented by Linear Technology)

Dave Dwelley Office of the CTO, Linear Technology
Dave Dwelley is an Office of the CTO at Linear Technology. Since joining the company over 25 years ago, Dave has served as an analog chip designer and design manager. He received his BSEE/CD degree from UC Berkeley in 1986 and is a member of the original IEEE P802.3af (PoE) task force starting in 2000 and was a founding member of the P802.3at (PoE+) group starting in 2004. He now serves as chairman of the IEEE 802.3 Power over Data Lines study group and participates in the P802.3bp Reduced Twisted Pair Gigabit group. Dave holds 16 patents and spends his free time raising two teenagers and thinking about rebuilding the collection of old cars gathering dust in his garage.

Linear Technology's DC/DC regulator and power management solutions enable designers to increase performance in GPU- and CPU-based systems. Improved electrical, thermal and mechanical properties for core, I/O, and memory rails, combined with expertise and tools for PCB layout, simulation and design verification permit deployment of more efficient, lighter weight, cooler, and more compact systems. This presentation will also focus on methods of controlling, monitoring and debugging power circuits by digitally communicating with the device, reading temperature and load current data while setting voltages and start-up conditions. Future product advancements related to powering automotive electronics will also be discussed.

Level: All
Type: Talk
Tags: Performance Optimization; Deep Learning & Artificial Intelligence; Self-Driving Cars & Automotive

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Marriott Salon 6

S6116 - Towards Building a GPU Cloud Service for Human-Level Quality Image Understanding

Xiaodong He Senior Researcher, Microsoft
Xiaodong He is a senior researcher in the Deep Learning Technology Center, Microsoft Research, Redmond, Wash. He is also an affiliate full professor in the Department of Electrical Engineering at the University of Washington, Seattle, serving in the Ph.D. reading committee. His research interests include deep learning, speech, natural language, vision, information retrieval, and knowledge representation and management. He has published in IEEE TASLP, IEEE SPM, Proc. IEEE, ICASSP, ACL, EMNLP, NAACL, CVPR, SIGIR, WWW, CIKM, ICLR, NIPS, and other venues. He has received several awards, including the Outstanding Paper Award of ACL 2015. He and colleagues developed the MSR-NRC-SRI entry and the MSR entry that won No. 1 in the 2008 NIST Machine Translation Evaluation and the 2011 IWSLT Evaluation (Chinese-to-English), respectively, and the MSR image captioning system that won first prize at the MS COCO Captioning Challenge 2015. He has held editorial positions on several IEEE Journals and has served on the organizing and program committees of major speech and language processing conferences. He is a senior member of IEEE and a member of ACL.
Kenneth Tran Senior Research Engineer, Microsoft Research
Kenneth Tran is a senior research engineer in the Deep Learning Technology Center, Microsoft Research. Previously, he was a machine learning scientist in the Cloud Machine Learning group at Microsoft, building a machine learning platform, which now powers the Azure ML. His research interest includes machine learning, optimization, and distributed computing.

Learn the latest deep learning techniques for semantic modeling of image, text, and knowledge graph, all empowered by GPU computing and cloud service. We'll demonstrate how to build deep semantic models across different modalities, and how to apply these models to reach the best results in information retrieval, question answering, and image captioning benchmarks. In particular, facilitated by the recently announced Microsoft Azure GPU compute instances, we'll show how to use GPU clusters to extend the MSR image captioning system, which won first prize in the COCO Captioning Challenge at CVPR 2015, and to build a publically available, large-scale, deep image understanding service that achieves state-of-the-art performance in generating novel captions for images.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision; Data Center & Cloud Computing

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room 210D

S6126 - GPU-Enabled Pavement Defect Detection: Looking for Cracks with Thousands of Eyes

Kristina Doycheva Research Assistant, Ruhr-University Bochum, Germany
Kristina Doycheva is pursuing her Ph.D. at the Ruhr-University and is working as a research assistant at the Chair of Computing in Engineering, Department of Civil Engineering. She received her M.S. in applied informatics at the Ruhr-University Bochum, Germany in 2013. Her research interests include high-performance image processing and machine learning. She is now working on a project related to pavement defect detection.

Learn how to use GPUs for pavement defect detection. In recent years, a variety of visual-based methods for pavement defect detection have been proposed. However, these methods process the images mostly offline, which results in a large amount of data being persistently stored. To enable real-time pavement distress detection, image pre-processing steps, such as nonuniform background illumination and noise removal, as well as pavement defect detection methods based on texture features and the wavelet transform, were implemented using GPUs. The achieved speed-up of the GPU implementation compared to a sequential implementation is approximately 10,000. The execution time allows for processing more than 600 images per second.

Level: Beginner
Type: Talk
Tags: Self-Driving Cars & Automotive ; Computer Vision & Machine Vision; Video & Image Processing

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room LL21E

S6215 - MBE: A GPU-Based Fast, Robust and Precise Solver for Chemical ODEs

Fan Feng Ph.D. Student, Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
Fan Feng is a Ph.D. student with the Supercomputer Center, Computer Network Information Center, Chinese Academy of Sciences, Beijing.

Explore a GPU-based efficient algorithm for chemical ODEs, which is the core and costly part of atmosphere chemistry model in CAS-ESM project. Chemical ODEs is numerically sticky because of its stiffness, nonlinearity, and nonnegativity. Traditional solvers, such as LSODE, are hard for parallelism because of its complicated control flow and coupling. In our experiments, we have obtained 3-5X speedup on GPU when the same input is set on each node, which eliminates the divergences in kernel, while the performance with real input is even worse than the serial code. So we develop a new solver Modified-Backward-Euler (MBE). In our numerical experiments, MBE is shown to be faster and more precise than LSODE, and it's easy to parallelize, so we can expect a significant speedup on GPU.

Level: All
Type: Talk
Tags: Earth System Modelling; Computational Fluid Dynamics; Algorithms

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room 211A

S6276 - Autonomous Robotic 3D Printing: Real-Time Path Planning with Computer Vision

Daghan Cam Architect, University College London
Daghan Cam is an architect and researcher based in London. He is the director of Daghan Cam Limited, which operates between architecture, technology, and research. He runs a post-graduate research cluster at UCL's Bartlett School of Architecture with Alisa Andrasek and Andy Lomas. He also leads research on GPU computing and he is a co-principal investigator of UCL as an NVIDIA GPU Research Center. Previously he worked with Zaha Hadid Architects. He taught workshops and gave lectures at AA Visiting Schools in Istanbul, Athens, London, and at Ecole d'architecture in Paris. His work on computational design and large-scale robotic fabrication has been widely exhibited, recently in San Francisco and in Milan Design Week 2015.

Teach your 3D printing robot how to adapt to unpredictable material behavior by using deep learning algorithms. We'll introduce a path planning strategy for iteratively correcting robot target positions in a 3D printing process by using an NVIDIA Jetson card attached to an industrial robotic arm. Initial path generation, visual tracking of material behavior in real-time, evaluation and recomputation of robot trajectories will be explained by code examples and video recordings from the fabrication process.

Level: Beginner
Type: Talk
Tags: Product & Building Design; Robotics & Autonomous Machines; Computer Vision & Machine Vision; Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room LL21A

S6288 - Automatically Fusing Hundreds of Stencil Kernels for Better Performance and Productivity

Mohamed Wahib Post-Doctoral Researcher, RIKEN Advanced Institute for Computational Science
Mohamed Wahib is currently a post-doctoral researcher in the HPC Programming Framework Research Team at RIKEN Advanced Institute for Computational Science (RIKEN AICS). He joined RIKEN AICS in 2012 after years at Hokkaido University, Japan, where he received a Ph.D in computer science in 2012. Prior to his graduate studies, Mohamed worked as a researcher at Texas Instruments R&D for four years.

This talk proposes an end-to-end framework for automatically transforming stencil-based CUDA programs to exploit inter-kernel data locality. The CUDA-to-CUDA transformation collectively replaces the user-written kernels by auto-generated kernels optimized for data reuse. The transformation is based on two basic operations, kernel fusion and fission, and relies on a series of automated steps: gathering metadata, generating graphs expressing dependencies and precedency constraints, searching for optimal kernel fissions/fusions, and generation of optimized code. We show how the automatic transformations were practical and effective in exploiting exposed data localities for a variety of real-world applications with large codebases that contain dozens of kernels and data arrays.

Level: Intermediate
Type: Talk
Tags: Tools & Libraries; Supercomputing & HPC; Performance Optimization

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room 211B

S6389 - Embedded Deep Learning for Object Detection and Classification in Aerial Images

Jon Barker Solution Architect, NVIDIA
Jon Barker is a solution architect with NVIDIA, helping customers and partners develop applications of GPU-accelerated machine learning and data analytics to solve defense and national security problems. He is particularly focused on applications of the rapidly developing field of deep learning. Prior to joining NVIDIA, Jon spent almost a decade as a government research scientist within the U.K. Ministry of Defence and the U.S. Department of Defense R&D communities. While in government service, he led R&D projects in sensor data fusion, big data analytics, and machine learning for multi-modal sensor data to support military situational awareness and aid decision making. He has a Ph.D. and B.S. in pure mathematics from the University of Southampton, U.K.

Learn how deep learning can be applied to object detection, localization, and tracking problems in remote sensing. We'll present a technical case study showing how a convolutional neural network (CNN) trained in the data center using DIGITS can be deployed to an embedded GPU system to carry out low-latency object detection, classification, and tracking in high-resolution aerial imagery. We'll compare different approaches to detection and localization tasks. An example will be given of integrating the Caffe deep learning framework for GPU-accelerated CNN inference with an OpenCV-based image and video processing pipeline. We'll also show how transfer learning can be accomplished using DIGITS to train a CNN when only a small task specific training dataset is available.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision; Robotics & Autonomous Machines; Aerospace & Defense

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room 210G

S6397 - Real-Time Non-Rigid Image Registration Engine

Randall Miles Senior Research Scientist, Propulsion Science and Technology
Dr. Randall Miles is a physicist, algorithm developer, and senior research scientist at Propulsion Science and Technology. He is lead designer and developer for model database development activities, and key contributor on a variety of projects, including quantum chemistry calculations and radar cross section modeling of CFD fields.

Non-rigid image registration, i.e., morphing, allows a smaller footprint of seed images to be used to create a smooth and continuously changing series of images. We'll present a new high-speed toolkit for image morphing implemented using NVIDIA GPU technology. Time improvements of ~80% were seen through implementing a succession of CUDA optimizations guided by the Nsight profiler results. Tests were conducted using available simulated rocket plume images to calculate run times and create performance measures.

Level: All
Type: Talk
Tags: Aerospace & Defense; Video & Image Processing; Performance Optimization; Computer Vision & Machine Vision

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Marriott Salon 2

S6421 - Using OpenACC to Parallelize Seismic One-Way-Based Migration

Maxime Hugues HPC Research Scientist, Total E&P Research & Technology USA, LLC
Maxime Hugues has been an HPC research scientist at TOTAL Houston since 2012. Maxime graduated from the French National Engineer School ""ISEN-Toulon"" in 2007. The same year, he received an M.S. from the University of Science and Technologies of Lille. He was a Ph.D. fellow at the oil and gas company TOTAL, and received his degree in computer science in 2011 at the University of Lille. While doing his Ph.D., he worked as a junior researcher in high performance computing at TOTAL. He continued to work on the multi-programming paradigm as a postdoctoral researcher at INRIA and as a visitor scientist at the University of Tsukuba. His research focuses on programming paradigms and innovative hardware for extreme computers.

We'll describe our experience in using OpenACC to parallelize One-Way Based Migration, a seismic application that uses Fourier Finite Differencing. We describe our approach at optimizing application kernels that involve FFT operations and solving systems of tridiagonal sparse matrices. We talk about expectations and challenges of using OpenACC along with potential pitfalls for application users. We highlight the advantages of using OpenACC for high-performance scientific applications and list shortcomings that affect performance.

Level: All
Type: Talk
Tags: Energy Exploration; Performance Optimization; Supercomputing & HPC; OpenACC

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Marriott Salon 1

S6489 - Testing Chordal Graphs with CUDA®

Agnieszka Lupinska PhD Student , Jagiellonian University
Agnieszka Lupinska is a Ph.D. student of computer science at Jagiellonian University in Cracow. She teaches programming in CUDA. She was a software engineer in Nokia Cracow, from June 2014 to June 2015. She developed embedded Linux systems in Small Cell, Nokia's product for LTE technology. Her interests include parallel computing, multi-threaded algorithms, low-level language programming, advanced algorithms, and computational complexity.

We'll present the CUDA implementation of algorithm to test chordality of graphs, which uses the parallel partition refinement with pivots. A graph is chordal if each cycle of size greater than three in $G$ has a chord, that is an edge between two non-adjacent vertices on the cycle. In total, the algorithm takes O(N) time on N-threads grid and it performs O(N+M) work for graphs of N vertices and M edges. We'll compare the performance tests results achieved by the CUDA implementation on NVIDIA GeForce GTX TITAN X and the sequential implementation on CPU with four cores (eight threads). We'll present the tests results for cliques, sparse graphs, dense graphs, and random chordal graphs.

Level: Advanced
Type: Talk
Tags: Algorithms; Big Data Analytics

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Marriott Salon 3

S6638 - Image Compositing on GPU-Accelerated Supercomputers

Pascal Grosset Ph.D. Student, University of Utah
Pascal Grosset is a Ph.D. student in graphics and visualization at the University of Utah. He is working with Dr. Charles Hansen on image compositing algorithms for high performance computing systems targeting both GPUs and many-core CPUs.

Image compositing on GPUs has traditionally been considered impractical because communication between GPUs on different nodes used to be very slow. However, the introduction of GPU Direct RDMA changes that. Inter-node GPU communication is now only a one-copy operation instead of five. Also Kepler-class K20 GPU accelerators and above can run both OpenGL and CUDA at the same time. We'll present the workflow that allows us to use OpenGL 4.5 with GLSL for volume rendering, GPU Direct RDMA and CUDA Kernels for compositing, and CUDA OpenGL interop for transferring data between OpenGL and CUDA.

Level: Advanced
Type: Talk
Tags: In-Situ and Scientific Visualization

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room LL21D

S6693 - Network, Storage, and Workflow Design for GPU Centric Film Productions

Jeff Brue CTO, Open Drives
Jeff is the CTO and founder of Open Drives a Data Storage company focused on Production IT. Jeff has an extensive background in filmmaking, 3d animation, and storage kernel design. Having worked on over 120 feature films in his time as CTO at several post production facilities, as well as heading up the first Uncompressed Data workflow commercial system in Hollywood with the Viper camera. Founding Open Drives in 2011 he has been the principal architect for the facilities for such productions as Gone Girl for David Fincher as well as designing the in house technology architecture for the Coen Brothers and Deadpool for 20th Century Fox. Open Drives studio data clients include Legendary Pictures, Warner Brothers, 20th Century Fox, Disney as well as many others.

This talk is designed to provide an overall perspective on GPU centric workflow for Media and Entertainment and an update on the talk from last years perspective on Gone Girl, to this year's high lighted film Deadpool. Particularly focused on designing storage systems for high speed GPU workflows in VFX and editing.

Level: All
Type: Talk
Tags: Media & Entertainment; Performance Optimization; General Interest

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room LL21C

S6750 - The Future of GPU Rendering in 2016 and Beyond

Jules Urbach CEO & Founder, OTOY, Inc.
Jules Urbach is a pioneer in computer graphics, streaming and 3D rendering with over 25 years of industry experience. He made his first game, Hell Cab (Time Warner Interactive) at age 18, which was one of the first CD-ROM games ever created. Six years after Hell Cab, Jules founded Groove Alliance. Groove created the first 3D game ever available on Shockwave.com (Real Pool). Currently, Jules is busy working on his two latest ventures, OTOY and LightStage which aim to revolutionize 3D content capture, creation and delivery.

You'll learn about OTOY's GPU rendering research and new product releases in the coming year. OTOY's breakthroughs in compression and rendering on NVIDIA GPUs have dramatically reduced the barriers for light field rendering, making it a viable media format that gives content creators everywhere a simple, cost-effective way to bring high quality, interactive 3D content to multiple platforms for the world to enjoy.

Level: Intermediate
Type: Talk
Tags: Rendering & Ray Tracing; Real-Time Graphics; Virtual Reality & Augmented Reality; Media & Entertainment

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room LL21B

S6115 - Real-Time Free Viewpoint TV System Based on a New Panorama Stitching Framework

Pierre Boulanger Professor, University of Alberta
Pierre Boulanger has more than 30 years of experience in 3D computer vision, rapid product development, and the applications of virtual reality systems to medicine and industrial manufacturing. He worked for 18 years at the National Research Council of Canada as a senior research officer where his primary research interest was in 3D computer vision, rapid product development, and virtualized reality systems. He now has a double appointment as a professor at the University of Alberta Department of Computing Science and at the Department of Radiology and Diagnostic Imaging. He is currently the director of the Advanced Man Machine Interface Laboratory (AMMI) as well as the scientific director of the SERVIER Virtual Cardiac Centre. In 2013, Pierre was awarded the CISCO chair in healthcare solutions, a 10 years investment by CISCO systems in the development of new IT technologies for healthcare in Canada. His main research topics are on the development of new techniques for telemedicine, patient specific modeling using sensor fusion, and the application of tele-presence technologies to medical training, simulation, and collaborative diagnostics.

With the advance of GPU and vision technologies, free viewpoint TV (FTV) will become a reality in the near future. Traditional videos such as those shown on TV or viewed on the Internet are passive and two-dimensional in nature. Viewers can only passively observe the events captured by a cameraman and have no ability to actively change their viewpoint once the video is recorded. On the contrary, FTV will allow the viewer to select an arbitrary viewpoint and thus enjoy a feeling of immersion into events such as an Olympic competition or a popular theatre show. In this presentation, we will describe a FTV system based on creating a real-time panorama from multiple pixel synchronized cameras using GPU and how to transmit this information using normal IPTV technologies.

Level: Intermediate
Type: Talk
Tags: Computer Vision & Machine Vision

Day: Tuesday, 04/05
Time: 15:00 - 15:25
Location: Room 210F

S6193 - Visualization Toolkit: Improving Rendering and Compute on GPU's

Robert Maynard R&D Engineer, Kitware Inc
Robert Maynard joined Kitware in 2010 as a research and development engineer. He is one of the primary developers of VTK-M and also an active contributor to CMake, SMTK, CMB, ParaView, and VTK.

Learn about how the latest changes to VTK, VTK-m, and Catalyst are allowing for better GPU-accelerated rendering and compute. We'll give an overview of the latest changes to VTK's rendering infrastructure, VTK-m compute capabilities, and Catalyst. Lastly, we'll demonstrate the results of this work by showing the results of an in-situ visualization of PYFR GPU simulation.

Level: Intermediate
Type: Talk
Tags: In-Situ and Scientific Visualization; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 15:00 - 15:50
Location: Room LL21D

S6194 - Delivering Graphics-Intensive Applications to Computing Labs and BYOD in Education

Michael Goay Executive IT Director, University of Southern California Viterbi School of Engineering
Michael Goay is responsible for the information technology and computer systems that support enterprise goals of Viterbi School of Engineering in University of Southern California. He oversees the school's IT service support, service delivery, digital communication, information systems, and system infrastructure in support of academic, administrative, and research activities. He has a B.S. in electrical engineering from the University of Texas, Austin, and an M.S. in health informatics from the University of Minnesota.

We'll examine some of the current possibilities to deliver graphics-intensive applications in support of engineering, architecture, and design and show how NVIDIA GRID boards benefit the Horizon View virtualized environment in the University of Southern California Viterbi School of Engineering, empowering students to learn and study with graphics-intensive software with any device, in the location where and when they feel most productive and inspired.

Level: All
Type: Talk
Tags: Graphics Virtualization; Data Center & Cloud Computing

Day: Tuesday, 04/05
Time: 15:00 - 15:50
Location: Marriott Salon 4

S6240 - High-Level GPU Programming Using OpenMP 4.5 and Clang/LLVM

Arpith Jacob Research Staff Member, IBM T. J. Watson Research Center
Arpith Jacob is a research staff member in the Advanced Compiler Technologies Group at the IBM T. J. Watson Research Center. His interests include parallelizing compilers and special-purpose accelerators. His current research is on building an effective OpenMP compiler for the IBM POWER/NVIDIA GPU CORAL supercomputers.

Learn how to exploit GPUs using high-level directives and our opensource Clang/LLVM OpenMP compiler and runtime. In this talk we describe the OpenMP data and execution model for accelerators and the application directives to program them. We take the audience behind the covers to reveal how the higher level abstractions map to CUDA and GPU constructs, which can be exploited for either flexibility or performance. Our runtime transparently manages code and data offloading to multiple GPUs on a node, uses a pool of pinned memory to reduce overhead, and supports asynchronous execution of dependent kernels on GPU SMs with CUDA streams by simply expressing kernel dependencies. We show that performance of programs written in OpenMP is close to that of native CUDA for relevant benchmarks.

Level: All
Type: Talk
Tags: Programming Languages; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 15:00 - 15:25
Location: Room 210E

S6310 - Accelerate Your Visual Content Creation with SOLIDWORKS Visualize and Iray®

Brian Hillner SOLIDWORKS Visualize Product Manager, Dassault Systemes
Brian Hillner, product manager for SOLIDWORKS Visualize (formerly Bunkspeed), has played an integral role in managing the product and producing digital assets for multiple clients. With a degree in industrial design, Brian is able to understand and tailor cutting-edge software experiences for specific target audiences. His unique blend of design, photography, and sales has helped him promote SOLIDWORKS's ecosystem of software solutions as a leader in the design community. Brian holds a B.S. in design with honors from DAAP at the University of Cincinnati in Ohio.

Using actual customer data, we'll present the many capabilities and direct benefits of SOLIDWORKS® Visualize (formerly called Bunkspeed), which provides a suite of standalone software tools combining industry-leading rendering capabilities with design-oriented features and workflows. Visualize enables easy and fast creation of visual content for designers, engineers, marketing, and other content creators. We'll showcase its flexibility and adaptability to the 3D workflow.

Level: All
Type: Talk
Tags: Product & Building Design; Rendering & Ray Tracing

Day: Tuesday, 04/05
Time: 15:00 - 15:50
Location: Room LL21A

S6324 - A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC

Kristoffer Robin Stokke PhD Candidate, University of Oslo, Simula Research Laboratory
Kristoffer Robin is currently working towards the PhD degree at the University of Oslo. Working with the MPG research group in Simula Research Laboratory, his primary research interest is in energy efficiency of heterogeneous multicore architectures such as the Tegra SoCs and how software can contribute to low-power operation. His recent research activities focus on high-precision energy modelling for different processors. Kristoffer Robin received his bachelor's degree in electrical engineering from the Oslo University College in 2009, and his master's degree in Informatics from the University of Oslo in 2012.

Learn how to build a high-precision power model for the Tegra K1 heterogeneous multicore SoC. Our methodology considers actual (measured) rail voltages, which vary with GPU, CPU and RAM operating frequency, power gating, hardware configurations, leakage currents, dynamic power as well as traditional hardware performance counters. Therefore, our model is able to predict power usage of individual units such as the Kepler-based GPU, dual-cluster CPU and memory (RAM) with an estimation accuracy above 98% for several CUDA kernels and software workloads. The main take-away point from our talk is to learn how the Tegra K1 consumes energy under various software workloads, which can be used to optimise code and applications.

Level: Intermediate
Type: Talk
Tags: Embedded; Performance Optimization; IoT

Day: Tuesday, 04/05
Time: 15:00 - 15:50
Location: Room LL20D

S6387 - GPU Acceleration of Cholesky's Factorization in CHOLMOD: Batching, Hybrid and Multi-GPU

Steven Rennich Sr. HPC Developer Technology Engineer, NVIDIA
Highly-Rated Speaker
Steven Rennich has been developing massively parallel algorithms and supporting the use of GPUs in the fields of linear algebra and structural mechanics as part of NVIDIA HPC DevTech for five years. His current research interests include the GPU acceleration of direct sparse solvers and sparse matrix-vector multiplication and the continued optimization of GPU implementations of BLAS and LAPACK library functions. Prior to NVIDIA, Steven spent 10 years performing development and performance optimization for several commercial structural mechanics and rigid body dynamics codes. He obtained a Ph.D. in aeronautics and astronautics from Stanford University, where he developed novel computational fluid mechanics codes and studied vortex dynamics and plant morphogenesis.

Sparse matrix factorization is a fundamental tool in scientific computing and has been shown to be well accelerated using GPUs. Yet applying the full capability of the GPU to the factorization operation remains a challenge. This talk covers the latest GPU optimizations that have been applied to the Cholesky factorization algorithm within the well-known SuiteSparse/CHOLMOD linear solver. These optimizations include new NVIDIA CUDA versions of BLAS and LAPACK routines to accelerate operations on batches of small, non-uniformly sized matrices, hybrid computing enhancements, support for multi-GPU acceleration, and further avoidance of PCIe communication through refinements to the sub-tree algorithm.

Level: Intermediate
Type: Talk
Tags: Algorithms; Performance Optimization; Tools & Libraries

Day: Tuesday, 04/05
Time: 15:00 - 15:25
Location: Marriott Salon 3

S6428 - Molecular Simulations of DNA Loop Extrusion Explain and Predict Human Genome Architecture

Adrian Sanborn Ph.D. Candidate, Department of Computer Science, Stanford University
Adrian Sanborn is a Ph.D. candidate in the department of Computer Science at Stanford University and a researcher at the Center for Genome Architecture in Houston. Previously, he graduated summa cum laude from Harvard University with a degree in mathematics and computer science.

We'll show how the human genome's 3D organization, which is closely linked to important cellular functions, can be explained and predicted by molecular simulations. Our recent high-resolution maps of DNA self-contacts revealed that the genome is organized into loops and domains demarcated by the DNA-binding protein CTCF. We present a model, developed using GPU-accelerated molecular simulations, in which loops and domains form through loop extrusion. Our simulations recapitulate DNA contact maps given only CTCF binding locations. When we alter CTCF binding locations using genome editing, our simulations generate accurate predictions for the edited DNA contact maps. These results significantly advance our understanding of genome folding and open a path towards targeted surgery of 3D genomes.

Level: All
Type: Talk
Tags: Computational Biology; Computational Physics

Day: Tuesday, 04/05
Time: 15:00 - 15:25
Location: Marriott Salon 5

S6470 - A Single Forward Propagation of Neural Network for Image Detection

Minyoung Kim Senior Research Engineer, Panasonic Silicon Valley Laboratory
Minyoung is a senior research engineer at Panasonic Silicon Valley Laboratory. She is working on deep learning projects where she spends her time on training deep neural networks, implementing new algorithms, building optimized systems, and researching new technologies in deep learning area. She received a Masters in Computer Science, specialized in Artificial Intelligence, at Stanford University in 2009.

This talk will describe how a single forward propagation of a neural network can give us locations of objects interested on an image frame. There are no proposal generation steps before running neural networks and no post processing steps after. The speaker will describe fully neural detection system, implemented by deep learning research teams of Panasonic, that achieves real-time speed and state-of-the-art performance. The talk also includes live demonstration of the system on a laptop PC with NVIDIA 970m and tablet with NVIDIA Tegra K1 GPU.

Level: Intermediate
Type: Talk
Tags: Self-Driving Cars & Automotive ; Deep Learning & Artificial Intelligence; Embedded

Day: Tuesday, 04/05
Time: 15:00 - 15:25
Location: Room LL21E

S6482 - Video Classification of Live Streams on Twitter's Periscope

Clement Farabet Tech Lead, Twitter Cortex, Twitter
Clement Farabet founded Madbits, which was acquired by Twitter, building core deep learning stack at Twitter Cortex. He has a Ph.D. in deep learning with Yann LeCun at New York University.

Periscope has encountered tremendous growth, and with it comes the challenge of discovery and relevance of live streams given strict latency requirements. We'll show you the work the Cortex team at Twitter has done to tackle the challenging Periscope distribution for video classification and discovery, as well as live demos.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 15:00 - 15:50
Location: Room 210D

S6484 - G3NA-V: GPU-Enabled Tool for Mining and Aligning Complex Gene Interaction Graphs

Karan Sapra PhD Student, Clemson University
Karan Sapra is a fifth year Ph.D. candidate in electrical and computer engineering at Clemson University under Dr. Melissa Smith. He received a B.S. in computer engineering with mathematical science minor from Clemson in 2011. He recently finished a six-month internship at Oak Ridge National Laboratory in the Technology Integration Group.

The rapid production of new crops in the face of population pressure and climate change is arguably as important to human health in the future as biomedical research is today. We'll demonstrate the utility of biological graph alignment using an NVIDIA GPU-enabled visualization G3NA wrapper and human interaction tool called G3NA-V. While our tool is applicable across the life sciences, our target end-user is the HPC-underserved plant breeder, for whom we open high-resolution windows into dynamic crop genetic systems to accelerate the crop development cycle.

Level: All
Type: Talk
Tags: Computational Biology; Big Data Analytics; In-Situ and Scientific Visualization; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 15:00 - 15:25
Location: Marriott Salon 5

S6570 - Deep Learning in Real-World Large-Scale Image Search and Recognition

Xian-Sheng Hua Senior Director/Researcher, Alibaba Group
Xian-Sheng Hua became a researcher and senior director of Alibaba Group in April 2015, leading the multimedia technology team in the Search Division. Before that, he was a senior researcher of Microsoft Research Redmond since 2013, working on web-scale image and video understanding and search, as well as related applications. He was a principal research and development lead in Multimedia Search for the Microsoft search engine, Bing, until 2011, where he led a team that designed and delivered leading-edge media understanding, indexing, and searching features. He joined Microsoft Research Asia in 2001 as a researcher. Since then, his research interests have been in the areas of multimedia search, advertising, understanding, and mining, as well as pattern recognition and machine learning. He has authored or co-authored more than 250 research papers in these areas and has filed more than 90 patents. He received his B.S. in 1996 and Ph.D. in applied mathematics in 2001 from Peking University, Beijing.

We'll introduce how deep learning helps realize a real-world visual search and recognition system. This topic has been studied for decades and became very hot again in recent years mainly due to the rapid development of deep learning and large-scale search techniques. Many visual search and recognition preliminary products are available to the public. However, have we solved all the big technical and non-technical challenges? Has ImageNet solved the recognition problem? What are the key factors of realizing a real-world visual recognition/search system? Are semantic gaps still there? Which direction is visual search/recognition going toward? What is still missing? We'll discuss all these based on a real-world, deep learning-based visual search and recognition.

Level: All
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Tuesday, 04/05
Time: 15:00 - 15:25
Location: Room 210G

S6617 - Using OpenACC to Accelerate Kirchhoff Depth Migration

Ken Hester Solution Architect, NVIDIA
Ken Hester is a solution architect with NVIDIA.

Learn how to use OpenACC directive-based methods to accelerate legacy seismic imaging apps using the public domain Seismic Unix package from the Colorado School of Mines Center for Wave Phenomena. Several case studies exist that explain how to accelerate reverse time migration algorithms using CUDA and GPUs. We'll focus on accelerating Kirchhoff Depth Migration (KDM), which is an important part for the seismic processing workflow. While significant performance gains will be reported using industry-standard KDM benchmarks and data, the emphasis will highlight how the OpenACC tool-chain improves portability and maintainability, improves programmer productivity, and maximizes performance.

Level: All
Type: Talk
Tags: Energy Exploration; OpenACC

Day: Tuesday, 04/05
Time: 15:00 - 15:25
Location: Marriott Salon 1

S6659 - Perfworks: A Library for GPU Performance Analysis

Avinash Baliga GPU Foundations Profiler Software Architect, NVIDIA
Avinash Baliga is the lead developer of the PerfWorks SDK, and has worked on GPU and game tools for the past 10 years. At prior GTCs, he has presented on the Nsight CUDA Debugger and Analysis tools.

Attendees will learn about PerfWorks, an updated successor to Perfkit, which can be used for GPU performance analysis. PerfWorks will support NVIDIA GPUs and SOCs going forward. Developers will get an overview of how PerfWorks will allow them access to low-level performance counters in NVIDIA GPUs. We'll provide example use cases so developers can see how it can be added to their application to instrument-specific sections of their code.

Level: Intermediate
Type: Talk
Tags: Tools & Libraries

Day: Tuesday, 04/05
Time: 15:00 - 15:25
Location: Room 211B

S6721 - Redshift: Production-quality, final-frame rendering on the GPU

Panagiotis Zompolas CTO, Co-founder, Redshift Rendering Technologies
Panagiotis Zompolas is a video game industry veteran driven by a passion for computer graphics and hardware. Panos has worked with GPUs since the days of the 3dfx and has closely followed the GPU compute revolution since its inception in the mid-2000s. Panos' career in the video game industry includes leading companies like Sony Computer Entertainment Europe and Double Helix Games (now Amazon Games). He has led teams of graphics programmers in the creation of render engines, spanning several generations of hardware. This experience, tied with his passion for the industry, is one of the key pillars of Redshifts success.
Robert Slater VP Engineering, Redshift Rendering Technologies
Robert Slater is a seasoned GPU software engineer and video game industry veteran, with a vast amount of experience in and passion for the field of programming. As a programmer, Rob has worked for companies such as Electronic Arts, Acclaim and Double Helix Games (now Amazon Games). During this time, Rob was responsible for the core rendering technology at each studio, driving their creative and technical development. Rob’s graphics engine programming experience and know-how ensures that Redshift is always at the forefront of new trends and advances in the industry.

We'll discuss the latest features of Redshift, the GPU-accelerated renderer running on NVIDIA GPUs that is redefining the industry's perception towards GPU final-frame rendering. A few customer work examples will be demonstrated. This talk will be of interest both to the industry professional who want to learn more about GPU-accelerated production-quality rendering as well as the software developer who's interested on GPU-accelerated rendering.

Level: All
Type: Talk
Tags: Media & Entertainment; Rendering & Ray Tracing; Real-Time Graphics

Day: Tuesday, 04/05
Time: 15:00 - 15:25
Location: Room LL21B

S6724 - Acceleration of Weather Forecasting and Meteorological Satellite Data Assimilation, Processing and Applications

Allen Huang CTO, Tempo Quest Inc.
Allen Huang received his Ph.D. in the area of satellite remote sensing from the University of Wisconsin-Madison in 1989. In the same year he joined Cooperative Institute for Meteorological Satellite Studies, Space Science and Engineering Center, University of Wisconsin-Madison, and is currently a distinguished scientist of the UW- Madison, a Fellow of International Society for Optical Engineering (SPIE), PI of many NOAA and NASA sponsored projects, an Adjunct Professor of several universities, CEO of Hyper Sensing, LLC, and CTO of Tempo Quest., Inc.

In partnership with scientists from Space Science and Engineering Center (SSEC), Tempo Quest Inc. is embarking on a quest to complete a proprietary version of Weather Research and Forecasting Model (WRF) - AceCAST, a mesoscale and global model designed for both operational forecasters and atmospheric researchers and widely used by commercial, government, and institutional users. The state-of-the-art acceleration of low throughput, low energy consumption, and error resilient satellite remote sensing data compression suitable for data, image, and video transmission and archive will also be discussed.

Level: All
Type: Talk
Tags: Earth System Modelling

Day: Tuesday, 04/05
Time: 15:00 - 15:50
Location: Room 211A

S6741 - Nervve High Speed Video Search, Powered by NVIDIA

Thomas Slowe CEO and Co-Founder, Nervve
Thomas Slowe identifies opportunities of scale in the market and designs game changing machine learning and interactive technologies. As CEO and Co-Founder of Nervve, Tom uses his technical and business development experience to position Nervve to change the way video and imagery can be used for intelligence, insight and action. Tom has over 19 years of experience in the area of machine learning as applied to video, imagery, and other two dimensional raster data. Prior to Nervve, he held a number of technical and executive positions where he was responsible for providing products and services to Fortune 500 companies in Retail, Advertising, Broadcast, Social, the US Intelligence Community and Department of Defense. Tom received his BSEE from Rutgers University and MS from MIT Media Laboratory.

Nervve is changing the way the world searches video by enabling industry leaders' to target, measure and monetize from their media through visual search and analysis. We currently work with the Federal Government, Media and Entertainment and Sports Media markets.

Level: Beginner
Type: Talk
Tags: Media & Entertainment; Big Data Analytics; Computer Vision & Machine Vision; Intelligent Video Analytics (IVA)

Day: Tuesday, 04/05
Time: 15:00 - 15:25
Location: Room LL21C

S6335 - Accelerating Reverse Time Migration Application for Seismic Imaging with GPU Architecture

Sergio Orlandini Software Developer, CINECA
Sergio Orlandini is a software developer in the High Performance Computing department of CINECA. He obtained a Ph.D. in physical-chemistry at the Sapienza University of Rome in 2010. Sergio's main interests are parallel computing, GPU computing, and code optimization. He is currently working on implementing seismic algorithms on GPUs. Sergio is also author and teacher of GPU programming course at CINECA.

We'll present an efficient GPU implementation of a reverse time migration (RTM) application. After an overview of the application, we'll discuss the use of GPUs to speed up the solution of wave equation in a finite difference algorithm. We'll show how to exploit concurrency between GPU and CPU computation and how to efficiently overlap computation and communication between GPU and CPU on different nodes. To reduce memory allocation on device, the use of a 16-bit fixed-point representation for velocity fields is exploited. In the second part of our talk, we'll show the RTM application performance, and discuss and analyze the performance on different HPC clusters and different devices.

Level: Intermediate
Type: Talk
Tags: Energy Exploration; Performance Optimization; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 15:30 - 15:55
Location: Marriott Salon 1

S6371 - Deep Convolutional Neural Networks for Spoken Dialect Classification of Spectrogram Images Using DIGITS

Nigel Cannings Chief Technical Officer, Intelligent Voice Limited
Nigel Cannings founded Docusite as a research and development vehicle for advanced natural language processing and voice recognition technology, gaining prestige clients, including AXA Investment Managers, before merging with Chase ITS in 2009. He was educated in England at Brentwood School, and in the USA at Milton Academy in Boston. He qualified as a lawyer in 1993, and has worked for some of world's largest law firms and software companies. He contributes regularly to a number of publications, including the Huffington Post and the Global Legal Post. He gained U.K. government recognition by way of a large grant for high-tech research exploring problems in speech research, such as ultra-high-speed GPU-accelerated speech recognition and emotional analysis of telephone calls.

Deep convolution neural networks are designed for classification tasks involving static images. We'll outline the novel application of using such networks for speech processing tasks such as the identification of a speaker's dialect. Representing speech as spectrogram images, we'll show our recent results from the NIST language recognition competition, and discuss how the network training results can be improved by manipulation of the spectrogram images in a way appropriate in the context of speech applications.

Level: All
Type: Talk
Tags: Aerospace & Defense; Deep Learning & Artificial Intelligence; Signal & Audio Processing

Day: Tuesday, 04/05
Time: 15:30 - 15:55
Location: Marriott Salon 2

S6383 - High Performance CTC Training for End-to-End Speech Recognition on GPU

Minmin Sun GPU Architecture Engineer, NVIDIA
Minmin Sun has worked at NVIDIA as a GPU architecture engineer since he graduated from Nanjing University in 2012. His interests are GPU architecture and speech recognition.

End-to-end speech recognition systems, which directly transcribe audio data with text without requiring an intermediate phonetic representation, are based on recurrent neural network (RNN) + connectionist temporal classification (CTC). CTC is to automatically learn the alignments between speech frames and the label sequence of transcript. In this work, we focus on optimizing CTC training, especially the forward-backward algorithm, on GPU. Firstly, opportunities of saving computation and memory access of CTC forward-backward algorithm were quantitatively analyzed and utilized to get a speedup of ~1.28X. Secondly, by data reuse among frames and data transfer between frames through register file and shared memory, we get a speedup of ~1.80X.

Level: All
Type: Talk
Tags: Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 15:30 - 15:55
Location: Room 210G

S6413 - GPU Computing with Apache Spark and Python

Stanley Seibert Scientific Software Developer, Continuum Analytics
Dr. Stanley Seibert is a scientific software developer at Continuum Analytics. He received a Ph.D. in experimental high energy physics from the University of Texas in Austin and performed his postdoctoral research at Los Alamos National Laboratory and University of Pennsylvania. Stan has been evangelizing the use of Python and GPU computing since 2007, and has worked on a number of applications using Python, C++ and CUDA, including maximum likelihood parameter estimation in large data sets, Monte Carlo optical simulations, and information-theoretic approaches to experiment design. Prior to joining Continuum Analytics, Stan was Chief Data Scientist at Mobi.
Siu Kwan Lam Software Engineer, Continuum Analytics
Siu Kwan Lam is a software developer at Continuum Analytics and the lead developer of the Numba open source compiler project. He has a B.S. and M.S. degree in Computer Engineering from San Jose State University. He taught CUDA at San Jose State University during his senior year and has researched TCP covert channel detection for NSF, STC, and TRUST.

We'll demonstrate how Python and the Numba JIT compiler can be used for GPU programming that easily scales from your workstation to an Apache Spark cluster. Using an example application, we show how to write CUDA kernels in Python, compile and call them using the open source Numba JIT compiler, and execute them both locally and remotely with Spark. We also describe techniques for managing Python dependencies in a Spark cluster with the tools in the Anaconda Platform. Finally, we conclude with some tips and tricks for getting best performance when doing GPU computing with Spark and Python.

Level: Intermediate
Type: Talk
Tags: Programming Languages; Tools & Libraries; Big Data Analytics

Day: Tuesday, 04/05
Time: 15:30 - 16:20
Location: Room 211B

S6439 - GPU-Oriented Sparse Multifrontal QR Method

Wissam Sid-Lakhdar Postdoctoral Research Associate, Texas A&M University
Wissam Sid-Lakhdar is a post-doc at Texas A&M University, working with Professor Tim Davis. He did his Ph.D. at ENS Lyon in France, under the supervision of Dr. Jean-Yves L'Excellent, on "Scaling the solution of large sparse linear systems using multifrontal methods on hybrid shared-distributed memory architectures." His current interests concern the resolution of sparse linear systems through direct methods on GPUs.

We'll present the sparse direct method, a multifrontal QR factorization intended specifically for GPU accelerators. Our approach relies on the use of a bucket scheduler that exploits an irregular parallelism on both a coarse grain, among a set of fronts with different characteristics, and on a fine grain, through the exploitation of the staircase shape of these fronts. The scheduler then relies on dense GPU kernels which design and implementation target recent GPU architectures.

Level: Intermediate
Type: Talk
Tags: Algorithms; Performance Optimization; Tools & Libraries

Day: Tuesday, 04/05
Time: 15:30 - 15:55
Location: Marriott Salon 3

S6467 - Training My Car to See: Using Virtual Worlds

Antonio M. López Principal Investigator & Associate Professor, Computer Vision Center & Universitat Autònoma de Barcelona
Antonio Lopez is the head of the Advanced Driver Assistance Systems (ADAS) Group of the Computer Vision Center, and associate professor of the Computer Science Department, both at the Universitat Autonoma de Barcelona (UAB). In 1996, Antonio participated in the foundation of the Computer Vision Center at the UAB, where he has held different institutional responsibilities. Antonio has been principal investigator of numerous public and industrial projects, and is a co-author of a large number of top journal and conference papers. His research interests are vision-based object detection, semantic segmentation, domain adaptation, and computer graphics for training visual models. These topics are seen as key technologies to be applied in ADAS and autonomous driving.

Learn how realistic virtual worlds can be used to train vision-based classifiers that operate in the real world, i.e., avoiding the cumbersome task of collecting ground truth by manual annotation. Many vision-based applications rely on classifiers trained with annotated data. We avoid manual annotation by using realistic computer graphics (e.g. video games). However, the accuracy of the classifiers drops because virtual (training) and real (operation) worlds are different. We overcome the problem using domain adaptation (DA) techniques. In the context of vision-based driver assistance and autonomous driving, we present our DA experiences using classifiers based on both handcrafted features and CNNs. We show how GPUs are used in all the stages of our training and operation paradigm.

Level: Beginner
Type: Talk
Tags: Self-Driving Cars & Automotive ; Computer Vision & Machine Vision; Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 15:30 - 15:55
Location: Room LL21E

S6502 - GPU Accelerated Virtual Cell Biology and SIMD Enhanced High Throughput Computational Biology

Narayan Ganesan Assistant Professor, Stevens Institute of Technology
Narayan Ganesan is an assistant professor of electrical and computer engineering at the Stevens Institute of Technology, Hoboken, New Jersey. He received his Ph.D. from Washington University in St. Louis in 2006. He later worked on designing novel compute architectures using massively parallel processors and reconfigurable hardware for scientific computing. His research interests include designing efficient algorithms and computing architectures for handling big-data and big-computation problems in molecular dynamics, computational biology, bioinformatics and healthcare notification systems.
Hanyu Jiang Research Assistant, Stevens Institute of Technology
Hanyu Jiang is Ph.D. student of computer engineering at the Stevens Institute of Technology, Hoboken, New Jersey. He received his B.S. in control science and engineering from Harbin Institute of Technology, Harbin, China, in 2012, and an M.E. of computer engineering from Stevens Institute of Technology in 2014. His current research interests include heterogeneous and parallel computing, multi-core processor architecture, bioinformatics, and big data analytics.

We'll present GPU-enabled virtual cell biology (VCB) and explore the advantages of SIMD along with SIMT execution to boost the performance of computational biology applications. We'll first present a GPU-based whole cell simulation framework to study complex biological pathways via a scalable model with millions of agents. This simulates a multitude of intracellular reactions from which the overall cellular function emerges, thus serving as a virtual computational microscope. We'll then explore embedding SIMD instructions as the inner-most tier in a multi-tiered parallel framework to obtain a performance boost. This method was employed to accelerate all stages of the HMMER3 pipeline to gain an order of magnitude increase in performance over highly optimized CPU and non-SIMD-based GPU implementations.

Level: Intermediate
Type: Talk
Tags: Computational Biology; Algorithms

Day: Tuesday, 04/05
Time: 15:30 - 15:55
Location: Marriott Salon 5

S6534 - Exciting Practical Applications of Scalable Deep Learning and Image Recognition in the Cloud

Georgi Kadrev CEO, Imagga Technologies
Georgi Kadrev is co-founder and CEO of Imagga Technologies (http://imagga.com), one of the companies pioneering the image-recognition-as-a-service model, offering highly scalable cloud API to businesses and developers. Georgi graduated with an M.S. in technology entrepreneurship from Sofia University in 2009 and is currently an assistant professor and Ph.D. student in the Software Engineering department, specializing in practical deep-learning for image recognition. While leading Imagga, Georgi has won multiple technology, innovation, and entrepreneurship awards, most recently the best company award in the "Technology For The Big Players" track at South Summit, Madrid, October 2015.

We'll demonstrate how scalable image recognition based on deep-learning can greatly contribute to business cases varying from advertising and user profiling to content management and cloud services. We'll also discuss the technical challenges of providing scalable image recognition capable of handling huge loads of images, instant feedback loops, and customer-specific recognition tasks, and how we've addressed them using GPUs in the cloud. Ultimately, you'll benefit from our experience handling 80+ different practical cases and dive deep into the most exciting ones.

Level: All
Type: Talk
Tags: Computer Vision & Machine Vision; Big Data Analytics; Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 15:30 - 15:55
Location: Room 210F

S6647 - Maxwell Render Meets the GPU

Juan Cañada Head of Maxwell Render Technology, Next Limit Technologies
Juan Canada joined Next Limit to work on research projects, later moving to the newly formed Maxwell Render research team. Since then, Juan has held several positions in the team, leading it since 2007. He holds a B.S. in mechanical engineering and a degree in environmental sciences. Outside the office, Juan used to describe himself as an acceptable guitar player, although his skills have deteriorated since the birth of his beautiful daughter. To try to stop himself thinking about rendering all the time, he is an avid scuba diver and underwater photographer, although sometimes, when he looks at how light behaves under the sea, he realizes how much work we have left to do!

We'll take you through the challenges the Next Limit team have overcome to create a GPU version of their highly acclaimed Maxwell Render engine. Maxwell Render was the first unbiased, spectral, physically based render engine on the market (2004) and due to its impeccable quality, serves as a ground truth reference for many. We'll reveal some of the technology behind our new GPU-based renderer, taking a look at both its advantages and limitations -- and you will of course be one of the first to see it running live! We'll show you how the GPU version has maintained the core qualities of the current CPU engine -- enabling users to create images from 3D scenes in a reliable and predictable way but with one key difference: faster than ever.

Level: Intermediate
Type: Talk
Tags: Rendering & Ray Tracing; Real-Time Graphics; Algorithms

Day: Tuesday, 04/05
Time: 15:30 - 15:55
Location: Room LL21B

S6655 - Towards a High Performance Analytics and Computing Platform for Brain Research

Dirk Pleiter Group leader, Forschungszentrum Juelich
Dirk Pleiter is research group leader at the Julich Supercomputing Centre (JSC) and professor of theoretical physics at the University of Regensburg. At JSC he is leading the work on application oriented technology development. Currently, he is principal investigator of the POWER Acceleration and Design Center, a center that is jointly run by IBM, JSC and NVIDIA. He has played a leading role in several projects for developing massively-parallel special purpose computers, including several generations of QPACE. Dirk is author and co-author of more than 170 scientific papers, conference contributions and book chapters in the areas theoretical high-energy physics and computer science. Forschungszentrum Julich was one of the first academic institutions that joined the OpenPOWER Foundation.

Understanding and modeling the human brain continues to be one of the biggest challenges of research. The Human Brain Project is a European flagship, which is in the process of creating a research infrastructure that will facilitate this research. Many research topics in this field require scalable compute resources or the ability to process extreme-scale data volumes (in some cases even both). Examples are approaches to simulate the network of a human brain in its full complexity and the efforts to create high-resolution brain atlases. GPUs play already today an important role to realize the necessary computational capabilities. We'll give an overview of the efforts of building an high-performance analytics and computing platform for brain research.

Level: All
Type: Talk
Tags: Big Data Analytics; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 15:30 - 15:55
Location: Room 210E

S6717 - Discovering the New Frontier of Shadertoy

Pol Jeremias Co-Founder, Beautypi
Highly-Rated Speaker
Pol Jeremias is passionate about technology and art. He grew up near Barcelona and moved to California in 2006. Since then, Pol has researched computer graphics and worked in multiple games for companies such as LucasArts or SoMa Play. Today, he helps create movies at Pixar Animation Studios. In his spare time, he has co-founded Shadertoy.com and Beautypi. When he is not programming you will find him running, reading or watching movies.
Inigo Quilez Co-Founder, Beautypi
Highly-Rated Speaker
Inigo Quilez grew up enjoying the mountains, snow and sea, but also programming fractals and graphics algorithms. After having finished his Master's degree in Electrical Engineering, and having later worked professionally in virtual reality and real-time rendering of massive data sets in Belgium for six years, he moved to the US to work at Pixar Animation Studios. There he spent five years creating procedural vegetation and landscapes for the movies, from research and tools creation to doing the required shot work in production. Inigo joined the Oculus Story Studio end of 2014, where he now works on bridging the worlds of real-time rendering, filmmaking and virtual reality. In his spare time Inigo co-founded the website Shadertoy, to which he also contributes with content regularly.

In this session, the Shadertoy.com creators will change the way you think about fragment shaders. During the last three years, the Shadertoy community has been trying to answer one question: What can a fragment shader do? The results have been mind-blowing. People from all over the world collaborated to break the limits of what was possible: procedural content, incredible raymarching tricks, VR or even GPU generated music. This year there is a new challenge: What can multiple fragment shaders do? Imagine fragment shaders could talk to each other and finally build complex algorithms running on the browser using your GPU. Join the Shadertoy creators to discover new ideas and techniques to create fragment shaders that are games, progressive path tracers, sorting algorithms, demos and more.

Level: Intermediate
Type: Talk
Tags: Media & Entertainment; Real-Time Graphics; Virtual Reality & Augmented Reality

Day: Tuesday, 04/05
Time: 15:30 - 16:20
Location: Room LL21C

S6230 - Hierarchical Computations on Manycore Architectures

Hatem Ltaief Senior Research Scientist, Extreme Computing Research Center, KAUST
Highly-Rated Speaker
Hatem Ltaief is a senior research scientist in the Extreme Computing Research Center at KAUST, where he advises several KAUST students in their M.S. and Ph.D. research. Hatem received his engineering degree from Polytech Lyon at the University of Claude Bernard Lyon I, France, an M.S. in applied mathematics at the University of Houston, and a Ph.D. degree in computer science from the University of Houston. From 2008 to 2010, he was a research scientist in the Innovative Computing Laboratory in the Department of Electrical Engineering and Computer Science at the University of Tennessee, Knoxville. He is part of the European Exascale Software Initiative (EESI) to build a European vision and roadmap to address the challenges of the new generation of massively parallel systems. He has various strategic partnerships with industries (Saudi Aramco, Intel, NVIDIA) as well as Universities and HPC Centers (University of Tennessee, INRIA Bordeaux, L'Observatoire de Paris, Barcelona Supercomputing Center). He is the author or co-author of 40 journal/conference papers and book chapters. His research interests include parallel numerical algorithms, parallel programming models, and performance optimizations for multicore architectures and hardware accelerators.

Learn about a new hierarchical matrix structure for fast linear algebra computations on GPUs! Recursivity, tree traversal, hierarchical data layout, and batched kernel executions are some of the ingredients of a new HPC recipe for computing challenging linear algebra operations and solving large scientific problems (e.g., spatial statistics) on GPUs. By exploiting the low-rank matrix representations, the original dense matrix of the problem can be approximated, which results in saving the memory footprint and reducing the algorithmic complexity, while still maintaining an adequate solution accuracy. In addition, the talk showcases a new high-performance hierarchical symmetric eigensolver and SVD, juicing the horsepower out of multiple GPUs to the fullest.

Level: Intermediate
Type: Talk
Tags: Algorithms; Performance Optimization; Tools & Libraries

Day: Tuesday, 04/05
Time: 16:00 - 16:25
Location: Marriott Salon 3

S6279 - Visual Feature Learning from Web Images and Click Log

Chen Fang Research Scientist, Adobe Systems
Chen Fang is a research scientist at Adobe Research. His interests include image recognition, image retrieval, deep learning, and large-scale machine learning. He obtained his Ph.D. from the computer science department at Dartmouth College.

Visual feature learning is a fundamental problem in computer vision. Existing solutions rely on deep learning and large-scale labeled datasets. However, it is often labor intensive and time consuming to collect such datasets. On the other hand, the internet offers raw visual data, i.e., images and videos, at massive scale and the associated user behavior data, e.g., click logs. We'll present a novel framework to learn visual features from such data, which completely forgoes the need of labeled datasets. We apply the proposed framework and its variants on two kinds of web data: images on a social website and their view history, and search log of a commercial image search engine. High-quality visual features are learned in both cases.

Level: All
Type: Talk
Tags: Computer Vision & Machine Vision; Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 16:00 - 16:25
Location: Room 210F

S6281 - Accelerating Science Platforms for Machine Learning, Big Data, and Earth System Science

John Taylor Leader, Computational and Simulation Science, CSIRO
Dr. John Taylor currently leads CSIRO Data61 Science Platforms. John has written more than 140 articles and books on computational and simulation science, climate change, global biogeochemical cycles, air quality and environmental policy, from the local to the global scale, spanning science, impacts and environmental policy. His research has been widely cited and attracted significant media attention. John has worked as a computational scientist and group leader both at the Mathematics and Computer Science Division, Argonne National Laboratory and at the Atmospheric Science Division at Lawrence Livermore National Laboratory. John was senior fellow in the Computation Institute at the University of Chicago. He has served on the Advisory Panel of the Scientific Computing Division of U.S. National Center for Atmospheric Research (NCAR) and the U.S. National Energy Research Scientific Computing Center NUGEX Advisory Committee. John currently serves on the board of the National eResearch Collaboration Tools and Resources (NeCTAR), a federal government super science initiative. He is a fellow of the Clean Air Society of Australia and New Zealand.
John Zic Principal Research Scientist, CSIRO
John Zic is a principal research scientist at CSIRO and leads the Dependable Systems research team. He is the Standards Australia Chair of Committee IT-038 "Cloud Computing", which has actively contributed to the new ISO standards 17788 — Cloud Computing Overview and Vocabulary and 17789 Cloud Computing — Reference Architecture. Previously, John has participated in the NSW Government Information Security Working Group and was an expert evaluator for the EU Framework 7 Call 8 Object 10.4 "Trustworthy ICT" in February 2012. He has given many invited presentations and keynotes: the inaugural MIT Kerberos and Internet of Trust (MIT-KIT) conference in 2014; the EU FP7 Program "Building International Co-operation" in Brussels (2011); INCO-TRUST workshop in New York (2010); Kerberos Consortium Conferences at MIT (2010 and 2011); the Vanguard/TTI CyberINsecurity Conference in 2010. He was a member of the 11th Joint EU and Australian Science and Technology Cooperation Committee in 2010. Academically, John has published in the area of trustworthy and dependable systems since 1990.
Oliver Obst Data Mining Team Leader, CSIRO Data61
Oliver Obst is a senior research scientist, leader of the CSIRO Data61 Big Data Platform project, and leader of the Data Mining Research Team at the Commonwealth Scientific and Industrial Research Organisation (CSIRO) in Sydney, Australia. In his work, he solves practical problems around making sense of data in industrial projects—for example, finding patterns, or detecting anomalies, but also selecting informative features or placing sensors for classification, prediction, or to make the best decisions based on past experience.

At CSIRO Data61, we're building the next generation of science platforms that exploit GPU computing to dramatically accelerate the time to discovery and the pace of innovation in science and industry. Scientific applications routinely generate huge amounts of data. In response to these trends, we've developed and deployed a new breed of GPU-accelerated big data technologies, earth system modeling tools, and machine learning capabilities. We'll present examples of our work in big data analytics, earth system modeling, and deep learning that clearly demonstrate the value that GPU computing can deliver to research organisations and industry. CSIRO has been at the forefront of GPU computing since 2009 and was one of the first NVIDIA CUDA Research Centers.

Level: All
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Earth System Modelling; Big Data Analytics; Astronomy & Astrophysics

Day: Tuesday, 04/05
Time: 16:00 - 16:25
Location: Room 210G

S6283 - GPU 201: Use Case Acceleration of a DNA Sequencing Analysis Pipeline

Charles Seberino Principal Software Performance Engineer, Complete Genomics
Charles (Chuck) Seberino works at Complete Genomics, Inc. (CGI), where his primary responsibility is algorithm development and GPU acceleration of high throughput DNA sequencing data. He has actively developed software for graphics, visual simulation, and GPGPU applications for over 20 years. Previously, he worked for government, defense, and robotics companies including Raytheon Missile Systems and Silicon Graphics. Chuck is refocusing his GPU and HPC expertise into the life sciences space, where he is pursuing an M.S. in bioinformatics at Stanford University. He holds a degree in electrical engineering from the University of Arizona.

So you can write a kernel, but can you take your app and get good performance out of GPUs without going to extremes? This talk takes you through a real-world example from a modestly optimized CPU application, and accelerating it using CUDA. This use case has a number of clear-cut sections that are easily parallelized (NPP, cuBLAS, Thrust), but also contains computations less suited for GPUs. Learn how CPU and GPU processing can co-exist by leveraging their strengths. The current application scales to over 28 threads spanning four GPUs. Comparisons will be shown for one or more GPUs on Tesla K40 and K80 cards as well as GeForce GTX TITAN X. Topics covered include splitting work across multiple threads, streams, and GPUs, examining the use of stream callbacks, and improving GPU optimization.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Computational Biology

Day: Tuesday, 04/05
Time: 16:00 - 16:50
Location: Room 212A

S6314 - Java Image Processing: How Runtime Compilation Transforms Memory-Bound into Compute-Bound

Florent Duguet Founder, ALTIMESH
Florent Duguet founded Altimesh in 2008, in an effort to reduce the learning curve of GPU computing for high-level language developers.The outcome is the Hybridizer, which enables many-core computing in high-level programming environments such as dot net and java. Florent graduated with a Ph.D. in computer graphics in 2005. He has implemented solutions for financial services for oil and gas industries with a focus on GPGPU since 2007, starting from the proof of concept and leading up to production.

A wide variety of image processing algorithms are typically parallel. However, depending on filter-size or neighborhood search pattern, memory access is critical for performances. We'll show how loop reordering and memory locality fine-tuning help achieve best performance. Using Hybridizer to automate Java byte-code transformation to CUDA source code, and using new CUDA feature Run Time Compilation, we transformed execution from memory-bound to compute-bound. Applying this technique to oil and gas image processing algorithms results in interactive response time on production-size datasets.

Level: All
Type: Talk
Tags: Performance Optimization; Energy Exploration; Video & Image Processing

Day: Tuesday, 04/05
Time: 16:00 - 16:25
Location: Marriott Salon 1

S6331 - Running Multiple Workloads on a GPU: A UX Oriented Approach

Yuval Sarna Graphics Software Expert, GameFly Streaming
Yuval Sarna is a graphics software expert at GameFly Streaming, where he focuses on solving game performance issues and developing new algorithms to drive the game streaming engine. Yuval is a graduate of Tel-Aviv University and holds a B.S. in computer science. Before his studies, he developed a real-time 3D engine that was later on used in his military service, where he worked on various other 3D projects.

With the increased usage of cloud computing for GPU-demanding workloads, the need to share the GPU between applications increases. We'll present an approach that improves the user experience when running GPU-demanding workloads concurrently. We'll take you through an overview of the existing GPU scheduling schemes used in Windows OS and present a new approach to better utilize the compute resources and achieve maximal cost efficiency.

Level: Intermediate
Type: Talk
Tags: Game Development; Performance Optimization

Day: Tuesday, 04/05
Time: 16:00 - 16:50
Location: Room 212B

S6354 - Unleashing the Performance Potential of GPU for Atmospheric Dynamic Solvers

Haohuan Fu Associate Professor, Tsinghua University
Haohuan Fu is an associate professor at the Center of Earth System Science in Tsinghua University. His research interests include HPC in earth and environmental sciences, computer architectures, performance optimizations, and programming tools in parallel computing. Haohan has a Ph.D. in computing from Imperial College London. He's a member of IEEE.

We'll demonstrate our efforts on developing highly efficient solvers for atmospheric dynamics on the GPU platforms. Besides general optimizations for GPU-based scientific computing applications, we apply optimization strategies that are specifically customized for atmospheric dynamic solvers. We'll show that by combining both algorithmic and architectural considerations, our optimization improves the computation efficiency from the original 2.24% to around 16% at the peak, with a sustained double-precision performance of 1.04 Tflops within one CPU-GPU node. We think this work demonstrates a huge potential for performing more efficient climate modeling work on GPU platforms.

Level: Advanced
Type: Talk
Tags: Earth System Modelling; Performance Optimization; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 16:00 - 16:25
Location: Room 211A

S6446 - A Fully Automated, High Performance Geolocation Improvement Workflow for Problematic Imaging Systems

Devin White Senior Research Scientist, Oak Ridge National Laboratory
Dr. Devin White is a senior research scientist at Oak Ridge National Laboratory and is a subject matter expert in the areas of quantitative social science, modeling complex adaptive systems, social network analysis, high performance computing, tactical airborne and spaceborne geopositioning, uncertainty propagation and analysis, image science, computer vision, multimodal image registration, data fusion, data visualization, imaging spectroscopy, lidar, SAR, and Earth observing systems. Devin is also a joint faculty professor of anthropology at the University of Tennessee, Knoxville. He previously served as a lead scientist for Exelis Visual Information Solutions and a scientist at Integrity Applications Incorporated, supporting large commercial and government customers.
Sophie Voisin Geospatial Software Engineer, Oak Ridge National Laboratory
Dr. Sophie Voisin is an engineer at Oak Ridge National Laboratory developing high performance computing methods for geospatial data analysis for the GIST group. She received her Ph.D. in computer science and image processing from the Universite de Bourgogne (France) in 2008 and joined ORNL in 2010 to work on numerous image processing related projects, successively performing quantitative analysis of neutron 2D and 3D image data; developing new techniques for eye-gaze data analysis, for which she is a co-recipient of an R&D 100 award (2014); and now implementing multidimensional image processing algorithms on GPU platforms for high performance computing analysis of satellite imagery.

Learn how hybrid CPU-GPU parallelization is being used to support rapid improvement of the geolocation accuracy of imagery collected by multiple airborne and spaceborne platforms. A sensor-agnostic, plugin-based framework with CUDA-enabled workflows was built to support photogrammetric and computer-vision processing tasks like image registration and orthorectification. Leveraging the complementary strengths of multicore CPUs and multiple Tesla K80 GPUs on each compute node required significant custom development to achieve optimal performance. We dramatically reduced per-image processing time and can handle multiple data streams simultaneously. The science behind two workflows will be presented, along with their performance metrics while executing on both bare-metal and virtual machines.

Level: Intermediate
Type: Talk
Tags: Video & Image Processing; Supercomputing & HPC; Aerospace & Defense

Day: Tuesday, 04/05
Time: 16:00 - 16:50
Location: Room 210E

S6458 - A GPU-Based Cloud Speech Recognition Server for Dialog Applications

Alexei V. Ivanov CTO, Verbumware Inc.
Alexei Ivanov has a background in engineering and computer science. He received his Ph.D. in theoretical foundations of computer science in 2004 from Belarusian State University of Informatics and Radioelectronics. He holds an M.S. in electrical engineering from Moscow Institute of Physics and Technology (State University). He has worked in both academia (University of Trento, Moscow Institute of Physics and Technology) and industry (Pearson Knowledge Technologies, USA; Speech Technology Center, Russia; Lernout & Hauspie Speech Products NV, Belgium). Alexei has broad experience in speech processing and recognition systems. His current research interests include adaptive conversational machines; web-integration of individual multimedia experiences; speech characterization technology; and integration of para-linguistic knowledge into the process of speech recognition and interpretation.

We'll show that GPUs enable a successful solution for difficult applications such as speech recognition as a dialog interaction. Dialog interactions are problematic because of high variability of spontaneous speech and the processing time constraints. Remote interactions impose telecommunication constraints, i.e. the narrowband and compressed signal representation, limited spoken context available for adaptation. The mass service requires acoustic model adaptation to regional accents. We conduct our experiments with the speech from non-native speakers of English as an extreme case of accented speech. Our GPU-based system exhibits high accuracy with processing speeds faster than the natural speaking pace. The latency of the speech recognizer is below that required for user satisfaction.

Level: All
Type: Talk
Tags: Signal & Audio Processing; Algorithms; Data Center & Cloud Computing; Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 16:00 - 16:25
Location: Marriott Salon 2

S6473 - Lift: Primitives for Hybrid CPU/GPU Parallel Programming

Nuno Subtil Principal Software Engineer, Genia
Nuno Subtil is a principal software engineer at Genia and the lead developer for the company's GPU-accelerated pipeline for primary analysis in DNA sequencing. Before joining Genia, Nuno was part of the NVBIO team at NVIDIA, focused on high-performance parallel algorithms for bioinformatics. Prior to that, he worked on low-level graphics system software for mobile and desktop platforms at NVIDIA. He also did research on physically based image synthesis at TU Wien and on accelerating computer vision algorithms at Deutsche Telekom Laboratories during the early days of GPUs. Nuno holds a computer science degree from the University of Coimbra, Portugal.

Lift is a thin abstraction layer that hides some of the complexity of parallel programming. It provides a set of primitives analogous to (and entirely compatible with) NVIDIA's Thrust library, but designed around drastically simpler code, suitable for inclusion in large, complex projects which target NVIDIA GPUs or Intel CPUs. Lift is an open-source project under active development at Genia. It is the foundation for our primary analysis pipeline for DNA sequencing, as well as the foundation for Firepony, an open-source base quality score recalibrator for DNA sequencing data. We'll cover the motivation for Lift and the applications we're developing it for, and then explain how it works, what problems it solves, and what lessons we learned from prior experience with similar libraries.

Level: Intermediate
Type: Talk
Tags: Tools & Libraries; Computational Biology

Day: Tuesday, 04/05
Time: 16:00 - 16:25
Location: Marriott Salon 5

S6494 - Implementing NVIDIA GRID™ Solutions in Unmanned Vehicle Ground Control Systems

Nathan Wincey Systems Architect, Lockheed Martin, Mission Systems and Training
Nathan Wincey is a systems architect at the Lockheed Martin Unmanned Integrated Systems group within the Mission Systems and Training Division. Over the course of his 16 years with Lockheed Martin, he has worked as a systems engineer and architect on a variety of unmanned aerial vehicle (UAV) ground control systems.

This talk will present the journey Lockheed Martin Unmanned Integrated Systems has taken over the past several years in its effort to successfully virtualize the GPU for integration into its Unmanned Vehicle Ground Control Systems. During this effort, Lockheed Martin UIS has utilized direct GPU pass-through, Microsoft's RemoteFX, VMware's vSGA, and now NVIDIA GRID vGPU technologies to bridge this once impassible technology gap in successfully virtualizing the GPU. This talk will provide an overview of the challenges faced during this effort, solutions to those challenges, benefits gained by staying open-architecture oriented, and SWaP improvements realized.

Level: All
Type: Talk
Tags: Graphics Virtualization; Aerospace & Defense

Day: Tuesday, 04/05
Time: 16:00 - 16:25
Location: Marriott Salon 4

S6538 - Experiences Using Tegra™ K1 and X1 for Highly Energy Efficient Computing

Andrew Haigh PhD student, Australian National University
Andrew Haigh is a Ph.D. student at Australian National University conducting research in high performance computing. He is interested in research on tools that make it easier to write code for complex, highly parallel modern HPC systems, in particular those incorporating GPU architectures.

We discuss our experiences in attempting to effectively utilize the low-power Tegra K1 and Tegra X1 devices for HPC workloads. We evaluate a number of techniques for concurrently exploiting all of these devices available processing elements and show they are effective for improving performance and energy efficiency, though this leads to interesting trade-offs. We compare the results to current state of the art high-end HPC systems. Our work leads to some insights into the difficulties of designing algorithms that are suited to both high-end and embedded systems. To this extent, we suggest some ideas in performance modelling of such heterogeneous but still highly uniform platforms and how it can be used to write code that gives good performance across many types of systems.

Level: Intermediate
Type: Talk
Tags: Embedded; Supercomputing & HPC; Tools & Libraries; IoT

Day: Tuesday, 04/05
Time: 16:00 - 16:25
Location: Room LL20D

S6572 - A Universal Trajectory Generator for Automated Vehicles

Janek Hudecek Automotive Research Engineer, Forschungsgesellschaft Kraftfahrwesen Aachen mbH
Janek Hudecek is a senior engineer at Forschungsgesellschaft Kraftfahrwesen Aachen mbH. He received his Ph.D. in mechanical engineering at the Institute for Automotive Engineering in 2012.

A universal, real-time capable NMPC (nonlinear model predictive controller) based implementation of a trajectory generator for highly automated vehicles is presented. Its main target is to serve as the central instance for all high-level ADAS or automated vehicle functions, therefore abstracting vehicle-dependent kinematics and dynamics. The trajectory planner is capable of the combined optimization of lateral and longitudinal dynamics in urban, rural, and highway scenarios. One of the major challenges besides stable system layout is the fast solution of the embedded optimal control problem. For this, a bespoke GPU-optimized implementation was developed; apart from the planner itself, details about this implementation will be presented.

Level: All
Type: Talk
Tags: Self-Driving Cars & Automotive

Day: Tuesday, 04/05
Time: 16:00 - 16:25
Location: Room LL21E

S6670 - Toward Bridging the Gap Between High Quality and High Performance for HPC Visualization

Robert Sisneros Technical Program Manager: Data Analysis and Visualization, National Center for Supercomputing Applications
Robert Sisneros manages NCSA's Data Analysis and Visualization Group. This group is tasked with supporting science teams utilizing NSF HPC resources as well as furthering the state of scientific visualization through cutting edge research. As a senior member of the Blue Waters Project, Robert's research interests in I/O and visualization are primarily aligned with issues of particular importance to high performance computing. These include: in situ visualization, data models and representations, parallel analysis algorithms, I/O parameter optimization, and "big data" analytics. Robert earned the degrees of Bachelor of Science in Mathematics and Computer Science from Austin Peay State University and the degrees of Master of Science and Doctor of Philosophy in Computer Science from the University of Tennessee in Knoxville.

I will discuss how the current standard for large-scale science is becoming obsolete, and how this is creating a gap between high-quality graphics and high performance visualization. I'll introduce the recent work of integrating NVIDIA IndeX™ with ParaView, work that I see as directly addressing the aforementioned gap.

Level: All
Type: Talk
Tags: In-Situ and Scientific Visualization; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 16:00 - 16:25
Location: Room LL21D

S6672 - Training and Deploying Deep Neural Networks for Speech Recognition

Bryan Catanzaro Senior Researcher, Baidu Research
Highly-Rated Speaker
Bryan Catanzaro is a research scientist at Baidu's Silicon Valley AI Lab, where he leads the systems team. His research is focused on efficient tools and methodologies for training and deploying large deep neural networks. Before joining Baidu, Bryan was involved in popularizing GPUs for machine learning while working at NVIDIA, including the creation of CUDNN. Bryan received his PhD from the University of California at Berkeley, where he wrote the first Support Vector Machine training library to run on Graphics processors, and created Copperhead, a Python-based DSL for parallel programming.

Training and deploying deep neural networks for speech recognition is very computationally intensive. I will discuss how we have made our training process scale efficiently to many GPUs while training, as well as how we use GPUs to take our deep neural networks to users at scale through Batch Dispatch.

Level: Advanced
Type: Talk
Tags: Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 16:00 - 16:25
Location: Room 210D

S6129 - Parallel Low Rank and Cholesky Refactorization

Lung-Sheng Chien Software Engineer, NVIDIA
Lung-Sheng Chien is a software engineer at NVIDIA, working on CUSOLVER and CUSPARSE libraries. Prior to NVIDIA, he was a Ph.D. student in the Department of Mathematics at National Tsing Hua University. He received his B.S. and M.S. in the Department of Computer Science at National Tsing Hua University in 2003 and 2005, respectively.

Attendees can learn how to use a low-rank update in linear solver during a nonlinear process--for example, linear programming, structural mechanics, and circuit simulation. A GPU-friendly version is proposed, which is mainly based on BLAS2 operations. Compared to traditional approaches, with BLAS2 operations, we can hide instruction latency well and achieve full bandwidth of a many-core processor. In this talk, we describe the basic idea of low-rank update and show up to 5x speedup from complexity analysis.

Level: Intermediate
Type: Talk
Tags: Algorithms

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Marriott Salon 3

S6141 - Computer Vision Algorithm Acceleration Using GPGPU and the Tegra Processor's Unified Memory

Aaron Mosher Design and Analysis Engineer, The Boeing Company (Boeing Research & Technology)
Aaron Mosher is a design and analysis engineer at Boeing Research and Technology (BR&T). He is currently a team lead on multiple projects involving sensor processing, algorithms, and software. He has worked on autonomy and sensor processing technologies for both unmanned ground vehicles and unmanned air vehicles (UGV and UAV). He holds three patents on radar obstacle detection technology. His first experience with unmanned vehicles was a joint research project between Boeing, Carnegie Mellon University, and Duke University for the DARPA Grand Challenge unmanned ground vehicle competition in 2004. He has a B.S. in computer engineering from the University of Alabama in Huntsville, and an M.S. in systems engineering from Missouri Science and Technology. He has worked on a variety of projects, including ground vehicles, air vehicles, and communications systems.

We'll explain how the Tegra processor's GPGPU capabilities and Unified Memory enabled us to accelerate computer vision algorithms. Unified Memory has been available on CUDA before, but now the Tegra K1 and X1 architectures use Unified Memory at the physical layer. Our example computer vision algorithm detects objects in a scene and requires a high degree of computation on traditional desktop systems, making it an ideal candidate for GPGPU acceleration. We'll explain the adaptation difficulty involved in porting an existing algorithm to the Tegra, the challenges involved in taking advantage of GPGPU capabilities, and the Unified Memory advantages and disadvantages, on both the Tegra K1 and X1.

Level: Intermediate
Type: Talk
Tags: Embedded; Computer Vision & Machine Vision; Robotics & Autonomous Machines; Aerospace & Defense; IoT

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Room LL20D

S6209 - A Look at Real World Performance Capabilities of NVIDIA GRID™ 2.0

Fred Devoir Senior Architect & Manager of IT Infrastructure, Textron Inc.
Fred is a Sr. Systems Architect and Manager of IT Infrastructure at Textron Inc. Fred has a wide variety of specialized and business systems experience with particular interests in integration and virtualization projects specifically centered around Virtual Desktop Infrastructure (VDI), graphics acceleration, and high performance computing clusters. His past experience includes 22 years of IT professional work experience as an IT Manager, Sr. Systems Analyst, Engineer, and Architect for Fortune 500 companies in the Aerospace/Defense, Engineering, Medical, and Pharmaceutical industries.
Luke Wignall Manager, GRID Performance Engineering, NVIDIA
Highly-Rated Speaker
Luke Wignall came to NVIDIA after working as an owner of an integrator/VAR, as a sales engineer, solution architect, consultant, and system administrator with both VMware and Citrix technologies in both public and private industry. An early evangelist of virtualization, Luke saw the ability to bring GPU to the end user experience as the missing "special sauce" that brings virtual desktops to the next level. Now managing the NVIDIA GRID Performance Engineering Lab, his focus is on performance and scalability to deliver the best value with the highest end user experience across all virtual workloads.

Join us for a technical dive into benchmarks and real workloads. How do real metrics stack up against what a benchmark tells you? Learn about the various performance characteristics, application tuning options, as well as hardware and platform design considerations.

Level: Intermediate
Type: Talk
Tags: Graphics Virtualization; Performance Optimization

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Marriott Salon 4

S6343 - Task-Based Dynamic Scheduling Approach to Accelerating NASA's GEOS-5

Eric Kelmelis CEO, EM Photonics
Highly-Rated Speaker
Eric Kelmelis is the co-founder and CEO of EM Photonics, a company focused on the development and transition of innovative research and technology in the fields of advanced imaging, high-performance computing, and embedded systems. Eric received B.S. and M.S. degrees in electrical engineering from the University of Delaware, has more than 60 publications, and holds two patents. He has also served as conference chair at SPIE's Defense, Security, and Sensing symposium since 2010.

As with many complex scientific computing applications, NASA's GEOS-5 climate modeling tool is computationally intense and can benefit from modern accelerated co-processing hardware. However, the burden of utilizing these new devices and achieving optimal results is placed on the same scientists responsible for developing the core algorithms and applying them to applications of interest. We'll present a task-based programming approach coupled with a dynamic scheduler. This allows the science of the software to be divorced from its implementation, both reducing the burden on the programmer and allowing it to adapt to changes in hardware architecture. In collaboration with NASA's Goddard Flight Research Center, we show our results in applying this technique to GEOS-5.

Level: All
Type: Talk
Tags: Earth System Modelling; Tools & Libraries; Computational Physics

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Room 211A

S6378 - Simplifying Multi-GPU Communication with NVSHMEM

Nathan Luehr Developer Technology Engineer, NVIDIA
Nathan Luehr is a senior developer technology engineer for compute applications at NVIDIA. He earned a Ph.D. in theoretical chemistry from Stanford University in June 2015.
Sreeram Potluri Senior CUDA Software Engineer, NVIDIA
Sreeram Potluri received his Ph.D. in computer science and engineering from Ohio State University and is a senior software engineer at NVIDIA Corp. His research interests include high-performance interconnects, heterogeneous architectures, parallel programming models, and high-end computing applications.

We'll present an overview of the NVSHMEM multi-GPU programming model. NVSHMEM is an implementation of the OpenShmem standard for GPUs. By providing fine-grained communication primitives between GPU threads, NVSHMEM improves communication latencies and can greatly reduce the complexity usually associated with multi-GPU programming. Two application studies are presented to illustrate the utility of NVSHMEM: CoMD, a molecular dynamics mini-application, and HPGMG, a geometric multi-grid solver.

Level: Intermediate
Type: Talk
Tags: Programming Languages; Tools & Libraries; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Room 211B

S6388 - GPGPU Applications for Hydrological and Atmospheric Simulations and Visualizations on the Web

Ibrahim Demir Assistant Research Professor, University of Iowa
Ibrahim Demir develops web-based visualization and communication tools to make it easy to see information from complex and large-scale geo-spatial environmental datasets. His work ranges from crowdfunding stream sensors, citizen science projects for collecting environmental data, adoption of environmental sensors by the public, crowdsourcing flood predictions, SIRI-like knowledge engine for flooding, displaying flood warnings using augmented reality, creating a virtual flood simulation on a tabletop, and experimenting with novel devices like Google Glass, Leap Motion, and Microsoft Kinect for innovative projects on scientific visualization and interaction.

Learn about general-purpose computing on GPU applications using web technologies to improve speed of web-based scientific computing and visualizations. GPGPU is the use of a GPU to perform computation for operations other than graphics processing. WebCL defines a JavaScript binding to the OpenCL standard for parallel computing on the web. The presentation will include background information on scientific computing techniques (e.g. WebGL, WebCL, Web Workers, ASM.js, SIMD.js, etc.) on the web, and sample applications from hydrological and atmospheric sciences.

Level: Beginner
Type: Talk
Tags: In-Situ and Scientific Visualization; Big Data Analytics; Algorithms

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Room LL21D

S6490 - Collision Avoidance on NVIDIA Tegra®

Richard Membarth Senior Researcher, DFKI
Richard Membarth is a senior researcher at the German Research Center for Artificial Intelligence (DFKI). He holds a diploma degree in computer science from the University of Erlangen-Nuremberg and a postgraduate diploma in computer and information sciences from Auckland University of Technologies. In 2013, he received his Ph.D. (Dr.-Ing.) at the University of Erlangen-Nuremberg on the automatic code generation for GPU accelerators from a domain-specific language for medical imaging. After his Ph.D., Richard joined the Graphics Chair and Intel Visual Computing Institute at Saarland University as postdoctoral researcher. His research interests include parallel computer architectures and programming models with focus on automatic code generation.
Christoph Lauer Research Engineer, AUDI AG
Christoph Lauer is research engineer at AUDI AG, Ingolstadt. He holds a diploma degree in computer science from the University of Erlangen-Nuremberg and a postgraduate diploma in computer and information sciences from Auckland University of Technologies. In 2011, he received his Ph.D. (Dr.-Ing.) at the University of Erlangen-Nuremberg on the model-based design of embedded safety control units.

D(r)ive deep into crash prediction in future automotive systems that allow the tracking of dozens of objects in real time by utilizing the processing power of embedded GPUs. We'll describe (1) the new possibilities for crash prediction systems in embedded systems that are only possible by taking advantage of recent developments of embedded GPUs, and (2) the implementation and optimization of such a system on the Tegra K1 utilizing AnyDSL, a framework for rapid prototyping of domain-specific libraries that targets NVVM and CUDA.

Level: All
Type: Talk
Tags: Self-Driving Cars & Automotive ; Embedded; Performance Optimization; Robotics & Autonomous Machines

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Room LL21E

S6515 - Listen, Attend and Spell

William Chan Ph.D. Candidate, Carnegie Mellon University
William Chan is a Ph.D. candidate at Carnegie Mellon University in the Department of Electrical and Computer Engineering. William graduated with an M.S. in electrical and computer engineering from Carnegie Mellon University in 2013, and a B.S. in computer engineering in 2011 from the University of Waterloo. His past industry experience includes internships at Google, Amazon, Intel, NVIDIA, AMD, and TD Securities. His current research crosses the fields of machine learning, deep learning, and speech recognition.

Most recently, Listen, Attend and Spell (LAS) was presented to directly transcribe speech utterances to characters. Unlike traditional DNN-HMM models, these models learn all the components of a speech recognizer jointly. The LAS model has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. We'll describe a distributed asynchronous training platform for training such an model on an array of GPUs.

Level: All
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Signal & Audio Processing

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Room 210G

S6516 - Automatic Grading of Eye Diseases through Deep Learning

Apaar Sadhwani Ph.D. Candidate, Stanford University
Apaar Sadhwani is a Ph.D. candidate in Operations Research, Department of Management Science & Engineering at Stanford University. Prior to this, he obtained a B.Tech. in production engineering from Indian Institute of Technology, Delhi, and an M.S. in operations research from Stanford University. He has an extensive background in applied math, with research experience in probability theory. In his doctoral thesis, he pursues applications of mathematical models and deep learning to biometrics, healthcare, and finance. For example, his algorithms for biometrics are now used to authenticate more than 550 million users of the world's largest biometric program in India. He has significant interests in computer vision and has been a teaching assistant for machine learning, optimization, artificial intelligence, and algorithms at Stanford.

We'll outline the development of state-of-the-art medical imaging system using novel deep architectures that harness GPUs for accelerated training. Trained using data from Stanford Byers Eye Institute and Palo Alto VA Hospital, our model grades the severity of eye diseases and localizes lesions to help screen eye patients at primary care. At the heart of this system lies our hybrid approach to deep learning for high resolution images -- a large convnet with millions of parameters trained with downsized images, fused with a net trained on selected tiles of the high-resolution image. This innovative approach involves use of transfer learning, data augmentation, and multi-GPU systems to identify small-scale features that are critical to detecting eye diseases.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Medical Imaging; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Room 210D

S6541 - Augmented Reality for In-Vehicle Head-Up Displays

Christian Reinhard Senior Director, HMI, Elektrobit
Christian Reinhard is Senior Director, HMI, at Elektrobit (EB). Since 2012, he's led the department that is responsible for developing software solutions for industry-leading graphical, touch and speech user interfaces for infotainment and instrument clusters as well as industrial and medical applications. Christian studied computer sciences at Friedrich-Alexander-Universität in Erlangen and joined EB in 2001. During his career, he held different positions at EB and was responsible for the development of navigation systems for the automotive and consumer markets before joining the HMI department.

Find out why technology like the augmented reality head-up display will be a critical component in autonomous vehicles, providing software-enabled features that will make safe autonomous driving possible. Infotainment systems and graphical interfaces are key differentiators for carmakers in the age of smartphones. Consumer electronics-inspired technologies and multimodal HMIs enriched with app-like content are becoming commonplace in vehicles, and HMI will play an increasingly important role in automated/autonomous driving. Its critical functionality will help transition control from the driver to the car and vice versa while taking into account the driver's status and distraction level. This will require HMI and driver-assistance software to work together more closely, across all screens in the car.

Level: Intermediate
Type: Talk
Tags: Self-Driving Cars & Automotive ; Embedded; Virtual Reality & Augmented Reality; Product & Building Design

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Marriott Salon 1

S6566 - Heterogeneous Compute for Real-Time Image Processing Applications

Mark Davey HPC Lead Engineer, The Foundry
Mark Davey joined The Foundry in 2011, where he heads up the HPC group to bring device-agnostic image processing frameworks to a number of key products and plug-ins. Previously, Mark was a technology manager at Grandeye, a leading manufacturer of security solutions, where he helped create innovative, 360-degree cameras complete with sophisticated video analytics. Mark has also worked in fields as diverse as augmented reality surgery, 3D foetal ultrasound, and document analysis and classification. He obtained his physics degree from University College London.

We'll discuss work carried out at the Foundry on a heterogeneous image processing framework, utilizing all available CPU and GPU compute devices within a system. Complex graphs of processing effects can be authored in BLINK, a domain-specific language created in-house. By harnessing data parallelism, knowledge of transfer speeds, and device compute capabilities, we have developed a scheduling system for efficiently deploying workloads across all devices. The talk will give a brief overview of BLINK, how graphs of effects are authored, and the innovative use of our scheduling framework within a hybrid 3D rendering system for virtual production.

Level: Intermediate
Type: Talk
Tags: Media & Entertainment; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Room LL21C

S6636 - GEM3: CPU-GPU Heterogeneous DNA Sequence Alignment for Scalable Read Sizes

Alejandro Chacon Ph.D. Student, Autonomous University of Barcelona
Alejandro Chacon is a fourth year Ph.D. student researching HPC applied to bioinformatics in the department of Computer Architectures and Operating Systems (CAOS) at Universitat Autonoma de Barcelona, Spain. He received a B.E. in computer science and M.S. in high performance computing and information theory in 2011 and 2012, respectively. He has been working with GPU architectures since his final degree project and in the CUDA library team and research group as an NVIDIA summer intern in 2015. His research interests include bioinformatics, high performance computing, and parallel heterogeneous systems.

Sequence alignment is one of the most computationally intensive steps in current bioinformatics analysis pipelines. Previous attempts to implement it in GPUs have failed to efficiently manage the inherent massive parallelism of the problem. The obvious data parallel strategy, which is having each read sequence being processed independently by a different thread, is very irregular and spans to a very large memory footprint. We'll introduce a CPU-GPU heterogeneous algorithm designed for GEM, a high-quality and already adopted aligner. It selects and packs regular work generated by the pipeline to be offloaded to multiple GPU devices; meanwhile, CPU cores cope with the filtered divergent cases, allowing to scale the read size and improving the quality of results on the new sequencing technologies.

Level: Intermediate
Type: Talk
Tags: Computational Biology; Performance Optimization; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Marriott Salon 5

S6722 - Flexible Cluster Rendering with Quadro® VCA

Ankit Patel Senior Product Manager, NVIDIA
Ankit Patel is a Senior Product Manager at NVIDIA. Prior to joining NVIDIA in 2011, Ankit worked in the media and entertainment industry for over 10 years. He has held product management positions at Matrox Video and Echolab, which was acquired by Blackmagic Design in 2010. Ankit is passionate about building products that allow creative individuals to realize their dreams, whether that's through creative storytelling or building amazing products. Ankit holds an MBA from Cornell University and a bachelor's degree in computer science from Concordia University in Montreal, Canada.

Learn how to deliver photograph-quality images faster than ever before with NVIDIA® Quadro® VCA. Accelerate design and VFX workflows with NVIDIA Quadro VCA, the fastest way to interact with photorealistic digital 3D models and scenes. This is a powerful network-attached appliance that harnesses the power of the highest-performing NVIDIA GPUs. It's accessible to anyone on the network, easily integrated into design workflows and effortlessly scales to multiple VCAs to minimize the time to noiseless physically-based global illumination.

Level: Beginner
Type: Talk
Tags: Rendering & Ray Tracing; Large Scale and Multi-Display Visualization

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Room LL21B

S6730 - Flexible Cluster Rendering with Quadro® VCA

Ankit Patel Senior Product Manager, NVIDIA
Ankit Patel is a Senior Product Manager at NVIDIA. Prior to joining NVIDIA in 2011, Ankit worked in the media and entertainment industry for over 10 years. He has held product management positions at Matrox Video and Echolab, which was acquired by Blackmagic Design in 2010. Ankit is passionate about building products that allow creative individuals to realize their dreams, whether that's through creative storytelling or building amazing products. Ankit holds an MBA from Cornell University and a bachelor's degree in computer science from Concordia University in Montreal, Canada.

Learn how to deliver photograph-quality images faster than ever before with NVIDIA® Quadro® VCA. Accelerate design and VFX workflows with NVIDIA Quadro VCA, the fastest way to interact with photorealistic digital 3D models and scenes. This is a powerful network-attached appliance that harnesses the power of the highest-performing NVIDIA GPUs. It's accessible to anyone on the network, easily integrated into design workflows and effortlessly scales to multiple VCAs to minimize the time to noiseless physically-based global illumination.

Level: Beginner
Type: Talk
Tags: Rendering & Ray Tracing; Large Scale and Multi-Display Visualization; Product & Building Design

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Room LL21B

S6745 - VQA: Visual Question Answering

Aishwarya Agrawal Ph.D. Student, Virginia Tech
Aishwarya Agrawal is a second year Ph.D. student at the Bradley Department of Electrical and Computer Engineering at Virginia Tech. She is a member of the Virginia Tech Machine Learning and Perception Lab and is advised by Dhruv Batra. Her research interests lie at the intersection of machine learning, computer vision and natural language processing with a focus on multi-modal Artificial Intelligence, e.g. Visual Question Answering (VQA).

We'll describe the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image (e.g., "What kind of store is this?", "How many people are waiting in the queue?", "Is it safe to cross the street?"), the machine's task is to automatically produce an accurate natural language answer ("bakery", "5", "Yes"). Answering any possible question about an image is one of the 'holy grails' of AI requiring integration of vision, language, and reasoning. We have collected and recently released a dataset containing >250,000 images, >750,000 questions, and ~10 Million answers (www.visualqa.org). We are also running VQA challenge (www.visualqa.org/challenge.html) which includes both an open-ended answering task and a multiple-choice task.

Level: Intermediate
Type: Talk
Tags: Computer Vision & Machine Vision; Deep Learning & Artificial Intelligence; Big Data Analytics

Day: Tuesday, 04/05
Time: 16:30 - 16:55
Location: Room 210F

S6683 - "Piz Daint" and "Piz Kesch": From General Purpose GPU-Accelerated Supercomputing to an Appliance for Weather Forecasting

Thomas Schulthess Director, Swiss National Supercomputing Centre
Thomas is Director of the Swiss National Supercomputing Centre (CSCS) and a professor for computational physics at ETH Zurich. He received his PhD in physics in 1994. Since 2010 he has taken interest in refactoring climate codes to take advantage of novel, energy efficient computing architectures.

One of today's biggest challenges for scientific computing is the rapidly developing architectural diversity and heterogeneity in computing systems. Application developers no longer face just concurrency as the major obstacle when scaling simulation codes, but have to adapt software to diverging architecture specific programming models and heterogeneous memory subsystems, requiring significant efforts in refactoring of software and development of new algorithms. In this talk, we will show how CSCS has turned these challenges into opportunity which led to software development collaborations with HPC centers in Europe, USA, and Japan, the deployment of "Piz Daint", a GPU-accelerated supercomputer that is among the top 10 systems worldwide that enabled development.

Level: All
Type: Talk
Tags: Supercomputing & HPC; Earth System Modelling

Day: Tuesday, 04/05
Time: 17:00 - 17:25
Location: Room 211A

S6399 - Accelerating Performance and Scalability with NVIDIA GPUs on HPC Applications

Pak Lui Application Performance Manager, HPC Advisory Council
Pak Lui is the Application Performance Manager for the HPC Advisory Council. He has been involved in demonstrating application performance on various open source and commercial applications. His main responsibilities involve characterizing HPC workloads, analyzing MPI profiles to optimize on the HPC applications, as well as exploring new technologies, solutions and their effectiveness on real HPC workloads. Pak works at Mellanox Technologies where his main focus is to optimize HPC applications on products, explore new technologies and solutions and their effect on real workloads. Pak has been working in the HPC industry for over 15 years. Prior to joining Mellanox Technologies, Pak worked as a Cluster Engineer at Penguin Computing, responsible for building and testing HPC cluster configurations from different OEMs and ISVs. Pak holds a B.Sc. in Computer Systems Engineering and a M.Sc. in Computer Science from Boston University.

GPU-based clusters are being adopted at a rapid pace in HPC clusters to perform compute-intensive tasks at a large scale. One of the main performance challenges in the deployments of this GPU clusters is the performance and latency of communications between GPUs across the interconnect fabric. The goal of this session is to highlight interconnect optimizations through MPI communication profiling that provides higher performance and better utilization that allow GPU cluster to scale. We will also demonstrate with a selection of HPC applications on InfiniBand cluster, with technology such as GPUDirect RDMA to see how to utilize this new feature to directly communicate in a peer-to-peer fashion, completely bypassing the CPU subsystem that allow application to perform and scale.

Level: Intermediate
Type: Talk
Tags: Supercomputing & HPC; Performance Optimization

Day: Tuesday, 04/05
Time: 17:30 - 17:55
Location: Room 211A

S6107 - Robust Model-Based 3D Head Pose Estimation

Shalini Gupta Senior Research Scientist, NVIDIA
Shalini Gupta has been a senior research scientist in the Mobile Visual Computing group of NVIDIA Research since April 2013. From 2011 to 2013, she worked as a senior mobile computer vision engineer at NVIDIA, where she designed and productized computer vision and computational photography solutions for mobile platforms and GPUs. She worked as an imaging and architecture scientist at Texas Instruments, from 2008 to 2010, where she designed algorithms for the image signal processing pipeline of mobile phones, at AT&T Laboratories on their IPTV project, and at Advanced Digital Imaging Research, LLC, where she designed algorithms for 3D human face recognition. Shalini received her M.S. and Ph.D. in electrical and computer engineering from the University of Texas at Austin in 2004 and 2008, respectively. She received a B.S. in electronics and electrical communication engineering from Punjab Engineering College, India, in 2002. She is a recipient of the Summer Research Fellowship 2001, awarded by the Jawaharlal Nehru Center for Advanced Scientific Research, Bangalore, India. Her primary research interests are image/signal processing, computer vision, and machine learning, and their application to scene understanding and interpretation.

Depth cameras have become cheap and ubiquitous. We introduce a computer vision algorithm for accurate, three-dimensional (3D) head pose (rotation and translation) estimation, which runs in near real time in CUDA. It works with different commodity depth sensors with minimal adaptation, handles large head rotations and occlusions gracefully, and does not require cumbersome subject initialization. Our algorithm results in an angular error of 2 degrees and a transnational error of 6 mm. It outperforms all seven competing methods on a benchmark data set. Accurate head pose estimation is an important fundamental problem in computer vision. It is a prerequisite for gaze estimation, facial animation capture, face recognition, driver monitoring, and head-coupled, 3D perspective displays.

Level: Intermediate
Type: Talk
Tags: Computer Vision & Machine Vision; Video & Image Processing; Intelligent Video Analytics (IVA)

Day: Wednesday, 04/06
Time: 09:00 - 09:25
Location: Room 210F

S6108 - High-Performance Pedestrian Detection on NVIDIA Tegra®

Max Lv GPU Architect, NVIDIA
Max Lv is a GPU architect in the Compute Architecture team at NVIDIA, focusing on computer vision applications on GPU and mobile GPU architecture. Before joining NVIDIA, he was a research assistant at the Parallel Processing Institute in Fudan University.

We'll present an innovate approach to efficiently mapping a popular pedestrian detection algorithm (HoG) on an NVIDIA Tegra GPU. Attendees will learn new techniques to optimize a real computer vision application on Tegra X1, as well as several new architecture features of the Tegra X1 GPU.

Level: Advanced
Type: Talk
Tags: Self-Driving Cars & Automotive ; Computer Vision & Machine Vision

Day: Wednesday, 04/06
Time: 09:00 - 09:25
Location: Room LL21E

S6127 - NVIDIA Iray®: Changing the Face of Architecture and Design

Scott DeWoody Firmwide Creative Media Manager, Gensler
Scott DeWoody has always had an affinity for art and technology. After seeing the animation being done through computers, he knew he could combine the two. In 2007, he graduated from The Art Institute of Houston with a B.A. in media arts and animation. There, he focused on lighting and rendering techniques using 3ds Max software. Image quality and workflow are the top priorities in his work. He is constantly studying color theory, composition, and new ways to produce the best possible results. He has worked at Gensler for the past eight years as a visualization artist and manager. He has worked for numerous clients, including NVIDIA Corporation, ExxonMobil, Shell Oil Company, BP, City Center Las Vegas, and many more. He is exploring the new possibilities of architecture in the interactive space with gaming platforms, augmented reality, and virtual reality.
Hao Ko Principal, Gensler
Grounded by the belief that the fundamental role of an architect is to elevate the human spirit, Hao Ko strives to design beautiful places -- ones that inspire people and impact the way they live, work, and play. Always pursuing a high level of conceptual thinking, pushing performance boundaries, and detailed in execution and craft, Hao has just recently completed the Tower at PNC Plaza in Pittsburgh, Pa. Coupling a progressive workplace design to a unique and innovative passive, natural ventilation strategy driven by a breathable double-skin facade and solar chimney, this transformational project is designed to be the world's greenest high-rise building.

NVIDIA's Iray technology was a game changer in the design process of its new corporate campus. Gensler teamed up with developers at NVIDIA to help integrate this technology into the process to accurately simulate how the design of the campus would look in the real world. This process ended up helping everyone understand how light and materials were going to act in the 500,000-square-foot space. Being able to accurately compute how the massive amount of daylight coming into the space would react to changes in the design was incredible feedback for the designers. The data that Iray visualized helped with almost every design decision from start to finish.

Level: All
Type: Talk
Tags: Product & Building Design; Rendering & Ray Tracing

Day: Wednesday, 04/06
Time: 09:00 - 09:50
Location: Room LL21A

S6151 - XMP: An NVIDIA CUDA®-Accelerated Big Integer Library

Justin Luitjens Developer Technologies Engineer, NVIDIA
Justin Luitjens has been a developer technology engineer at NVIDIA for five years. He received his Ph.D. in scientific computing from the University of Utah.

We'll introduce the XMP library, which provides CUDA-accelerated implementations for many large integer arithmetic operations. These operations are generally used to implement encryption and decryption routines, including RSA, ECC, and Diffie-Hellman key exchange. We'll focus on what the capabilities of the library are along with how to efficiently use the library.

Level: All
Type: Talk
Tags: Aerospace & Defense; Tools & Libraries

Day: Wednesday, 04/06
Time: 09:00 - 09:25
Location: Marriott Salon 2

S6249 - How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors

Paolo Rech Associate Professor, UFRGS
Paolo Rech is an associate professor at the Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil. Paolo received his M.S. and Ph.D. from Padova University, Padova, Italy, in 2006 and 2009, respectively. His studies included radiation tests and the effect of neutrons, protons, and alpha particles on programmable devices like FPGAs and systems on chip. He was a postdoc at LIRMM, Montpellier, France from 2010 to 2012, working on radiation effects on electronic devices at high altitudes. Recently, he started collaborations with NVIDIA, AMD, Northeastern University, and Los Alamos National Lab to evaluate and mitigate the radiation-induced effects in devices designed for large-scale HPC centers and in heterogeneous systems for automotive and aerospace markets.

We will disclose the basics of radiation-induced effects on GPUs and propose effective solutions to mitigate them. The session will start with an exhaustive description of the physics mechanisms that induce ionizing particles to generate failures. Then, taking advantage of data gathered in four years of GPUs neutron beam tests, we evaluate GPUs' error rate in realistic applications and identify GPUs' weaker resources. Observed errors are also compared with Titan field data and automotive market reliability constraints. Additionally, mitigation strategies like ECC and software-based hardening solutions are analyzed and experimentally evaluated. Finally, we will advise on how to implement parallel algorithms and distribute threads in the more efficient and reliable way.

Level: Intermediate
Type: Talk
Tags: Supercomputing & HPC; Self-Driving Cars & Automotive

Day: Wednesday, 04/06
Time: 09:00 - 09:25
Location: Room 211A

S6258 - VMD: Interactive Molecular Ray Tracing with NVIDIA OptiX™

John Stone Senior Research Programmer, University of Illinois at Urbana-Champaign
Highly-Rated Speaker
John Stone is a senior research programmer in the Theoretical and Computational Biophysics Group at the Beckman Institute for Advanced Science and Technology, and associate director of the NVIDIA CUDA Center of Excellence at the University of Illinois. John is the lead developer of VMD, a high-performance molecular visualization tool used by researchers all over the world. His research interests include molecular visualization, GPU computing, parallel processing, ray tracing, haptics, and virtual environments. John was named an NVIDIA CUDA Fellow in 2010. In 2015, he joined the Khronos Group Advisory Panel for the Vulkan graphics API. John also provides consulting services for projects involving computer graphics, GPU computing, and high performance computing.

We'll describe the adaptation of the popular molecular graphics program VMD for interactive ray tracing using NVIDIA OptiX, on computers ranging from laptops all the way up to large NVIDIA VCA GPU clusters, and petascale supercomputers such as the Blue Waters and Titan. We'll highlight the new OptiX 3.8 progressive rendering and remote device APIs, and show how they are using VMD for both local and remote VCA rendering. We'll highlight the use of OptiX GPU ray tracing for interactive panoramic and omnidirectional projections suited to planetariums, fulldome theaters, and VR headsets (HMDs) such as the Oculus Rift. The session will present the latest VMD+OptiX ray tracing performance data for workstation, VCA GPU clusters, and supercomputers.

Level: Intermediate
Type: Talk
Tags: Rendering & Ray Tracing; In-Situ and Scientific Visualization

Day: Wednesday, 04/06
Time: 09:00 - 09:25
Location: Room LL21B

S6266 - Automated Geophysical Feature Detection with Deep Learning

Chiyuan Zhang PhD Student, MIT
Chiyuan Zhang received his B.S. and M.S. in computer science from Zhejiang University, China, in 2009 and 2012, respectively. He is currently a Ph.D. candidate at the Computer Science and Artificial Intelligence Laboratory at MIT. His research interests include machine learning and computational neuroscience, as well as application to processing/analysis of speech, vision, and other kinds of real-world signals.

We introduce a novel approach to fault localization in oil and gas exploration based on automated feature detection with deep learning algorithms running on GPUs. Faults are key geological structures that can serve as boundaries for hydrocarbon reservoirs. Most current techniques that tackle this problem rely on seismic images, which are the outcome of expensive computing with substantial human intervention. We'll present latest results from a joint project by MIT and Shell International E&P Inc., on using deep learning to bypass the expensive processing mentioned above and perform fault detection on the raw seismic traces. We build a system in Julia/Mocha.jl and cuDNN to solve the challenging structured output prediction problem, and show promising preliminary results.

Level: Advanced
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Energy Exploration

Day: Wednesday, 04/06
Time: 09:00 - 09:25
Location: Room 210G

S6295 - Counterparty Credit Risk and IM Computation for CCP on Multicore Systems

Prasad Pawar Developer, Tata Consultancy Services
Prasad Pawar is currently working as a developer at TCS in the Parallelization and Optimization group of HPC. Prasad received an M.S. in computer science and engineering from Kolhapur University, Maharashtra in 2008 and B.S. in computer science and engineering from Aurangabad University in 2005. He has seven years of experience in the HPC domain. Prasad is an inventor of one patent and published his research work in various national and international conferences. His research interests include high performance computing, parallelization and optimization, multicore programming, GPGPUs, and algorithms.
Amit Kalele Consultant, Tata Consultancy Services
Amit Kalele has 10 years of industry experience in high performance computing. His primary area of focus is performance optimization and parallelization of applications on multi core and many core CPUs/GPUs. Amit received his Ph.D. from the Department of Electrical Engineering, IIT Bombay.

We'll present how GPUs and performance optimization using the latest features available on Kepler GPUs enabled a critical risk estimation application in a trading system achieved near real-time performance compared to the 25 minutes on legacy systems. The counterparty credit risk is defined as the risk that the counterparty to a transaction could default before the final settlement of the transaction's cash flows. A CCP calculates the mark-to-market margin requirement for each member and blocks it from a member's collateral if the margin is not sufficient. Moreover, the GPU based approach minimizes the risk for the CCP.

Level: All
Type: Talk
Tags: Performance Optimization; Finance

Day: Wednesday, 04/06
Time: 09:00 - 09:50
Location: Room 212A

S6353 - Accelerating a Spectral Algorithm for Plasma Physics with Python/Numba on GPU

Remi Lehe Postdoctoral Researcher, Lawrence Berkeley National Laboratory
Remi Lehe graduated in physics from Ecole normale superieure, Paris, and obtained a Ph.D. from Ecole Polytechnique, France, where he studied plasma-based particle accelerators. His work on these accelerators is largely based on particle-in-cell (PIC) simulations, and in particular he developed an alternative finite-difference Maxwell solver, which is now implemented in several PIC codes used by research teams throughout the world (Osiris, PIConGPU, Warp, Calder). Remi is now a postdoctoral researcher at Lawrence Berkeley Laboratory, where he works on large-scale plasma simulations and advanced spectral algorithms.

Learn how a complex spectral algorithm can be rapidly ported to GPU, while writing only Python code. Overall, our spectral Particle-In-Cell simulation code runs ~40 times faster on one GPU than on one CPU. This is partly due to the extensive use of FFTs and matrix multiplications in our spectral solver. Those operations are very efficient on a single GPU, while they are relatively slow on a single CPU and difficult to parallelize over several CPUs. The entire code is written in Python, which allowed for fast development and debugging, while the Numba just-in-time compiler enabled high performance on GPU. In particular, we made use of cuFFT and cuBlas, of well-controled memory transfer from GPU to CPU, and of a parallel Radix Sort to avoid race conditions in critical sections of the code.

Level: Intermediate
Type: Talk
Tags: Computational Physics; Tools & Libraries

Day: Wednesday, 04/06
Time: 09:00 - 09:25
Location: Marriott Salon 6

S6444 - White Matter Tractography and Human Brain Connections Using GPUs

Moises Hernandez Fernandez Ph.D. Candidate, Oxford Centre for Functional MRI of the Brain (FMRIB). University of Oxford
Moises Hernandez Fernandez is a Ph.D. candidate at University of Oxford in clinical neurosciences under Professor Stephen Smith, Professor Mike Giles, Dr. Stamatios Sotiropoulos, and Dr. Istvan Reguly.

We'll present a novel analysis tool for diffusion MRI (dMRI) data using NVIDIA GPUs for mapping connections in the human brain. We'll describe the potential of dMRI and how it allows the study of brain microstructure and long-range brain connections, non-invasively and in-vivo (tractography). Due to the multidimensional nature of the data, modelling can be computationally demanding. We present a parallel framework for analysis of dMRI data that allows accelerations of up to two orders of magnitude when comparing GPU with CPU implementations. We'll highlight the tremendous benefit of these accelerations in very large recent studies such as the Human Connectome Project, where comprehensive maps of brain anatomical connectivity of unprecedented quality are being generated.

Level: Intermediate
Type: Talk
Tags: Medical Imaging

Day: Wednesday, 04/06
Time: 09:00 - 09:25
Location: Room 212B

S6454 - Real-Time Graphics for Film Production at Pixar

Pol Jeremias Graphics Software Engineer, Pixar Animation Studios
Highly-Rated Speaker
Pol Jeremias is passionate about technology and art. He grew up near Barcelona and moved to California in 2006. Since then, Pol has researched computer graphics and worked in multiple games for companies such as LucasArts or SoMa Play. Today, he helps create movies at Pixar Animation Studios. In his spare time, he has co-founded Shadertoy.com and Beautypi. When he is not programming you will find him running, reading or watching movies.
Jeremy Cowles Lead Software Engineer, Pixar Animation Studios
Jeremy Cowles is the GPU team lead at Pixar, where he has contributed to Universal Scene Description, OpenSubdiv, and Pixar's animation system for Presto. He is also the co-architect of Hydra, Pixar's real-time render engine for film assets. Jeremy and his team will present Presto's next generation hybrid GL/path traced viewport architecture and the future of Presto real-time workflows, where these technologies come together.

Join the Pixar GPU team for a session that explores how real-time graphics are used at Pixar. We'll cover the unique needs for film production, including loading and run-time management of massive movie sets and complex characters, real-time subdivision surfaces, real-time effects particularly useful for technical directors, and how these assets are rendered using the latest hardware features. Don't miss this great opportunity to learn about graphics, algorithms, and movies!

Level: All
Type: Talk
Tags: Media & Entertainment; Rendering & Ray Tracing; Real-Time Graphics

Day: Wednesday, 04/06
Time: 09:00 - 09:50
Location: Room LL21C

S6481 - Distributed Graph-Based Density Matrix Calculation for Quantum Molecular Dynamics Using GPUs

Susan Mniszewski Scientist, Los Alamos National Laboratory
Susan Mniszewski is a scientist in the Computer, Computational and Statistical Sciences Division at Los Alamos National Laboratory (LANL). Her work in computational co-design includes development of molecular dynamics proxy applications used to explore new programming models and algorithms for emerging hardware and software capabilities and exploration of sparse matrix and graph-based linear scaling approaches for quantum molecular dynamics on multicore, GPU-accelerated, and distributed architectures. This work was performed with C. Negre, M. Cawkwell, and A. Niklasson of LANL.

Quantum molecular dynamics (QMD) simulations are a highly accurate tool to predict material properties, with potential applications in targeted pharmaceuticals, fuel cells, and biomolecular systems. A graph-based second order spectral projection (SP2) approach is presented for calculation of the electronic density matrix from a Hamiltonian matrix. Large systems run distributed using OpenMP/MPI parallelism for the data decomposition, graph partitioning, submatrix extraction, and density matrix assembly. Compute-intensive SP2 calculations take advantage of GPU acceleration using the cuBLAS matrix algebra library. This hybrid parallel methodology is demonstrated for poly(ethylene) and protein structures solvated in water.

Level: All
Type: Talk
Tags: Computational Chemistry

Day: Wednesday, 04/06
Time: 09:00 - 09:25
Location: Marriott Salon 5

S6491 - Containerizing GPU Applications with Docker for Scaling to the Cloud: Future of Packaging Applications

Maciej Bajkowski COO, Bitfusion.io, Inc
Maciej Bajkowski is a proven innovator with extensive engineering experience in the design of high-speed components, memory systems and storage solutions for leading companies in the computing field, including Intel, Samsung, Freescale and Dell.
Subbu Rama CEO, Bitfusion.io, Inc
Subbu Rama has held engineering and leadership roles in hardware and software divisions, while building CPUs, micro-servers, SoCs, and cloud infrastructures, at companies like Intel and Dell. As a founding member, he built Dell's first cloud infrastructure marketplace.

We'll share different ways of packaging GPU applications as containers versus traditional options and shed light on performance versus portability. Docker containers wrap up a piece of software in a complete filesystem that contains everything it needs to run: code, runtime, system tools, system libraries -- anything you can install on a server. This guarantees that it will always run the same, regardless of the environment it is running in. We'll provide some background on linux containers, and its applicability to heterogeneous platforms, GPUs in specific, and challenges in adoption, and conclude with a demo of the whole process of containerizing, deploying, and managing GPU applications for the cloud.

Level: Beginner
Type: Talk
Tags: Data Center & Cloud Computing; Graphics Virtualization; Supercomputing & HPC

Day: Wednesday, 04/06
Time: 09:00 - 09:50
Location: Room 210E

S6633 - Navigating the In-Situ Visualization Landscape

Tom Fogal Senior Software Engineer, NVIDIA
Thomas Fogal is an NVIDIA engineer specializing in HPC visualization. As a doctoral student, he worked on parallel volume rendering techniques as well as novel approaches to in situ visualization. At the Scientific Computing & Imaging Institute, ORNL, and LLNL, he worked on parallel rendering for large scientific data. Thomas holds a B.S. and an M.S. from the University of New Hampshire, and will soon have a doctorate from the University of Duisburg-Essen in Germany.

You'll learn how to navigate the complex landscape of in-situ visualization. There are a number of technologies and a variety of design challenges to overcome when adding in-situ visualization into simulation software. Should your coupling be tight or loose? Does 'in transit' visualization make sense in your environment? What value do you gain from coupling with VisIt's libsim or ParaView's Catalyst? How can you adjust to low-memory environments? Is high-performance analysis via VTK-m applicable in your workflows? How much temporal resolution does one need on the visualization side? What's the best way to approach CUDA-OpenGL interop to get zero-copy visualization? How should you organize data for visualization?

Level: Beginner
Type: Talk
Tags: In-Situ and Scientific Visualization; Programming Languages

Day: Wednesday, 04/06
Time: 09:00 - 09:25
Location: Room LL21D

S6130 - 3D Deep Learning

Jianxiong Xiao Assistant Professor, Princeton University
Jianxiong Xiao is an assistant professor in the Department of Computer Science at Princeton University and the director of the Princeton Vision Group. He received his Ph.D. from the Computer Science and Artificial Intelligence Laboratory (CSAIL) at Massachusetts Institute of Technology (MIT). Jianxiong's research interests are in computer vision. He has been motivated by the goal of building computer systems that automatically understand visual scenes, both inferring the semantics and extracting 3D structure. Jianxiong focuses on 3D deep learning, RGB-D recognition and reconstruction, place-centric 3D context modeling, graphics for vision (synthesis for analysis), deep learning for autonomous driving, large-scale crowd-sourcing, and petascale big data. His work has received the Best Student Paper Award at the European Conference on Computer Vision (ECCV) in 2012 and Google Research Best Papers Award for 2012. Jianxiong was awarded the Google U.S./Canada Fellowship in Computer Vision in 2012, MIT CSW Best Research Award in 2011, and two Google Research Awards in 2014 and in 2015.

We'll discuss some of our research projects about 3D deep learning in computer vision, including our projects to use 3D convolution neural networks on GPUs to learn 3D descriptors for point features, to model 3D shapes, and to parse 3D scenes. Finally, we'll talk about Marvin, a deep learning software framework for N-dimensional data that we developed for NVIDIA GPUs, which could impact other fields, such as neural sciences, biology, medical images, and healthcare.

Level: All
Type: Talk
Tags: Computer Vision & Machine Vision; Robotics & Autonomous Machines; Deep Learning & Artificial Intelligence

Day: Wednesday, 04/06
Time: 09:30 - 09:55
Location: Room 210F

S6220 - Not Just a Universal Crutch: Other Useful Things to Do with atomicCAS

Elmar Westphal Scientific Programmer, Forschungszentrum Jülich GmbH
Highly-Rated Speaker
Elmar Westphal has been working as a programmer and cluster architect at Forschungszentrum Juelich for more than 15 years. In the last years he ported simulation programs from different fields of computational physics to single- and multi-GPU systems and developed CUDA-based building blocks, libraries, and applications, mostly for molecular dynamics and micromagnetism simulations.

There is more to atomicCAS than the double-precision atomicAdd loop from the programming guide. Something different from the universal atomic operation loop it represents. We'll show how to build shared, memory-based hash function loops to solve different counting and grouping problems at warp- and block-level. Variations of this loop can be used to count unique elements in a block, find threads sharing common data elements, or speed up histogram building for large numbers of bins. With the now natively implemented atomic operations on shared memory on Maxwell, these functions can be significantly faster than algorithms optimised for other architectures.

Level: Advanced
Type: Talk
Tags: Algorithms; Performance Optimization

Day: Wednesday, 04/06
Time: 09:30 - 09:55
Location: Marriott Salon 3

S6270 - How to Create Photoreal Configurators Using Lightworks Iray+®

Dave Hutchinson Chief Technology and Operating Officer, Lightwork Design Ltd.
Dave Hutchinson converts leading lightwork design into new business opportunities through Lightworks Iray+, Iray+ Configurator, Iray+ for 3DSMax, NVIDIA Iray, and associated consultancy and digital services. Dave leads product development, engineering, support, and customer interactions. He is also responsible for driving commercial strategy and directing the marketing and sales teams to achieve new sales and existing customer satisfaction. Dave has an extensive background in visualization technology and the 3D market.
Dave Coldron Product Director, Lightwork Design Ltd.
As Lightworks Product Director, Dave Coldron has responsibility for the development of the Iray+ ecosystem, including Iray+ for 3DSMax and the new Iray+ Configurator. With over 20 years of experience in developing integrated systems for the computer graphics industry, Dave knows how to create applications that support the design workflow; focusing on the use of compelling digital content, interactive design, and the user experience.

Photoreal configurators powered by GPUs have come of age and we present an industry view on where we are seeing the demand emerging for this technology within product and architectural design review and client presentation through to enabling of dealership and online consumer product customisation. We'll use real-world projects to illustrate the demand we are seeing, outline what we and our clients had to do to create a great experience, and describe why the GPU is key to its success. With NVIDIA Iray technology being delivered within key products such as Iray+ for 3DSMax and Iray+ for Siemens NX, extending the power of physically based rendering through configuration in complementary workflows is being demanded by the industry. Join us to find out more about this exciting and growing area.

Level: Beginner
Type: Talk
Tags: Rendering & Ray Tracing; Product & Building Design

Day: Wednesday, 04/06
Time: 09:30 - 09:55
Location: Room LL21B

S6294 - Live, Interactive, In-Situ Visualization of Large-Scale Plasma Simulations

Michael Bussmann Group Leader Computational Radiation Physics, Helmholtz-Zentrum Dresden - Rossendorf
Highly-Rated Speaker
Michael Bussmann leads a team of developers for the plasma simulation code PIConGPU. He is co-founder of the Dresden GPU Computing Center of Excellence. Michael makes use of GPUs for large-scale simulations on systems like Titan at Oak Ridge, high-data rate analysis of scientific images, and in-situ visualization.

In large-scale scientific simulations, I/O has become a bottleneck that can slow down the exploration of unknown physical scenarios. We show that it is vital to view a HPC system not only in its ability to simulate the system but also to visualize the simulated data. By keeping the data of the simulation in the GPU memory, remote analysis via a Wi-Fi connection can work at frame rates well above 10 fps while latencies are not of importance, even when spanning continents. This presentation includes a live demo.

Level: All
Type: Talk
Tags: In-Situ and Scientific Visualization; Supercomputing & HPC; Computational Physics

Day: Wednesday, 04/06
Time: 09:30 - 09:55
Location: Room LL21D

S6300 - Big Lasers, Small Particles & GPUs: Our Weapons of Choice to Fight Cancer

Axel Huebl PhD Student, Helmholtz-Zentrum Dresden - Rossendorf
Axel Huebl is one of the main developers of the PIConGPU laser plasma simulation and one of the inventors of the OpenPMD metadata format for particle mesh data. He has been part of the team to bring PIConGPU into the finals of the Gordon Bell award 2013. Axel currently works on his master thesis on laser-driven ion acceleration for cancer therapy and the interaction of X-Ray lasers with solid density plasmas.

We'll present results on our INCITE project "Targeting Cancer with High Power Lasers," which aims to deliver beams of ions for cancer therapy accelerated by high power lasers. With a novel target design in which the target is levitated in a trap to isolate it from its environment, we study the properties of the generated ion beams and their potential for radiation therapy of cancer. In the discussion, we'll also present performance results of our own plasma simulation code PIConGPU on the Titan system, which has been used to study the laser plasma interaction in 3D.

Level: All
Type: Talk
Tags: Computational Physics; Computational Chemistry; Supercomputing & HPC

Day: Wednesday, 04/06
Time: 09:30 - 09:55
Location: Marriott Salon 6

S6304 - Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercomputers

Jaewoon Jung Research Scientist, RIKEN AICS
Jaewoon Jung works as a technical scientist for RIKEN, Japan. He joined RIKEN as a research scientist in 2010. Jaewoon earned his Ph.D. in physics at the Institute of Science and Technology, Korea.
Yuji Sugita Chief Scientist, RIKEN
Yuji Sugita has a Ph.D in Chemistry from Kyoto University in 1998. He joined RIKEN as a Postdoctoral Fellow in 1998 and since 2012 has worked as the Chief Scientist at RIKEN's Theoretical Molecular Science Laboratory.

We address an efficient parallelization scheme of molecular dynamics (MD) simulations on hybrid CPU/GPU supercomputer systems. In this scheme, the most time-consuming calculations, the real-space nonbonded interactions and the setup of the pairlist for the nonbonded interactions are performed on GPUs, while the rest of the calculations are done on CPUs. In our program, GENESIS (Generalized-ensemble simulation system), we introduced a novel domain decomposition scheme, which we call the midpoint cell method, for its good weak scaling on massively parallel (CPU-based) supercomputer. This method is also applicable to hybrid CPU/GPU supercomputer systems for simulating large-scale biological systems. We show the performance of GENESIS on TSUBAME supercomputer.

Level: Intermediate
Type: Talk
Tags: Computational Chemistry

Day: Wednesday, 04/06
Time: 09:30 - 09:55
Location: Marriott Salon 5

S6315 - Energy-Efficiency and Performance Comparison of a Digital Earth Routine on Tegra K1

Dustin Feld Research Associate, Fraunhofer SCAI
Dustin Feld is a Ph.D. student at the Computer Science Chair of Prof. Dr. Michael Juenger at the University of Cologne. He is a research scientist at the Fraunhofer Institute for Scientific Computing and Algorithms (SCAI), where he is a member of the HPC group. His work focuses on parallel algorithms and their efficient implementation on modern parallel hardware.

We'll investigate the potential of a distributed Tegra K1 system for calculating the aerosol optical depth (AOD), a significant optical property of aerosol for applications such as the atmospheric correction of remotely sensed surface features, monitoring of volcanic eruptions or forest fires, air quality, and even climate changes from satellite data. The achieved performance as well as energy efficiency is analyzed with real-world data from the moderate resolution imaging spectroradiometer (MODIS). Additionally, we compare the potential of this architecture to today's commonly available HPC hardware. Due to the very low energy intake, such embedded hardware architectures provide a great chance for situations with strong energy constraints like the usage on onboard missions.

Level: Intermediate
Type: Talk
Tags: Embedded; Supercomputing & HPC; Earth System Modelling; IoT

Day: Wednesday, 04/06
Time: 09:30 - 09:55
Location: Room LL20D

S6377 - Building the Fully Digital Audi Virtual Cockpit

Horst Hadler Manager Cluster Instruments and Graphics Framework, e.solutions
Horst Hadler joined e.solutions in 2009 as one of the initial members in the infotainment group managing the framework team. With the definition of the virtual cluster Horst is responsible for the cluster instrument development and graphics framework. He has a degree from University Erlangen and specialized in computer graphics. Before his infotainment time, he did visual effects for a motion picture studio and worked in a project for simulation of heat distribution in high temperature furnaces at the Erlangen University's computer graphics chair.

Get an overview of the techniques used for Audi's Tegra 3 powered virtual cockpit, focusing on the topics (1) reduction of start-up time, (2) instrument display with 60 fps, and (3) synchronization with the infotainment main unit. Additionally, get to know the overall software structure and see how graphical effects were implemented. The virtual cockpit is available in single-display and dual-display configurations. The single-display configuration is used for sport models, like the TT and R8, where the output of the infotainment main unit is integrated into the instrument cluster. In contrast, the dual-display configuration additionally features a ""standard"" main unit display.

Level: Intermediate
Type: Talk
Tags: Self-Driving Cars & Automotive ; Embedded; Real-Time Graphics

Day: Wednesday, 04/06
Time: 09:30 - 09:55
Location: Room LL21E

S6460 - Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Dhabaleswar K. (DK) Panda Professor and University Distinguished Professor, The Ohio State University
Highly-Rated Speaker
Dhabaleswar K. (DK) Panda is a professor and university distinguished scholar of computer science and engineering at the Ohio State University. He has published over 350 papers in major journals and international conferences. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) open-source software package, developed by his research group, are currently being used by more than 2,450 organizations in 76 countries around the world. This software has enabled several InfiniBand clusters to get into the latest TOP500 ranking during the last decade. More than 293,000 downloads of this software have taken place from the project's website alone. He is an IEEE fellow and a member of ACM.

Learn recent developments in middleware design to boost performance of GPU-based streaming applications. Several runtimes already support and optimize GPU communication using various CUDA features. Similarly, some runtimes use InfiniBand hardware multicast to boost broadcast performance for host-based communications. We'll focus on challenges in combining and fully utilizing GPUDirect RDMA and hardware multicast technologies in tandem to design support for high-performance broadcast operation for streaming applications. Further, we'll present associated challenges and designs for clusters with multi-HCA and multi-GPU configurations and MPI_Bcast operations performance evaluation of the proposed designs will be presented and analyzed.

Level: Intermediate
Type: Talk
Tags: Supercomputing & HPC; Tools & Libraries; Performance Optimization

Day: Wednesday, 04/06
Time: 09:30 - 09:55
Location: Room 211A

S6525 - Image Classification Using Deep Neural Networks on NVIDIA CUDA® Mobile Platforms

Everett Phillips Senior HPC Software Engineer, NVIDIA
Everett Phillips is a senior engineer in the Tesla Performance group at NVIDIA, where he works on high performance computing applications and benchmarks for GPU-accelerated supercomputers. He holds an M.S. in mechanical and aeronautical engineering from University of California, Davis.

We'll present the steps and methodology used to optimize Google's open source Jetpac Deep Belief image classification framework for the latest CUDA-capable mobile processors: NVIDIA Tegra TK1 and Tegra TX1. Jetpac uses convolutional neural networks to recognize objects on mobile devices. The original code can classify an image in 300 ms on an iPhone 5s and also runs on the Raspberry Pi using Neon instructions and OpenGL for compute.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Performance Optimization

Day: Wednesday, 04/06
Time: 09:30 - 09:55
Location: Room 210G

S6533 - Accelerating Neural Engineering: Closing the Loop on Brain Stimulation

Adam Lichtl Founder, Delta Brain Inc.
Adam Lichtl recently founded Delta Brain Inc. with the goal of using cutting-edge engineering and computing to facilitate healthy brain function. Before that, he was Director of Research at SpaceX, where he built up world-class teams in the areas of combustion simulation, machine learning, and analysis. He received his BS in Physics from Caltech at the age of 19, followed by an MBA and PhD in Computational Physics from Carnegie Mellon University. In addition to his time at Delta Brain Inc. and SpaceX, Adam has held positions as a postdoctoral fellow at Brookhaven National Lab, and as a quant at Morgan Stanley overseeing Global Base and Precious Metals Strategies.

Brain stimulation has been FDA approved for the modulation of a variety of drug-resistant mental disorders, ranging from extreme depression to Parkinson's, and we are just beginning to understand the neural circuitry involved. Most of the interesting circuitry is deep within the brain, making it difficult not only to stimulate, but also to monitor for improvement. We'll present the latest work in this area, as well as Delta Brain Inc.'s novel system-level approach to the brain: combining brain imaging, GPU-accelerated scientific computing, and targeted stimulation to create an end-to-end treatment protocol to generate healthy brain function in the people who need it most.

Level: All
Type: Talk
Tags: Medical Imaging; In-Situ and Scientific Visualization; Computational Physics

Day: Wednesday, 04/06
Time: 09:30 - 09:55
Location: Room 212B

S6552 - Subgridded FDTD on GPU Allows Rapid Design of Implantable and Wearable Technology

Chris Mason Product Manager, Acceleware
Highly-Rated Speaker
Chris Mason is the product manager in charge of Acceleware's accelerated electromagnetic product line. He is responsible for the development and launch of Acceleware products used by companies worldwide. Chris has 10 years of experience in developing commercial applications for the GPU and multi-core CPUs. His previous experience also includes parallelization of algorithms on digital signal processors for cellular phones and base stations. His specialty is in electromagnetic simulations, medical imaging, signal processing, and linear algebra. Chris has an M.S. in electrical engineering from Stanford University.

Join Acceleware and SPEAG/Zurich MedTech to learn how GPU-enabled subgridding for the finite difference time domain (FDTD) algorithm can substantially reduce runtimes for electromagnetic simulations of human interface technology. We'll focus on real-life examples, including an RF-powered contact lens, a wireless capsule endoscopy, and a smart watch. We'll also outline the basics of the subgridding algorithm along with the GPU implementation and the development challenges. Performance results will illustrate the significant reduction in computation times when using a localized subgridded mesh running on an NVIDIA Tesla GPU.

Level: Intermediate
Type: Talk
Tags: Signal & Audio Processing; Product & Building Design; Computational Biology

Day: Wednesday, 04/06
Time: 09:30 - 09:55
Location: Marriott Salon 2

S6589 - Algorithmic Trading Strategy Performance Improvement Using Deep Learning

Masahiko Todoriki Assistant Vice President, Mizuho Securities. Co., Ltd.
Masahiko Todoriki is the project manager/lead developer of AI platform project in Mizuho Securities Co. Ltd., in the Wholesale IT Strategy Department. He works as a quantitative analyst in the Sales Trading Department to provide feedback of research and analysis using the latest technologies. From 2009 to 2014, he worked on the development, team management, and execution performance analysis of algorithmic trading strategies in Mizuho. Prior to Mizuho, he ran a firm, consulting and developing algorithmic trading strategies for FX and commodity futures. Masahiko majored in pure physics in Waseda University.

Learn how we improved stock price prediction accuracy in 30 minutes compared to two hours for instruments listed in Tokyo Stock Exchange. We used GPGPUs to speed up data preprocessing and deep learning. As a part of training dataset, we used ticks (every single trade) and quotes (every single change in order book) to detect micro changes of the market. As a result, we get constantly better accuracy than the historical probability.

Level: Intermediate
Type: Talk
Tags: Finance; Deep Learning & Artificial Intelligence; Big Data Analytics

Day: Wednesday, 04/06
Time: 09:30 - 09:55
Location: Marriott Salon 1

S6123 - Effects of GPU, AAD and XVA on the Future Computing Architecture of Banks

Pierre Spatz Head of Quantitative Research, Murex
Pierre Spatz heads the quantitative analysis team of Murex, a world leader in trading and risk management software. He holds a M.S. in computer engineering and applied mathematics from ENSIMAG in Grenoble, France.

The 2008 crisis has tremendously changed the way we approach financial computing in the banks. While the complexity and diversity of traded products have been reduced, volumes and regulatory computations needs have exploded while budgets became tight and we do not see any relief in the future. Several solutions including GPU– powerful parallel coprocessor - , AAD– an algorithm - or both of them have been implemented to cope with today workload. All these methods imply at least a partial rewrite of the code. We will come back on our experience and see how well each solution fit different test cases with current or future hardware and extrapolate how the future calculation servers of banks will look like.

Level: All
Type: Talk
Tags: Finance

Day: Wednesday, 04/06
Time: 10:00 - 10:50
Location: Marriott Salon 1

S6175 - Scientific Simulations on Thousands of GPUs with Performance Portability

Alan Gray Research Architect, EPCC, The University of Edinburgh
Alan Gray was awarded the status of NVIDIA CUDA Fellow in 2014. His research career began in the area of theoretical physics: his Ph.D. thesis was awarded the UK-wide Ogden Prize in 2004 for the best thesis in particle physics phenomenology. He continued this work under a university fellowship at The Ohio State University, before moving to EPCC in 2005. His current research focuses on the exploitation of GPUs to the benefit of real scientific and industrial applications: he has a particular interest in the programming of large-scale, GPU-accelerated supercomputers. Alan leads EPCC's GPU-related activities and is involved in management, teaching, and supervision for the EPCC M.S. in high performance computing. Since 2003, he has authored more than 40 publications, many of which in refereed journals, which have received over 1,500 citations.

"Developing your application for GPUs destroys portability to other platforms." We'll debunk this and other myths as we describe how we have solved the performance-portability challenge, allowing two separate scientific applications (which simulate complex fluids and fundamental particle physics, respectively) to effectively utilize machines such as the world's largest GPU-accelerated supercomputer, Titan at Oak Ridge, while remaining completely portable to multi-core or many-core CPU-based systems when GPUs are unavailable. The key ingredient is a new simplistic abstraction layer called targetDP, which targets data parallel hardware in a platform-agnostic but performance-portable manner.

Level: Beginner
Type: Talk
Tags: Supercomputing & HPC; Computational Physics

Day: Wednesday, 04/06
Time: 10:00 - 10:25
Location: Room 211A

S6198 - The Latest in High Performance Desktops with VMware Horizon and NVIDIA GRID™ vGPU

Pat Lee Sr. Director, Remote Experience, VMware
Pat Lee is the Senior Director, Mobile Experience for VMware Desktop and Application products. The Mobile Experience team is responsible for 3D graphics, remote display protocols, remote device access, desktop clients, thin clients, web clients, and mobile clients. Since joining VMware in 2007, Pat has held multiple roles in product management and product marketing. Prior to VMware, Pat held multiple product management and marketing roles at Dantz Development and EMC. Pat earned a BA in Physics from the University of California, Berkeley.
Luke Wignall Manager, GRID Performance Engineering, NVIDIA
Highly-Rated Speaker
Luke came to NVIDIA after working as an owner of an integrator/VAR, as a sales engineer, solution architect, consultant, and system administrator with both VMware and Citrix technologies in both public and private industry. An early evangelist of virtualization, Luke saw the ability to bring GPU to the end user experience as the missing "special sauce" that brings virtual desktops to the next level. Now managing the NVIDIA GRID Performance Engineering Lab, his focus is on performance and scalability to deliver the best value with the highest end user experience across all virtual workloads.

Hear about the latest advances in 3D desktops with VMware Horizon and NVIDIA GRID. Until now, delivering high performance graphics workstations remotely was cost prohibitive and complicated to setup and deliver. With VMware Horizon and NVIDIA GRID vGPU, there has never been a better time to deliver high performance 3D desktops in a cost effective manner that is simple to setup and deploy bring your customers the security, performance, reliability, and collaboration needed to transform their business.

Level: Intermediate
Type: Talk
Tags: Graphics Virtualization; Product & Building Design

Day: Wednesday, 04/06
Time: 10:00 - 10:50
Location: Marriott Salon 4

S6212 - Complex Application Proxy Implementation on the GPU Through Use of Kokkos and Legion

Geoff Womeldorff Scientist, Los Alamos National Laboratory
Geoff Womeldorff is a computational scientist with a background in mathematics and centroidal Voronoi tessellations. He has experience in numerical methods and parallel frameworks for multi-scale models for ocean, and algorithms for communication aggregation thereof. His interests also include the coupling between proxy applications, their hosts, and programming models, codesign interactions, and parallel frameworks and algorithms, in general.

We'll present research on the implementation, performance, and optimization of a complex application kernel, dim3_sweep of SNAP, a neutral particle transport proxy, in CUDA through the use of the Kokkos programming model. Examples will be given of kernel performance measurements and optimization techniques enabled through the use of Kokkos. In addition, we'll discuss efforts to couple the coarse-grained parallelism of SNAP, as implemented in Legion, a task-based programming model, and the fine-grained aspects, as implemented in Kokkos and CUDA, and how that coupling compares and contrasts to the native MPI+OpenMP of SNAP.

Level: Advanced
Type: Talk
Tags: Tools & Libraries; Computational Physics; Performance Optimization

Day: Wednesday, 04/06
Time: 10:00 - 10:25
Location: Room 211B

S6252 - Developing Software Architectures for Autonomous Driving Vehicles

Karsten Hoffmeister Senior Manager, Platform for Autonomous Driving , Elektrobit
Karsten Hoffmeister is a senior manager at Elektrobit's (EB) Automotive Consulting Unit, where he is working in a cross-functional team supporting EB's platform for autonomous driving. He also consults on ADAS projects for German OEMs. He has been with EB since 2002 in various roles, including product manager for EB Tresos, for which he was responsible for AUTOSAR Basic Software. He served as chief engineer and team manager for an engineering service team in Tokyo, consulting for Japanese OEMs and tier one suppliers. Most recently, he led an EB subsidiary as branch office manager in Radolfzell, Germany. He holds two degrees from the University of Applied Science in Wolfenbuettel, an M.S. in vehicle systems and an advanced academic degree in computer engineering.

Modern vehicle functions like advanced driver assistance systems (ADAS) or even fully autonomous driving functions have a rapidly growing demand for high performance computing power. To fulfill fail-operational requirements of autonomous driving functions, the next generation of a vehicle infrastructure platform has to ensure the execution of safety critical functions with high reliability. In addition the "always connected" feature, needed for autonomous driving, should be protected by the powerful security mechanisms. We'll show how the requirements of ADAS can be fulfilled in an efficient way, on both system and software architecture levels, using the example of automated valet parking from Elektrobit.

Level: Intermediate
Type: Talk
Tags: Self-Driving Cars & Automotive

Day: Wednesday, 04/06
Time: 10:00 - 10:25
Location: Room LL21E

S6272 - Deep Learning Algorithms for Recognising the Features of Facial Ageing

Konstantin Kiselev Data Scientist, Facebrain, Inc
Konstantin Kiselev conducts research in scientific startup, Facebrain (which is co-founded by Dr. Alex Zhavoronkov), in the field of computer vision and deep learning and holds a position of lead data scientist in the big data project for TechnoServ, a top 5 Russian IT company, and Beeline, a top 3 Russian mobile operator. Konstantin holds an M.S. in theoretical physics from Lomonosov Moscow State University. He has a broad experience in software development of high-load systems and extensive knowledge in machine learning and big data. From 2014 to 2015, he was development lead for large IT systems for the Russian government at LANIT, a leading Russian software company. He received additional education in big data and machine learning fields, took first place in the Microsoft Machine Learning Hackathon (June 2015), and participated in the deep learning team competition organized by MIPT (deephack.me, mipt.ru/en/, July 2015).

We'll discuss DNN applications for determination of main facial skin biomarkers using a face photo. While there are a lot of other factors that enable to determine human age with high accuracy, the most obvious factor is how your face looks. Tracking face wrinkles enables us to track not only skin ageing process as such, but also the results and efficiency of treatment used. By following the dynamics of wrinkles appearance, it is possible to find out which treatment is more suitable for a particular face or skin type and hence provide recommendations.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Wednesday, 04/06
Time: 10:00 - 10:25
Location: Room 210G

S6274 - Unleash Your Material Render Capabilities with Substance Designer, Substance Painter and NVIDIA Iray®

Jerôme Derel Chief Product Officer, Allegorithmic
Engineer and product designer Jerome Derel joined Allegorithmic in 2014 as a chief product officer. Jerome worked for seven years at Dassault Systemes as a visualization expert in the Design Studio and the CATIA Design teams, leading projects producing high-quality virtual materials.
Alexis Khouri VP WW Sales, Allegorithmic
Alexis Khouri joined Allegorithmic in 2007 and has been deeply involved in the product design of Substance, the new generation of texturing tools. He is also supervising business development in North America and Japan and partnerships with other vendors. He was formerly general manager at Playsoft, a mobile game developer, and a senior consultant at Simon-Kucher & Partners. He also wrote on a regular basis for the French business newspaper Les Echos on the video game industry. Alexis graduated from ESSEC Business School in 2003 and has a strong technical background in programming and production pipelines.

Allegorithmic has been integrating Iray render engine to combine its expertise of procedural texture rendering (aka substances) with multi-layered MDL materials and Iray. Create your MDL and substance material from scratch through the node-tree editor in Substance Designer and use Substance Painter to 3D pain any object with multi-mask management. Both materials and masks can then be exported to your predilection 3D software (Maya, 3ds Max, Rhino), enabling infinite capabilities for material rendering.

Level: All
Type: Talk
Tags: Product & Building Design; Rendering & Ray Tracing; Game Development

Day: Wednesday, 04/06
Time: 10:00 - 10:50
Location: Room LL21A

S6363 - Algorithms for Auto-Tuning OpenACC Accelerated Kernels

Saber Feki Computational Scientist, King Abdullah University of Science and Technology
Saber Feki is a computational scientist at the KAUST Supercomputing Laboratory, where he contributed to the procurement of the Shaheen XC40 supercomputer. He received a Ph.D. in computer science from the University of Houston in 2010. He then joined the oil and gas company TOTAL as an HPC research scientist. His research interests include seismic imaging and computational electromagnetic applications using different programming models and automatic performance tuning of MPI communications and OpenACC accelerated applications on GPUs.

We'll present optimization techniques using different machine learning and derivative-free search algorithms, individually and in hybrid combinations, for auto-tuning parameters in OpenACC clauses for a stencil evaluation kernel executed on GPUs. We compare execution time performance of several auto-tuning techniques. These optimization algorithms will be evaluated over a large two-dimensional parameter space not satisfactorily addressed to date by OpenACC compilers, consisting of gang size and vector length. A hybrid of historic learning and Nelder-Mead delivers the best balance of high performance and low tuning effort.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Programming Languages; Algorithms; OpenACC

Day: Wednesday, 04/06
Time: 10:00 - 10:25
Location: Room 212A

S6368 - Hardware Architecture Considerations for Building Efficient GPU Cluster for Accelerated CNN Training

Jianxiong YIN Research & Development Engineer, Nanyang Technological University
Jianxiong Yin is a data communication system researcher at Nanyang Technological University (NTU), Singapore, where he researches and optimizes system architecture for improved performance and efficiency. Jianxiong's work in the system architecture domain has been recognized by top-tier conferences and reputable supercomputing competitions. His work won the industry award for the Cloud3DView Project and Jianxiong's team was awarded the 2015 Data Center Dynamics Asia Pacific Award. Jianxiong is now responsible for the development of deep learning infrastructure in ROSE lab, NTU, Singapore, and jointly working for NVIDIA Technology Centre at Singapore in deep learning and HPC application development. Jianxiong received his M.S. from Yonsei University, South Korea, in 2012, and a B.S. from South China University of Technology in 2009.
Pradeep Gupta Senior Solutions Architect, NVIDIA
Pradeep Gupta is a lead deep learning solutions architect at NVIDIA, where he supports customers and developers across the Asia Pacific, Japan, and India regions for deep learning and HPC application development. Pradeep also works to enable the GPU computing ecosystem in universities and research labs across the region. Pradeep is responsible for running and managing R&D projects at the NVIDIA Technology Centre in Singapore. He is working on smart cities enablement with the GPU computing initiative at NVIDIA. Before joining NVIDIA, he worked with various technologies in high performance computing domains. Pradeep received an M.S. in research from the Indian Institute of Science (IISc), Bangalore. His research focused on developing compute-efficient algorithms. He has numerous publications in IEEE, SPIE, and other reputed conferences.

Learn how the hardware architecture difference in terms of training infrastructure will affect the CNN training process, and what are the design principles of building an efficient CNN training cluster, what are the key metrics you should be watching, and what is the reference architecture and how it's been developed since from traditional IT server architecture to HPC architecture.

Level: Intermediate
Type: Talk
Tags: Data Center & Cloud Computing; Deep Learning & Artificial Intelligence; Supercomputing & HPC

Day: Wednesday, 04/06
Time: 10:00 - 10:50
Location: Room 210E

S6374 - Gunrock: A Fast and Programmable Multi-GPU Graph Processing Library

Yangzihao Wang Graduate Student Researcher, University of California Davis
Yangzihao Wang is a Ph.D. candidate at UC Davis, supervised by Professor John D. Owens in the research of GPU graph processing. He felt the pain of coding and optimizing individual graph algorithms on the GPU and wanted a unified framework with both high-performance and easy programmability, which became the Gunrock library.
Yuechao Pan Graduate Student Researcher, University of California Davis
Yuechao Pan is a Ph.D. student at UC Davis, also from Professor John D. Owens' group, focusing on multi-GPU graph processing. He designed and implemented the multi-GPU framework of Gunrock, which brought the performance and the flexibility of the library to a new level.

We present Gunrock, a multi-GPU graph processing library, that enables easy graph algorithm implementation and extension onto multiple GPUs for scalable performance on large graphs with billions of edges. Attendees can learn how to 1) solve large-scale graph problems with high-performance GPU computing primitives and optimization strategies, using our high-level data-centric abstraction that focuses on vertex or edge frontier operations, and 2) utilize multi-GPU computing power by just a few algorithm-dependent blocks, using our multi-GPU framework that handles most multi-GPU implementation details and memory allocation. We will also share experience on the library's design and implementation that helps it achieve the best performance among programmable GPU graph libraries.

Level: Intermediate
Type: Talk
Tags: Big Data Analytics; Tools & Libraries; Supercomputing & HPC

Day: Wednesday, 04/06
Time: 10:00 - 10:25
Location: Room 210F

S6435 - Analyzing the Behavior of Memory Accesses on the NVIDIA Jetson TK1

Mohammad Dashti Ph.D. Student, University of British Columbia
Mohammad Dashti is a Ph.D. student at the University of British Columbia. He holds an M.S. in computer science from Simon Fraser University and an M.S. in mobile communications from King's College London. He received his B.S. in computer engineering from Old Dominion University. His research focuses on operating systems, GPGPU, and heterogeneous CPU/GPU systems.

We'll describe our analysis for global memory allocation methods on a CUDA-capable integrated CPU-GPU system (Jetson TK1). We show that even though the memory is physically the same, global memory accesses can heavily impact performance depending on which runtime calls are used to allocate data. Furthermore, we implement a new global allocation method that is superior to the current schemes. Our experiments show that a naive choice for an allocation method on the Jetson TK1 can degrade overall performance by more than 9.5x. Our proposed method does not suffer this overhead. Furthermore, we show that utilizing the new method for concurrent CPU-GPU workloads can achieve large performance improvements.

Level: Advanced
Type: Talk
Tags: Embedded; Performance Optimization; IoT

Day: Wednesday, 04/06
Time: 10:00 - 10:25
Location: Room LL20D

S6451 - Local Statistical Filtering through Concurrent Domain Dissection for Medical Imaging

Nikos Pitsianis Assistant Professor, Aristotle University of Thessaloniki, Greece
Nikos Pitsianis is an assistant professor at the Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, Greece, and an adjunct professor with the Department of Computer Science, Duke University, Durham, North Carolina. His research interests include high-performance algorithms and architectures for signal and image processing. He holds a Ph.D. in Computer Science from Cornell University.

We'll present a new parallel scheme for local statistical filtering (LSF), which is indispensable to high-fidelity medical image analysis but still challenges efficient solutions, due to range-value dependencies and irregular data accesses. The new scheme maintains high-degree concurrency and makes efficient use of advanced GPU/CUDA features. Experimental results are presented with 4D-CT images and associated deformation fields.

Level: Intermediate
Type: Talk
Tags: Medical Imaging; Algorithms

Day: Wednesday, 04/06
Time: 10:00 - 10:25
Location: Room 212B

S6461 - Advancements in a GPU Monte Carlo Simulator for Radiotherapy

Nick Henderson Research Associate, Stanford University
Highly-Rated Speaker
Nick Henderson is a Research Associate and Instructor at Stanford University. His primary affiliation is with Stanford's Institute for Computational and Mathematical Engineering.

We'll describe several advancements in our efforts to build a high-performance GPU Monte Carlo simulator for radiotherapy. The central idea is an algorithm that reduces thread divergence and run-time memory requirements compared to previous methods. The method presented also enables extensions to other applications such as the nanoscale interaction of DNA and ionizing radiation, which require larger number of physics models and particle types. Details of the performance analysis may also be applied to other Monte Carlo methods that rely on process selection as a part of the simulation.

Level: Intermediate
Type: Talk
Tags: Computational Physics; Performance Optimization; Medical Imaging

Day: Wednesday, 04/06
Time: 10:00 - 10:25
Location: Marriott Salon 6

S6618 - Trinity: A Novel Visualization and Data Distribution System

Jens Krüger Professor, CoViDAG, University Duisburg-Essen
Since 2013, Jens Kruger has been chair of the high performance computing group at the University of Duisburg-Essen. He also holds an adjunct professorship from the University of Utah and is a principal investigator of multiple projects in the Intel Visual Computing Institute at Saarland University. Jens studied computer science at the Rheinisch-Westfalische Technische Hochschule Aachen, where he received his diploma in 2002. In 2006, he finished his Ph.D. at the Technische Universitat Munchen and, after postdoc positions in Munich and at the Scientific Computing and Imaging Institute, he became a research assistant professor at the University of Utah. In 2009, he joined the Cluster of Excellence Multimodal Computing and Interaction at Saarland University to head the Interactive Visualization and Data Analysis group.
Andrey Krekhov Head of HCI Department, CoViDAG, CoViDAG, University Duisburg-Essen
Andrey Krekhov is employed at the HPC Group in Duisburg-Essen and heads the Human-Computer Interaction department of the "Center of Visual Data Analysis and Computer Graphics - CoViDAG". Andrey received his B.S. and M.S., with honors, in computer science from the Saarland University in 2011 and 2012, respectively.

Scalability matters. Today more than ever, if we consider all the available computation resources, GPU farms, and cloud solutions. Designing a highly adaptive and still user- and developer-friendly visualization system requires us to rethink existing visualization pipelines and to develop with scalability in mind. This session gives you insight in our novel "Trinity" system that separates frontends, processing and data nodes, interconnected by a simple, easy-to-use API. Browse data sets in the cloud, render them on the NVIDIA GPU cluster, and display the result on your phone or on a display wall -- the plethora of application scenarios take visualization to a whole new level.

Level: Intermediate
Type: Talk
Tags: In-Situ and Scientific Visualization

Day: Wednesday, 04/06
Time: 10:00 - 10:50
Location: Room LL21D

S6623 - Advances in NAMD GPU Performance

Antti-Pekka Hynninen Computational Scientist, Oak Ridge National Laboratory
Antti-Pekka is a computational scientist in biophysics at Oak Ridge National Laboratory (ORNL), where he focuses on the software development and INCITE user support of NAMD biomolecular modeling application. In particular, Antti-Pekka is interested in using GPUs to their fullest potential to enable fast and scalable molecular dynamics. Prior to joining ORNL in 2014, Antti-Pekka worked at National Renewable Energy Laboratory, where we rewrote much of the CHARMM molecular dynamics engine to be faster, more parallel, and support GPU acceleration. Antti-Pekka holds a PhD in physics from Utrecht University and he did his postdoctoral research at Princeton University on Monte Carlo simulations of charged colloids.

Learn about recent performance improvements in the GPU acceleration of NAMD biomolecular modeling application. These improvements include performance gains in the non-bonded CUDA kernels and new GPU-only implementation of Particle Mesh Ewald (PME) reciprocal computation. We will describe in detail the changes made in the non-bonded CUDA kernels that give 1.4-1.7 times better performance compared to the previous version. We will describe the new PME reciprocal code that enables computation on multiple GPUs and gives performance that is between 1.4-1.8 times faster than the previous code.

Level: Intermediate
Type: Talk
Tags: Computational Chemistry; Performance Optimization; Supercomputing & HPC

Day: Wednesday, 04/06
Time: 10:00 - 10:25
Location: Marriott Salon 5

S6635 - Portable Performance for Monte Carlo Simulations of Photon Migration in 3D Turbid Media for Single and Multiple GPUs

Leiming Yu PhDc Computer Eng., Northeastern University
Leiming Yu is a Ph.D. candidate in computer engineering at Northeastern University. He belongs to the Northeastern University Research Group (NUCAR) under supervision of Dr. David Kaeli. He has been involved on general purpose computing on GPUs, performance optimization and modeling and high performance computing. He has been published in different venues, such as International Workshop on OpenCL 2015, ALLDATA 2015, Proceedings of Workshop on General Purpose Processing Using GPUs, ACM 2015, Boston Area Architecture Workshop 2015, and ICPE 2015.
Fanny Nina Paravecino Ph.D. Computer Engineer, Northeastern University
Fanny Nina Paravecino is a Ph.D. candidate in computer engineering at Northeastern University. She belongs to the Northeastern University Research Group (NUCAR) under supervision of Dr. David Kaeli. She received her B.S. summa cum laude in computer engineering from University of San Antonio Abad of Cusco in Peru in 2005. She received an M.S. in computer engineering from University of Puerto Rico at Mayaguez in 2011. She achieved the best grade for her undergrad thesis titled "Virtual framework to simulate Industrial Robot" using OpenGL 3D graphics with C#. Her research interests focus on high-performance optimization of image processing algorithms on parallel architecture. She has been published in different venues such as: IWOCL, ICPE, BARC, ICCVG, SPIE, and Gordon-CENSSIS, among others. She has also been highlighted in Woman & CUDA on the NVIDIA website.

We present a parallel Monte Carlo (MCX) algorithm accelerated by GPUs for modeling time-resolved photon migration in a 3-D turbid media. We'll present optimizations that benefit execution on a single GPU as well as multiple GPUs. By leveraging persistent threads, our single-GPU implementation provides a high-performance parallel simulation of MCX when run on an NVIDIA GPU. Our implementation is automatically tuned to leverage persistent threads for different GPU architectures. We achieved improvements over 25% for Kepler and 12% for Maxwell architecture as compared to using a heuristic approach. In addition, we propose a linear programming approach based on predictive modeling to optimize MCX execution on multiple devices.

Level: Intermediate
Type: Talk
Tags: Algorithms; Performance Optimization; Rendering & Ray Tracing

Day: Wednesday, 04/06
Time: 10:00 - 10:25
Location: Marriott Salon 3

S6641 - HD GP-GPU Systems for HPC Applications

Sergio Tafur Physicist, Naval Research Laboratory
Sergio Tafur is a physicist working in the Computational Science Division of the Naval Research Laboratory, Code 5594. Sergio holds a Ph.D. in physics from the University of Central Florida and was inducted into his alma mater's Order of Pegasus in 2010. He has served as an XSEDE Campus Champion from 2011 to 2013, and since 2004 has supported optimizing scientific and engineering computing workflows, implementing and administering supercomputing HPC/HTC-related systems, establishing research networks, as well as the parallelization of scientific computing algorithms and visualization efforts. Sergio supports NRL's computational science community resolve challenges in traditional, and non-traditional supercomputing high performance and high throughput computing workflows, by leveraging existing and emerging technologies such as MPI, CUDA, and Intel MIC computing environments.

We'll be presenting how we fielded a High Density (HD) GP-GPU system, currently 227 on the Top 500, evaluated its performance, and overcame challenges that arose during testing phases. In addition, we will touch on using Python to code for and "glue" CPUs and GP-GPUs together in such HD GP-GPU systems.

Level: All
Type: Talk
Tags: Aerospace & Defense; Supercomputing & HPC; Algorithms

Day: Wednesday, 04/06
Time: 10:00 - 10:25
Location: Marriott Salon 2

S6759 - CAVE 2.0: The World's Largest Virtual Reality Cluster at PSA Peugeot Citroën

Matthieu Mika Virtual Reality Engineer, PSA Peugeot Citroën
Matthieu Mika has worked in PSA Peugeot's R&D department since 2009, starting as a Virtual Reality Engineer. Since 2013, he has been involved in the areas of Real Time Rendering for Immersive systems and the new CAVE project. He is a graduate of the Polytech University of Paris Sud Orsay with a master's degree in in computer science and engineering.
Alain Gonzalez Expert Workstations Graphics Systems & 3D Imagery, PSA Peugeot Citroën
Alain Gonzalez has worked in PSA Peugeot's IT department since 2000, starting as a workstations IT architect. Since 2009, he has been involved in the areas of expert workstations graphics technologies and 3-D imagery. He is a graduate of the University of Paris Sud Orsay with a master's degree in computer science and engineering.
Benoit Bastien Workstation Sales Lead - France, Dell
Benoit Bastien is working as workstation expert since 2006 by several OEM. Currently leading professional workstation business for Dell in France, he’s involved in multiple 3D related projects. He is a graduate of Clermont-Ferrand Business School with a master’s degree in IT management.

4K, stereoscopy, 53 million pixels, 400 TFlops, 56 Gbits high speed network, 10 tons of steel, 5 tons of glass, 2 miles of optical fiber, 3 miles of network cables and ...70 NVIDIA Quadro M6000 pro GPUs. This presentation will recap how PSA Peugeot Citroën managed a CAVE 2.0 implementation. The session will cover all aspects from end-user requirements to the final setup in order to meet all stakeholder needs: vehicle architecture, process, HMI, car interior and exterior styling, perceived quality.

Level: All
Type: Talk
Tags: Product & Building Design; Virtual Reality & Augmented Reality; Large Scale and Multi-Display Visualization

Day: Wednesday, 04/06
Time: 10:00 - 10:50
Location: Room LL21A

S6225 - Efficient Utilization of Large-Scale Heterogeneous Systems Using the Uintah Computational Framework

Alan Humphrey Software Developer and Ph.D. Student, Scientific Computing and Imaging Institute, University of Utah
Alan Humphrey is a software developer at the Scientific Computing and Imaging Institute and also a Ph.D. student at the University of Utah, where he works with Dr. Martin Berzins on improving the performance and scalability of the Uintah Computational Framework. Alan has been primarily involved in extending Uintah to run on hybrid CPU/GPU systems with the development of Uintah's prototype CPU-GPU task scheduler and most recently, Uintah's Unified multi-threaded heterogeneous task scheduler and runtime system that allows Uintah to dynamically dispatch computational tasks to both CPU cores and available GPUs on-node. Much of Alan's past research has been focused on formal verification of concurrent systems, specifically the Message Passing Interface (MPI) and dynamic verification tools like In-situ Partial Order (University of Utah) - and its integration within the Eclipse Parallel Tools Platform (PTP). Alan has also been involved with the Eclipse PTP project from 2009-2015.

We'll discuss how directed acyclic graph (DAG) approaches provide a powerful abstraction for solving challenging engineering problems and how using this abstraction and DAG approach, computational frameworks such as Uintah can be extended with relative ease to efficiently leverage GPUs, even at scale. Attendees will learn how frameworks like Uintah are able to shield the application developer from the complexities of the deep memory hierarchies and multiple levels of parallelism found in heterogeneous supercomputers. Attendees will be shown how Uintah applications can be made to utilize thousands of GPUs within a single simulation, as shown by recent results for a GPU-based radiation model that achieves excellent strong scaling to 16,384 GPUs on DOE Titan.

Level: All
Type: Talk
Tags: Supercomputing & HPC; Computational Fluid Dynamics

Day: Wednesday, 04/06
Time: 10:30 - 10:55
Location: Room 211A

S6305 - 3D Point Cloud Registration Using GPU-Accelerated Expectation Maximization

Benjamin Eckart Ph.D. Student, Carnegie Mellon University
Benjamin Eckart is a Ph.D. candidate with the Robotics Institute at Carnegie Mellon University and an NVIDIA Graduate Fellow. Ben's research focuses on the creation of parallel algorithms for 3D robotic perception. He is currently exploring ways to use many-core architectures such as the GPU to rapidly create compact models to facilitate and unify common low-level perceptive tasks like segmentation, registration, and classification. Ben holds an M.S. in robotics from Carnegie Mellon University, an M.S. in electrical engineering from Tennessee Tech University, as well as a B.S. in computer science and a B.S. in computer engineering.

We'll discuss how to use GPUs to accelerate a common 3D spatial processing application, point cloud registration. Registration, or finding the relative rigid transform between two point clouds, forms a core component of many 3D vision algorithms such as object matching and environment reconstruction. We use the GPU to accelerate this process using a parallelized form of the Expectation Maximization (EM) algorithm. Using this novel EM construction can both accelerate registration as well as provide a natural geometric segmentation of the data, two processes that we show to be highly interrelated at the kernel level when deployed on a GPU. Finally, we discuss how GPU-accelerated registration can be used in the larger context of real-time 3D perception.

Level: Advanced
Type: Talk
Tags: Robotics & Autonomous Machines; Computer Vision & Machine Vision; IoT

Day: Wednesday, 04/06
Time: 10:30 - 10:55
Location: Room LL20D

S6320 - Opticks: Optical Photon Simulation for High Energy Physics with NVIDIA OptiX™

Simon Blyth Postdoctoral Fellow, National Taiwan University
Simon is a High Energy Physicist and Software Developer based at National Taiwan University, Taipei and member of the Daya Bay and JUNO Collaborations. Simon has a D.Phil in Particle Physics from Oxford University. His interests focus on applying techniques from Computer Science within the High Energy Physics community. He is currently working on using GPU ray tracing to accelerate optical photon simulation within photomultiplier based experiments.

Opticks is an open source project that brings NVIDIA OptiX ray tracing to existing Geant4 toolkit based simulations. Advantages of separate optical photon simulation and the approaches developed to integrate it with the general Geant4 particle simulation are presented. Approaches to minimize overheads arising from split are shown. Challenges included bringing complex CSG geometries with wavelength dependent material and surface properties to the GPU. Techniques for visualisation of photon propagations with interactive time scrubbing and history selection using OpenGL/OptiX/Thrust interop and geometry shaders are described. Results and demonstrations are shown for the photomultiplier based Daya Bay and JUNO Neutrino detectors.

Level: All
Type: Talk
Tags: Computational Physics; In-Situ and Scientific Visualization; Rendering & Ray Tracing

Day: Wednesday, 04/06
Time: 10:30 - 10:55
Location: Marriott Salon 6

S6350 - State of the Art Real-Time Graphics for Events, Broadcast and Interactive Content

Erik Beaumont COO, Ventuz Technology AG
Highly-Rated Speaker
Former senior product specialist at Softimage and technical director at Animoto, Erik Beaumont joined Ventuz Technology in 2013 and became its chief operating officer in 2014.

While game engines are advancing at an incredible rate in terms of capabilities and quality, adoption of these advancements in non-games or visualization markets has been slow. We'll look at some of the ways we seek to change that and bring cutting-edge graphics techniques and capabilities to these markets. We'll talk about some of the barriers (such as making these technologies approachable for non-expert designers and artists), some of the advantages, and some of the disadvantages. We'll also explore what direction we think these technologies will go.

Level: Beginner
Type: Talk
Tags: Media & Entertainment; Large Scale and Multi-Display Visualization; Real-Time Graphics; Product & Building Design

Day: Wednesday, 04/06
Time: 10:30 - 10:55
Location: Room LL21C

S6360 - Graph Analytics: Using GPU-Accelerated Sparse Linear Algebra Routines

Paul Fox Engineer, EM Photonics, Inc.
Highly-Rated Speaker
Paul Fox has seven years of experience in GPU and heterogeneous computing. Working at EM Photonics, he has contributed to the CULA GPU linear algebra library and the ATCOM image enhancement suite. His work has recently focused on programming methods for high-performance heterogeneous computing, particularly in scientific computing and signal processing application areas. He has an M.S. in electrical engineering from University of Delaware.

Large-scale graph analytics frameworks provide a convenient and highly scalable platform for developing algorithms to analyze large datasets. Although conceptually scalable, these techniques exhibit poor performance on modern computational hardware. We're developing an implementation of the high-level functions supported by these APIs in terms of linear algebra operations, which will be parallel on each pair of vertices connected by an edge. This technology can reduce the number of nodes required and map well to computational accelerators such as GPUs, thus enabling users to perform more complex analysis with less hardware at lower cost. We'll detail our latest work on this project, including challenges, specifics of our approach, and preliminary results.

Level: Beginner
Type: Talk
Tags: Big Data Analytics; Algorithms

Day: Wednesday, 04/06
Time: 10:30 - 10:55
Location: Room 210F

S6380 - Simulating a Quantum Annealer with GPU-Based Monte Carlo Algorithms

James King Senior Algorithms Researcher, D-Wave Systems
James King is a senior algorithms researcher at D-Wave Systems, where he works as part of the Benchmarking team evaluating D-Wave's quantum annealing processors. James was born and raised in Vancouver, Canada, and studied computer science at Waterloo (B.S.), UBC (M.S.), and McGill University (Ph.D.). His dissertation focused on computational geometry and probabilistic analysis of random data structures. During his postdoctoral research at the University of Oxford, he studied energy landscapes of discrete optimization problems.

Learn how the world's most powerful quantum computers are simulated and benchmarked using GPU-based Monte Carlo algorithms. We'll introduce D-Wave's quantum annealing platform, describe several Monte Carlo algorithms for their simulation, and compare CPU- and GPU-based implementations of these algorithms. In particular, we'll focus on considerations of memory layout and fast mathematical functions to maximize speed. Finally, we'll present benchmarking results, including CPU-based algorithms, GPU-based algorithms, and D-Wave's latest-generation quantum annealers.

Level: Beginner
Type: Talk
Tags: Algorithms; Computational Physics; Supercomputing & HPC

Day: Wednesday, 04/06
Time: 10:30 - 10:55
Location: Marriott Salon 3

S6431 - Advanced Thurst Programming with Policies

Steven Dalton Research Scientist, NVIDIA
Steven Dalton joined NVIDIA Research in July 2014. He completed his Ph.D. in computer science at UIUC, where his research focused on mapping irregular operations on sparse matrices related to algebraic multigrid (AMG) methods to GPU architectures. He is the primary contributor to the Cusp sparse linear algebra research library. He holds two Bachelor of Science degrees from Georgia Institute of Technology in the areas of Physics and Computer Science.

We'll discuss Advanced Thrust design patterns that help to facilitate the construction of complex high-performance libraries. The focus of our presentation will be the definition and use of execution-policies as a means of influencing the performance and execution of Thrust routines. As part of the technical specification on parallelism in the C++17 proposal, execution-policies will be an important feature to effectively design high-performance parallel applications in the future. This means it's imperative that developers start the process of understanding and experimenting with the execution-policy design pattern today.

Level: Intermediate
Type: Talk
Tags: Tools & Libraries; Algorithms; Performance Optimization

Day: Wednesday, 04/06
Time: 10:30 - 10:55
Location: Room 211B

S6434 - Missile Defense Radar through Real-Time Electromagnetic Simulation Injection

Ted Selig Director and COO, FishEye Software, Inc.
Ted Selig is focused on technology exploiting the flow and analysis of real-time system data. Ted is director and COO of FishEye Software, a supplier of real-time software and real-time software development, integration, and test for some of the world's most complex civil and national defense systems. His commercial and government experience includes real-time systems, railway information, computer networks, air traffic control, and phase-array radar systems. He holds a B.S. in electrical engineering from the University of Massachusetts and an M.S. in computer systems from Northeastern University. Ted is an inventor on two U.S. patents.

Radars, electromagnetic sensors encoded transmit signals, focus beams, extract targets from noise, and perceive targets and environments. These real-time systems are expensive and risky to build and operate because they are complex, real-time, and difficult to test. The evolution of the GPU has the potential to disrupt this sensor industry by dramatically reducing the cost of radars, accelerate innovation, and reduce sensor maintenance. The presentation will discuss processing techniques and data flow architecture required by these sensors. The discussion explores how GPU adoption can reduce the development costs and risks of sensor development for missile defense but also enable low-cost applications like the self-driving car, weather sensing, and air traffic management.

Level: All
Type: Talk
Tags: Aerospace & Defense; Embedded; Signal & Audio Processing

Day: Wednesday, 04/06
Time: 10:30 - 10:55
Location: Marriott Salon 2

S6449 - Sustainability and Performance through Kokkos: A Case Study with LAMMPS

Christian Trott Senior Member of Technical Staff, Sandia National Laboratories
Christian Trott is a high performance computing expert with experience in designing and implementing software for GPU and MIC compute-clusters. He earned a Ph.D. from the University of Technology Ilmenau in theoretical physics. Christian's prior scientific work focused on computational material research using Ab-Initio calculations, molecular dynamic simulations and Monte Carlo methods. As of 2015, Christian is a senior member of technical staff at the Sandia National Laboratories. He is a core developer of the Kokkos programming model with a large role in advising applications on adopting Kokkos to achieve performance portability for next-generation supercomputers.

Learn about strategies to keep codes maintainable and performant in a diverse high performance computing environment. Using the example of LAMMPS we will demonstrate how the use of Kokkos can reduce code redundancy compared to reimplementing capabilities in hardware specific variants, while delivering similar performance. We will show how new features supported by Kokkos are closing some of the remaining gaps to the native models, with a particular focus on overlapping hybrid execution on GPU and CPU. You will also learn how the Kokkos model provides build-in instrumentation for an application, which supports kernel based analysis of applications across diverse architectures. Performance data will be shown for Intel Haswell, ARM and OpenPower based systems, with and without GPUs.

Level: Beginner
Type: Talk
Tags: Performance Optimization; Tools & Libraries; Programming Languages

Day: Wednesday, 04/06
Time: 10:30 - 10:55
Location: Room 212A

S6505 - Electron Dynamics on Graphics Processing Units

Xavier Andrade Postdoctoral Researcher, Lawrence Livermore National Laboratory
Xavier Andrade is a postdoctoral researcher with the Quantum Simulation Group at Lawrence Livermore National Laboratory. Xavier obtained a Ph.D. in physics from the University of the Basque Country, and worked as a postdoc at the Department of Chemistry and Chemical Biology at Harvard University. His research is focused on developing new theoretical models and algorithms for the computational simulation of electrons in materials. Xavier is one of the main developers of Octopus, a scientific code used by hundreds of researchers around the world to simulate materials. Currently, his research efforts are dedicated to the application of real-time electron dynamics to model conductivity and other properties of matter under extreme conditions.

Learn how scientists simulate the movement of electrons inside materials, and how GPUs can be used to accelerate these simulations. The dynamics of electrons give rise to important phenomena in materials, for example, it determines how they interact with light, or how they conduct heat or electricity. The simulation of electrons, however, is a challenging task as their behavior is governed by quantum mechanics. So, we need to represent electrons as "clouds" and model how these clouds evolve in time. Fortunately, those simulations have a great potential for parallelization, and are ideal for GPUs. Our current efforts are focused on using GPU clusters for large-scale electron dynamics that will allow us to perform simulations of unprecedented predictive power.

Level: Intermediate
Type: Talk
Tags: Computational Chemistry; Computational Physics; Supercomputing & HPC

Day: Wednesday, 04/06
Time: 10:30 - 10:55
Location: Marriott Salon 5

S6511 - Fast Compressed Sensing MRI Reconstruction Using Convolutional Sparse Coding on the GPU

Won-Ki Jeong Associate Professor, Ulsan National Institute of Science and Technology (UNIST)
Won-Ki Jeong is an associate professor in the School of Electrical and Computer Engineering at UNIST. Before joining UNIST, he was a research scientist in the Center for Brain Science at Harvard University. His research interests include scientific visualization, image processing, and general purpose computing on the graphics processor in the field of biomedical image analysis. He has published 22 peer-reviewed research articles, two book chapters (NVIDIA GPU Gems), and one international patent. He received a Ph.D. in computer science from the University of Utah in 2008, and was a member of the Scientific Computing and Imaging (SCI) institute. He is a recipient of the NVIDIA Graduate Fellowship in 2007, and is currently the princial investigator of the NVIDIA CUDA Research Center at UNIST.
Tran Minh Quan PhD Student, Ulsan National Institute of Science and Technology (UNIST)
Tran Minh Quan is a Ph.D. student in the School of Electrical and Computer Engineering at UNIST, Korea. His research interests are GPU computing and biomedical image processing. He received his B.S. in electrical engineering at KAIST, Korea.

Among the well-known machine learning methods such as deep neural network, convolutional sparse coding is fairly new in those data-driven approaches, when it requires some regularizers to approximate signals with a superposition of sparse feature maps that have been convolved by a collection of filters. We'll introduce a fast alternating method for reconstructing highly undersampled MRI data. The proposed solution leverages Fourier Convolution Theorem to accelerate the process of learning a set of filters and iteratively revise the MRI reconstruction based on the sparse codes found subsequently. Finally, we'll show that our method is faster with GPU supports and outperforms regular CPU implementation of the state-of-the art dictionary learning-based approaches.

Level: All
Type: Talk
Tags: Medical Imaging; Video & Image Processing; Deep Learning & Artificial Intelligence

Day: Wednesday, 04/06
Time: 10:30 - 10:55
Location: Room 212B

S6556 - Hollywood Under the Hood: The Mercedes Concept IAA

Vilya Harvey Senior Software Engineer, The Foundry
Vil is a Senior Software Engineer at The Foundry, where he's worn many hats and grown a beard. He's been part of the Nuke team, the Research team and, nowadays, the Future Technologies team where he's designing and building a real-time rendering engine. Before joining The Foundry, Vil was the Head of Systems Development at industry-leading VFX house Framestore - a role he came to after a 5 year stint writing pipeline management software for Oil & Gas. He still wonders whether Framestore understood what type of pipeline his previous experience involved. A native of Perth, Western Australia, Vil moved to England straight after graduating from the computer science programme at Curtin University and has been there ever since. He now lives in Leeds with his wife and despite working with Mercedes has recently bought a VW.

Mercedes Benz' history is defined by moments of innovation that have dramatically shaped and impacted the automotive industry. When the company sought a solution to support the efficient development and delivery of next generation digital user experiences (UX), it engaged The Foundry to help create that solution. The solution, code named Project Dash, leverages proven 3D content and digital visualization technology from The Foundry, existing Mercedes solutions and custom software development. Working closely with Mercedes, The Foundry created a fully bespoke solution for real time UI/UX design. With this solution, Mercedes UX designers can explore, create and iterate faster, with high-quality content.

Level: All
Type: Talk
Tags: Self-Driving Cars & Automotive ; Media & Entertainment

Day: Wednesday, 04/06
Time: 10:30 - 10:55
Location: Room LL21E

S6723 - Which Whale Is It, Anyway? Face Recognition for Right Whales Using Deep Learning

Robert Bogucki Chief Science Officer, deepsense.io
Robert Bogucki is a Chief Science Officer at deepsense.io where he currently manages the R&D team and focuses on deep learning. He is also a successful Kaggle competitor. When tackling real life problems, he particularly enjoys leveraging algorithms and computational power instead of, or in addition to, domain knowledge. His motivation to work in the IT Industry is to bring the theoretical ideas and concepts and put them to good use.

With fewer than 500 North Atlantic right whales left in the world's oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction. To interest the data science community, NOAA Fisheries has organized a competition hosted on Kaggle.com. The challenge was to automate the right whales recognition process using a dataset of aerial photographs of individual whales - currently a painstaking and lengthy, manual process. In this session, I will outline the winning solution. It is based on deep learning and convolutional neural networks.

Level: Advanced
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Wednesday, 04/06
Time: 10:30 - 10:55
Location: Room 210G

S6476 - Distributed Deep Learning for Large Scale Speech Recognition

Zejun Ma Senior Expert, Alibaba Inc.
Zejun Ma works for Alibaba Corporation as a senior algorithm expert. His work and interests include deep learning, speech recognition, natural language processing, and distributed machine learning. He received his Ph.D. from the University of Chinese Academy of Sciences (formerly named Chinese Academy of Sciences).

We'll describe our GPU acceleration of a set of deep models for speech recognition, especially for voice search application, including DNN, CNN, RNN, and LSTM. We'll cover a set of key components that are related to deep learning, such as data augmentation, highly optimized parallel algorithm for data partitioning and communication, and various deep architectures.

Level: Beginner
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Algorithms

Day: Wednesday, 04/06
Time: 13:00 - 13:25
Location: Room 210G

S6405 - DeepSpark: Distributed Deep Learning with CPU-GPU Scheduling

Uri Verner Researcher, Huawei Research
Uri Verner is a member of the GPGPU research team at Huawei Research. He is completing a Ph.D. in Computer Science at the Technion Institute in Israel. His research interests include heterogeneous computing, GPGPU, communication scheduling, and real-time data processing. Uri interned at NVIDIA in 2014.
Roman Talyansky Researcher, Huawei Technologies
Roman is a member of the Heterogeneous Computing research team at Huawei Research. He has recently completed a PhD in Computer Science in the Technion Institute, Israel. His research interests include machine learning, deep learning, heterogeneous computing, GPGPU and distributed computing.

We present DeepSpark, a fault-tolerant framework that distributes large-scale training over a heterogeneous cluster. It adapts the distribution of training data to the processing capabilities of each node. DeepSpark is based on Apache Spark and employs a new node-level task execution engine named AxE. AxE analyzes the neural network and maps the computations to the CPUs and GPUs, aiming to maximize the throughput.

Level: All
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Data Center & Cloud Computing; Big Data Analytics

Day: Wednesday, 04/06
Time: 13:30 - 13:55
Location: Room 210G

S6121 - GPU-Accelerated Computer Vision for Multimedia, Post Production and Surveillance

Hannes Fassold Senior Researcher, JOANNEUM RESEARCH
Hannes Fassold works at Joanneum Research, where he is a senior researcher at the Audiovisual Media Group of DIGITAL -- the Institute for Information and Communication Technologies. His main research interests are algorithms for digital film restoration, content-based video quality analysis, and the efficient parallelization of these algorithms on the GPU. He received an M.S. in applied mathematics from Graz University of Technology in 2004. He has published several publications in these fields and is the principal investigator for the CUDA Research Center at DIGITAL - Institute for Information and Communication Technologies, Joanneum Research.

Computer vision is at the core of many tools used in multimedia, post-production, and surveillance. We'll present some key computer vision algorithms for motion compensation, feature point extraction and tracking, SIFT descriptor extraction, and wavelet transform. We'll provide information about the significant speed-up we gained from porting these algorithms to the GPU and lessons learned from the process of porting. We'll give insight how these algorithms are used in several applications like real-time video quality analysis (detection of dropouts and noise level), brand visibility monitoring in broadcast content, film and video restoration (dust and dirt removal, noise reduction, etc.), and traffic monitoring for wrong-way driver detection.

Level: All
Type: Talk
Tags: Media & Entertainment; Computer Vision & Machine Vision; Video & Image Processing

Day: Wednesday, 04/06
Time: 14:00 - 14:25
Location: Room LL21C

S6166 - CAD Benchmarking of NVIDIA® Graphics on HP Laptops, Blades, Workstations and VMs

Brian Walrath Senior Multi-Disciplined Engineer , Raytheon Missile Systems
Brian Walrath is a senior multi-disciplined engineer at Raytheon Missile Systems in Tucson, AZ. Brian has B.S. in Technology from Southwest Texas State.

Raytheon Missiles has done its CAD design work almost exclusively on NVIDIA hardware for the better part of the last decade, on Desktop, Laptop, Blade Workstation and now Virtual Machines. How do all these platforms stack up against each other under 3D CAD and Heavy Analysis and Simulation loads? Focusing on the last five years or so, Raytheon will share its benchmarking methodology, and data, and compare and contrast our 3D CAD experience with NVIDIA GPUs on HP hardware and VMs.

Level: All
Type: Talk
Tags: Graphics Virtualization; Aerospace & Defense

Day: Wednesday, 04/06
Time: 14:00 - 14:50
Location: Marriott Salon 4

S6183 - Raytracing Sound: Geometric Acoustics on the GPU

Tony Scudiero Devtech Software Engineer, NVIDIA
Highly-Rated Speaker
Tony Scudiero is a developer technology engineer at NVIDIA who works with a diverse set of applications ranging from physical simulations in nuclear engineering to ray tracing and audio processing. Tony researched human hearing at the University of Minnesota as a graduate student, though his degrees are in computer science and mathematics. Tony has been using GPUs for general purpose computing for over 10 years.

Geometric acoustics is the application of physically based rendering (ray tracing) techniques to propagation of acoustic energy. It models all of the effects modeled by 3D audio as well as effects often neglected or approximated. We'll provide a general overview of geometric acoustics and its applications. We'll show specific studies using a geometric acoustics engine written using NVIDIA's OptiX framework. The system described in this talk can be experienced at the Interactive Innovation Showcase on the GTC show floor.

Level: All
Type: Talk
Tags: Signal & Audio Processing; Rendering & Ray Tracing

Day: Wednesday, 04/06
Time: 14:00 - 14:25
Location: Marriott Salon 2

S6192 - Brain-in-a-Box: A Unified Perception and Navigation Framework for Mobile Robots, Drones and Cars

Massimiliano Versace CEO, Neurala Inc.
Massimiliano Versace is the CEO of Neurala Inc. and founding director of the Neuromorphics Lab at Boston University. He is a pioneer in researching and bringing to market large-scale, deep learning neural models that allow robots to interact and learn in real time in complex environments. Max has authored approximately 40 journal articles, book chapters, and conference papers, holds several patents, and has been an invited speaker at dozens of academic and business meetings, research and national labs, and companies, including NASA, Los Alamos National Labs, Air Force Research Labs, HP, iRobot, Samsung, LG, Qualcomm, Ericsson, BAE Systems, Mitsubishi, ABB, and Accenture, among others. He is a Fulbright scholar and holds two Ph.D.s: experimental psychology, University of Trieste, Italy, and cognitive and neural systems, Boston University, USA. He obtained his B.S. from the University of Trieste, Italy.

Mobile robots, drones, and self-driving cars need advanced and coordinated capabilities in perception and mobility to co-exist with humans in complex environments. To date, the most effective "machines" built for these tasks come to biology. Max Versace, CEO of Neurala and director of the Boston University Neuromorphics Lab, will explain how mobile robots, drones, and cars can use GPUs coupled with relatively inexpensive sensors, today available in the sensor pack of a common smartphone, to enable machines to sense and navigate intelligently their environment. The talk will illustrate the working "mini-brain" that can drive a ground robot to learn, map, and understand the layout of the environment, objects in it, while avoiding collisions.

Level: All
Type: Talk
Tags: Robotics & Autonomous Machines; Deep Learning & Artificial Intelligence; Self-Driving Cars & Automotive ; IoT

Day: Wednesday, 04/06
Time: 14:00 - 14:50
Location: Room LL20D

S6201 - A New Parallel Prefix Scan Algorithm for GPUs

Sepideh Maleki Graduate Research Assistant , Texas State
Sepideh is a Graduate Research Assistant in the Efficient Computing Laboratory at Texas State University, where she is currently pursuing her Master's degree in computer science. Her research interest include performance optimization, GPGPUs, and parallel programing. She is a student member of IEEE, SWE, and the ACM. She received a Graduate Excellence Award from the Department of Computer Science in 2015.

We present and evaluate a new technique for implementing parallel prefix scans. A number of GPU libraries include this important parallel primitive, but most of those implementations are based on a hierarchical algorithm. In contrast, our algorithm only requires a single stage. We implemented it in portable CUDA code using just one relatively short templated kernel. Our code outperforms prefix scans from popular libraries like CUDPP, MGPU, and THRUST on both Kepler and Maxwell devices. In many cases, it even outperforms the CUB library, which employs architecture-specific assembly code.

Level: Advanced
Type: Talk
Tags: Performance Optimization

Day: Wednesday, 04/06
Time: 14:00 - 14:25
Location: Room 212A

S6202 - GPUCC: An Open-Source GPGPU Compiler

Jingyue Wu Software Engineer, Google Inc.
Jingyue Wu is a software engineer at Google and an active contributor to the LLVM compiler. He is one of the main contributors of gpucc, Google's open-source CUDA compiler. He completed his Ph.D. in computer science at Columbia University, where he worked with Professor Junfeng Yang on several projects related to software reliability and programming languages. This work has led to over 10 publications in top journals and conferences such as OSDI, SOSP, and PLDI.

We'll present gpucc, an LLVM-based, fully open-source, CUDA-compatible compiler for high performance computing. Its Clang-based front-end supports modern language features such as those in C++11 and C++14. Its compile time is faster than nvcc. It generates better code than nvcc on key end-to-end internal benchmarks and is on par with nvcc on a variety of open-source benchmarks.

Level: Advanced
Type: Talk
Tags: Tools & Libraries; Performance Optimization; Programming Languages

Day: Wednesday, 04/06
Time: 14:00 - 14:25
Location: Room 211B

S6255 - High-Performance CUDA™ Clustering at Cloud Scale: GPUDirect RDMA over 40GBE IWARP

Tom Reu Consulting Application Engineer, Chelsio Communications, Inc.
Tom has a long career in the computer industry. He started his career at a NJ based mini-computer company, Concurrent Computer Corporation, and ended his tenure in product development with HP working in the High Performance Computing Division. After leaving HP, Tom transitioned to his current Field Application role with Chelsio Communications. Tom has a Masters Degree in Computer Science from Monmouth University and a Bachelors Degree in Electrical Engineering from Villanova University.

Learn how to deploy GPU clustering at scale by integrating Chelsio's 40GbE iWARP (RDMA/TCP) into your GPU applications. GPUs have demonstrated paradigm-shifting performance in a wide variety of applications. But there remain network infrastructure challenges to the adoption of GPUs operating at scale, especially in large-scale cloud environments. We present 40GbE iWARP, which leverages existing Ethernet infrastructure and requires no new protocols, interoperability, or long maturity period as the no-risk path for Ethernet-based, large-scale GPU clustering. The first part of the session is a technical overview of 40GbE iWARP, including best practices and tuning for GPU applications. The second part summarizes benchmark results showing benefits of GPUDirect RDMA using 40GbE iWARP.

Level: Intermediate
Type: Talk
Tags: Data Center & Cloud Computing; Performance Optimization; Tools & Libraries

Day: Wednesday, 04/06
Time: 14:00 - 14:50
Location: Room 210E

S6267 - Data Analytics and Machine Learning at Your Finger Tips - No CUDA Required

Bryan Thompson Chief Scientist, Co-Founder, SYSTAP, LLC
Bryan Thompson is chief scientist at SYSTAP. He has 30+ years experience as a technologist, inventor, and researcher in cloud computing and big data. He is the lead architect for BlazeGraph, an open source, distributed graph database used by Fortune 500 companies, including EMC, Autodesk, and Yahoo!. He leads SYSTAP's research team investigating GPU-accelerated distributed architectures for graph processing, which, together with the SCI Institute, in 2014 published results for executing Breadth-First Search on a cluster of 64 GPUs at up to 32 billion traversed edges per second.
James Lewis CUDA Researcher, SYSTAP, LLC
James Lewis is a CUDA researcher with SYSTAP. He is the lead developer for Blazegraph GPU. He wrote the initial version of the software that uses SpMV techniques to implement Sparql Query evaluation on the GPU. He was the lead CUDA developer for integrating Mapgraph technology with the Merlin Application to accelerate Electronic Warfare using GPU graph capabilities. In this role, James exposed the graph capabilities on the GPU via a Java Native Interface (JNI) to enable the integration without the application developer writing any CUDA, C++, of non-Java code. He studied at the University of Utah Scientific Computing Institute (SCI) where he received B.S. degrees in both Computer Science and Applied Mathematics as a well as an M.S. in Computing. In his research work, James developed graph topological metrics to evaluate the performance of aggregation method in the context of multigrid coarsening. He implemented parallel aggregation techniques for multigrid coarsening in C++ and CUDA.

Writing fast, efficient data analytics for graph and machine learning on GPUs can be hard due to the complexities of CUDA and achieving effective parallelism. DASL and SPARQL are high-level languages for graph and machine learning algorithms (DASL) and graph pattern matching (SPARQL) that provide speedups of up to 1,000x over Spark native and up to 300x over leading graph databases when executed on the BlazeGraph platform. These high-level languages are translated into task graphs that expose the available parallelism. The mapgraph runtime evaluates the task graphs and provides a scalable architecture on GPUs and GPU clusters. This presentation discusses the concepts for graph algorithms and queries, the mapgraph architecture, and how algorithms are evaluated on a GPU cluster.

Level: Intermediate
Type: Talk
Tags: Big Data Analytics; Deep Learning & Artificial Intelligence; Aerospace & Defense

Day: Wednesday, 04/06
Time: 14:00 - 14:25
Location: Room 210F

S6293 - DeepFont: Font Recognition and Similarity Based on Deep Learning

Hailin Jin Principal Scientist, Adobe
Dr. Hailin Jin has been a principal scientist with Adobe Research since 2004. Hailin received his B.S. in automation from Tsinghua University, Beijing, in 1998. He then received his M.S. and Ph.D. in electrical engineering from Washington University in Saint Louis in 2000 and 2003, respectively. In 2003, he was a postdoctoral researcher at the Computer Science Department with the University of California at Los Angeles.

Font is one of the core elements in design. In this talk we will present two technologies: one for recognizing font from an image and other for suggesting fonts based on visual similarity. Both technologies are built upon improvements to the state-of-the-art in Deep Learning. Our recognition system is trained with millions of images and on NVIDIA GPUs. It is able to recognize over 7,500 fonts, achieves an accuracy of higher than 80% (top-5), and produces a good font similarity measure for font selection and suggestion. The technologies presented are the foundation of the new font similarity feature in Adobe Photoshop.

Level: All
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Wednesday, 04/06
Time: 14:00 - 14:25
Location: Room 210G

S6301 - Driver Face Analytics & Emotion Recognition Using Deep Learning

Modar Alaoui CEO, Eyeris
Modar Alaoui is a tech entrepreneur and expert in artificially intelligent vision technologies, deep learning, and ambient intelligence. Modar is founder and CEO at Eyeris, the worldwide leading deep learning-based emotion recognition software. The company's flagship product, EmoVu, reads facial micro-expressions in real time and uses convolutional neural networks as a deep learning architecture to train and deploy its algorithm into a myriad of today's commercial applications. Modar combines a decade of experience between human machine interaction and audience behavioral measurement. He is a frequent speaker and keynoter on "ambient intelligence" as the next frontier in AI, a winner of several technology and innovation awards, and has been featured in many major publications for his work.

We'll introduce you to ultra-lightweight vision software that reads facial micro-expressions in real time for use in driver monitoring systems in the next generation of vehicles. Using deep learning-based convolutional neural networks (CNNs) powered by GPUs, vision algorithms for embedded systems can now allow vehicles to constantly monitor drivers' inattention, cognitive awareness, and emotional distraction, through a number of face analytics and emotion recognition technology in a 30th of a second. We'll also reveal the five most common applications of such technology in the automotive space, ranging from invisible reactive support systems to semi-autonomous driving. We'll also present on stage a brief, highly rated live demo toward the end of the session.

Level: All
Type: Talk
Tags: Self-Driving Cars & Automotive ; Computer Vision & Machine Vision; Deep Learning & Artificial Intelligence; Intelligent Video Analytics (IVA)

Day: Wednesday, 04/06
Time: 14:00 - 14:25
Location: Room LL21E

S6376 - Development of a Track Trigger Based on GPUs for the CMS Experiment at CERN

Felice Pantaleo Physicist & PhD Student, CERN
Felice Pantaleo is a high-energy physicist (M.S. at University of Pisa), working at CERN for the CMS experiment. He has worked with GPUs since 2008, for astrophysical simulations, maximization of likelihood for fast fitting in ROOT framework. In the last 4 years his work has focused on real-time triggering for the NA62 and CMS experiments. He is a Ph.D. student at CERN and the University of Hamburg.

We'll discuss the CMS experiment at CERN, which is planning a major upgrade to cope with an expected average number of overlapping collisions per bunch crossing of 140. A key element of this upgrade will be the introduction of tracker information at the very first stages of the trigger system for which several possible hardware implementations are under study. In particular the adoption of GPUs in the first level of the trigger system is currently being investigated.

Level: All
Type: Talk
Tags: Computational Physics; Algorithms; Supercomputing & HPC

Day: Wednesday, 04/06
Time: 14:00 - 14:25
Location: Marriott Salon 6

S6411 - MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies

Dhabaleswar K. (DK) Panda Professor and University Distinguished Scholar, The Ohio State University
Dhabaleswar Panda is a professor and university distinguished scholar of computer science and engineering at The Ohio State University. He has published over 350 papers in major journals and international conferences. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) open-source software package, developed by his research group, is used by more than 2,450 organizations in 76 countries. This software has enabled several InfiniBand clusters to get into the latest TOP500 ranking during the last decade. More than 293,000 downloads of this software have taken place from the project's website alone. He is an IEEE Fellow and a member of ACM.
Khaled Hamidouche Senior Research Associate, The Ohio State University
Khaled Hamidouche is a senior research associate in the Department of Computer Science and Engineering at The Ohio State University. He is a member of the Network-Based Computing Laboratory led by Dr. D. K. Panda. His research interests include high-performance interconnects, parallel programming models, accelerator computing, and high-end computing applications. His current focus is on designing high-performance, unified MPI, PGAS, and hybrid MPI+PGAS runtimes for InfiniBand clusters and their support for accelerators. Khaled is involved in the design and development of the popular MVAPICH2 library and its derivatives MVAPICH2-MIC, MVAPICH2-GDR, and MVAPICH2-X. He has published over 40 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. He is a member of ACM.

Learn how MVAPICH2-GDR library is enabling support for different GPUDirect technologies to simplify the task of porting message passing interface (MPI) applications to supercomputing clusters with NVIDIA GPUs. MVAPICH2-GDR supports MPI communication directly from GPU device memory and optimizes it using various features offered by the CUDA toolkit. Various optimizations are integrated transparently under standard MPI API. Recent advances in MVAPICH2 include support of GDR_Async, MPI-3 RMA using GPUDirect RDMA, usage of fast GDRCOPY, Non-Blocking Collectives using GDR and Core-Direct, and much more. Performance results with micro-benchmarks and applications will be presented using MPI and CUDA/OpenACC. Performance impact of application co-design using MVAPICH2-GDR will also be presented.

Level: Intermediate
Type: Talk
Tags: Supercomputing & HPC; Tools & Libraries; Performance Optimization; OpenACC

Day: Wednesday, 04/06
Time: 14:00 - 14:50
Location: Room 211A

S6524 - Enabling the Electronic Structure Program Gaussian on GPGPUs Using OpenACC

Roberto Gomperts Principal Engineer, NVIDIA
In 2011, Roberto Gomperts became a principal engineer at NVIDIA working to enable computational chemistry applications on GPGPUs. After graduating at the University of Nijmegen (The Netherlands), Roberto became professor of physical chemistry at the University of Zulia in his native country, Venezuela. He developed his expertise in applying parallelism to computational chemistry programs first in 1985 at IBM, where he held a post-doctoral position for two years, and later at Alliant Computer Systems. In 1991, he joined Silicon Graphics, Inc., where he served as a computational chemistry specialist, principal scientist, and senior principal scientist, among other roles. Roberto is co-author of KGNMOL, Gaussian 92, Gaussian 92/DFT, Gaussian 94, Gaussian 98, Gaussian 03, and Gaussian 09 and was an active co-developer of early versions of Amber. He has published many technical reports and is co-author of over 20 peer-reviewed scientific publications and one patent.

In 2011, Gaussian, Inc., PGI, and NVIDIA embarked on a long-term project to enable Gaussian on GPGPUs using a directives-based approach. OpenACC has emerged as the de-facto standard to port complex programs to GPU accelerators. We'll discuss how we attacked some of the challenges involved in working with a large-scale, feature-rich application like Gaussian. This includes a number of PGI extensions to the OpenACC 2.0 standard that we believe will have a positive impact on other programs. To conclude, we'll present a sample of GPU-based performance improvements on a variety of theories and methods.

Level: Intermediate
Type: Talk
Tags: Computational Chemistry; Supercomputing & HPC; Tools & Libraries; OpenACC

Day: Wednesday, 04/06
Time: 14:00 - 14:50
Location: Marriott Salon 5

S6526 - Beyond Standards: A New GPU-Aware Image Coding System

Pablo Enfedaque Ph.D. Student, Universitat Autònoma de Barcelona
Pablo Enfedaque is a third year Ph.D. student with the departments of Information and Communications Engineering (dEIC) and Computer Architectures and Operating Systems (CAOS), Universitat Autonoma de Barcelona, Spain. He received a B.E. in computer science and an M.S. in high performance computing and information theory in 2012 and 2013, respectively. Pablo has been working with GPU architectures since his final degree project. His research interests include image coding, high performance computing, and parallel architectures.

Discover a new image coding system devised to exploit massive parallelism in a GPU. Current standards for the compression of images lack the kind of parallelism required for efficient implementation in GPUs. Although much effort is made to implement such standards in CUDA, most implementations obtain poor results. This session describes the main insights behind the proposed image coding system. Our starting point was JPEG2000. The core mechanisms of the standard were redefined to allow the type of parallelism required in SIMT computing. All the advanced features of the system are preserved, but it is no longer compatible with the standard. Performance results will be given, comparing state-of-the-art CPU and GPU implementations of JPEG2000 with the proposed system.

Level: Intermediate
Type: Talk
Tags: Video & Image Processing; Performance Optimization; Signal & Audio Processing

Day: Wednesday, 04/06
Time: 14:00 - 14:25
Location: Room LL21B

S6527 - Effective Analysis of Large, Multi-Scale Combustion Simulations

Stephen Jones Senior Software Engineer, SpaceX
Stephen Jones leads the Simulation and Analytics group at SpaceX, where he works on various projects including large-scale simulation of combustion processes in rocket engines. He previously worked at NVIDIA, where he was the architect for the CUDA language and worked closely with NVIDIA's hardware designers to develop new GPU features in support of parallel programming. His background is in computational fluid mechanics and plasma physics, but he has worked in diverse industries, including networking, CAD/CAM, and scientific computing.
Andre Kessler Software Engineer, SpaceX
Andre Kessler is a software engineer in the Simulation and Analytics group at SpaceX, where he works on a multi-physics code for simulating turbulent combustion inside a running rocket engine. His background is in mathematics and parallel computing, and he previously worked at NVIDIA on the CUDA math libraries.

SpaceX is creating a multi-physics code with the aim of accurately modeling combustion processes inside a rocket engine. A major challenge is that system-wide effects are driven by processes 6 orders of magnitude smaller, so the simulation must capture all scales simultaneously. One way to handle the enormous size and complexity of the problem is by using wavelet techniques to dynamically adapt the simulation space, but this presents challenges for analysis and visualization of terabytes of irregularly-structured data. We present techniques for operating natively on sparse data representations, allowing us to process and visualize the data in-situ directly from its multi-resolution format.

Level: Intermediate
Type: Talk
Tags: In-Situ and Scientific Visualization; Algorithms; Computational Physics; Aerospace & Defense

Day: Wednesday, 04/06
Time: 14:00 - 14:50
Location: Room LL21D

S6597 - Accuracy Improvement of DBS Electrode Placement by Visualization of Sensor-Fused Data

Jens Krüger Professor, CoViDAG, University Duisburg-Essen
Jens studied computer science at the Rheinisch-Westfälische Technische Hochschule Aachen where he received his diploma in 2002. In 2006, he finished his PhD at the Technische Universität München and after a Postdoc position in Munich and at the Scientific Computing and Imaging (SCI) Institute he became a research assistant Professor at the University of Utah. In 2009, he joined the Cluster of Excellence Multimodal Computing and Interaction at Saarland University to head the Interactive Visualization and Data Analysis group. Since 2013, Jens Krüger has been Chair of the High Performance Computing group at the University of Duisburg-Essen.
Andre Waschk Researcher, University Duisburg-Essen
Andre received his Bachelor's degree in Applied Computer Science at the Universität Duisburg-Essen in 2014 and is currently pursuing his Masters degree.

In this talk, attendees will learn how we improve established brain surgery routines for deep brain stimulation by using the latest GPU-based volume ray-casting and electrode signal classification. We will present the techniques used in our software, from GPU ray-casting to GPU-based registration of CT and MRI data and the interpretation and classification of the microelectrode signal using the CUDA toolkit. During our session, we will discuss future improvements of our approach, like supervised learning for the classification of brain tissue based on the provided electrode and volume data.

Level: All
Type: Talk
Tags: Medical Imaging; In-Situ and Scientific Visualization; Rendering & Ray Tracing

Day: Wednesday, 04/06
Time: 14:00 - 14:50
Location: Room 212B

S6661 - Training Recurrent Neural Networks in FP16

Erich Elsen Research Scientist, Baidu USA, Inc.
Erich Elsen joined Baidu's new Silicon Valley AI Lab in the summer of 2014, excited to get involved with deep learning. He teaches a course on parallel algorithms, OpenMP, MPI, and CUDA at Stanford as a consulting associate professor each spring. Erich received his Ph.D. in mechanical engineering from Stanford in 2009. His thesis developed novel parallel algorithms for running fluid dynamics and molecular dynamics computations on two types of then newly released parallel processors: Sony's Cell and GPUs. After graduating, he joined an EDA startup, where he developed GPU-accelerated computational lithography solutions. From there he founded a consulting company, Royal Caliber, to help others take advantage of GPUs. Some projects include moving Shazam's recognition engine to run on GPUs and a high-performance graph algorithm framework, vertexAPI2.

Reducing training time allows us to learn from our experiments more quickly and make new innovations based on what we've learned. Using less than the standard 32 bits to represent a number can help reduce training times. We'll talk about how to use 16-bit floating point because it is starting to have wide hardware support with the release of Pascal. Unfortunately, naively converting all datatypes from 32- to 16-bits doesn't work, as training stability and accuracy are comprised. We'll discuss the reasons for the difficulties and solutions. Finally, we'll show performance and scalability improvements due to using reduced precision.

Level: Intermediate
Type: Talk
Tags: Algorithms; Deep Learning & Artificial Intelligence; Performance Optimization

Day: Wednesday, 04/06
Time: 14:00 - 14:25
Location: Marriott Salon 3

S6702 - Automated Creation of Tests from CUDA Kernels

Oleg Rasskazov Vice President, Quantitative Research, JP Morgan Chase
For the last eight years, Oleg has worked in Quantitative Research at JP Morgan, focusing on High Performance Compute for Equities, Commodities and FX. He has a PhD in Applied Mathematics, focused on computer-assisted proofs.

JP Morgan is extensively using GPUs to speed up risk calculations and reduce computational costs since 2011. The computational library runs a large number of kernels, both hand-written and auto-generated, with a complex data flow. As we were upgrading the CUDA drivers, runtimes, and hardware, we infrequently saw regressions in performance, and numerical values, and understood the need of the test suite that would simplify submission of issue reproducers without sharing whole proprietary library. This talk will present an automated approach to converting individual kernel launches into standalone test cases, subject to some restrictions on the GPU code structure.

Level: Intermediate
Type: Talk
Tags: Finance; Tools & Libraries

Day: Wednesday, 04/06
Time: 14:00 - 14:25
Location: Marriott Salon 1

S6140 - Optimizing Instruction-Bound Kernels in Dissipative Particle Dynamics

Yu-Hang Tang PhD Candidate, Division of Applied Mathematics, Brown University
Yu-Hang is a PhD candidate with the Division of Applied Mathematics at Brown University. His primary research interests focus on High Performance Computing and concurrent multiscale coupling with applications in modelling soft matter systems and physiological fluids. He is the author of several open-source software packages, including the LAMMPS USERMESO GPU-accelerated package for Dissipative Particle Dynamics (DPD) and Smoothed Particle Hydrodynamics (SPH) simulations, as well as the Multiscale Universal Interface library for coupling standalone solvers to perform multiscale simulations.

In this talk, we report algorithmic and instruction-level optimizations used in uDeviceX, a CUDA particle simulator for biomedical microfluidic devices. First, an FMA-intense random number generator (RNG) was proposed by exploiting the chaotic logistic map. This RNG can take advantage of the higher FP-to-integer instruction throughput ratio of CUDA GPUs to generate a large number of high quality random streams in situ. Second, warp-votes and shared memory were used to consolidate workload from diverging warps. Last, inline PTX was used to emulate 24-bit integer arithmetics by their floating point counterparts in order to increase throughput. An implementation using C++ templates ensures that no type-casting overhead is triggered and also guards the technique from unintentional usage.

Level: Intermediate
Type: Talk
Tags: Algorithms; Computational Chemistry; Performance Optimization

Day: Wednesday, 04/06
Time: 14:30 - 14:55
Location: Marriott Salon 3

S6153 - Fast Non-Rigid Registration for Mobile High-Dynamic Range Photography

Orazio Gallo Senior Research Scientist, NVIDIA
Orazio earned a M.S. degree in Biomedical Engineering from "Politecnico di Milano" (Italy). He then joined the Smith-Kettlewell Eye Research Institute, where he developed a novel bio-imaging technique capable of recording micrometric deformations of soft tissues. Subsequently, he joined the University of California at Santa Cruz, where he received a Ph.D. in Computer Engineering in 2011. During his studies in Santa Cruz, Orazio also interned at Canesta, Inc. (now acquired by Microsoft), and at the Nokia Research Center in Palo Alto. In September 2011, Orazio joined NVIDIA Research, where he currently works in the Mobile Visual Computing team. His interests span several areas of the fields of computer vision and computational photography. For a complete list of papers, including those published before joining NVIDIA research, see here. Orazio regularly serves on the program committees of the top computer vision and computational photography conferences (CVPR, ICCV, ICCP) and is an associate editor of the journal Signal Processing: Image Communication.

We present a method that leverages the computational power of GPUs to create a high-dynamic-range (HDR) photograph in the presence of camera motion and scene changes. Our approach is extremely fast and prevents the artifacts that arise from insufficient registration quality. Previous methods to address this problem are either accurate, but too slow for mobile devices, or fast, but prone to failing. As a comparison, our method runs in under 700ms on an NVIDIA-powered tablet for a pair of 5MP images, whereas previous state-of-the-art methods performing non-rigid registration take over a minute on desktops for a pair of 1MP images.

Level: Intermediate
Type: Talk
Tags: Video & Image Processing; Computer Vision & Machine Vision

Day: Wednesday, 04/06
Time: 14:30 - 14:55
Location: Room LL21B

S6179 - Delivering Personalized Cloud Services to the Car

Albert Jordan VP Products, Cloudcar
Albert Jordan is a co-founder and vice president of products at CloudCar. Previously, he served as vice president of products at Core Mobility, Inc. (acquired by Smith Micro). He led the launch of Core Mobility cloud-based services on Tier 1 carrier networks. Prior to Core Mobility, Albert was the CEO of Adaptive Telecom, a company that developed software-based intelligent antenna solutions that dramatically increased cellular network capacity. Adaptive Telecom was purchased by Metawave Communications, where he became president of the the CDMA business unit. Albert began his career as an ASIC designer at Tandem Computers. He holds five technology patents.

Most of today's IVI solutions are trying to replicate the smartphone interaction model in the car. Adopting an approach that is similar to smartphones will not result in differentiated solutions with a sustainable competitive advantage. More importantly, the immersive experiences that are typical of smartphone interaction, are not suitable in a driving environment. CloudCar is proposing a new approach in delivering connected services to the car, which brings about a new interaction model suited for the car.

Level: All
Type: Talk
Tags: Self-Driving Cars & Automotive

Day: Wednesday, 04/06
Time: 14:30 - 14:55
Location: Room LL21E

S6238 - Realtime Raw Image and Video Processing on GPU

Fyodor Serzhenko CEO, Fastvideo
Fyodor is CEO of Fastvideo company. His research interests include high speed cameras and software for high speed imaging, high performance computing, GPU image processing for video applications. He was graduated from Moscow Institute of Physics and Technology in 1989 and got PhD in physics of semiconductors in 1993.

The goal of this session is to demonstrate how to achieve real time image and video processing for RAW data on GPU. In this session we will present detailed analysis of Fastvideo SDK for GPU image processing pipeline: RAW/DNG acquisition, preprocessing, demosaicing, denoising, color correction, tone mapping, resizing, sharpening, OpenGL output, compression to MJPEG and H.264. Now it could be done in real time on GPU for 4K RAW data.

Level: All
Type: Talk
Tags: Media & Entertainment; Video & Image Processing; Computer Vision & Machine Vision

Day: Wednesday, 04/06
Time: 14:30 - 14:55
Location: Room LL21C

S6251 - Leveraging GPU Technology to Visualize Next-Generation Products and Ideas

Michael Wilken Director of 3D, Saatchi & Saatchi
Michael Wilken leads Saatchi and Saatchi LA's growing 3D capabilities. He built powerful agency 3D capability from a single role to a 30+ team serving some of the world's largest brands. His team has successfully realized an industry-leading integration of 3D production ability with creative collaboration within an advertising agency.

While CAD real-time visualization solutions and 3D content creation software have been available for decades, there were practical workflow barriers that inhibit efficient integration into an agency's creative and production process. Using the latest in GPU technology from NVIDIA, Saatchi and Saatchi LA is pioneering the breaking of these barriers. 3D artists work with creative directors and clients to rapidly visualize ideas and products. Real-time visualization is integrated into the production workflow seamlessly, making rapid visualization both inspiring and cost-saving. We'll provide a top-level overview of how Saatchi is leveraging NVIDIA GPU technologies, including the NVIDIA VCA, to create powerful virtual creative collaborations.

Level: Intermediate
Type: Talk
Tags: Product & Building Design; Rendering & Ray Tracing; Self-Driving Cars & Automotive

Day: Wednesday, 04/06
Time: 14:30 - 14:55
Location: Room LL21A

S6292 - Gradually Porting an In-Use Sparse Matrix Library to Use CUDA

Mark Hoemmen Senior Member, R&D Staff, Sandia National Laboratories
Mark Hoemmen is a member of the R&D staff at the Center for Computing Research at Sandia National Laboratories. His research interests include numerical linear algebra, fault tolerance, and parallel programming models. He contributes to the Trilinos open-source software library, leads Trilinos' Scalable Linear Algebra capability area, and gives Trilinos tutorials regularly. He has a B.S. in mathematics and computer science from the University of Illinois Urbana-Champaign and a Ph.D. in computer science from the University of California, Berkeley.

Learn how to port an existing parallel library to use CUDA, even while the library is under constant production use by applications. We did this for the Tpetra parallel sparse linear algebra library. Tpetra provides data structures, computational kernels, and MPI data redistribution for Trilinos' sparse linear solvers. We used Kokkos, an abstraction over different shared-memory parallel programming models, to rewrite Tpetra for CUDA. This, along with careful attention to backwards compatibility, unit testing, and frequent application feedback, let us undertake this rewrite gradually. It also gave both applications and Trilinos' sparse linear solver packages that depend on Tpetra a gradual path to embrace MPI + thread parallelism.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Supercomputing & HPC

Day: Wednesday, 04/06
Time: 14:30 - 14:55
Location: Room 212A

S6309 - Capitalico - Chart Pattern Matching in Financial Trading Using RNN

Hitoshi Harada CTO, Alpaca
Hitoshi Harada is CTO at Alpaca, a company enabling AI technology to automate professional human tasks. Before Alpaca, Hitoshi worked in the database industry and community for 10 years as a PostgreSQL's major feature contributor, a kernel architect of MPP database Greenplum, and a contributor of open source in-database machine learning library MADlib. He has a large amount of experience in distributed system, data science, and machine learning for industrial applications.

Discretionary trading by technical analysis and momentum strategy in the financial market has been difficult to automate by quant-style rigid conditional programming as it involves a lot of fuzziness and subtleties of human perception. Our application, Capitalico, analyzes the financial time-series data and trader's behavior to solve this problem using the RNN/LSTM. In this talk, we'll introduce the problem and our approach, and detail pitfalls and practices, such as how we choose networks and parameters to achieve the best accuracy and performance with deep learning using GPUs. As we borrowed great ideas from past deep learning applications, we'll help you understand how we converted those ideas to our solution and how to apply deep learning to your problem.

Level: Intermediate
Type: Talk
Tags: Finance; Deep Learning & Artificial Intelligence

Day: Wednesday, 04/06
Time: 14:30 - 14:55
Location: Marriott Salon 1

S6438 - How GPU Can Help High Energy Physics Experiments

Gianluca Lamanna Researcher, INFN
Gianluca Lamanna is a researcher at the National Institute for Nuclear Research (INFN) in Italy. He received his Ph.D. in particle physics in 2006 at Pisa University. From 2007 to 2010, he was a postdoctoral student at Scuola Normale Superiore, working mainly in data analysis and detectors design. From 2010 to 2013, he was appointed as research fellow at CERN to work on NA62 experiment data acquisition and trigger. He received the Marie-Curie Fellowship. From 2013 he has been the principal investigator of GAP project, founded by Italian Ministry of Research, to study the application of GPUs for real-time data acquisition.

We aim to show how the online data selection in high-energy physics experiments could benefit from real-time GPU processing. The computing power of GPUs fits the requirements to increase the ability of the trigger systems to reduce the data bandwidth. We designed a system for online processing exploiting commercial GPUs for the NA62 experiment at CERN. In particular we will show different techniques to reduce and control the latency due to data transfer in order to have synchronous response from the system. We will show recent results obtained in a physics run at CERN with high data rate. Attendees will learn how a high-energy physics trigger system works and how GPUs can increase the discovery potential of high-precision experiments.

Level: Intermediate
Type: Talk
Tags: Computational Physics; Algorithms

Day: Wednesday, 04/06
Time: 14:30 - 14:55
Location: Marriott Salon 6

S6471 - Accelerating Influence Spread Estimation on Social Networks in the Continuous-Time Domain

Zissis Poulos PhD Candidate, Mitacs Accelerate Intern, University of Toronto, Sysomos Inc.
Zissis received his B.E. Degree in Electrical and Computer Engineering (ECE) from the National Technical University of Athens in 2011, and his M.A.Sc degree in ECE from the University of Toronto in 2013. Currently, he is pursuing a PhD in the Department of ECE at the University of Toronto. His research is on formal verification and automated design debugging of digital systems, and extends to applications of data-mining for statistical design debugging. He recently joined Sysomos Inc. as a Data Science intern, where his research focus is on influence maximization, network diffusion models and GPU acceleration for inference algorithms.

This session showcases how to leverage GPUs to accelerate influence spread estimation in large social networks. Estimating the spread of an opinion or product across members of a graph-modelled social network is a hard problem requiring compute-intensive approximation algorithms. The complexity of the problem further rises in the continuous-time domain, where influence transmission rates on network edges are derived from stochastic distributions. Spread estimation algorithms that operate on stochastic transmission rates, such as naive sampling and neighbourhood size estimation, require a plethora of samples to achieve convergence. By exploiting the inherent independence across multiple sampling iterations of these algorithms we achieve up to 11x improvement in run-time using GPUs.

Level: Intermediate
Type: Talk
Tags: Big Data Analytics; Algorithms

Day: Wednesday, 04/06
Time: 14:30 - 14:55
Location: Room 210F

S6665 - Using Butterfly-Patterned Partial Sums to Draw from Discrete Distributions

Guy Steele Software Architect, Oracle Labs
Guy Steele is a software architect at Oracle Labs, responsible for research in language design and implementation strategies, and architectural and software support for programming languages. He received his B.A. in applied mathematics from Harvard College in 1975, and Ph.D. in computer science and artificial intelligence from M.I.T. in 1977 and 1980, respectively. He previously an assistant professor of computer science at Carnegie-Mellon University; a member of technical staff at Tartan Laboratories in Pittsburgh, Pa.; and a senior scientist at Thinking Machines Corporation in Cambridge, Mass. He joined Sun Microsystems, acquired by Oracle in 2010, as a distinguished engineer in 1994 and was named a Sun Fellow in 2003.

We describe a SIMD technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model for machine learning, that avoids computing a complete table of partial sums of the relative probabilities. A table of alternate ("butterfly-patterned") form is faster to compute, making better use of coalesced memory accesses. From this table, complete partial sums are computed on the fly during a binary search. Measurements using an NVIDIA TITAN Black GPU show that for a sufficiently large number of clusters or topics (K > 200), this technique alone more than doubles the speed of a latent Dirichlet allocation (LDA) application already highly tuned for GPU execution.

Level: Advanced
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Algorithms; Performance Optimization

Day: Wednesday, 04/06
Time: 14:30 - 14:55
Location: Room 210G

S6674 - Tips and Tricks for Debugging Multi-GPU and/or Multi-Node GPU Applications

Larry Edelstein Sales Engineer, Rogue Wave Software
Larry Edelstein has been working software development for over 25 years. From mainframes to desktops to the EC2 cloud, he has seen platforms and practices come and go. Larry has worked with a number of companies throughout his career – from start-up to well established industry leaders creating software, leveraging his systems engineering expertise and mentoring junior engineers. Larry has a Bachelor of Science in Computer Science from Cornell University and lives in the Bay area.

Applications that use multiple GPUs are becoming more common and because of that debugging application these types of applications is also becoming a more important part of the development process. Whether you are using a single node with multiple GPUs or multiple nodes each with a GPU, there are some tips that can be useful to find any problems you may be having. This presentation will be based on the experienced Rogue Wave has learned from supporting customers debugging these types of applications.

Level: All
Type: Talk
Tags: Tools & Libraries; OpenACC

Day: Wednesday, 04/06
Time: 14:30 - 14:55
Location: Room 211B

S6691 - Multi-Source Fusion Using Deep Learning

Andrew Jenkins Senior Data Scientist, Digital Globe
Andrew Jenkins works for DigitalGlobe Inc. as a Senior Data Scientist focused on applying deep learning to multi-source spatial data such as satellite imagery, geo-tagged photos and videos. Andrew is currently a PhD candidate in the Department of Geography and Geographic Information Science at George Mason University. He holds a MS degree in Geoinformatics and a BS in Computer and Information Science. Andrew previously worked as a government researcher at the US Army Engineer Research and Development Center, and prior to that spent eight years in the military.

DigitalGlobe's satellite constellation collects millions of square kilometers of earth's imagery daily, yielding high resolution data of our planet. By employing DL algorithms & NVIDIA GPUs, DigitalGlobe processes imagery & detect objects at speeds orders of magnitude faster than ever before. Emergency responders require a multitude of information sources to support their mission. DigitalGlobe utilizes several methodologies of fusing disparate data sets together. Social media, weather, other sensor types (eg. RADAR/LIDAR) & Satellite Imagery can be fused together to help decision makers answer questions. By combining the data sets based on their location and common categories from the DL algorithms, emergency responders & analysts are able to automatically verify objects on the ground.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Video & Image Processing; Aerospace & Defense

Day: Wednesday, 04/06
Time: 14:30 - 14:55
Location: Marriott Salon 2

S6114 - Attack Graphs: Visualizing 200M Alerts a Day with GPU Clouds and JavaScript

Leo Meyerovich CEO, Graphistry, Inc.
Leo Meyerovich co-founded Graphistry, Inc., in 2014 to scale visual graph analytics. Graphistry builds upon the founding team's work at UC Berkeley on the first parallel web browser and Superconductor, a declarative GPU-accelerated data visualization language. Leo's most referenced work is in language-based security: language design and automatic verification for web apps and across control. However, his broader work from the past 10 years has been in designing programming languages, receiving awards for his research on the first reactive web language (OOPSLA), automatic parallelization (PLDI), and sociological foundations (OOPSLA, SIGPLAN).
Michael Wendt Data Engineer Principal, Accenture
Michael Wendt is a researcher at Accenture Technology Labs, where he focuses on developing new techniques and architectures to power next-generation data visualizations. As a developer Michael has created many responsive data visualizations for clients (like www.disasterviz.com and www.riskanalyticsviz.com) and believes that there is a fundamental connection between the user's experience and the backend performance. His previous research work on Hadoop, Cassandra, Storm and other technologies has allowed him to build and design big data solutions for clients. Michael has a BS in Computer Engineering from University of Maryland: College Park.
Joshua Patterson Data Science Principal, Accenture
Joshua Patterson is a Principal Data Scientist at Accenture Technology Labs, and a Presidential Innovation Fellow. At Accenture he leads data science research on Cyber Security and Risk, focusing on big data architecture, analytics, and visualization techniques to accelerate fraud and anomaly detection at scale. For the government, he supports the Data Services Initiative at the Department of Commerce. Prior to Accenture, Joshua led advanced analytics projects across several sectors including financial services, state and federal government, and commercial real estate. His current passion is graph analytics, GPUs, and advanced visualization. Joshua also loves storytelling with data, and some of his work can be seen at www.hotshotcharts.com, www.disasterviz.com, & www.riskanalyticsviz.com. Joshua holds a B.A. in Economics from the University of North Carolina Chapel Hill and a M.A. in Economics from the University of South Carolina Moore School of Business.

Enterprises "assume breach": someone, somewhere, already compromised them. Analysts sift through a GB/min (or more!) of attack logs from hundreds of thousands of systems. For every identified incident, they then map out the entire breach by backtracking through months of alerts. This talk shares how Graphistry and Accenture tackled the visual analytics problem: how do we explore big graphs? We'll drill into two of our GPU technologies for visualizing graphs: [1] StreamGL, our distributed real-time renderer for delivering buttery interactions, smart designs, and responsive analytics to standard web devices; [2] Node-OpenCL and our CLJS client: open source JavaScript libraries for server-side GPU scripting.

Level: Beginner
Type: Talk
Tags: Big Data Analytics; Aerospace & Defense; Large Scale and Multi-Display Visualization

Day: Wednesday, 04/06
Time: 15:00 - 15:50
Location: Room 210F

S6190 - Tools and Approaches for Increasing Developer Productivity on GPU-Enabled Systems

Eric Kelmelis CEO, EM Photonics
Highly-Rated Speaker
Eric Kelmelis is the co-founder and CEO of EM Photonics, a company focused on the development and transition of innovative research and technology in the fields of advanced imaging, high-performance computing, and embedded systems. Mr. Kelmelis received B.S. and M.S. degrees in electrical engineering from the University of Delaware, has more than 60 publications, and holds two patents. He has also served as conference chair at SPIE's Defense, Security, and Sensing symposium since 2010.

We are building technologies to allow developers to take advantage of GPUs without requiring deep knowledge of the underlying platform and programming paradigms. We'll provide an overview of our initiatives in the space. Specifically, we'll discuss libraries that encapsulate common and domain-specific functionality and abstract it to a level that allows the use of GPU-accelerated routines by users with no knowledge of the underlying hardware; static and dynamic task scheduling approaches to optimize workflows on mixed device systems; and tools to optimize computational software for either performance or power. We'll provide our perspective on easing the burden on developers and how our work can be applied to new applications or in refactoring existing code.

Level: All
Type: Talk
Tags: Tools & Libraries; Supercomputing & HPC; Programming Languages

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Room 211B

S6197 - Improving High Performance Image Resizing and Rotation: A Case Study of Texture Options

Ismayil Guracar Senior Key Expert, Siemens Medical Solutions, Ultrasound Business Unit
Highly-Rated Speaker
Ismayil Guracar has been working in the ultrasound imaging field for over 29 years. He is a senior key expert with the Innovations Applications Group at Siemens Healthcare, Ultrasound Business Unit in Mountain View, Calif. His interests include ultrasound image formation and high-performance, real-time signal processing, especially using GPUs. He holds 68 U.S. patents, has pioneered new ultrasound technologies in the areas of parametric and molecular imaging, and has contributed to the development of many successful diagnostic medical imaging products.

We present a case study on the use of textures for image resizing and rotation using conventional bilinear and high-quality cubic interpolation filtering using various texturing options and data widths. Choices including CUDA arrays, pitched 2D arrays, or linear memory each offer benefits and drawbacks that depend on the particular demands and details of the application. We provide performance measurements from the latest Maxwell GPU architecture, which has a number of performance improving advances over previous generations. We hope to provide information and insight to CUDA developers and demonstrate some benchmarking and measurement techniques with Nsight so that informed choices can be made about how best to match texture image processing options to application requirements.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Video & Image Processing; Real-Time Graphics

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Room LL21B

S6233 - Intelligent Mobile System for Improving Spatial Design Support and Security Inside Buildings

Janusz Bedkowski Senior Researcher, Institute of Mathematical Machines
Janusz Bedkowski works for Institute of Mathematical Machines. He received his Ph.D. in 2010 in automation and robotics. His main research is related with autonomous mobile mapping systems. He is head of the Laboratory of Image Processing and NVIDIA GPU Research Centre in Institute of Mathematical Machines. Janusz is involved in a research project funded by EU FP7 and NCBiR (Polish National Centre for Research and Development) on a mobile spatial assistance system as well as a methodology project with the NCN (Polish National Centre of Science) on semantic models building based on mobile robots observations. His main expertise is mobile robotics, image processing, parallel computing, mobile robotic system design, and artificial intelligence. He is working on training and support technologies for operators of multi-robotic systems. He is actively publishing in international journals such as Elsevier Automation in Construction, Industrial Robot an International Journal, and Springer Optoelectronics Review. He is also reviewer in these journals.

This talk concerns the intelligent mobile application for spatial design support and security domain. Mobility has two aspects in our research: The first one is the usage of mobile robots for 3D mapping of urban areas and for performing some specific tasks. The second is related to a novel software as a service system that allows access to robotic functionalities and data over the Ethernet. Thus, we demonstrate the use of the novel NVIDIA GRID technology, which virtualizes the GPU. We introduce Complex Shape Histogram, a core component of our artificial intelligence engine, used for classifying 3D point clouds with Support Vector Machine. We use NVIDIA CUDA for accelerating computations.

Level: Intermediate
Type: Talk
Tags: Aerospace & Defense; Data Center & Cloud Computing; Robotics & Autonomous Machines; Graphics Virtualization

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Marriott Salon 2

S6239 - From CVA to the Resolution of a Large Number of Small Random Systems

Lokman Abbas Turki Lecturer, LPMA, Paris 6 University
Lokman Abbas Turki is a lecturer at the Laboratoire de Probabilites et Modeles Aleatoires. Prior to this position, he spent two years as a postdoc at TU Berlin working on probability problems related to market impact and liquidity. Before that, Lokman earned his Ph.D. in probability and worked for a few months at INRIA as an expert in GPU parallelization of financial algorithms. During his Ph.D., he built a strong relationship with financial institutions like Credit Agricole and Pricing Partners.

The credit valuation adjustment (CVA) simulation represents a typical example of a problem that can be successfully overcome using GPUs. It also shows how challenges from a real-world application require advanced computing optimizations. This presentation covers both the algorithmic aspect of using GPUs for the CVA and the implementation optimization that should be performed when resolving a large number of small systems. The algorithmic part involves a nested Monte Carlo method for which we establish a judicious choice that relates the number of the inner and the number of the outer simulated trajectories. The implementation part presents and compares on a large number of small systems: the LDLt factorization, the Householder reduction, and the divide and conquer diagonalization.

Level: Intermediate
Type: Talk
Tags: Finance; Algorithms; Performance Optimization

Day: Wednesday, 04/06
Time: 15:00 - 15:50
Location: Marriott Salon 1

S6352 - Adapting the Visualization Toolkit for Many-Core Processors with the VTK-m Library

Christopher Sewell Staff Scientist, Los Alamos National Laboratory
Christopher Sewell is a staff scientist in the Computer, Computational, and Statistical Sciences Division at Los Alamos National Laboratory. His research interests include large-scale and in-situ visualization and analysis, data-parallelism, and multi-core and many-core technologies. Before joining Los Alamos, Christopher worked in the fields of haptics and medical robotics. He received a B.S. in Computer Science from Texas A&M University and an M.S. and Ph.D. in Computer Science from Stanford University.

To address the need for HPC scientific visualization software to effectively exploit GPUs and other many-core processors, U.S. DOE researchers are building a new library, named VTK-m. VTK-m provides a framework for simplifying the design of visualization algorithms on emerging architectures. It provides a flexible data model that can adapt to many scientific data types and operate well on multithreaded devices. Finally, VTK-m serves as a container for algorithms designed in the framework and gives the visualization community a common point to collaborate, contribute, and leverage massively threaded algorithms. In this talk, we will describe the design of VTK-m, and present results related to the functionality and performance of VTK-m for a variety of visualization applications.

Level: Intermediate
Type: Talk
Tags: In-Situ and Scientific Visualization; Supercomputing & HPC; Algorithms

Day: Wednesday, 04/06
Time: 15:00 - 15:50
Location: Room LL21D

S6408 - HPC Application Porting to CUDA® at BSC

Pau Farré HPC Software Developer, Barcelona Supercomputing Center
Pau Farre received his B.S. in computer science in February 2014 from the Universitat Politecnica de Catalunya. Since then, he has been working at the Barcelona Supercomputing Center in the Accelerators for HPC group. He also is a member of UPC/BSC CUDA Center of Excellence.
Marc Jorda Resident Student, Barcelona Supercomputing Center
Marc Jorda received his MSc in computer science in February 2012 from the Universitat Politecnica de Catalunya. Since then, he has been working at the Barcelona Supercomputing Center in the Accelerators for HPC group, pursuing his PhD. He also is a member of UPC/BSC CUDA Center of Excellence.

In this session you will learn the main challenges that we have overcome at the BSC to successfully accelerate two large applications by using CUDA and NVIDIA GPUs: WARIS (a Volcanic Ash Transportation Model) and PELE (a Drug Molecule Interaction Simulator). We show that leveraging asynchronous execution is key to achieve a high utilization of the GPU resources (even for very small problem sizes) and to overlap CPU and GPU execution. We also explain some techniques to introduce Unified Virtual Memory in your data structures for seamless CPU/GPU data sharing. Our results show an execution time improvement in WARIS of 8.6x for a 4-GPU node compared to a 16-core CPU node (using by-hand AVX vectorization and MPI). Preliminary experiments in PELE already show a 2x speedup.

Level: All
Type: Talk
Tags: Supercomputing & HPC; Computational Chemistry; Earth System Modelling

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Room 211A

S6480 - GPUAM: Graphics Processing Units for Atoms and Molecules

Jorge Garza Professor, Universidad Autonoma Metropolitana
Jorge Garza is a full-time professor at the Universidad Autonoma Metropolitana-Iztapalapa in Mexico City, and has published around 70 scientific reports related to quantum chemistry supported by parallel computing. He obtained his Ph.D. at UAMI by studying confinement effects on the electron structure of atoms, within the context of the density functional theory. After his Ph.D., he gained experience on parallel programming techniques at the Pacific Northwest National Laboratory, working with quantum chemistry suite code NWChem. In 2008, he was responsible for the installation of the fastest supercomputer in Latin America. Now, he applies parallel programming techniques on heterogeneous computing.

We'll present the GPUAM code, which uses GPUs to analyze electron density or molecular orbitals for atoms and molecules. Several quantum chemistry vector and scalar fields are evaluated on 3D grids, which are mapped on 2D GPU grids. Among the quantum chemistry fields considered by GPUAM are: 1) Molecular orbitals (MO) and electron density, 2) Gradient or Laplacian of MO and electron density, 3) Non-covalent interactions index, 4) Electrostatic potential, 5) Critical points search within the atoms in molecules approach. We'll present several applications, in particular, proteins or parts of proteins where hydrogen bonds are relevant. The performance of GPUAM is contrasted with a CPU counterpart, showing the importance of the GPUs in quantum chemistry analysis.

Level: All
Type: Talk
Tags: Computational Chemistry; Computational Physics

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Marriott Salon 5

S6497 - Neural Attention for Object Tracking

Brian Cheung Ph.D. Student, UC Berkeley
Brian Cheung is a Ph.D. student at UC Berkeley working with Professor Bruno Olshausen at the Redwood Center for Theoretical Neuroscience. His research interests lie at the intersection between machine learning and neuroscience.

With differentiable forms of attention being integrated into neural networks, end-to-end training with backpropagation is possible. We adopt the recently proposed attention mechanism in spatial transformer networks (STNs) into a recurrent architecture to perform object tracking. We show that this attention mechanism has significant overlap with the mechanism in deep recurrent attentive writer (DRAW) networks, which have been successfully used to create generative models of images. We present an end-to-end trainable recurrent attention model for tracking a variety of objects in video recorded by cameras mounted on an automobile.

Level: Advanced
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Robotics & Autonomous Machines; Computer Vision & Machine Vision; Self-Driving Cars & Automotive

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Room 210G

S6499 - Overcoming Memory Allocation Bottlenecks with Custom Allocators

Jo Balme Software Engineer, Space Exploration Technologies
Jo is a software engineer at Space Exploration Technologies, working on GPU-based combustion simulations. In a past life she worked on scientific computing initiatives at Microsoft.

Memory allocation is a heavy-weight operation; custom memory sub-allocators give developers the power to optimize for particular allocation patterns. In this talk, we present a data-driven exploration of different allocation techniques we used when running large-scale combustion simulations. We will show data from our memory profiling system, discuss how it guided our allocator designs, and show the results we were able to achieve.

Level: Advanced
Type: Talk
Tags: Performance Optimization; Algorithms

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Room 212A

S6509 - High-Performance Batched Computations for GPUs: Approaches and Applications

Stanimire Tomov Research Director, UTK
Stan Tomov received a Ph.D. in mathematics from Texas A&M University in 2002. He is a research director at the Innovative Computing Laboratory and adjunct assistant professor in EECS at the University of Tennessee, Knoxville (UTK). His research interests are in parallel algorithms, numerical analysis, and HPC. Currently, his work is concentrated on the development of numerical linear algebra software for emerging architectures. Stan is a PI of the GPU Center of Excellence at UTK.
Azzam Haidar Research Scientist, UTK
Azzam Haidar is a research scientist at the Innovative Computing Laboratory at the University of Tennessee, Knoxville. He received a Ph.D. in 2008 from CERFACS, France. His research interests focus on the development and implementation of parallel linear algebra routines for scalable multi-core and GPU architectures, for large-scale dense and sparse problems, as well as approaches that combine direct and iterative algorithms to solve large linear systems as well as eigenvalue problems.

Learn techniques for efficient batched computations on GPUs, where small and independent computations must be grouped and executed together to obtain high performance. These problems occur very frequently in scientific applications like machine learning, data mining, dense and sparse solvers, high-order FEM, astrophysics, and more. We will consider the development of batched computations for these applications, stressing innovative GPU techniques and algorithms for uniform, as well as variable-size batches, tensor contractions, batched BLAS, and more. Batched computations can fill up the GPU with work, remove scheduling overheads and costly CPU-GPU communications to accelerate the computation often by an order of magnitude compared to non-batched approaches.

Level: Intermediate
Type: Talk
Tags: Algorithms; Tools & Libraries; Performance Optimization

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Marriott Salon 3

S6522 - A Novel Neural Network Architecture for Representing Scene Structure

Eric Weiss Graduate Student, UC Berkeley
Eric Weiss is a Ph.D. student at UC Berkeley, working under Professor Bruno Olshausen at the Redwood Center for Theoretical Neuroscience. His work focuses on computational modeling of cognitive and neural processes using methods from statistics and machine learning.

Early works on deep image processing using recurrent neural networks with selective attention have yielded promising results. However, it is unclear whether standard recurrent network architectures are well-suited to representing scene structure. We present a novel memory system that can efficiently store a high-level model of a scene. The proposed approach features several advantages: it is differentiable, easy to analyze, and has constant memory requirements. Additionally, we show how it is relatively straightforward to incorporate it into a selective attention mechanism based on information theoretic principles, enabling highly efficient image processing. We present results on a toy dataset.

Level: Intermediate
Type: Talk
Tags: Robotics & Autonomous Machines; Computer Vision & Machine Vision; Deep Learning & Artificial Intelligence; Algorithms; IoT

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Room LL20D

S6523 - Chainer: A Powerful, Flexible, and Intuitive Deep Learning Framework

Shohei Hido Chief Research Officer, Preferred Networks America, Inc.
Shohei Hido is Chief Research Officer of Preferred Networks America, Inc. He received M.S in Informatics from Kyoto University in Japan, 2006. Since then, he has worked at IBM Research in Tokyo for six years as a staff researcher in machine learning and its applications to many industries. After joining Preferred Infrastructure, Inc. in 2012, he has worked as the leader of Jubatus project, an open source software framework for real-time, streaming machine learning. Currently, he is the product manager of Deep Intelligence in Motion, software for using deep learning in IoT applications. Preferred Networks was established as a spinout company from Preferred Infrastructure in 2014.

CUDA-based framework is the key for applying deep learning technologies. We introduce Chainer, a Python-based standalone open-source framework. The audiences will know how Chainer works and enables new kinds of applications of deep learning. Due to the success of Caffe, Torch, and Theano, the power of deep learning continues to expand beyond traditional pattern recognition tasks such as image recognition. However, the gap is rapidly increasing between the complexities of newly proposed neural network models, and the capabilities of existing frameworks, which have been mainly used for convolutional neural networks. Chainer enables users to intuitively implement many kinds of other models including recurrent neural networks with a lot of flexibility and comparable performance with GPGPU.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence

Day: Wednesday, 04/06
Time: 15:00 - 15:50
Location: Room 210D

S6547 - Segmentation of Medical Volumes Using Convolutional Neural Networks

Fausto Milletari Research Scientist, PhD Candidate, Technical University of Munich
Fausto Milletari has been a Ph.D. candidate at the Technical University of Munich (TUM) since October 2013. After earning his M.S. in informatics, passed with high distinction, he joined the chair for Computer Aided Medical Procedures, directed by Professor Nassir Navab. Fausto's major research topic is segmentation of ultrasound images of the brain. In addition, he works on a variety of other computer vision problems, such as object tracking and detection. His work focuses on pattern recognition and machine learning, and in particular on voting-based approaches using state-of-the-art learning techniques. Several of his contributions have been presented in recent editions of MICCAI, IPCAI, and BMVC. Outside of the lab, Fausto strives to spread scientific knowledge about machine vision to a wider audience. He recently founded the computer vision and medical image analysis meetup group of Munich, which hosts monthly events that bring together academics and industry representatives interested in the field.

Can convolutional neural networks be used effectively for medical tasks? How does the choice of network architecture influence outcomes? How can we cope with the limited amount of annotated training data that is usually available in the medical domain? Is it advantageous to process volumetric data instead of 2D images? We'll seek answers to these questions by showing our recent results on segmentation. A Hough-voting strategy is used in conjunction with CNNs to localise and segment deep brain regions in MRI and ultrasound. We benchmark six different CNN architectures by training them with different amounts of training data and different input dimensionality. Results suggest that the complexity of the task, from a human standpoint, correlates to the required network complexity.

Level: All
Type: Talk
Tags: Medical Imaging; Deep Learning & Artificial Intelligence

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Room 212B

S6588 - GPU-Based Deep Learning in Cloud and Embedded Systems

Frederick Soo Chief Technology Officer, Nauto, Inc.
As chief technology officer and co-founder of Nauto, Inc., Frederick Soo has assembled a team of world-class computer vision and machine learning researchers and engineers and set them to build the core algorithms and hardware for Nauto's commercial products. Prior to joining Nauto, Fred studied the computational neurophysiology of the retina, receiving his Ph.D. in biophysics from Stanford University and completing post-doctoral fellowships at the University of Washington and Princeton University. His work experiences include working at McKinsey and Co., where he collaborated with Nauto co-founder Prof. Stefan Heck, and at Soo Embedded Systems, where he built products from the ground up.

We'll present how Nauto uses deep learning in its distributed, vehicle-based compute and sensor network, and our learnings to date. Topics will include the performance of deep learning algorithms for computer vision in embedded systems, strategies for distributing compute across networks of embedded systems and in the cloud, and collecting and labeling data to maximize the performance of the system. Nauto's system is a dual-camera, windshield-mounted dashcam with GPS, IMU, wireless/cellular connection, and a SoC capable of running small CNNs in real time.

Level: All
Type: Talk
Tags: Self-Driving Cars & Automotive ; Embedded; Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Room LL21E

S6644 - How an Architectural Design Firm Leverages Virtual GPU Technology for Global Collaboration

Andrew Schilling Senior Associate, Information Technology, CannonDesign
Email for bio
Jimmy Rotella Design Application Specialist, CannonDesign
Jimmy Rotella received his B.A. in architecture from the Illinois Institute of Technology, and is now a design application specialist at CannonDesign in Chicago. In his 10 years of experience, he has worked for multiple large design firms implementing Revit, developing project standards, managing software and infrastructure, providing technical support for design applications and computers, and also teaching in both corporate and educational environments. His background in both IT and architecture put him at the forefront of design technology and position him to share his knowledge of new tools with others to help them build and realize their digital designs.

Learn the benefits that virtualization provides for an architecture and engineering design firm, along with the journey through the advancements in virtualization technology it took to finally meet the graphics-intensive needs of our design software. We'll share our actual experiences in how virtualization allows a large company, with over 15 offices and 1,000 people worldwide, to collaborate and work as a single firm. We'll show some cost comparisons with virtualization, along with their management benefits and requirements. We'll also look at the methods we used to set and test metrics specific to our requirements, and follow the results of those metrics through the changes in graphics virtualization technology.

Level: All
Type: Talk
Tags: Graphics Virtualization; Product & Building Design; Data Center & Cloud Computing

Day: Wednesday, 04/06
Time: 15:00 - 15:50
Location: Marriott Salon 4

S6667 - Revolutionizing Lattice QCD Physics with Heterogeneous Multigrid

Kate Clark HPC Engineer, NVIDIA
Kate Clark has worked at NVIDIA since 2011, where she works at the interface between applications, algorithms, and parallel computation. Kate's background is in high energy physics, having completed doctoral research in Monte Carlo algorithms for lattice quantum chromodynamics in 2005 and graduating from the University of Edinburgh. Upon subsequently moving to Boston University, Kate focused on adaptive multi-grid algorithms and symplectic integrators. It was during this time that research was initiated into harnessing GPUs for lattice QCD computation: this research has since evolved into the QUDA library. Kate spent 2009-2011 at Harvard University, continuing to work on algorithms for GPUs and many-core processors, with focus on signal processing.
Alexei Strelchenko Staff Scientist, Fermilab National Laboratory
Alexei Strelchenkov joined the Scientific Computing Division's Lattice Quantum Chromodynamics (LQCD) group at Fermilab in 2013, coming from a postdoc in computational physics in Cyprus. Since 2010, he has been working on linear solver algorithms in the QUDA library, which enables high-efficiency LQCD computations to be performed on GPU-based HPC clusters. More recently, he has been applying similar techniques for Xeon Phi coprocessors, in preparation for the forthcoming Cray 'Cori' system to be based at NERSC.

Learn how combining GPUs with advanced multi-grid solvers are revolutionizing the study of lattice quantum chromodynamics (LQCD). LQCD is a computational tool for probing nuclear and particle physics, however, it can require thousands of GPUs working in tandem for months due to the computationally prohibitive linear solver. Using the QUDA framework, we describe how the solver can be accelerated using an adaptive multi-grid method. The optimization techniques employed are: fine-grained parallelization, mixed precision, communication reducing solvers, and reformulation of the algorithm to allow the CPU and GPU to work in parallel. Using this multitude of algorithmic innovations, we demonstrate that a 5X speedup can be realized over present state-of-the-art methods using GPUs.

Level: Intermediate
Type: Talk
Tags: Computational Physics; Algorithms; Performance Optimization

Day: Wednesday, 04/06
Time: 15:00 - 15:50
Location: Marriott Salon 6

S6681 - Benefits of remote GPU virtualization: the rCUDA® perspective

Federico Silla Associate Professor, Technical University of Valencia
Federico received the MSc and PhD degrees in Computer Engineering from Technical University of Valencia, Spain, in 1995 and 1999, respectively. He is currently an associate professor at the Department of Computer Engineering (DISCA) at that university, where he teaches Computer Networks as well as High Performance Interconnects courses at the Computer Engineering School. He is also an external contributor of the Advanced Computer Architecture research group at the Department of Computer Engineering at the University of Heidelberg. Furthermore, he worked for two years at Intel Corporation, developing on-chip networks. His research addresses high performance on-chip and off-chip interconnection networks as well as distributed memory systems and remote GPU virtualization mechanisms. In this regard, he is the coordinator of the rCUDA remote GPU virtualization project since it began in 2008.

Many applications use GPUs to accelerate their execution. However, using GPUs presents several side effects, such as increased acquisition and maintenance costs and space requirements. Moreover, these increased costs may not be easily amortized because GPUs usually present very low utilization rates. In a similar way to virtual machines, the use of virtual GPUs may overcome the concerns associated with the use of real GPU devices. The remote GPU virtualization technique allows an application being executed in a computer not having a GPU to transparently make use of a GPU installed in other node of the cluster. Although the use of remote GPUs may seem to be a senseless idea, it provides several benefits as described in this talk by using the rCUDA (remote CUDA) middleware as a case study.

Level: All
Type: Talk
Tags: Data Center & Cloud Computing; Supercomputing & HPC; Tools & Libraries

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Room 210E

S6698 - How GPU Helps to Power Next Generation of ArcVideo Video Products and Service

Jin Huang CTO, ArcVideo, Inc
Jin Huang is CTO of ArcVideo, a spin off software company from ArcSoft, Inc, and focus on providing video related solutions and service to Chinese Broadcasting and OTT customers. Working on multimedia areas for over ten years, including PC/Mobile and Server/Cloud business, Jin responses for enabling broadcasting level of video solutions with high performance, private/public cloud video SAAS services, and intelligent video content analytic products supporting millions of end users.

ArcVideo is a leading video solution company in China, provides video transcoding, video content analyzing, interactive video solution running on physical or virtual servers, private and public cloud for Broadcasting and OTT customers. We take fully advantage of Tesla and GRID GPU transcoding and generic CUDA capabilities, to accelerate video transcoding and post processing pipeline, enable Deep Learning training for fast video content recognition, and also private and public cloud video based services for content providers. The high performance of GPU bring ArcVideo next generation of video experience including VR and 4K HEVC broadcasting, and make real time video based interactive platform possible to support millions users.

Level: Intermediate
Type: Talk
Tags: Media & Entertainment; Video & Image Processing; Data Center & Cloud Computing

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Room LL21C

S6751 - Rendering Lost Historical Buildings with NVIDIA Technology

Andrew Rink Marketing Strategy, NVIDIA
At NVIDIA, Andrew Rink is responsible for global marketing strategy for Manufacturing and AEC Industries. With 25 years’ international experience in various industries including CAD and animation software, lasers and photonic power, Andrew has extensive understanding of business challenges faced by companies around the world and expertise in bringing innovative technology to market. Based at NVIDIA’s Silicon Valley headquarters, he has travelled to over 80 countries and is fluent in three languages.

New powerful tools are now available to accelerate and improve architectural visualization. We'll present an overview of how NVIDIA Iray plugins, Iray Server distributed rendering software, Quadro pro GPUs and Quadro Visual Computing Appliance are harnessed to drive interactive photorealistic rendering for building design visualization. When the Bank of England made renovations in the 1920s, almost all the neoclassical design contributed by Sir John Soane was lost. Project Soane was launched in 2015 to produce a crowdsourced digital model of Soane's work. Now that model is being used to create photorealistic renders of the building from the early 1800s. Join this session to hear how history is being rendered with leading edge technology.

Level: All
Type: Talk
Tags: Product & Building Design

Day: Wednesday, 04/06
Time: 15:00 - 15:25
Location: Room LL21A

S6161 - Single Instruction Multiple Data for Computer Vision

Yen-Te Shih Sr. Compute Architect, NVIDIA
Yen-Te Shih works at NVIDIA on GPU architecture which runs computer vision algorithms and applications.

Attendees will learn how to: port f32 code to f16x2 version quickly; predict the performance; analyze the overhead; and design a tool or follow one SOP to directly translate existing f32 code to f16x2.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Computer Vision & Machine Vision; Video & Image Processing

Day: Wednesday, 04/06
Time: 15:30 - 15:55
Location: Room 212A

S6167 - GPU-Accelerated Radiological Knowledge Extraction System at MGH

Synho Do Assistant Medical Director, MGH and Harvard Medical School
Synho Do, Ph.D., is an assistant in physics at Massachusetts General Hospital, where he is a technical committee member of Webster Center for Advanced Research and Education in Radiation, and instructor at Harvard Medical School. Synho received a Ph.D. degree in Biomedical Engineering from University of Southern California. He is currently a member of IEEE Signal Processing Society, Bio-Imaging and Signal Processing (BISP). He is a MGH site PI for NVIDIA CUDA Research Center (CRC). Synho's current research interests include statistical signal and image processing, estimation, detection, and medical signal and image processing, such as computed tomography. He has been a co-investigator for multiple medical imaging projects, and co-PI/PI on medical (i.e., GE, Siemens, and Philips etc) and security (i.e., DHS, DARPA etc) image reconstruction projects.

We'll present a novel GPU-accelerated knowledge extraction system as decision support for radiologists to help reduce human error and improve workflow efficiency. The Massachusetts General Hospital (MGH) Picture Archiving and Communication System (PACS) boasts a database of 20 billion radiology images across 13 million studies that remains limited by a system that is not "intelligent" and fully optimized to extract value for patient care. We are developing a powerful GPU computing system using NVIDIA DIGITS DevBox and GPU clusters to process the vast amounts of data sets and extract clinically relevant radiological knowledge that can enhance image interpretation. We'll introduce the architecture of our radiological knowledge extraction system and the results of the training.

Level: Intermediate
Type: Talk
Tags: Medical Imaging; Big Data Analytics; Deep Learning & Artificial Intelligence

Day: Wednesday, 04/06
Time: 15:30 - 15:55
Location: Room 212B

S6204 - Energy Consumption Evaluation for Krylov Methods on Cluster of GPU Accelerators

Serge Petiton Professor, University of Lille 1, Sciences and Technologies
Serge G. Petiton is a professor at the Scientific and Technical University of Lille. He received his Ph.D. in computer science in 1988 and the "Habilitation a diriger des recherches" in 1993 from Pierre and Marie Curie University, in Paris, France. Serge was a post-doc student, registered at the graduate school, and junior researcher scientist at Yale University from 1989-1990. He was researcher at the "Site Experimental en Hyperparallelisme" (supported by CNRS, CEA, and French DoD) from 1991 to 1994. Serge also was affiliate research scientist at Yale and visiting research fellow in several U.S. laboratories, especially in NASA-ICASE and the AHPCRC during the period 1991-1994. Serge leads the "Methodology and Algorithmic Programming" group of the CNRS "Laboratoire d'Informatique Fondamentale de Lille," and he participated to the INRIA Saclay "Grand Large" project. Serge has been scientific director of more than 22 Ph.D.s and has authored more than 100 articles on international journals and conferences. His current research interests are in "Parallel and Distributed Computing," "Post-Petascale Auto/smart-tuned Dense and Sparse Linear Algebra," and "Language and Programming Paradigm for Extreme Modern Scientific Computing," targeting especially geoscience and big data applications.

We'll evaluate the energy consumption of several orthogonalization methods computed during Krylov methods for large linear algebra problems on a supercomputer using dozen of GPU accelerators. We analyze that performance with respect to several methods optimizing the communications between nodes, from incomplete orthogonalization to "communication-avoiding" techniques using a multi-GPU TSQR method we implemented. We'll compare the impact of different algorithms to compute sparse-matrix vector multiplications, using hypergraph techniques in particular, with respect to different sparse matrices. After experimenting on a supercomputer using several dozen GPUs, we conclude that sometimes we have to find a tradeoff between energy consumption and the number of iteration using smart-tuning.

Level: All
Type: Talk
Tags: Supercomputing & HPC; Algorithms; Performance Optimization

Day: Wednesday, 04/06
Time: 15:30 - 15:55
Location: Room 211A

S6271 - From Fashion CAD to Marketing Pictures with Optitex and Iray®

Rony Goldenthal CTO, Optitex
Rony Goldenthal has been with Optitex for three years, where he leads the company's R&D effort. He has a Ph.D. in computer science from the Hebrew University of Jerusalem, and over 12 years of professional experience in various domains, such as evolutionary design methods or cloth and hair simulation.

Optitex now embeds Iray rendering in its software. Coupled with its 3D-stitching cloth technology, users will benefit from a complete design proofing tool up to releasing actual marketing assets or point of sale experience. The new release of Optitex will be showcased with actual fashion content to demonstrate the full fashion CAD workflow and how an iterative mockup process could be dramatically reduced.

Level: All
Type: Talk
Tags: Product & Building Design; Rendering & Ray Tracing

Day: Wednesday, 04/06
Time: 15:30 - 16:20
Location: Room LL21A

S6308 - Image Super-Resolution: From Sparse Coding to Deep Network

Zhaowen Wang Research Scientist, Adobe Systems Inc.
Zhaowen Wang is a research scientist in Adobe Systems, Inc. His research areas include image understanding and enhancement through machine learning algorithms, with a special interest in deep learning. Before joining Adobe, Zhaowen obtained a Ph.D. from the University of Illinois at Urbana Champaign in 2014.

Learn how to combine a conventional signal processing model with a deep neural network to achieve state-of-the-art performance for single-image super-resolution. Representing an image signal with its sparse coefficients has been proven as an effective prior for many image restoration problems including super-resolution. We design a deep convolutional network that mimics the sparse coding model and at the same time has the same advantage of end-to-end optimization as other deep learning models. By unifying the strengths of good image prior and large learning capacity, our method generates much better upscaling results than vanilla sparse coding and neural network in both visual and numerical quality. The learned network has a very compact size and can be implemented efficiently on a GPU.

Level: Intermediate
Type: Talk
Tags: Media & Entertainment; Video & Image Processing; Deep Learning & Artificial Intelligence

Day: Wednesday, 04/06
Time: 15:30 - 15:55
Location: Room LL21C

S6342 - Putting Tegra into Drive: Safe, Secure, and Seamless Vehicle Integration

Ulrich Meis Senior Software Engineer, OpenSynergy
Is not availabel at the moment. Please accept that we send it later

A solution for vehicle integration targeting the NVIDIA Tegra Jetson Pro and DriveCX platforms will be presented. Communication with the vehicle via the automotive CAN bus is managed by a system that runs separately from other functions in its own execution environment and backed by its own real-time operating system -- all based on the industry's standard Automotive Open System Architecture (AUTOSAR). Learn about the various benefits this design often has versus handling CAN directly in systems like Linux, Android, or QNX.

Level: Intermediate
Type: Talk
Tags: Self-Driving Cars & Automotive

Day: Wednesday, 04/06
Time: 15:30 - 15:55
Location: Room LL21E

S6349 - XMP Library Internals: Modular Multiplication on Kepler and Maxwell

Niall Emmart Ph.D. Student, University of Massachusetts
Niall Emmart is working towards his Ph.D. in computer science at the University of Massachusetts, Amherst, where his research focuses on high-performance, multiple-precision arithmetic across recent NVIDIA GPU architectures. Niall received a B.S. in pure mathematics from University of Massachusetts, Amherst, in 1992. From 1992 to 2012 he ran Yrrid Software, a small firm focusing on legacy system integration with the web.

We'll present an overview of the internals of the XMP multiple precision library and take a detailed look at the low-level algorithms used for modular squaring and modular multiplication on Kepler and present novel algorithms for Maxwell. Modular multiplication is a performance-critical primitive and widely used in cryptographic algorithms from prime testing and factorization to public key/private key algorithms such as RSA, Diffie-Hellman, and digital signatures.

Level: Intermediate
Type: Talk
Tags: Algorithms; Tools & Libraries

Day: Wednesday, 04/06
Time: 15:30 - 15:55
Location: Marriott Salon 3

S6396 - Experience Running a HPC System with up to 16 GPUs per Node

Maxime Boissonneault HPC Specialist, Calcul Québec - Compute Canada
Maxime Boissonneault earned his Ph.D. in quantum physics at Universite of Sherbrooke and has been working as a high performance computing specialist for Calcul Quebec at Universite Laval since 2013. While he has no academic background in computing, he has learned numerous programming languages starting at the age of 12, going from Java to C++ through Python and C#. He has been giving CUDA training classes since 2014, and has been managing the Calcul Quebec Helios GPU cluster located at Universite Laval.

We'll describe our experience running an HPC cluster composed of high-density GPU nodes of either 8 Tesla K20 or 8 K80 GPU accelerators (16 logical GPUs). This system is being used by a wide variety of researchers who run jobs ranging from 1 to 16 GPUs per node. As this heterogenous workload requires the sharing of nodes, we'll detail how the system was tuned to achieve the best balance between shareability, stability, and overall performance of the cluster. We'll communicate our experience on the following topics: [1] Restricting access to GPU devices, [2] Benchmarks to identify ideal workloads per node type, [3] Numa nodes and the impact of sharing resources on memory management.

Level: Advanced
Type: Talk
Tags: Data Center & Cloud Computing; Supercomputing & HPC

Day: Wednesday, 04/06
Time: 15:30 - 15:55
Location: Room 210E

S6432 - Hand Gesture Recognition with 3D Convolutional Neural Networks

Pavlo Molchanov Research Scientist, NVIDIA
Pavlo Molchanov is with NVIDIA research science May 2015. He received BSc and MSc degrees, with distinction, in radio technical systems, devices, and complexes from National Aerospace University, Kharkov, Ukraine, in 2008 and 2010, respectively. He received PhD degree with Tampere University of Technology, Tampere, Finland in the area of signal processing.

This presentation will describe the design of a multi-resolution 3D convolutional neural network for drivers' hand gesture recognition. The talk will include task-specific data augmentation strategies that help to achieve state-of-the-art performance on a publicly available dataset. Several aspects of multi-sensor fusion with deep neural networks will be discussed in detail.

Level: Beginner
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Video & Image Processing

Day: Wednesday, 04/06
Time: 15:30 - 15:55
Location: Room 210G

S6478 - Real-Time Visualization of CUDA® Data Using ArrayFire Forge

Brian Kloppenborg Research Scientist, ArrayFire
Brian Kloppenborg is an astrophysicist turned high performance computing engineer and current a Research Scientist at ArrayFire. As a research scientist, he writes massively parallel, high performance software with applications in physics, astrophysics, and computer vision. Brian is particularly involved with DARPA's MEMEX project, a program which seeks to identify individuals who might be subject to human trafficking. Additionally, Brian is an adjunct professor in the Department of Physics and Astronomy at Georgia State University. As a professor, he conducts astrophysical research using spatially resolved optical interferometric imagin of stellar surfaces, eclipsing binary stars, and novae.

We will debut ArrayFire Forge, our new general-purpose data visualization library for GPUs. ArrayFire Forge is a data visualization library that is written specifically for use with GPU-accelerated applications. By using interoperability with OpenGL, Forge enables developers to create real-time, responsive, and stunning visualizations in 2D and 3D. Forge is an open-source project and distributed on GitHub.

Level: Intermediate
Type: Talk
Tags: Tools & Libraries; Real-Time Graphics; Graphics Virtualization

Day: Wednesday, 04/06
Time: 15:30 - 15:55
Location: Room 211B

S6492 - If It's Not Reproducible, It's Not Worth It! Deterministic Machine Learning and Molecular Dynamics

Scott LeGrand CTO, AMBER Inc.
Highly-Rated Speaker
Scott LeGrand is a principal engineer at Amazon Inc., working on neural network-based recommendation systems. In college, he developed the first molecular modeling system for home computers, Genesis, Folderol, the distributed computing project targeted at the protein folding problem in 2000, and BattleSphere, a networkable 3D space shooter for the Atari Jaguar the same year. Surprisingly, all three of these efforts shared a common codebase. More recently, he ported the Folding@Home codebase to CUDA, achieving a 5X speedup over previous efforts, and which accounts for 2.6 petaFLOPs of the project's computational firepower. He is best known for his work porting the AMBER molecular dynamics package to CUDA, attaining record-breaking performance in the process. Scott earned a B.S. in biology from Siena College and a Ph.D. in biochemistry from the Pennsylvania State University. He is currently obsessed with deep neural network performance optimization.

Parallel algorithms are hard. Data-parallelizing such algorithms is even harder. Extending such parallelization to multiple GPUs can tempt one to relax the reproducibility of computations in order to simplify reductions and allow for dynamic load-balancing as the distribution of the underlying data shifts throughout a computation. But once done, it's impossible to detect any sort of race condition in your codebase. Race condition effects can range from introducing mostly harmless noise in an approximate algorithm to catastrophic corruption that destroys the validity of computations. This talk will show how to maintain high performance on GPU clusters without losing reproducibility with examples in machine learning, deep neural networks, and molecular dynamics.

Level: Intermediate
Type: Talk
Tags: Computational Chemistry; Deep Learning & Artificial Intelligence; Algorithms

Day: Wednesday, 04/06
Time: 15:30 - 16:20
Location: Marriott Salon 5

S6596 - IFM Technologies: Intelligent Flying Machines for Indoor Applications

Marc Gyongyosi Founder , IFM Technologies
Marc Gyongyosi is a junior in computer science at Northwestern University in the McCormick School of Engineering. For the past two years, he has been working closely together with BMW's robotics research department to develop novel robotic systems assisting workers at BMW factories. At BMW, Marc's primary research focus is the implementation and development of cooperative lightweight robots. At Northwestern's The Garage, Marc is involved in two startups: at MDAR Technologies, he works on a novel 3D vision system for self-driving cars and other autonomous vehicles. As the founder of IFM Technologies, he develops novel "Intelligent Flying Machines," i.e., Drones for Decisions. IFM Technologies aims to increase productivity and improve efficiency in everyday manufacturing and logistics processes.

We'll present recent advancements in leveraging the GPU on-board IFM Technologies' "Intelligent Flying Machines." IFM is providing industrial, indoor, flying platforms for data-driven decisions in the manufacturing and logistics industry. IFM provides a complete framework to collect, visualize, and leverage three-dimensional data analysis in indoor environments. Using the onboard GPU, IFM Technologies takes innovative production and logistics technology to a -- quite literally -- new dimension.

Level: Intermediate
Type: Talk
Tags: Robotics & Autonomous Machines; Computer Vision & Machine Vision; IoT

Day: Wednesday, 04/06
Time: 15:30 - 15:55
Location: Room LL20D

S6199 - Raytracing Scientific Data in NVIDIA OptiX™ with GVDB Sparse Volumes

Rama Hoetzlein Graphics Research Engineer, NVIDIA
Rama Hoetzlein's current research with NVIDIA explores data structures for large-scale simulation and volume rendering. Rama completed a dual-degree in computer science and fine arts from Cornell in 2001, with research in robotics and imaging. In 2010, his dissertation at the University of California, Santa Barbara, focused on tools for creative interaction in procedural modeling for media artists. In 2010, Rama was co-director and lead scientist of the Transliteracies project in the Digital Humanities, and professor of media studies at the Medialogy program in Copenhagen with a focus on visual effects and animation.
Tom Fogal Senior Software Engineer, NVIDIA
Thomas Fogal is an NVIDIA engineer specializing in HPC visualization. As a doctoral student, he worked on parallel volume rendering techniques as well as novel approaches to in situ visualization. At the Scientific Computing & Imaging Institute, ORNL, and LLNL, he worked on parallel rendering for large scientific data. Thomas holds a B.S. and M.S. from the University of New Hampshire, and will soon have a doctorate from the University of Duisburg-Essen in Germany.

We present a novel technique for visualization of scientific data with compute operators and multi-scatter ray tracing entirely on GPU. Our source data consists of a high-resolution simulation using point-based wavelets, a representation not supported by existing tools. To visualize this data, and consider dynamic time-based rendering, our approach is inspired by OpenVDB from motion pictures, which uses a hierarchy of grids similar to AMR. We develop GVDB, a ground-up implementation with tree traversal, compute, and ray tracing via OptiX all on the GPU. GVDB enables multi-scatter rendering at 200 million rays/sec, and full-volume compute operations in a few milliseconds on datasets up to 4,200^3 entirely in GPU memory.

Level: All
Type: Talk
Tags: In-Situ and Scientific Visualization; Rendering & Ray Tracing; Computational Fluid Dynamics

Day: Wednesday, 04/06
Time: 16:00 - 16:50
Location: Room LL21D

S6242 - Parallel CAFFE Framework Based on GPU Cluster: IB + GPU Cluster + Lustre + MPI

Qing Zhang HPC Application R&D Manager, Inspur Electronic information Indudtry Co.,Ltd
Qing is currently the HPC Application R&D Manager at Inspur Group. He is Inspur-Intel China Parallel Computing Joint Lab and Inspur-Nvidia GPU joint lab Chief Architect. He is in charge of the two labs. His research directions include HPC, multi-core CPU parallel computing and GPU/MIC/FPGA heterogeneous computing. His research focuses on HPC applications, deep learning applications and Internet Data Center (IDC) applications. His research application fields include Oil & Gas, CFD/CAE, Life Science, Finance, Internet and so on. He is an expert of heterogeneous computing and leads a HPC optimization team of 10+ people. He is the author of the book "High Performance Computing on the Intel Xeon Phi", which is the 1st MIC technology guide in the world.

The goal of this session is to explain the strategy on how to design parallel CAFFE framework on GPU cluster platform to handle big data, like three different kinds of MPI parallel mechanism, and how to optimize the data reading, network communication, multi-GPU parallel efficiency.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence

Day: Wednesday, 04/06
Time: 16:00 - 16:25
Location: Room 210E

S6262 - A Comparison of Accelerator Architectures for Signal-Processing Algorithms

John Romein Senior Researcher HPC, ASTRON (Netherlands Institute for Radio Astronomy)
John Romein is senior researcher at ASTRON, where he leads several projects on HPC research for radio-astronomical applications. His primary focus is the use of accelerator hardware. He implemented the Blue Gene/P correlator for the LOFAR radio telescope. John received his Ph.D. in computer science on distributed game-tree search at the Vrije Universiteit, Amsterdam, in 2001. As a postdoctoral researcher, he solved the game of Awari using a large computer cluster and did research on parallel algorithms for bioinformatics. His research interests include high-performance computing, parallel algorithms, networks, programming languages, and compiler construction.

We'll compare different accelerator platforms (GPUs from NVIDIA and AMD, the Xeon Phi, a DSP from Texas Instruments, and a regular Xeon CPU as reference platform) for signal-processing algorithms that are used in radio astronomy (e.g., a filter, correlator). We'll show why the architectures are (in)efficient, discuss which architecture-(in)dependent optimizations are necessary, report on energy efficiency, and assess programmability.

Level: Advanced
Type: Talk
Tags: Performance Optimization; Supercomputing & HPC; Signal & Audio Processing

Day: Wednesday, 04/06
Time: 16:00 - 16:25
Location: Room 212A

S6303 - GPU Acceleration of an Iterative Physical Optics Algorithm

Giorgio Urso Researcher, The University of Melbourne
Giorgio Urso got a degree in Physics in 1993. After few years spent in Italian research centres, he started a professional career as scientific software developer. Mainly interested in simulation using particle systems, he developed also CUDA software in several fields as biological neural network, computer vision, optimization, and geophysics.

Learn how multi-GPU and CUDA can help accelerate an iterative physical optic (IPO) algorithm for the analysis of electrically large scatterer. You'll see how to solve an unprecedented challenge set to our electromagnetic solver, which is a 30-meter Near Field Cassegrain antenna to be accurately simulated up to X band, by using a Fast Far Field Approximation (FaFFA) algorithm that has been highly tuned for latest generation GPU cards and efficiently cast onto a multi-GPU architecture. Performance results for GPU against CPU implementation will be presented.

Level: Intermediate
Type: Talk
Tags: Algorithms; Computational Physics; Performance Optimization

Day: Wednesday, 04/06
Time: 16:00 - 16:25
Location: Marriott Salon 3

S6321 - How Deep Learning Works for Automated Customer Service

Chenghua (Kevin) Li Chief Scientist of DNN Lab, JD.COM
Dr. Chenghua Li is the chief scientist in the deep neural network (deep learning) laboratory of JD, in charge of promoting the application of deep learning technologies in JD products. He was a data mining expert in the National Key Laboratories of Hisense, in charge of intelligent hardware innovation and the development of data mining. Chenghua has been researching and working in machine learning, especially in neural network and data mining for decades. He has published more than 30 papers in world-leading academic journals such as Expert System with Application, Information Processing and Management, Knowledge based system, and Neurocomputing, and hold more than 10 patents. He received his Ph.D in data mining and machine learning at Chonbuk National University and finished his post-doctoral research at St. Francis Xavier University and York University in Canada. He was also a visiting scientist at MIT Media Lab.

Deep learning research and applications have seen numerous successes in the field of image processing and speech recognition. However, in the field of natural language processing, it is still under utilized. This session will share the relevant technology and the development process of the intelligent customer service robot; as well as machine learning, deep learning, and natural language processing related technology. We'll also discuss the application of deep learning on natural language processing and automatic question answering system, the role it plays in business, and how it enhances the ability to answer customer questions and boost customer satisfaction.

Level: All
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Robotics & Autonomous Machines; Computer Vision & Machine Vision

Day: Wednesday, 04/06
Time: 16:00 - 16:25
Location: Room LL20D

S6395 - Graph Database and Analytics in a GPU-Accelerated Cloud Offering

Brad Bebee CEO, SYSTAP, LLC
Brad Bebee is the CEO of SYSTAP, leading efforts to deliver graphs at scale with Blazegraph products. An expert in graphs and large-scale analytics, he has a diverse background in software developments, telecommunications, and information retrieval.
Dave Driggers Chief Executive and Technical Officer, Cirrascale Corporation
David Driggers is the chief executive officer and original founder of Cirrascale, responsible for the company's strategic direction. David establishes the technology roadmap of the company and provides guidance for hardware technology development. He is directly responsible for the patents surrounding the company's vertical cooling technology and blade-based products.

Blazegraph GPU provides 300X acceleration for SPARQL graph query and graph database management with acceleration for existing RDF/SPARQL and Property Graph (Tinkerpop) applications. Multi-GPU configurations can effectively manage billion+ edge graphs on single-node machines with 4 or 8 K80 GPU accelerators. This is a cost-effective way to deliver high performance for graphs, but many end-users and applications do not have existing multi-GPU systems; current cloud offerings at this scale are not generally available. Cirrascale has developed a cloud-based solution for provisioning multi-GPU Tesla systems using its switch riser technology. This session details the Blazegraph GPU cloud offering on Cirrascale, demonstrates how to quickly deploy it in the cloud, and shows graph benchmarks on cloud systems.

Level: Beginner
Type: Talk
Tags: Big Data Analytics; Data Center & Cloud Computing; Aerospace & Defense

Day: Wednesday, 04/06
Time: 16:00 - 16:25
Location: Room 210F

S6400 - Quants Coding CUDA® in .NET: Pitfalls and Solutions

Benjamin Eimer Quantitative Developer, Chatham Financial
Benjamin Eimer has been a quantitative developer at Chatham Financial for the past four years, where he focuses on model development and performance. Previous to working in finance, Ben worked for the National Institute of Occupational Safety and Health as an aerosol scientist. Ben received his Ph.D. in physics with an emphasis in computational modeling and material science from New Mexico State University in 2006.

We'll cover some of the lessons we have learned in developing a hybrid GPU/CPU linear algebra library in .NET to accelerate the financial risk and derivative pricing models developed by our quant team. The purpose of this library is to allow our team to transition to GPU computing incrementally within our extensive .NET codebase. We'll present some of the difficulties encountered when .NET automated garbage collection interacts with low-level memory management and how we addressed them. Solving these problems is essential for running CUDA code as part of a highly available, web service architecture.

Level: All
Type: Talk
Tags: Finance; Tools & Libraries; Programming Languages

Day: Wednesday, 04/06
Time: 16:00 - 16:25
Location: Marriott Salon 1

S6442 - Recognizing Cancerous Cells in Histology Imagery Using Deep Learning

Ted Hromadka Senior Software Engineer, Integrity Applications Incorporated
Theodore Hromadka is a senior software engineer at Integrity Applications Incorporated, working on the HPCMP Portal project. His research areas include deep learning, high-performance computing, and mobile app development. He is a graduate student in computer science at University of California, San Diego.

We'll present the results of applying deep learning techniques and GPUs towards classification of histology imagery. At Naval Medical Center, San Diego, pathologists manually inspect biopsy samples to identify cancerous cells amid healthy tissue. This process is time-intensive and susceptible to errors caused by fatigue. Using DIGITS, Caffe and GPUs, researchers are automating this process.

Level: Beginner
Type: Talk
Tags: Medical Imaging; Deep Learning & Artificial Intelligence

Day: Wednesday, 04/06
Time: 16:00 - 16:25
Location: Room 212B

S6443 - Mining Audio Information on Web Videos and Recordings

Benjamin Elizalde PhD Student, Carnegie Mellon University
Benjamin Elizalde is a Ph.D. student at Carnegie Mellon University under the direction of Professor Ian Lane. He was a staff researcher in the Audio and Multimedia group at the International Computer Science Institute affiliated to UC Berkeley from 2012 to 2015. He worked under IARPA's ALADDIN project for video event detection on web videos as a participant of the TRECVID MED evaluations. He also worked on a Livermore Labs project on multimedia content analysis with high performance computing. His research has resulted in over 15 peer-reviewed publications. He received his B.S. and M.S. from Tecnologico de Monterrey in Mexico. During his M.S., he also did research at Carnegie Mellon University, where he began his work on audio-based content analysis.

We are surrounded by sounds and acoustics that describe the world we live in. This audio information is reflected in the content of videos, providing unique characteristics as well as complementing cues such as images and text. We'll present our ongoing research on deriving information from environmental sounds recordings and audio from city-location web videos utilizing GPU-based recurrent neural networks. These methods and findings can be applied to multimedia content analysis, robotics, and IoTs.

Level: All
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Big Data Analytics

Day: Wednesday, 04/06
Time: 16:00 - 16:25
Location: Room 210G

S6565 - Creating Unique Customers Relationships with Deep Learning in the Cloud and in the Car

Nick Black Chief Product Officer, CloudMade
Nick Black is a product-focused founder and entrepreneur with domain experience in location-based services, connected cars, and automotive infotainment. His experience includes building and leading product planning and product design organizations through full product life cycles from concepting, research, co-creation, business model development, prototyping, user testing, development, and release cycles. At CloudMade, Nick leads product planning and design, where he is focussed on designing, specifying, and delivering CloudMade's self-learning car solutions.

The car presents a particular challenge for creators of learning systems -- it is incredibly rich in data and context, its hardware and software environments are heterogeneous and fragmented, and drivers expect incredible precision from its interactions. CloudMade has pioneered an approach to machine learning in the automotive context that leverages the richness of car data, the emerging computational power of the car, and the existing computational power of the cloud to deliver an automotive-grade machine learning toolset. With CloudMade's solutions, automotive OEMs can deliver personalized experiences to customers that together create a self-learning car that anticipates the needs and desires of the user.

Level: All
Type: Talk
Tags: Self-Driving Cars & Automotive ; Deep Learning & Artificial Intelligence; Embedded

Day: Wednesday, 04/06
Time: 16:00 - 16:25
Location: Room LL21E

S6678 - Accelerating Real Applications: Best Practices for Profiling and Debugging Complex Code

Beau Paisley Support Engineer, Allinea
Beau is a computer science and mathematics graduate from the College of William and Mary and performed graduate studies in Electrical Engineering at Purdue University. He has over twenty-five years of experience in development, marketing and sales roles with research, academic, and startup organizations. He has previously held positions with NCAR, Applied Physics Lab, and several startup and early growth technical computing companies.

Real-world codes are often made up of 100,000-line files written by generations of contributors none of whom fully understood the previous work. As the complexity of a code grows it goes through a phase change and the simple techniques learned on simple examples fail. In this talk we share some best practices for understanding and accelerating complex, real-world codes based on over a decade working with scientists on some of the world's largest HPC systems and applications. Topics covered include: (1) Understanding and navigating complex codes from a performance perspective, (2) Practical advice on identifying areas to accelerate, (3) A scientific approach to preventing, identifying and tracking down errors and (4) Profiling and debugging tools and when NOT to use them.

Level: Beginner
Type: Talk
Tags: Tools & Libraries; Performance Optimization

Day: Wednesday, 04/06
Time: 16:00 - 16:50
Location: Room 211B

S6704 - The Microscope for 21st Century Discovery

Wojtek James Goscinski Manager, External Collaborations, eResearch Centre, Monash University
Dr Wojtek James Goscinski is the coordinator of the Multimodal Australian ScienceS Imaging and Visualisation Environment (MASSIVE), a specialist Australian high performance computing facility for imaging and visualization, and the External Collaborations Manager at the Monash e-Research Centre a role in which he leads teams to develop effective and creative applications of computing in research. He holds a Ph.D. in Computer Science, a Bachelor of Design (Architecture), and a Bachelor of Computer Science.
Paul Bonnington Director, Monash eResearch Centre, Monash University
Professor Paul Bonnington is the Director of the Monash eResearch Centre, Monash University, and a Professor in the School of Mathematical Sciences. The Monash eResearch Centre's role is to build collaborations between research disciplines, nurture eResearch developments and to build bridges between researchers and research infrastructure providers. The Centre is an initiative of Monash University's Vice-Provost (Research and Research Infrastructure) to support researchers within Monash and across Australia.. Monash eResearch Centre host significant nodes of all of Australia's core eResearch infrastructure funded under NCRIS and EIF Super Science, such as the National Computational Infrastructure Specialist High Performance Computing Facility (MASSIVE), and nodes of the national NeCTAR Research Cloud and Research Data Storage Infrastructure. Additionally, the Monash eResearch Centre leads the development of Australian National Data Service and NeCTAR Research Tools platforms for the international research community.

World-class environments for research require the orchestration of specialised instruments, data storage and processing facilities, and advanced data visualisation environments. The Clayton Innovation Precinct is now home to a world-unique trifecta to support this vision: (1) World-class scientific instruments located at Monash University, CSIRO, Australian Synchrotron and affiliated medical research institutes; (2) Unique data processing capabilities of the MASSIVE HPC facility; and (3) A world-class immersive visualisation environment for data analysis and collaboration (the CAVE2). The way in which scientists apply these three capabilities in concert will be an archetype of the way research will be performed in the 21st Century.

Level: All
Type: Talk
Tags: Supercomputing & HPC; Data Center & Cloud Computing

Day: Wednesday, 04/06
Time: 16:00 - 16:25
Location: Room 211A

S6705 - Solving the Mysteries of Particle Physics with GPUs

Waseem Kamleh Senior Research Associate, University of Adelaide
Dr Waseem Kamleh is a leading researcher in computational physics based at the University of Adelaide. Dr Kamleh graduated in 1999 with Honours in Mathematical Physics, and was awarded the University Medal. He received his doctorate in 2004 and performed post-doctoral studies at Trinity College, Dublin. Dr Kamleh returned to the University of Adelaide in 2008 to join the Centre for the Subatomic Structure of Matter (CSSM), where he leads the high performance computational physics component of the CSSM lattice research program. Dr Kamleh's research interests cover a wide variety of topics within lattice quantum chromodynamics (QCD), ranging from hadron structure to dynamical fermion algorithms. Dr Kamleh has been an avid programmer all his life, and has over 15 years experience in high performance computing. Dr Kamleh has transformed the way in which the CSSM research group's calculations are performed by exploiting GPU technology in the software he develops, and is a leading expert in the use of GPUs for studying lattice QCD.

A 50-year old mystery in particle physics is solved with the discovery that an exotic subatomic particle is actually a molecule. Lattice quantum chromodynamics (QCD) uses high performance computing to solve the fundamental equations that describe the interactions of subatomic particles and reveal their internal structure. The anomalously low mass of the Lambda(1405) resonance has puzzled physicists since it was first observed in the 1960s. We show how a recent Lattice QCD calculation conducted by the University of Adelaide demonstrated that the Lambda(1405) is an exotic meson-baryon molecular state. The highly parallel nature of Lattice QCD calculations makes GPUs an ideal hardware platform and we discuss the critical importance of large-scale GPU clusters to this field of research.

Level: All
Type: Talk
Tags: Computational Physics; Supercomputing & HPC

Day: Wednesday, 04/06
Time: 16:00 - 16:25
Location: Marriott Salon 6

S6712 - GPU Powered Solutions in the Second Kaggle Data Science Bowl

Jon Barker Solution Architect, NVIDIA
Jon Barker joined NVIDIA in May 2015 as a Solution Architect. Since then he has been helping customers to design, implement and optimize a variety of GPU accelerated deep learning applications and has also provided internal and external deep learning training. Jon is particularly focused on the application of deep learning to problems in defense and national security. Jon graduated from the University of Southampton in the UK in 2007 with a PhD in Mathematics. Prior to joining NVIDIA, Jon worked for the UK Ministry of Defence and spent four years on secondment to the US Department of Defense where he was a research scientist focused on data analytics and machine learning for multi-sensor data streams. In order to learn new data science skills Jon has been a long time competitor on Kaggle.

The second annual Data Science Bowl was an online data science contest that took place in early 2016 and was hosted on the Kaggle platform. The objective of the contest was to develop an algorithm that could accurately estimate the volume of the left ventricle of a human heart at the points of maximum and minimum volume from a time-series of multiple cross sectional Magnetic Resonance Imaging (MRI) images of the heart. The contest provided thousands of MRI images to train an algorithm. The challenge was a natural fit for GPU accelerated deep learning (DL). We'll hear from some of the winning teams describe their approaches. The complexities of working with sometimes messy clinical data will be discussed and we will hear how deep learning can be applied to a time-series of 3D images.

Level: Beginner
Type: Talk
Tags: Medical Imaging; Deep Learning & Artificial Intelligence; Video & Image Processing

Day: Wednesday, 04/06
Time: 16:00 - 16:50
Location: Room LL21B

S6143 - Enhanced Blueprint Rendering in OpenGL

Christoph Kubisch Senior Developer Technology Engineer, NVIDIA
Highly-Rated Speaker
Christoph Kubisch is a senior developer technology engineer for NVIDIA, where he focuses on advanced OpenGL and Vulkan real-time rendering techniques suitable for CAD/DCC and scientific applications. He collaborates with external partners and NVIDIA's internal teams to optimize rendering algorithms. Prior to joining NVIDIA, he was a researcher on hardware-accelerated visualization techniques for medical datasets at the Otto-von-Guericke University of Magdeburg. He has also worked as a technical artist creating game art, technology, and tools.

We'll present rendering technology for stylized lines, often found in blueprint drawings for CAD software. The techniques allow higher quality depiction of lines than classic OpenGL, by adding flexibility to stippling patterns or joints, and improve the quality of lines with arbitrary widths. We'll present several optimization techniques that the system makes use of.

Level: All
Type: Talk
Tags: Product & Building Design; Real-Time Graphics

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Room LL21A

S6205 - Towards Efficient Option Pricing in Incomplete Markets

Shih-Hau Tan Ph.D. Student, University of Greenwich
Shih-Hau Tan graduated from National Tsing Hua University in Taiwan, finished his M.S. in University of Nice Sophia Antipolis with internship in INRIA in France, and is now working for ITN-STRIKE Marie Curie project under European Union on computational finance in London. His research interests include nonlinear option pricing for incomplete markets, application in commodity markets, and high performance computing with implementation on GPUs.

Nonlinear option pricing is a new approach for traders, hedge funds, or banks to obtain more accurate option price and to do fast model calibration using huge market data. Numerically the main problem is to solve fully nonlinear PDEs and strategies like Newton's method and ADI scheme are employed. Batch operations are used as well for calculating different option pricing problems together at the same time. We'll introduce how to use OpenACC and CUDA libraries to accelerate the whole computation. The complexity analysis will be shown first. We can obtain around 2X speedup by using OpenACC, and around 5X speedup by using libraries from cuSPARSE for solving tridiagonal systems and cuBLAS for computing level-2 functions.

Level: All
Type: Talk
Tags: Finance; OpenACC

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Marriott Salon 1

S6260 - Big Geospatial Data + Deep Learning + High Performance Computing = Geospatial Intelligence

Bingcai Zhang Tech Fellow, BAE Systems
Dr. Bingcai Zhang is a technical fellow at BAE Systems, the premier global defense and aerospace company. He joined BAE Systems in September 1995 right out of the University of Wisconsin-Madison, where he earned his Ph.D. in engineering and M.S. in computer science. Bingcai's research interests are geospatial information technology and 3D mapping; robot vision and unmanned systems; and 3D geoweb search. He has held positions as chief architect, chief photogrammetrist, R&D manager, and technical fellow with BAE Systems. Bingcai has three inventions: Embedded Photogrammetry, Next Generation Automatic Terrain Extraction (NGATE), and Automatic 3D Object Extraction.

We present two algorithms that are specifically designed to accurately detect geospatial objects in geospatial images. Combining these two algorithms with deep learning algorithms, we have achieved detection accuracy over 99% for vehicles, positional accuracy of within 6 pixels, orientation accuracy of less than 10 degrees, and false positive error rate of 0.001% with 7.5cm GSD aerial images. In essence, our algorithms induce learning capability from deep learning into template image matching in geospatial intelligence. Our algorithms reduce false positive error rate by an order of magnitude over softmax classifier. With over 99% accuracy, we believe this may be the game changer in geospatial intelligence domain.

Level: All
Type: Talk
Tags: Aerospace & Defense; Big Data Analytics; Deep Learning & Artificial Intelligence

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Marriott Salon 2

S6280 - Accelerating Spark Workloads Using GPUs

Rajesh Bordawekar Research Staff Member, IBM Research
Rajesh Bordawekar is a Research Staff Member at the IBM T. J. Watson Research Center. His current interest is exploring software-hardware co-design of analytics workloads. He works at the intersection of high-performance computing, analytics, and data management domains. He has been investigating how GPUs could be used for accelerating key analytics kernels in text analytics, data management, graph analytics, and deep learning. As part of this work, he collaborates closely with the IBM Power Systems, and various analytics and database product teams. He is currently leading a team that is exploring applications of GPUs for accelerating key Spark workloads.

The Apache Spark engine is being increasingly used for implementing large-scale distributed analytics workloads. These workloads cover a wide array of analytics models, including predictive analytics, optimizations, and graph analytics. We'll discuss opportunities for exploiting GPUs for accelerating different Spark components such as MLLib. The talk will first overview the Spark programming and execution model and the describe key issues in integrating GPUs into the Spark infrastructure. We then describe our approach for enabling Spark to use multiple GPUs in a distributed manner and provide details of accelerating key MLLib kernels without changing the source Spark program.

Level: All
Type: Talk
Tags: Big Data Analytics; Deep Learning & Artificial Intelligence; Algorithms

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Room 210F

S6287 - PerfMon Redux: Analyzing a CUDA® Application With the Windows Performance Monitor

Richard Wilton Research Scientist, Johns Hopkins University
Highly-Rated Speaker
Richard works on petabyte-scale databases in the Institute for Data Intensive Engineering and Science (IDIES) in the Department of Physics and Astronomy at Johns Hopkins University. He designed and implemented data-transformation workflows for the Pan-STARRS astronomical survey database. He is the lead developer of Arioc, a GPU-based short-read DNA sequence aligner that is a key component in the preparation of data for the NIH-funded Terabase Search Engine project.

Learn how to use the Performance Monitor tool ("PerfMon") in Microsoft Windows to do non-invasive real-time visualization of the performance of a CUDA application. This approach lets you aggregate performance data from the host operating system and hardware along with GPU performance metrics, and makes it possible to examine the interactions between GPU components (CUDA compute and memory activity) and non-GPU components (CPU activity, disk I/O, and host memory) throughout the execution lifetime of a complex CUDA application. Examples will be provided from the performance analysis of a pipelined CUDA application that runs kernels on multiple GPUs and that makes intensive concurrent use of CPU threads and host memory.

Level: Intermediate
Type: Talk
Tags: Performance Optimization

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Room 212A

S6348 - Computational Drug Discovery Using Deep Learning

Olexandr Isayev Research Professor, University of North Carolina at Chapel Hill
Olexandr Isayev is a scientist at UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill. His research interests focus on making sense of chemical data with molecular modeling and machine learning. Before joining UNC in 2013, he was a post-doctoral research fellow at the Case Western Reserve University and scientist at a government research lab. In 2008, he received his Ph.D. in computational chemistry. He received the "Emerging Technology Award" from the American Chemical Society (ACS) and the GPU computing award from NVIDIA in 2014.

Learn how deep learning can address some of the most critical problems of computational drug discovery. Historically, the field has been strongly focused on the development of drugs intended to act against one specific target with high potency and selectivity. It is now recognized that these concepts are too simplistic. At the same time, there was an unprecedented growth of chemical databases incorporating hundreds of billions of useful chemical records. Deep learning is well suited to address both of these challenges. GPU computing is the central hardware technology that allows deep learning to scale.

Level: Advanced
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computational Chemistry; Big Data Analytics

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Room 210G

S6417 - FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters

Forrest Iandola CEO, DeepScale
Forrest Iandola will complete his Ph.D. in EECS from UC Berkeley in spring 2016. Forrest has published more than 10 papers on computer vision and has applied computer vision research experience at companies such as NVIDIA and Microsoft.

One of the largest barriers to industrial adoption of deep learning is the time required to train models; it can take a week or more to train a high-quality deep neural network on a GPU workstation. We present FireCaffe, which trains state-of-the-art deep neural networks on a cluster of 32 GPUs with a 23x speedup over a single GPU.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision; Big Data Analytics

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Room 210E

S6418 - Bringing NVIDIA GPUs to the PGAS/OpenSHMEM World: Challenges and Solutions

Dhabaleswar K. (DK) Panda Professor and University Distinguished Scholar, The Ohio State University
Highly-Rated Speaker
Dhabaleswar K. (DK) Panda is a professor and university distinguished scholar of computer science and engineering at the Ohio State University. He has published over 350 papers in major journals and international conferences. The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) open-source software package, developed by his research group, are currently being used by more than 2,450 organizations in 76 countries around the world. This software has enabled several InfiniBand clusters to get into the latest TOP500 ranking during the last decade. More than 293,000 downloads of this software have taken place from the project's website alone. He is an IEEE fellow and a member of ACM.

Learn about techniques and solutions that bring GPU computing to the World of Partitioned Global Address Space (PGAS) Models, especially with the emerging OpenSHMEM paradigm. PGAS models are gaining attention for providing shared memory abstractions that make it easy to develop parallel applications with dynamic and irregular communication patterns. However, the existing OpenSHMEM standards do not support direct communication on GPU memory. This talk discusses simple extensions to the OpenSHMEM model to address this issue. Challenges and solutions in designing CUDA-aware runtimes to support these extensions, optimize data movement using CUDA IPC and GPUDirect RDMA features are presented. The impact of these concepts on applications performance is demonstrated.

Level: Intermediate
Type: Talk
Tags: Supercomputing & HPC; Programming Languages; Tools & Libraries

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Room 211A

S6425 - High-Performance Wave Modeling in Nanooptics on GPGPU-Based Supercomputers

Andrey Zakirov Researcher, Kintech Lab
Andrey Zakirov graduated from Moscow Institute of Physics and Technology (MIPT), Faculty of Applied Mathematics in 2009. In 2012, he received his Ph.D. with his thesis entitled "Application of Local-recursive nonLocal-asynchronous algorithms in full-wave numerical modeling." He was involved in the development of the program called CFmaxwell. This code allows people to model nanooptical devices and a wide range of other materials. This code is extremely effective and has no performance loss at extra large grids due to using of innovative LRnLA algorithms.

We'll tell you about our code (DTmaxwell) for full-wave 3D numerical simulation of electromagnetic and elastic waves propagation by FDTD (finite-difference time-domain) method. The code is based on local recursive non-local asynchronous (LRnLA/DiamondTile) algorithms for GPGPU and has the performance rate about 2 billion cells/sec per one GPU device especially for big data problems. Almost ideal scalability for thousands of GPUs makes it possible to model wave processes that require extra-large computational grid. We'll also consider benchmarks of code on several supercomputers (Cray Titan, Tsubame 2.5, and Lomonosov-2).

Level: Advanced
Type: Talk
Tags: Computational Physics; Supercomputing & HPC

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Marriott Salon 6

S6506 - Windowed All-kNN Search over Multidimensional Array Data from Medical Imaging

Dimitris Floros Ph.D. Student, EECS Aristotle University of Thessaloniki
Dimitris Floros is a Ph.D. candidate in Electrical and Computer Engineering at Aristotle University of Thessaloniki, Greece. Dimitris received his diploma from the same institution in 2015, with a thesis on the analysis and design of 3D projection mapping systems. His primary research interests lie in the field of high performance numerical computing. He is also involved with an award winning team that designs and implements interactive web apps and tools that enable a casual user to solve practical everyday problems.

We'll present a systematic approach for automatic optimization in searching for the k-nearest neighbors among all elements in a space/space-time domain, by their affinity in a feature space as well as their proximity in the domain. In comparison to its window-free counterpart, windowed kNN search, respecting local properties in images, scales linearly with the data domain size in terms of arithmetic complexity. But the search remains time consuming, due in part to obscure difficulties in keeping data locality high and computation redundancy low. We resolve the problem by means of orchestrating dimension permutations and reshapes, depending on the data size, window size, and GPU memory hierarchy, and utilizing efficient matrix operations, all in high-level CUDA expressions.

Level: Intermediate
Type: Talk
Tags: Medical Imaging; Big Data Analytics; Performance Optimization

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Room 212B

S6540 - Need for Speed: Accelerating High-Accuracy Quantum Chemistry Using OpenACC Directives

Janus Eriksen Postdoctoral Researcher, Aarhus University
Janus Eriksen, Ph.D., is a postdoctoral researcher employed at the qLEAP Center for Theoretical Chemistry, Aarhus University, Denmark, where his work is concerned in part with new theoretical developments in the area of accurate wave function-based quantum chemistry, and in part the HPC adaption of the resulting models. He has authored or co-authored more than 10 publications in international peer-reviewed journals and is one of the key developers of the Divide-Expand-Consolidate (DEC) local correlation module of the massively parallel and linear-scaling LSDalton electronic structure program.

Quantum chemistry (QC)–that is, the application of quantum mechanics to molecular systems–has become an integral tool to most, if not all of chemical, biological, and general material sciences. In this session, we describe how we have achieved speed-ups of more than 10x by accelerating existing CPU-based implementations of two of the most prominent models of modern wave function-based QC–the RI-MP2 and CCSD(T) models–as well as their local correlation Divide-Expand-Consolidate (DEC) formulations–DEC-RI-MP2 and DEC-CCSD(T). The codes in question have been accelerated in the massively parallel and linear-scaling LSDalton program using the compiler directives of the OpenACC 2.0 standard. Examples illustrating the efficiency of the resulting (portable) OpenACC GPU port will be provided.

Level: Intermediate
Type: Talk
Tags: Computational Chemistry; Supercomputing & HPC; OpenACC

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Marriott Salon 5

S6666 - Fast Splittable Pseudorandom Number Generators

Guy Steele Software Architect, Oracle Labs
Guy Steele is a software architect at Oracle Labs, where he is responsible for research in language design and implementation strategies, and architectural and software support for programming languages. He has also been an assistant professor of computer science at Carnegie-Mellon University; a member of technical staff at Tartan Laboratories in Pittsburgh, Pa.; and a senior scientist at Thinking Machines Corporation in Cambridge, Mass. Guy joined Sun Microsystems (later acquired by Oracle) in 1994 as a distinguished engineer and was named a Sun Fellow in 2003. He received his B.A. in applied mathematics from Harvard College (1975), and his M.S. and Ph.D. in computer science and artificial intelligence from M.I.T. (1977 and 1980).

We describe two new classes of algorithm for a "splittable" pseudorandom number generator (PRNG) that is quite fast: either 9 or 11 64-bit arithmetic/logical operations per 64 bits generated. A splittable PRNG provides a "split" operation that creates a new PRNG that is computationally and statistically independent of its creator and therefore may be used in parallel. Splittable PRNG objects make it easy to organize the use of pseudorandom numbers in multithreaded programs where the number of threads may vary dynamically, but also have sufficient speed and quality to be useful when the number of threads is fixed. It is faster than MRG32k3a and of higher quality than XORWOW. No locking or synchronization is required, and the algorithm is quite suitable for SIMD or GPU implementation.

Level: All
Type: Talk
Tags: Algorithms; Tools & Libraries; Performance Optimization

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Marriott Salon 3

S6726 - Solid State LiDAR for Ubiquitous 3D Sensing

Louay Eldada CEO, Quanergy Systems, Inc.
Dr. Louay Eldada is Founder and Chief Executive Officer of Quanergy Systems, Inc. Prior to founding Quanergy, he founded and sold three photonic IC businesses to Fortune 100 companies. He chaired and organized 160 conferences; delivered 200 keynotes, invited talks and courses; published 270 technical papers, books and book chapters; received 50 technical awards; and holds 65 patents. Dr. Eldada studied business administration at Harvard, MIT and Stanford, and holds a Ph.D. in optoelectronics from Columbia University.

This tutorial covers for the first time the technology, operation and application of Quanergy's solid state LiDAR that is making 3D sensing ubiquitous, with its low price point, no moving parts, small form factor, light weight, low power consumption, long range, high resolution, high accuracy, long lifetime, and ability to operate in various environmental conditions. GPUs are used for performing in real time (1) LiDAR/Video data fusion for modeling and recognizing the environment around a vehicle, (2) object detection, classification, identification, and tracking, (3) scenario analysis and path planning based on deep learning, and (4) actuation of vehicle controls.

Level: All
Type: Talk
Tags: Self-Driving Cars & Automotive ; Robotics & Autonomous Machines; Deep Learning & Artificial Intelligence

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Room LL21E

S6758 - Deep Learning Robot

Duy Huynh Founder, Autonomous
Duy Huynh is founder of Autonomous. Duy is an ex-IBM Software Architect. He is currently a Ph.D. student at the University of Maryland.

Deep Learning Robot is built for advanced research in Robotics and Artificial Intelligence (Deep Learning). Pre-installed Google TensorFlow, Robot Operating System (ROS), Caffe, Torch, Theano, CUDA, and cuDNN. We'll show a live demo of the Deep Learning Robot and how developers and researchers can build applications for Deep Learning Robot such as autonomous navigation, object recognition, speech recognition, and natural language processing. We'll show a concrete example of building a robot that can navigate autonomously, also recognize objects around the room, and understand voice commands from the presenter.

Level: Beginner
Type: Talk
Tags: Robotics & Autonomous Machines

Day: Wednesday, 04/06
Time: 16:30 - 16:55
Location: Room LL20D

S6146 - SculptPrint: Subtractive 3D Printing through GPUs

Tommy Tucker CEO, Tucker Innovations
Tommy Tucker is the CEO and owner of Tucker Innovations. He has a secret clearance and a Ph.D., M.S., and B.S. in mechanical engineering. He has over 15 years of experience writing computationally intensive software applications for engineering, medical, and defense applications. After spending the early part of his career at high-tech startup companies, Tommy founded Tucker Innovations to facilitate his software consulting activities. Through Tucker Innovations, he has aided various organizations in producing software applications from concept to product launch and continuing through multiple release cycles. The Tucker Innovations team includes a blend of U.S.-based employees and offshore contractors. Clients range from small, high-tech startup companies to large organizations such as 3M, 3D Systems, the U.S. Navy, and U.S. Air Force.

We'll describe a new software package that moves GPUs from rendering virtual 3D objects seen with your eyes to making physical 3D objects held in your hands. SculptPrint is a computer-aided manufacturing (CAM) application for producing computer numerical controlled (CNC) machine tool cutting tool paths with a high level of automation and rich 3D feedback to the machinist and manufacturing engineer. The underlying technology uses a fundamentally different geometry representation from traditional CAD/CAM systems and leverages a suite of different GPU parallel processing algorithms to accomplish its workflow. SculptPrint technology is referred to as "Subtractive 3D Printing" to distinguish it from traditional NC CAM.

Level: Intermediate
Type: Talk
Tags: Product & Building Design

Day: Thursday, 04/07
Time: 09:00 - 09:50
Location: Room LL21A

S6147 - Single- and Multi-GPU Acceleration of Learning Using Word Vectors

Rajesh Bordawekar Research Staff Member, IBM T. J. Watson Research Center
Rajesh Bordawekar is a research staff member at the IBM T. J. Watson Research Center. His current interest is exploring software-hardware co-design of analytics workloads. He works at the intersection of high performance computing, analytics, and data management domains. He has been investigating how GPUs could be used for accelerating key analytics kernels in text analytics, data management, graph analytics, and deep learning. As part of this work, he collaborates closely with the IBM Power Systems, and various analytics and database product teams.

Vector representations of words is being widely used for identify relationships between word entities in a data corpus. The word vector computation tool takes text corpus as input and returns a set of vectors, potentially one for each word entity. The vectors can then be processed to find words with similar meaning and word synonyms. The word to vector learning process uses a wide variety of algorithms such as word2Vec or GloVe. This talk describes acceleration of these algorithms on a wide array of datasets, for both single- and multi-GPU environments.

Level: All
Type: Talk
Tags: Algorithms; Big Data Analytics; Deep Learning & Artificial Intelligence

Day: Thursday, 04/07
Time: 09:00 - 09:25
Location: Marriott Salon 3

S6170 - Real World Experiences from vGPU in the Cloud

Kyle Grossmiller VDI Solutions Architect, Pure Storage
Kyle is a VDI Solutions Architect at Pure Storage, where he focuses on helping customers bring their VDI projects to the next level of success using Pure’s All-Flash Arrays. Prior to joining Pure, Kyle was at Lockheed Martin Space Systems Company for over 12 years where he worked in dual IT roles supporting their CAD, CAM and FAE engineering user base as well as serving as the technical lead for their internal private-cloud VDI. Recently coming from a larger enterprise, Kyle is intimately aware of the challenges faced by those organizations and is extremely optimistic about the transformative positive change that emerging technologies like vGPU and Pure All-Flash Arrays will be able to rapidly bring about.
Scott McIsaac co-CTO, Secure24
Scott is co-CTO at Secure24, a leading midwestern cloud service provider. Scott architects large scale compute pods utilizing the best in class hardware and software infrastructure. Scott brings 15 years of experience in enterprise cloud computing, infrastructure management and design, and virtualization technologies. His focus has been on purpose built solutions to support business critical applications as well as utilizing technology to solve business challenges. He is currently leading secure24's work to provide NVIDIA GRID vGPU as a service to their clients.

As the promise of connecting workstations to the cloud for rich visual graphics becomes possible, visual thinkers - engineers, artists, scientists, and gamers are beginning to move their models and run applications in the cloud. While traditionally, virtual desktops were regarded as too slow and unable to deliver a good user experience, the NVIDIA GRID vGPU technology is a breakthrough in delivering rich graphics VDI as a service. NVIDIA GRID vGPU, VMware, a modern cloud delivery infrastructure, and a set of best practices combine to deliver a breakthrough for rich graphics applications as a service capability.

Level: All
Type: Talk
Tags: Graphics Virtualization; Data Center & Cloud Computing; Product & Building Design

Day: Thursday, 04/07
Time: 09:00 - 09:50
Location: Room LL21D

S6196 - Graphics Virtualization: Leveraging Microsoft's New GPU Virtualization Technologies

Chris Huybregts Senior Program Manager, Microsoft
Chris Huybregts drives the strategy and vision for Microsoft's Virtual GPU technology stack. By working with teams like HyperV and Azure, he's helping ensure the capabilities of GPUs in different virtualized environments meet the expectations of Microsoft's technology partners. Additionally, Chris owns the story around Microsoft's virtualized, GPU-accelerated visualization technology stack found in Remote Desktop.

Azure's new GPU-enabled N series VMs are leveraging new technology being developed across Microsoft's stack. In this talk we'll go over the roadmap of GPU virtualization advancements, how you can leverage HyperV to configure a deployment in house for testing or production, and what changes are being made to support visualization on this platform. The talk will provide the details needed for solution providers to leverage Microsoft's hypervisor to enable GPU pass-through leveraging discrete device assignment to a variety of VMs. We'll also be going over the additional visualization scenarios this new technology stack enables. This talk is appropriate for those looking to leverage Azure as well as their own deployments.

Level: All
Type: Talk
Tags: Graphics Virtualization; Data Center & Cloud Computing

Day: Thursday, 04/07
Time: 09:00 - 09:25
Location: Marriott Salon 4

S6256 - Particle Simulations with HOOMD-Blue

Joshua Anderson Research Area Specialist Lead, University of Michigan
Joshua Anderson is a Research Area Specialist Lead in the Glotzer Group at the University of Michigan. Dr. Anderson received his Ph.D. in Condensed Matter Physics from Iowa State University and is the creator and lead developer of HOOMD-blue. He is the 2015 winner of the CoMEF Young Investigator Award for Modeling and Simulation.

Come and see how to use HOOMD-blue, a flexible particle simulation tool. HOOMD-blue runs hard particle Monte Carlo, Molecular Dynamics, DPD, and other types of particle simulations, all on GPUs. It runs on single GPU workstations up to thousands of GPUs on supercomputers. Use python scripts to configure jobs with custom initialization, complex flow control, and in-situ analysis of data. This talk introduces HOOMD-blue features and describes how to use them, focusing on the newest capabilities. It demonstrate job scripts for common usage patterns and shows examples of efficient workflows.

Level: Beginner
Type: Talk
Tags: Computational Chemistry; Computational Physics

Day: Thursday, 04/07
Time: 09:00 - 09:50
Location: Marriott Salon 5

S6347 - Multi GPU, Interactive 3D Simulator for the Lattice Boltzmann Immersed Boundary Method

Bob Zigon Senior Staff Research Engineer, Beckman Coulter
Highly-Rated Speaker
Bob Zigon is a senior staff research engineer and has worked at Beckman Coulter for 13 years. He has degrees in computer science and mathematics from Purdue University. He was the architect of Kaluza, an NVIDIA Tesla-powered analysis application for flow cytometry. He's now researching how machine learning techniques can be applied to laboratory automation. His interests include high performance computing, numerical analysis, machine learning, and software development for life science.

The Lattice Boltzmann Immersed Boundary method is a technique in computational fluid dynamics used to model and simulate fluid-structure-interaction problems. The goal of this session is to demonstrate practical strategies for partitioning the computations across Tesla K40 cards while exploiting the programmable pipeline inside of a Quadro K5000 to visualize the 3D flow fields at interactive rates. GPU results will be compared with an OpenMP implementation. Full source code will be provided.

Level: All
Type: Talk
Tags: Computational Fluid Dynamics; Computational Physics

Day: Thursday, 04/07
Time: 09:00 - 09:25
Location: Marriott Salon 1

S6356 - Visual Sensemaking with GPU-Driven Machine Learning

Stef van den Elzen Visualization Architect, SynerScope BV
Stef van den Elzen has worked at SynerScope B.V. as a visualization architect since July 2015. Stef obtained his M.S. with honors in 2011. In July 2011, he started as a developer with SynerScope B.V. on a Ph.D. project at the Eindhoven University of Technology. For his research, he developed tools and techniques for the exploration of dynamic multivariate networks. The results of his Ph.D. research, completed in 2015, were published in a number of articles in international conference proceedings and journals. He received three best paper awards: at IEEE PacificVis 2013 for his work on extended massive sequence views, at IEEE InfoVis 2014 for his work on multivariate network exploration and presentation, and at IEEE VAST 2015 for work on dynamic network exploration. Furthermore, he received two Best Visualization awards for his work on the exploration and analysis of massive mobile data at the Data for Development Challenges 2013 and 2015. The work on reordering massive sequence views led to a patent application.

We show how our interactive, integrated analytics solution allows a new class of users to perform machine-assisted visual sensemaking. Up till now, machine learning techniques such as predictive analytics and deep learning are mostly used as part of a complex tool-chain that serves as an endpoint in the decision making process. We combine the strengths of human decision making and GPU-driven machine learning in a multi-coordinated visual analytics solution. This enables the discovery of actionable insights by bridging the gap between data scientist and business user.

Level: Beginner
Type: Talk
Tags: Big Data Analytics; Deep Learning & Artificial Intelligence; Self-Driving Cars & Automotive

Day: Thursday, 04/07
Time: 09:00 - 09:25
Location: Room 210F

S6410 - Comparing OpenACC 2.5 and OpenMP 4.5

Jeff Larkin DevTech Engineer, NVIDIA
Highly-Rated Speaker
Jeff Larkin is a software engineer in NVIDIA's Developer Technology (DevTech) group where he works on porting and optimizing HPC applications. He is also closely involved with the development of both the OpenACC and OpenMP specifications. Prior to joining NVIDIA Jeff worked in Cray's Supercomputing Center of Excellence at Oak Ridge National Laboratory.
James Beyer Senior Runtime Engineer, NVIDIA
James Beyer recently moved to NVIDIA after a 15-year tenure at Cray Inc. He is a longtime member of the OpenMP language committee, and co-chair of the OpenMP subcommittee on accelerator directives. He was one of the founding members of the OpenACC specification and remains an active member of the OpenACC technical committee.

We'll compare the current state of two competing accelerator directive sets: OpenACC 2.5 and OpenMP 4.5. As members of both the OpenACC technical committee and the OpenMP language committee, we'll provide an inside take on the current state of the directives and insight into how to transition between the directive sets.

Level: All
Type: Talk
Tags: Programming Languages; Performance Optimization; OpenACC

Day: Thursday, 04/07
Time: 09:00 - 09:50
Location: Room 212B

S6466 - High-Speed GPU Parallel Algorithm for the 2-D HOPS Formula and Application to Bio-Sensor Design

Takahiro Sasaki PhD student, University of Minnesota, Twin Cities
Takahiro Sasaki is a Ph.D. candidate in scientific computation at the University of Minnesota, Twin Cities. Takahiro received his B.S. and M.S. in electronic engineering focusing on optics from Osaka University in Japan, and experienced optical-system design and analysis of imaging performance for optical lithography tools at an optical instruments company in Japan. His research includes a combination of optics, numerical methods, and parallel computing.

We'll describe high-speed GPU parallel algorithms for computing rigorous diffracted optical fields from 2-D periodic structures by a combination of the HOPS (high-order perturbation of surface) formula and high-performance GPUs. This work makes a contribution not only to the acceleration of device development, such as bio-sensors using surface plasmon resonance, but also to a wide range of applications such as computer graphics and inverse problems. After providing analysis regarding the parallelizability of the HOPS formula, two different algorithms to extract full performance from GPUs will be explored for different problem sizes, and a FFT-accelerated algorithm will be investigated. In experiments, executing the three algorithms on CPUs and GPUs will be compared.

Level: Intermediate
Type: Talk
Tags: Computational Physics; Algorithms

Day: Thursday, 04/07
Time: 09:00 - 09:25
Location: Marriott Salon 6

S6485 - High-Performance GPU Programming for Deep Learning

Scott Gray Principal Software Engineer, Nervana Systems
Scott Gray is a principal engineer at Nervana Systems. He leads GPU kernel development providing key components for the deep learning library neon as well as the simulator for Nervana's novel distributed processor architecture. Scott is the author of maxas, a custom assembler for the NVIDIA Maxwell architecture, which he has used to achieve state-of-the-art results in numerical linear algebra. He has a long-standing interest in computational neuroscience and designing artificial algorithms to play games.

This session goes over many of the techniques we use at Nervana in GPU programming to achieve state-of-the-art performance for deep learning networks. The main focus will be on the customization of dense linear algebra kernels: Winograd 3x3 convolution, direct convolution, and small tile GEMM (matrix multiply). In particular, we'll look at how we achieve high utilization at very small mini batches which is important for multi-gpu scaling and inference. In addition we'll talk about where and how you can effectively leverage lower and mixed precision to further increase performance without loss in accuracy.

Level: Advanced
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Performance Optimization; Algorithms

Day: Thursday, 04/07
Time: 09:00 - 09:50
Location: Room 210G

S6622 - Advances in Remoting Protocol Technology for 3D Graphics

Derek Thorslund Director of Product Management, Citrix
Highly-Rated Speaker
Derek Thorslund is director of product management, driving Citrix's product strategy and roadmap for HDX (high definition experience) multimedia virtualization and remoting protocol technologies across XenApp, XenDesktop, and the Citrix Receiver. He joined Citrix in 2003, where he played a key role in introducing the Citrix Access Suite, forerunner to XenApp/XenDesktop Platinum Edition. Derek's previous roles in the high-tech industry include director of Product Management at Avotus and manager of New Business Applications at Bell-Northern Research (Canada).
Stephen Vilke Sr. Director of Engineering, Citrix
Stephen Vilke is a technologist, entrepreneur, and enterprise executive with 24 years of experience innovating, inventing, and leveraging technology to solve real-world business problems across financial services, technology and government sectors. He led the development of the ground-breaking Framehawk HDX technology introduced to Citrix XenApp/XenDesktop in 2015. Previously, Stephen held leadership positions in technology strategy, architecture, and IT operations at Barclays Global Investors, CIBC, Alteon WebSystems, and Clarify/Nortel. Stephen started his career as a NASA programmer/analyst at the Space Sciences Laboratory at U.C. Berkeley.

Learn about Human UX Protocol design concepts in Citrix's next-gen HDX display remoting technology and hear from XenApp customers pushing the limits with graphics virtualization over high-latency connections halfway around the world. User experience is fundamental to a successful implementation and the remoting protocol used to transmit the virtualized 3D app or desktop from the data center or cloud to the worker is critical. Delivering 3D graphics to demanding users over a long-haul corporate WAN connection without excessive bandwidth consumption requires innovative solutions.

Level: All
Type: Talk
Tags: Graphics Virtualization; Data Center & Cloud Computing

Day: Thursday, 04/07
Time: 09:00 - 09:50
Location: Marriott Salon 2

S6649 - One Size Doesn't Fit All: The Importance of Aligning VR Environments to Workflows

Matt Szymanski Technical Consultant, Mechdyne Corporation
Matt earned his BS in Bioengineering and MS in Electrical Engineering and Computer Science from the University of Illinois-Chicago. After working at the Argonne National Laboratory as a Scientific Research Software Developer, Matt co-founded and served as Chief Technology Officer CTO at VRCO, a software company acquired by Mechdyne in 2006. Prior to his current role, Matt served as the VP of Products at Mechdyne.

Despite the resurgence of interest in Virtual Reality due to HMDs like Oculus Rift and HTC Vive, VR has been available in a variety of forms from wearable, to desktop, to fully immersive rooms for several decades. With all of the recent fanfare, you may be wondering, what is the best VR solution for me? The answer: it depends on your goals. This presentation covers VR's complexities, NVIDIA's role in a solution, and what qualifies as VR, opposed to only 3D. A visual tour of VR technologies will demonstrate the pros and cons of different solutions. Use cases from various industries highlight how VR technologies improve decision making, problem solving, and learning. We conclude with a checklist of questions you should ask when determining the best VR solution to meet your needs.

Level: All
Type: Talk
Tags: Large Scale and Multi-Display Visualization; Virtual Reality & Augmented Reality

Day: Thursday, 04/07
Time: 09:00 - 09:50
Location: Room 210E

S6652 - SQLite Running Entirely on the GPU

Sky Morey Chief Architect, DEG
Sky Morey embodies creative, adaptive thinking as made manifest through technology. He operates as a technical architect at the macro level, drawing on decades of experience across multiple projects and environments. Since joining DEG as its first associate, Sky has demonstrated his software development expertise through direct engagement with DEG's engineering staff, niche troubleshooting, and ongoing research and development.

SQLite is a lightweight relational database and the most widely deployed database engine in the world, usually embedded in other applications. We have ported SQLite entirely to the GPU. You'll see a working version running on the GPU with native host file system access, followed by a lessons learned, and a deeper dive into the platform. The project is open source under the MIT license and usage examples will be shown. This project is still in early alpha.

Level: All
Type: Talk
Tags: Tools & Libraries; Embedded; Performance Optimization

Day: Thursday, 04/07
Time: 09:00 - 09:50
Location: Room 211B

S6679 - GPU in Surgery and Anatomy Simulation, Image Processing and Augmented Reality Projects

Boris Yaremin Director, High Performance Computing And Big Data Lab, Samara State Medical University
Born in 1977 at West Ukraine. Graduated Samara State Medical University at 2000. Working as Associate Professor at Operative Surgery And Clinical Anatomy Chair, as a transplant surgeon in Samara Center Of Organ And Tissue Transplantation. Works as computational scientist - as a team leader of Scientific Educational Center "Virtual Technology in medicine", product manager of Virtual Anatomy Atlas "Deep Anatomy", endovascular surgery simulator. Now is working as director of HPC unit of Samara State Medical University, director of Regional Transplant Service Samara Region Health Ministry.

This talk is about technical aspects of surgery and anatomy simulation in different levels. We will talk about GPU use in virtual clinic environment, about surgery simulation on endovascular trainer, laparoscopic training system. We will also discuss virtual dissection table creation. Other part of talk is about augmented reality in surgical practice.

Level: All
Type: Talk
Tags: Medical Imaging; Virtual Reality & Augmented Reality; Education & Training

Day: Thursday, 04/07
Time: 09:00 - 09:25
Location: Room LL21B

S6692 - Embedded Supercomputing: Radio Astronomy at the Limit

Simon Ratcliffe Technical Lead: Scientific Computing, SKA South Africa
Simon is the lead for scientific computing at SKA South Africa. Overall design and architecture for the high performance computing systems on the MeerKAT radio telescope. Core member of the international SKA Science Data Processor consortia that is currently designing the compute backend for the Square Kilometer Array, which will be one of the largest scientific facilities in the world.

This talk will present designs and performance results for a highly parallel Tegra X1 based compute platform being developed as part of a next generation radio telescope. The MeerKAT radio telescope is currently under construction in the semi-desert Karoo region of Southern Africa. This talk presents the ongoing work into developing novel computing technologies to deliver a large scale computational platform within the strict confines of power, space and emission that are in force at this remote site. Using the Tegra X1 as a building block, a rugged, oil-cooled platform has been developed that will power the imager that lies at the heart of the compute challenge. This is a follow on talk from an initial exploration presented in 2015.

Level: All
Type: Talk
Tags: Astronomy & Astrophysics; Embedded

Day: Thursday, 04/07
Time: 09:00 - 09:25
Location: Room 212A

S6195 - Burning on the GPU: Fast and Accurate Chemical Kinetics

Russell Whitesides Member of Technical Staff, Lawrence Livermore National Laboratory
Dr. Russell Whitesides has applied his theoretical and applied knowledge of chemical kinetics and scientific computing platforms towards internal combustion engine simulations with the goal of highly efficient, clean-combustion engines for transportation. Russell has pursued a variety of topics in mechanical engineering R&D in the course of his academic and research career. Since joining Lawrence Livermore National Laboratory, he has worked alongside the Methods Development Group at LLNL to enhance the capabilities and interoperability of scalable structural mechanics codes. His doctoral thesis focused on the atomistic chemical mechanisms of soot particle growth in combustion environments.

Come learn about our latest developments in accelerating combustion kinetics for computational fluid dynamics (CFD). We have extended our previously presented CUDA implementation to improve performance and will present multiple examples of improved solver performance. We will discuss the merits of our approach in comparison to related approaches and provide insight into the lessons we've learned along the way.

Level: Intermediate
Type: Talk
Tags: Computational Fluid Dynamics; Performance Optimization; Computational Physics; Aerospace & Defense

Day: Thursday, 04/07
Time: 09:30 - 09:55
Location: Marriott Salon 1

S6402 - Fast Detection of Neighboring Vectors

Krzysztof Kaczmarski Assistant Professor, Warsaw University of Technology, Faculty of Mathematics and Information Science
Since 2008, Krzysztof Kaczmarski has worked at the Faculty of Mathematics and Information Science at Warsaw University of Technology in the Department of Computer Science and Numerical Methods as a coordinator of computer science studies. Before this, he took part in several projects in Polish-Japanese Institute of Information Technology. In 2007 he got a Ph.D. at the Institute of Computer Science, Polish Academy of Sciences, in the field of modeling distributed object-oriented databases.

We'll present several methods for detecting pairs of vectors, which are in Hamming distance 1. This problem is an important part of the cell graph construction in motion planning in a space with obstacles. We'll begin with a naive square-time solution, which simply compares pairs of vectors, through building dedicated search trees, moving towards an optimal linear algorithm. Sequential linear time algorithms for the problem were already known, but due to high constants hidden in the complexity function, they appeared to be not very efficient for real-life data. Our GPU-based massively parallel solution promises acceptable execution times, opening dynamic cell graph construction for real-time applications like robotics and optimal path searching.

Level: Intermediate
Type: Talk
Tags: Algorithms; Robotics & Autonomous Machines; Tools & Libraries

Day: Thursday, 04/07
Time: 09:30 - 09:55
Location: Marriott Salon 3

S6412 - Fourier Domain Pulsar Acceleration Searches on GPUs for the Square Kilometre Array

Sofia Dimoudi Research Associate, University of Oxford
Sofia Dimoudi studied for her Ph.D. at Durham University, where she worked on GPU acceleration of atmospheric tomography computational algorithms for real-time control on adaptive optics systems for extremely large telescopes. Currently, she works as a research associate at the OeRC, looking at hardware acceleration of real-time pulsar signal processing algorithms for next-generation radio telescopes.

We'll describe how we can accelerate one of the most demanding computational tasks of the real-time pulsar signal processing pipeline of the world's largest next generation radio telescope, the Square Kilometre Array (SKA). We'll explain the scientific goals and importance of pulsar searches, along with the technical challenges facing pulsar signal processing on the SKA. Pulsar acceleration searches will be introduced, and an overview of a Fourier Domain method for recovering signal power from binary accelerated pulsars will be given. We'll then present our GPU implementation of this method, discuss techniques used for optimisation, show comparative computational performance results, and consider performance projections with future GPU technology.

Level: All
Type: Talk
Tags: Astronomy & Astrophysics; Algorithms

Day: Thursday, 04/07
Time: 09:30 - 09:55
Location: Room 212A

S6472 - The Promise of GPU Analytics or Why GPU is the New CPU

Todd Mostak CEO, MapD
Todd Mostak is the founder and CEO of MapD. Before MapD, Todd was a researcher focusing on GPU databases at MIT CSAIL. Seeking adventure upon finishing his undergrad, Todd moved to the Middle East, spending two years in Syria and Egypt teaching English, studying Arabic, and eventually working as a translator for an Egyptian newspaper. He then completed his M.A. in Middle East studies at Harvard University, afterwards taking a position as a research fellow at Harvard's Kennedy School of Government, focusing on the analysis of Islamism using forum and social media datasets. His frustration with the inability of existing tools to allow for the interactive exploration of large Twitter datasets motivated him to create MapD.

We'll explain why GPU-powered in-memory databases and analytics platforms are the logical successor to CPU in-memory systems, largely due to recent increases in onboard memory available on GPUs. With sufficient memory, GPUs possess numerous advantages over CPUs, including much greater compute and memory bandwidth and a native graphics pipeline. We'll demo how MapD is able to leverage multiple GPUs per server to extract orders-of-magnitude performance increases over CPU-based systems, bringing interactive querying and visualization to multi-billion row datasets.

Level: All
Type: Talk
Tags: Big Data Analytics; Performance Optimization

Day: Thursday, 04/07
Time: 09:30 - 09:55
Location: Room 210F

S6501 - Deep Correspondence Restricted Boltzmann Machine for Cross-Modal Retrieval

Ruifan Li Assistant Professor , Beijing University of Posts and Telecommunications
Ruifan Li is an assistant professor of computer science at Beijing University of Posts and Telecommunications, and affiliated with the Engineering Research Center of Information Networks, Ministry of Education. Ruifan received B.S. and M.S. degrees in control systems and in circuits and systems from Huazhong University of Science and Technology, China, in 1998 and 2001, respectively. In 2006, he received a Ph.D. in signal and information processing from BUPT and joined the School of Computer Science there. In 2011, he spent one year as a visiting scholar at the Information Sciences Institute, University of Southern California. Ruifan's current research activities include neural information processing, multimedia information processing, and statistical machine learning.

Learn how correspondence restricted Boltzmann machine (Corr-RBM), a kind of classical model in deep learning, is used for the task of large-scale cross-modal retrieval, such as using text query for images. We'll first illustrate the RBM model as one of the building blocks in deep learning. We'll describe the architecture of the Corr-RBM model and its learning algorithm. We construct two deep neural structures using Corr-RBM for the task of cross-modal retrieval. A number of comparison experiments with their hardware and software settings are reported on three public real-world datasets. We report the computational time using the NVIDIA Tesla K20c GPU card for the largest dataset used in our experiments.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Algorithms; Video & Image Processing

Day: Thursday, 04/07
Time: 09:30 - 09:55
Location: Room 210D

S6543 - IBM SoftLayer Enables 3D Workspaces with NVIDIA GRID™

Jerry Gutierrez Global HPC Leader, SoftLayer, an IBM Company
Jerry Gutierrez, Global HPC Sales Leader at SoftLayer, has worked in the IT industry for over 20 years. His career has spanned from positions at Hewlett Packard, Terremark (formerly Data Return), and founder of Data Direct Business Solutions, an IT managed services and cloud solution provider. In 2012, he joined SoftLayer to assist businesses from various industries across the globe successfully utilize SoftLayer Cloud with a focus on HPC and GPU Accelerated solutions.

Expand virtualization to any user on the network. Deliver better graphics, improve productivity, and grant access to business-critical applications anywhere, all from SoftLayer and the IBM Cloud.

Level: All
Type: Talk
Tags: Graphics Virtualization; Media & Entertainment; Product & Building Design

Day: Thursday, 04/07
Time: 09:30 - 09:55
Location: Marriott Salon 4

S6651 - Using Today's Fastest Chips to Design the Chips of Tomorrow

Mauro Calderara Ph.D. Student, ETH Zurich
Mauro Calderara is a Ph.D. student working on a density functional theory code simulating nano structures under the supervision of Professor Dr. Mathieu Luisier at the Integrated Systems Laboratory at ETH Zurich. His background is in theoretical physics, which Mauro has studied at ETH and UC Berkeley until 2009, receiving an M.S. from ETH. While working for an IT company, he received a Certificate of Advanced Studies in computer science (CAS ETH in computational science and distributed systems) before he started his Ph.D. His interests focus on the algorithmic challenges in ab-initio quantum transport on modern hybrid supercomputers. He has developed SplitSolve, a linear solver targeting the sparse linear systems we encounter in our quantum transport simulations. It runs on accelerators like NVIDIA GPUs or Intel's MICs and typically outperforms traditional sparse solvers such as MUMPS, Pardiso, or SuperLU.

We'll show how one can effectively leverage GPUs to perform ab-initio quantum transport simulations in realistic nanostructures and investigate the behavior of tomorrow's transistors down to the flow of electrons through atomic geometries. Due to transistors getting smaller and smaller, simulations have to account for quantum mechanical effects from first principles. The incurred computational cost has so far limited such investigations to small domains that are of little practical interest from an experimental point of view. We present algorithms specifically geared towards GPUs that are more memory efficient and faster than traditional CPU-based ones. They speed up the calculations by an order of magnitude allowing for the simulation of realistic devices.

Level: Intermediate
Type: Talk
Tags: Computational Physics; Algorithms; Supercomputing & HPC

Day: Thursday, 04/07
Time: 09:30 - 09:55
Location: Marriott Salon 6

S6657 - GPU-Based Real Time Reconstruction and Visualization of Cardiovascular 3D Ultrasound Images

Erik N. Steen Principal Engineer, GE Healthcare
Erik N. Steen is a principal engineer with GE VIngmed Ultrasound, responsible for technology strategy. He received his M.S. in computer science from the Norwegian Technical University in Trondheim in 1992. He received his Ph.D. in the field of 3D medical image processing in 1996. He has worked at GE Vingmed Ultrasound in Norway since 1996. During the last few years, he has been actively involved in development of a new GPU-based architecture for real-time ultrasound image reconstruction as well as real-time visualization of 3D cardiac images.

We'll cover the clinical and technical benefits of using GPUs for real-time reconstruction and visualization of 3-dimensional and 2-dimensional cardiovascular ultrasound images. The session has three main parts. First, we'll introduce some of the clinical challenges in cardiovascular ultrasound imaging. Second, we'll give an overview of a new image reconstruction architecture called cSound as well as some of the algorithms that have been implemented with this new architecture. The technical and clinical benefits of this architecture also will be discussed. Finally, we'll cover GPU-based real-time visualization of the reconstructed 3D images. Several examples of 2D and 3D ultrasound images will be shown.

Level: All
Type: Talk
Tags: Medical Imaging; Video & Image Processing

Day: Thursday, 04/07
Time: 09:30 - 09:55
Location: Room LL21B

S6165 - Optimizing Training Performance of Recurrent Neural Networks

Jeremy Appleyard Senior Engineer (DevTech Compute), NVIDIA
Jeremy is a senior member of NVIDIA's European Developer Technology team. Based near Oxford, he works with developers accelerating applications on GPUs. He has experience with both deep-learning as well as traditional scientific computing applications. He holds a PhD in computational fluid dynamics from Cranfield University.

By exposing parallelism between operations in a recurrent neural network it is possible to achieve significant performance improvements when training. In this talk a case study based on a Long Short-Term Memory (LSTM) recurrent network will be used to demonstrate a 5x speedup over a naive implementation for the forward pass of a single layer. A further 2x speedup (totaling 10x) will be shown when considering multiple layers. Results will also be presented for the backward pass.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Performance Optimization

Day: Thursday, 04/07
Time: 10:00 - 10:25
Location: Room 210G

S6171 - Bringing Dassault Engineering Into The Cloud with NVIDIA GRID™

Christophe Delattre Software Architecture Director, Dassault Systemes
Graduated in Computer Graphics and Information Science, Christophe Delattre spent his entire career at Dassault Systèmes. After leading the 3D Visualization team for over 10 years, making sure CATIA and other Dassault Systèmes solutions provide best performance and reliability to their customers, he has also been managing the technical partnership between NVIDIA and Dassault Systèmes. Over the past 3 years, he has been entitled to lead the entire remote graphics architecture and strategy of the Dassault Systèmes platform (aka 3DEXPERIENCE platform). Supervising both internal developments and the entire deployment over the Cloud infrastructure, he makes sure the best integration possible is implemented into the Dassault Systèmes dashboard with minimal lag and best user experience leveraging NVIDIA GRID technology.
Stefan Schoenefeld Devtech Engineer, NVIDIA
Highly-Rated Speaker
Stefan Schoenefeld is a Senior Developer Technology Engineer at NVIDIA. After working on the Scenix Scene Graph SDK and the Workstation Performance Drivers, he now works on implementing NVIDIA GRID™, improving video encoding and new ideas for remote graphics.

In this session you will learn how Dassault Systems uses NVIDIA GRID™ cards and the NVIDIA GRID™ SDK to bring their applications into the cloud. We will take a look behind the scene at the technologies used to provide fast, high-quality graphics streaming and will cover implementation and design details of the Dassault remoting application. Furthermore we will talk about new, exciting ideas for next generation application remoting jointly developed by Dassault Systems and NVIDIA engineers. Finally we will give demos of the Dassault remoting application as well as some of our next generation prototypes.

Level: Intermediate
Type: Talk
Tags: Graphics Virtualization; Data Center & Cloud Computing; Product & Building Design

Day: Thursday, 04/07
Time: 10:00 - 10:50
Location: Marriott Salon 4

S6173 - Tuning Performance on Kepler GPUs: An Introduction to Kepler Assembler and Its Usage in CNN optimization

Zhe Jia Development Engineer, Alibaba
Zhe is a development engineer in domain specific computing team at Alibaba and has participated in several projects on deep learning code optimization. He specializes in improving kernel performance through various aspects, from algorithm designing to low level optimization. He is one of developers of Kepler Assembler used in Alibaba. He graduated from Peking University, China. Before his graduation, he developed a computer model to simulate Jupiter's great red spot and tsunami, and accelerated it with CUDA.

Learn some advanced skills about performance optimization on Kepler GPUs. NVIDIA has provided many powerful tools to analyze and improve efficiency of CUDA kernels. However, in many specific cases, developers need to do some more detailed adjusting to get expected performance. In this session, a native assembler for Kepler architecture used in Alibaba will be introduced. Also, turning experiences of CNN and gemm implementation with this assembler will be shown as examples. If you are interested in assembly level optimization and want to use such a tool in Kepler architecture, you shouldn't miss this session!

Level: Advanced
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Performance Optimization

Day: Thursday, 04/07
Time: 10:00 - 10:25
Location: Room 210D

S6177 - Efficient Imaging in Radio Astronomy Using GPUs

Bram Veenboer Ph.D. Researcher, Astron
Bram Veenboer is working as Ph.D. researcher at ASTRON, the Netherlands Institute for Radio Astronomy. Bram works on the Dome project, where he did research towards the biggest radio telescope in the world, the Square Kilometre Array. Bram's research focuses on accelerator platforms and how they can be used to speed up the algorithms that are used to transform observation data into sky-images. Before joining ASTRON, he was student in computer science at the Vrije Universiteit in Amsterdam. His M.S. program focused on high performance and distributed computing, and for his final thesis he worked on spiking neural network on GPUs at the Centrum voor Wiskunde en Informatica in Amsterdam.

We'll present an optimized GPU implementation of a new radio astronomical imaging algorithm that outperforms the current state of the art. In contrast to traditional imaging algorithms, it offers correction for direction-dependent effects at negligible additional computational cost. We'll explain why this algorithm works so well on GPUs and show the optimization techniques that were applied to get there.

Level: Intermediate
Type: Talk
Tags: Astronomy & Astrophysics; Performance Optimization; Algorithms

Day: Thursday, 04/07
Time: 10:00 - 10:25
Location: Room 212A

S6245 - IT-as-a-Service with Visually Intensive VDI: Designing, Testing and Scaling

Tony Foster Principal Technical Marketing Engineer, EMC
Tony has been working in the virtualization industry since 2005, where he worked for Eagle Technologies implementing and supporting their virtualization solutions. During his 7 years at Eagle he designed, deployed, and supported hundreds of vSphere implementations across the Midwest. In 2012 Tony was named a VMware vExpert and has continued to hold into 2015. Tony joined VCE in 2012 as a Corporate Engineer specializing in End User Computing (EUC) and helped design many large EUC implementations. In 2014 he joined the EMC EUC Solutions team where he helps to create new EUC technologies.

Learn how visually intensive VDI architectures operate within an IT-as-a-Service framework, run alongside more traditional VDI workloads and integrate with private-cloud architectures. Understand how visually intensive solutions need to be designed and deployed balancing cost, performance and usability for both IT and line-of-business users. Take away key design principles and best practices for enabling user self-service, architecting for linear scalability, applying security, monitoring user performance and integrating with existing private clouds.

Level: Intermediate
Type: Talk
Tags: Graphics Virtualization; Data Center & Cloud Computing

Day: Thursday, 04/07
Time: 10:00 - 10:25
Location: Marriott Salon 2

S6268 - GPU Accelerated Markov Decision Process in Crowd Simulation

Benjamin Hernandez Computer Scientist, Oak Ridge National Laboratory
Benjamin Hernandez is a member of the recently created Advanced Data and Workflows group at NCCS in Oak Ridge National Laboratory. His previous appointments include the Barcelona Supercomputing Center and Tecnológico de Monterrey, campus Ciudad de Mexico, where he was principal investigator of the CUDA Teaching Center. His research interests are in the intersection of HPC, GPUs, interactive computer graphics, HCI, crowd and traffic simulation, analysis, and visualization.
Sergio Ruiz Professor, Tecnológico de Monterrey
Sergio Ruiz is a full-time professor for the Computer Department in Tecnologico de Monterrey, Mexico City campus, where he received his Ph.D. in 2014. His research area is path planning for simulated crowds proposing the thesis: "A hybrid method for macro- and micro-simulation of crowd behavior." He spent a research visit in 2013 working in crowd simulation at the Barcelona Supercomputing Center in Spain.

Markov decision processes have been used in real-world path planning, where environment information is incomplete or dynamic. The problem with the MDP formalism is that its state space grows exponentially with the number of domain variables, and its inference methods grow with the number of available actions. To overcome this issue, we formulate an MDP solver in terms of matrix multiplications, based on the value iteration algorithm; thus we can take advantage of GPUs to produce interactively obstacle-free paths in the form of an optimal policy. We'll present a performance analysis of our technique using Jetson TK1, CPU, and GPU platforms. Our algorithm presents 90x speed-up in GPUs, and 30x speed-up in the Jetson TK1 in contrast with its CPU multi-threaded version.

Level: All
Type: Talk
Tags: Algorithms; Deep Learning & Artificial Intelligence

Day: Thursday, 04/07
Time: 10:00 - 10:25
Location: Marriott Salon 3

S6302 - Two-Level Parallelization and CPU-GPU Hybrid Large Scale Discrete Element Simulation

Ji Xu Associate Professor, Institute of Process Engineering (IPE), Chinese Academy of Sciences (CAS)
Ji Xu is an associate professor at the Institute of Process Engineering, Chinese Academy of Sciences, where he also received his Ph.D. in chemical engineering. His interests include the mechanism of complex particle systems with discrete simulation methods, such as molecular dynamics and discrete element method, including large-scale, high-performance algorithm design and software development for hybrid CPU-GPU computing.

Learn how to develop algorithms for the discrete element method to efficiently simulate a large number of particles in complex-shaped systems with a large number of GPUs through: (1) Two-level domain decomposition parallel algorithm for multiple GPUs; (2) Faster particle-to-particle collision algorithms for one GPU; (3) Overlap communication and computation for efficient parallel computing. Scientific and industrial applications are given for single-phase particle only and multi-phase flow systems, such as granular flows, particle mixing, gas-solid fluidization, and liquid-solid flow in stirred tanks.

Level: Intermediate
Type: Talk
Tags: Computational Physics; Computational Fluid Dynamics; Supercomputing & HPC

Day: Thursday, 04/07
Time: 10:00 - 10:25
Location: Marriott Salon 6

S6319 - Revealing Chemical Reactions in Complex Systems with GPU-Enabled ReaxFF Molecular Dynamics

Xiaoxia Li Professor, Institute of Process Engineering, Chinese Academy of Sciences
Received his B. E., Department of Chemical Engineering, Tsinghua University, Beijing, China and his M. S., Design and Development of Organic Compound Physical Property Data System, Institute of Chemical Metallurgy*, Chinese Academy of Sciences, Beijing, China. Currently he is Professor, State Key Laboratory of Multiphase Complex Systems, Institute of Process Engineering, Chinese Academy of Sciences. Serves as Vice Chairman, Committee of Computer Chemistry, Chinese Chemical Society; Executive Member, Chinese National Committee for CODATA; Trustee, Chemical Structure Association Trust and Project Director of Asian Chemical Information Network, Federation of Asian Chemical Societies (FACS).

This talk presents the first GPU enabled code for ReaxFF MD, GMD-Reax, and its applications in simulating large scale reactive molecular systems for challenging problems in energy applications, including reaction mechanism investigation of coal and biomass pyrolysis and combustion of jet fuels. GMD-Reax allows for efficient simulations of large models of ~10,000 atoms. Combined with using VARxMD, the first code we created for ReaxFF MD reaction analysis in our methodology development, the coal pyrolysis simulations can predict the overall spectrum evolution trend of products and uncover important reaction pathways and radical behaviour. What we obtained in simulations of coal and biomass pyrolysis and fuel combustion is hardly accessible experimentally or by other computational approach.

Level: Intermediate
Type: Talk
Tags: Computational Chemistry; Algorithms

Day: Thursday, 04/07
Time: 10:00 - 10:25
Location: Marriott Salon 5

S6329 - Petascale Computational Fluid Dynamics with Python on GPUs

Freddie Witherden Postdoctoral Research Assistant, Imperial College London
Freddie Witherden is a postdoctoral scholar in the Department of Aeronautics at Imperial College London. He obtained his Ph.D. in high-order methods for GPU-accelerated computational fluid dynamics in 2015 under the supervision of Dr. Peter Vincent. Between 2008-2012, Freddie studied physics with theoretical physics at Imperial College London, earning an M.S. with first-class honors. Outside of work, Freddie has a keen interest in helping academics track their engagement with the mass-media. In 2012, Freddie co-founded the news analytics start-up Newsflo, where he served as chief technology officer. Newsflo was acquired in January 2015 by the academic publisher Elsevier, which has since gone on to employ the core technology in a range of products.

Discover how Python and in-situ visualization are being used to enable petascale computational fluid dynamics simulations of flow over real-world geometries. We'll (1) introduce PyFR, an open-source Python framework for solving the compressible Navier-Stokes equations on unstructured grids, (2) describe how PyFR leverages the capabilities of NVIDIA GPUs to obtain in excess of 50% of peak FLOP/s at scale, (3) outline the challenges, both technical and logistical, faced when scaling such a code to thousands of GPUs, and (4) show how in-situ visualization can be used to remediate many of these issues. Examples of high-fidelity unsteady flow simulations enabled by PyFR and these approaches will be showcased throughout.

Level: Intermediate
Type: Talk
Tags: Computational Fluid Dynamics; Supercomputing & HPC; In-Situ and Scientific Visualization

Day: Thursday, 04/07
Time: 10:00 - 10:25
Location: Marriott Salon 1

S6346 - Easy and High Performance GPU Programming for Java Programmers

Kazuaki Ishizaki Research Staff Member, IBM Research - Tokyo
Kazuaki Ishizaki is a research staff member at IBM Research – Tokyo. He has over 20 years of experience conducting research and development on dynamic compilers for Java and other languages. He is an expert in compiler optimizations, runtime systems, and parallel processing. Recently, his research has focused on how system software can enable programmers to easily program GPUs to accelerate their workloads without increasing their burden in high-level languages and frameworks. Kazuaki is an ACM senior member.

Learn how to program NVIDIA GPUs on an IBM OpenPOWER platform using Java with parallel streams API. We'll give an overview of how programmers write a pure Java parallel program with parallel streams API and shows how the Java program is compiled for GPUs at runtime with automatic application of data transfer optimizations, read-only cache exploitation, and loop transformations. Applying these optimizations does not require any annotation or directive.

Level: Beginner
Type: Talk
Tags: Programming Languages; Performance Optimization; OpenACC

Day: Thursday, 04/07
Time: 10:00 - 10:25
Location: Room 212B

S6361 - Attacking HIV with Petascale Molecular Dynamics Simulations on Titan and Blue Waters

James Phillips Senior Research Programmer, University of Illinois
Highly-Rated Speaker
James Phillips is a senior research programmer in the Theoretical and Computational Biophysics Group at the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. He has a Ph.D. in physics from the University of Illinois. Since 1999, James has been the lead developer of the highly scalable parallel molecular dynamics program NAMD, for which he received a Gordon Bell Award in 2002. His research interests include improving the performance and accuracy of biomolecular simulations through parallelization, optimization, hardware acceleration, better algorithms, and new methods.

Come learn the opportunities and pitfalls of taking GPU computing to the petascale. The highly parallel molecular dynamics code NAMD is used on the GPU-accelerated Cray XK7 Blue Waters and ORNL Titan machines to perform petascale biomolecular simulations, including a 64 million-atom model of the HIV capsid. In 2007, NAMD was one of the first codes to run on a GPU cluster, and it is now being prepared for the 2017 ORNL Summit supercomputer, which will feature IBM POWER9 CPUs, NVIDIA Volta GPUs, and the NVLink CPU-GPU interconnect. We'll discuss the importance of CUDA and Kepler/Maxwell features in combining multicore host processors and GPUs in a legacy message-driven application, and the promise of remote graphics for improving productivity and accessibility in petascale biology.

Level: Intermediate
Type: Talk
Tags: Supercomputing & HPC; Computational Chemistry

Day: Thursday, 04/07
Time: 10:00 - 10:25
Location: Room 211A

S6426 - Production Intelligence: GPU-Databases for Predictive Maintenance and In-Line Controlling in Automobile Manufacturing

Peter Strohm Senior Program Manager R&D, Jedox AG
Peter Strohm works as project manager in the R&D department at Jedox with a focus on GPU databases, business intelligence, and big data analysis. He obtained his diploma in computer science from the University of Freiburg, Germany, in 2008. After that, he joined the Inline Processing Team at Fraunhofer Institute for Physical Measurement Techniques IPM, Freiburg, as a software developer for parallel real-time applications. Since 2013, Peter has been with Jedox as a GPU developer and manager for research projects.

Learn how in-GPU-memory databases optimize complex manufacturing processes by enabling real-time data input into big datasets, in-line decision making, and predictive maintenance. In general, manufacturing processes today provide tons of data, e.g., on the process itself, workpieces, machine sensor data, parts delivered by external vendors, etc. In the Production Intelligence project, our goal is to turn this unspecific data into "smart data" to gain better insight in the manufacturing process, e.g., prevent machine shutdowns or decrease the amount of junk parts. We'll present our solutions to streaming input data vectors into big datasets, analyzing incoming data in real time and predicting production or system errors with the help of deep learning algorithms.

Level: All
Type: Talk
Tags: Big Data Analytics; Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Thursday, 04/07
Time: 10:00 - 10:25
Location: Room 210F

S6549 - Real-Time Medical Imaging Using GPUs with a Non-Real-Time Operating System

Stefan Schneider Software Engineer, Siemens Healthcare
Stefan Schneider earned his associate engineer in computer systems and information in 2001 and his M.S. in computational engineering in 2006. He became a certified CUDA programmer in 2011. He joined Siemens Healthcare in 2001 and was responsible for 3D reconstruction on GPUs and mobile C-arm devices. In 2010, he became responsible for 2D image processing pipeline, implemented in CUDA, on many Siemens Healthcare devices.

Is it possible to perform real-time X-ray imaging on a standard PC even in fraught situations? Learn how Siemens Healthcare managed to remove dedicated image processing hardware (like FPGAs and DSPs) from many of its products and introduced NVIDIA GPUs to design and implement the imaging chain from the frame-grabbing board to the display on out-of-the-box-hardware. This "harmonized image chain" (what we call "harmonIC") runs on many modalities like radiography, fluoroscopy, mammography, and mobile C-arm devices, which are used in surgery, where reliability matters most. Additionally, scalability and extensibility are very important in medical imaging and will be covered in this presentation.

Level: All
Type: Talk
Tags: Medical Imaging; Video & Image Processing

Day: Thursday, 04/07
Time: 10:00 - 10:25
Location: Room LL21B

S6567 - Large Scale and Multi-Display Visualization

Rod Sterling Chief Engineer, JVCKenwwod
Rod Sterling is chief engineer of JVC Technology Center, JVCKenwood USA Corporation, in Long Beach, Calif. He received his M.S. in electrical engineering and applied physics from the University of California, San Diego. He supports the efforts in ultra-high resolution displays, reference series and visualization series projectors and their applications, with focus on simulation, visualization, home theatre, stereoscopic displays, and high-dynamic range systems. He has over 31 years of experience in the display area and over 23 years in simulation and electronic/digital cinema. He is the author of over 24 journal articles, 12 patents, and two screen credits. He is an active member of SID, IEEE, SPIE, and SMPTE.

As data sets increase, visualization of the projected data is becoming more challenging. Higher pixel performance and utilization is being demanded. Users want more pixels, more bits per pixel, higher dynamic range per pixel, faster pixels, and more colorful pixels. We'll discuss 8K and HDR LCOS projectors offered by JVC. Using NVIDIA drivers, we are able to utilize JVC e-shift technology to display 8192x4800 images. LCOS is a great technology to improve the spatial resolution of a projector, but is inherently adapted to give very high native dynamic range. This presentation will discuss the high-resolution e-shift technology and high dynamic range projectors.

Level: All
Type: Talk
Tags: Large Scale and Multi-Display Visualization

Day: Thursday, 04/07
Time: 10:00 - 10:25
Location: Room 210E

S6609 - Bring Your 3D Interior Design to Life in Breathtaking Realistic Rendering with HomeByMe and NVIDIA Iray®

Florian Lecoq 3DVIA Lead Developer, Dassault Systèmes
Florian Lecoq leads all realistic render options of the 3DVIA platform for both the real-time engine (DirectX, OpenGL, WebGL) and the ray tracing (NVIDIA Iray IRT in beta) at Dassault Systemes. He holds an engineering M.S. in computer science from Central Paris, a degree in game development from DDJV Montreal (formerly Campus Ubisoft), and had an internship at Ubisoft R&D in character animation.

We'll explore HomeByMe, a free, online, 3D space-planning service to imagine, design and share 3D housing projects for consumers, and which includes the latest version of and the ability to render your interior realistically with NVIDIA Iray.

Level: All
Type: Talk
Tags: Product & Building Design; Rendering & Ray Tracing

Day: Thursday, 04/07
Time: 10:00 - 10:50
Location: Room LL21A

S6697 - Optimizing Application Performance with CUDA® Profiling Tools

Swapna Matwankar Senior Engineer, NVIDIA
Swapna is a Senior Engineer in NVIDIA's Developer tools team focusing on performance analysis. Earlier she worked in NVIDIA's graphics driver team. Before joining NVIDIA, Swapna worked in the embedded multimedia domain where she built and optimized video codecs on ARM and DSP-based platforms and integrated them into various multimedia frameworks.

This session will provide an step-by-step walk through of new features added in NVIDIA Visual Profiler and nvprof. It will show how these profiling tools can be used to identify optimization opportunities at the application, kernel, and source-line levels.

Level: Intermediate
Type: Talk
Tags: Tools & Libraries; Performance Optimization

Day: Thursday, 04/07
Time: 10:00 - 10:50
Location: Room 211B

S6736 - Sensor Processing at Sea: Remoteness Demands High Performance

James Gosling Chief Software Architect, Liquid Robotics
James Gosling received a B.S. in computer science from the University of Calgary, Canada in 1977. He received a Ph.D. in Computer Science from Carnegie-Mellon University in 1983. The title of his thesis was "The Algebraic Manipulation of Constraints". He spent many years as a VP & Fellow at Sun Microsystems. He has built satellite data acquisition systems, a multiprocessor version of Unix, several compilers, mail systems and window managers. He has also built a WYSIWYG text editor, a constraint based drawing editor and a text editor called `Emacs' for Unix systems. At Sun his early activity was as lead engineer of the NeWS window system. He did the original design of the Java programming language and implemented its original compiler and virtual machine. He has been a contributor to the Real-Time Specification for Java, and a researcher at Sun labs where his primary interest was software development tools. He then was the Chief Technology Officer of Sun's Developer Products Group and the CTO of Sun's Client Software Group. He briefly worked for Oracle after the acquisition of Sun. After a year off, he spent some time at Google and is now the chief software architect at Liquid Robotics where he spends his time writing software for the Waveglider, an autonomous ocean-going robot.

We built an autonomous robot that roams the oceans carrying whatever sensors the customer provides. The ocean is an extreme environment: between salt water, hurricanes, communication challenges, and the need to be at sea for months at a time, the engineering has been an exciting challenge. Key to meeting that challenge has been having significant onboard computing resources. There are multiple communication channels that must be arbitrated between. Something is always failing and must be coped with. Autonomy can get complex, especially when collision avoidance kicks in. The machine is on it's own, many, many miles from the nearest human. At the same time, it is a part of an end-to-end application that includes significant cloud processing and fleet operations.

Level: All
Type: Talk
Tags: Robotics & Autonomous Machines; Aerospace & Defense; IoT

Day: Thursday, 04/07
Time: 10:00 - 10:50
Location: Room LL20D

S6113 - BLAZE-DEM: A Discrete Element Simulation Framework for NVIDIA GPUs

Nicolin Govender Senior Research Scientist, Center for High Performance Computing (CSIR)
Nicolin Govender is associated with several institutions. At the University of Johannesburg, he is a member of the ATLAS Collaboration at CERN and from this base he operates his research projects associated with computing in a high-energy physics environment, which spans computing for data acquisition as well as for analysis. He has also worked on the modeling of nuclear reactors, and this is the area where he obtained his M.S. with distinction, in a project involving collaboration between the University of Johannesburg and the South African Nuclear Energy Corporation (Necsa). He has a Ph.D. in computational mechanics from the University of Pretoria and has done his post-doc at the University of Utah and Ecole Mines France.

Understanding the dynamical behavior of particulate materials is important to many industrial processes, with applications that range from hopper flows in agriculture to tumbling mills in the mining industry. The discrete element method (DEM) has become the defacto standard to simulate particulate materials. The DEM is a computationally intensive numerical approach that is limited to hundreds of thousands of particles. However, the computational architecture plays a significant role on the performance that can be realized. The parallel nature of the GPU allows for a large number of independent processes to be executed in parallel. This results in a significant speedup over conventional implementations utilizing the CPU. In this talk we present the GPU-based large-scale code BLAZE-DEM.

Level: Beginner
Type: Talk
Tags: Computational Physics; Astronomy & Astrophysics; Computational Fluid Dynamics

Day: Thursday, 04/07
Time: 10:30 - 10:55
Location: Marriott Salon 6

S6118 - In-Place Computing on PostgreSQL: SQL as a Shortcut of GPGPU

Kohei KaiGai Lead of PG-Strom Project, NEC
Kohei KaiGai is lead of the PG-Strom project. He has 10 years of experience in open source software development on Linux, PostgreSQL, Apache/httpd, and more. He has contributed some core functionality of PostgreSQL, like SELinux integration and security label, security barrier view, writable FDW and remote join capability, and custom-scan/join interface. He's currently focused on integration of GPU acceleration with PostgreSQL to utilize the power of the recent semiconductor's evolution.

Near data computing is one recent technology trend. Cost of data translation is never ignorable, thus people are inclined to run their tasks on the location of data (e.g., Hadoop). Our PG-Strom technology transparently off-loads some CPU-intensive SQL workloads to GPU devices, using automatic SQL-to-CUDA code generator. It enables users to describe their mathematical/statistical algorithm by SQL, then run these logic on the location very close to the data managed with PostgreSQL database. Usually, users have to export an entire dataset once, prior to what they really want to process. However, integration of GPU computing power within SQL database eliminates the necessity of these tasks, and allows researchers to focus on what they really want to dive into.

Level: Intermediate
Type: Talk
Tags: Big Data Analytics

Day: Thursday, 04/07
Time: 10:30 - 10:55
Location: Room 210F

S6134 - High Performance and Productivity with Unified Memory and OpenACC: A LBM Case Study

Jiri Kraus Compute Devtech Software Engineer, NVIDIA
Highly-Rated Speaker
Jiri Kraus is a senior developer in NVIDIA's European Developer Technology team. As a consultant for GPU HPC applications at the NVIDIA Julich Applications Lab, Jiri collaborates with local developers and scientists at the Julich Supercomputing Centre and the Forschungszentrum Julich. Before joining NVIDIA, he worked on the parallelization and optimization of scientific and technical applications for clusters of multicore CPUs and GPUs at Fraunhofer SCAI in St. Augustin. He holds a diploma in mathematics from the University of Cologne, Germany.

Learn how to use unified memory to improve your productivity in accelerating applications with OpenACC. Using a Lattice Boltzmann CFD solver as an example, we'll explain how a profile-driven approach allows one to incrementally accelerate an application with OpenACC and unified memory. Besides the productivity gain, a primary advantages of this approach is that it is very accessible also for developers new to a project and therefore not familiar with the whole code base.

Level: Intermediate
Type: Talk
Tags: Computational Fluid Dynamics; Supercomputing & HPC; Tools & Libraries; OpenACC; Aerospace & Defense

Day: Thursday, 04/07
Time: 10:30 - 10:55
Location: Marriott Salon 1

S6149 - Understanding the Business Impact the NVIDIA vGPU Technology is making to Citrix XenDesktop UX

Jaymes Davis VP of Innovation, Entisys
Jaymes is one of the industry's top consultants with over 600 client projects ranging from global enterprise companies of 20,000+ employees to regionalized medium-sized businesses to small businesses with 5-25 employees. His white papers and articles have been published and recognized by the industries top virtualization leaders. His projects have included designing and delivering multiple global rollouts of application front ends to assessing business and technical requirements for implementing virtual infrastructures to address disaster recovery. His broad range of experience has helped clients address many issues to increase their IT process efficiencies or to address a business process need, as well as, lowering the total cost of ownership.

A "business first" approach revolves around understanding the value created by technical solutions and asking the questions about how changes can affect the foundation or long term strategy of the organization. As well the quality of the user experience can have very tangible business impact. This session focuses on how to build a User Index for Graphic Virtualization Adoption in an enterprise when looking at Virtual desktops, In order to deliver a solution that fits your users 'needs, it is essential that you keep the focus on user experience paramount.

Level: Advanced
Type: Talk
Tags: Graphics Virtualization

Day: Thursday, 04/07
Time: 10:30 - 10:55
Location: Marriott Salon 2

S6180 - Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures

Adam McLaughlin PhD Candidate, Georgia Institute of Technology
Adam is a PhD candidate in Electrical and Computer Engineering at the Georgia Institute of Technology. He works in the High Performance Computing (HPC) lab as a graduate research assistant for Professor David Bader. In the past he has interned with D.E. Shaw Research, NVIDIA, AMD, and Los Alamos National Laboratory (LANL). His current research focuses on utilizing Graphics Processing Units (GPUs) for fast parallel execution of algorithms that traverse unstructured network data sets such as crawls of the internet or the social structure of Facebook. He received his BSc in Computer Systems Engineering from Boston University in 2011.

Contemporary microprocessors use relaxed memory consistency models to allow for aggressive optimizations in hardware. This enhancement in performance comes at the cost of design complexity and verification effort. In particular, verifying an execution of a program against its system's memory consistency model is an NP-complete problem. This session improves upon existing work by introducing an algorithm that not only reduces the time complexity of the verification process, but also facilitates the development of parallel algorithms for solving these problems. For large tests of interest, our GPU implementation achieves an average application speedup of 26x over existing techniques in use at NVIDIA.

Level: Advanced
Type: Talk
Tags: Algorithms; Big Data Analytics; Tools & Libraries

Day: Thursday, 04/07
Time: 10:30 - 10:55
Location: Marriott Salon 3

S6264 - State of GPUDirect Technologies

Davide Rossetti Software Engineer, NVIDIA
Davide Rossetti has a degree in theoretical physics from Sapienza Rome University and is a senior engineer at NVIDIA. His research focuses on high performance computing (parallel computing, high-speed networking architectures, numerical simulations), while his interests span computer graphics, operating systems, I/O technologies, GPGPU, embedded systems, and real-time systems.

GPUDirect is a family of technologies aimed at interoperating with third-party devices. We'll describe the new capabilities introduced in CUDA 7.5 and later, walking the audience through different technical skills involved. Also, we'll provide the latest benchmarking results on recent hardware platforms.

Level: Advanced
Type: Talk
Tags: Supercomputing & HPC; Tools & Libraries

Day: Thursday, 04/07
Time: 10:30 - 10:55
Location: Room 211A

S6265 - A CUDA®-Based 3D Kinetic Model for Space Plasma Physics

Shahab Fatemi Ph.D. Fellow in Space Science, University of California, Berkeley
Dr. Shahab Fatemi specializes in high performance scientific computing and modeling of plasma interactions with airless bodies, including our Moon, the moons of Mars, and the moons of the outer planets. He is interested in parallel algorithm developments and applying them to solve fundamental plasma physics. Over the past decade, Shahab has written and utilized several kinetic plasma models, including hybrid models, particle-in-cell, and Monte Carlo codes. He is now actively involved in applying GPGPU computing to problems in space plasma physics. His has been a post-doctoral fellow at the University of California at Berkeley since October 2015.
Andrew R. Poppe Assistant Research Scientist, UC Berkeley
Dr. Andrew Poppe specializes in modeling and data analysis of plasma interactions with airless bodies, including our Moon, the moons of Mars, and the moons of the outer planets. He is interested in the fundamental plasma physics inherent in such interactions, as well as implications for planetary exospheric generation and surface weathering. Over the past decade, Andrew has written and utilized several kinetic plasma models, including Monte Carlo, particle-in-cell, and hybrid codes. He is now actively involved in applying GPGPU computing to problems in space plasma physics. He has been an assistant research scientist in the Space Sciences Lab at the University of California at Berkeley since 2013.

We've developed the first three-dimensional, self-consistent kinetic plasma model that runs on NVIDIA GPUs using CUDA. The model self-consistently solves the charged-particles motion and their associated electromagnetic fields. We use this model to explore the microphysics of plasma interactions with solar system objects, to understand fundamental kinetic processes of plasma, and to meet NASA's requirements for planetary and space exploration.

Level: All
Type: Talk
Tags: Astronomy & Astrophysics; Computational Physics; Algorithms

Day: Thursday, 04/07
Time: 10:30 - 10:55
Location: Room 212A

S6278 - Speed Without Compromise: Precision and Methodology Innovation in the AMBER GPU MD Software

Ross Walker Associate Professor, University of California San Diego
Ross Walker is an associate research professor at the San Diego Supercomputer Center, an adjunct associate professor in the Department of Chemistry and Biochemistry at the University of California, San Diego, and an NVIDIA Fellow. He runs the Walker Molecular Dynamics Lab in San Diego, where he leads a team that develops advanced techniques for molecular dynamics simulations supporting work aimed at improved drug and biocatalyst design. His work includes leading the development of new force fields for simulation of lipid membranes, improved quantum mechanical/molecular mechanical models, automated force field parameter refinement techniques, and the development of the world's fastest GPU-accelerated molecular dynamics software released as the AMBER Molecular Dynamics engine PMEMD.

The AMBER molecular dynamics (MD) package is one of the fastest MD packages on commodity hardware and was one of the first widely used packages to exploit GPUs. We'll discuss the history of AMBER on NVIDIA GPUs and then highlight some of the newest advances in MD simulation that feature in the latest version 16 of AMBER. This includes extremely high-throughput thermodynamic integration free energy methods, explicit solvent constant pH simulations, advanced umbrella sampling restraints, multi-dimensional replica exchange methods, and asymmetric boundary conditions. We'll also discuss the development and validation of our latest precision model, SPXP, which is focused on maximizing the performance achievable from Maxwell-generation hardware without sacrificing accuracy.

Level: All
Type: Talk
Tags: Computational Chemistry; Supercomputing & HPC

Day: Thursday, 04/07
Time: 10:30 - 10:55
Location: Marriott Salon 5

S6299 - Looking at Ultrasound Processing on Low-Power GPUs

Anne C Elster Visiting Fellow, University of Texas at Austin
Anne Elster is an associate professor of high performance computing at the Department of Computer & Information Science at the Norwegian University of Science and Technology (NTNU) in Trondheim, Norway, and a research affiliate at ICES/University of Texas at Austin. Anne received her Ph.D. from Cornell University in 1994. Prior to joining NTNU in 2001, she worked for Schlumberger's Austin center (1994-1997) and The University of Texas at Austin, and founded Acenor Inc. in 2000. Her research interests are in parallel and heterogeneous computing, including GPU computing, and tools for program optimization. She is a member of ACM, AGU (life member), IEEE(senior member), SIAM, as well as an original member of the MPI Forum. She is a CUDA book author, and has so far advised several postdocs, Ph.D. students and over 50 M.S. students. She is one of the main principal investigators for the EU H2020 Cloud Lightning project.

HPC-Lab at NTNU has been looking at processing medical imaging algorithms on GPUs since 2007, when one of my students did the wavelet transform in Cg on an early GeForce GTX card. With the introduction of the NVIDIA Tegra TK1, my graduate student Bjorn Tungesvik and I decided to benchmark the device and see how suitable it could be for real-time image processing. We'll show the impressive results we achieved for benchmarks, including Julia set fractals, vector multiplication, and N-body simulations. In collaboration with Professor Bjorn Angelsen's group, we'll also show how it is possible to create a filter based on the simulated Westervelt equation to achieve synthetic dynamic focusing by transversal filtering in real-time using an embedded GPU.

Level: Intermediate
Type: Talk
Tags: Medical Imaging; Signal & Audio Processing; Embedded

Day: Thursday, 04/07
Time: 10:30 - 10:55
Location: Room LL21B

S6561 - Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding

Song Han PhD student, Stanford University
Song Han is a fourth-year PhD student with Prof. Bill Dally at Stanford University. His research interest is computer architecture, deep learning, and computer vision. His research is improving the efficiency of neural networks to fit it on mobile. Before joining Stanford, Song Han did his undergrad at Tsinghua University, Beijing.

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce "Deep Compression", which reduces the number of connections of deep neural networks by an order of magnitude and the total size of the networks by 35-49x without affecting their accuracy. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x, from more than half a gigabyte to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. This also makes it possible to fit DNN into mobile Apps given size limit.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Thursday, 04/07
Time: 10:30 - 10:55
Location: Room 210D

S6673 - Optimizing Deep Recurrent Neural Networks With Persistent CUDA Kernels

Gregory Diamos Senior Researcher, Baidu
Gregory Diamos is a Senior Researcher at Baidu's Silicon Valley AI Lab under the direction of Professor Andrew Ng, where he is exploring the design of deep neural network algorithms and their mapping onto high performance computing systems. Before joining Baidu, he contributed to the development of new compilation, processor architecture, and systems software technologies for the Pascal and Volta GPUs at NVIDIA.

Learn a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possible to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU's inverted memory hierarchy to reuse network weights over time, and communication between thread blocks using a deadlock-free global barrier. Our initial implementation sustains 3 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU. This provides a 16x reduction in activation memory footprint, allows us to train models with 12x more parameters on the same hardware, and allows us to strongly scale RNN training to 32 GPUs.

Level: Advanced
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Supercomputing & HPC; Performance Optimization

Day: Thursday, 04/07
Time: 10:30 - 10:55
Location: Room 210G

S6709 - Write Once, Parallel Everywhere: OpenACC for GPUs, x86, OpenPOWER, and Beyond

Michael Wolfe Compiler Engineer, NVIDIA
Highly-Rated Speaker
Michael Wolfe is a PGI compiler engineer at NVIDIA. He has over 35 years of experience developing optimizing compilers for parallel computing.

Performance portability means the ability to write a single program that runs with high performance across a wide range of target systems, including multicore systems, GPU-accelerated systems, and manycore systems, independent of the instruction set. It's not a "myth" or a "dream," as has been claimed recently. It should be demanded by developers and expected from any modern high level parallel programming language. OpenACC was designed five years ago with broad cross-platform performance portability in mind. The current PGI compiler suite delivers on this promise. Come hear about the current capabilities and performance of PGI OpenACC on GPUs, x86 and OpenPOWER, and learn about our plans for new features and even wider platform support.

Level: All
Type: Talk
Tags: Programming Languages; OpenACC; Supercomputing & HPC

Day: Thursday, 04/07
Time: 10:30 - 10:55
Location: Room 212B

S6285 - Bridging the Productivity Gap Between 3D Printing and Subtractive Manufacturing

Mohammad Hossain Graduate Student, Georgia Tech
Mohammad M Hossain is a Ph.D. candidate in computer science at Georgia Institute of Technology in Atlanta. He is advised by Professor Richard Vuduc. Mohammad's research work focuses on storage-efficient hybrid data structure design for extreme-resolution volumes processing on GPUs. He is developing a prototype of GPU-accelerated offsetting algorithm for free-form solids represented in a sparse data representation. The far-reaching goal of his work is a GPU-centric volume processing platform development that is intertwined with core computer-aided design and manufacturing (CAD/CAM) applications towards enabling subtractive manufacturing with the ease of the programmability of 3D printing. Mohammad completed his M.S. in computer science from Georgia Institute of Technology, and B.S. in computer science and engineering from Bangladesh University of Engineering and Technology.

GPU-based software can enable advanced computer-aided manufacturing methods. Such methods are inspired by 3D printing, which is easy to program but limited in terms of the materials that can be used, the finishing quality that can be achieved, and relatively slow printing speeds. By contrast, CNC milling can address these limitations. However, the price is the high computational cost of generating collision-free tool paths in extreme-resolution volume models. Our work reveals whether a GPU-based platform can reduce this processing cost. We prototype a GPU-accelerated offsetting algorithm for freeform solids, represented in a sparse data structure, that beats prior best-forming multicore CPU algorithms by 16x, and which we demonstrate on a real milling machine.

Level: All
Type: Talk
Tags: Product & Building Design

Day: Thursday, 04/07
Time: 13:00 - 13:50
Location: Room LL21A

S6740 - GPU Powered Solutions in the Second Kaggle Data Science Bowl

Jon Barker Solution Architect, NVIDIA
Jon Barker joined NVIDIA in May 2015 as a Solution Architect. Since then he has been helping customers to design, implement and optimize a variety of GPU accelerated deep learning applications and has also provided internal and external deep learning training. Jon is particularly focused on the application of deep learning to problems in defense and national security. Jon graduated from the University of Southampton in the UK in 2007 with a PhD in Mathematics. Prior to joining NVIDIA, Jon worked for the UK Ministry of Defence and spent four years on secondment to the US Department of Defense where he was a research scientist focused on data analytics and machine learning for multi-sensor data streams. In order to learn new data science skills Jon has been a long time competitor on Kaggle.

The second annual Data Science Bowl was an online data science contest that took place in early 2016 and was hosted on the Kaggle platform. The objective of the contest was to develop an algorithm that could accurately estimate the volume of the left ventricle of a human heart at the points of maximum and minimum volume from a time-series of multiple cross sectional Magnetic Resonance Imaging (MRI) images of the heart. The contest provided thousands of MRI images to train an algorithm. The challenge was a natural fit for GPU accelerated deep learning (DL). We'll hear from some of the winning teams describe their approaches. The complexities of working with sometimes messy clinical data will be discussed and we will hear how deep learning can be applied to a time-series of 3D images.

Level: Beginner
Type: Talk
Tags: Medical Imaging; Deep Learning & Artificial Intelligence; Video & Image Processing

Day: Thursday, 04/07
Time: 13:00 - 13:50
Location: Room LL21B

S6136 - High-Performance Deep Neural Net Inference on the GPU

Michael Andersch GPU Architect, NVIDIA
Michael Andersch is a GPU architect at NVIDIA and helps define NVIDIA's next-generation GPU architectures to improve performance and programmability for CUDA and OpenCL applications. GPUs have been Michael's love for a while, as he had a printed block diagram of the architecture of G80, the GPU driving the legendary Geforce 8800 GTX, on his wall at age 16. During his graduate studies at TU Berlin, Michael worked on better cycle-accurate hardware simulators to estimate GPU performance and power, and on the performance analysis of latency-limited GPU programs.

We'll discuss, analyze, and improve the performance of deep neural network inference using GPUs. Other than neural net training, which is an offline process where large batches of images are fed to the GPU to maximize computational throughput, inference focuses on small-batch, low-latency forward propagation through the network. We'll discuss how the different performance requirements for inference impact the way we implement it on GPUs and what performance optimizations are possible, and we'll show how GPUs, all the way from the small Tegra X1 to the powerful TITAN X, excel at performance and energy efficiency when performing inference for deep neural networks.

Level: Advanced
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Self-Driving Cars & Automotive ; Performance Optimization

Day: Thursday, 04/07
Time: 14:00 - 14:25
Location: Room 210G

S6241 - An Optimized Solver for Unsteady Transonic Aerodynamics and Aeroacoustics Around Wing Profiles

Jean-Marie Le Gouez Research scientist, ONERA
Jean-Marie Le Gouez is with the CFD department at ONERA, the French aeronautics research institute. Jean-Marie graduated from Ecole Polytechnique in Paris in 1977 and received a M.S. in mechanical engineering from Stanford University in 1978. He worked for 11 years at the CEA (French Nuclear Research Energy Institute) in thermal hydraulics of sodium-cooled reactors and safety studies. Jean-Marie joined the independent research company PRINCIPIA in 1990, where he led the CFD department, developing software for unsteady incompressible, free surface flows and fluid-structure interaction, with contracts in naval, car, and satellite industries. In 2007, he joined ONERA, where he was head of the CFD and Aeroacoustics department until 2014. Then he joined the research unit on novel CFD software architectures in this department.

The extensive optimisation for HPC of Fluid Dynamics software is possible under a variety of aspects on GPU clusters within GPUDirect/C/CUDA/Thrust programming paradigms. In particular, our algorithm could be made more modular to adapt to the CUDA register usage limit, Thrust libraries provide highly efficient solutions on global memory computations and warp collaboration through shared memory proves crucial. Large Eddy Simulation, based on basic principles of field mechanics, despite its very high computing requirements, complements Reynolds-Averaged Navier-Stokes models, which lack versatility. In the field of aeronautical flows around wing profiles in steady or off-design configurations, our solver provides efficient solutions on 128-TESLA clusters for adequate 2-billion cell grids.

Level: Intermediate
Type: Talk
Tags: Computational Fluid Dynamics; Supercomputing & HPC; In-Situ and Scientific Visualization; Aerospace & Defense

Day: Thursday, 04/07
Time: 14:00 - 14:25
Location: Marriott Salon 1

S6257 - Kokkos Implementation of Albany: Towards Performance Portable Finite Element Code

Irina Demeshko Postdoctoral Appointee, Sandia National Laboratories
Irina Demeshko works as a postdoctoral appointee at the Sandia National Laboratories. Irina got her Ph.D. from the Department of Mathematical and Computing Sciences, Tokyo Institute of Technology. Her research work focused on adapting NICAM climate simulation and weather prediction code for NVIDIA GPUs. Her research interests lie in adapting scientific simulation codes for high performance computing (HPC) systems. Her efforts center around creating a performance portable implementation of the Albany Finite Element-based application development environment and on porting HOMME atmospheric dynamical core code from CESM to hybrid HPC systems.

Together with continuously changing HPC architecture, efforts needed for developing, porting, and tuning software have increased. Therefore, performance portability is a very important issue to be addressed in scientific codes. We'll present our preliminary results of development a performance-portable implementation of the finite element-based Albany code. This implementation is based on the Kokkos programming model from Trilinos, which uses a library approach to provide performance portability across diverse devices with different memory models. Evaluation experiments show good performance results for a single implementation across different multicore/many-core architectures: NVIDIA GPUs, multi-core CPUs, Intel Xeon Phi and POWER.

Level: All
Type: Talk
Tags: Supercomputing & HPC; Algorithms

Day: Thursday, 04/07
Time: 14:00 - 14:25
Location: Room 211A

S6340 - Anomaly Detection and Categorisation Using Unsupervised Deep Learning

Stephen McGough Lecturer, Durham University
Stephen McGough is a lecturer in Computing Sciences at Durham University, U.K. He obtained his Ph.D. in the area of parallel simulation and has worked for many years in the areas of parallel computing and simulation. This has lead to over 50 publications in the area of parallel computing, including receiving the NVIDIA best paper award at HiPC 2012. His research focuses on the use of novel computing technologies to solve real-world challenges.

The potential information buried within datasets is immense -- though extracting this information is difficult when the data is large, noisy, unlabeled, and unstructured. We present the use of GPGPU-powered unsupervised deep learning to identify the anomalies within such datasets. Analysis of these anomalies can be performed to determine which are "pertinent" and which are "benign." Once the significance of an anomaly has been determined, this then becomes a label, which is added to the data. Repeating this process will lead to unlabeled data becoming labeled. This newly labeled data can be used to train a supervised deep learning system to identify new instances of that stereotype. We demonstrate how GPGPUs can be used to enable real-time anomaly detection and stereotyping.

Level: All
Type: Talk
Tags: Big Data Analytics; Deep Learning & Artificial Intelligence

Day: Thursday, 04/07
Time: 14:00 - 14:25
Location: Room 210F

S6429 - Transparent Checkpoint and Restart Technology for CUDA® Applications

Akira Nukada Associate Professor, Tokyo Institute of Technology
Akira is a researcher at Tokyo Institute of Technology. He received PhD degree from the University of Tokyo. His research interest includes high performance computing. He developed several high performance FFT libraries for CPUs and GPUs.

Checkpoint and restart technology is useful for fault tolerance as well as for aggressive job management on large systems. System level checkpoint minimizes application developers' effort required to use it, however CUDA applications are incompatible with major checkpoint implementations such as BLCR. In this talk, we'll present our CUDA-capable checkpoint library called CRCUDA. We are working to reduce its performance overhead.

Level: Advanced
Type: Talk
Tags: Tools & Libraries

Day: Thursday, 04/07
Time: 14:00 - 14:25
Location: Room 211B

S6464 - Quotient Filters: Approximate Membership Queries on the GPU

Afton Geil Ph.D. Student, UC Davis
Afton Geil is a Ph.D. student at the University of California, Davis, working under Professor John D. Owens. Her research explores parallel data structures and problems that utilize them, with the goal of broadening the classes of problems that can be solved on the GPU. She earned a B.S. in engineering science and a B.A. in physics from Trinity University in San Antonio, Texas.

Most GPU data structures must be rebuilt (often on the CPU) any time they are modified. We'll examine the challenges of building and maintaining mutable data structures on the GPU, and will present our solution for one particular data structure: the quotient filter. A quotient filter is used for performing fast database queries, similar to a Bloom filter. We describe our search for an efficient parallelization of construction, insertion, and query operations on the quotient filter data structure. We show that this data structure can outperform a Bloom filter for database lookups and insertions, while also providing much greater flexibility.

Level: Intermediate
Type: Talk
Tags: Algorithms; Big Data Analytics

Day: Thursday, 04/07
Time: 14:00 - 14:25
Location: Marriott Salon 3

S6465 - Physics-Based Modeling of Flexible Tires on Deformable Terrain with the GPU

Daniel Melanz Mechanical Engineer/Research Associate, University of Wisconsin - Madison
Highly-Rated Speaker
Daniel Melanz has an M.S. in mechanical engineering from the University of Wisconsin - Madison. His technical area of focus is modeling and simulation using high performance computing with an emphasis on terramechanics and multiphysics. He is working toward his Ph.D. in mechanical engineering with a minor in computer science.

Explore new techniques in the simulation of a flexible tire rolling on discrete terrain comprised of over a million elements. Simulations like this have the potential to predict the transient soil behavior exhibited under severe vehicle maneuvering in various Army operations, some of which include the tire spinning in sand and the sinkage caused by large slips. To support off-road heavy vehicle mobility analysis, we are integrating a tire model based on nonlinear finite elements by leveraging GPU-parallel computational modeling, numerical solution, and collision detection. Numerical results for a spectrum of scenarios on deformable terrain will be compared against results obtained with the rigid tire model to gauge the effect of the tire flexibility on overall system dynamics.

Level: All
Type: Talk
Tags: Computational Physics; Algorithms; Aerospace & Defense

Day: Thursday, 04/07
Time: 14:00 - 14:25
Location: Marriott Salon 6

S6535 - Real-Time Monte-Carlo Path Tracing of Medical Volume Data

Klaus Engel Principle Key Expert Visualization, Siemens Heathcare GmbH
Klaus Engel joined Siemens Healthcare, GmbH in 2003 as the principal key expert in visualization and he is the co-author of the book Real-Time Volume Graphics which was released in 2006. Klaus received his M.S. in computer science in 1997 and his Ph.D. in 2002. His thesis was on ""Strategies and Algorithms for Interactive Volume Visualization in Digital Documents"". He has won many scientific awards and his scientific interest include visualization on mobile devices, parallel rendering, graphics hardware, network-based visualization systems and the volumetric effects and participating media stereoscopy.

"Cinematic rendering" creates photo-realistic and hyper-realistic images from scanned patient anatomy. Such photorealistic images provide great value for reporting, patient-doctor communication, anatomical education, marketing, and surgery planning, and may support diagnostic imaging by providing meaningful overview visualizations. The physically based rendering algorithm is based on volumetric Monte-Carlo path tracing, a method that has been too computationally expensive to be used in a product before now. Recent advances in GPUs and algorithms now allow interactive rendering of volume datasets using this method.

Level: All
Type: Talk
Tags: Medical Imaging; Rendering & Ray Tracing; Large Scale and Multi-Display Visualization

Day: Thursday, 04/07
Time: 14:00 - 14:50
Location: Room LL21B

S6584 - Deep Dive into Dynamic Parallelism Performance

Christoph Angerer DevTech Compute, NVIDIA
Christoph Angerer is a developer in NVIDIA's European Developer Technology team. Based in Munich, Germany, he works with developers accelerating applications on GPUs. He holds a Ph.D. in computer science from ETH Zurich in Switzerland.
Shankara Rao Thejawsi Nanditale DevTech Compute, NVIDIA
Thejaswi is a Compute Devtech Engineer at Nvidia, who is interested in programming GPUs, machine-learning and bioinformatics.

Dynamic parallelism enables a CUDA kernel to create and synchronize new nested work by launching child kernels from the GPU. Such a nested parallelism programming model maps directly to many real-world programming patterns like adaptive grids or tree-traversal based computations. We'll systematically analyze the performance characteristics of dynamic parallelism by means of real-world application case studies and suggest programming guidelines to get the best performance out of the dynamic parallelism feature. (This talk will be held in collaboration with Thejaswi Rao.)

Level: Intermediate
Type: Talk
Tags: Performance Optimization

Day: Thursday, 04/07
Time: 14:00 - 14:50
Location: Room 212B

S6598 - Real Performance Results with VMWare Horizon and View Planner

Manvender Rawat Performance Engineer, NVIDIA GRID, NVIDIA
Manvender Rawat is part of the NVIDIA GRID performance engineering team and is responsible for measuring and validating the performance and scalability delivered via the GRID platform, and GRID vGPU software running on GRID GPUs on all enterprise virtualized platforms.

We'll showcase the test results for various knowledge worker workloads (for MS word, Adobe Acrobat reader, etc.) and other 3D workloads using the latest VMware View Planner tool. We'll also give an overview of the deployment and performance measurement tools.

Level: Beginner
Type: Talk
Tags: Graphics Virtualization

Day: Thursday, 04/07
Time: 14:00 - 14:50
Location: Marriott Salon 2

S6627 - Bifrost: High-Throughput CPU/GPU Pipelines Made Easy

Ben Barsdell Developer Technology Engineer, NVIDIA
Highly-Rated Speaker
Ben Barsdell completed his Ph.D. in astronomy at Swinburne University, where he studied the application of GPUs to a number of widely used simulation and data processing algorithms. In 2013, he moved to the U.S. to take up a post-doctoral position at the Harvard-Smithsonian Center for Astrophysics, where he worked on GPU algorithms and digital back-ends for large-N radio telescope arrays. He joined NVIDIA in 2015, where he works on GPU-accelerated astronomy and deep learning applications.

We'll present Bifrost, a lightweight new framework designed to ease the development and deployment of pipeline applications that demand sustained peak utilization of network, CPU, and GPU resources under soft real-time constraints. Such applications are common in experimental science and computer vision, where processing must keep up with acquisition systems to avoid data loss. Bifrost enables operations to be wrapped in a simple task container with metadata-rich inputs and outputs. By connecting tasks together, complex branching pipelines can be constructed, with asynchronous communication handled by efficient ring buffers in host or device memory. We'll demonstrate Bifrost using a high-performance radio astronomy application that has been deployed as part of the LEDA project.

Level: Intermediate
Type: Talk
Tags: Astronomy & Astrophysics; Tools & Libraries; Signal & Audio Processing

Day: Thursday, 04/07
Time: 14:00 - 14:25
Location: Room 212A

S6634 - Fast and Easy Hyper-Parameter Grid Search for Deep Learning on Rescale

Mark Whitney Senior Software Engineer, Rescale
Mark Whitney is a senior software engineer at Rescale, where he works on high-performance computing and machine learning. Prior to this, he started the Energy Demand Forecasting Group at EnergyHub, providing smart-grid analytics to electric utilities. Mark received his Ph.D. from UC Berkeley in quantum computing.

Training deep neural networks can be time consuming when searching a large hyper-parameter space. Using the Rescale optimization platform, we present a simple interface for doing parallelized hyper-parameter grid search for deep learning models from a number of different machine learning packages of a user's choice. Offered packages are pre-configured to take advantage of NVIDIA cuDNN accelerated training, allowing the user to tradeoff cost vs. training time vs. model performance. We will demo the Rescale DNN optimization system and give performance results when trained on NVIDIA GPU hardware available on Rescale. Benchmarking will be done against the MNIST and CIFAR datasets.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence

Day: Thursday, 04/07
Time: 14:00 - 14:25
Location: Room 210D

S6757 - Maximize OpenACC Performance with the PGPROF Profiler

Scott Biersdorff Senior Software Engineer, NVIDIA
Scott Biersdorff is a PGI performance tools engineer at NVIDIA. Before joining NVIDIA, Scott was a researcher at the University of Oregon investigating how software tools can better help High Performance Computing applications adapt to the increasing hardware resources. Scott has contributed to various HPC performance tools including TAU and Score-P.

OpenACC directives are a quick way to find out if your code can benefit from GPU acceleration. The PGI OpenACC compilers provide a wealth of information to help you along the way. we'll discuss using two key technologies which contribute performance data about OpenACC regions and how they can help you transition to the GPU and improve the performance of OpenACC accelerated code: (1) the OpenACC 2.5 tools interface provides a set of events that can record what operations are performed for each OpenACC region and how long each one takes; and (2) a profiler can correlate the compiler output with the performance data to convey specific information about each OpenACC region well as presenting a timeline that shows when each region executed.

Level: Intermediate
Type: Talk
Tags: OpenACC; Performance Optimization; Tools & Libraries

Day: Thursday, 04/07
Time: 14:00 - 14:25
Location: Room 210E

S6760 - GPUs in the Cloud: How to Harness the Infinite Scale of the Public Cloud for Interactive Applications

Nikola Bozinovic Founder & CEO, Frame
Nikola Bozinovic is the founder and CEO of Frame, a 3-year-old Silicon Valley startup that makes creative software available to users anywhere via the cloud. He started Frame with the goal of democratizing computing and making any software—even the most powerful design and engineering applications—accessible anywhere and on any device. After graduating with a Ph.D. in Electrical Engineering from Boston University in 2006, Nikola moved to California to help start MotionDSP. As a chief technologist of the company, he developed software for real-time video analytics and enhancement. But with the emergence of the cloud in 2012, Nikola saw another opportunity present itself: to build a platform for easy delivery of applications from the cloud. Frame was launched in 2013, and within days, tens of thousands of people from 190 countries were using it on desktops, tablets, and phones. Since then, Frame has secured $10 million in venture capital funding and is daily gaining a growing customer base drawn to the cloud's technical and financial advantages, such as SolidWorks, Adobe, Autodesk and Siemens.

Learn how companies such as SolidWorks, Adobe, Autodesk, Siemens are using Frame to deliver their applications globally from the public cloud.

Level: Beginner
Type: Talk
Tags: Graphics Virtualization; Data Center & Cloud Computing; Rendering & Ray Tracing

Day: Thursday, 04/07
Time: 14:00 - 14:25
Location: Marriott Salon 4

S6145 - Kokkos Hierarchical Task-Data Parallelism for C++ HPC Applications

H. Carter Edwards Principal Member Technical Staff, Sandia National Laboratories
Highly-Rated Speaker
H. Carter Edwards has over three decades of experience developing software for simulations of a variety of engineering domains. He is an expert in high performance computing (HPC) and is currently focusing on thread-scalable algorithms and data structures for heterogeneous manycore architectures such as NVIDIA GPU, AMD Fusion, and Intel Xeon Phi. He has a B.S. and M.S. in aerospace engineering from the University of Texas at Austin, and worked for 10 years at the Johnson Space Center in the domain of spacecraft guidance, navigation, and control. He has a Ph.D. in computational and applied mathematics, also from the University of Texas at Austin. He has been researching and developing software for HPC algorithms and data structures for the past 16 years at Sandia National Laboratories.

The Kokkos library provides C++ HPC applications with a performance portable programming model for disparate manycore architectures such as NVIDIA Kepler, AMD Fusion, and Intel Xeon Phi. Until this year, Kokkos supported only composition of data parallel patterns (foreach, reduce, and scan) with range and hierarchical team parallel execution policies. Our latest capability enhancement is the addition of hierarchical task-DAG (directed acyclic graph) pattern and policy, where each task supports internal data parallelism. We present our GPU-suitable abstractions and interface for non-blocking task-DAG, and their application to incomplete sparse matrix factorization and graph triangle enumeration.

Level: Intermediate
Type: Talk
Tags: Supercomputing & HPC; Tools & Libraries

Day: Thursday, 04/07
Time: 14:30 - 14:55
Location: Room 211A

S6156 - Unblock Performance Limit of DNN by CUDA® in R

Patric Zhao Senior GPU Architect, NVIDIA
Patric Zhao is a senior GPU architect in the HPC field at NVIDIA. He has seven years of experience developing scientific and engineering applications and is experienced in parallelization, performance modeling, and architecture-specific tuning. Patric is working on molecular dynamic and big data projects. Before joining NVIDIA, he worked on distributed processing and algorithm optimization for EDA software at Cadence.

You'll learn technical solutions to accelerate R by CUDA. R's DNN has become a very popular approach in statistical analysis areas. Even though there are several DNN packages in R, it is rarely used in big data and deep neural networking because the single core performance of R is limited and the current design of DNN packages in R is not GPU-friendly. Firstly, we'll introduce how we apply specific patterns, such as general matrix multiplication (GEMM), to DNN in R, which is a GPU-friendly pattern and can be easily accelerated by cuBLAS. Secondly, we'll show the tradeoff between performance and memory usage in R for DNN. Finally, we'll package all of these CUDA approaches into a R package and publish to CRAN so than anyone can install it in R quickly, and get significant performance improvement from NVIDIA GPUs.

Level: Intermediate
Type: Talk
Tags: Big Data Analytics; Deep Learning & Artificial Intelligence; Performance Optimization; Programming Languages

Day: Thursday, 04/07
Time: 14:30 - 14:55
Location: Room 210F

S6236 - Shaping the Light with GPUs

Damien Gratadour Associate Professor, Université Paris Diderot & Observatoire de Paris
Damien Gratadour is an associate professor at Universite Paris Diderot and research scientist at LESIA, Observatoire de Paris since 2008. Damien holds an M.S. in theoretical physics from Universite Paris Diderot and a Ph.D. in observational astronomy from the same university (2005). In the past, Damien has been responsible for the last stages of commissioning of the LGS upgrade to the Altair AO system on the Gemini North Telescope in Hawaii (2006) and assumed for almost two years, as an AO scientist, the responsibility of instrument scientist for GeMS, the Gemini MCAO System, a $15 million facility, actively participating in the various acceptance tests and integration of its sub-systems and first stages of technical tests of the full instrument and most notably the DSP-based RTC. Since 2008, at Observatoire de Paris, Damien has been concentrating on high-performance numerical techniques for astronomy for modeling, signal processing, and instrumentation and on the development of observational programs with AO-equipped, large, ground-based telescopes for the study of star formation in merging galaxies. He coordinates several national and international projects aiming at designing and building supercomputers for AO.

Learn how GPUs are used to shape the light on extreme diameter telescopes. By providing the means to process, in real time, large-scale images from wavefront sensors, GPUs are revolutionizing adaptive optics, an instrumental technique used to compensate fast-evolving aberrations in optical systems. We'll show how GPUs are used to power the real-time controllers of these systems to provide millions of commands per second to deformable mirrors so as to stabilize the image quality at the output of a large telescope. The first results of the Green Flash project, a large-scale European initiative aimed at prototyping real-time controllers for the European Extremely Large Telescope, will be presented and illustrated with preliminary data obtained in the lab.

Level: All
Type: Talk
Tags: Astronomy & Astrophysics; Signal & Audio Processing; Supercomputing & HPC

Day: Thursday, 04/07
Time: 14:30 - 14:55
Location: Room 212A

S6339 - Bugs, Bench-Marking and Vulkan

Erika Dignam Bug Triager/Tech PM, NVIDIA
Erika Dignam started off in computer arts before making her way to NVIDIA in 2007. At NVIDIA she started as a QA engineer working with top industry applications, then spent several years as an ISV Program Manager, and finally moved into the OpenGL performance team doing bug triage.
Ross Cunniff Senior Software Engineer, NVIDIA
Ross Cunniff received two degrees, in Computer Science and Mathematics, from New Mexico State University in 1985. Ross has been an NVIDIA employee since 2001. At NVIDIA, he has worked on OpenGL and DirectX graphics drivers as well as on camera image enhancement algorithms. Ross is the inventor or co-inventor of over 15 patents. He is currently one of NVIDIA’s representatives on the SPEC Graphics and Workstation Performance Group committees.

How to get your bugs fixed faster, efficient bug resolution and bug prevention. We will review internal processes, basic OpenGL bug triage and tools, a little Vulkan debugging, and finally bug prevention with bench-marking and test creation.

Level: All
Type: Talk
Tags: Tools & Libraries; Performance Optimization; Real-Time Graphics

Day: Thursday, 04/07
Time: 14:30 - 14:55
Location: Room 211B

S6436 - GPUs in the Cloud: A Software Vendor Perspective

Andrea Rodolico CTO, NICE
Life-long passion for ICT and innovation, coupled with 20 years of business experience in the Technical ICT Market. Entrepreneur since 1994, Andrea co-founded NICE and, since then, has led the technology strategies of the company, evolving their Grid, Cloud and Remote 3D Visualization products and solutions to meet the most challenging needs of many Fortune 2000 companies, including Oil&Gas, Aerospace, Automotive, Financial Services, Life Sciences and Research customers worldwide.

With GPUs being offered by an increasing number of public Cloud providers, even the most challenging visual applications can now be offered in the elastic world. Besides being a major new capability available for their users, software vendors can also leverage this new dimension of flexibility, enabling new approaches to common needs like demonstrations, evaluations, trainings, support, and even sales of their applications. We'll cover how GPUs have been used in the Cloud by leadings ISVs, and how they embraced and leveraged their performance, and overcome unexpected limitations the Cloud imposes.

Level: All
Type: Talk
Tags: Graphics Virtualization; Data Center & Cloud Computing; In-Situ and Scientific Visualization; Tools & Libraries

Day: Thursday, 04/07
Time: 14:30 - 14:55
Location: Marriott Salon 4

S6529 - Simulation of Rayleigh-Bernard Convention on GPUs

Massimiliano Fatica Senior Manager, Tesla HPC Performance Group, NVIDIA
Massimiliano Fatica is a senior manager at NVIDIA in the Tesla HPC Performance and Benchmark Group, where he works in the area of GPU computing (high performance computing and clusters). Prior to joining NVIDIA, he was a research staff member at Stanford University, where he worked on applications for the Stanford Streaming Supercomputer. He holds a laurea in aeronautical engineering and a Ph.D. in theoretical and applied mechanics from the University of Rome "La Sapienza."

We'll show the steps required to port a finite difference code for the direct numerical simulation of turbulent flow to run on GPUs. The code is the open source AFID project, developed by Twente University, SURFsara, and University of Rome "Tor Vergata." One of the main goals of the porting project was to keep the code as close as possible to the original CPU implementation. The porting was done with CUDA Fortran, using as much as possible the CUF kernel directives. We'll show how to profile the code, the verification process, and the results obtained.

Level: Intermediate
Type: Talk
Tags: Computational Fluid Dynamics; Supercomputing & HPC; Performance Optimization

Day: Thursday, 04/07
Time: 14:30 - 14:55
Location: Marriott Salon 1

S6530 - Simulation of Hypervelocity Impact of a Whipple Shield

Wayne Mindle Director of Sales & Marketing, CertaSIM, LLC
Highly-Rated Speaker
Wayne Mindle is the director of sales and marketing at CertaSIM, LLC, the U.S. and Canadian distributor of the IMPETUS Afea Software Suite. He earned his Ph.D. from Northwestern University in the area of applied mechanics, more specifically finite element analysis as applied to the area of nonlinear explicit transient dynamic problems. He has worked for several major aerospace companies, a consulting company for the FAA, and, prior to his association with CertaSIM, spent 15 years at Livermore Software Technology Corp. as the lead technical sales engineer.

We'll discuss an accurate computational method to efficiently model hypervelocity impact on a single workstation with GPU parallelization. The combination of an SPH (Smooth Particle Hydrodynamics) solver combined with a finite element solver using high-order elements to model the final impacting structure is used to model a Whipple Shield. Results, run times, and comparisons with available test data will be presented.

Level: All
Type: Talk
Tags: Computational Physics; Aerospace & Defense

Day: Thursday, 04/07
Time: 14:30 - 14:55
Location: Marriott Salon 6

S6669 - Demystifying Learning at Scale: From Distributed Mathematics to Effective HPC Infrastructure

Steven Eliuk Project Lead, Samsung Electronics
Steven Eliuk has over 10 years of experience in HPC. He earned his Ph.D. in computer science at the University of Alberta. His focus is on tractability issues in all domains.

We'll demonstrate some of the design choices required to provide a distributed, in-memory, GPU-accelerated, parallel mathematics library, distributed mathematics (dMath). The library considers some of the most common functionality required for effective scaling of deep learning pipelines for a variety of recognition and understanding tasks. The core of the problem is efficient implementations of common basic linear algebra subprograms (BLAS) and specific abstractions for learning at scale.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence

Day: Thursday, 04/07
Time: 14:30 - 15:20
Location: Room 210D

S6671 - GPU Accelerated Streaming Algorithms for Halo Finders

Nikita Ivkin Student , Johns Hopkins University
Nikita is a second year PhD student in the Computer Science Theory Group at Johns Hopkins University. Working primarily in the area of Streaming Algorithms. Prior to starting his PhD program, he received his BSc and MSc in the Department of the Control and Applied Math at Moscow Institute of Physics and Technology in 2011 and 2013, respectively.

In this work we show the connection between two problems: halo-finding and heavy hitters. Finding haloes, dense clumps of matter, in output of cosmological simulation is crucial for verifying theoretical models using observation. Current algorithms require to load full dataset into memory, making computations infeasible on the desktop machine. We reduce halo-finding problem to problem of finding most frequent items (heavy hitters) in streaming data, and apply two algorithms: Pick-and-Drop and Count Sketch. These algorithms can find top 1000 largest haloes with logarithmical memory usage, but time performance is poor. GPU acceleration makes it possible to make several passes in reasonable time, thus helping to find more haloes in the future.

Level: Beginner
Type: Talk
Tags: Algorithms; Astronomy & Astrophysics

Day: Thursday, 04/07
Time: 14:30 - 14:55
Location: Marriott Salon 3

S6206 - Better Than All the Rest: Finding Max-Performance GPU Kernels Using Auto-Tuning

Cedric Nugteren GPU/Supercomputing Expert, SURFsara HPC centre
Cedric Nugteren lives and breathes GPU technology. Cedric ported image processing apps to CUDA during his M.S. studies as early as 2008 and received his Ph.D. in 2014 after publishing 15 peer-reviewed GPU-related articles. He interned at ARM's Mali GPU compiler group in 2012 and NVIDIA's cuFFT team in 2014. After his Ph.D., he started working as a GPU-expert at the Dutch supercomputer center, where he developed an auto-tuner for GPU kernels. His free time is also well spent: he works on a tuned C++11 version of the OpenCL BLAS library.

We'll demonstrate how to use an auto-tuner to squeeze out every bit of performance from your OpenCL/CUDA kernels while achieving performance portability at the same time. This is done by tuning a user-defined set of parameters, such as the thread block size, vector width or tile size. Because the amount of options to explore can be large, the tuner uses search algorithms and machine learning models to smartly traverse the search-space. In this session we'll demonstrate how to use the CLTune open-source auto-tuner to optimize two building blocks for deep learning: GEMM matrix-multiplication and 2D convolution. We'll show how to tune for specific GPUs and convolution filters, producing state-of-the-art kernels: we obtained the fastest CUDA/OpenCL GEMM and convolution kernels.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Tools & Libraries; Deep Learning & Artificial Intelligence

Day: Thursday, 04/07
Time: 15:00 - 15:25
Location: Room 211B

S6213 - CoreNeuron Neuronal Network Simulator Optimization Opportunities and Early Experience

Pramod Kumbhar HPC Enginner, Blue Brain Project, EPFL Switzerland
Pramod Kumbhar is an HPC engineer, focusing on the development of the NEURON/CoreNeuron simulator within the Blue Brain Project. Pramod works on parallelisation, performance optimisation and scaling of the simulator on supercomputing architectures like IBM BlueGene-Q, Intel MIC and graphics processing units. As part of the European exascale DEEP project (Dynamic Exascale Entry Platform), he is involved in the porting and optimisation of the simulator on next-generation Intel architectures. Pramod received his M.S. in high performance computing from the Edinburgh Parallel Computer Centre at the University of Edinburgh, Scotland. Before joining the Blue Brain Project, he worked at the Julich Research Centre, Germany.

Learn how large-scale neuronal network simulations are designed, developed, and simulated using detailed morphologies of neurons. NEURON is a widely used simulation environment developed over the last 30 years for modeling individual neurons and networks of neurons with complex branched anatomy. We'll describe the major difficulties in getting maximum performance when optimising a scientific code that includes both clock- and event-driven modeling and recipes to overcome these blockers. Building on this analysis, we'll present a comparison of performance results on Intel x86, MIC, IBM Blue Gene/Q, and NVIDIA GPU architectures for driving HPC system co-design.

Level: Intermediate
Type: Talk
Tags: Supercomputing & HPC; Performance Optimization; Computational Chemistry; OpenACC

Day: Thursday, 04/07
Time: 15:00 - 15:25
Location: Room 211A

S6229 - Computational Simulation of World's Biggest Eye on GPUs

Hatem Ltaief Senior Research Scientist, Extreme Computing Research Center, KAUST
Highly-Rated Speaker
Hatem Ltaief is a senior research scientist in the Extreme Computing Research Center at KAUST, where he is advising several KAUST students in their M.S. and Ph.D. research. Hatem received the engineering degree from Polytech Lyon at the University of Claude Bernard Lyon I, France. He earned his M.S. in applied mathematics and his Ph.D. in computer science from the University of Houston. From 2008 to 2010, Hatem was a research scientist in the Innovative Computing Laboratory in the Department of Electrical Engineering and Computer Science at the University of Tennessee, Knoxville. He is part of the European Exascale Software Initiative (EESI) to build a European vision and roadmap to address the challenges of the new generation of massively parallel systems. Hatem has various strategic partnerships with industries (Saudi Aramco, Intel, NVIDIA) as well as universities and HPC centers (University of Tennessee, INRIA Bordeaux, L'Observatoire de Paris, Barcelona Supercomputing Center). Hatem is the co-author of 40+ journal/conference papers and book chapters. His research interests include parallel numerical algorithms, parallel programming models, and performance optimizations for multicore architectures and hardware accelerators.
Damien Gratadour Associate Professor, Université Paris Diderot & Observatoire de Paris
Damien Gratadour has been an associate professor at Universite Paris Diderot and research scientist at LESIA, Observatoire de Paris since 2008. Damien holds an M.S. in theoretical physics and a Ph.D. in observational astronomy from Universite Paris Diderot. In the past, Damien has been responsible for the last stages of commissioning of the LGS upgrade to the Altair AO system on the Gemini North Telescope in Hawaii (2006). He spent two years as an AO scientist, with the responsibility of instrument scientist for GeMS, the Gemini MCAO System, a $15 million facility, actively participating in the various acceptance tests and integration of its sub-systems and first stages of technical tests of the full instrument and most notably the DSP-based RTC. Since 2008, at Observatoire de Paris, Damien is concentrating on high performance numerical techniques for astronomy for modeling, signal processing and instrumentation and on the development of observational programs, with AO-equipped, large ground-based telescopes, for the study of star formation in merging galaxies. He coordinates several national and international projects aimed at designing and building supercomputers for AO.

Have you heard about the world's biggest eye? Learn how GPUs help design major, multimillion-dollar optical instruments for the European Extremely Large Telescope. Starting from the mathematical model up to the high-performance implementation on distributed-memory systems with hardware accelerators, we'll explain how the resulting dense linear algebra operations associated with an efficient task-based programming model help design the next generation of telescope instruments.

Level: Intermediate
Type: Talk
Tags: Astronomy & Astrophysics; Algorithms; Performance Optimization

Day: Thursday, 04/07
Time: 15:00 - 15:25
Location: Room 212A

S6254 - Performance Analysis and Prediction of Multi-GPU MPI Applications

Michael Frumkin Sr. Computer Architect, NVIDIA
Michael holds MS in Mathematics from the Moscow State University and PhD in Computer Sciences from the Grad School of The Soviet Academy of Sciences. He is an author of more than 70 scientific papers, and holds 2 patents. At NVIDIA Michael works as a senior Computer Architect on performance optimization and analysis of large scale applications. Michael previously worked at Google, Intel, and NASA. Michael worked on traffic management, performance optimization of multicore systems, and benchmarking of large scale systems.

We will introduce several MPI Coral and pure CUDA applications that will be able to harness multi-GPU systems and describe potential performance issues with this applications. We will describe tools for tracing mixed MPI & CUDA applications. The tools are based on PMPI instrumentation and DepEx interception of CUDA calls. From the traces captured by the tools we build an application model called Application Graph. Using this model we will demonstrate visualization of timelines and inter-GPU traffic of the applications. Then we focus on performance issues that can be exposed by analysis of the Application Graph. Finally, we will map the Application Graph on future multi-GPU systems and present some performance projections of these systems.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Supercomputing & HPC; Tools & Libraries

Day: Thursday, 04/07
Time: 15:00 - 15:25
Location: Room 212B

S6328 - Towards the Industrial Adoption of GPU Accelerated Computational Fluid Dynamics

Peter Vincent Senior Lecturer, Imperial College London
Highly-Rated Speaker
Peter Vincent is a senior lecturer and EPSRC early career fellow in the department of Aeronautics at Imperial College London, working at the interface between mathematics, computing, fluid dynamics, and aeronautical engineering. He holds a first class B.S. from the Department of Physics at Imperial College (graduating top of the year), and a Ph.D. from the Department of Aeronautics at Imperial College in the field of CFD. He has also studied in the U.S., serving as a postdoctoral scholar in the Department of Aeronautics and Astronautics at Stanford University, where he developed novel high-order numerical methods for CFD, and implemented them for massively parallel, many-core GPUs.

We'll detail our experiences of translating next-generation high-order GPU-accelerated CFD technology from an academic codebase through to an industry-ready platform. We'll begin by introducing the flux reconstruction (FR) approach to high-order methods, a discretization that is particularly well-suited to many-core architectures. We'll then outline our open-source implementation of FR called PyFR, and proceed to describe the Hyperflux project with Zenotech and CFMS, which aims to translate technology from PyFR into the commercial zCFD software. The talk with touch on various topics, including maintainability, portability, algorithm choice, numerical robustness, and expected performance improvements -- all in an industrial context.

Level: All
Type: Talk
Tags: Computational Fluid Dynamics; Supercomputing & HPC; Algorithms; Aerospace & Defense

Day: Thursday, 04/07
Time: 15:00 - 15:25
Location: Marriott Salon 1

S6332 - Training and Simulation in a Secure Cloud Environment

Matt Coppinger Director, End User Computing, VMware
Matt Coppinger is currently Director, Technical Marketing & Enablement, End User Computing at VMware. Matt has worked on desktop virtualization since its inception at VMware in 2007, first in engineering, then as a field consultant and finally within his current role in Technical Marketing & Enablement. He has authored a number of reference architectures for VMware, including virtualising 3D applications, and has spoken on the technical aspects of desktop virtualization at VMworld and other major conferences since 2010.

We'll demonstrate the benefits of moving your training and simulation workloads into a secure cloud environment with VMware and NVIDIA GRID. We focus on the performance of these applications when virtualized and demonstrate an application in action. The session covers an example architecture and sizing for deploying Bohemia Interactive Simulations Virtual Battlespace 3 product. You will see the application in action along with benchmarks comparing physical workstation vs. virtual desktop.

Level: Beginner
Type: Talk
Tags: Graphics Virtualization; Aerospace & Defense

Day: Thursday, 04/07
Time: 15:00 - 15:25
Location: Marriott Salon 4

S6427 - Monitoring Solutions for NVIDIA GRID™

Thomas Poppelgaard Technology Evangelist, Poppelgaard.com
Thomas was awarded Microsoft Most Valuable Professional (MVP) since 2015, Citrix Technology Professional (CTP) since 2013 and RES Software Value Professional since 2013. Thomas has 19 years of experience with IT. Currently working as a Technology Evangelist / Independent Consultant at his own company Poppelgaard.com.

Learn the best tools that is available for monitoring NVIDIA GRID for Citrix, Vmware and Microsoft. Learn how to monitor from GPU, hypervisor, virtual machine, app level, protocol and end user experience. In this session CTP and MVP awarded Thomas Poppelgaard will show you how to use monitoring tools such as Uberagent, Goliath Performance Monitor and Lakeside Systrack. In this session you get the ultimate overview of possible monitoring stacks and after the session you will learn whats possible and which limitations are in the products.

Level: Advanced
Type: Talk
Tags: Graphics Virtualization; Tools & Libraries

Day: Thursday, 04/07
Time: 15:00 - 15:25
Location: Marriott Salon 2

S6457 - Real-Time Multi GPU-Based 3D Gabor-Domain Optical Coherence Microscopy

Anand Santhanam Assistant Professor, University of California, Los Angeles
Anand Santhanam's research is focussed on real-time GPU based computing for radiotherapy dose calculations, deformable image registration, and biomechanical modeling.

Fast, robust, nondestructive 3D imaging is needed for the characterization of microscopic tissue structures for diseases such as skin and corneal cancer. We'll present a real-time 3D optical coherence microscopy imaging framework that is solely enabled by a multi-GPU image processing system. A custom microelectromechanical system (MEMS)-based 2D scanner was developed to achieve, together with a multi-level GPU architecture, 55 kHz fast-axis A-scan acquisition in a Gabor-domain optical coherence microscopy (GD-OCM) custom instrument. GD-OCM yields high-definition micrometer-class volumetric images. Our results showed that using multi-GPU-based processing, imaging at 2 um resolution, has been achieved throughout 1X1X0.6 mm3 volume, acquired and processed in less than two minutes.

Level: Intermediate
Type: Talk
Tags: Medical Imaging; Algorithms; Supercomputing & HPC

Day: Thursday, 04/07
Time: 15:00 - 15:25
Location: Room LL21B

S6475 - Simulating Human Aorta Material Behavior Using a GPU Explicit Finite Element Solver

Vukasin Strbac PhD student, Biomechanics Section, KULeuven University
Vukasin Strbac, received his MSc degree in Information Science from the Faculty of Organization and Informatics, University of Zagreb, Croatia in 2009, majoring in low-level programming of computer graphics and rigid body dynamics. From 2011 he is a PhD student at KU Leuven, Biomechanics section, department of Mechanical engineering. His interests are: parallel programming, many-core architectures, optimization and nonlinear solvers applied to finite element modeling.

An examination of: (1)fiber-reinforcement anisotropy, (2)higher-order integration and (3)near-incompressibility on computation speeds using a custom GPU explicit finite element code is presented. A constitutive model of the arterial wall is implemented and tested in full-, selective-reduced, and under integrated regimes to measure relative performance impact. Accuracy verification and speedups are reported relative to established academic and industry proven FE codes. Learn about the current solution speeds and challenges for clinically relevant hyperelastic materials using GPU explicit finite elements.

Level: Intermediate
Type: Talk
Tags: Computational Physics; Performance Optimization

Day: Thursday, 04/07
Time: 15:00 - 15:25
Location: Marriott Salon 6

S6517 - GPU Multisplit

Saman Ashkiani Ph.D. Student, University of California, Davis
Saman Ashkiani is a fifth year Ph.D. student in the Electrical and Computer Engineering Department of the University of California, Davis. He is working under the supervision of Prof. John Owens on general-purpose GPU programming.

Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often use a sort instead. However, sort does more work than necessary to implement multisplit, and is thus inefficient. In this work, we provide a parallel model and multiple implementations for the multisplit problem, with a focus on a small number of buckets. In our implementations, we exploit the computational hierarchy of the GPU to perform most of the work locally, with minimal usage of global operations. We use warp-synchronous programming models as well as hierarchical reordering of input elements to achieve better performance.

Level: Intermediate
Type: Talk
Tags: Algorithms

Day: Thursday, 04/07
Time: 15:00 - 15:25
Location: Marriott Salon 3

S6708 - Data Science Applications of GPUs in the R Language

Norm Matloff Professor of Computer Science, University of California, Davis
Dr. Norm Matloff is a professor of computer science at the University of California, Davis. A Los Angeles native, he has a Ph.D. in (pure) mathematics from UCLA, and conducts research in computer science and statistics. His book, Parallel Computing for Data Science, was published by Chapman and Hall last June, and he is currently writing a book on regression, classification and machine learning. He is a recipient of campuswide awards at UCD for teaching and public service.

In this presentation, you will learn about the use of GPUs in data science applications using the R language, as well as a general method, Software Alchemy, for parallelizing statistical applications. The talk will provide an overview of R libraries available for interfacing with GPUs, and discussion of issues involved in writing such libraries, before showing you how to use Software Alchemy (with or without R) to overcome GPU memory limitations in statistical applications.

Level: All
Type: Talk
Tags: Big Data Analytics; Deep Learning & Artificial Intelligence

Day: Thursday, 04/07
Time: 15:00 - 15:25
Location: Room 210F

S6742 - Real-time Person Tracking on Jetson with OpenPTrack

Jeff Burke Assistant Dean, Technology and Innovation, UCLA School of Theater, Film and Television
Jeff Burke is Assistant Dean for Technology and Innovation at the UCLA School of Theater, Film and Television (UCLA TFT). He has produced, managed, programmed and designed experimental performances, short films, new genre art installations and new facility construction internationally for more than 15 years. Jeff has been a faculty member since 2001 and today, in addition to his role developing technology and innovation strategy at TFT, is Co-PI and application team lead for the Named Data Networking project, a multi-campus effort supported by the National Science Foundation (NSF) and an international 25-member consortium to develop a future Internet architecture. In 2004, Burke co-founded UCLA TFT's Center for Research in Engineering, Media and Performance (REMAP), a collaboration with the Henry Samueli School of Engineering and Applied Science, which combines research, artistic production and community engagement. At REMAP, Burke's research has been supported by the NSF and NEA, Intel, Cisco, Trust for Mutual Understanding and the MacArthur Foundation, among others. From 2006-2012, he was area lead for participatory sensing at the NSF Center for Embedded Networked Sensing, helping to define a new application arena for mobile devices. In 2014, Jeff received a three-year Google Focused Award on the "Future of Storytelling," for work that will explore the intersection of storytelling and coding through research and production of original, interdisciplinary digital media works at UCLA TFT.

We'll provide an overview of OpenPTrack, a GPU-enabled, open-source project that enables real-time position tracking of many people using networked 3D imagers, which is now available for the Jetson TK1/TX1 embedded platform. OpenPTrack specifically targets innovative applications in education, arts, and culture, where it aims to meet a need for real-time person tracking that is reliably scalable over large areas, realistically deployable, and low cost. We'll cover the basic technical approach, UCLA REMAP's experience from real-world multi-imager deployments, and the technology roadmap, using Jetson, that aims to bring occlusion-resistant, real-time person tracking into the mainstream of interactive design and experimentation.

Level: Intermediate
Type: Talk
Tags: Computer Vision & Machine Vision; Robotics & Autonomous Machines; Embedded

Day: Thursday, 04/07
Time: 15:00 - 15:50
Location: Room 210G

S6155 - Deep Learning Recommendation of Treatment from Electronic Data

David Ledbetter Data Science Consultant, Children's Hospital Los Angeles
David Ledbetter has an extensive and deep understanding of decision theory. He has experience implementing various decision engines, including convolutional neural networks, random forests, extra trees, and linear discrimination analysis. His particular area of focus is in performance estimation, where he has demonstrated a tremendous ability to accurately predict performance on new data in nonstationary, real-world scenarios. David has worked on a number of real-world detection projects, including detecting circulating tumor cells in blood, automatic target recognition utilizing CNNs from satellite imagery, make/model car classification for the Los Angeles Police Department using CNNs, and acoustic right whale call detection from underwater sonobuoys. Recently, David has been developing a CNN to generate personalized treatment recommendations to optimize patient outcomes using unstructured electronic medical records from 10 years of data collected from the Children's Hospital Los Angeles Pediatric Intensive Care Unit.

Construct a model to generate treatment predictions to optimize patient outcomes by leveraging the information gleaned from over 10,000 patients that passed through the Pediatric Intensive Care Unit at Children's Hospital Los Angeles over more than 10 years. This is accomplished by converting unstructured, non-uniformly sampled patient information into a structured data representation which resembles an image - here referred to as a "patient snapshot." These patient snapshots elegantly enable convolutional neural networks to efficiently generate a basis.

Level: Intermediate
Type: Talk
Tags: Medical Imaging; Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Thursday, 04/07
Time: 15:30 - 15:55
Location: Room LL21B

S6185 - Fast Parallel Suffix Array on the GPU

Leyuan Wang Ph.D. Student, University of California, Davis
Leyuan Wang is a Ph.D. student in computer science at the University of California, Davis. Leyuan completed her M.S. in electrical and computer engineering at UC Davis in 2014, after having earned her undergraduate degree in electronics science and technology at China's Zhejiang University. With Professor John Owens as her advisor, her research spans general-purpose computing on graphics units — GPGPU, also known as GPU computing — along with computer graphics, parallel algorithms programming, and data compression. Most recently, Leyuan has been implementing classic algorithms on GPU and large-scale GPU computing. This summer, she presented one of only two "Distinguished Papers" of the 51 accepted at Euro-Par 2015. She is a main developer and release czar for CUDPP 2.2.

Explore the latest techniques for accelerating suffix array construction algorithms (SACAs) using CUDA. The suffix array (SA) data structure is used in a broad spectrum of applications, including data compression, bioinformatics, and text indexing. The recent explosion in data sizes and the emergence of commodity data-parallel processors motivate efficient parallel implementations of SACAs. Because of the high arithmetic and memory throughput of many-core GPUs and multi-core CPUs, these processors are well-suited for data-intensive computing tasks such as SACAs. We address the parallel SACA problem by designing, implementing, and comparing two different formulations of SACAs on NVIDIA GPUs and achieve significant speedups compared with previous CPU/GPU state-of-the-art implementations.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Algorithms; Tools & Libraries

Day: Thursday, 04/07
Time: 15:30 - 15:55
Location: Room 211B

S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Wei Tan Research Staff Member, IBM T. J. Watson Research Center
Wei Tan currently works on big data and distributed computing systems. He is an adjunct professor at Department of Automation, Tsinghua University, China, and an associate editor of IEEE Transactions on Automation Science and Engineering. Wei is interested in accelerating machine learning using scale-out (e.g., Spark) and scale-up (e.g., GPU) approaches. He also works on NoSQL and services computing. Wei's work and code have been incorporated into the IBM patent portfolio and software products such as BigInsights and Cognos.

We present cuMF, a highly optimized matrix factorization system on GPUs. Matrix factorization (MF) is a key algorithm in recommender systems. On a single GPU, we introduce a memory-optimized alternating least square (ALS) method; it alleviates discontiguous access and aggressively uses registers, so as to reduce memory latency. On multiple GPUs, we combine data parallelism with model parallelism, and introduce a topology-aware parallel reduction method, so as to scale ALS to multiple GPUs. Using only one machine with four NVIDIA GPU cards, cuMF can be 6-10 times as fast, and 33-100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. Moreover, cuMF can solve the largest matrix factorization problem ever reported.

Level: Intermediate
Type: Talk
Tags: Big Data Analytics; Deep Learning & Artificial Intelligence; Performance Optimization; Algorithms

Day: Thursday, 04/07
Time: 15:30 - 15:55
Location: Room 210F

S6318 - ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

Vishal Mehta Engineer, Barcelona SuperComputing Center
Vishal Mehta works as an engineer at the Barcelona Supercomputing Center. Vishal has four years of experience working with GPUs in the field of RADAR processing, fluid dynamics, and eigen problems. Currently, he is working on porting multi-physics applications on GPUs and GPU performance optimizations. Vishal's research interests include hadoop, multi-physics applications, algorithms for GPUs, openstack and virtualization.

Learn to interface CUDA kernels, CUDA library API and driver APIs with existing Fortran applications in HPC. This session informs you about the Alya multi-physics code developed at Barcelona Supercomputing Centre. The code is based on Fortran95 and scales across thousands of cores. We describe in depth how to port computationally heavy modules from Fortran to CUDA. The session will teach in depth on how to use CUDA features like dynamic parallelism, CUDA streams, unified memory, and error handling features for Fortran applications with NVCC compiler. We also discuss future directions using next-generation programming models such as OmpSs for hybrid CPU and GPU computing. The presentation includes various example codes for improving the programming skills of the scientific community.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Supercomputing & HPC

Day: Thursday, 04/07
Time: 15:30 - 15:55
Location: Room 211A

S6355 - Using AmgX to Accelerate PETSc-Based CFD Codes

Pi-Yueh Chuang Ph.D. Student, George Washington University
Pi-Yueh Chuang is a Ph.D. student in mechanical and aerospace engineering at George Washington University, Washington, D.C. He is a member of Professor Lorena A. Barba's research group. His current research interests are GPU applications in computational fluid dynamics simulations and immersed boundary methods. Prior to his Ph.D. studies, he worked as an engineer in Moldex3D, a company developing moldflow simulation software. He got his M.S. in mechanical engineering from National Taiwan University with a thesis and papers focusing on simulation Monte Carlo method and nanoscale energy transport. He has a B.S. in power mechanical engineering from National Tsing Hua University, Taiwan.

Learn to accelerate existing PETSc applications using AmgX-NVIDIA's library of multi-GPU linear solvers and multigrid preconditioners. We developed wrapper code to couple AmgX and PETSc, allowing programmers to use it with fewer than 10 additional lines of code. Using PetIBM, our PETSc-based, immersed-boundary CFD solver, we show how AmgX can speed up an application with little programming effort. AmgX can thus bring multi-GPU capability to large-scale 3D CFD simulations, reducing execution time and lowering hardware costs. As example, we estimate the potential cost savings using Amazon elastic compute cloud (EC2). We also present performance benchmarks of AmgX, and tips for optimizing GPU multigrid preconditioners for CFD. This presentation is co-authored with Professor Lorena A. Barba.

Level: Intermediate
Type: Talk
Tags: Computational Fluid Dynamics; Supercomputing & HPC; Aerospace & Defense; Tools & Libraries

Day: Thursday, 04/07
Time: 15:30 - 15:55
Location: Marriott Salon 1

S6477 - Overview of Performance Prediction Tools for Better Development and Tuning Support

Rommel Anatoli Quintanilla Cruz Master's Student, Universidade Federal Fluminense
Rommel is a current student of the MSc in Computer Science at Universidade Federal Fluminense. He holds a Bachelor of Computer Science degree from Universidad Nacional de San Agustín, Perú. His research interests include parallel and concurrent computing and GPU computing. He is currently a MSc student member of the UFF Medialab under the supervision of PhD Esteban Walter Gonzalez Clua. In August 2015, his team won the Marathon of Parallel Programming at the Regional School in High Performance Computing of the state of Rio de Janeiro (ERAD/RJ).
Esteban Clua Associate Professor, Universidade Federal Fluminense
Esteban was awarded the title of NVIDIA Fellow in 2015. He is graduated in Computer Science at Universidade de São Paulo and has master's and PhD degree in Computer Science. Today Esteban is associated professor at the computer science of Universidade Federal Fluminense, in Rio de Janeiro, and director of UFF Medialab. Esteban is one of the founders of SBGames – Brazilian Symposium of Digital Entertainment and Video Games (the largest conference in the subject in South America), is director of Academia of IGDA Rio, president of the Brazilian Computing Society Game Committee and member of program committees of many conferences in Video Games, such as ACM Siggraph Sandbox, IEEE Gameon and SBC SBGames. In 2007 received the prize of the personality which most contributed for the growth of the video game industry in Brazil and in 2009 received the prize of Young Scientist of the State of Rio de Janeiro. Esteban is the coordinator of the first Latin America NVIDIA Center of Excellence.

Explore the latest techniques in performance prediction models. In addition to kernel analysis tools provided by NVIDIA, several other tools are available that help the programmer on performance analysis and optimization of their GPGPU applications. Specifically, they are useful for detecting bottlenecks, scheduling of concurrent kernels, auto-tuning, estimating power consumption, among others. We present a state-of-the-art in performance prediction models. The review covers an analysis of each of the principal techniques for performance modeling, these include analytical models, machine learning models, and based on simulation.

Level: Intermediate
Type: Talk
Tags: Performance Optimization; Tools & Libraries; Supercomputing & HPC

Day: Thursday, 04/07
Time: 15:30 - 15:55
Location: Room 212B

S6513 - GPU Optimization of the Kripke Neutral-Particle Transport Mini-App

David Applehans Post-Doctoral Researcher, IBM
Dr. Appelhans is a member of the Data Centric Systems group within IBM. He received a PhD in Applied Math at the University of Colorado, Boulder, where he developed a new PDE solution technique targeted for the next generation of HPC machines. Prior to that he received a MS in Applied Physics from Colorado School of Mines, where he used quantum mechanical modeling to predicting the first semi-conducting allotrope of graphene. His current research interests are OpenMP4 development, optimizing GPU performance, and improving application codes for hybrid GPU/CPU systems.

For Sierra, a pre-exascale CORAL supercomputer arriving at Lawrence Livermore National Lab in 2017, neutral-particle transport codes will be a primary application and ensuring peak performance of these applications on this system (multiple IBM POWER9 CPUs + multiple Volta GPUs per node) is important. In preparation, transport mini-apps, like Kripke, are being optimized on today's hybrid CPU-GPU clusters using different programming models. This talk discusses performance issues encountered by Kripke on these systems and their solutions. Specifically we will focus on: a) a novel implementation of the sweep algorithm; b) techniques useful for modeling physical problems requiring memory footprint exceeding the aggregated GPU memory; and c) porting Kripke using OpenMP4.

Level: All
Type: Talk
Tags: Algorithms; Computational Physics; Supercomputing & HPC

Day: Thursday, 04/07
Time: 15:30 - 15:55
Location: Marriott Salon 3

S6608 - Virtualize Linux 3D Applications with Citrix HDX 3D Pro

Vipin Borkar Principal Product Manager, Citrix Systems Inc
Vipin Borkar works as principal product manager for Citrix XenApp and XenDesktop focused on Citrix Receiver, Linux Virtual Desktop. His experience in the software industry is diversified and includes virtualization, enterprise mobility, CADCAMCAE design, 3D graphics, Linux, and SoC.

Linux 3D applications? Are you running them on physical machines today or using solutions that are not enterprise ready? Citrix HDX 3D Pro for Linux allows you to virtualize your Linux 3D applications and get benefits of industry-leading Citrix XenApp and XenDesktop technology optimized for NVIDIA GRID GPU acceleration for Linux workloads. You'll learn: (1) the current capabilities and architecture of Citrix Linux Virtual Desktop; (2) how to map the Linux Virtual Desktop to current use cases for knowledge workers and power users with HDX 3D Pro for Linux; and (3) the benefits of virtualizing Linux 3D applications without compromising on the experience.

Level: All
Type: Talk
Tags: Graphics Virtualization; Data Center & Cloud Computing; Energy Exploration

Day: Thursday, 04/07
Time: 15:30 - 15:55
Location: Marriott Salon 4

S6696 - GPU Accelerating and Deployment for Online Deep Learning Application

Songbai Pu System Engineer, Baidu
Songbai Pu is a system engineer at Baidu. Her received his M.S. from the Chinese Academy of Sciences. His focus is on GPU optimization for GPU cluster, deep learning training and online deployment.

At present, many Baidu's products begin to use deep learning algorithm to improve the quality of service, different from offline training, online service is more concerned with the speed of response, reliability and cost. The traditional on-line scheme basic based on CPU, but it is limited to calculate performance and power, although FPGA has good computing performance and the power is low; but it could not meet the demand for rapid iterative algorithm. It seems that GPU is a balanced solution, as we usually uses gpu for deep learning train, GPU can be very natural to apply to online deployment, it could conveniently optimize based on the new algorithm. Compared to FPGA, the software development cycle can be compressed about 2-4 times;the performance can be increased 2-5 times.

Level: Intermediate
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Supercomputing & HPC

Day: Thursday, 04/07
Time: 15:30 - 15:55
Location: Room 210D

S6277 - VMware and NVIDIA 3D Virtual Desktops Reference Architecture

Matt Coppinger Director, End User Computing, VMware
Matt Coppinger is currently Director, Technical Marketing & Enablement, End User Computing at VMware. Matt has worked on desktop virtualization since its inception at VMware in 2007, first in engineering, then as a field consultant and finally within his current role in Technical Marketing & Enablement. He has authored a number of reference architectures for VMware, including virtualising 3D applications, and has spoken on the technical aspects of desktop virtualization at VMworld and other major conferences since 2010.

In this session we will walk you through a reference architecture for deploying 3D enabled virtual desktops using VMware Horizon and NVIDIA GRID. The session provides architecture guidance on how to properly size and design your deployment to handle your 3D workload. The presentation also covers basic benchmarks to ensure your deployment performs as expected.

Level: Intermediate
Type: Talk
Tags: Graphics Virtualization; Data Center & Cloud Computing

Day: Thursday, 04/07
Time: 16:00 - 16:50
Location: Marriott Salon 2

S6286 - Towards a Unified CPU/GPU Codebase for Linear Scaling FMM Coulomb Solver

Alberto Garcia-Garcia Computer Science Master Student, University of Alicante
Alberto Garcia is an M.S. student in automation and robotics at the University of Alicante, where he received his B.S. in computer engineering in July 2015. Albert is a research intern at the Department of Computer Technology under the direction of Jose Garcia-Rodriguez. His research interests include 3D computer vision, parallel computing on GPUs, and high performance computing. His B.S. thesis involved the acceleration of a real-time 3D object recognition system on GPUs (NVIDIA Jetson TK1) using CUDA. Albert is a computer graphics enthusiast and loves physics, especially how high performance computing can be applied to solve challenging physics problems.

We'll focus on how to achieve a performance portable C++ implementation, using only a single codebase which is easily executable in both CPU and GPU. For that purpose, we'll present our core algorithm -- the Fast Multipole Method -- embedded in a stack of abstraction layers, allowing us to achieve portability without maintaining separate kernels for each architecture. In addition, we'll review common implementation pitfalls that might help other developers when aiming at a unified code base. Especially memory allocation, memory access, and the abstraction of SIMD/SIMT for complex user-defined data structures are investigated. Finally, we'll present results and comparisons of our linear-scaling Coulomb solver on different CPU/GPU platforms.

Level: Intermediate
Type: Talk
Tags: Supercomputing & HPC; Computational Physics; Algorithms

Day: Thursday, 04/07
Time: 16:00 - 16:25
Location: Room 210G

S6372 - Dominoes: Exploratory Data Analysis of Software Repositories Through GPU Processing

Jose Ricardo da Silva Junior Ph.D. Candidate, Universidade Federal Fluminense
Jose Ricardo has been a Ph.D. candidate at Universidade Federal Fluminense since 2010 under the supervision of Dr. Esteban Walter Gonzalez Clua and Leonardo Murta in the MediaLab-UFF Group. He received his M.S. in computer science from the Universidade Federal Fluminense (UFF) in 2010 and a B.S. in computer science from the Universidade Estacio de Sa in 2005. Jose is also a researcher at Media Lab at UFF, which is an NVIDIA Center of Excellence. Jose has worked with fluid simulation in real time, using multiple GPUs and a load balance between GPU and CPU. During his fellowship at the University of Nebraska, he started to develop Dominoes with Prof. Anita Sarma. Jose has experience in computer science with emphasis in GPGPUs, HPC, simulation, and optimization.
Esteban Clua Associate Professor, Universidade Federal Fluminense
Esteban Clua was awarded the title of NVIDIA Fellow in 2015, and is an associated professor at the computer science of Universidade Federal Fluminense, in Rio de Janeiro, and director of UFF Medialab. He earned M.S. and Ph.D. degrees in computer science at Universidade de Sao Paulo. Esteban is one of the founders of SBGames — Brazilian Symposium of Digital Entertainment and Video Games, the largest conference on the subject in South America. He is director of Academia of IGDA Rio, president of the Brazilian Computing Society Game Committee, and a member of program committees of many conferences in video games, such as ACM Siggraph Sandbox, IEEE Gameon, and SBC SBGames. In 2007, Esteban received the prize of the personality which most contributed to the growth of the video game industry in Brazil, and in 2009 received the prize of Young Scientist of the State of Rio de Janeiro. Esteban is the coordinator of the first Latin America NVIDIA Center of Excellence.

Learn how to perform data analysis over software repositories in GPU architecture Dominoes tool. We'll give an overview and introduction of the tool and its capabilities, which provides a unified view of the computational resources. Dominoes allows anyone to explore large software repositories at any grain (files, methods, or classes), without using any programming language. Due to its high-level parallel architecture in GPU, the results are processed in real time. The attendees will learn the strategy used by Dominoes to allow big data to be processed over GPU.

Level: All
Type: Talk
Tags: Big Data Analytics; Tools & Libraries

Day: Thursday, 04/07
Time: 16:00 - 16:25
Location: Room 210F

S6731 - Life and Medical Biology Data Accelerator

Guangming Tan Professor, Chinese Academy of Sciences
Guangming Tan is a Professor from Institute of Computing Technology, Chinese Academy of Sciences. His research interest includes parallel programing and algorithm, domain-specific architecture and bioinformatics. He has published more than 50 papers in many journals including SC, PLDI and TPDS. He serves as associated editor of IEEE Transactions on Parallel Distributed Systems.

We'll share our experience on accelerating medical image processing pipelines on GPU architecture. The accelerated topic includes both image processing algorithms and machine learning algorithms. We've developed a platform based on GPU for medical imaging big data applications.

Level: All
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Medical Imaging; Computational Biology

Day: Thursday, 04/07
Time: 16:00 - 16:25
Location: Room LL21B

S6261 - VMD+NVIDIA OptiX™: Streaming Interactive Ray Tracing from Remote GPU Clusters to Your VR Headset

John Stone Senior Research Programmer, University of Illinois at Urbana-Champaign
Highly-Rated Speaker
John Stone is a senior research programmer in the Theoretical and Computational Biophysics Group at the Beckman Institute for Advanced Science and Technology, and associate director of the NVIDIA CUDA Center of Excellence at the University of Illinois. John is the lead developer of VMD, a high-performance molecular visualization tool used by researchers all over the world. His research interests include molecular visualization, GPU computing, parallel processing, ray tracing, haptics, and virtual environments. John was named an NVIDIA CUDA Fellow in 2010. In 2015, he joined the Khronos Group Advisory Panel for the Vulkan graphics API. John also provides consulting services for projects involving computer graphics, GPU computing, and high performance computing.

Commodity head-mounted displays (HMDs) offer a tremendous opportunity to make immersive molecular visualization techniques broadly available. HMDs offer the promise of intuitive exploration of large molecular complexes and their dynamics, but their requirement for low-latency, high-frame-rate display presents a formidable challenge for high-quality remote ray tracing at distant HPC centers. We'll present a new, interactive ray-tracing system for remote visualization with HMDs, implemented within the popular molecular visualization tool VMD using a combination of interactive OptiX ray tracing, omnidirectional stereoscopic projection, H.264 video streaming, and high performance OpenGL rasterization.

Level: Intermediate
Type: Talk
Tags: Virtual Reality & Augmented Reality; Rendering & Ray Tracing; In-Situ and Scientific Visualization

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6367 - Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition

Suyoun Kim PhD Student, Carnegie Mellon University
Suyoun is currently a PhD student at Carnegie Mellon University in the Department of Electrical and Computer Engineering. Suyoun graduated with MS in Language Technologies Institute, School of Computer Science at Carnegie Mellon University in 2014. Her research interests include speech recognition, deep learning, machine learning.

In this talk, we introduce novel 3D attention neural network for distant speech recognition with multiple microphones. The 3D attention mechanism is used to learn which time-frequency component of channels should be more focused within Recurrent Neural Networks (RNNs) based acoustic model. We will compare our 3-d attention neural network framework and traditional beamforming technique on distant speech recognition task.

Level: Beginner
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Signal & Audio Processing

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6392 - Transforming The Way You Work: Mobile Enabled Workflows with 3D Virtual Workspaces

Jeff Weiss GRID Solutions Architect Manager, NVIDIA
Jeff is the GRID Solutions Architect Manager for North America working with the Solution Architecture & Engineering team at NVIDIA. Prior to joining NVIDIA, Jeff's pedigree includes a 7 year stint at VMware as an EUC Staff Engineer, as well as time at Symantec and Sun Microsystems. Along with his current focus on NVIDIA GRID vGPU enabled end user computing, his experience includes datacenter business continuity/disaster recovery solutions, software infrastructure identity management and email security/archiving tools.
Matt Coppinger Director, Technical Marketing & Enablement, End User Computing, VMware
Matt is currently a Director, Technical Marketing & Enablement & End User Computing at VMware. Matt has worked on desktop virtualization since its inception at VMware in 2007, first in engineering, then as a field consultant and finally within his current role in Technical Marketing & Enablement. He has authored a number of reference architectures for VMware, including virtualizing 3D applications, and has spoken on the technical aspects of desktop virtualization at VMworld and other major conferences since 2010.

A detailed look into how vGPU enabled desktops improve end user productivity by moving the desktop compute closer to the data. How cloud based desktops save time, money and increase security of your company's IP.

Level: Intermediate
Type: Talk
Tags: Graphics Virtualization

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6404 - Accelerating Transport System Micro-Simulations Using CUDA

Peter Heywood Ph.D. Student, The University of Sheffield
Peter Heywood is a Ph.D. student in the University of Sheffield's Department of Computer Science (a NVIDIA GPU Research Center), working on GPU-accelerated micro-simulation of transport systems and is named as lead researcher on an ongoing collaboration funded by Highways UK. Under the supervision of Dr. Paul Richmond and working closely with the Transport Systems Catapult (the UK's innovation centre for intelligent mobility), Peter is currently developing techniques to extend FLAME GPU (open-source agent-based modelling environment) specific to transport systems simulation, which will enable simulations of significantly greater scale and complexity than currently possible.

Discover how GPUs are being used to accelerate predictive simulations used in transport system planning and management to help alleviate the global increase in transport demand. We'll discuss the role of predictive, high-performance micro-simulations in transport management and provide insight into the development process and benchmark performance of agent-based transport models developed using FLAME GPU. We'll also describe the lessons learned in creating a virtual reality experience of real-time crowd simulation using an omnidirectional treadmill for urban planning.

Level: All
Type: Talk
Tags: Virtual Reality & Augmented Reality; Real-Time Graphics

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6456 - Natural Human Interactivity in the World of AR & VR: A Pipe Dream or Reality?

Soulaiman Itani Founder & CTO, Atheer
Soulaiman Itani has spent his career trying to understand how the world operates and how to leverage that knowledge to improve everyday life. With that core belief he created Atheer, the goal of which is to advance human-centric computing technologies and empower users to have technology work with them in ways never thought possible only a few short years ago. His previous work includes designing cancer tests and treatments as well as creating models for robotics and unmanned aerial vehicles. He received his M.S. and Ph.D. in electrical engineering and computer science from the Massachusetts Institute of Technology.

As augmented and virtual reality get traction in enterprises, smart glasses are graduating from a tiny monocular display to a 3D immersive experience, overlaying rich contextual information right where you need it. Now the questions coming into the spotlight are "What's the optimal interaction model for the enterprise-class workflows powered by the smart glasses? Can we combine hand gestures, voice, eye tracking, head motion, and contextualization to build a more intuitive and natural user interaction?" Augmented interactive reality promises to increase productivity and streamline workflows in ways that hasn't been seen before.

Level: Beginner
Type: Talk
Tags: Virtual Reality & Augmented Reality; Computer Vision & Machine Vision

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6531 - CUDA® Debugging Tools in CUDA8

Vyas Venkataraman Engineering Manager, NVIDIA
Highly-Rated Speaker
Vyas is a Software Engineering Manager in the Developer Tools group at Nvidia. His team is responsible for the CUDA Debug API and cuda-memcheck. Vyas has been at Nvidia since 2010. He received his PhD in Computer Engineering from Boston University.

This talk will describe new features in debugging tools in the CUDA 8.0 toolkit.

Level: Intermediate
Type: Talk
Tags: Tools & Libraries

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6587 - Lessons Learned from VR Navigation in Neurosurgery at UCLA

Moty Avisar Co-Founder and CEO, Surgical Theater
Moty Avisar is president and co-founder of Surgical Theater. He has led his company from startup venture to inception to maturity -- from the development of the idea and the business case throughout sales and revenue. He has led IP, regulatory, and financial strategies with successful execution -- achieving patent approval and FDA / 510k clearance in record time frames.
Neil Martin Chairman-Department of Neurosurgery-UCLA, UCLA Department of Neurosurgery
Dr. Neil Martin is Chair of Neurosurgery at Ronald Reagan UCLA Medical Center, and serves as Head of the Neurovascular Surgery Section. He is the Director of the UCLA Cerebral Blood Flow Laboratory, Medical Director of the UCLA Neurosurgery Intensive Care Unit and a co-Director of the UCLA Stroke Center. He is internationally known for his research regarding cerebral blood flow and brain metabolism following brain injury, and is recognized as one of the top specialists in the area of surgical treatment of cerebrovascular disease.
Alon Geri EVP Engineering & Co-Founder, Surgical Theater
Born & Raised in Israel, currently living in Cleveland, OH. Ex-Israeli Air Force Pilot and R&D Officer. Senior software engineer and Chief engineer of large scale Flight Simulation Programs for the Israeli Air Force. BSC, Computer Science & Mathematics. Recognized as creative, "out of the box" problem solver.

Learn about the real-world experiences of using virtual reality in the operating room from a leading brain surgeon. Dr. Neil Martin, chairman of Neurosurgery at UCLA, will share his insights and vision of how using virtual reality and enhanced 3D imaging is transforming complex surgery. Joining Dr. Martin will be Moty Avisar, co-founder and CEO of Surgical Theater, who will share how his company has transferred the science behind advanced flight simulation into a first-of-its-kind healthcare solution.

Level: Beginner
Type: Talk
Tags: Virtual Reality & Augmented Reality; Medical Imaging

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6653 - Bilateral and Trilateral Filters in Stereo Vision

Ryan Beethe Student, Colorado School of Mines
Ryan Beethe is a student at the Colorado School of Mines who loves robots. His recent research focus has been in embedded stereo vision, and was last year's NVIDIA CUDA Vision Challenge winner, with his work on GPU-accelerated semi-global block matching.

We'll explain how bilateral and trilateral filters work and how they are used in modern stereo vision algorithms. Bilateral filters are the basis of many of the fastest, most effective local algorithms available today. Local algorithms are especially desirable because they are easily parallelizable. Topics covered with OpenCV-based GPU-accelerated examples will include stereo vision basics, motivations for applying bilateral filters, and pre- and post-processing stereo images with bilateral filters, and limitations of bilateral filters in stereo vision. Additional topics will include bilateral-based adaptive support weight (ASW) correspondence searching, trilateral filters, and trilateral-based ASW correspondence searching.

Level: Beginner
Type: Talk
Tags: Computer Vision & Machine Vision; Embedded; IoT

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6677 - Photogrammetry and Virtual Reality: Transporting Real Sites into VR

David Finsterwalder CTO, realities.io
Working in Archeology, David gained profound knowledge in 3D reconstruction through LiDAR scanning, UAV and ground photogrammetry. Driven by the question how to unlock the full potential of the gathered data for research and public relations alike, he started to look into the use of game engines and VR HMDs for visualization. Amazed by all the possibilities for scene reconstructions and VR HMDs – not only for Archeology but also for Virtual Tourism and Gaming – he decided to dedicate all his efforts in immersive technology and founded realities.io which, among other projects, visualized 3D scans of two archeological excavations and a medieval castle.
Daniel Sproll CXO, realities.io
A background in cognitive science and Virtual Reality psychology research, the latest VR renaissance allowed Daniel to dive into the field of VR user experience design. His design process combines perceptional psychology with rapid prototyping to explore the huge uncharted territory that is VR interaction and UX design. His special interest lies in applications beyond gaming - from interactive storytelling to data visualization. Daniel is a Co-Founder and CXO at realities.io.

In this session, we will give a detailed overview of the applications for photogrammetry in the new medium of Virtual Reality and offer an insight into the workflow involved. Photogrammetry allows the 1:1 visualization of real-world sites in a virtual environment, but sets unique challenges both for content creators and GPU hardware. Photogrammetrical reconstruction and texturing is computationally very intensive, and thus profits heavily from parallel processing provided by software layers like CUDA - attend this talk to learn about specialized software, workflow optimizations and the unique applications that photogrammetry has in Virtual Reality and beyond.

Level: All
Type: Talk
Tags: Virtual Reality & Augmented Reality

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6680 - Optimizing Application Performance with CUDA® Profiling Tools

Swapna Matwankar Senior Engineer, NVIDIA
Swapna is a Senior Engineer in NVIDIA's Developer tools team focusing on performance analysis. Earlier she worked in NVIDIA's graphics driver team. Before joining NVIDIA, Swapna worked in the embedded multimedia domain where she built and optimized video codecs on ARM and DSP-based platforms and integrated them into various multimedia frameworks.

This session will provide an step-by-step walk through of new features added in NVIDIA Visual Profiler and nvprof. It will show how these profiling tools can be used to identify optimization opportunities at the application, kernel, and source-line levels.

Level: Intermediate
Type: Talk
Tags: Tools & Libraries; Performance Optimization

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6700 - Write Once, Parallel Everywhere: OpenACC for GPUs, x86, OpenPOWER, and Beyond

Michael Wolfe Compiler Engineer, NVIDIA / PGI
Highly-Rated Speaker
Michael Wolfe is a PGI compiler engineer at NVIDIA. He has over 35 years of experience developing optimizing compilers for parallel computing.

Performance portability means the ability to write a single program that runs with high performance across a wide range of target systems, including multicore systems, GPU-accelerated systems, and manycore systems, independent of the instruction set. It's not a "myth" or a "dream," as has been claimed recently. It should be demanded by developers and expected from any modern high level parallel programming language. OpenACC was designed five years ago with broad cross-platform performance portability in mind. The current PGI compiler suite delivers on this promise. Come hear about the current capabilities and performance of PGI OpenACC on GPUs, x86 and OpenPOWER, and learn about our plans for new features and even wider platform support.

Level: All
Type: Talk
Tags: Programming Languages; OpenACC; Supercomputing & HPC

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6713 - Large Scale Video Processing for Virtual Reality

Arthur van Hoff CTO, Jaunt VR
Arthur van Hoff is a serial entrepreneur and was most recently CTO at Flipboard. He started his career in Silicon Valley at Sun Microsystems where he was an early developer of the Java programming language. Since then he has started several successful companies including Marimba (IPO 1999), Strangeberry (acquired by TiVo), ZING (acquired by Dell), and Ellerdale (acquired by Flipboard). Arthur has expertise in machine learning, big data, mobile applications, 3D printing, and computational photography. He is originally from the Netherlands and has a master's degree in Computer Science from Strathclyde University in Glasgow.

Jaunt VR has developed a GPU based large scale video processing platform to combine multiple HD camera streams in radial configuration into seamlessly stitched stereoscopic spherical panoramas. The approach uses complex computational photography algorithms that require sharded processing of the data across hundreds of cloud based GPU instances.

Level: All
Type: Talk
Tags: Virtual Reality & Augmented Reality; Computer Vision & Machine Vision; Video & Image Processing

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6720 - Live Video Streaming in VR

Nicolas Burtey CEO, VideoStitch
Nicolas Burtey is Founder and CEO of VideoStitch and has worked in the VR and 360 industry for more than twelve years.

Learn how to stream live VR video from capture to the HMD and all the processing step in between.

Level: Beginner
Type: Talk
Tags: Virtual Reality & Augmented Reality; Video & Image Processing

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6725 - RealityCapture: A New Software for VR Content Creation

Michal Jancosek Managing Partner, Capturing Reality
Michal Jancosek is a managing partner and co-founder of Capturing Reality. He was a part of several EU research projects like COSPAL, DIRAC, ProVisG, and PRoViScout. His main research is in 3D reconstruction from images. Michal is the author of implementation of non-commercial 3D reconstruction software called CMPMVS. He has Ph.D. from Center for Machine Perception, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague.

Creating realistic 3D content for virtual reality worlds is time consuming and difficult using traditional methods. Our software allows our customers to drastically reduce the time and difficulty. GPU processing is one of the tools that allows our software to push the limits. We'll show results from our customers and describe the basic parts of our pipeline. We'll provide statistics with the main focus on GPUs.

Level: All
Type: Talk
Tags: Virtual Reality & Augmented Reality; Media & Entertainment

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6727 - Applications of Eye Tracking in Virtual Reality

Tom Sengelaub Manager Solution Delivery, SensoMotoric Instruments (SMI)
Tom Sengelaub has been developing eye tracking algorithms and solutions for SensoMotoric Instruments (SMI) for 7 years. SMI has a more than 20 years history in developing scientific grade eye tracking solutions for medical, research and applied markets. Tom applied SMI's experience to Virtual Reality by leading the development of SMI's eye tracking upgrades for both the DK1 and DK2. In his current position as 'Manager Solutions Delivery' he is pioneering the application of eye tracking in AR and VR.

Eye Tracking is a hot topic in VR. But what is all the fuzz about? SMI will present how eye tracking can be used to personalise the 3D experience and how the point of gaze revolutionises interaction with a virtual world. The power of eye tracking is not limited to interaction alone, in the long run eye tracking can make logins obsolete and using foveated rendering, make high resolution displays possible in HMDs. SMI will outline how this can be done and present both the state-of-the-art and a leap into the future of eye tracking in VR.

Level: All
Type: Talk
Tags: Virtual Reality & Augmented Reality; Performance Optimization

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6732 - RealityCapture: A New Software for VR Content Creation

Michal Jancosek Managing Partner, Capturing Reality
Michal Jancosek is a managing partner and co-founder of Capturing Reality. He was a part of several EU research projects like COSPAL, DIRAC, ProVisG, and PRoViScout. His main research is in 3D reconstruction from images. Michal is the author of implementation of non-commercial 3D reconstruction software called CMPMVS. He has Ph.D. from Center for Machine Perception, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague.

Creating realistic 3D content for virtual reality worlds is time consuming and difficult using traditional methods. Our software allows our customers to drastically reduce the time and difficulty. GPU processing is one of the tools that allows our software to push the limits. We'll show results from our customers and describe the basic parts of our pipeline. We'll provide statistics with the main focus on GPUs.

Level: All
Type: Talk
Tags: Virtual Reality & Augmented Reality; Media & Entertainment

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6737 - Applications of Eye Tracking in Virtual Reality

Tom Sengelaub Manager Solution Delivery, SensoMotoric Instruments (SMI)
Tom Sengelaub has been developing eye tracking algorithms and solutions for SensoMotoric Instruments (SMI) for 7 years. SMI has a more than 20 years history in developing scientific grade eye tracking solutions for medical, research and applied markets. Tom applied SMI's experience to Virtual Reality by leading the development of SMI's eye tracking upgrades for both the DK1 and DK2. In his current position as 'Manager Solutions Delivery' he is pioneering the application of eye tracking in AR and VR.

Eye Tracking is a hot topic in VR. But what is all the fuzz about? SMI will present how eye tracking can be used to personalise the 3D experience and how the point of gaze revolutionises interaction with a virtual world. The power of eye tracking is not limited to interaction alone, in the long run eye tracking can make logins obsolete and using foveated rendering, make high resolution displays possible in HMDs. SMI will outline how this can be done and present both the state-of-the-art and a leap into the future of eye tracking in VR.

Level: All
Type: Talk
Tags: Virtual Reality & Augmented Reality; Performance Optimization

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6743 - Large Scale Video Processing for Virtual Reality

Arthur van Hoff CTO, Jaunt VR
Arthur van Hoff is a serial entrepreneur and was most recently CTO at Flipboard. He started his career in Silicon Valley at Sun Microsystems where he was an early developer of the Java programming language. Since then he has started several successful companies including Marimba (IPO 1999), Strangeberry (acquired by TiVo), ZING (acquired by Dell), and Ellerdale (acquired by Flipboard). Arthur has expertise in machine learning, big data, mobile applications, 3D printing, and computational photography. He is originally from the Netherlands and has a master's degree in Computer Science from Strathclyde University in Glasgow.

Jaunt VR has developed a GPU based large scale video processing platform to combine multiple HD camera streams in radial configuration into seamlessly stitched stereoscopic spherical panoramas. The approach uses complex computational photography algorithms that require sharded processing of the data across hundreds of cloud based GPU instances.

Level: All
Type: Talk
Tags: Virtual Reality & Augmented Reality; Computer Vision & Machine Vision; Video & Image Processing

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6744 - Solving One of the Hardest Challenges in Virtual Reality: Human Eyes

Jeroen Snepvangers Co-Founder, Transfolio, LLC
Jeroen Snepvangers has been a leader in enterprise innovation using Virtual Reality, 3D Visualization and Digital Media for the past 13 years, beginning with founding RTT USA (acquired by Dassault 3DExcite in 2014). He served as North American CEO for 9 years, helping RTT grow from $10M to $100M globally. While with RTT, Jeroen serviced a long list of major automotive clients (GM, Toyota, Lexus, Nissan, Infiniti, VW, Audi, Porsche) as well as established consumer brands (The North Face, Coach, Adidas, Vans, Under Armour, HP, Beats by Dr. Dre). He led award-winning and forward-thinking virtual reality, augmented reality, interactive and mobile campaigns for these brands. Since 2012, Jeroen is an independent consultant, where he advises executives and investors at technology companies, consumer product companies and digital agencies. Jeroen also serves as non-executive board member at Mackevision GmbH, an Emmy award winning 3D VFX and automotive visualization company.

We address the numerous challenges of visualizing human eyes, and we explain why this is so vitally important to the VR community. The potential is massive, if you can capture and visualize anyone's eyes quickly and simply, it opens VR to all types of human interaction applications, which is usually the largest market for any consumer technology. However, digitizing eyes is hard. We will consider several current attempts to solve this problem and their differing approaches. We also take a detailed look at the science and technology behind one of the most promising solutions, a double projection Moiré profilometer with an accuracy of 300,000 measured points per inch. It raises the question: How long until your next selfie is a fully animated VR avatar?

Level: All
Type: Talk
Tags: Virtual Reality & Augmented Reality; Media & Entertainment

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6752 - Sports Training and Virtual Reality: Challenges in Making the Physical, Virtual

Brendan Reilly CEO, EON Sports VR
Brendan Reilly is the CEO and Co-Founder of EON Sports VR. A former student assistant for Bill Self at the University of Kansas, he went on to work as a administrative assistant for Tim Jankovich at Illinois State's Men's Basketball team. While on staff at ISU, he became focused on virtual reality. While working out of the proverbial garage for about a year, he refined what the ideal virtual reality training program would look like and with no formal business or computer science background, he was able to convince the executives at one of the world's leading providers of virtual and augmented Reality, EON Reality, to join forces with him. Brendan has worked in the virtual and augmented reality market since joining EON Reality in 2011 and is now recognized as one of the leaders in not only innovative sports training, but the Virtual Reality and Augmented Reality industry.
Nils Andersson CTO, EON Reality, Inc.
As CTO, Nils Andersson is responsible for overseeing the direction of EON Reality's innovative virtual reality technologies globally. Nils started his career in 1991 at SAAB SPACE working as an electronics hardware engineer for the European Space Station. Shortly after, he began his 20+ year career as a software engineer at Enera, where he worked on the GPS-based tracking of vehicles during 1992 – 1995. Nils became CTO at EON Reality in 1999, and currently leads the company's technology into the future. Using agile development methods, Nils leads the core development team efficiently to develop future products for the interactive 3D market. Nils and his team have developed strong, collaborative relations with technology partners Texas Instruments, NVidia, and Vicon to ensure the smooth integration and delivery for EON Reality's newest technology. Nils received his Masters Degree in Electrical Engineering specializing in Computer Engineering and Science at Chalmers University of Technology in Gothenburg Sweden.

Using Virtual Reality, EON Sports VR has developed applications to bring specialized sports training to athletes, whether professionals or amateurs. The goal is to leverage Virtual Reality technology to provide realistic repetitions at game speed to improve on-field decision making. In doing this, the EON Reality and EON Sports VR development teams encountered specific challenges related to translating the training for American Football, Baseball, and Soccer to a Virtual Environment.

Level: All
Type: Talk
Tags: Virtual Reality & Augmented Reality; Education & Training; General Interest

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6753 - How to Draw Living Virtual Reality Worlds

Beck Besecker CEO, Co-Founder, Marxent
Beck Besecker is co-founder and CEO of Marxent (marxent.com), the leader in enterprise Virtual Reality and Augmented Reality solutions for retailers and manufacturers. Prior to founding Marxent, Beck spent 13 years building interactive marketing solutions for Fortune 500 retailers and brands, including Target Stores and Tesco. In 1999, he founded Copient Technologies which enabled large retailers to easily manage personalized promotions in-store and online. Copient was acquired by NCR in 2003. Beck then served as EVP of New Business at Catalina Marketing, the nation's largest in-store promotional network.

Imagine creating and populating an endless series of living, breathing 360-degree worlds with 3D products. Empowered to change features, textures and objects in real time, you can immediately view each freshly designed creation in 3D Virtual Reality on an Oculus Rift or in Augmented Reality on an iPad. Using the VisualCommerce(TM)-powered Lowe's Holoroom as an example, we'll share three tips for taking full advantage of NVIDIA graphics processing power to calculate and render VR-ready 3D graphics to create the ultimate real-time Virtual Reality experience.

Level: Intermediate
Type: Talk
Tags: Virtual Reality & Augmented Reality; Computer Vision & Machine Vision; Graphics Virtualization

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6762 - Reimagining Cartography for Navigation

Eric Gundersen CEO, Mapbox
As CEO of Mapbox, Eric Gundersen coordinates product and business development. Eric has been with the team since the start and helped grow Mapbox out of the need for a better mapping platform. The platform is now powering maps with NVIDIA, Foursquare, Pinterst, MapQuest, CNN, and thousands more. Mapbox has over 130 people worldwide, with main offices in Washington DC and San Francisco.

Attendees will be able to walk away with an appreciation for how modern computing power and GPUs are enabling a whole new world of map design potential for the car. Vector-based maps can render data on the fly, 60fps, taking in-car map design to a more video game-like state. The driving experience can be seamless across devices, and tailored to exactly what a user needs for any specific use case.

Level: Beginner
Type: Talk
Tags: Self-Driving Cars & Automotive ; Embedded; Real-Time Graphics

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6763 - Analyzing Videos for Creating Personalized Viewing Experiences

Omar Javed Chief Scientist, ClipMine, Inc.
Omar Javed is ClipMine's Chief Scientist. His interests include video event understanding, multi-media indexing, object tracking, and online machine learning. Prior to joining ClipMine, he was a Principal Scientist and Technology Leader at SRI International. Omar was also an Associate Editor for the Machine Vision and Applications Journal from 2008-2015, and was the area chair for CVPR 2008. His research article "Object Tracking: A Survey" was ranked #1 in ACM's most popular magazine and computing survey articles list in 2007. His research article, "Modeling Inter-Camera Space-Time and Appearance Relationships for Tracking across Non Overlapping Views" was listed as the top-10 most cited paper in the Computer Vision and Image Understanding Journal for papers published from 2007-2011.

Our session will focus on video analysis for rapid media personalization. In recent years, viewership of online video content has increased exponentially. However, the current video players allow for video viewing mostly in the same way as analog videos were viewed using VCRs, i.e., view from start to end with option to play, pause, fast forward and rewind. We have developed an automatic video processing system to rapidly analyze videos on-demand, in order to generate Table of Contents (ToCs) and personalized storyboards for video navigation. We will discuss the advantages of using GPUs for video analytics and how GPUs enabled us to generate TOCs of hour long HD videos in seconds.

Level: All
Type: Talk
Tags: Video & Image Processing

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6768 - The Audi VR Experience – A Look into the Future of Digital Retail

Marcus Kuehne Project Lead Audi VR experience, Audi AG
Marcus Kuehne is a progressive mind, bringing several innovations to the car industry during his career. He studied Interface Design and started his way at the Audi product marketing. After that, Marcus changed to electronical development and was responsible for development and market introduction of the MMI touch – the first fully integrated touchpad based car HMI. In 2013, Marcus returned to the marketing & sales department and took over the project lead for the Audi VR experience. For a real VR enthusiast like him, it's a fulfilled dream to realize one of the most ambitious and complex VR industry applications.
Thomas Zuchtriegel Team Lead Digital Retail Solutions, Audi AG
Thomas Zuchtriegel is specialized in leading cross-functional teams to create amazing experiences using disruptive technology, e.g. the world's first digital showroom Audi City, a ground breaking digital retail experience. He studied Digital Film Making at the Middlesex University in London and worked as a creative/technical/managing consultant for many Automotive OEMs. Since 2012 Thomas leads the Audi City project team within Audi Digital Retail and is now working with Marcus on the Audi VR experience.

We'll give an insight into the philosophy behind the "Audi VR Experience" and share that experience with you. We'll share the challenges as well as the learnings from creating this VR experience. We'll explain why Audi is an attractive industry partner for all VR technology and content companies. With special focus on visual performance. Darren Jobling, CEO of project partner Zerolight, will join us. He'll explain how Zerolight managed to create the VR visual performance defined by Audi.

Level: All
Type: Talk
Tags: Virtual Reality & Augmented Reality; Self-Driving Cars & Automotive ; Product & Building Design

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

S6769 - Real Time Dual Camera Spectral Imaging Based on NVIDIA Tegra SoC to Assess UAV Missions

Michele Moscaritolo Head of Vision, Aerialtronics
Michele Moscaritolo works as head of vision for embedded systems in the research and development department at Aerialtronics, based in The Netherlands. Michele holds a Ph.D. in Physics, Master's degree in Computer Science Engineering and a graduate Ph.D. in Biomedical Engineering. His interests include physical techniques applied to medicine, computational methods, mechanical engineering and robotics in a multitude of diverse applications.
Alessandro Della Villa Senior System Architect , Commprove
Alessandro Della Villa works and lives in Florence, Italy. He has a M.Sc. degree in telecommunication engineering and a Ph.D. in applied electromagnetics. He works as a senior systems architect at Commprove.
Giacomo Benelli Senior Product Architect , Commprove
Giacomo Benelli works as a senior product architect at Commprove handling their product roadmap. His interest include C++ and Python programming.

We'll demonstrate that it is possible to process and combine dual spectral information (visible and IR) in real-time on a UAV system equipped with NVIDIA Tegra SoC.

Level: All
Type: Talk
Tags: Robotics & Autonomous Machines

Day: TBD, TBD
Time: TBD - TBD
Location: TBD

Talk
 

TUTORIAL

Presentation
Details

S6111 - NVIDIA CUDA® Optimization with NVIDIA Nsight™ Eclipse Edition: A Case Study

Christoph Angerer DevTech Compute, NVIDIA
Christoph Angerer is a developer in NVIDIA's European Developer Technology team. Based in Munich, Germany, he works with developers accelerating applications on GPUs. He holds a Ph.D. in computer science from ETH Zurich in Switzerland.

We'll present a real CUDA application and use NVIDIA Nsight Eclipse Edition on Linux to optimize the performance of the code. Attendees will learn a method to analyze their codes and how to use the tools to apply those ideas.

Level: Intermediate
Type: Tutorial
Tags: Performance Optimization; Tools & Libraries

Day: Monday, 04/04
Time: 09:00 - 10:20
Location: Room 211A

S6160 - Best Practices in GPU-Based Video Processing

Thomas True Senior Applied Engineer for Professional Video and Image Processing, NVIDIA
Tom True is a Senior Applied Engineer at NVIDIA where he works with developers to optimize the designed and implementation of GPU-based professional video broadcast, digital post production and large scale multi-GPU multi-display visualization systems. Tom has a Bachelor of Science degree from the Rochester Institute of Technology and a Master of Science degree from the graphics lab at Brown University. Prior to joining NVIDIA, Tom was an application engineer at Silicon Graphics.

This session will explore best practices and techniques for the development of efficient GPU-based video and image processing applications. Topics to be discussed include threading models for efficient parallelism, CPU affinity to optimize system memory and GPU locality, image segmentation for overlapped asynchronous transfers, optimal memory usage strategies to reduce expensive data movement, and image format considerations to reduce and eliminate data conversions. Single and multi-GPU systems for uncompressed real time 4K video capture, processing, display and play-out will be considered. Takeaways should prove applicable to developers of video broadcast and digital post production systems as well as to developers of large scale visualization systems that require video ingest.

Level: Advanced
Type: Tutorial
Tags: Media & Entertainment; Video & Image Processing; Large Scale and Multi-Display Visualization

Day: Monday, 04/04
Time: 09:00 - 10:20
Location: Room 210F

S6571 - Advanced Rendering Solutions from NVIDIA

Phillip Miller Senior Director, Advanced Rendering Products, NVIDIA
Highly-Rated Speaker
Phillip Miller is senior director of NVIDIA's commercial advanced rendering offerings, ranging from the Iray and mental ray shipping within leading products in Design and Entertainment to the IndeX technology used in large data visualization. Phil has been with NVIDIA for 7 years and has led leading software products for over 20 years, including the Entertainment offerings at Autodesk and the Web Design product line at Adobe. He holds a Masters of Architecture from the University of Illinois and is a registered architect.

Learn about the latest breakthroughs and offerings in NVIDIA's Advanced Rendering Solutions, which scale smoothly from local GPU rendering to remote supercomputer clusters. We'll explore and demonstrate new capabilities and possibilities in NVIDIA® Iray® and mental ray®. Plus, we'll see what's possible with the latest in NVIDIA OptiX™ for accelerating custom ray-tracing development, and in NVIDIA IndeX™ for in-situ large data visualization. Industry trends and production examples will also be explored as both interactive and production rendering possibilities continue to revolutionize workflows.

Level: Beginner
Type: Tutorial
Tags: Rendering & Ray Tracing; Tools & Libraries; In-Situ and Scientific Visualization; Product & Building Design

Day: Monday, 04/04
Time: 09:00 - 09:50
Location: Room 210E

S6595 - Benchmarking Graphics Intensive Application on VMware Horizon 6 Using NVIDIA GRID™ vGPUs

Manvender Rawat Performance Engineer, NVIDIA GRID, NVIDIA
Manvender Rawat is part of the NVIDIA GRID performance engineering team and is responsible for measuring and validating the performance and scalability delivered via the GRID platform, GRID vGPU software running on GRID GPUs, on all enterprise virtualized platforms.
Lan Vu Performance Engineer, VMware Inc.
Lan Vu is working on performance engineering at VMware, focusing on optimizing performance and scalability of virtual desktop infrastructure (VDI) solutions including 3D graphics, Horizon View, Horizon Air and DaaS. Prior to joining VMware, she worked at Parallel Distributed System Labs, University of Colorado Denver with research focus on high performance methods in data mining. She holds a Ph.D. in Computer Science & Information Systems from University of Colorado Denver.

We'll provide a technical deep dive into performance characteristics, application tuning options, as well as hardware and platform design considerations for administering 3D graphics-intensive applications in VMware Horizon environments. We'll review various application performance benchmark results from NVIDIA and VMware joint testing of NVIDIA GRID vGPUs (K2 and M60). These results will showcase scalability with different hardware and software configurations. We'll also discuss the considerations required for choosing the correct NVIDIA GRID vGPU profile to deliver a great user experience in VMware Horizon environments.

Level: Intermediate
Type: Tutorial
Tags: Graphics Virtualization; Performance Optimization

Day: Monday, 04/04
Time: 09:00 - 10:20
Location: Room 210G

S6312 - Sharing Physically Based Materials Between Renderers with MDL

Jan Jordan Software Product Manager MDL, NVIDIA
Jan Jordan is the product manager for the NVIDIA Material Definition Language. He is a graduate engineer of applied computer science from the Fachhochschule fur Wirtschaft und Technik Berlin, Germany, and has a B.S. in computer science from the RTC Galway, Ireland. Before joining NVIDIA, his diverse working experience spanned from research work on practical VR applications to working as an art director in computer games. He is a long-time member of NVIDIA's Advanced Rendering team, where his focus has been on enabling material workflows across many different applications and renderers.
Lutz Kettner Senior Manager, Advanced Rendering and Materials, NVIDIA
Lutz Kettner leads the design and engineering efforts for the Material Definition Language, MDL, and the Iray renderer from the NVIDIA Advanced Rendering Center. He has been working on leading software products in advanced rendering, language design, API design, and geometry for 19 years. He is known for his influential work on the open source Computational Geometry Algorithms Library CGAL. He holds a Ph.D. in computer science from ETH Zurich, Switzerland, worked as a researcher at the University of North Carolina at Chapel Hill, and led a research group at the Max-Planck-Institute in Saarbrucken, Germany. He has served on the ISO and ECMA standardization committees.

The basics of NVIDIA's Material Definition Language (MDL) will be discussed, showing how a single material can be used to define matching appearances between different renderers and rendering techniques. End users will learn how physically-based definitions can be defined while developers will learn what is entailed in supporting MDL within their own product/renderer.

Level: All
Type: Tutorial
Tags: Rendering & Ray Tracing; Media & Entertainment; Product & Building Design

Day: Monday, 04/04
Time: 10:00 - 10:50
Location: Room 210E

S6112 - NVIDIA CUDA® Optimization with NVIDIA Nsight™ Visual Studio Edition: A Case Study

Christoph Angerer DevTech Compute, NVIDIA
Christoph Angerer is a developer in NVIDIA's European Developer Technology team. Based in Munich, Germany, he works with developers accelerating applications on GPUs. He holds a Ph.D. in computer science from ETH Zurich in Switzerland.

We'll present a real CUDA application and use NVIDIA Nsight Eclipse Edition on Linux to optimize the performance of the code. Attendees will learn a method to analyze their codes and how to use the tools to apply those ideas.

Level: Intermediate
Type: Tutorial
Tags: Performance Optimization; Tools & Libraries

Day: Monday, 04/04
Time: 10:30 - 11:50
Location: Room 211A

S6307 - Get to Know the NVIDIA GRID™ SDK

Shounak Deshpande System Software Engineer, Sr., NVIDIA
I started as a developer for the video sw group at NVIDIA in 2007. Since then, I have worked on various technologies like NVIDIA Optimus, NVENC API and GRID SDK. In my current position as the technical lead for GRID SDK development, I strive to make our SW clean, efficient and user-friendly.

This session will introduce folks who want to build high performance desktopapplication streaming software, to NVIDIA GRID SDK. Using the GRID SDK enables harnessing the power of NVIDIA GPUs for fast capture and efficient hardware-based video encoding. We will begin the session with an introduction to GRID SDK, provide an overview of GRID SDK APIs (NVFBC - NVIDIA FrameBuffer Capture, and NVIFR - NVIDIA In-band Frame Readback) and review the GRID SDK Roadmap. The remainder of the session will be spent discussing common issues faced by developers including: getting the most out of NVFBC and NVENC HW encoder for desktop streaming; measuring NVIFR performance and multithreading with NVIFR.

Level: All
Type: Tutorial
Tags: Graphics Virtualization; Data Center & Cloud Computing

Day: Monday, 04/04
Time: 10:30 - 11:50
Location: Room 210G

S6248 - High-Order Discretizations for Geometric Multi-Grid Methods

Nikolay Sakharnykh Developer Technology Engineer, NVIDIA
Nikolay Sakharnykh is a senior developer technology engineer at NVIDIA, where he works on accelerating applications on GPUs. He has experience in scientific research and software development focusing on computational techniques related to physics, chemistry, and biology.
Dusan Stosic Masters candidate, Federal University of Pernambuco (UFPE)
Dusan Stosic is an M.S. candidate in the Universidade Federal de Pernambuco, Brazil. He received a B.A. in physics from Boston University in 2014. His current research interests include computational physics, machine learning, and high performance computing.

Join us for a deep-dive analysis of geometric multigrid methods on GPUs. These methods have numerous applications in computer science, including combustion codes based on adaptive mesh refinement techniques. High-order multigrid schemes are actively being explored for production in many linear algebra packages, and may become a commodity within the next few years. We'll discuss performance bottlenecks and key implementation choices for current and future generation GPUs. Our analysis is based on a well-known high-performance multigrid benchmark (HPGMG). Hybrid CPU/GPU implementation using unified memory, high-order stencil optimizations, and multi-GPU scaling will be covered in detail.

Level: Intermediate
Type: Tutorial
Tags: Supercomputing & HPC; Computational Physics

Day: Monday, 04/04
Time: 11:00 - 11:50
Location: Room 212A

S6306 - ESI Rendering Innovations with NVIDIA DesignWorks™

Andreas Mank Team Leader Visualization, ESI Software Germany
As leader of the visualization team in the BU Immersive Experience with ESI Software Germany, Andreas Mank is responsible for driving advances in visualization technologies and delivering state-of-the-art, high-performance immersive engineering visualization as well as advanced high-quality rendering with ESI software products. Andreas studied media computer science at the University of Applied Sciences in Wedel. He has over 10 years of experience in virtual reality-related software development. In the last years, he has been working as a team leader in R&D.
Markus Tavenrath Senior Developer Technology Engineer, NVIDIA
Markus Tavenrath experiments with future rendering technologies that bring high interactivity to complex scenes. This work includes both CPU and GPU strategies to solve typical scene-graph operations related to rendering. Markus finished his studies in computer science with a focus on computer graphics in 2008. He was one of the first to use ray-tracing on CUDA for his diploma thesis, which brought him straight to NVIDIA. There, he primarily worked on GPU raytracing for SceniX, NVIDIA's scene-graph technology, first showcased at SIGGRAPH 2008. Later he applied his experience to implement parts of OptiX, improve SceniX, and develop several ray-tracing demos. In close cooperation with external partners, he improved rendering performance and scenegraph usability as developer technology engineer.

We'll present the evolution of the high performance renderer from ESI Group, a pioneer and world-leading provider in virtual prototyping, leveraging the physics of materials. We'll describe the next steps towards a comprehensive physically based rendering framework, and take a look behind the scenes at NVIDIA's nvpro-pipeline technology as well as the concepts and design details required to integrate it into ESI Software Germany's new rendering framework for immersive environments. Established in more than 40 countries worldwide, ESI Group helps industrial clients shorten their product development cycle by eliminating the need for physical prototypes using interactive rendering technology.

Level: Beginner
Type: Tutorial
Tags: Rendering & Ray Tracing; Product & Building Design; Real-Time Graphics

Day: Monday, 04/04
Time: 11:00 - 11:50
Location: Room 210E

S6646 - NVIDIA GPU Educators Program: Prepare Today's Students for Tomorrow's Accelerated Computing Challenges

Joe Bungo GPU Educators Program, NVIDIA
Joe Bungo is NVIDIA's GPU educators program manager. He enables the use of NVIDIA and GPU technologies in universities in a variety of ways, including curriculum and teaching material development, facilitation of academic ecosystems, and hands-on instructor workshops. Previously, Joe managed ARM Inc.'s University Program and was an applications engineer for ARM.

The demand for new graduate students with quality parallel and accelerated computing education exceeds the global supply required by industry and new scientific discovery. The ACM-IEEE CS2013 curricular recommendations include a large increase in parallel programming topics, as understanding parallel computing is a requirement to get the most out of any modern microprocessor from mobile to supercomputers. NVIDIA's new GPU Educators Program breaks the global barriers of entry to teaching parallel computing for higher-education instructors. Co-developed with academia, GPU Teaching Kits are the flagship offering of the program and target a variety of academic disciplines. These comprehensive packages contain everything an instructor needs to teach a full-term curriculum course.

Level: All
Type: Tutorial
Tags: Education & Training

Day: Monday, 04/04
Time: 11:00 - 11:50
Location: Room 210F

S6133 - VKCPP: A C++ Layer on Top of Vulkan

Markus Tavenrath Senior Developer Technology Engineer, NVIDIA
Markus Tavenrath is experimenting with future rendering technologies that bring high interactivity to complex scenes. This work includes both CPU and GPU strategies to solve typical scene graph operations related to rendering. Markus finished his studies in computer science with a focus on computer graphics in 2008. He was one of the first using ray tracing on CUDA for this diploma thesis, which brought him straight to NVIDIA. There he primarily worked on GPU ray tracing for SceniX, NVIDIA's scene graph technology, which was showcased at SIGGRAPH 2008. Afterwards, he applied his experience to implement parts of OptiX, improve SceniX, and develop several ray tracing demos. In close cooperation with external partners, he improved rendering performance and scenegraph usability as a developer technology engineer.

Learn how the C++ wrapper around Vulkan helps you to write more productive Vulkan code and how Vulkan got integrated into the RiX and the rest of the pipeline to simply rendering with Vulkan even more. We'll also compare performance of Vulkan and OpenGL based on a CAD-style reference model.

Level: Intermediate
Type: Tutorial
Tags: Real-Time Graphics

Day: Monday, 04/04
Time: 13:00 - 13:50
Location: Room LL21B

S6226 - High-Performance Video Encoding on NVIDIA GPUs

Abhijit Patait Director, Multimedia System Software, NVIDIA
Abhijit Patait manages the GPU Multimedia Software team at NVIDIA, which is responsible for video and audio software, GRID SDK, NVENC SDK, and others. Prior to NVIDIA, he worked at various U.S. organizations in areas such as baseband and audio signal processing, telecommunications software, VoIP, and video/image processing. Abhijit holds an MSEE from Missouri S&T University and MBA from University of California, Berkeley.
Eric Young GRID Developer Relations Manager , NVIDIA
Eric Young is an engineering manager in the Content and Technology group. We focus on applied research with GRID technologies and video compression. Eric specializes in remote graphics, video compression, and technology for games.

We'll provide an overview of video encoding technologies available using NVIDIA GPUs. In particular, attendees can expect to learn the following: (1) overview of NVIDIA video encoding SDK (NVENC SDK); (2) new features in NVIDIA video encoding (NVENC) hardware with new GPU chips; (3) changes and new features in NVENC SDK 6.0 and NVENC SDK 7.0; and (4) differences between NVENC SDK and GRID SDK and using right SDK for a particular application.

Level: All
Type: Tutorial
Tags: Media & Entertainment; Video & Image Processing; Tools & Libraries

Day: Monday, 04/04
Time: 13:00 - 13:50
Location: Room 210F

S6311 - MDL Materials to GLSL Shaders: Theory and Practice

Andreas Mank Team Leader Visualization, ESI Group
Andreas Mank leads the visualization team in the BU Immersive Experience, where he's responsible for driving advances in visualization technologies and delivering state-of-the-art, high-performance immersive engineering visualization as well as advanced, high-quality rendering with ESI software products. Andreas has studied media computer science at the University of Applied Sciences in Wedel, Germany. He has over 10 years of expe