Sign In
GTC Logo
GPU
Technology
Conference

October 4-5, 2016 | Melbourne
To view 500+ session recordings and slides from GTC 2016, sign in or create a MyGTC Account.

Schedule Planner

Print
Download PDF
 

 
Refine:
  • Session Levels:
  • |
  • |
  • |
  • |
  • |

HANDS-ON LAB

Presentation
Details

L6118 - Simple Steps to Speed Up Your C Code Dramatically with GPU

Beatriz Pazmino Guest Researcher, Wesleylan - NIST
Beatriz Pazmino is a postdoctoral researcher at the Materials Science and Engineering Division, National Institute of Standards and Technology. Her research interests include: implementing and developing computational methods to aid the understanding of soft materials, in particular: 1) calculating electromagnetic and hydrodynamic properties of objects having arbitrary shape, including polymers and polymer assemblies in solution; 2) characterizing biometric data (i.e., cell morphology and human activities) based on shape metrologies using path-integration methods to find ""the useful metric"" that relates with its functionality; 3) unifying theoretical perspectives in glass dynamics and its application in understanding the effects of nanoparticles and confinement in polymer-glass forming materials, and the quantification of the interaction strength between polymer matrix and nanoparticle. Beatriz has a Ph.D. in physics from Wesleyan University and a B.S. in electrical engineering.
Fernando Vargas-Lara Guest Researcher, NIST-Weselyan
Fernando Vargas-Lara is a guest researcher at the Material Science and Engineering Division. His research interests include transport properties, computer modeling, statistical mechanics, soft matter, polymer physics, DNA, DNA-based materials, carbon-based nanocomposites. He received his Ph.D. in physics from Wesleyan University, Middletown, CT.

Many scientists have their old trusted codes in C and want to gain the GPUs speed-up without much trouble. To reach this community, we have selected the most common algorithms where the parallel architecture benefits are best and describe the little details encountered in the transition to CUDA programming. We'll cover simple concepts, such as identifying device and host routines, dynamic memory allocation, calling external functions, and how to implement averages and histograms. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Level: Beginner technical
Type: Hands-on Lab
Tags: Programming Languages; Algorithms

Day: Monday, 04/04
Time: 09:00 - 10:30
Location: Room 210A

L6128 - OpenACC Bootcamp

Jeff Larkin DevTech Software Engineer, NVIDIA
Highly-Rated Speaker
Jeff is a software engineer in NVIDIA's Developer Technology (DevTech) group where he works on porting and optimizing HPC applications. He is also closely involved with the development of both the OpenACC and OpenMP specifications. Prior to joining NVIDIA Jeff worked in Cray's Supercomputing Center of Excellence at Oak Ridge National Laboratory.

In this session participants will learn OpenACC programming by example. Participants must be comfortable with C/C++ or Fortran programming, but no prior OpenACC or GPU programming experience is required. This lab will demonstrate a 4 step process for applying OpenACC to an existing application: Identify Parallelism, Express Parallelism, Express Data Movement, Optimize Loops and discussion OpenACC best practices. Upon completion participants will be able to begin accelerating their own applications using OpenACC. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Level: Beginner technical
Type: Hands-on Lab
Tags: Programming Languages; OpenACC

Day: Monday, 04/04
Time: 09:00 - 12:00
Location: Room 210C

L6146 - IBM Watson Developers Lab

Niyati Parameswaran Data Scientist, IBM Watson Ecosystem, IBM
Niyati works as a data scientist for the Watson Ecosystem team. A dream of being able to provide a machine with intelligence that is unique, that can augment our own distinctive intelligence and that ultimately hopes to answer Alan Turing's question of 'Can machines think?' motivates her research. She holds a Bachelors in Computer Science from The Birla Institute of Science & Technology in India, and a Masters in Computer Science with a specialization in Machine Learning and Natural Language Processing from The University of Texas at Austin. She has worked on building deep QA systems, knowledge graphs and recommendation engines. At Watson, she is known for developing core ML and NLP algorithmic paradigms, conceptualizing cognitive solutions for partners to bring them to scale and providing data science as a service to the Ecosystem team.

IBM Watson will deliver a unique opportunity for attendees of the NVIDIA GTC Conference and the OpenPOWER Foundation Summit. If you are a developer interested in Machine or Deep Learning this is not a lab to miss. IBM Watson it is a cognitive technology platform that uses Natural Language Processing, Vision, Machine and Deep Learning to reveal insights from large amounts of unstructured data. The Watson Lab 202 is a hands-on experience with IBM's Watson cognitive platform. Starting with a set of Watson services, attendees of all skill levels will learn how to build apps powered by Watson, gaining experience using cognitive and deep learning technologies. Attendees must bring their own laptops and follow pre-req instructions.

Pre-Req:

  1. SIGN UP FOR BLUEMIX: Lab attendees should visit https://console.ng.bluemix.net/. Click on the Get Started Free button. Register your information and click on Create Account button.
  2. INSTALL GIT: Instructions available at https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
  3. INSTALL CLOUD FOUNDRY: Instructions available at https://github.com/cloudfoundry/cli#downloads
  4. Take a look at the lab ahead of time. It's available at https://github.com/biosopher/watson-ipa-web-nodejs. Come prepared with questions, and ideas on how to extend it!

Level: Intermediate technical
Type: Hands-on Lab
Tags: Deep Learning & Artificial Intelligence; Big Data Analytics; Video & Image Processing

Day: Monday, 04/04
Time: 09:00 - 11:00
Location: Room LL21A

Hands-on Lab

TUTORIAL

Presentation
Details

S6111 - NVIDIA CUDA® Optimization with NVIDIA Nsight™ Eclipse Edition: A Case Study

Christoph Angerer DevTech Compute, NVIDIA
Christoph Angerer is a developer in NVIDIA's European Developer Technology team. Based in Munich, Germany, he works with developers accelerating applications on GPUs. He holds a Ph.D. in computer science from ETH Zurich in Switzerland.
Mathias Wagner Developer Technology Engineer, NVIDIA
Mathias Wagner is part of the European Developer Technology Team. He holds a Ph.D. in theoretical physics from TU Darmstadt. Before joining Nvidia Mathias worked on Lattice Quantum Chromodynamics simulation on GPUs.

We'll present a real CUDA application and use NVIDIA Nsight Eclipse Edition on Linux to optimize the performance of the code. Attendees will learn a method to analyze their codes and how to use the tools to apply those ideas.

Level: Intermediate technical
Type: Tutorial
Tags: Performance Optimization; Tools & Libraries

Day: Monday, 04/04
Time: 09:00 - 10:20
Location: Room 211A

S6160 - Best Practices in GPU-Based Video Processing

Thomas True Senior Applied Engineer for Professional Video and Image Processing, NVIDIA
Tom True is a Senior Applied Engineer at NVIDIA where he works with developers to optimize the designed and implementation of GPU-based professional video broadcast, digital post production and large scale multi-GPU multi-display visualization systems. Tom has a Bachelor of Science degree from the Rochester Institute of Technology and a Master of Science degree from the graphics lab at Brown University. Prior to joining NVIDIA, Tom was an application engineer at Silicon Graphics.

This session will explore best practices and techniques for the development of efficient GPU-based video and image processing applications. Topics to be discussed include threading models for efficient parallelism, CPU affinity to optimize system memory and GPU locality, image segmentation for overlapped asynchronous transfers, optimal memory usage strategies to reduce expensive data movement, and image format considerations to reduce and eliminate data conversions. Single and multi-GPU systems for uncompressed real time 4K video capture, processing, display and play-out will be considered. Takeaways should prove applicable to developers of video broadcast and digital post production systems as well as to developers of large scale visualization systems that require video ingest.

Level: Advanced technical
Type: Tutorial
Tags: Media & Entertainment; Video & Image Processing; Large Scale and Multi-Display Visualization

Day: Monday, 04/04
Time: 09:00 - 10:20
Location: Room 210F

S6571 - Advanced Rendering Solutions from NVIDIA

Phillip Miller Senior Director, Advanced Rendering Products, NVIDIA
Highly-Rated Speaker
Phillip Miller is senior director of NVIDIA's commercial advanced rendering offerings, ranging from the Iray and mental ray shipping within leading products in Design and Entertainment to the IndeX technology used in large data visualization. Phil has been with NVIDIA for 7 years and has led leading software products for over 20 years, including the Entertainment offerings at Autodesk and the Web Design product line at Adobe. He holds a Masters of Architecture from the University of Illinois and is a registered architect.

Learn about the latest breakthroughs and offerings in NVIDIA's Advanced Rendering Solutions, which scale smoothly from local GPU rendering to remote supercomputer clusters. We'll explore and demonstrate new capabilities and possibilities in NVIDIA® Iray® and mental ray®. Plus, we'll see what's possible with the latest in NVIDIA OptiX™ for accelerating custom ray-tracing development, and in NVIDIA IndeX™ for in-situ large data visualization. Industry trends and production examples will also be explored as both interactive and production rendering possibilities continue to revolutionize workflows.

Level: Beginner technical
Type: Tutorial
Tags: Rendering & Ray Tracing; Tools & Libraries; In-Situ and Scientific Visualization; Product & Building Design

Day: Monday, 04/04
Time: 09:00 - 09:50
Location: Room 210E

S6595 - Benchmarking Graphics Intensive Application on VMware Horizon 6 Using NVIDIA GRID™ vGPUs

Manvender Rawat Performance Engineer, NVIDIA GRID, NVIDIA
Manvender Rawat is part of the NVIDIA GRID performance engineering team and is responsible for measuring and validating the performance and scalability delivered via the GRID platform, GRID vGPU software running on GRID GPUs, on all enterprise virtualized platforms.
Lan Vu Performance Engineer, VMware Inc.
Lan Vu is working on performance engineering at VMware, focusing on optimizing performance and scalability of virtual desktop infrastructure (VDI) solutions including 3D graphics, Horizon View, Horizon Air and DaaS. Prior to joining VMware, she worked at Parallel Distributed System Labs, University of Colorado Denver with research focus on high performance methods in data mining. She holds a Ph.D. in Computer Science & Information Systems from University of Colorado Denver.
Aravind Bappanadu Performance Engineering Manager, VMware
My team works on developing methodologies to measure performance of VMware’s End User Computing products. We can help size, benchmark, performance tune and debug VMware’s View/Horizon products.

We'll provide a technical deep dive into performance characteristics, application tuning options, as well as hardware and platform design considerations for administering 3D graphics-intensive applications in VMware Horizon environments. We'll review various application performance benchmark results from NVIDIA and VMware joint testing of NVIDIA GRID vGPUs (K2 and M60). These results will showcase scalability with different hardware and software configurations. We'll also discuss the considerations required for choosing the correct NVIDIA GRID vGPU profile to deliver a great user experience in VMware Horizon environments.

Level: Intermediate technical
Type: Tutorial
Tags: Graphics Virtualization; Performance Optimization

Day: Monday, 04/04
Time: 09:00 - 10:20
Location: Room LL20A

Tutorial

TALK

Presentation
Details

S6645 - Scientific Visualization in HPC

Peter Messmer Principal Software Engineer , NVIDIA
Highly-Rated Speaker
Peter Messmer is a senior software engineer in NVIDIA's Developer Technology organization, working with clients to accelerate their scientific discovery process with GPUs. One area of his current research is to investigate how to utilize the GPUs in high performance computing systems for data analysis and visualization. Prior to joining NVIDIA, Peter spent more than 15 years developing HPC- and GPU-accelerated applications for industry and government clients, ranging from simulating next-generation particle accelerators or electromagnetic problems to modeling the behavior of dust on the surface of the Moon. Peter holds an M.S. and Ph.D. in physics from ETH Zurich, Switzerland, with specialization in kinetic plasma physics and nonlinear optics.

Learn how to leverage the graphics power in your GPU-accelerated supercomputer to turn your simulation data into insight. Starting from simulation data distributed across the nodes of a remote supercomputer, we'll cover various techniques and tools to convert this data into insightful visualizations at your workstation, leading to an end-to-end GPU accelerated visualization pipeline.

Level: Beginner technical
Type: Talk
Tags: In-Situ and Scientific Visualization; Supercomputing & HPC

Day: Monday, 04/04
Time: 09:00 - 09:50
Location: Room 212A

Talk

TUTORIAL

Presentation
Details

S6740 - GPU Powered Solutions in the Second Kaggle Data Science Bowl

Jon Barker Solution Architect, NVIDIA
Jon Barker joined NVIDIA in May 2015 as a Solution Architect. Since then he has been helping customers to design, implement and optimize a variety of GPU accelerated deep learning applications and has also provided internal and external deep learning training. Jon is particularly focused on the application of deep learning to problems in defense and national security. Jon graduated from the University of Southampton in the UK in 2007 with a PhD in Mathematics. Prior to joining NVIDIA, Jon worked for the UK Ministry of Defence and spent four years on secondment to the US Department of Defense where he was a research scientist focused on data analytics and machine learning for multi-sensor data streams. In order to learn new data science skills Jon has been a long time competitor on Kaggle.

The second annual Data Science Bowl was an online data science contest that took place in early 2016 and was hosted on the Kaggle platform. The objective of the contest was to develop an algorithm that could accurately estimate the volume of the left ventricle of a human heart at the points of maximum and minimum volume from a time-series of multiple cross sectional Magnetic Resonance Imaging (MRI) images of the heart. The contest provided thousands of MRI images to train an algorithm. The challenge was a natural fit for GPU accelerated deep learning (DL). We'll hear from some of the winning teams describe their approaches. The complexities of working with sometimes messy clinical data will be discussed and we will hear how deep learning can be applied to a time-series of 3D images.

Level: Beginner technical
Type: Tutorial
Tags: Deep Learning & Artificial Intelligence; Medical Imaging; Video & Image Processing

Day: Monday, 04/04
Time: 09:00 - 10:20
Location: Grand Ballroom

S6818 - Vulkan and NVIDIA: The Essentials

Tristan Lorach DevTech, NVIDIA
In his early time, Tristan discovered the world of computer graphics through his active contribution to the demoscene world (Amiga). Later, graduating in 1995, Tristan has developed a series of 3D real-time interactive installations for exhibitions and events. From the creation of specific engines to the conception of new 3D human interfaces for public events, Tristan always wanted to mix new Hardware technology with innovative and creative ideas. Tristan is now working at NVIDIA, as the manager of "Devtech Proviz" team (Developer Technical Relations department for Professional Visualization), participating on a variety of projects in relation with NVIDIA partners at the same time he's contributing to R&D, writing demos and tools for new GPU Chips.

NVIDIA is bringing the power of Vulkan to a range of platforms to extend the choice of APIs for developers. This rapid-fire session will cover the essentials of NVIDIA's Vulkan rollout across its product range – with insights to help you judge whether Vulkan is right for your next development project.

Level: All technical
Type: Tutorial
Tags: Game Development; Real-Time Graphics

Day: Monday, 04/04
Time: 09:00 - 09:50
Location: Room LL20C

S6820 - Session 1 of 4: An Introduction to CUDA Programming (Presented by Acceleware)

Chris Mason Director of Product Management, Acceleware LTD
Highly-Rated Speaker
Chris Mason is the Product Manager for Acceleware's GPU accelerated electromagnetic product line. He is responsible for the successful development and launch of Acceleware products used by companies world-wide. Chris has 10 years of experience in developing commercial applications for the GPU and has delivered numerous CUDA courses to students in a diverse range of industries. His previous experience also includes parallelization of algorithms on digital signal processors (DSPs) for cellular phones and base stations. Chris has a Masters in Electrical Engineering from Stanford University.

Join us for an informative introductory tutorial intended for those new to CUDA and is the foundation for our following three tutorials. Those with no previous CUDA experience will leave with essential knowledge to start programming in CUDA. For those with previous CUDA experience, this tutorial will refresh key concepts required for subsequent tutorials on CUDA optimization. The tutorial will begin with a brief overview of CUDA and data-parallelism before focusing on the GPU programming model. We will explore the fundamentals of GPU kernels, host and device responsibilities, CUDA syntax and thread hierarchy. A programming demonstration of a simple CUDA kernel will be delivered. Printed copies of the material will be provided to all attendees for each session – collect all four!

Level: Beginner technical
Type: Tutorial
Tags: Programming Languages; Supercomputing & HPC

Day: Monday, 04/04
Time: 09:00 - 10:20
Location: Room LL20D

S6181 - Memory Bandwidth Bootcamp: Collaborative Access Patterns

Tony Scudiero Devtech Software Engineer, NVIDIA
Highly-Rated Speaker
Tony Scudiero is a developer technology engineer at NVIDIA who works with a diverse set of applications ranging from physical simulations in nuclear engineering to ray tracing and audio processing. Tony researched human hearing at the University of Minnesota as a graduate student, though his degrees are in computer science and mathematics. Tony has been using GPUs for general purpose computing for over 10 years.

Performance of many modern applications is limited by the rate at which operands can be delivered from memory to functional units. The GPU Memory Bootcamp series aims to equip developers to optimize these applications. In this installment, collaborative access patterns that make use of coordination within a warp or thread block are explored for their ability to mitigate effects of memory divergence and execution divergence. Specific examples derive from real-world applications where this pattern has been successfully deployed.

Level: Intermediate technical
Type: Tutorial
Tags: Performance Optimization; Supercomputing & HPC; Algorithms

Day: Monday, 04/04
Time: 10:00 - 10:50
Location: Room 210G

S6312 - Sharing Physically Based Materials Between Renderers with MDL

Jan Jordan Software Product Manager MDL, NVIDIA
Jan Jordan is the product manager for the NVIDIA Material Definition Language. He is a graduate engineer of applied computer science from the Fachhochschule fur Wirtschaft und Technik Berlin, Germany, and has a B.S. in computer science from the RTC Galway, Ireland. Before joining NVIDIA, his diverse working experience spanned from research work on practical VR applications to working as an art director in computer games. He is a long-time member of NVIDIA's Advanced Rendering team, where his focus has been on enabling material workflows across many different applications and renderers.
Lutz Kettner Senior Manager, Advanced Rendering and Materials, NVIDIA
Lutz Kettner leads the design and engineering efforts for the Material Definition Language, MDL, and the Iray renderer from the NVIDIA Advanced Rendering Center. He has been working on leading software products in advanced rendering, language design, API design, and geometry for 19 years. He is known for his influential work on the open source Computational Geometry Algorithms Library CGAL. He holds a Ph. D in Computer Science from ETH Zurich, Switzerland, worked as a researcher at the University of North Carolina at Chapel Hill and led a research group at the Max-Planck-Institute in Saarbrucken, Germany. He served on ISO and ECMA standardization committees.

The basics of NVIDIA's Material Definition Language (MDL) will be discussed, showing how a single material can be used to define matching appearances between different renderers and rendering techniques. End users will learn how physically-based definitions can be defined while developers will learn what is entailed in supporting MDL within their own product/renderer.

Level: All technical
Type: Tutorial
Tags: Rendering & Ray Tracing; Media & Entertainment; Product & Building Design

Day: Monday, 04/04
Time: 10:00 - 10:50
Location: Room 210E

Tutorial

TALK

Presentation
Details

S6590 - HPC Visualization Using NVIDIA IndeX™

Tom-Michael Thamm Director, Software Product Management, NVIDIA
Tom-Michael Thamm is director for software product management at the NVIDIA Advanced Rendering Center (ARC) in Berlin, Germany, and is responsible for all commercial software products, such as NVIDIA mental ray, NVIDIA Iray, and NVIDIA IndeX. He is managing and coordinating with his team the customer support as well as the general product definition and positioning. Tom-Michael has worked for NVIDIA ARC, and before for mental images, for over 25 years. He has led several key software projects and products, such as the NVIDIA IndeX product for large volume visualization. He has studied mathematics.
Christopher Lux Senior Graphics Software Engineer, NVIDIA
Christopher Lux is a senior graphics software engineer at the NVIDIA Advanced Rendering Center. He received is Ph.D. in computer science in 2013 from the Bauhaus-Universitat Weimar, Germany. Through his interest in real-time computer graphics and scientific visualization, he early on focused his work on the interactive visualization of large-scale datasets from the geo-scientific and medical domain.
Marc Nienhaus Product Technology Lead of the NVIDIA IndeX commercial software at NVIDIA, NVIDIA
Marc Nis the product technology lead of the NVIDIA IndeX commercial software at NVIDIA. He manages the NVIDIA IndeX software engineering team and is responsible for the product architecture and applications of NVIDIA IndeX in various application domains. Before joining mental images' R&D rendering department and NVIDIA ARC, Marc researched as a post-doc at Northwestern University in Illinois and led research projects at the University of Potsdam. His research interests include parallel and distributed rendering and computing, scientific visualization, GPU-based rendering, and photorealistic and non-photorealistic expressive depictions. Marc holds an M.S. in mathematics with a minor in computer science from the University of Muenster, and a Ph.D. in computer science from the Hasso Plattner Institute at the University of Potsdam. Marc has published various papers on GPU-based real-time and non-photorealistic rendering techniques.

We'll give a technical overview of the NVIDIA IndeX architecture that enables instant visualization of simulation and compute data, details on the interface design and use. Further, NVIDIA IndeX's capabilities are demonstrated by real-world solutions, which include a real-time weather prediction and a seismic wave-propagation algorithm.

Level: All technical
Type: Talk
Tags: In-Situ and Scientific Visualization; Large Scale and Multi-Display Visualization; Computer-Aided Engineering

Day: Monday, 04/04
Time: 10:00 - 10:50
Location: Room 212A

Talk

TUTORIAL

Presentation
Details

S6817 - High-Performance, Low-Overhead Rendering with OpenGL and Vulkan

Edward Liu DevTech, NVIDIA
Edward Liu is a developer technology engineer at NVIDIA as well as a huge graphics enthusiast. He was fascinated by graphics at first sight and spent most of his college career at South China University of Technology writing software renderers from scratch. Later he joined the graphics lab at Georgia Institute of Technology as a graduate student, working on fluid simulation and rendering. Now at NVIDIA, he works on exploring cutting-edge rendering techniques that could improve the visual quality and performance of future graphics applications.

As advanced games and applications continue to push the performance envelope, developers look to their 3D APIs for improved predictability, threading and reduced CPU load. New extensions to OpenGL and a totally new 3D API Vulkan are answering these requests directly. This session introduces and details both of these approaches a shows how applications can use OpenGL "AZDO" (Approaching Zero Driver Overhead) extensions like NVIDIA's Command Lists to greatly reduce single-threaded CPU overhead while reusing existing OpenGL code. Going further, the second section introduces the new 3D API from Khronos called Vulkan. Finally, there is a discussion of the tradeoffs of extended OpenGL and AZDO versus Vulkan, and how a developer might choose between them.

Level: Intermediate technical
Type: Tutorial
Tags: Real-Time Graphics ; Game Development

Day: Monday, 04/04
Time: 10:00 - 10:50
Location: Room LL20C

Tutorial

HANDS-ON LAB

Presentation
Details

L6105 - Developing, Debugging and Optimizing GPU Codes for High Performance Computing with Allinea Forge

Beau Paisley Support Engineer, Allinea Software
Beau Paisley is a support engineer with Allinea Software. He has over 25 years of experience in development, marketing, and sales roles with research, academic, and startup organizations. Beau has previously held positions with NCAR, Applied Physics Lab, and several startup and early growth technical computing companies. He is a computer science and mathematics graduate from the College of William and Mary and performed M.S. studies in electrical engineering at Purdue University.

We'll bring CUDA into a compute-intensive application by learning how to use CUDA-enabled development tools in the process of profiling, optimization, editing, building, and debugging. Using the Allinea Forge development toolkit, we'll cover how to profile an existing application and identify the most compute-intensive code regions. We'll then replace these regions with CUDA implementations and review the results before turning to the task of debugging the GPU-enabled code to fix an error introduced during the exercise. We'll learn debugging techniques for CUDA and debug using Allinea Forge to produce the correct, working, high-performance GPU-accelerated code. We'll be using GPUs hosted in the cloud, so simply bring a laptop with a modern browser. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Level: Beginner technical
Type: Hands-on Lab
Tags: Supercomputing & HPC; Tools & Libraries

Day: Monday, 04/04
Time: 10:30 - 12:00
Location: Room 210A

L6127 - Simplified CUDA® Development with C#

Daniel Egloff Managing Director QuantAlea / Partner InCube Group, QuantAlea / InCube Group
Highly-Rated Speaker
Daniel Egloff is a partner of InCube Group and managing director of QuantAlea, a Swiss software engineering company specializing in GPU software development. He studied mathematics, theoretical physics, and computer science and worked for almost 20 years as a quant and software architect.

Because of the availability of GPUs on Azure and in the new Surface, developing GPU accelerated applications in C# is becoming more important than ever before. Scaling out to GPUs hosted in a public cloud is now simpler and very cost effective. On Azure C# is the language of choice for many enterprises and software developers. In this session we use the Alea GPU V3 development stack to program GPU algorithms directly in C# and show how to perform native debugging, profiling and performance tuning. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Level: Beginner technical
Type: Hands-on Lab
Tags: Programming Languages; Tools & Libraries

Day: Monday, 04/04
Time: 10:30 - 12:00
Location: Room 210B

Hands-on Lab

TUTORIAL

Presentation
Details

S6112 - NVIDIA CUDA® Optimization with NVIDIA Nsight™ Visual Studio Edition: A Case Study

Christoph Angerer DevTech Compute, NVIDIA
Christoph Angerer is a developer in NVIDIA's European Developer Technology team. Based in Munich, Germany, he works with developers accelerating applications on GPUs. He holds a Ph.D. in computer science from ETH Zurich in Switzerland.
Jakob Progosch DevTech Compute, NVIDIA
Jakob Progsch is a Developer Technology engineer based in NVIDIA's Zürich office. He works with developers of scientific software on porting, profiling and optimizing code for GPUs.

We'll present a real CUDA application and use NVIDIA Nsight Visual Studio Edition on Windows to optimize the performance of the code. Attendees will learn a method to analyze their codes and how to use the tools to apply those ideas.

Level: Intermediate technical
Type: Tutorial
Tags: Performance Optimization; Tools & Libraries

Day: Monday, 04/04
Time: 10:30 - 11:50
Location: Room 211A

S6307 - Get to Know the NVIDIA GRID™ SDK

Shounak Deshpande System Software Engineer, Sr., NVIDIA
I started as a developer for the video sw group at NVIDIA in 2007. Since then, I have worked on various technologies like NVIDIA Optimus, NVENC API and GRID SDK. In my current position as the technical lead for GRID SDK development, I strive to make our SW clean, efficient and user-friendly.

This session will introduce folks who want to build high performance desktopapplication streaming software, to NVIDIA GRID SDK. Using the GRID SDK enables harnessing the power of NVIDIA GPUs for fast capture and efficient hardware-based video encoding. We will begin the session with an introduction to GRID SDK, provide an overview of GRID SDK APIs (NVFBC - NVIDIA FrameBuffer Capture, and NVIFR - NVIDIA In-band Frame Readback) and review the GRID SDK Roadmap. The remainder of the session will be spent discussing common issues faced by developers including: getting the most out of NVFBC and NVENC HW encoder for desktop streaming; measuring NVIFR performance and multithreading with NVIFR.

Level: All technical
Type: Tutorial
Tags: Graphics Virtualization; Data Center & Cloud Computing

Day: Monday, 04/04
Time: 10:30 - 11:50
Location: Room LL20A

S6821 - Session 2 of 4: An Introduction to the GPU Memory Model (Presented by Acceleware)

Chris Mason Director of Product Management, Acceleware LTD
Highly-Rated Speaker
Chris Mason is the Product Manager for Acceleware's GPU accelerated electromagnetic product line. He is responsible for the successful development and launch of Acceleware products used by companies world-wide. Chris has 10 years of experience in developing commercial applications for the GPU and has delivered numerous CUDA courses to students in a diverse range of industries. His previous experience also includes parallelization of algorithms on digital signal processors (DSPs) for cellular phones and base stations. Chris has a Masters in Electrical Engineering from Stanford University.

This tutorial is for those with a basic understanding of CUDA who want to learn about the GPU memory model and optimal storage locations. To learn the basics of CUDA programming required for Session 2, attend Session 1 - An Introduction to GPU Programming. This session begins with an essential overview of the GPU architecture and thread cooperation before focusing on different memory types available on the GPU. We will define shared, constant and global memory and discuss the best locations to store your application data for optimized performance. A programming demonstration of shared and constant memory will be delivered. Printed copies of the material will be provided to all attendees for each session – collect all four!

Level: Beginner technical
Type: Tutorial
Tags: Programming Languages; Supercomputing & HPC

Day: Monday, 04/04
Time: 10:30 - 11:50
Location: Room LL20D

Tutorial

HANDS-ON LAB

Presentation
Details

L6147 - IBM Watson Developers Lab

Niyati Parameswaran Data Scientist, IBM Watson Ecosystem, IBM
Niyati works as a data scientist for the Watson Ecosystem team. A dream of being able to provide a machine with intelligence that is unique, that can augment our own distinctive intelligence and that ultimately hopes to answer Alan Turing's question of 'Can machines think?' motivates her research. She holds a Bachelors in Computer Science from The Birla Institute of Science & Technology in India, and a Masters in Computer Science with a specialization in Machine Learning and Natural Language Processing from The University of Texas at Austin. She has worked on building deep QA systems, knowledge graphs and recommendation engines. At Watson, she is known for developing core ML and NLP algorithmic paradigms, conceptualizing cognitive solutions for partners to bring them to scale and providing data science as a service to the Ecosystem team.

IBM Watson will deliver a unique opportunity for attendees of the NVIDIA GTC Conference and the OpenPOWER Foundation Summit. If you are a developer interested in Machine or Deep Learning this is not a lab to miss. IBM Watson it is a cognitive technology platform that uses Natural Language Processing, Vision, Machine and Deep Learning to reveal insights from large amounts of unstructured data. The Watson Lab 202 is a hands-on experience with IBM's Watson cognitive platform. Starting with a set of Watson services, attendees of all skill levels will learn how to build apps powered by Watson, gaining experience using cognitive and deep learning technologies. Attendees must bring their own laptops and follow pre-req instructions.

Pre-Req:

  1. SIGN UP FOR BLUEMIX: Lab attendees should visit https://console.ng.bluemix.net/. Click on the Get Started Free button. Register your information and click on Create Account button.
  2. INSTALL GIT: Instructions available at https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
  3. INSTALL CLOUD FOUNDRY: Instructions available at https://github.com/cloudfoundry/cli#downloads
  4. Take a look at the lab ahead of time. It's available at https://github.com/biosopher/watson-ipa-web-nodejs. Come prepared with questions, and ideas on how to extend it!

Level: Intermediate technical
Type: Hands-on Lab
Tags: Deep Learning & Artificial Intelligence; Big Data Analytics; Video & Image Processing

Day: Monday, 04/04
Time: 11:00 - 13:00
Location: Room LL21A

Hands-on Lab

TUTORIAL

Presentation
Details

S6248 - High-Order Discretizations for Geometric Multi-Grid Methods

Nikolay Sakharnykh Developer Technology Engineer, NVIDIA
Nikolay Sakharnykh is a senior developer technology engineer at NVIDIA, where he works on accelerating applications on GPUs. He has experience in scientific research and software development focusing on computational techniques related to physics, chemistry, and biology.
Dusan Stosic Masters candidate, Federal University of Pernambuco (UFPE)
Dusan Stosic is an M.S. candidate in the Universidade Federal de Pernambuco, Brazil. He received a B.A. in physics from Boston University in 2014. His current research interests include computational physics, machine learning, and high performance computing.

Join us for a deep-dive analysis of geometric multigrid methods on GPUs. These methods have numerous applications in computer science, including combustion codes based on adaptive mesh refinement techniques. High-order multigrid schemes are actively being explored for production in many linear algebra packages, and may become a commodity within the next few years. We'll discuss performance bottlenecks and key implementation choices for current and future generation GPUs. Our analysis is based on a well-known high-performance multigrid benchmark (HPGMG). Hybrid CPU/GPU implementation using unified memory, high-order stencil optimizations, and multi-GPU scaling will be covered in detail.

Level: Intermediate technical
Type: Tutorial
Tags: Supercomputing & HPC; Computational Physics

Day: Monday, 04/04
Time: 11:00 - 11:50
Location: Room 212A

S6311 - MDL Materials to GLSL Shaders: Theory and Practice

Andreas Mank Team Leader Visualization, ESI Group
Andreas Mank leads the visualization team in the BU Immersive Experience, where he's responsible for driving advances in visualization technologies and delivering state-of-the-art, high-performance immersive engineering visualization as well as advanced, high-quality rendering with ESI software products. Andreas has studied media computer science at the University of Applied Sciences in Wedel, Germany. He has over 10 years of experience in virtual reality-related software development. In recent years, he has worked as a team leader in research and development.
Andreas Süßenbach Senior Developer Technology Engineer, NVIDIA
Andreas Sussenbach is a senior DevTech engineer in NVIDIA's Professional Solutions Group, where he works to help different ISVs improve their GPU-related implementations. He has more than 15 years of experience in scene graph and rendering technologies, with emphasis on efficient handling of geometries and materials. He has a diploma in mathematics with a focus on numerical mathematics and CAGD.

Learn how you can map arbitrarily complex materials described with NVIDIA's Material Definition Language (MDL) onto sets of material-specific GLSL shaders using the MDL SDK. We use a skeleton of a general purpose main shader per stage, where a couple of pre-defined evaluation and sample functions are called. The body of those functions is composed by some code-snippets selected by the material analyzer. This approach has been adopted by ESI into their new rendering framework to showcase the power and flexibility of MDL. A demo will show the implementation results with focus on material re-use and sharing.

Level: Intermediate technical
Type: Tutorial
Tags: Rendering & Ray Tracing; Product & Building Design; Real-Time Graphics

Day: Monday, 04/04
Time: 11:00 - 11:50
Location: Room 210E

S6646 - NVIDIA GPU Educators Program: Prepare Today's Students for Tomorrow's Accelerated Computing Challenges

Joe Bungo GPU Educators Program Manager, NVIDIA
Joe Bungo is NVIDIA's GPU Educators Program Manager. He enables the use of NVIDIA and GPU technologies in universities in a variety of ways, including curriculum and teaching material development, facilitation of academic ecosystems, and hands-on instructor workshops. Previously, Joe managed ARM Inc.'s University Program and was an applications engineer for ARM.

The demand for new graduate students with quality parallel and accelerated computing education exceeds the global supply required by industry and new scientific discovery. The ACM-IEEE CS2013 curricular recommendations include a large increase in parallel programming topics, as understanding parallel computing is a requirement to get the most out of any modern microprocessor from mobile to supercomputers. NVIDIA's new GPU Educators Program breaks the global barriers of entry to teaching parallel computing for higher-education instructors. Co-developed with academia, GPU Teaching Kits are the flagship offering of the program and target a variety of academic disciplines. These comprehensive packages contain everything an instructor needs to teach a full-term curriculum course.

Level: All technical
Type: Tutorial
Tags: Education & Training

Day: Monday, 04/04
Time: 11:00 - 11:50
Location: Room 210F

Tutorial

TALK

Presentation
Details

S6796 - Khronos API Ecosystem Update – Including Vulkan, OpenCL, OpenVX and SPIR-V

Neil Trevett Vice President Developer Ecosystem, NVIDIA
Neil Trevett has spent over thirty years in the 3D graphics industry - and by day drives the advanced apps ecosystem on NVIDIA Tegra mobile and embedded devices. By night, Neil is the elected President of the Khronos Group industry standards consortium where he initiated the OpenGL ES standard now used by billions worldwide every day, helped catalyze the WebGL project to bring interactive 3D graphics to the Web, chairs the OpenCL working group defining the open standard for heterogeneous parallel computation and has helped create and launch the new generation Vulkan API.

Discover how over 100 companies cooperate at the Khronos Group to create open, royalty free standards that enable developers to access the power of the GPU to accelerate demanding compute, graphics and vision applications. This session includes the very latest updates, including the newly announced Vulkan, SPIR-V, OpenVX and OpenCL 2.1 specifications.

Level: Beginner technical
Type: Talk
Tags: Real-Time Graphics ; Computer Vision & Machine Vision; Programming Languages

Day: Monday, 04/04
Time: 11:00 - 11:50
Location: Room LL20C

Talk

TUTORIAL

Presentation
Details

S6847 - Accelerate Deep Learning with NVIDIA's Deep Learning Platform

Stephen Jones Product Manager, Deep Learning, NVIDIA
Focused on customer experiences, Stephen’s background in technology and intuitive grasp of developers’ needs brings a unique spin to product management. Stephen holds a BS in Computer Science from the University of Tennessee and is pursuing an MBA at Colorado State University. Stephen has been fortunate enough to work for great companies that permit him to pursue his hobbies, including cycling the backroads or running the trails around his hometown of Boulder, CO. When not in oxygen debt, Stephen spends time with his two boys and beautiful wife.

Deep Learning is delivering the future today, enabling computers to perform tasks once thought possible only in science fiction. Innovations such as autonomous vehicles, speech recognition and advances in medical imaging will transform the world as we know it. GPUs are at the core of this transformation, providing the engines that power Deep Learning. In this session, we'll discuss the software tools NVIDIA provides to unlock the power of Deep Learning on GPUs. We'll provide an overview of NVIDIA's Deep Learning Software, including cuDNN and DIGITS, and pointers to maximize your experience with Deep Learning at GTC.

Level: Beginner technical
Type: Tutorial
Tags: Deep Learning & Artificial Intelligence

Day: Monday, 04/04
Time: 11:00 - 11:50
Location: Grand Ballroom

Tutorial

HANGOUT

Presentation
Details

H6128 - Hangout: The DIGITS Roadmap

Stephen Jones Product Line Manager, Deep Learning, NVIDIA
Focused on customer experiences, Stephen’s background in technology and intuitive grasp of developers’ needs brings a unique spin to product management. Stephen holds a BS in Computer Science from the University of Tennessee and is pursuing an MBA at Colorado State University. Stephen has been fortunate enough to work for great companies that permit him to pursue his hobbies, including cycling the backroads or running the trails around his hometown of Boulder, CO. When not in oxygen debt, Stephen spends time with his two boys and beautiful wife.

This session is intended for users or contributors to the NVIDIA Deep Learning GPU Training System (DIGITS).  There will be a short presentation on the DIGITS roadmap and then an open discussion regarding the future of this powerful tool for deep learning. Attendees are invited to provide feedback and share stories on how they use this tool in their daily work. Suggestions on ways to improve DIGITS for specific use cases or collaboration ideas are welcomed.

Level: All technical
Type: Hangout
Tags: Deep Learning & Artificial Intelligence

Day: Monday, 04/04
Time: 12:00 - 13:00
Location: Pod A

H6155 - Performance Analysis and Optimization

Justin Luitjens Developer Technologies Engineer, NVIDIA
Highly-Rated Speaker
Justin Luitjens has been a developer technology engineer at NVIDIA for five years. He received his Ph.D. in scientific computing from the University of Utah.
Ross Xie HPC Developer Technology Engineer, NVIDIA
Ross Xie is a HPC Developer Technology Engineer at NVIDIA.

Come ask your GPU code optimization questions to experts in the field. Hangouts provide an opportunity for you to ask topic experts specific questions. Come on in. Find a person wearing an Ask Me button and ask a question!

Level: All technical
Type: Hangout
Tags: Performance Optimization

Day: Monday, 04/04
Time: 12:00 - 13:00
Location: Pod B

Hangout

SPECIAL EVENT

Presentation
Details

SE6136 - Networking Lunch

Grab your lunch and spend an hour meeting fellow GPU enthusiasts.

Level: All technical
Type: Special Event
Tags: Public Event

Day: Monday, 04/04
Time: 12:00 - 13:00
Location: Grand Ballroom

Special Event

HANGOUT

Presentation
Details

H6111A - Hangout: HPC & In-Situ Visualization

Tom Fogal Senior DevTech Engineer, NVIDIA
Tom is a senior software engineer in the developer technologies group at NVIDIA. He has spent a decade crafting visualization tools to scale to the needs of computational and experimental scientists, helping others to understand divers problems, from deep brain stimulation to nuclear physics. Tom specializes in in situ visualization as well as out-of-core techniques that take advantage of the highly-parallel nature of NVIDIA's hardware. He comes from a long list of academic institutions including the University of Duisburg-Essen in Germany, the SCI Institute in Utah, Lawrence Livermore and Oak Ridge national laboratories, and the University of New Hampshire. At NVIDIA, Tom continues his work on in situ visualization whilst connecting with everyday developers to help them create new tools to visualize and understand data.
Peter Messmer Senior Software Engineer, NVIDIA
Highly-Rated Speaker
Peter Messmer is a senior software engineer in NVIDIA's Developer Technology organization, working with clients to accelerate their scientific discovery process with GPUs. One area of his current research is to investigate how to utilize the GPUs in high performance computing systems for data analysis and visualization. Prior to joining NVIDIA, Peter spent more than 15 years developing HPC- and GPU-accelerated applications for industry and government clients, ranging from simulating next-generation particle accelerators or electromagnetic problems to modeling the behavior of dust on the surface of the Moon. Peter holds an M.S. and Ph.D. in physics from ETH Zurich, Switzerland, with specialization in kinetic plasma physics and nonlinear optics.

Speak with NVIDIA engineers to answer your in situ and post hoc visualization questions. This is the best place to get your queries concerning HPC visualization answered, from user-level to developer-level. Come learn how to effectively utilizing GPUs in ParaView or VisIt, how to use the interop APIs to perform your own custom visualization, or how to improve performance by enabling zero-copy visualization from simulation code to visualization code. Learn how to achieve GPU-accelerated graphics without X servers via EGL and get your questions answered as to the best methods for image composition on modern supercomputers.

Level: All technical
Type: Hangout
Tags: In-Situ and Scientific Visualization

Day: Monday, 04/04
Time: 13:00 - 14:00
Location: Pod C

H6141 - Hangout: MDL Materials to GLSL Shaders

Andreas Süßenbach Senior Developer Technology Engineer, NVIDIA
Andreas Sussenbach is a senior DevTech engineer in NVIDIA's Professional Solutions Group, where he works to help different ISVs improve their GPU-related implementations. He has more than 15 years of experience in scene graph and rendering technologies, with emphasis on efficient handling of geometries and materials. He has a diploma in mathematics with a focus on numerical mathematics and CAGD.
Andreas Mank Team Leader Visualization, ESI Group
Andreas Mank leads the visualization team in the BU Immersive Experience, where he's responsible for driving advances in visualization technologies and delivering state-of-the-art, high-performance immersive engineering visualization as well as advanced, high-quality rendering with ESI software products. Andreas has studied media computer science at the University of Applied Sciences in Wedel, Germany. He has over 10 years of experience in virtual reality-related software development. In recent years, he has worked as a team leader in research and development.

We will talk about how you can map arbitrarily complex materials described with NVIDIA's Material Definition Language (MDL) onto sets of material-specific GLSL shaders using the MDL SDK. Details of the skeleton of a general purpose main shader per stage, where a couple of pre-defined evaluation and sample functions are called can be discussed. The body of those functions is composed by some code-snippets selected by the material analyzer.

Level: All technical
Type: Hangout
Tags: Real-Time Graphics

Day: Monday, 04/04
Time: 13:00 - 14:00
Location: Pod A

H6156 - Hangout: Algorithms and Numerical Techniques

Shankara Rao Thejaswi Nanditale DevTech Compute, NVIDIA
Coming soon
Nikolay Sakharnykh Developer Technology Engineer, NVIDIA
Nikolay Sakharnykh is a senior developer technology engineer at NVIDIA, where he works on accelerating applications on GPUs. He has experience in scientific research and software development focusing on computational techniques related to physics, chemistry, and biology.

Stop by and have a chat with GPU programming experts on GPU accelerating algorithms and numerical techniques. Hangouts provide an opportunity for you to ask topic experts specific questions. Come on in. Find a person wearing an Ask Me button and ask a question!

Level: All technical
Type: Hangout
Tags: Algorithms

Day: Monday, 04/04
Time: 13:00 - 14:00
Location: Pod B

Hangout

HANDS-ON LAB

Presentation
Details

L6104 - In-Depth Performance Analysis for OpenACC/CUDA®/OpenCL Applications with Score-P and Vampir

Robert Henschel Manager, Scientific Applications and Performance Tuning, Research Technologies, Pervasive Technology Institute, Indiana University
Robert Henschel runs the Scientific Applications and Performance Tuning group at Indiana University, focused on optimizing scientific applications. He received an M.S. in computer science from Technische Universitat Dresden, Germany.
Guido Juckeland Head of Computational Science Group, Helmholtz-Zentrum Dresden-Rossendorf (HZDR)
Guido Juckeland runs the Computational Science Group at Helmholtz-Zentrum Dresden-Rossendorf (HZDR) and coordinates the work of the GPU Center of Excellence at Dresden. He and also represents HZDR at the SPEC High Performance Group and OpenACC committee. He received his Ph.D. for his work on performance analysis for hardware accelerators.

Participants will work with Score-P/Vampir to learn how to dive into the execution properties of CUDA and OpenACC applications. We'll show how to use Score-P to generate a trace file and how to study it with Vampir. Additionally, we'll use the newly established OpenACC tools interface to also present how OpenACC applications can be studied for performance bottlenecks. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Level: Advanced technical
Type: Hands-on Lab
Tags: Performance Optimization; Tools & Libraries

Day: Monday, 04/04
Time: 13:00 - 14:30
Location: Room 210B

L6143 - Train and Deploy Deep Learning for Vision, Natural Language and Speech Using MXNet

Chiyuan Zhang PhD Student, Massachusetts Institute of Technology
Chiyuan Zhang received his B.Sc. and M.S. in computer science from Zhejiang University, China, in 2009 and 2012, respectively. He is currently a Ph.D. candidate at the Computer Science and Artificial Intelligence Laboratory at MIT. His research interests include machine learning and computational neuroscience, as well as application to processing/analysis of speech, vision, and other kinds of real-world signals.
Bing Xu Data Scientist, Dato
Bing Xu is currently a final year M.S. student at University of Alberta. He will join Dato Inc. His research interests include neural network robustness, language understanding and various application of deep learning. He is also a Top-100 player in Kaggle.
Junyuan (Eric) Xie PhD Student, University of Washington
Junyuan Xie is currently a third year Ph.D student at University of Washington. He is interested in machine learning, deep learning and their applications to computer vision. He also likes to develop open source software for research and fun.

This hands-on tutorial will work through the pipeline of developing, training and deploying deep learning applications by using MXNet. Multiple applications including computer vision, natural language processing, and speech recognition will be covered. You'll learn how to write a deep learning program in a few lines of codes in your favorite language such as Python, Scala, and R and train it on one or multiple GPUs. You'll also learn how to deploy a deep learning application in the cloud or on the mobile phones. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Level: Beginner technical
Type: Hands-on Lab
Tags: Deep Learning & Artificial Intelligence; Big Data Analytics

Day: Monday, 04/04
Time: 13:00 - 14:30
Location: Room 210A

L6148 - IBM Watson Developers Lab

Max Kaufmann NLP Developer, IBM Watson Ecosystem, IBM
Max Kaufmann has a BS in Linguistics from Grinnell College, and a MS in Computational Linguistics from the University of Washington. He has previous experience in academia, where he has published papers on topics such as using machine translation, and in industry, where he has several NLP-related patents pending. He is currently a member of the IBM Watson Ecosystem team, where he helps help fast-growing companies and entrepreneurial-minded organizations use Watson's NLP capabilities to solve real-world problems.

IBM Watson will deliver a unique opportunity for attendees of the NVIDIA GTC Conference and the OpenPOWER Foundation Summit. If you are a developer interested in Machine or Deep Learning this is not a lab to miss. IBM Watson it is a cognitive technology platform that uses Natural Language Processing, Vision, Machine and Deep Learning to reveal insights from large amounts of unstructured data. The Watson Lab 202 is a hands-on experience with IBM's Watson cognitive platform. Starting with a set of Watson services, attendees of all skill levels will learn how to build apps powered by Watson, gaining experience using cognitive and deep learning technologies. Attendees must bring their own laptops and follow pre-req instructions.

Pre-Req:

  1. SIGN UP FOR BLUEMIX: Lab attendees should visit https://console.ng.bluemix.net/. Click on the Get Started Free button. Register your information and click on Create Account button.
  2. INSTALL GIT: Instructions available at https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
  3. INSTALL CLOUD FOUNDRY: Instructions available at https://github.com/cloudfoundry/cli#downloads
  4. Take a look at the lab ahead of time. It's available at https://github.com/biosopher/watson-ipa-web-nodejs. Come prepared with questions, and ideas on how to extend it!

Level: Intermediate technical
Type: Hands-on Lab
Tags: Deep Learning & Artificial Intelligence; Big Data Analytics; Video & Image Processing

Day: Monday, 04/04
Time: 13:00 - 15:00
Location: Room LL21A

Hands-on Lab

TUTORIAL

Presentation
Details

S6133 - VKCPP: A C++ Layer on Top of Vulkan

Markus Tavenrath Senior Developer Technology Engineer, NVIDIA
Markus Tavenrath is experimenting with future rendering technologies that bring high interactivity to complex scenes. This work includes both CPU and GPU strategies to solve typical scene graph operations related to rendering. Markus finished his studies in computer science with a focus on computer graphics in 2008. He was one of the first using ray tracing on CUDA for this diploma thesis, which brought him straight to NVIDIA. There he primarily worked on GPU ray tracing for SceniX, NVIDIA's scene graph technology, which was showcased at SIGGRAPH 2008. Afterwards, he applied his experience to implement parts of OptiX, improve SceniX, and develop several ray tracing demos. In close cooperation with external partners, he improved rendering performance and scenegraph usability as a developer technology engineer.

Learn how the C++ wrapper around Vulkan helps you to write more productive Vulkan code and how Vulkan got integrated into the RiX and the rest of the pipeline to simply rendering with Vulkan even more. We'll also compare performance of Vulkan and OpenGL based on a CAD-style reference model.

Level: Intermediate technical
Type: Tutorial
Tags: Real-Time Graphics

Day: Monday, 04/04
Time: 13:00 - 13:50
Location: Room LL21B

S6142 - Multi GPU Programming with MPI

Jiri Kraus Compute DevTech Software Engineer, NVIDIA
Highly-Rated Speaker
Jiri Kraus is a senior developer in NVIDIA's European Developer Technology team. As a consultant for GPU HPC applications at the NVIDIA Julich Applications Lab, Jiri collaborates with local developers and scientists at the Julich Supercomputing Centre and the Forschungszentrum Julich. Before joining NVIDIA, Jiri worked on the parallelization and optimization of scientific and technical applications for clusters of multicore CPUs and GPUs at Fraunhofer SCAI in St. Augustin. He holds a diploma in mathematics from the University of Cologne, Germany.

Learn how to program multi-GPU systems or GPU clusters using the message passing interface (MPI) and OpenACC or NVIDIA CUDA. We'll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we'll cover advanced topics like CUDA-aware MPI and how to overlap communication with computation to hide communication times. We'll also cover the latest improvements with CUDA-aware MPI, the multi process service (MPS aka Hyper-Q for MPI), and MPI support in the NVIDIA performance analysis tools.

Level: Intermediate technical
Type: Tutorial
Tags: Supercomputing & HPC; Tools & Libraries; OpenACC

Day: Monday, 04/04
Time: 13:00 - 13:50
Location: Room 212A

S6226 - High-Performance Video Encoding on NVIDIA GPUs

Abhijit Patait Director, Multimedia System Software, NVIDIA
Abhijit Patait manages the GPU Multimedia Software team at NVIDIA, which is responsible for video and audio software, GRID SDK, NVENC SDK, and others. Prior to NVIDIA, he worked at various U.S. organizations in areas such as baseband and audio signal processing, telecommunications software, VoIP, and video/image processing. Abhijit holds an MSEE from Missouri S&T University and MBA from University of California, Berkeley.
Eric Young GRID Developer Relations Manager , NVIDIA
Eric Young is an engineering manager in the Content and Technology group. We focus on applied research with GRID technologies and video compression. Eric specializes in remote graphics, video compression, and technology for games.

We'll provide an overview of video encoding and decoding technologies available using NVIDIA GPUs. In particular, attendees can expect to learn the following: (1) overview of NVIDIA Video Codec SDK; (2) new features in NVIDIA video encoding (NVENC) and decoding (NVDEC) hardware with latest generation GPU chips; (3) changes and new features in Video Codec SDK 6.0 and roadmap; and (4) How to use the API's exposed by video codec SDK for various applications, such as transcoding, capturing & streaming etc.

Level: All technical
Type: Tutorial
Tags: Media & Entertainment; Video & Image Processing; Tools & Libraries

Day: Monday, 04/04
Time: 13:00 - 13:50
Location: Room 210F

S6341 - Performance Optimizations for Automotive Software

Stefan Schoenefeld Devtech Engineer, NVIDIA
Highly-Rated Speaker
Stefan Schoenefeld is a senior developer technology engineer at NVIDIA. After working on the Scenix Scene Graph SDK and the Workstation Performance Drivers, he now works on implementing NVIDIA GRID, improving video encoding and new ideas for remote graphics.
Pradeep Chandrahasshenoy Automotive Solution Architect, NVIDIA
Pradeep Chandrahasshenoy is an Automotive Solution Architect at NVIDIA. After working on several IVI systems and software applications ranging from Radio, Navigation to HMI, he now works on Automotive Solutions ranging from cockpit to self driving cars for future.

Learn how to use NVIDIA performance tools to optimize your scene graph and rendering pipeline for the use in automotive software. We'll demonstrate the capabilities of these tools using some simple Qt-based examples and will look at some of the more common mistakes in writing efficient software and how to avoid them.

Level: Intermediate technical
Type: Tutorial
Tags: Self-Driving Cars & Automotive ; Real-Time Graphics ; Performance Optimization

Day: Monday, 04/04
Time: 13:00 - 13:50
Location: Room LL21E

S6385 - What is Cloud and What Can It Do For Your Desktop Workloads

Jeff Weiss GRID Solutions Architect Manager, NVIDIA
Jeff Weiss is the GRID solutions architect manager for North America working with the Solution Architecture & Engineering team at NVIDIA. Previously, Jeff worked for seven years at VMware as an EUC staff engineer, and at Symantec and Sun Microsystems. Along with his current focus on NVIDIA GRID vGPU-enabled end-user computing, his experience includes data center business continuity/disaster recovery solutions, software infrastructure identity management, and email security/archiving tools.
Matt Coppinger Director, Technical Marketing & Enablement, End User Computing, VMware
Matt Coppinger is director of Technical Marketing and Enablement for End User Computing at VMware. Matt has worked on desktop virtualization since its inception at VMware in 2007, first in engineering, then as a field consultant, and now within his current role. He has authored a number of reference architectures for VMware, including virtualizing 3D applications, and has spoken on the technical aspects of desktop virtualization at VMworld and other major conferences since 2010.
Stephane Asselin Senior EUC Architect, Technical Enablement, VMWare
Stephane Asselin is an architect for the End-User Computing business unit at VMware. Recently, he had national responsibility for Canada for EUC planning, designing and implementing virtual infrastructure solutions and all processes involved. At VMware, he has worked on EUC pre-sales activities, internal IP, product development and technical specialist lead on BETA programs. He has done work as a Subject Matter Expert for project Octopus, Horizon, View, vCOps and ThinApp.

This is a technical deep dive into the GPU-enabled data center and building out of an NVIDIA GRID vGPU-enabled workflow on a cloud platform with VMware and NVIDIA.

Level: Intermediate technical
Type: Tutorial
Tags: Graphics Virtualization

Day: Monday, 04/04
Time: 13:00 - 14:20
Location: Room 210G

S6474 - From Workstation to Embedded: Accelerated Deep Learning on NVIDIA Jetson™ TX1

Julie Bernauer Senior Solutions Architect, NVIDIA
Julie is a Senior Solutions Architect for Deep Learning at NVIDIA. She attended ENS Cachan from 2001 to 2004 where she received a degree in Physical Chemistry. She obtained her PhD from Université Paris-Sud in 2006 while performing research in the Yeast Structural Genomics group. Her thesis focused on the use of Voronoi models for modelling protein complexes. After a post-doctoral position at Stanford University with Pr. Michael Levitt, Nobel Prize in Chemistry 2013, she joined Inria, the French National Institute for Computer Science. While Senior Research Scientist at Inria, Adjunct Associate Professor of Computer Science at École Polytechnique and Visiting Research Scientist at SLAC, her work focused on computational methods for structural bioinformatics, specifically scoring functions for macromolecular docking using machine learning, and statistical potentials for molecular simulations. She was the first to successfully introduce machine learning for coarse-grained models in the CAPRI challenge.

Running deep learning inference tasks on embedded platforms often requires deployment of pretrained models. Finding the best hyper-parameters and training are usually performed on a workstation or large-scale system to obtain the best model. In this talk, we'll show through examples using frameworks how to train models on a workstation and deploy models on embedded platforms such as the NVIDIA® Jetson™ TX1 or NVIDIA Drive™ PX. We'll also show dedicated tools and how to monitor performance and debug issues on embedded platforms for easy demo setup. This talk will include a live demo session.

Level: All technical
Type: Tutorial
Tags: Deep Learning & Artificial Intelligence; Embedded; Aerospace & Defense

Day: Monday, 04/04
Time: 13:00 - 13:50
Location: Grand Ballroom

S6510 - Targeting GPUs with OpenMP 4.5 Device Directives

James Beyer Senior Runtime Engineer, NVIDIA
James Beyer recently moved to NVIDIA after a 15+ year tenure at Cray Inc. He is a longtime member of the OpenMP language committee, and co-chair of the OpenMP subcommittee on accelerator directives. He was one of the founding members of the OpenACC specification and an remains an active member of the OpenACC technical committee.
Jeff Larkin DevTech, NVIDIA
Highly-Rated Speaker
Jeff Larkin is a software engineer in NVIDIA's developer technology (DevTech) group where he works on porting and optimizing HPC applications. He is also closely involved with the development of both the OpenACC and OpenMP specifications. Prior to joining NVIDIA, Jeff worked in Cray's Supercomputing Center of Excellence at Oak Ridge National Laboratory.

This tutorial will present the best practices for writing OpenMP for GPUs using the Clang compiler. We'll introduce the strengths and weaknesses of the OpenMP target constructs and how to use them effectively on NVIDIA GPUs.

Level: Intermediate technical
Type: Tutorial
Tags: Tools & Libraries; Performance Optimization

Day: Monday, 04/04
Time: 13:00 - 14:20
Location: Room 211A

S6643 - Introduction to the NVIDIA OptiX™ Ray Tracing Engine

Austin Robison Senior Product Manager, NVIDIA
Austin Robison is a product manager, technologist, and futurist with a passion for ray tracing. Austin has worked in computer graphics, wearables, and consumer software for over a decade at startups and Silicon Valley heavyweights NVIDIA and Google. He attended the University of Utah and the University of Chicago, graduating from both with degrees in computer science. He is one of the authors of the first version of OptiX.

Learn about the OptiX ray tracing engine, a sophisticated library for performing GPU ray tracing. We'll provide an overview of the OptiX ray tracing pipeline and the programmable components that allow for the implementation of many algorithms and applications. OptiX can be used in many domains, ranging from rendering to acoustic modeling to scientific visualization. Several case studies will be presented describing the benefits of integrating OptiX into third-party applications.

Level: Intermediate technical
Type: Tutorial
Tags: Rendering & Ray Tracing

Day: Monday, 04/04
Time: 13:00 - 13:50
Location: Room 210E

Tutorial

TALK

Presentation
Details

S6783 - VisionWorks™: A CUDA Accelerated Computer Vision Library

Elif Albuz Computer Vision Software Lead, NVIDIA
Elif Albuz is the technical lead for VisionWorks Toolkit at NVIDIA, driving features and optimizations with CUDA acceleration on Tegra GPUs. Before Computer Vision Group, she was leading CUDA FFT Library; designing new algorithms for motion estimation, superresolution and frame-rate up conversion and accelerating them on NVIDIA GPUs; designing architecture for error concealment, adaptive quantization for video codec hardware; and implementing low-level code for h.264, MPEG2 codecs. Prior to joining NVIDIA, she worked at Sony Electronics, leading DVD decoder firmware stack that was used in DVD players and Playstation 2, implementing real-time OS for multi-processor systems and accelerating h.264 using SIMD in the Multimedia Research Labs. Elif Albuz holds dual degree on Electrical Engineering and Computer Science where she focused on Artificial Intelligence and Robotics, and holds a Masters degree in Electrical Engineering where she did research on content based image retrieval, parallel architectures and algorithms.

In this talk, we will introduce NVIDIA VisionWorks™ toolkit, a software development package for computer vision (CV) and image processing. VisionWorks(TM) implements and extends the Khronos OpenVX standard, and it is optimized for CUDA-capable GPUs and SOCs enabling computer vision applications on a scalable and flexible platform. VisionWorks implements a thread-safe API and framework for seamlessly adding user defined primitives. The talk will give an overview of the VisionWorks toolkit, OpenVX API and framework, VisionWorks-plus modules including VisionWorks Structure From Motion and Object Tracker modules, and computer vision pipeline samples showing integration of the library API into a computer vision pipeline on Tegra platforms.

Level: All technical
Type: Talk
Tags: Computer Vision & Machine Vision; Embedded; Self-Driving Cars & Automotive

Day: Monday, 04/04
Time: 13:00 - 13:50
Location: Room LL20A

S6815 - Advanced Rendering with DirectX®

Oleg Kuznetsov DevTech, NVIDIA
Oleg began his professional career at NVIDIA in 2005 at the tender age of 20. Starting as a game tester, Oleg worked his way up to Developer Technology Engineer, optimizing different games from well-known legends like S.T.A.L.K.E.R. and Witcher 2 to the current and upcoming AAA titles. When not analyzing shader code, Oleg enjoys riding his Honda VFR1200 and snowboarding.

This talk focuses on some of the new features that DX12 and DX11.3 introduce, as well as touch up on how to drive DX12 efficiently. Amongst other things the slides will shed light on the use of predication, ExecuteIndirect and explicit MGPU in DX12.

Level: Beginner technical
Type: Talk
Tags: Rendering & Ray Tracing; Game Development

Day: Monday, 04/04
Time: 13:00 - 13:50
Location: Room LL20C

Talk

TUTORIAL

Presentation
Details

S6822 - Session 3 of 4: Asynchronous Operations & Dynamic Parallelism in CUDA (Presented by Acceleware)

Dan Cyca Chief Technology Officer, Acceleware LTD
Highly-Rated Speaker
Regarded as a leading mind in the field of parallel processing, Dan Cyca has extensive experience working with GPUs, clusters and multi-core solutions. Dan joined Acceleware in 2004 as a software developer to build the company's first product. Since then, he has served in many technical and leadership roles in the company. Most recently, as the Director of Engineering, Dan was responsible for managing the software development group. Prior to Acceleware, Dan's experience included developing 'C-to-hardware' compilers, and implementing digital signal processing and encryption algorithms on FPGAs. Dan has an M. Sc. in Electrical Engineering from the University of Calgary.

This tutorial builds on the two previous sessions (An Introduction to GPU Programming and the Introduction to GPU Memory Model) and is intended for those with a basic understanding of CUDA programming. This session dives deep into asynchronous operations and how to maximize throughput on both the CPU and GPU with streams. We will demonstrate how to build a CPU/GPU pipeline and how to design your algorithm to take advantage of asynchronous operations. The second part of the session will focus on dynamic parallelism. A programming demo involving asynchronous operations will be delivered. Printed copies of the material will be provided to all attendees for each session – collect all four!

Level: All technical
Type: Tutorial
Tags: Programming Languages; Supercomputing & HPC

Day: Monday, 04/04
Time: 13:00 - 14:20
Location: Room LL20D

Tutorial

HANGOUT

Presentation
Details

H6130A - Hangout: Vulkan

Neil Trevett Vice President Developer Ecosystem, NVIDIA
Neil has spent over thirty years in the 3D graphics industry - and by day drives the advanced apps ecosystem on NVIDIA Tegra mobile and embedded devices. By night, Neil is the elected President of the Khronos Group industry standards consortium where he initiated the OpenGL ES standard now used by billions worldwide every day, helped catalyze the WebGL project to bring interactive 3D graphics to the Web, chairs the OpenCL working group defining the open standard for heterogeneous parallel computation and has helped create and launch the new generation Vulkan API.

Vulkan™ is a new generation, royalty-free, open standard API to accelerate graphics and compute on modern GPUs. This ground-up design, complementing the OpenGL® and OpenGL ES™ 3D APIs, gives applications direct control over GPU acceleration for maximized performance and predictability, with minimized CPU overhead and efficient multi- threaded performance. NVIDIA supports Vulkan across its product families including Quadro and GeForce with Windows and Linux,and Tegra with Android and Linux. These hangouts give you the chance to interact with NVIDIA personnel that have helped create this exciting new standard and can give you tips and tricks on how to use Vulkan to get the most out of your NVIDIA GPU.

Level: All technical
Type: Hangout
Tags: Real-Time Graphics

Day: Monday, 04/04
Time: 14:00 - 15:00
Location: Pod A

H6139 - Hangout: Maximizing Performance of CUDA and OpenGL Applications

Dennis Sandler Devtech Professional visualization, NVIDIA
Dennis Sandler is a senior developer technology engineer at NVIDIA. Dennis works on NVIDIA middleware and with NVIDIA partners and customers to accelerate their applications on the GPU focusing on advanced real-time rendering techniques with OpenGL and Vulkan as well as reliable, fast, and scalable GPGPU processing pipelines on CUDA and OpenCL. Dennis works in a wide variety of areas of professional visualizations: CAD/CAM, DCC, Non-linear video editing and processing, training simulation, engineering computations, computer vision, and scientific visualizations. Prior to joining NVIDIA, he worked on a CUDA-based computational fluid dynamics solver visualized in OpenGL at the Department of Thermophysics of Moscow Power Engineering Institute.

Come ask your questions about CUDA, OpenGL and related performance issues or features during this Hangout.

Level: All technical
Type: Hangout
Tags: Performance Optimization

Day: Monday, 04/04
Time: 14:00 - 15:00
Location: Pod C

H6157 - Performance Analysis and Optimization

Maxim Milakov Senior Developer Technology Engineer, NVIDIA
Maxim Milakov joined NVIDIA in January 2012. Since then he has been working on accelerating scientific and industrial applications on GPUs in various fields including machine learning, developing codes and algorithms to maximize performance while also collaborating with internal teams to help co-design future hardware and tools.
Julien Demouth Senior Developer Technology Engineer, NVIDIA
Highly-Rated Speaker
Julien Demouth is a member of the Developer Technology team at NVIDIA where he works on accelerating applications on GPUs. He holds a Ph.D in Computational Geometry from INRIA / Université Nancy 2 in France.

Come ask your GPU code optimization questions to experts in the field. Hangouts provide an opportunity for you to ask topic experts specific questions. Come on in. Find a person wearing an Ask Me button and ask a question!

Level: All technical
Type: Hangout
Tags: Performance Optimization

Day: Monday, 04/04
Time: 14:00 - 15:00
Location: Pod B

Hangout

TUTORIAL

Presentation
Details

S6244 - Implement Physically Based Ray Tracing with NVIDIA OptiX™ and MDL

Detlef Roettger Senior Developer Technology Engineer, NVIDIA
Detlef Roettger has been working on GPU ray tracing at NVIDIA since 2008. He joined the company in 2000 as a senior systems software engineer in the OpenGL driver team with a focus on workstation graphics performance and ISV partner support. Later he joined the Workstation Middleware Software team, which implemented the SceniX scene graph and specialized application drives. Detlef has worked on many of the early SIGGRAPH and GTC ray tracing demos and helps ISV partners with specific OptiX implementation questions. He's been programming computer graphics for over 33 years and his main interest in ray tracing is photo-realistic rendering. Prior to joining NVIDIA, Detlef worked as OpenGL driver team manager at ELSA GmbH, Aachen, and helped start the Quadro business with first workstation products using NVIDIA GPUs. He holds a diploma in computer science from the Technical University Clausthal, Germany.

Learn how to implement a physically based ray tracing renderer with NVIDIA OptiX, which supports the material definition language (MDL). The concepts and specific renderer design decisions to support the fundamental building blocks in the MDL specification are explained using a global illumination path tracer implemented with OptiX as an example. Special attention has been given to the material description code inside that renderer to express complex material hierarchies via standard C++ mechanisms in a readable manner with the goal of automatic code generation from MDL files finally done via the MDL SDK.

Level: Advanced technical
Type: Tutorial
Tags: Rendering & Ray Tracing; Real-Time Graphics ; Media & Entertainment

Day: Monday, 04/04
Time: 14:00 - 14:50
Location: Room 210E

S6338 - VR Multi GPU Acceleration Featuring Autodesk VRED

Kai Ingo Esser Senior Engineer, Developer Technology, NVIDIA
Ingo Esser is a senior developer technology engineer in NVIDIA's Professional Solutions Group, where he works to help ISVs improving their rendering algorithms. These ISVs mostly work in the automotive and the oil and gas domains, where either rendering complex surfaces or visualizing large datasets is an issue. He has a diploma in computer science from the chair for Computer Graphics and Multimedia at the RWTH Aachen, Germany.
Michael Nikelsky Principal Engineer , Autodesk
Michael Nikelsky is a principal engineer at Autodesk working on the VRED product line. His field of work focuses on rendering and ray-tracing techniques as well as shader development. He has a diploma in computer science from the University of Koblenz-Landau, Germany.

Recently stereo rendering, particular HMD VR rendering, has seen a large growth in popularity. We'll present a new VR SLI OpenGL extension that is well suited for rendering stereo workloads by utilizing multi-GPU rendering. We'll explain the extension and how it can be used to benefit various applications. Later, we'll showcase a current real-world use of the extension in Autodesk's VRED and review the gains seen.

Level: Intermediate technical
Type: Tutorial
Tags: Virtual Reality & Augmented Reality; Real-Time Graphics ; Product & Building Design

Day: Monday, 04/04
Time: 14:00 - 14:50
Location: Room LL20C

S6416 - See the Big Picture: How to Build Large Display Walls Using NVIDIA DesignWorks™ APIs/Tools

Doug Traill Senior Solutions Architect, NVIDIA
Highly-Rated Speaker
Doug Traill is a senior solutions architect at NVIDIA responsible for scalable visualization technologies. In this role, he works with system integrators, developers, and customers to help design and implement complex visualization systems. Prior to NVIDIA, he worked at Silicon Graphics for nine years in various technical roles, including solutions architect and visualization product manager. During his career, Doug has helped design and build some the world's largest visualization centers. He holds a B.S. in electronic systems and microprocessor engineering from the University of Glasgow, U.K., as well as an M.S. of telecommunications business management from King's College London, U.K.

The need to drive multiple displays, be it for digital signage, a corporate conference room, or even an immersive VR room is becoming more common. We'll provide an overview of the display management tools and APIs that are part of NVIDIA's DesignWorks SDK. Attendees will learn about NVIDIA MOSAIC; display setup and management using NVAPI + NVWMI; synchronization methods; and warp and blend APIs.

Level: All technical
Type: Tutorial
Tags: Large Scale and Multi-Display Visualization; Virtual Reality & Augmented Reality

Day: Monday, 04/04
Time: 14:00 - 15:20
Location: Room LL21B

S6729 - Teach Robotics with the New Jetson™ GPU Teaching Kit for Educators

Joe Bungo GPU Educators Program Manager, NVIDIA
Joe Bungo is the GPU Educators Program Manager at NVIDIA, where he enables the use of NVIDIA and GPU technologies in universities in a variety of ways, including curriculum and teaching material development, facilitation of academic ecosystems, and hands-on instructor workshops. Previously, he managed the university program at ARM Inc. and worked as an applications engineer there.
John Seng Professor, Cal Poly State University, San Luis Obispo
John Seng is a professor in the Computer Science department at Cal Poly State University, San Luis Obispo. He is also part of the Cal Poly Computer Engineering Program.

As performance and functionality requirements of interdisciplinary robotics applications rise, industry demand for new graduates familiar with GPU-accelerated computer vision, machine learning and other robotics concepts grows. We'll introduce you to a comprehensive set of academic labs and university teaching material targeted at the NVIDIA Tegra-based Jetson embedded computing platform for use in introductory and advanced interdisciplinary robotics courses. The teaching materials start with the basics and focus on programming the Jetson platform, and include advanced topics such as computer vision, machine learning, robot localization and controls.

Level: All technical
Type: Tutorial
Tags: Robotics & Autonomous Machines; Computer Vision & Machine Vision

Day: Monday, 04/04
Time: 14:00 - 14:50
Location: Room 212A

S6733 - NVIDIA's Deep Learning Car Computer - DRIVE PX

Shri Sundaram Senior Product Manager - DRIVE PX, NVIDIA
Shri Sundaram is product manager of DRIVE PX, the world's first AI supercomputer for self-driving cars, at NVIDIA. Previously, he was software product manager for the NVIDIA SHIELD Android TV. He joined the company in 2011 as product manager responsible for building imaging partnerships for NVIDIA's presence in the smartphone market. Prior to NVIDIA, Shri worked for 10 years at Toshiba America as product line manager for a $150 million semiconductor product line. He holds a bachelor's degree in instrumentation engineering from BITS Pilani, India, and an MBA from the Thunderbird School of Global Management, in Arizona. He has also completed an executive program in Marketing Leadership from INSEAD, Singapore.

At CES 2016, DRIVE PX 2 was launched as the world's first AI supercomputer designed for autonomous vehicles from NVIDIA. DRIVE PX is a lot more than that. It is an incredible development platform for developers to write autonomous car applications. It is a reference design for Tier-1s, OEMs to reuse it for safety critical ECUs meant for Level 3/4/5 (As defined by SAE International). This talk will present the *under the hood* details of what makes it an AI Supercomputer, a Development platform and a Reference platform for autonomous cars.

Level: All technical
Type: Tutorial
Tags: Self-Driving Cars & Automotive ; Deep Learning & Artificial Intelligence

Day: Monday, 04/04
Time: 14:00 - 14:50
Location: Room LL21E

S6739 - VisionWorks™ Toolkit Programming

Thierry Lepley Senior SW Engineer, NVIDIA
Thierry Lepley is Senior Computer Vision Engineer at NVIDIA and the NVIDIA representative in the OpenVX standardization group. His focus is on the development of optimized computer vision toolkits and libraries for real-time embedded systems. Earlier, Thierry was Principal Engineer at STMicroelectronics, working on many-core acceleration for computer vision, where he developed a compiler that automates the parallelization of image processing pipelines and the management of image tiling.

In this tutorial, we'll dive into the programming with the VisionWorks toolkit, an NVIDIA SDK for computer vision (CV) that implements and extends the new OpenVX standard. The tutorial will cover the most important aspects of the API, for instance data objects and CV graphs. We'll discuss NVIDIA extensions and the way to interop with CUDA and OpenCV. Finally, we'll look at different ways debugging and profiling an application developed with VisionWorks.

Level: Intermediate technical
Type: Tutorial
Tags: Computer Vision & Machine Vision

Day: Monday, 04/04
Time: 14:00 - 15:20
Location: Room LL20A

S6807 - Deep Dive into Dynamic Parallelism Performance

Christoph Angerer DevTech Compute, NVIDIA
Christoph Angerer is a developer in NVIDIA's European Developer Technology team. Based in Munich, Germany, he works with developers accelerating applications on GPUs. He holds a Ph.D. in computer science from ETH Zurich in Switzerland.
Shankara Rao Thejawsi Nanditale DevTech Compute, NVIDIA
Thejaswi is a Compute Devtech Engineer at Nvidia, who is interested in programming GPUs, machine-learning and bioinformatics.

Dynamic parallelism enables a CUDA kernel to create and synchronize new nested work by launching child kernels from the GPU. Such a nested parallelism programming model maps directly to many real-world programming patterns like adaptive grids or tree-traversal based computations. We'll systematically analyze the performance characteristics of dynamic parallelism by means of real-world application case studies and suggest programming guidelines to get the best performance out of the dynamic parallelism feature. (This talk will be held in collaboration with Thejaswi Rao.)

Level: Intermediate technical
Type: Tutorial
Tags: Performance Optimization

Day: Monday, 04/04
Time: 14:00 - 14:50
Location: Room 210F

Tutorial

TALK

Presentation
Details

S6504 - A Data-Driven Methodology for NVIDIA GRID™ vGPU™ Sizing

Jeremy Main Senior Solution Architect, NVIDIA
Jeremy is the Senior Solution Architect for NVIDIA's GRID enterprise graphics virtualization in Japan. He works to architect solutions for organizations to deliver high-fidelity GPU-accelerated desktops and applications. Before joining NVIDIA, Jeremy led the development of several remote graphics products as well as 3D CAD software development. Jeremy received his Bachelor of Science from the University of Utah.
Milan Diebel Senior Product Manager, NVIDIA
Milan is the Senior Product Manager for the NVIDIA GRID product family. He has been working in the technology sector for 15 years in a variety of roles. Milan holds a PhD in Physics from the University of Washington as well as an MBA from Cornell University.

GRID vGPU sizing is often viewed as more of an art form than a science. One of the challenges is that synthetic performance benchmarks are not a good representation of actual user workloads in virtualized environments. Influenced by many customer interactions, we will be introducing a systematic way of producing sizing information utilizing both real user workloads and synthetic performance benchmarks.

Level: Intermediate technical
Type: Talk
Tags: Graphics Virtualization

Day: Monday, 04/04
Time: 14:30 - 15:20
Location: Room 210G

Talk

HANGOUT

Presentation
Details

H6119 - Hangout: GPU-Based Video Processing

Thomas True Senior Applied Engineer for Professional Video and Image Processing, NVIDIA
Thomas True is a Senior Applied Engineer for Professional Video and Image Processing in NVIDIA’s Professional Solutions Group where he focuses on the use of GPUs in broadcast, video and film applications ranging from pre-visualization to post production and live to air. Prior to joining NVIDIA, Tom was an Applications Engineering at SGI. Thomas has a M.S. degree in Computer Science from the Graphics Lab at Brown University and a B.S. Degree from the Rochester Institute of Technology.

In this hangout, developers can ask questions and talk to experts regarding GPU-based video processing.

Level: All technical
Type: Hangout
Tags: Video & Image Processing

Day: Monday, 04/04
Time: 15:00 - 16:00
Location: Pod C

H6142A - Hangout: Iray® Rendering for Developers

Martin-Karl Lefrancois DevTech Plateform Lead, NVIDIA
​Martin-Karl Lefrançois is a senior software engineer and team lead in NVIDIA's Developer Technology organization in Berlin. Martin-Karl is working with various NVIDIA rendering and core development teams to bring to clients the best rendering experience. Prior to joining NVIDIA, Martin-Karl worked at mental images to deliver automatic GPU support in mental ray. After graduating with a degree in computer science and mathematics from the University of Sherbrooke in Quebec, he worked as a graphic developer for nearly ten years at Softimage in Montreal and Tokyo before leading the core game engine team at A2M.

Discuss how key applications have incorporated NVIDIA® Iray® and how you can do the same. Get an overview of NVIDIA® Iray®, see examples of NVIDIA® Iray® integrations, what exactly is MDL and how rendering using NVIDIA® Visual Computing Appliance(VCA). Come to discuss and learn how to create a custom rendering experience that fits your application.

Level: All technical
Type: Hangout
Tags: Rendering & Ray Tracing

Day: Monday, 04/04
Time: 15:00 - 16:00
Location: Pod A

H6158 - Hangout: Algorithms and Numerical Techniques

Natalia Gimelshein Senior Parallel Computing Software Engineer, NVIDIA
Natalia Gimelshein joined NVIDIA in 2014. She works on accelerating deep learning applications. She has masters degree in aerospace engineering from Penn State University.
Simon Layton TBA, NVDIA
TBA

Come ask your GPU code optimization questions to experts in the field. Hangouts provide an opportunity for you to ask topic experts specific questions. Come on in. Find a person wearing an Ask Me button and ask a question!

Level: All technical
Type: Hangout
Tags: Algorithms

Day: Monday, 04/04
Time: 15:00 - 16:00
Location: Pod B

Hangout

HANDS-ON LAB

Presentation
Details

L6123 - Advanced OpenACC Programming

Jeff Larkin DevTech Software Engineer, NVIDIA
Highly-Rated Speaker
Jeff Larkin is a software engineer in NVIDIA's Developer Technology (DevTech) group, where he works on porting and optimizing HPC applications. He is also closely involved with the development of both the OpenACC and OpenMP specifications. Prior to joining NVIDIA, Jeff worked in Cray's Supercomputing Center of Excellence at Oak Ridge National Laboratory.

This tutorial will teach experienced OpenACC programmers several techniques that will take them to the next level in their code. Participants will learn via hands-on examples how to pipeline computations to overlap data transfers, multi-GPU programming, OpenACC interoperability, and more. Participants must be comfortable with C/C++ or Fortran programming and already have experience with OpenACC programming. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Level: Advanced technical
Type: Hands-on Lab
Tags: Programming Languages; OpenACC

Day: Monday, 04/04
Time: 15:00 - 16:30
Location: Room 210B

L6139 - A Tutorial on More Ways to Use DIGITS

Allison Gray Solutions Architect, NVIDIA
Allison Gray works at NVIDIA as a Solutions Architect, helping partners use GPUs for machine learning.

DIGITS is a visualization tool designed to help decrease the development time of a trained neural network. Its capabilities allow researchers to easily visualize different aspects of their network and data sets. In this lab you'll learn how to create and train a Siamese network, auto encoder, and add different layer types. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Level: Intermediate technical
Type: Hands-on Lab
Tags: Deep Learning & Artificial Intelligence

Day: Monday, 04/04
Time: 15:00 - 16:30
Location: Room 210A

L6152 - IBM Watson Developers Lab

Niyati Parameswaran Data Scientist for the IBM Watson Ecosystem, IBM
Niyati works as a data scientist for the Watson Ecosystem team. A dream of being able to provide a machine with intelligence that is unique, that can augment our own distinctive intelligence and that ultimately hopes to answer Alan Turing's question of 'Can machines think?' motivates her research. She holds a Bachelors in Computer Science from The Birla Institute of Science & Technology in India, and a Masters in Computer Science with a specialization in Machine Learning and Natural Language Processing from The University of Texas at Austin. She has worked on building deep QA systems, knowledge graphs and recommendation engines. At Watson, she is known for developing core ML and NLP algorithmic paradigms, conceptualizing cognitive solutions for partners to bring them to scale and providing data science as a service to the Ecosystem team.

IBM Watson will deliver a unique opportunity for attendees of the NVIDIA GTC Conference and the OpenPOWER Foundation Summit. If you are a developer interested in Machine or Deep Learning this is not a lab to miss. IBM Watson it is a cognitive technology platform that uses Natural Language Processing, Vision, Machine and Deep Learning to reveal insights from large amounts of unstructured data. The Watson Lab 202 is a hands-on experience with IBM's Watson cognitive platform. Starting with a set of Watson services, attendees of all skill levels will learn how to build apps powered by Watson, gaining experience using cognitive and deep learning technologies. Attendees must bring their own laptops and follow pre-req instructions.

Pre-Req:

  1. SIGN UP FOR BLUEMIX: Lab attendees should visit https://console.ng.bluemix.net/<;/a>. Click on the Get Started Free button. Register your information and click on Create Account button.
  2. INSTALL GIT: Instructions available at https://git-scm.com/book/en/v2/Getting-Started-Installing-Git<;/a>
  3. INSTALL CLOUD FOUNDRY: Instructions available at https://github.com/cloudfoundry/cli#downloads<;/a>
  4. Take a look at the lab ahead of time. It's available at https://github.com/biosopher/watson-ipa-web-nodejs<;/a>. Come prepared with questions, and ideas on how to extend it!

Level: Intermediate technical
Type: Hands-on Lab
Tags: Deep Learning & Artificial Intelligence; Big Data Analytics; Video & Image Processing

Day: Monday, 04/04
Time: 15:00 - 17:00
Location: Room LL21A

Hands-on Lab

TUTORIAL

Presentation
Details

S6306 - ESI Rendering Innovations with NVIDIA DesignWorks™

Andreas Mank Team Leader Visualization, ESI Software Germany
As leader of the visualization team in the BU Immersive Experience with ESI Software Germany, Andreas Mank is responsible for driving advances in visualization technologies and delivering state-of-the-art, high-performance immersive engineering visualization as well as advanced high-quality rendering with ESI software products. Andreas studied media computer science at the University of Applied Sciences in Wedel. He has over 10 years of experience in virtual reality-related software development. In the last years, he has been working as a team leader in R&D.
Markus Tavenrath Senior Developer Technology Engineer, NVIDIA
Markus Tavenrath experiments with future rendering technologies that bring high interactivity to complex scenes. This work includes both CPU and GPU strategies to solve typical scene-graph operations related to rendering. Markus finished his studies in computer science with a focus on computer graphics in 2008. He was one of the first to use ray-tracing on CUDA for his diploma thesis, which brought him straight to NVIDIA. There, he primarily worked on GPU raytracing for SceniX, NVIDIA's scene-graph technology, first showcased at SIGGRAPH 2008. Later he applied his experience to implement parts of OptiX, improve SceniX, and develop several ray-tracing demos. In close cooperation with external partners, he improved rendering performance and scenegraph usability as developer technology engineer.

We'll present the evolution of the high performance renderer from ESI Group, a pioneer and world-leading provider in virtual prototyping, leveraging the physics of materials. We'll describe the next steps towards a comprehensive physically based rendering framework, and take a look behind the scenes at NVIDIA's nvpro-pipeline technology as well as the concepts and design details required to integrate it into ESI Software Germany's new rendering framework for immersive environments. Established in more than 40 countries worldwide, ESI Group helps industrial clients shorten their product development cycle by eliminating the need for physical prototypes using interactive rendering technology.

Level: Beginner technical
Type: Tutorial
Tags: Rendering & Ray Tracing; Product & Building Design; Real-Time Graphics

Day: Monday, 04/04
Time: 15:00 - 15:50
Location: Room 210E

S6339 - Robust Software Development: Bug Prevention and Isolation

Erika Dignam Bug Triager/Tech PM, NVIDIA
Erika Dignam started off in computer arts before making her way to NVIDIA in 2007. At NVIDIA she started as a QA engineer working with top industry applications, then spent several years as an ISV Program Manager, and finally moved into the OpenGL performance team doing bug triage.
Ross Cunniff Senior Software Engineer, NVIDIA
Ross Cunniff received two degrees, in Computer Science and Mathematics, from New Mexico State University in 1985. Ross has been an NVIDIA employee since 2001. At NVIDIA, he has worked on OpenGL and DirectX graphics drivers as well as on camera image enhancement algorithms. Ross is the inventor or co-inventor of over 15 patents. He is currently one of NVIDIA’s representatives on the SPEC Graphics and Workstation Performance Group committees.

How to get your bugs fixed faster, efficient bug resolution and bug prevention. We will review internal processes, basic OpenGL bug triage and tools, a little Vulkan debugging, and finally bug prevention with bench-marking and test creation.

Level: All technical
Type: Tutorial
Tags: Tools & Libraries; Performance Optimization; Real-Time Graphics

Day: Monday, 04/04
Time: 15:00 - 15:50
Location: Room 211A

S6613 - Rendering Faster and Better with VRWorks™

John Spitzer Vice President of GameWorks Labs at NVIDIA, NVIDIA
John’s spent nearly his entire career working in realtime 3D graphics, holding engineering/management positions at IBM, SGI, and NVIDIA, where he is the VP of GameWorks Labs. He founded NVIDIA’s branch office in Russia, created the GeForce Experience gaming service, and has over a dozen patents granted or pending

This talk will introduce developers to NVIDIA's VRWorks, an SDK for VR game, engine, and headset developers that cuts latency and accelerates stereo rendering performance on NVIDIA GPUs. We'll explain the features of this SDK, including VR SLI, multi-resolution shading, context priorities, and direct mode. We'll discuss the motivation for these features, how they work, and how developers can use VRWorks in their renderers to improve the VR experience on Oculus Rift, HTC Vive, and other VR headsets.

Level: Intermediate technical
Type: Tutorial
Tags: Virtual Reality & Augmented Reality; Real-Time Graphics ; Game Development

Day: Monday, 04/04
Time: 15:00 - 15:50
Location: Room LL20C

S6615 - Developer Tools Arsenal for Tegra Platforms

Sebastien Domine Senior Director Software Engineering, Developer Tools, NVIDIA
Highly-Rated Speaker
Sebastien Domine is senior director of software engineering, Developer Tools, at NVIDIA. He runs various software engineering teams and oversees the development of software products dedicated to ease the developer's life and to foster the creation of applications that can take full advantage of the GPU and the SoC. Prior to NVIDIA, he worked on PC games at GameFX/THQ and 3D digital content creation tools at Katrix and Nichimen Graphics. He holds a Diplome d'Ingenieur in computer science from EPITA, Paris, France.

We'll present the complete offering of developer tools for the SHIELD and Jetson platforms. We'll cover compute and graphics tools, as well as platform tools. We'll demonstrate the basic use of the tools with real-life application -- from gaming to embedded systems. The talk will focus on the latest Jetson TX1 devkit release and the latest SHIELD products. The audience will also learn more about what is coming up in our roadmap for future developer tools.

Level: All technical
Type: Tutorial
Tags: Tools & Libraries; Game Development; Embedded

Day: Monday, 04/04
Time: 15:00 - 15:50
Location: Room 210F

Tutorial

TALK

Presentation
Details

S6797 - Top 20 Posters Fast Forward

David Luebke Vice President Graphics Research, NVIDIA
Highly-Rated Speaker
David Luebke helped found NVIDIA Research in 2006 after eight years teaching computer science on the faculty of the University of Virginia. David is currently Vice President of Graphics Research at NVIDIA. His personal research interests include virtual and augmented reality, display technology, ray tracing, and graphics architecture. His honors include the NVIDIA Distinguished Inventor award, the NSF CAREER and DOE Early Career PI awards, and the ACM Symposium on Interactive 3D Graphics "Test of Time Award". David has co-authored a book, a SIGGRAPH Electronic Theater piece, a major museum exhibit visited by over 110,000 people, an online course on parallel computing that has reached over 80,000 students, and dozens of papers, articles, chapters, and patents on computer graphics and GPU computing.

GTC Fast Forward Poster program is an accelerated poster presentation program that serves as a catalyst for the advancement of an array of innovations that come from universities, research labs and industry. The GTC Poster Review Committee selected the best 20 posters submitted to GTC2016. This program gives the author a chance to present his GPU project in front of the top technology developer working in a vast array of industries.

Level: All technical
Type: Talk
Tags: General Interest; Press-Suggested Sessions: General Interest

Day: Monday, 04/04
Time: 15:00 - 15:50
Location: Room 212A

Talk

TUTORIAL

Presentation
Details

S6823 - Session 4 of 4: Essential CUDA Optimization Techniques (Presented by Acceleware)

Dan Cyca Chief Technology Officer, Acceleware LTD
Highly-Rated Speaker
Regarded as a leading mind in the field of parallel processing, Dan Cyca has extensive experience working with GPUs, clusters and multi-core solutions. Dan joined Acceleware in 2004 as a software developer to build the company's first product. Since then, he has served in many technical and leadership roles in the company. Most recently, as the Director of Engineering, Dan was responsible for managing the software development group. Prior to Acceleware, Dan's experience included developing 'C-to-hardware' compilers, and implementing digital signal processing and encryption algorithms on FPGAs. Dan has an M. Sc. in Electrical Engineering from the University of Calgary.

This tutorial is for those with some background in CUDA including an understanding of the CUDA memory model and streaming multiprocessor. Our earlier tutorials will provide the background information necessary for this session. This informative tutorial will provide an overview of the analysis performance tools and key optimization strategies for compute, latency and memory bound problems. The session will include techniques for ensuring peak utilization of CUDA cores by choosing the optimal block size. This session will include code examples and a programming demonstration highlighting the optimal global memory access pattern applicable to all GPU architectures. Printed copies of the material will be provided to all attendees for each session – collect all four!

Level: All technical
Type: Tutorial
Tags: Programming Languages; Supercomputing & HPC

Day: Monday, 04/04
Time: 15:00 - 16:20
Location: Room LL20D

Tutorial

TALK

Presentation
Details

S6853 - MXNet: Flexible Deep Learning Framework from Distributed GPU Clusters to Embedded Systems

Mu Li Ph.D. Student, Carnegie Mellon University
Mu Li is currently a final year Ph.D. student at Carnegie Mellon University. His research interests lie in algorithms and systems for distributed machine learning and deep learning. In particular, he designs algorithms and systems scaling to petabyte datasets and running over thousands of machines. He has co-authored tens of top journal and conference papers ranging from learning theory to machine learning, from data mining to systems. He has served as principal architect at Baidu and co-founded several machine learning startups.
Tianqi Chen Ph.D. Student, University of Washington
Tianqi Chen is a third year Ph.D. at University of Washington, working on large scale machine learning. He has co-authored many important works in scalable learning systems, statistical sampling theory and deep learning. He also designed several widely used scalable learning systems, including XGBoost and MXNet.

This talk will describe how to develop and deploy deep learning applications efficiently and easily using MXNet. MXNet is a new deep learning framework developed by collaborators from over 10 institutes. It is designed for both flexibility and optimized performance, with easy to use interfaces in currently 7 programming languages including Python, Scala and R. We will discuss the technologies to scale out the framework to distributed clouds ranging from EC2, Azure, GCE to Spark clusters, and also memory optimizations to fit into embedded systems like mobile phones. Finally, we'll demonstrate deep learning applications in computer vision, natural language processing, and speech recognition.

Level: All technical
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Programming Languages; Embedded

Day: Monday, 04/04
Time: 15:00 - 15:50
Location: Grand Ballroom

Talk

HANDS-ON LAB

Presentation
Details

L6129 - VisionWorks Toolkit

Thierry Lepley Senior Software Engineer, NVIDIA
Thierry Lepley is Senior Computer Vision Engineer at NVIDIA and the NVIDIA representative in the OpenVX standardization group. His focus is on the development of optimized computer vision toolkits and libraries for real-time embedded systems. Earlier, Thierry was Principal Engineer at STMicroelectronics, working on many-core acceleration for computer vision, where he developed a compiler that automates the parallelization of image processing pipelines and the management of image tiling.
Colin Tracey Senior System Software Engineer, NVIDIA
Colin Tracey has been with NVIDIA as a Senior System Software Engineer since 2011. He has worked on camera features for mobile devices including panorama, HDR, video stabilization, and object tracking. More recent work has been in ADAS and autonomous driving systems including surround view, obstacle detection, and sensor fusion.
,
,

In this hands-on session, we'll discover practically the VisionWorks™ toolkit, a NVIDIA SDK for computer vision (CV) that implements and extends the new OpenVX standard. The first step will be to install the VisionWorks toolkit, discover its structure, its documentation and run samples. In a second step, we will experiment different ways debugging and profiling an application developed with VisionWorks. Finally, we will do some programming to experience practically the API.

Level: Intermediate technical
Type: Hands-on Lab
Tags: Computer Vision & Machine Vision

Day: Monday, 04/04
Time: 15:30 - 17:00
Location: Room 210C

Hands-on Lab

HANGOUT

Presentation
Details

H6124 - Hangout: DRIVE PX 2

Shri Sundaram Product Marketing Manager, DRIVE PX, NVIDIA
Shri Sundaram is product manager of DRIVE PX, the world's first AI supercomputer for self-driving cars, at NVIDIA. Previously, he was software product manager for the NVIDIA SHIELD Android TV. He joined the company in 2011 as product manager responsible for building imaging partnerships for NVIDIA's presence in the smartphone market. Prior to NVIDIA, Shri worked for 10 years at Toshiba America as product line manager for a $150 million semiconductor product line. He holds a bachelor's degree in instrumentation engineering from BITS Pilani, India, and an MBA from the Thunderbird School of Global Management, in Arizona. He has also completed an executive program in Marketing Leadership from INSEAD, Singapore.

Hangout with NVIDIA experts and discuss why autonomous and driver assistance technologies powered by deep learning have become a key focus for every car manufacturer, as well as transportation services and technology companies. The car needs to know exactly where it is, recognize the objects around it, and continuously calculate the optimal path for a safe driving experience. This situational and contextual awareness of the car and its surroundings demands a powerful visual computing system that can merge data from cameras and other sensors, plus navigation sources, while also figuring out the safest path – all in real-time. This autonomous driving platform is NVIDIA DRIVE PX.

Level: All technical
Type: Hangout
Tags: Self-Driving Cars & Automotive

Day: Monday, 04/04
Time: 16:00 - 17:00
Location: Pod A

H6126 - Hangout: How to Set Up MOSAIC Video Walls

Doug Traill Senior Solutions Architect, NVIDIA
Highly-Rated Speaker
Doug is a senior solutions architect at NVIDIA

Hangout for customers wanting to ask questions about setting up large scale video display walls

Level: All technical
Type: Hangout
Tags: Large Scale and Multi-Display Visualization

Day: Monday, 04/04
Time: 16:00 - 17:00
Location: Pod C

H6161 - Performance Analysis and Optimization

Maxim Milakov Senior Developer Technology Engineer, NVIDIA
Maxim Milakov joined NVIDIA in January 2012. Since then he has been working on accelerating scientific and industrial applications on GPUs in various fields including machine learning, developing codes and algorithms to maximize performance while also collaborating with internal teams to help co-design future hardware and tools.
Julien Demouth Senior Engineer (Devtech), NVIDIA
Highly-Rated Speaker
Julien is a senior engineer (Devtech) at NVIDIA.

Come ask your GPU code optimization questions to experts in the field. Hangouts provide an opportunity for you to ask topic experts specific questions. Come on in. Find a person wearing an Ask Me button and ask a question!

Level: All technical
Type: Hangout
Tags: Performance Optimization

Day: Monday, 04/04
Time: 16:00 - 17:00
Location: Pod B

Hangout

TUTORIAL

Presentation
Details

S6138 - GPU-Driven Rendering in Vulkan and OpenGL

Pierre Boudier Software Architect, NVIDIA
Pierre Boudier works as a software architect for NVIDIA.
Christoph Kubisch Developer Technology Engineer, NVIDIA
Highly-Rated Speaker
Christoph Kubisch is a Senior Developer Technology Engineer for NVIDIA, where he focuses on advanced OpenGL and Vulkan real-time rendering techniques suitable for CAD/DCC and scientific applications. He collaborates with external partners and NVIDIA's internal teams to optimize current and future rendering algorithms. Prior to joining NVIDIA, Christoph was a researcher on hardware accelerated visualization techniques for medical datasets at the Otto-von-Guericke University of Magdeburg. Furthermore, he has worked as technical artist creating game art, technology and tools.

We will present a detailed description of how the hardware implements the graphics pipeline. The hardware's pipeline and motivation behind both high- and low-level design choices are described. We provide practical usage scenarios how Vulkan, OpenGL and GLSL are best used and how to avoid performance pitfalls. Through the use of this information and the latest graphics API features, the GPU can be leveraged very efficiently do more work autonomously of the CPU and increase overall throughput.

Level: Intermediate technical
Type: Tutorial
Tags: Real-Time Graphics ; Performance Optimization

Day: Monday, 04/04
Time: 16:00 - 16:50
Location: Room LL21B

S6382 - Performance Considerations for OpenCL on NVIDIA GPUs

Karthik Raghavan Ravi Engineering Manager, OpenCL, NVIDIA
Karthik Ravi has been with the compute drivers group at NVIDIA for over five years, and presently manages the team that builds and explores ways of improving the efficiency of the NVIDIA OpenCL drivers.

Understand how to get the most out of your OpenCL programs on NVIDIA GPUs. Idiosyncrasies of various architectures and OpenCL implementations mean that what's fast on one platform may not always be the fastest on others. We'll give you insights into the NVIDIA OpenCL implementation and GPU architecture, enabling you to make optimal application choices and extract more juice from your programs on NVIDIA GPUs. Some key things you will learn are: [1] performance paths for shared virtual memory (coarse-grained buffer); [2] efficient inter-op with OpenGL in single and (the notorious) multi-threaded cases; [3] pref knobs available and the architectural basis for how the knobs will impact performance of your program; [4] improving throughput by overlapping GPU copy and compute.

Level: Intermediate technical
Type: Tutorial
Tags: Performance Optimization; Programming Languages

Day: Monday, 04/04
Time: 16:00 - 16:50
Location: Room 211A

Tutorial

TALK

Presentation
Details

S6392 - AEC Project Execution Using GRID vGPU Enhanced Virtualization

Bill Dale Technology Leader, Jacobs
Bill is a technology leader at Jacobs with more than 20 years of experience in the areas of science, technology, and engineering. He specializes in identifying opportunities for improving work process with the practical application of technology and is passionate about information security and intellectual property management. Currently, he advises internal teams, and external entities, on optimizing process execution. He enjoys skiing, fishing, and spending time with his family.
Randall Siggers Solution Architect, NVIDIA
Randall is a Solutions Architect for NVIDIA with more than 20 years of experience in IT. He specializes in researching, analyzing, and implementing emerging technology. His current projects involve, SCCM, Cloud technology, imaging systems and VDI. He enjoys traveling and speaking at various tech conferences and in his spare time, enjoys building custom gaming rigs, vintage BMX, and import tuning.

This session presents an overview of the challenges involving traditional BIM workflow processes and the benefits of moving to GRID vGPU enabled Integrate Project Design.

Level: Intermediate technical
Type: Talk
Tags: Graphics Virtualization

Day: Monday, 04/04
Time: 16:00 - 16:25
Location: Room 210G

Talk

TUTORIAL

Presentation
Details

S6512 - Massive Time-lapse Point Cloud Rendering with VR

Innfarn Yoo Software Engineer, NVIDIA
Innfarn Yoo is a software engineer in OpenGL core and chips team at NVIDIA Corporation. He received his Ph.D. and M.S. degree from Purdue University, in addition to a B.S. degree from Konkuk University in South Korea. His research interests lie in fields of animation, point-cloud rendering, and photo-realistic rendering.
Markus Schuetz Software Engineer, NVIDIA
Markus Schuetz is a Visual Computing student at the Vienna University of Technology and responsible for the development of the WebGL point cloud viewer Potree. He is continuing point cloud rendering research and development in his current position as an intern at NVIDIA.

We present our novel methods for visualizing massive scale time-lapse point cloud data, and navigating and handling point cloud VR. Our method provides new approaches for normal and stereoscopic rendering of 120 GB time-lapse point cloud data, and targeted to apply our method to 2 TB data. Time-lapse point cloud has many problems including color mismatching, registration, out-of-core design, and memory management. We generate progressive blue-noise point cloud, and apply sparse buffer extension in OpenGL 4.5, by them, reduce the complexity of out-of-core design and memory manipulation cost. In addition, point cloud with VR is an emerging field so that not many methods are applicable yet. We are investigating a new method that is able to visualize and navigate large point cloud data.

Level: Intermediate technical
Type: Tutorial
Tags: Virtual Reality & Augmented Reality; Product & Building Design; In-Situ and Scientific Visualization

Day: Monday, 04/04
Time: 16:00 - 16:50
Location: Room LL20C

S6710 - Developer Tools for Next Generation Graphics APIs

Jeffrey Kiel Manager, Graphics Tools, NVIDIA
Jeff Kiel is the manager of Graphics Tools at NVIDIA. His responsibilities include development and oversight of graphics performance and debugging tools, including Nsight Visual Studio Edition and Tegra Graphics Debugger. Previous projects at NVIDIA include PerfHUD, and ShaderPerf. Before coming to NVIDIA, Jeff worked on PC and console games at Interactive Magic and Sinister Games/Ubisoft. Jeff has given presentations at many GDCs and SIGGRAPHs and contributed articles to graphics related publications. His passion for the art started in the G-Lab at the University of North Carolina at Chapel Hill where he received his B.S. in Mathematical Sciences.

The next generation of graphics APIs provide unprecedented, low level access to the GPU pipeline. With this power comes significantly more responsibility to manage data/resource hazards, threading, residency, etc., things the drivers and runtime implementations used to take care of for you. Come hear about how Nsight Visual Studio Edition and related NVIDIA offerings provide the tools you need to debug and profile your applications in this new world.

Level: Intermediate technical
Type: Tutorial
Tags: Tools & Libraries; Performance Optimization; Game Development

Day: Monday, 04/04
Time: 16:00 - 16:50
Location: Room 210F

Tutorial

TALK

Presentation
Details

S6851 - GPU Server Portfolio Overview (Presented by Supermicro)

Neil Truong Product Manager, Supermicro
SUPERMICRO is the clear industry leader in GPU accelerated total solutions.As Vice President of Marketing & Worldwide Business Development at SUPERMICRO, Don Clegg leads teams focused on delivering High-Performance Server, Storage and Networking systems, leveraging GPU technology. Don brings 30+ years of direct experience in design, marketing, and business development to help Supermicro deploy its industry leading, first-to-market, scale-out/scale-up platforms. Don began his career as a design engineer, developing multi-node, multi-user, x86 servers and workstations. With an emphasis on first-to-market product introductions, Don subsequently held executive positions at several chipset and system companies where he helped them achieve #1 market share. The trend continues at Supermicro. He earned a bachelor's degree in Electrical Engineering from Brigham Young University, where he graduated with high honors.

Supermicro will be giving an overview on next generation technologies and GPU solutions.

Level: All technical
Type: Talk
Tags: Education & Training

Day: Monday, 04/04
Time: 16:00 - 16:50
Location: Room LL20A

S6868 - Give Life to your 3D Art with MDL and NVIDIA Iray® in Substance Painter

Manuel Kraemer Sr. Developer Technology Engineer, NVIDIA
Manuel Kraemer is a Senior Developer Technology Engineer at NVIDIA. Previously Manuel was a Graphics Software Engineer at Pixar Animation Studios. Prior to that, Manuel worked as a technical director at Disney Feature Animation, Double Negative and the BBC.
Jérémie Noguer Senior Product Manager, Substance Painter
With a game developer background and after being a Technical Artist for Allegorithmic for 7 years, Jérémie is the Senior Product Manager for Substance Painter since 2013.

Allegorithmic and NVIDIA will show how combining Substance, worldwide reference for procedural textures, MDL, the new standard to define multi-layer materials, and NVIDIA Iray, GPU-accelerated unbiased raytracer, will help solving artists and developers PBR material challenges from edition to final frame rendering for artistic shots. After explaining MDL basics and the associated material workflow in Substance Designer, we will showcase the latest edition of Substance Painter, market's most innovative real-time 3D painting software. Now embedding Iray as alternate viewport, Substance Painter fully leverages the power of MDL and Substance and natively enhances your art with the most advanced rendering quality reduced to minimal compute time thanks to GPU acceleration.

Level: Intermediate technical
Type: Talk
Tags: Rendering & Ray Tracing; Game Development

Day: Monday, 04/04
Time: 16:00 - 16:50
Location: Room 210E

S6859 - Unveiling the Impact of Time Slicing with NVIDIA GRID™ vGPU for Realistic ROI/TCO Analysis

Erik Bohnhorst GRID Solution Architect, NVIDIA
Erik Bohnhorst is a Senior GRID Solution Architect at NVIDIA based in Stuttgart, Germany. After 7 years working at HP and focusing on Client Virtualization, Erik joined NVIDIA to support the largest Graphics Accelerated Client Virtualization opportunities in central Europe. Erik regularly shares his experience and technical understanding of Client Virtualization opportunities at technical events like BriForum, HP Discover, VMworld, E2EVC and other industry focused events.

A detailed look into why time slicing the various GPU engines allows scalability without compromising the graphics experience and what impact it has on benchmarking and generating realistic user per host recommendations which directly impact the TCO/ROI model.

Level: Intermediate technical
Type: Talk
Tags: Graphics Virtualization

Day: Monday, 04/04
Time: 16:30 - 16:55
Location: Room 210G

Talk

POSTER

Presentation
Details

P6102 - DR. TED: Deep Learning Recommendation of Treatment from Electronic Data

David Ledbetter Data Science Consultant, Children's Hospital Los Angeles
David Ledbetter has an extensive and deep understanding of decision theory. He has experience implementing various decision engines, including convolutional neural networks, random forests, extra trees, and linear discrimination analysis. His particular area of focus is in performance estimation, where he has demonstrated a tremendous ability to accurately predict performance on new data in nonstationary, real-world scenarios. David has worked on a number of real-world detection projects, including detecting circulating tumor cells in blood, automatic target recognition utilizing CNNs from satellite imagery, make/model car classification for the Los Angeles Police Department using CNNs, and acoustic right whale call detection from underwater sonobuoys. Recently, David has been developing a CNN to generate personalized treatment recommendations to optimize patient outcomes using unstructured electronic medical records from 10 years of data collected from the Children's Hospital Los Angeles Pediatric Intensive Care Unit.
Melissa Aczon Data Science Consultant, Children's Hospital Los Angeles
Melissa Aczon is a Principal Scientist at Areté Associates where she leads a team of scientists to develop algorithms that both improve the detection capability and reduce false alarms of a very complex sensor system. She leverages her deep understanding of mathematics to solve signal processing, detection, classification and estimation problems from a wide array of applications. She has worked with data coming from many different types of sensors including radar, optical and acoustic systems. She is also a Data Science Consultant at Children's Hospital Los Angeles (CHLA), where she is exploiting years of electronic health records collected from the CHLA Pediatric Intensive Care Unit with deep learning methods to create doctor decision aids. Melissa holds a Bachelor's Degree in Mathematics from Harvey Mudd College and a Ph.D. in Scientific Computing and Computational Mathematics from Stanford University.

Construct a model to generate treatment predictions to optimize patient outcomes by using the information gleaned from over 10,000 patients who passed through the Pediatric Intensive Care Unit at Children's Hospital Los Angeles over more than 10 years. This is accomplished by converting unstructured, non-uniformly sampled patient information into a structured data representation that resembles an image -- here referred to as a "patient snapshot." These patient snapshots elegantly enable convolutional neural networks to efficiently generate a basis.

Level: Intermediate technical
Type: Poster
Tags: Deep Learning & Artificial Intelligence; Medical Imaging

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6104 - A Case Study on Programming Heterogeneous Multi-GPU with StarPU Library.

Joao Gazolla Ph.D. Candidate, Universidade Federal Fluminense
Joao Gazolla has been a Ph.D. candidate at Universidade Federal Fluminense since 2012 under the supervision of Dr. Esteban Walter Gonzalez Clua in the MediaLab-UFF Group. He received his M.S. in computer science from UFF in 2010 and B.S. in computer science from the Federal University of Vicosa in 2009. He is a researcher at Media Lab at UFF, which is an NVIDIA Center of Excellence and also fellow of the project Global Cyber Bridges of Florida International University. He has experience in computer science with emphasis in GPGPUs, HPC, simulation, and optimization.
Esteban Clua Associate Professor, Universidade Federal Fluminense
Esteban Walter Gonzalez Clua is an associated professor in the Computer Science department of Universidade Federal Fluminese, in Rio de Janeiro, and director of UFF Medialab. Esteban was awarded the title of NVIDIA Fellow in 2015. He graduated in computer science at Universidade de Sao Paulo and has M.S. and Ph.D. degrees in computer science. Esteban is one of the founders of SBGames - Brazilian Symposium of Digital Entertainment and Video Games, director of Academia of IGDA Rio, president of the Brazilian Computing Society Game Committee, and a member of program committees of many conferences in video games, such as ACM Siggraph Sandbox, IEEE Gameon, and SBC SBGames. He helped establish the first Latin America NVIDIA Center of Excellence.

We present a case study of kernel executions that take advantage of the StarPU Library to exploit heterogeneous multi-GPU systems. This third-party library, developed by INRIA in France, provides a unified view of the computational resources. This way, the programmer can concentrate more on programming than handling low-level issues such as data transfers. It uses a task-based model and allows developers to implement custom scheduling policies. In this case study , thousands of matrix multiplications are performed with different scheduling policies and different number of processing units, resulting in different execution times, while low-level issues are handled by the StarPU Library.

Level: Beginner technical
Type: Poster
Tags: Tools & Libraries; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6106 - White Matter Tractography and Human Brain Connections Using GPUs

Moises Hernandez Fernandez Ph.D. Student, Oxford Centre for Functional MRI of the Brain (FMRIB). University of Oxford
Moises Hernandez Fernandez is a Ph.D. candidate at the University of Oxford in clinical neurosciences. His supervisors are Professor Stephen Smith, Professor Mike Giles, Dr. Stamatios Sotiropoulos, and Dr. Istvan Reguly.

We present a novel analysis tool for diffusion MRI (dMRI) data using NVIDIA GPUs for mapping connections in the human brain. We describe the potential of dMRI and how it allows the study of brain microstructure and 3D estimation of long-range brain connections, non-invasively and in-vivo (tractography). Due to the multidimensional nature of the data, modeling can be computationally demanding. We present a parallel framework for analysis of dMRI data that allows accelerations of up to two orders of magnitude when comparing GPU with CPU implementations. We highlight the tremendous benefit of these accelerations in very large recent studies such as the Human Connectome Project, where comprehensive maps of brain anatomical connectivity of unprecedented quality are being generated.

Level: Intermediate technical
Type: Poster
Tags: Medical Imaging

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6107 - User-Defined Drag Models on the GPU

Andrew Larson Software Developer, CPFD Software LLC
As a graduate student with a background in math and computer science, Andrew Larson was hit with the CUDA manual and subsequently instructed, "Make that code go fast." Now, as a software developer at CPFD Software LLC, he applies GPU acceleration through CUDA runtime to Barracuda VR, a computer-aided engineering software that uses the MP-PIC methodology to simulate industrial-scale processing units (e.g., regenerators, cyclones, gasifiers, and chemical looping combustors). Turns out, that code goes fast!

With CUDA, GPU processing is not relegated to matrix operations. Integrating GPU acceleration into commercial products requires both performance and flexibility. Often, acceleration from applying GPU parallelization is paramount. However, with the proper flexibility to easily incorporate more generalized usage, large performance gains are not lost due to a limited scope of application. We demonstrate the feasibility of GPU acceleration with end-user support for custom drag models supplied at runtime, greatly improving overall usability without sacrificing too much performance.

Level: Beginner technical
Type: Poster
Tags: Computational Fluid Dynamics; Tools & Libraries

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6109 - GPU-Accelerated Batch-ACPF Solution for N-1 Static Security Analysis

Gan Zhou Associate Professor, Southeast University
Gan Zhou is an associate professor with the School of Electrical Engineering, Southeast University. His research interests include CPU+GPU hybrid computing architecture and high performance computing in power systems.

GPUs have been applied successfully in many scientific computing realms and have great potential in power system applications. The N-1 static security analysis (SSA) appears to be a candidate application in which massive alternating current power flow (ACPF) problems need to be solved. However, when applying existing GPU-accelerated algorithms to solve the N-1 SSA problem, the degree of parallelism is limited because existing research has been devoted to accelerating the solution of single ACPF. This paper proposes a GPU-accelerated solution that creates an additional layer of parallelism among batch ACPFs and consequently achieves a much higher level of parallelism. In comparison to its CPU counterpart on Xeon E5-2620, the GPU method and framework solves SSA on Tesla K20C achieves up to a 57.6X speedup.

Level: Intermediate technical
Type: Poster
Tags: Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6110 - Fast Sorting in OpenGL Shaders for Order Independent Transparency

Pyarelal Knowles Student, RMIT University
Pyarelal Knowles is a Ph.D. student at RMIT University, Melbourne, with research interests in real-time computer graphics and physics simulations. He completed his B.S. in games and graphics programming in 2008, before earning a computer science degree with honors in 2009 at RMIT.

Sorting is a bottleneck when rendering large scenes with order independent transparency. The problem is to quickly sort millions of small lists of varying sizes, up to hundreds of items, on the GPU. Two techniques are shown to provide a compound performance improvement of over a factor of 10. The first is backwards memory allocation (BMA), which groups similar threads and executes them in batches. This improves GPU occupancy and allows a strategy pattern approach to the sorting algorithm. The second is register-based block sort (RBS), which improves local memory use with careful and explicit use of registers and works well in combination with BMA. Their improvements are shown to increase with GPU generations.

Level: Intermediate technical
Type: Poster
Tags: Real-Time Graphics ; Performance Optimization

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6114 - GPU Acceleration of Non-Iterative and Iterative Algorithms in Fluorescence Lifetime Imaging Microscopy

Gang Wu Student, University of Sussex
Gang Wu is a Ph.D. student in University of Sussex and a visiting scholar in University of Strathclyde. He received his M.S. in system analysis and integration from the Zhejiang University in 2013. His current research interests include FLIM analysis, fluorescence-based sensing systems, GPU computing, and biophotonics.

Fluorescence lifetime imaging microscopy (FLIM) plays a significant role in biological sciences, chemistry, and medical research. We propose a GPU-based FLIM analysis tool suitable for high-speed and flexible FLIM applications. With a large number of parallel processors, GPUs can significantly speed up lifetime calculations compared to CPU-OpenMP (parallel computing with multiple CPU cores) based analysis. The implemented algorithms have been tested on both synthesized and experimental FLIM data. The results show that at the same precision the GPU analysis can be up to 24x faster than its CPU-OpenMP counterpart.

Level: Intermediate technical
Type: Poster
Tags: Medical Imaging; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6115 - GPU Implementation and Optimization of Video Super-Resolution

Bo Xiao Research Scientist, Baidu Research SVAIL
Bo Xiao received the B.E. and M.S. degrees in the electrical engineering from the Shanghai Jiao Tong University, Shanghai China in 2007 and 2010, respectively; and his PhD degree in Computational Science and Engineering from Georgia Institute of Technology in 2014. He is currently working as a research scientist at the Baidu's Silicon Valley AI Lab (SVAIL), Sunnyvale, CA. His research interests include high performance computing, large scale data analysis, and deep learning.

We managed to apply the SRCNN, a convolutional neural network for image super-resolution, on FHD to 4K video super-resolution. The SRCNN runs very slowly on CPU. So we parallelized the convolution and implemented the SRCNN on GPU. We make use of the hierarchy of GPU memory and optimize the algorithm. The GPU and the optimization significantly accelerate the video super-resolution process and achieve the speed of 1.2s per frame, which is almost 300 times faster than the CPU implementation.

Level: Beginner technical
Type: Poster
Tags: Video & Image Processing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6120 - ABCD Algorithm for Tridiagonal Solver

Erh-Chung Chen MSc Student, National Tsing Hua University
Ehr-Chung Chen is pursuing an M.S. at National Tsing Hua University. His first research was using CUDA to accelerate physical models. Now, he is studying a new approach that can solve Poisson equation fast in parallel computing environments.

We study and implement the Augmented Block Cimmino Distributed (ABCD) algorithm on the GPU. Because of the special structure of tridiagonal matrices, we investigate the boundary padding technique to eliminate the execution branches on GPU for better performance. In addition, our implementation incorporates various performance optimization techniques, such as memory coalesce, to further enhance the performance.

Level: Beginner technical
Type: Poster
Tags: Algorithms; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6123 - GPU Accelerated Hausdorff Distance Computation Using Mathematical Morphology

Érick Rodrigues Ph.D. Student, Universidade Federal Fluminense
Erick Rodrigues is a Ph.D. student in visual computing at Universidade Federal Fluminense, Niteroi - Rio de Janeiro, Brazil. Recently, he has been working on fractals and metaheuristics on the GPU. Erick also has an M.S. in visual computing from Universidade Federal Fluminense. His main fields of study are processing, analysis, and synthesis of images; fractals; registration; pattern recognition; artificial intelligence; metaheuristics; medial images; games and engines.
Esteban Clua Professor, Unversidade Federal Fluminense
Professor and vice-director of the Computer Science Institute of Universidade Federal Fluminense, director of UFF Medialab and coordinator of the NVIDIA CCOE at UFF. Esteban is a CUDA Fellow since 2015.

Hausdorff distance is a widely used metric in visual computing for comparing images, finding patterns, and performing registrations. We propose two parallel algorithms for computing the Hausdorff distance using morphological dilations on the GPU. The algorithms require block synchronization and both the CPU-based and GPU-based block synchronizations were evaluated. Furthermore, we compare the efficiency of the proposed algorithms to implementations on the CPU and GPU regarding distinct programming languages, including C++, Java, and the Aparapi library. Experimental results have shown that the CUDA GPU-based synchronized algorithm provided the best results and was approximately 26 and 6630 times faster than the CPU for large binary and grey images, respectively.

Level: Intermediate technical
Type: Poster
Tags: Computer Vision & Machine Vision; Performance Optimization

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6124 - Non-Local Lattice Encoding for Bit-Vectorized Cellular Automata GPU Implementations

Jeffrey Kelling PhD Student, Helmholtz-Zentrum Dresden-Rossendorf
Jeffrey Kelling is a Ph.D. student at TU Chemnitz and Helmholts-Zentrum Dresden-Rossendorf, Germany, working on nanostructure evolution and topics in statistical physics using GPUs for large-scale simulations. Jeffrey studied physics at TU Dresden, Germany. He received his diploma in 2012 from Helmholtz-Zentrum Dresden-Rossendorf, where his thesis was "Kinetic Monte Carlo Simulations on Self-organization of Nanostructures Accelerated by Massive Parallelization."

In many areas, from physics to economics to social sciences, there are problems that can be mapped to stochastic cellular automata (SCA). In combination with machine learning techniques, cellular automata with learned rules can be used to efficiently predict real-world systems. In physics, they are used to study atomistically the size and shape evolution of micro- and nanostructures, providing insights into processes of self-organization crucial to today's nanotechnology. We present an extremely efficient SCA implementation of a surface growth model using bit-vectorization enhanced by non-local encoding on GPU. The employed technique and non-local encoding can be transferred to other applications.

Level: Intermediate technical
Type: Poster
Tags: Algorithms; Computational Physics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6128 - Fully Parallelized Lossless LZW Decompression for CUDA® Enabled GPUs

Koji Nakano Professor, Hiroshima University
Koji Nakano is a professor at the School of Engineering, Hiroshima University. Koji received his B.E., M.E., and Ph.D. degrees from the Department of Computer Science, Osaka University, Japan in 1987, 1989, and 1992, respectively. From 1992-1995, he was a research scientist at the Advanced Research Laboratory at Hitachi Ltd. In 1995, he joined the Department of Electrical and Computer Engineering at Nagoya Institute of Technology. In 2001, he became an associate professor at the School of Information Science, Japan Advanced Institute of Science and Technology. He has been a full professor at School of Engineering, Hiroshima University since 2003. He has published extensively in journals, conference proceedings, and book chapters. He has served on the editorial board of journals, including IEEE Transactions on Parallel and Distributed Systems, IEICE Transactions on Information and Systems, and International Journal of Foundations on Computer Science.

LZW is a popular lossless compression method used in UNIX file compression utility "compress" and in the GIF/TIFF image formats. However, it is very hard to parallelize it, because it creates a dictionary sequentially by reading the input data one by one. The main contribution of this work is to show a fully parallelized LZW decompression, which assigns each thread to an input compressed code, and converts it into the corresponding original input string. We have implemented our fully parallelized LZW decompression using CUDA. The experimental results show that our CUDA implementation on GeForce GTX 980 can attain 40 times speedup over a sequential implementation on Intel Core i7-4790. We also show that our LZW decompression is useful for big data and deep learning applications.

Level: Intermediate technical
Type: Poster
Tags: Algorithms; Video & Image Processing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6130 - Reducing Remote GPU Execution's Overhead with mrCUDA®

Pak Markthub Ph.D. Student, Tokyo Institute of Technology
Pak Markthub is pursuing a Ph.D. in the Department of Mathematical and Computing Science at Tokyo Institute of Technology, Japan, under the MEXT scholarship program. He is a member of Satoshi Matsuoka Research Laboratory, which specializes in high performance computing research. Pak received his B.S. in computer engineering from Kasetsart University, Thailand, in 2011, and his M.S. in science from Tokyo Institute of Technology in 2015. His research interests are remote/virtual GPU execution, cloud computing, and system virtualization.

Remote GPU execution (e.g., rCUDA) has been proven useful in many situations, including increasing overall resource utilization in multi-GPU batch-queue systems. However, for applications that intensively communicate with GPUs, remote GPU execution overhead can become large, even when InfiniBand is used as the communication backend. Since using local GPUs is always better in terms of performance, we propose mrCUDA, a middleware for transparently migrating work from remote to local GPUs. We presents how mrCUDA works and two case studies that show how good mrCUDA can improve LAMMPS's performance and the solution to the scattered idle-GPU problem compared with continuously using rCUDA.

Level: Intermediate technical
Type: Poster
Tags: Supercomputing & HPC; Tools & Libraries

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6131 - Fast Parallel Skew and Prefix-Doubling Suffix Array Construction on the GPU

Leyuan Wang Ph.D. student, University of California, Davis
Leyuan Wang is a Ph.D. student in computer science at the University of California, Davis. Her advisor is Professor John Owens. Leyuan completed her M.S. in electrical and computer engineering at UC Davis in 2014, after having earned her undergraduate degree in electronics science and technology at China's Zhejiang University. Her research spans general-purpose computing on graphics units -- GPGPU, also known as GPU computing -- along with computer graphics, parallel algorithms programming, and data compression. Most recently, she has been implementing classic algorithms on GPU and large-scale GPU computing. This summer, she presented one of only two "Distinguished Papers" of the 51 accepted at Euro-Par 2015. She is the main developer and release czar for CUDPP 2.2.

This poster presents the latest techniques for accelerating suffix array construction algorithms (SACAs) using CUDA. The suffix array (SA) data structure is used in a broad spectrum of applications, including data compression, bioinformatics, and text indexing. The recent explosion in data sizes and the emergence of commodity data-parallel processors motivate efficient parallel implementations of SACAs. Because of the high arithmetic and memory throughput of many-core GPUs and multi-core CPUs, these processors are well-suited for data-intensive computing tasks such as SACAs. We address the problem by designing, implementing, and comparing two different formulations of SACAs on NVIDIA GPUs and achieve significant speedups compared with previous CPU/GPU state-of-the-art implementations.

Level: Intermediate technical
Type: Poster
Tags: Performance Optimization; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6132 - Fast Sparse Matrix Vector Multiplication with Highly-Compressed Sparse Format

Yusuke Nagasaka Student, Tokyo Institute of Technology
Yusuke Nagasaka is an M.S. student at the Tokyo Institute of Technology, working under Professor Satoshi Matsuoka. He received his B.S. in 2014 from Tokyo Institute of Technology. His research interests include sparse matrix formats for many-core architectures.

We show the acceleration of sparse matrix vector multiplication (SpMV) on GPU by highly reducing memory traffic. SpMV is a dominant kernel in many sparse algorithms. The performance of SpMV is limited by memory bandwidth and lower locality of memory access to input vector causing performance degradation. We propose a new sparse matrix format, which alleviates these problems about memory bound by adaptive multi-level blocking techniques and compressing the index of the given matrix. Performance evaluations of SpMV for 40 matrix datasets show that we achieve speedups of 2.91X on maximum and 1.81X on average compared to NVIDIA's cuSparse library. We also find out the memory traffic in SpMV can be estimated and the performance of SpMV strongly depends on the memory traffic.

Level: Intermediate technical
Type: Poster
Tags: Algorithms; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6133 - GPU-Based Parallel Fuzzy C-Mean Clustering Model Through Genetic Algorithm

Yuan Huai Wu M.Sc. Student, Providence University
Yuan-Huai Wu is an M.S. student in the Department of Computer Science and Information Engineering, Providence University, Taiwan. He received B.S. degrees at the Department of Computer Science and Communication Engineering.
Che Lun Hung Associate Professor, Providence University
Che-Lun Hung has been an associate professor at Providence University since 2013. He received M.S. and Ph.D. in the Department of Information Engineering and Computer Science at Feng Chia University and in Department of Computer Science at National Tsing Hua University in 2000 and 2010, respectively. In 2010, he joined the Department of Computer Science and Communication Engineering at Providence University as an assistant professor. His research interests are in the areas of parallel and distributed computing, proteomics, genomics, systems biology, next-generation sequencing, and computational chemistry.

Detection of white matter changes in brain tissue using magnetic resonance imaging has been an increasingly active and challenging research area in computational neuroscience. A genetic algorithm based on a fuzzy c-mean clustering method (GAFCM) was applied to simulated images to separate foreground spot signal information from the background. However, GAFCM results in heavy computational loads. This study presents a new GPU-based parallel GAFCM algorithm for robust segmentation of brain MRIs. The experimental results indicate that the proposed algorithm achieves acceptable segmentation results and can significantly reduce computational cost.

Level: Intermediate technical
Type: Poster
Tags: Medical Imaging; Embedded

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6134 - Photometry of Fractal Meshes for Applications to Large-Scale Rough Planetary Surfaces

Antonio Gracia Berná Postdoctoral researcher, University of Bern
Antonio Gracia Berna is a postdoctoral fellow with the Swiss National Science Foundation (SNSF), researching at the Physics Institute for Space Research And Planetary Sciences (University of Bern, Switzerland). He received his M.S. in computer graphics, games, and virtual reality from Universidad Rey Juan Carlos, Madrid, Spain, in 2010. He received his M.S. in advanced computing for science and engineering from Universidad Politecnica de Madrid, in 2012. In June 2014, he received a Ph.D. in computer science from Universidad Politecnica de Madrid, with international mention, , where he has been working on two internationally renowned projects: Cajal Blue Brain, the first comprehensive attempt to reverse-engineer the mammalian brain, and Gaia Mission, the European Space Agency effort to chart a 3D map of the galaxy. Antonio's career focus is on the space industry, working for three different European Space Agency's projects: Rosetta, BepiColombo, and ExoMars. His research interests include the large-scale analysis of interplanetary data collected by different instruments (e.g., OSIRIS/BELA) onboard spacecrafts during space exploration missions.

The photometry measured by spacecrafts during space missions provides important information about the planetary surface composition and properties, like the roughness that influences its photometry. The model by B. Hapke has been one of the most used models to fit the photometric data, but it presents drawbacks. We present a GPU-accelerated technique that simulates the photometry, produced on large-scale rough surfaces, as the interaction of millions of light rays. Reflectance values measured in the laboratory from real samples are used in the simulation. To prove the validity of the approach, a comparison with the Hapke model is proposed. This is a first step to relate real laboratory measurements to the photometry of solar system surfaces observed by past and future missions.

Level: Beginner technical
Type: Poster
Tags: Astronomy & Astrophysics; Computational Physics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6138 - Real-Time Face Detection in FHD Images Exploiting both Embedded CPU and GPU

Chanyoung Oh Graduate Student, University of Seoul
Chanyoung Oh is an M.S. student at the University of Seoul, where he received his B.S. in electrical and computer engineering in 2015. His research interests include parallel software design, heterogeneous computing, embedded GPU platforms, computer vision, and medical imaging.
Youngmin Yi Associate Professor, University of Seoul
Youngmin Yi is an associate professor in University of Seoul. His research interest includes algorithm/architecture codesign for heterogeneous manycore platforms, GPU computing, high-performance distributed framework using manycore accelerators, and computer vision applications. He received the Ph.D. degree in electrical engineering and computer science from Seoul National University in 2007.
Saehanseul Yi Research Engineer, University of Seoul
Saehanseul Yi received the B.S. and M.S. degrees in electrical & computer engineering from University of Seoul in 2013 and 2015, respectively. He is currently with Dasan Networks. His research interest includes parallel software design, heterogeneous computing, embedded GPU platforms, computer vision and high-performance distributed framework using manycore accelerators.

Face detection has become very popular for many reasons, along with face recognition. Among them, many detection systems require real-time processing for the incoming video streams, such as intelligent surveillance systems. Recently, the widely used image resolution for surveillance cameras has been increased to full high definition (FHD, 1920x1080), which obviously increases the computation time for face detection. At the same time, CPU-GPU heterogeneous systems have become a mainstream platform in both server and embedded domains with ever increasing demand for powerful accelerators. Today, we present parallelization techniques that exploit both data and task parallelism of LBP-based face detection algorithm on an embedded heterogeneous platform.

Level: Beginner technical
Type: Poster
Tags: Embedded; Video & Image Processing; IoT

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6139 - Heterogeneous Learning for Multi-task Facial Analysis Using Single Deep Convolutional Network

Takayoshi Yamashita Lecturer, Chubu University
Takayoshi received his PhD degree from Department of Computer Science, Chubu University, Japan in 2011. He worked in OMRON Corporation from 2002 to 2014. He is a lecturer of the Department of Computer Science, Chubu University, Japan since 2014. His current research interests include object detection, object tracking, human activity understanding, pattern recognition and deep learning.
Yuu Kato Mr, Chubu University
Yuu received his Bachelor degree from Department of Computer Science, Chubu University, Japan in 2016. His current research interests include facial analysis and deep learning.
Hiroshi Fukui Mr, Chubu University
Hiroshi received his Master's degree from Department of Computer Science, Chubu University, Japan in 2016. His current research interests include pedestrian detection and deep learning.
Yuji Yamauchi Dr, Chubu University
Yuji received the PhD from Department of Computer Science, Chubu University in 2012. From 2012 to 2014 he was a post-doctoral fellow at the Chubu University. In 2011, he was a visiting student at Robotics Institute, Carnegie Mellon University. He was a Fellowship of the Japan Society for the Promotion of Science from 2010 to 2012. His research interests include computer vision and pattern recognition.
Hironobu Fujiyoshi Doctor, Chubu University
Hironobu received his PhD in Electrical Engineering from Chubu University, Japan, in 1997. From 1997 to 2000 he was a post-doctoral fellow at the Robotics Institute of Carnegie Mellon University, Pittsburgh, PA. working on the DARPA Video Surveillance and Monitoring (VSAM) effort and the humanoid vision project for the HONDA Humanoid Robot. He is now a professor of the Department of Computer Science, Chubu University, Japan. His research interests include computer vision, video understanding and pattern recognition.

When performing multiple tasks like recognition and regression, it requires multiple networks and computational cost becomes enormous in proportion to the number of task. Although heterogenous learning can perform multiple tasks in a single network, the performances of each task are worse than the case of trained individually. We propose the new heterogenous learning with weighted loss function. We apply the method to facial analysis which contains five heterogenous tasks(gender estimation and race detection, facial points detection, age estimation and smile degree estimation). Even a single network, the performance is comparable to the networks trained for single task. The computation speed is 22ms in GTX 980. It is 5 times faster than case of utilize five networks for single task.

Level: Intermediate technical
Type: Poster
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6140 - High Performance Hierarchical Matrix-Vector Multiplication using Hardware Accelerators

Hatem Ltaief Senior Research Scientist, KAUST
Highly-Rated Speaker
Hatem Ltaief is a Senior Research Scientist in the Extreme Computing Research Center at KAUST, where is also advising several KAUST students in their MS and PhD research. Hatem received the engineering degree from Polytech Lyon at the University of Claude Bernard Lyon I, France, the MSc in applied mathematics at the University of Houston, and the PhD degree in computer science from the University of Houston. From 2008 to 2010, he was a Research Scientist in the Innovative Computing Laboratory in the Department of Electrical Engineering and Computer Science at the University of Tennessee, Knoxville. He is part of the European Exascale Software Initiative (EESI) to build a European vision and roadmap to address the challenges of the new generation of massively parallel systems. He has various strategic partnerships with industries (Saudi Aramco, NVIDIA, Intel) as well as Universities and HPC Centers (University of Tennessee, INRIA Bordeaux, L'Observatoire de Paris, Barcelona Supercomputing Center, University of Erlangen). He is the (co)author of +40 journal / conference papers and book chapters. His research interests include parallel numerical algorithms, fault tolerant algorithms, parallel programming models and performance optimizations for multicore architectures and hardware accelerators.

We present a high performance hierarchical matrix vector multiplication using hardware accelerators. By properly mapping the tree structures to the GPU and overlapping the phases of the computation using streams, we greatly outperform the CPU implementations and achieve up to 80% of the sustained bandwidth of the GPU.

Level: Intermediate technical
Type: Poster
Tags: Algorithms; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6141 - GPU-Accelerated Isosurface Extraction

Michal Kierzynka Researcher, Poznan Supercomputing and Networking Center
Michal Kierzynka is employed at the Poznan Supercomputing and Networking Center. His research interests include bioinformatics, parallel computing, algorithms, new computing architectures, and energy efficiency. Michal received his Ph.D. in computer science from Poznan University of Technology, Poland, where he worked on high-performance DNA de Novo assembly (bioinformatics).
Marcin Adamski Researcher, Poznan Supercomputing and Networking Center
Marcin Werla has been working at Poznan Supercomputing and Networking Center since 2002. He started to work for the Grid community on the GridLab project, leading security work package. He has been involved in many other EU-funded projects: InteliGrid, QosCosGrid, ACGT, BeInGrid, BREIN, and AirPROM.

The algorithms for isosurface extraction from volumetric data have become crucial in the petroleum industry, medicine, and many other fields over the last years. They are computationally intensive, especially for large, high-resolution domains. Our GPU implementation of Marching Tetrahedra algorithm is not only immensely fast but allows us to split the domain across multiple GPUs. Processing of large domains is now a matter of seconds. For smaller domains, the algorithm is able to compute the isosurface in milliseconds and the resulting model is visualized in real time.

Level: Beginner technical
Type: Poster
Tags: Algorithms; Medical Imaging

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6144 - Non-Equilibrium GPU Simulation of Entangled Polymer Melts by Extending HOOMD-Blue

Ludwig Schneider Ph.D. Student, University of Göttingen -- Institute for Theoretical Physics
Ludwig Schneider is a Ph.D. student at Georg August University of Gottingen, Germany. In 2012, he came in touch with general purpose programming for GPUs during a research internship at the Max-Planck Institute for Dynamics and Self-Organization, Gottingen, Germany. After developing his first CUDA program in C, he did simulations for his bachelor thesis through customizing HOOMD-blue in C++/CUDA. Since then, he has been continuously improving this concept. Recently, Ludwig started to contribute to the HOOMD-blue project itself, providing many implementations and improving them to use in multi-GPU simulations.

Molecular dynamics (MD) simulations for melts of multicomponent polymer systems with an experimental value of the invariant degree of polymerization are not tractable with conventional techniques and CPU implementation. For standard GPU-accelerated MD simulations, the HOOMD-blue software package provides an excellent implemented solution. The simulation of highly coarse grained polymer melts and requires specialized techniques to investigate non-equilibrium properties. These techniques (slip-spring model) are physically motivated and we discuss the algorithmic characteristics and their parallel implementation for GPUs. We evaluate the results with HOOMD-blue benchmarks extended by the new implementations.

Level: Intermediate technical
Type: Poster
Tags: Computational Physics; Computational Fluid Dynamics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6145 - Real-Time Simulation and Prognosis of Smoke Propagation in Underground Stations: Roadmap

Anne Severt Ph.D. Candidate, Forschungszentrum Jülich GmbH
Anne Severt is a Ph.D. candidate in the Julich Supercomputing Center. In the Civil Security and Traffic division and in collaboration with project ORPHEUS, she develops a real-time simulation and prognosis software for smoke propagation in underground stations. Before joining the Forschungszentrum Julich, she worked as a consultant at Bain & Company, a management consulting firm, with projects in the oil and gas, insurance, and fire protection industries. Anne holds an M.S. in mathematics from the RWTH Aachen University, Germany.

Real-time simulations of smoke propagation during fires in complex geometries are challenging. For the sake of computing time, accuracy could impact rescue decisions. We present a roadmap to build a real-time simulation and prognosis software for smoke propagation in underground stations. First, a fractional step method using finite differences is implemented. Second, we evaluate the Lattice-Boltzmann method due to its high parallelization capability and smart handling of complex boundaries and grids. By including live data through sensor coupling, accuracy is maintained and the prognosis is optimized by ensemble simulation. Further, acceleration is accomplished by dynamically extending the computational domain and multi-GPU support.

Level: Beginner technical
Type: Poster
Tags: Computational Fluid Dynamics; Real-Time Graphics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6147 - Robust Moving Object Detection on Tegra K1

Cevahir Cigla Senior Design Engineer, Aselsan Inc.
Cevahir Cigla is a design engineer at ASELSAN, the 52nd largest military electronics company worldwide. Cevahir got his Ph.D. in electrical and electronics engineering from Middle East Technical University in 2012. He worked as an algorithm design engineer at a TV company from 2008 till 2012. He developed algorithms for video surveillance on NVIDIA platforms. Recently, his work has focused on the applications through the Tegra K1 mobile platform.
Burak Ozkalayci Senior Design Engineer, Aselsan Inc.
Burak NAME is a design engineer at ASELSASN. He obtained his Ph.D. in electrical and electronics engineering from the Middle East Technical University in 2014. He develops computer vision and machine learning algorithms for embedded systems like the Tegra K1 mobile platform.

We present a novel background modeling-based moving object detection and segmentation approach with real-time implementation on the recent NVIDIA Tegra K1 mobile GPU platform. The proposed solution introduces pixel-wise adaptive background learning rates as well as reinforced re-learning of the models. In that manner, especially dynamic backgrounds are modeled robustly where fake alarms due to irrelevant motion and the learning rate of these regions are increased. Detection is followed by shadow removal and dual background modeling approaches to detect abandoned objects with high precision. Each algorithmic step is implemented on GPU and real-time performance is achieved (detection, shadow removal, and abandoned object detection) on Jetson TK1 for 720x576 videos.

Level: Intermediate technical
Type: Poster
Tags: Intelligent Video Analytics (IVA); Video & Image Processing; IoT

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6149 - Strassen's Algorithm in a Heterogeneous CPU-GPU Distributed System

Liliana Ibeth Barbosa Santillán Research Engineer, University of Guadalajara
Dr. Liliana Ibeth Barbosa Santillán holds a Ph.D. in Computer Science from the Faculty of Computer Sciences at Technical University of Madrid, Spain. She also has a Ph.D. in Strategic Planning and Technology Direction at UPAEP; She is the chairperson of www.dataminingengineeringgroup.net.

We review the Strassen algorithm for matrix multiplication. The major difficulty for its implementation in hybrid/heterogeneous systems is the number of programming tools required for the management of hardware. Our objective is to take advantage of both: task and data parallelism by using a framework that simplifies the implementation while improves the performance. The problem is divided in many independent tasks mapped to computing devices, and then each task is executed using data parallelism to harness the advantages of GPUs and multicore CPUs. To verify the advantages of this task/data parallel approach, we performed several experiments varying the size of the matrix. Our results show that we can achieve up to 3.4X of speedup using the same application.

Level: Intermediate technical
Type: Poster
Tags: Supercomputing & HPC; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6150 - High Accuracy High Performance 3D Cone Beam Reconstruction

Oleg Konings Senior Software Engineer, Triple Ring Technologies
Oleg Konings has a wide range of experience designing and optimizing high-performance algorithms. His particular interest is GPU Computing and he has worked on several novel medical and security applications. At UCSF he was a member of the Gazzaley Lab where he worked on real-time functional brain imaging. This work resulted in the first GPU implementation of the group lasso convex optimization solver, for both dense and sparse systems. This solver was used in the 'Glass Brain' project to compute the degree of connectivity between regions of the brain in real-time. Currently at Triple Ring he works on GPU implementations of algorithms used for image processing, Monte Carlo simulations, and combinatorial optimization. His CUDA application for generating and evaluating all permutations of an array is considered the fastest known implementation.
Daniel Badali Senior Physicist, Triple Ring Technologies
Daniel Badali has a range of experience in imaging and diffraction techniques, in particular in using ultrafast electron diffraction to study transient dynamics in matter. This involved the reconstruction of 3D molecular structures by processing large amounts of noisy image data and applying iterative algorithms to address the "phase problem" in crystallography. In addition to the development of such complex experimental systems, he has worked extensively in mathematical modelling, laser science, and single-molecule biophysics. Daniel received a B.Sc. (with High Distinction) in Biological Physics and Mathematics from the University of Toronto. He was awarded a M.S. in Physics and he did receive his Ph.D. in Physics in 2015 from the University of Hamburg for work he did at the Max Planck Institute for the Structure and Dynamics of Matter.
Tobias Funk Director, X-Ray System R&D, Triple Ring Technologies
Tobias Funk has years of experience in the development of instrumentation for science and medicine. Prior to joining Triple Ring, he worked as a researcher at Lawrence Berkeley National Laboratory and in the Department of Radiology at the University of California, San Francisco. Tobias is an experimental solid state physicist by training and has focused on utilization of ionizing radiation throughout his career. He developed synchrotron instrumentation for spectroscopy on proteins and ultra-low temperature equipment for nuclear solid state physics. More recently, he worked on SPECT imaging and developed a dedicated cardiac SPECT system that was successfully brought to market. At Triple Ring, Tobias focuses on dose reduction in X-ray systems. His current research is funded by a Phase II SBIR grant to develop a large field of view inverse geometry fluoroscopy system, previous efforts have been grant and partner funded.
Scott Hsieh Research Scientist, Triple Ring Technologies,Stanford University
Scott Hsieh has experience in image processing, medical devices, and X-ray physics. He specializes in image reconstruction algorithms, and has worked on reconstruction projects in both X-ray CT and ultrasound. Prior to working at Triple Ring, he has assisted start-up companies developing medical imaging devices with new modalities. His prior work also includes astronomical work, characterization of nanoelectronics, and software engineering. Scott received BSc degree in Applied Physics and Business Economics from Caltech, and his MSc and PhD degrees in Electrical Engineering at Stanford University. He has authored 13 peer-reviewed publications and one book chapter and is a named inventor on six granted and pending patents.

We have developed a high performance high accuracy multi-GPU back projection application which generates 32-bit 3D volumes of size 1024^3 and 2048^3. Reconstructing a large 3D volume enables detailed examination of human anatomy or complex industrial devices with numerous small features. This has been tested with both real world CT data, and data generated from CAD models used as inputs to x-ray Monte Carlo simulations. This current implementation for 1024^3 was validated and bench-marked using the RabbitCT projection set and our results fit into the 2nd place position with the least amount of reconstruction error(32 bit).

Level: Intermediate technical
Type: Poster
Tags: Medical Imaging; Aerospace & Defense

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6160 - A GPU Parallel Solver for 3D Incompressible Navier-Stokes Equations Discretized by the SUPG/PSPG Stabilized Finite Element Formulation

Viet Huynh Quang Huy , Graduate School of Environmental and Life Science, Okayama University, Japan
Viet Huynh Quang Huy is a research assistant professor at the Graduate School of Environmental and Life Science, Okayama University, Japan. His main areas of research interest are numerical simulation, parallel computation, and computer vision.

The discretization of the Navier-Stokes equations by the SUPG/PSPG stabilized finite element formulation leads to a large and sparse non-symmetric system of linear equations. The numerical solution of nonsymmetric linear systems has often been solved by using iterative methods such as the biconjugate gradient stabilized method (Bi-CGStab). Among the variations of the Bi-CGStab algorithm proposed by various researchers, the GPBi-CG algorithm has been proven to have very good convergence behavior. In this poster, we propose an efficient GPU implementation of a parallel solver based on the GPBi-CG algorithm for 3D Navier-Stokes equations discretized by the SUPG/PSPG stabilized finite element formulation.

Level: Intermediate technical
Type: Poster
Tags: Computational Fluid Dynamics; Computational Physics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6162 - CNNs in Content Moderation for Online Classifieds: Scalable Image Classification Service

Alexandra Fenster Senior Analyst, Avito
Alexandra Fenster is a senior analyst at Avito, where she is responsible for computer vision tasks. She received a B.S. in computer science from the Novosibirsk State University and an M.S. from Yandex Data School of Analysis and National Research University Higher School of Economics in Moscow. Her main areas of interest are computer vision and deep learning.

Avito.ru is the biggest online classified advertising platform in Russia. The more buyers we attract, the more attractive the site is for swindlers to upload prohibited content. With all this modernization we've met many challenges related to user content and the quality of that content. At a certain point validating all the incoming items by scaling became unrealistic. Having a tremendous amount of data, we started to implement machine learning approaches first with text models only. But then we found the power of deep learning and GPU computing to help us to use both images and text models for quality improvement.

Level: Intermediate technical
Type: Poster
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6163 - Utilization and Expansion of ppOpen-AT for OpenACC

Satoshi Ohshima Assistant Professor, The University of Tokyo
Satoshi Ohshima is an assistant professor at the Information Technology Center, at the University of Tokyo. He has researched GPUs since 2004. Initially, he implemented matrix computation by using graphics languages and tried to accelerate it by parallel calculation of CPU and GPU. With CUDA, he implemented an OpenMP environment named OMPCUDA. His current research targets are sparse matrix computations and auto-tuning. Today, he is working on the optimization and auto-tuning of OpenACC programs.

OpenACC attracts attention as an easy and useful GPU programming environment. While OpenACC is not difficult to use, users have to spend time and energy to optimize OpenACC programs. To address this, we are developing an auto-tuning (AT) language named ppOpen-AT. We have shown that this language is useful for multi- and many-core parallel programming. We investigate the usability of ppOpen-AT for OpenACC programs and propose to expand ppOpen-AT for further optimization of OpenACC. While ppOpen-AT for OpenACC is in development, the effectiveness is shown and we believe that our next-gen's ppOpen-AT will help various optimization works of OpenACC programs.

Level: Intermediate technical
Type: Poster
Tags: Tools & Libraries; Programming Languages

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6164 - Benefits of Remote GPU Virtualization: The rCUDA Perspective

Federico Silla Associate Professor, Technical University of Valencia
Federico Silla is an associate professor at the Technical University of Valencia, Spain, where he received M.S. and Ph.D. degrees in computer engineering in 1995 and 1999, respectively. Federico worked for two years at Intel Corporation, developing on-chip networks. His research addresses high-performance on-chip and off-chip interconnection networks as well as distributed memory systems and remote GPU virtualization mechanisms. He has been the coordinator of the rCUDA remote GPU virtualization project since it began in 2008. He has published more than 80 papers in peer-reviewed conferences and journals, as well as 10 book chapters.

Many applications use GPUs to accelerate their execution. However, using GPUs presents several side effects, such as increased acquisition and maintenance costs and space requirements. Moreover, these increased costs may not be easily amortized because GPUs usually present very low utilization rates. In a similar way to virtual machines, the use of virtual GPUs may overcome the concerns associated with the use of real GPU devices. The remote GPU virtualization technique allows an application being executed in a computer not having a GPU to transparently make use of a GPU installed in another node of the cluster. Although the use of remote GPUs may seem to be a senseless idea, it provides several benefits as described in this poster by using the rCUDA (remote CUDA) middleware.

Level: Intermediate technical
Type: Poster
Tags: Tools & Libraries; Data Center & Cloud Computing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6167 - Hercules: High-Performance Real-time Architectures for Low-Power Embedded Systems

Paolo Burgio Research Assistant, University of Modena and Reggio Emilia, Italy
Paolo Burgio is an expert in parallel architectures, compilers, and virtual platforms. He has led research on software parallelization, runtime and compiler development, and programming models at the HiPeRT Lab at the University of Modena, Italy, since 2014. He received an M.S. in informatic engineering and a Ph.D. in electronics engineering at the University of Bologna, Italy, and the Universite de Bretagne-Sud, France, respectively.

Many-core architectures are the key building block for the next generation of embedded systems, where power consumption will be the primary concern. Platforms such as NVIDIA Tegra X1 with a GPU and a multi-core host provide an unprecedented performance/watt trade-off, but they are not yet widely adopted in domains such as advanced driver assistance systems (ADAS), where safety-critical requirements and a tight interaction with the surrounding environment call for predictable performance. The Hercules project will develop an integrated framework to obtain predictable performance on top of cutting-edge heterogeneous COTS many-core platforms, with the final goal of obtaining an order-of-magnitude improvement in the cost and power consumption of next-generation real-time applications.

Level: Intermediate technical
Type: Poster
Tags: Robotics & Autonomous Machines; Self-Driving Cars & Automotive ; IoT

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6168 - GPU Accelerated Method for Estimation of Light-Sources

Bruno Augusto Dorta Marques Ph.D. Student, Universidade Federal Fluminense
Bruno Augusto Dorta Marques is a Ph.D. student at the Computer Science Institute at Universidade Federal Fluminense. He has an M.S. in computer science in the area of computer graphics at Universidade Federal do ABC - SP, Brazil. His main interests and research areas are real-time rendering, animation, virtual reality, augmented reality, games, and digital entertainment.
Esteban Clua Professor, Universidade Federal Fluminense
Esteban Clua is a professor and vice-director of the Computer Science Institute of Universidade Federal Fluminense, director of UFF Medialab, and coordinator of the NVIDIA CCOE at UFF. He has been a CUDA Fellow since 2015.

The estimation of a light source configuration of a real-world environment can benefit a wide range of applications. Intelligent applications can produce different behaviors based on the lighting present in the user environment by, for example, adjusting the color scheme of the interface, changing the brightness, and controlling the white-balance of a display. One area of particular interest is augmented reality and mixed reality, which requires that both real-world and virtual elements have a consistent appearance and lighting conditions. This study proposes a GPU-accelerated method to recognize some aspects of the light source of a real environment. The method is build with a fast evaluation function that fits the highly constrained time of a real-time application.

Level: Intermediate technical
Type: Poster
Tags: Virtual Reality & Augmented Reality; Deep Learning & Artificial Intelligence

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6171 - Fourier-Stepping Implementation for Non-Linear Schrödinger Equations Using OCCA.

Andreas Mieritz Ph.D. Student, DTU
Andreas Mieritz is a Ph.D. student at the Technical University of Denmark, working in the field of high performance computing.

We present initial results for the creation of a GPU-based split-step Fourier solver for the nonlinear Schrodinger equations. For this poster, the OCCA framework was chosen, which shows great promise for unifying the various parallel programming languages that we have today. These results are very initial, and we fully expect to have a much better implementation of the Fourier transform operations before the conference.

Level: Beginner technical
Type: Poster
Tags: Computational Physics; Programming Languages

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6172 - Quasar: A Programming Framework for Rapid Prototyping

Bart Goossens professor, Ghent University
Bart Goossens is a postdoctoral fellow of FWO Flanders and also a part-time lecturer at Ghent University. For his research, he received the Barco/FWO prize for master theses 2006 and the Scientific prize IBM Belgium for Informatics 2011. Since 2011, he has been developing Quasar, a programming language and framework with IDE.

We present a new programming language, Quasar, which mitigates the common drawbacks of GPU programming for rapid prototyping. Quasar is an easy-to-learn, high-level programming language that is hardware-independent, ideal for both rapid prototyping and full deployment on heterogeneous hardware. By using Quasar, a researcher can write compact code in a scripting language while getting high performance due to the use of GPU acceleration. In addition to the Quasar language, we present the development tools that facilitate performance optimization, e.g., profile analyzers and automated code feedback.

Level: Beginner technical
Type: Poster
Tags: Programming Languages; Tools & Libraries

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6173 - GPU-Enabled Monte Carlo Simulation Makes Study of Cardiac Arrhythmia Possible

Mohsin Jafri Professor and Chair, George Mason University
Mohsin Jafri uses computational models to understand the cellular and molecular basis of heart disease. He is professor and chair in the Molecular Neuroscience Department at George Mason University. He has published over 50 peer-reviewed journal articles and has been funded by NIH and NSF. He holds an affiliate appointment the Center for Biomedical Engineering and Technology at the University of Maryland Baltimore. Mohsin received his B.S. in mathematics from Duke University, M.S. in mathematics from the Courant Institute of Mathematical Science, and Ph.D. in biomathematical sciences from the Mount Sinai School of Medicine.

Heart disease is the leading cause of death in the developed world. Many of these deaths occur through fatal arrhythmia. Multi-scale computational models are required to integrate data to understand the complex dynamics of the myriad components that comprise the heart. Stochastic simulations that integrate the function of individual proteins called ion channels that number in the millions in the cardiac muscle cell are needed to understand arrhythmia. Simulations of such computational complexity are now possible due to our patented Ultra Fast Monte Carlo Algorithm and its implementation on GPUs.

Level: Beginner technical
Type: Poster
Tags: Computational Biology; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6175 - Visual Tracking with Deep Networks: A Performance Study

David Concha Ph.D. Student, Universidad Rey Juan Carlos
David Concha is a Ph.D. student and grant holder at Universidad Rey Juan Carlos, where he also received a B.S. in computer science in 2011. His research interests focus on computer vision and GPU computing.
Antonio S. Montemayor Associate Professor, Universidad Rey Juan Carlos
Antonio S. Montemayor is an Associate Professor at Universidad Rey Juan Carlos (Madrid, Spain) and principal investigator of the CAPO research group at URJC. His research interests include soft computing, computer vision, GPU computing, image and video processing and real-time implementations.
Juan José Pantrigo Associate Professor, Universidad Rey Juan Carlos
Juan Jose Pantrigo is an Associate Professor at Universidad Rey Juan Carlos (Spain) where he is a member of the CAPO research group in the Department of Computer Science. His main research interests focus on the interface among Computer Science, Artificial Intelligence, Computer Vision and Operations Research

We propose a tracking system using the power of a deep neural network for the object detection stage, and guiding the classification stage to relevant zones in future time steps. Moreover, we perform an important performance evaluation using different computing platforms such as a Tegra K1 board, a mobile GPU and CPU, and a desktop CPU and GPU.

Level: Intermediate technical
Type: Poster
Tags: Computer Vision & Machine Vision; Deep Learning & Artificial Intelligence

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6176 - GPU-Accelerated Motion Input Analysis for Motion Reconstruction

Esteban Walter Gonzalez Clua Professor, Universidade Federal Fluminense
Esteban Clua is a professor and vice-director of the Computer Science Institute of Universidade Federal Fluminense, director of UFF Medialab, and coordinator of the NVIDIA CCOE at UFF. Esteban has been a CUDA Fellow since 2015.
Rafael Rego Drumond Masters Student, Universidade Federal Fluminense
Rafael Rego Drumond graduated in 2014 in computer science at Universidade Federal do Maranhao (Brazil), with one year abroad at the University of Illinois at Urbana Champaign. Rafael started his M.S. in 2015 at the Universidade Federal Fluminense (Brazil). Rafael is interested in computer games, serious games, and animation.

Games and real-life simulations introduced motion capture as an interactive controlling feature that allows humans to reproduce movements they want to see inside the game. However, reconstructing motion can be slow, especially when it's necessary to look up a huge pre-recorded database. This work seeks to present a GPU-parallelized algorithm of a previous animation reconstruction method to detect the motion being performed in order to reconstruct it with reduced delay. This work used a K-NN approach to compare human motion input with a pre-recorded database to detect what kind of motion is being performed in order to select what kind of animation will be reproduced.

Level: Beginner technical
Type: Poster
Tags: Game Development; Performance Optimization

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6177 - GPU Accelerated Compressive Sensing in MRI

Jonas Vylder Postdoctoral Researcher, Ghent university
Jonas Vylder has been a member of the Quasar team (a collaboration between UGent and iMinds) since the beginning of 2014, where he is responsible for load balance modeling of kernels on heterogeneous hardware. Jonas received his M.S. in formal informatics at Ghent University in 2007. After graduating, he joined the department of telecommunications and information processing (TELIN) at Ghent University. In 2008, he was awarded an IWT research scholarship for his research on image processing of microscopic images of cell nuclei. His Ph.D. focused on model-based image analysis for biological applications. He is the author or co-author of 48 peer-reviewed publications.

We introduce a new approach to get faster MRI acquisition. By reducing the number of data samples in combination with a new MRI reconstruction method, we're able to reduce the acquisition time by a factor of 20 without introducing disturbing artifacts. To reconstruct the image, we have to iteratively apply the non-uniform fast Fourier transform. This step turns out to be a major bottleneck. Therefore, we have accelerated the NUFFT by using the GPU. The speedup achieved by the GPU acceleration now enables new options for MRI research.

Level: Beginner technical
Type: Poster
Tags: Medical Imaging; Tools & Libraries

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6178 - A Highly Scalable Kernel Based Clustering Algorithm for Molecular Dynamics

Marco Jacopo Ferrarotti Ph.D. Student, Italian Institute of Technology
Marco Ferrarotti is a Ph.D. student in the CONCEPT lab at the Italian Institute of Technology, working on the study and development of highly scalable machine learning techniques in order to analyze molecular dynamics trajectories of bio-molecular processes of interest. Marco graduated cum laude in the physics of complex systems in 2013 with a master thesis on multiple time-step algorithms in the framework of molecular dynamics simulations. Later he was a scientific developer, working on the customization of software tools used in the study of protein-ligand binding.

Presented is a novel distributed clustering algorithm based on kernel k-means. The algorithm is carefully designed and crafted around the specific needs of molecular dynamics applications but is nevertheless general enough to be applied and verified against standard clustering datasets. Scalability and good performances on modern HPC GPU-endowed facilities are the key points of the design discussed in the poster. The computational burden is addressed by devising a smart distribution strategy and through an iterative mini-batch approach. GPU acceleration is introduced to speed up the expensive evaluation of the kernel matrix. A three-stage pipeline to hide PCIe latency is described along with a producer-consumer pattern that allows further overlap between GPU and CPU computations.

Level: Intermediate technical
Type: Poster
Tags: Computational Chemistry; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6181 - GPU-Based Pedestrian Detection for Autonomous Driving

Juan Carlos Moure Ph.D., University Autonoma of Barcelona
Juan C. Moure is associate professor at the Universitat Autonoma de Barcelona, where he teaches computer architecture and parallel programming. His current research interest focuses on the usage of massively parallel architectures and the application of performance engineering techniques to open research problems in bioinformatics, signal processing, and computer vision.
Victor Campmany B.Sc. Researcher , Computer Vision Center
Víctor Campmany is a B.Sc. student in Computer Engineering at Universitat Autònoma de Barcelona (UAB). Besides, he is a research assistant at Computer Vision Center (CVC) where he collaborates with Advanced Driver Assistance Systems (ADAS) group. His research interests include computer vision, parallel algorithms and high performance computing.

Pedestrian detection for autonomous driving has gained a lot of prominence during the last few years. Besides the fact that this is one of the hardest tasks within computer vision, it involves huge computational costs. Obtaining acceptable real-time performance, measured in frames per second (fps), for the most advanced algorithms is a difficult challenge. We propose a CUDA implementation of a well known pedestrian detection system (i.e., Random Forest of Local Experts). It includes LBP and HOG as feature descriptors and SVM and Random Forest as classifiers. We introduce significant algorithmic adjustments and optimizations to adapt the problem to the NVIDIA GPU architecture. The aim is to deploy a real-time system providing reliable results.

Level: Intermediate technical
Type: Poster
Tags: Self-Driving Cars & Automotive ; Computer Vision & Machine Vision

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6182 - Modeling and Simulation of a Atrium Fiber Using GPUs

John Osorio Professor, Universidad Tecnológica de Pereira
John Osorio is a researcher at Universidad Tecnologica de Pereira. His principal interests are computer architecture, high performance computing and CUDA programming. John works as a computer science professor at Universidad Tecnologica de Pereira, teaching computer architecture and high performance computing. He is the principal investigator of the GPU Education Center at Universidad Tecnologica de Pereira in Colombia.

The Particle-in-cell Method (PIC) is a computational method that allows to solve theoretical models such as the kinetic description of plasma. To study plasma it is necessary to understand the behavior of the particles through differential equations such as the Vlasov-Poisson equations. The simulation of methods such as PIC consumes several computational resources due to the amount of particles that are used. This work presents the implementation of the PIC code in 2D to simulate two-stream instability in a cyclic and conservative simulation space while introducing 2D PIC CUDA implementation for the GPU usage achieving an improvement (17x) in the execution time for a grid of 64*64 cells for 800,000 particles and up to 13x for a grid of 517*517 for 800,000 particles.

Level: Beginner technical
Type: Poster
Tags: Computational Biology; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6183 - Implementation of the Vlasov-Poisson Equations in 2D Using PIC on GPUs

John Osorio Professor, Universidad Tecnológica de Pereira
John Osorio is a researcher at Universidad Tecnologica de Pereira. His interests include computer architecture, high performance computing, and CUDA programming. He has worked for the past seven years as a professor of computer science at Universidad Tecnologica de Pereira, teaching computer architecture and high performance computing. He is the principal investigator of the GPU Education Center at Universidad Tecnologica de Pereira in Colombia.

The particle-in-cell method (PIC) is a computational method that allows solving theoretical models such as the kinetic description of plasma. To study plasma, it is necessary to understand the behavior of the particles through differential equations such as the Vlasov-Poisson equations. The simulation of methods such as PIC consumes several computational resources due to the amount of particles that are used. This work presents the implementation of the PIC code in 2D to simulate two-stream instability in a cyclic and conservative simulation space while introducing 2D PIC CUDA implementation for the GPU usage achieving an improvement (17x) in the execution time for a grid of 64*64 cells for 800,000 particles and up to 13x for a grid of 517*517 for 800,000 particles.

Level: Beginner technical
Type: Poster
Tags: Computational Physics; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6186 - Image Registration for Real-Time Database Extraction

Randall Miles Senior Research Scientist, Propulsion Science and Technology
Randall Miles is a physicist, algorithm developer, and senior research scientist at Propulsion Science and Technology. He is lead designer and developer for model database development activities, and key contributor on a variety of projects, including quantum chemistry calculations and radar cross section modeling of CFD fields.

We present GPU-enabled, real-time, high-fidelity image interpolation software implementing a non-rigid image registration method, i.e., morphing. Morphing is a mathematical method that modifies input images to create a smoothly evolving image set, with minimal image degradation. Morphing eliminates jitter of extracted database images and can also decrease the database size. Tests using simulated thermal images (128x256 pixels) of high-speed jet flow show image extraction speeds of over 500Hz (~80X over serial code). Application to HWIL and scene simulation can provide accurate target inputs with a much smaller database footprint.

Level: Beginner technical
Type: Poster
Tags: Aerospace & Defense; Video & Image Processing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6188 - Improving Spinal Cord Segmentation and Cross-Sectional Area Calculation

Joo-won Kim Postdoctoral Fellow, Icahn School of Medicine at Mount Sinai
Joo-won is a postdoctoral fellow at the Translational and Molecular Imaging Institute, Department of Radiology, Icahn School of Medicine at Mount Sinai, New York. He received his PhD in Applied Mathematics and Statistics from Stony Brook University.

We propose two medical image analysis improvements to robustly quantify the cross-sectional areas of the human spinal cord from high resolution morphological spinal cord magnetic resonance (MR) images. First, we estimate a spinal cord inner line to improve the segmentation of the spinal cord; secondly we introduce a robust method to approximate the tangent vector to the cross section of the spinal cord. Both methods were implemented in CUDA, which greatly improved the computation speed compared to CPU implementation.

Level: Beginner technical
Type: Poster
Tags: Medical Imaging

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6193 - Using the GPU to Predict Drift in the Ocean

Martin Lilleeng Sætra Scientist, Norwegian Meteorological Institute
Martin Lilleeng Saetra is a scientist with the Norwegian Meteorological Institute. His research interests include numerical simulation, accelerated scientific computing, and scientific visualization. He also holds a part-time associate professor II position at Westerdals Oslo ACT, where he lectures a course on graphics programming.

We describe the implementation of a simple numerical scheme for solving the shallow water equations on a GPU, which will be used in the further development of a massive ensemble prediction system running on GPUs. The numerical scheme has previously been used in operational forecasting, and benchmarks comparing the FORTRAN CPU version with the new GPU version have been performed. The results show that the GPU implementation gives a speedup over the CPU of slightly more than 200X. This is highly promising regarding the possibilities of running a large number of ensembles cost effectively on a computer and thereby increasing the usefulness of short-term ocean current forecasts and drift trajectory predictions.

Level: Advanced technical
Type: Poster
Tags: Earth System Modelling; Computational Physics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6194 - CUDA Accelerated Cross Validated Best Subset Selection with XLSTAT

Arnaud Belletoile Statistical Programmer, Addinsoft
Arnaud Belletoile is a senior statistical programmer at Addinsoft. He has developed data analysis tools for the XLSTAT software with a talented and growing team. He is focused on the econometrics and financial toolbox. Arnaud's interests include parallelized computations on GPUs.

Our implementation of a cross-validated best subset selection in linear regressions is presented. This algorithm is the latest GPU-enabled feature made available in our statistical solution XLSTAT. It is based on the binary tree regressions first proposed by Furnival & Wilson and is implemented through a QR factorization and subsequent updates of the R matrix using the cuSolver library. The last step of our model selection is done by a leave-one-out cross-validation test.

Level: Beginner technical
Type: Poster
Tags: Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6196 - Parallel Homotopy Method for the Symmetric Eigenvalue Problem

Peter Reheis Research Assistant, Trinity College
Peter Reheis has a double major in computer science and mathematics from Trinity College.
Peter Yoon Associate Professor of Computer Science , Trinity College
Peter Yoon is an associate professor of the Computer Science Department at Trinity College. He received his Ph.D. in computer science from Pennsylvania State University in 1996, where he developed numerical algorithms for guidance and control systems for underwater signal processing at the Applied Research Laboratory. Prior to joining the faculty of Trinity College in 2000, he taught at Azusa Pacific University for five years. His teaching interests include computer systems and high performance computing. His current research efforts are focused on the development, analysis, and high-performance implementation of novel algorithms for large-scale matrix computations and their applications in image processing using general-purpose GPUs.

The homotopy method can be applied to solve eigenvalue-eigenvector problems for symmetric tridiagonal matrices. It is often not of any interest to compute every eigenvalue of a matrix. Because the homotopy method possesses the order preserving property, the algorithm is able to compute any specific eigenvalue without the need for computing any other eigenvalues. Because of this, the homotopy method is also a highly parallel algorithm. By using CUDA and the cuBLAS and MAGMA libraries, we achieved up to 27X speedup in computation time. Numerical results show our method is highly efficient, especially for graded matrices.

Level: Intermediate technical
Type: Poster
Tags: Performance Optimization; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6197 - N-Body Simulation of Binary Star Mass Transfer Using NVIDIA GPUs

Taylor Hutyra Student Research Lead, Tarleton State University
Taylor Hutyra is a senior physics major at Tarleton State University who hopes to continue his education with a Ph.D. in computational physics. In the past, he was the lead structure engineer for the Tarleton Aeronautical Team, where he made contributions to software design and electronics. The team placed third overall in NASA's 2012-2013 USLI national competition. He has worked on research in CUDA optimization and visualization of fractal generation and detecting exoplanets with Tarleton's 32-inch telescope. Taylor has been programming with CUDA since 2013 and is currently the lead on the Binary Star Mass Transfer project.
Baylor Fain Student Research Lead, Tarleton State University
Baylor Fain is a B.S. student in physics and math at Tarleton State University. He hopes to continue his education with a Ph.D. in mathematics and would like to study dynamical systems to combine his love of math, physics, and computer science. Baylor has been a physics lab assistant at Tarleton for two years and has tutored physics and math for three years. He has been applying CUDA and parallel processing to N-body problems. He is currently the lead programmer on the Traveling Salesman project as well as a physics and CUDA consultant for the Binary Star Mass Transfer project.
Edward Smith Student Research Lead, Tarleton State University
Edward Smith is a B.S. student in computer science and mathematics at Tarleton State University. He is the lead programmer implementing CUDA for the Binary Star Mass Transfer project and works in the HPC Lab. In the past, he has tutored calculus and has worked as a programmer on other research projects for the university, presenting them at multiple research symposiums. In the future, he hopes to become a software engineer.

Over 70% of the stars in our galaxy are binary systems. Because of their interaction, the masses of these stars can be found using Newton's and Kepler's Laws. This allows astronomers to use these systems to study properties and processes of stars and galaxies. Among the many types of binary stars observed, contact systems are the most interesting because they exhibit mass transfer, changing the functionality of both stars. But, due to the lack of precise observational data and the large time scale of this process, there is limited understanding of the mass transfer. In this work, a model was made to give astronomers a method for gaining a deeper knowledge and visual intuition of how the mass transfer between binary stars takes place.

Level: Intermediate technical
Type: Poster
Tags: Astronomy & Astrophysics; Computational Physics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6198 - A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows

Timothy Blattner Computer Science Trainee, NIST
Timothy Blattner is a Ph.D. candidate at the University of Maryland, Baltimore County, with Dr. Milton Halem and Dr. Walid Keyrouz as his advisors. His topic, hybrid task graph scheduling, aims at improving programming productivity when developing software to scale algorithms to systems with multiple GPUs. In 2009, he received his B.S. in computer science from Marquette University. In 2013, he received his M.S. in computer science from the University of Maryland, Baltimore County, on "A Hybrid CPU/GPU Pipeline Workflow System."

Designing scalable applications is key to improving performance in hybrid computing. Scheduling code to utilize parallelism is difficult, particularly when dealing with data dependencies, memory management, data motion, and processor occupancy. The Hybrid Task Graph Scheduler (HTGS) increases programmer productivity when implementing hybrid workflows that scale to multi-core and multi-GPU systems. HTGS manages dependencies between tasks, represents CPU and GPU memories independently, overlaps computations with disk I/O and memory transfers, uses multiple GPUs, and uses all available compute resources. We present an implementation of hybrid microscopy image stitching using HTGS that reduces code size by ~25% and shows favorable performance compared to an implementation without HTGS.

Level: Advanced technical
Type: Poster
Tags: Video & Image Processing; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6200 - Angular Momentum of Late Lunar Forming Impacts Using NVIDIA GPUs

Jonathan Petz System Admin, HPC Lab, Tarleton State University
Jonathan is a senior Computer Science major at Tarleton State University. Upon graduation, he plans on forming a game development company. He is currently working as system administrator for the University High Performance Computing Lab and the lead programmer for the Late Lunar Forming Impacts group. As lead programmer, Jonathan created an application using SDL2 and OpenGL to view and record our N-Body simulations. Jonathan worked on several game development projects, including leading a group to develop a video game that won second at the New Jersey TSA State Conference.
Ty Turner Student Researcher, Tarleton State University
Ty Turner is a junior computer science student at Tarleton State University, where he is doing research in HPC on late lunar forming impacts, which includes developing better and faster parallel code, and continuously running simulations to produce a proper Earth-Moon system from a collision. He is planning on becoming a game developer once he graduates.
William Sumpter Student Researcher, Tarleton State University
William Sumpter is a senior computer science major at Tarleton State University, where he conducts research in the Tarleton High Performance Computing Lab modeling late lunar forming impacts using NVIDIA GPUs. He programs programmable logic controllers at Schreiber Foods. His interests include robotics and programming interface applications.

Our Moon is no ordinary satellite! It is too large to be a captured asteroid. Could it be a twin planet formed alongside of Earth as our solar system was being created? Or, perhaps a captured rocky planet forced to light our night and give lovers' inspiration? Though this is romantic, the true answer is thought to be much more violent. We believe the Moon was born from a violent encounter of two young proto-planets. This giant impact hypothesis (GIH) is the main theory for the formation of our Moon, but has been questioned recently because simulations of the GIH leave the Earth-Moon system with excess angular momentum. In this work, we show how to remove the excess angular momentum from giant impact simulations, while preserving all the desired results from previous giant impact studies.

Level: Intermediate technical
Type: Poster
Tags: Astronomy & Astrophysics; Computational Physics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6203 - Accelerated Transport System Simulation Using CUDA

Peter Heywood PhD Student, The University of Sheffield
Peter Heywood is a Ph.D. student at the University of Sheffield's Department of Computer Science (a NVIDIA GPU Research Center), working on GPU-accelerated micro-simulation of transport systems and is named as lead researcher on an ongoing collaboration funded by Highways UK. Under the supervision of Dr. Paul Richmond and working closely with the Transport Systems Catapult (the UK's innovation centre for intelligent mobility), Peter is developing techniques to extend FLAME GPU (an open-source, agent-based modeling environment) specific to transport systems simulation, which will enable simulations of significantly greater scale and complexity than currently possible.

Discover how GPUs are being used to accelerate predictive simulations used in transport system planning and management to alleviate the global increase in transport demand. We highlight the role of predictive, high-performance micro-simulations in transport system management and provide insight into the development process and benchmark performance of agent-based transport models developed using FLAME GPU, including the creation of a populated virtual reality environment using an omnidirectional treadmill.

Level: Beginner technical
Type: Poster
Tags: Virtual Reality & Augmented Reality

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6207 - Astro: A Low Cost Low Power Computing Cluster

Gavin Baker Student, California Polytechnic State University, San Luis Obispo
Gavin Baker is a M.S. student of computer science (CPE B.S.) at California Polytechnic State University, San Luis Obispo, studying scalable HPC software for utilizing multiple coprocessors for future exascale systems.
Sean Sheen Student, California Polytechnic State University, San Luis Obispo
Sean Sheen is a M.S. student of computer science (CPE B.S.) at California Polytechnic State University, San Luis Obispo.
Chris Lupo Associate Professor, California Polytechnic State University, San Luis Obispo
Chris Lupo is a professor of applied parallel computing and computer architecture at California Polytechnic State University, San Luis Obispo.

On the path to exascale computing, the trend toward more expensive, power-hungry high performance computing (HPC) clusters presents a number of challenges to academic institutions with limited funding, that wish to contribute to the study of scalable architectures and software resilience. Hybrid CPU-GPU systems have the potential for reducing the cost and power consumption of next-generation computing clusters. This poster demonstrates the efficacy of such systems within commodity computing clusters through our design, implementation, and benchmarking of a hybrid CPU-GPU cluster based on the NVIDIA Jetson TK1 board. Our new computing system has a theoretical peak floating-point performance of ~24 TFLOPS with power consumption of 1KW under full load and was built at a cost of $13,000.

Level: Intermediate technical
Type: Poster
Tags: Supercomputing & HPC; Performance Optimization

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6209 - Pore-Network Simulation of Fluid Flow and Transport in Porous Media on GPUs

Hassan Dashtian PhD Candidate, Research Assistant , University of Southern California
Hassan Dashtian is a Ph.D. student in chemical engineering at University of Southern California. He holds a B.S. and M.S. in petroleum engineering with emphasis on drilling and production, from Petroleum University of Technology and Sharif University of Technology, respectively. He is working on nano- and pore-scale simulation of salt precipitation in porous media using high-performance computation. Hassan's other research interests include analysis of well log and seismic data, fluid flow, and transport in porous media.

Networks of interconnected resistors, springs and beams, or pores are standard models of studying scalar and vector transport processes in heterogeneous materials and media, such as fluid flow in porous media, and conduction, deformations, and electric and dielectric breakdown in heterogeneous solids. We developed an algorithm for using the computational power of GPUs to speed up the calculations with pore and resistor networks. The same algorithm can be used with networks of springs or beams. A mixed-precision algorithm, together with the conjugate-gradient method, have been implemented on a single GPU solver. We achieve a speedup factor of 60X, and are able to simulate very large networks with several million sites.

Level: Beginner technical
Type: Poster
Tags: Computational Fluid Dynamics; Energy Exploration

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6210 - enerGyPU for Monitoring Performance and Power Consumption on Multi-GPUs

John Anderson Garcia Henao Master Student of Science in Systems Engineering and Informatics, Research Assistant at High Performance and Scientific Computing Unit – UIS
John Anderson Garcia Henao develops solutions using techniques of mathematical optimization applied to machine learning and big data. As an M.S. student, John specialized in numerical algorithms, parallel computing, the use of heterogeneous computer architectures, heterogeneous parallel programming, power management, and performance analysis techniques for experimental design measurements, simulation, and modeling. He wants to plan and manage the development of information technology in the areas of optimization and automation of processes and communications, information management, ensuring data integrity and technology infrastructure.

enerGyPU is a tool that can be used to analyze multiple tests under different combinations of parameters to observe the key factors that determine the energy efficiency in terms of "Energy per Computation" on clusters with multi-GPUs. enerGyPU is a monitor to centralize and automate the data captured in runtime while it is executed in parallel with the scientific application and displays information through sequence plots, statistical tables, and bar graphs, and shows results in terms of energy efficiency.

Level: Beginner technical
Type: Poster
Tags: Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6213 - Interactive Boussinesq-Type Simulation and Visualization of Water Wave Propagation on GPU

Sasan Tavakkol Graduate Research Assistant, University of Southern California
Sasan Tavakkol is a Ph.D. student at Sonny Astani Department of Civil and Environmental Engineering at USC. He received his B.S. (2010) and M.S. (2013) with honors from Tehran Polytechnic and joined USC in 2013 after receiving the Viterbi Dean's Doctoral Fellowship. He is also a research assistant, partly at the Coastal Engineering Lab of the Civil Engineering Department, and partly at the Integrated Media System Center of the Computer Sciences Department. Sasan has published several peer-reviewed journal and conference papers, mostly on fluid simulation.

A coastal wave simulation and visualization software is developed based on the Boussinesq equations. Both simulation and its concurrent visualization are performed on the GPU using DirectX API. The software provides faster than real-time, interactive modeling for the first time in coastal engineering. A model running faster than real time can be handy in disaster forecasting, naval navigation, and any time-sensitive project. Interactivity provides the option for scientists and engineers to test different scenarios and see the results on the go.

Level: Intermediate technical
Type: Poster
Tags: Computational Fluid Dynamics; Real-Time Graphics ; Virtual Reality & Augmented Reality

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6218 - Accelerating Protein Sequences and Classification Using GPU-HMMER Search

Mahesh Khadtare Research Student, Pune University
Mahesh Khadtare is a research student highly motivated towards biomedical signal and image processing applications. He has 12 years of work experience, including four years as a scientist at CRL, four years as a team lead at Trinity convergence, one year as a technology consultant at HP, two years as team lead at GE Healthcare, and one year as a lead software engineer at Verizon Wireless. Mahesh's work experience focuses on research in the area of signal processing. He has published papers on various applications on GPUs at GTC 2010 and GTC 2012.
Pragati Dharmale Graduate Student, SNHU NH
Pragati is a graduate student interested in GPU programming.

This poster represents the results of parallelizing HMMer, which is a widely used tool for protein sequence homology detection, as well as functional annotation of homologous protein sequences, and protein family classification. The HMMer program is based upon a Viterbi algorithm coded in C, and is quite time consuming. We modify the Viterbi algorithm logically to port it on GPGPU.

Level: Intermediate technical
Type: Poster
Tags: Computational Biology; Astronomy & Astrophysics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6221 - GPU Effectiveness for DART

Ye Feng Student, University of Wyoming
Ye Feng is a Ph.D. student in electrical and computer engineering at the University of Wyoming. Primarily, she works on HPC in 3D computer vision and computational science. Ye received an M.S. in adaptive control theory and B.S. in electrical and computer engineering from the University of Wyoming. She joined SIParCS (Summer Internship in Parallel Computational Science) at NCAR (National Center for Atmospheric Research) for summer 2014 and 2015, where she worked on "Evaluating Coprocessor Effectiveness for the Data Assimilation Research Testbed."

Data Assimilation Research Testbed (DART) is a framework that makes it easy for modelers, observational scientists, and geophysicists to explore a variety of data assimilation methods and confront different numerical models with observations. Approximately 25% of the Yellowstone supercomputer resources run DART applications. The scalability and efficiency of the DART system is of paramount concern in cases where forecasting and weather modeling is required. Over the past few years, general-purpose graphics processing units (GPGPUs) have emerged as an inexpensive way to accelerate scientific applications. In these projects, we focused on implementing new parallel versions of the targeted functions with CUDA Fortran on NVIDIA GPUs (K20x), which are available on the Yellowstone supercomputer.

Level: Intermediate technical
Type: Poster
Tags: Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6224 - GPU Implementation of Protein Morphing Algorithm

Chengbin Hu Graduate Research Assistant, University of South Florida
Chengbin Hu is doing his Ph.D. in biomedical science at the University of South Florida. With a broad background in pharmacology, bioinformatics analysis, statistics, and computer programming, he is focusing on designing algorithms to serve in silico drug design. Chengbin's research interest focuses on protein simulation and how to use new technology to accelerate the calculation process.

Computational modeling of ligand-protein binding has become a popular initial approach to designing new drugs. Most current structure-based drug design focuses on the docking of static crystal protein structures. Under natural conditions, proteins are known to be dynamic in order to use different conformation to perform vital functions. The purpose of a protein morphing algorithm is to estimate the dynamic protein conformation by computation. In this research, we implement the morphing algorithm by "linear interpolation" and we enable massive parallel programming to calculate and adjust the intermediate protein pose. This algorithm could generate morphing intermediate poses at a very high speed.

Level: Beginner technical
Type: Poster
Tags: Computational Biology

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6226 - Client-Side GPGPU Web Application for Catchment Delineation and Watershed Segmentation

Muhammed Yusuf Sermet Research Assistant, IIHR-Hydroscience & Engineering, University of Iowa
Muhammed Yusuf Sermet is a Ph.D. student in the Department of Electrical and Computer Engineering at the University of Iowa and a research assistant at the IIHR--Hydroscience & Engineering and the Iowa Flood Center. He is originally from Turkey, where he obtained his B.S. in computer engineering at Ege University. His research focuses broadly on citizen science, virtual stream sensors, augmented reality, and mapping applications. He is currently working on several projects, including developing a web-based knowledge engine utilizing the comprehensive, information-centric flood ontology that he created, a real-time social media analysis system for flood monitoring and prediction also utilizing flood ontology, and a client-side GPGPU algorithm to analyze high-resolution terrain data for watershed delineation, all to be implemented in the Iowa Flood Information System.

Generation of huge amounts of spatial data has increased demand for applications that are capable of handling large-scale and high-resolution terrain data. A novel example of this would be the Iowa Flood Information System, which is a web-based, one-stop platform for accessing flood-related data. One of the most challenging tasks for terrain analysis is the delineation of watersheds. Although traditional methods for watershed analysis give high-accuracy results, it becomes more burdensome as the data resolution increases, and there is no client-side analysis tool for watershed delineation. In this project, we developed a client-side GPGPU algorithm to analyze high-resolution terrain data for watershed delineation, which allows parallelization using GPUs.

Level: Intermediate technical
Type: Poster
Tags: Earth System Modelling; Big Data Analytics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6227 - Fourier Domain Pulsar Acceleration Searches on GPUs for the Square Kilometre Array.

Sofia Dimoudi Research Associate, University of Oxford
Sofia Dimoudi works as a research associate at the Oxford e-Research Centre looking at hardware acceleration of real-time pulsar signal processing algorithms for next-generation radio telescopes. Sofia studied for her Ph.D. at Durham University, where she worked on GPU acceleration of atmospheric tomography computational algorithms for real-time control on adaptive optics systems for extremely large telescopes.

We describe the work done at the Oxford e-Research Centre (OeRC) at Oxford University toward accelerating one of the most demanding computational tasks of the real-time pulsar signal processing pipeline of the world's largest next generation radio telescope, the Square Kilometre Array (SKA). We introduce the problem of pulsar acceleration searches and a Fourier domain computational method used for detecting signals from accelerated pulsars. A GPU implementation and optimizations results are presented in the context of the SKA timing requirements. This work is done as part of Astro-Accelerate, a real-time time-domain data processing library, currently under development at the OeRC.

Level: Intermediate technical
Type: Poster
Tags: Algorithms; Astronomy & Astrophysics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6230 - A Highly Parallel Implementation of the Faddeev-Leverrier Algorithm

Rahul Chandrashekhar Student, Trinity College, Hartford - CT
Rahul Chandrashekhar is a junior at Trinity College, pursuing a B.S. in computer science and mathematics. He comes from India and is a student researcher at the college working on high performance computing. Rahul's academic interests include competitive programming, optimization algorithms, graph theory, and robotics. His other interests are graphic design and app development.
Peter Yoon Associate Professor of Computer Science, Trinity College, Hartford - CT
Peter Yoon is an associate professor at Trinity College. He received his Ph.D. in computer science from Pennsylvania State University in 1995, where he developed numerical algorithms for guidance and control systems for underwater signal processing at the Applied Research Laboratory. Peter then taught at Azusa Pacific University before joining Trinity College. He is interested in developing visualization techniques for time-varying data for signal and image processing.

We present an accelerated implementation of the Faddeev-Leverrier algorithm (FLA) to solve the Eigenvalue Problem. The problem, being recursive in nature, cannot be directly extended to a parallel implementation. Instead, a hybrid model is implemented to harness the combined computing power of the CPU and GPU more effectively.

Level: Intermediate technical
Type: Poster
Tags: Algorithms; Performance Optimization

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6233 - Data Reduction for Cherenkov Gamma-Ray Astronomy on Jetson TK1

Alberto Madonna Software Engineer, Italian National Institute for Astrophysics (INAF)
Alberto Madonna is a software engineer for the Italian National Institute for Astrophysics, developing data analysis software for parallel and low-power architectures, in collaboration with the GPU Research Center at the University of Padova. Alberto earned his B.S. and M.S. in aerospace engineering at the University of Padova (Italy), both with theses regarding scientific software development on GPUs. A multiple-time GTC presenter, he has been programming GPUs with CUDA since 2009.

A mini-array of ASTRI SST-2M Cherenkov telescopes will be deployed soon in a remote site far away from human activities to achieve optimal observation conditions for gamma-ray astronomy. In such a scenario, the capability of each telescope to process its own data before sending them to a central acquisition system provides a key advantage. We implemented the complete analysis chain required by a single telescope on a Jetson TK1 development board, overcoming the required real-time processing speed by more than a factor of two, while staying within a very small power budget.

Level: Intermediate technical
Type: Poster
Tags: Astronomy & Astrophysics; Embedded

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6234 - Fast Parallel GPU Implementation for Clinical Helical CT Using Branchless DD

Ayan Mitra Graduate Research Assistant, Washington university in St. Louis
Ayan Mitra is a third-year Ph.D. student in the department of Electrical and Systems Engineering at Washington University in St. Louis. He received his B.S. in electrical engineering at Jadavpur University, India, in 2013. His research interests include algorithm development for iterative image reconstruction, parallelization, and implementation of reconstruction algorithms in GPU.

We present a multi-GPU-based approach for iterative image reconstruction using Branchless Distance Driven (DD) Projection and Back Projection methods. Preliminary results showed that this implementation allowed approximately 5x decrease in reconstruction time for back projection and 2x for forward projection by using three NVIDIA GeForce TITAN X GPUs in parallel compared to its OpenMP CPU implementation using 16 threads in parallel. We can expect to get better computational efficiency having a higher number of GPUs, hybrid CPUGPU method and ordered subsets, which open the door to the possibility of using an iterative reconstruction algorithm in real time in clinical settings.

Level: Intermediate technical
Type: Poster
Tags: Medical Imaging; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6235 - Towards 3D Image Up-Scaling on the GPU

Srinivasan Ramesh Research Assistant, Indian Institute of Science
Srinivasan Ramesh is a research assistant at the Supercomputer Education and Research Center at the Indian Institute of Science, Bangalore. He holds a B.S. in computer science from BITS, Pilani. He is working on optimizing a climate modeling application (CESM) on an Intel Xeon Phi cluster. Srinivasan enjoys parallel programming and hopes to contribute to research that would make parallel programming easier for everyone.

Affordable medical imaging solutions are an important part of making healthcare accessible to everyone, especially in developing nations. High-resolution CT scanners are expensive instruments, and obtaining large high-resolution images may not always be possible. We explore 3D image up-scaling as a possible software solution to this problem, by extending the ICBI 2D image up-scaling algorithm to operate in 3D. Based on the promising results obtained by parallelizing the 2D ICBI image algorithm on the GPU, we parallelize the 3D ICBI algorithm on the GPU and demonstrate the performance benefits of doing so.

Level: Advanced technical
Type: Poster
Tags: Medical Imaging; Video & Image Processing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6236 - Locality-Aware Memory Association for Pipelining and Multi-Device Worksharing

Thomas Scogland Computer Scientist, Lawrence Livermore National Laboratory
Thomas Scogland is a computer scientist in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. His research focuses on technologies for addressing the ever-growing complexity of high performance computing systems and nodes, including job scheduling, task scheduling, and programming models. He also explores related topics such as green computing, heterogeneous computing, and scalable resource management.

Advances in directive-based programming models have made GPU programming more accessible than ever. Even so, models like OpenMP 4.0 and OpenACC lack worksharing and memory management facilities for multi-GPU environments. We present a memory-association interface for directive-based models that enables multi-device worksharing, automated pipelining for greater support of out-of-core workloads, as well as NUMA management all as a single extension. Our implementation, AffinityTSAR, scales well to multiple GPUs, GPUs and CPUs together, and even shows improvement in CPU-only performance.

Level: Intermediate technical
Type: Poster
Tags: Programming Languages

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6239 - Dense Reconstruction with GPUs

Jason Mak Ph.D. Student, University of California, Davis
Jason Mak is a Ph.D. student at the University of California, Davis. His interests include parallel computing problems, especially those that relate to massively parallel architectures such as GPUs. He is currently working on ways to benefit computer vision applications with GPUs, particularly in the area of 3D reconstruction.

We show how to obtain a dense 3D reconstruction of a scene given an initial sparse 3D reconstruction and images of that scene. GPUs are used to make the method computationally viable. 3D reconstruction is an increasingly important computer vision problem with a variety of applications, and dense reconstructions are more desirable for the level of detail they provide. The dense reconstruction method featured in this talk is simple to implement and relies on geometric and image consistency constraints. It updates a traditional approach with more modern segmentation and feature descriptors to improve accuracy. Details of the method's implementation on GPUs are also explained. The brute-force computation provided by GPUs allows for more dense and accurate reconstructions.

Level: Intermediate technical
Type: Poster
Tags: Computer Vision & Machine Vision; Video & Image Processing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6241 - GPU Based Fluid Structure Interaction

Christopher Minar Student, Oregon State University
Christopher Minar is an M.S. student of mechanical engineering. His interests include parallel processing and computational fluid dynamics.

In computational fluid dynamics, solving a large system of equations is required. This is a massively parallel problem well-suited to GPUs. We present work on a GPU-based computational fluid dynamics solver. The solver uses the immersed boundary method, which allows the Navier-Stokes equations to be solved on a structured, Cartesian grid. This avoids having to remesh or update overset meshes every time an immersed body moves.

Level: Beginner technical
Type: Poster
Tags: Computational Fluid Dynamics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6242 - Serving Multiple Concurrent Interactive Sessions with a GPU-based Speech Recognition System

Alexei V. Ivanov CTO, Verbumware Inc.
Alexei Ivanov has broad experience in speech processing and recognition systems. He received his Ph.D. in the theoretical foundations of computer science from the Belarusian State University of Informatics and Radioelectronics in 2004. He holds an M.S. in electrical engineering from the Moscow Institute of Physics and Technology. He has experience both in academia (University of Trento, Moscow Institute of Physics and Technology) and industry (Pearson Knowledge Technologies, USA; Speech Technology Center, Russia; Lernout & Hauspie Speech Products NV, Belgium). His research interests include adaptive conversational machines; web-integration of individual multimedia experiences; speech characterization technology; and integration of para-linguistic knowledge into the process of speech recognition and interpretation.

We explore the possibility of using GPUs in a cloud implementation of an interactive automated speech recognition server. Among the specific challenges for this system, we see rapidly performing model adaptation and context switching and scheduling process for small incoming data chunks. Ultimately, we observe significant advantages with our GPU speech recognizer implementation as it commits the total available computational resources to solving a single task at any given moment, allowing higher computational throughput under the moderate computational load. We measure performance of our system with respect to the open-source reference implementation of Kaldi using models, obtained from the large publicly available Librispeech collection.

Level: Intermediate technical
Type: Poster
Tags: Signal & Audio Processing; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6244 - Image-Based Sticker Recommendation Using Deep Learning

Jiwon Kim Senior Research Engineer, Naver Labs
Jiwon Kim has worked as a research engineer at Naver Labs for the past five years with an emphasis on computer vision, image processing, and deep learning. She holds an M.S. in computer science from University of Washington and a B.S. in computer science from Seoul National University.

LINE mobile messenger sticker shop accepts new stickers designed by independent artists on a daily basis. To assist users in browsing the stickers, a list of similar stickers are recommended by collaborative filtering based on user purchase history. A major drawback of this approach, known as a cold start problem, is that new items cannot be recommended as they have no purchase records. To address this issue, we recommend stickers based on image content by learning the visual similarity between stickers using deep learning on GPUs. We trained a convolutional neural network to learn semantic features, which we use for recommending visually similar stickers. We measure the relevance of different recommendation schemes and verify the effectiveness of the proposed approach.

Level: Intermediate technical
Type: Poster
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6245 - Neurophysiological Working Memory Task Classification from Magnetoencephalography Using Deep Learning

Zachary Harper Research Associate, Medical College of Wisconsin
As a biomedical engineering M.S. student with a background in psychology, Zachary Harper aims to bring cutting-edge computational methods into the realm of neurological disorder diagnosis and therapy. He is employed as a Department of Neurology research associate at the Medical College of Wisconsin, where he explores multimodal analysis using physiological signals, medical imaging, and behavioral data. Zachary's long-term interest is in developing for traumatic brain injury and post-traumatic stress disorder research. While his current studies involve building accelerated analysis pipelines, GPGPU parallelization has been a powerful catalyst in reaching towards real-time processing for neurofeedback applications.
Charles Welzig Associate Professor, Medical College of Wisconsin
Charles Welzig specializes in computational medicine, autonomic neuroscience, and cardiovascular physiology.

Biological neural networks are complex, plastic, and unique to every individual. These qualities pose great challenges in classifying highly dynamic patterns across neural activity for time-sensitive medical applications. In this project, we use deep learning to identify oscillatory activation markers that can differentiate between two working memory tasks. Training on multiple NVIDIA GeForce GTX TITAN GPUs enables us to overcome computational challenges for use in clinical and medical research applications. This poster presents our first step towards classifying deep-temporal whole-brain neural network activation patterns using GPU-accelerated deep learning systems with convolutional neural networks.

Level: Beginner technical
Type: Poster
Tags: Deep Learning & Artificial Intelligence; Medical Imaging

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6247 - CudaPAD - A Quick on-the-fly PTX/SASS Viewer

Ryan White Programmer, IT Consultant
Ryan White is an IT coordinator, living in Pleasanton, California. He earned his B.S. in computer science at California State University East Bay in 2012. Ryan has been writing lines of code since the age of seven and continues to enjoy programming and writing algorithms in his free time.

CudaPAD is Windows-based software that aids in the optimizing and understanding of NVIDIA's CUDA kernels by displaying an on-the-fly view of the PTX/SASS that makes up the GPU kernel. CudaPAD simply shows the PTX/SASS output. However, it has several visual aids to help understand how minor code tweaks or compiler options can affect the PTX/SASS. Just type or paste the kernel in the left panel and the right panel will display the corresponding disassembly information. Visual aids like CUDA-to-PTX code matching lines and WinDiff are built in to help identify PTX sections quickly. Other on-the-fly information is also given, like register counts, memory usage, and error information.

Level: Intermediate technical
Type: Poster
Tags: Tools & Libraries; Performance Optimization

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6248 - Performance Comparison of a CUDA® Interval Arithmetic Library with Standard Interval Library

Paluri Nataraj Professor, Indian Institute of Technology, Bombay, India
Paluri Nataraj's research interests are in high performance computing (GPU), robust stability and control, nonlinear system analysis and control, and reliable computing. Paluri obtained his Ph.D. from IIT, Madras, India, in process dynamics and control in 1987. He then worked in the CAD center at IIT Bombay for 18 months before joining the faculty at the Systems and Control Engineering Group at IIT Bombay in 1988.

We developed a CUDA-based interval arithmetic library for GPU users. This library is based on the ideas in C-XSC interval library. We compare the performance of the developed CUDA interval library with that of the C-XSC interval library for different interval arithmetic operations. The CUDA interval library is found to be much faster than the standard C-XSC library. This CUDA interval arithmetic library allows advanced interval techniques, such as interval global optimization, to be performed in comparatively very little time on GPUs.

Level: Beginner technical
Type: Poster
Tags: Tools & Libraries; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6252 - G-Storm: GPU-Aware Scheduling in Storm

Cheng-Hao Huang Student, National Tsing Hua University
Cheng-Hao Huang is a student at National Tsing Hua University. His research domain includes parallel computing on GPU, visualization, and development operations.

As we shift toward a data-driven economy, the ability to efficiently analyze huge amounts of data in shorter amounts of time is the key to success. Many systems for big data processing have been developed. Storm is one of them, targeting stream data processing.

Level: Advanced technical
Type: Poster
Tags: Video & Image Processing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6253 - MRICloud: An MRI Data Platform with Cloud Storage and Cloud Computing Service

Jian-Hua Huang Master student, Chang Gung University
Jian-Hua Huang is pursuing an M.S. at the Department of Computer Science and Information Engineering in Chang Gung University, Taoyuan, Taiwan. His major research interests include WebGL, CUDA, and cloud computing in medical applications.

The importance of brain researches using MRI is elevating nowadays, especially in the research fields of neurological disorders, mental illness and cognitive neuroscience. The aim of this study is to implement an MRICloud platform for MRI data storage, management and sharing, graph theoretical analysis based on GPU-based cloud computing, MRI data visualization by web interface. From the results, it takes 238.91 seconds to generate the brain structural connectivity matrix using NVIDIA Tesla C2075 with 990 brain regions and 2.4 million of fiber tracts from diffusion MRI and is 39x the speedup in comparison with single CPU (i7-4770). Moreover, our platform also provides the visualization of all results in any device with WebGL-enabled browser.

Level: Beginner technical
Type: Poster
Tags: Medical Imaging

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6254 - Exploring Dynamic Parallelism for Irregular Applications on GPUs

Jin Wang Ph.D. Student, Georgia Institute of Technology
Jin Wang is a Ph.D. student whose research interests are in architecture and runtime techniques for heterogeneous computing. Her previous work includes optimizations for applications on integrated CPU-GPU systems and the development of an OpenCL runtime frontend within GPU Ocelot dynamic compilation framework. More recently, she has been constructing a compilation and runtime system for executing JavaScript code on GPUs. Currently she is working on optimization of irregular applications for GPUs. Specifically, she is investigating new execution models for dynamic parallelism and their efficient compiler and microarchitecture support.

We explore techniques for implementing irregular applications with dynamic parallelism on GPUs. We observe that CUDA Dynamic Parallelism (CDP) shows potential benefits for irregular applications by handing the dynamic formed pockets of structured parallelism. These benefits include better control flow and memory behavior. However, the non-trivial kernel launching latency could overshadow the benefits provided by CDP. We then propose Dynamic Thread Block Launch (DTBL), which is a lightweight execution mechanism that supports launching thread blocks (TB) from the GPUs on demand with substantially lower overhead and an average of 21% speedup. Based on DTBL model, we also propose a new locality-aware TB scheduler that can increase the memory system efficiency, which generates another 27% speedup.

Level: Intermediate technical
Type: Poster
Tags: Performance Optimization

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6256 - Fast Parallel Bulk Insertion in GPU MOLAP Databases

Steffen Wittmer Senior Software Developer (Jedox GPU), Jedox AG
Steffen Wittmer obtained his degree in computer science in 2010 and since then has been working on GPU solutions for Jedox.

This work focuses on input processing of big data streams in a GPU-accelerated in-memory OLAP (MOLAP) database by Jedox. We present a solution that supports fast insertion of high data volumes by avoiding the compute-expensive task of multidimensional sorting during the actual insertion phase. The main processing step achieves a significant speedup over existing CPU-only version.

Level: Intermediate technical
Type: Poster
Tags: Algorithms; Big Data Analytics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6257 - GPU Parallelization of a Distance Field Solver

Anup Shrestha Graduate Research Assistant, Boise State University
Anup Shrestha joined Boise State University as an M.S. student in August 2014. He obtained his B.S. in computer science and mathematics from Lewis-Clark State College in 2013, and worked as a database analyst at University of Idaho for a year. Originally from Nepal, Anup's research primarily focuses on scientific computing and parallel programming using GPUs.

Propagating interfaces occur in a wide variety of fields, including fluid mechanics and computer graphics. The distance field from an interface can be calculated by solving the Eikonal equation at each node using the Fast Sweeping Method (FSM) [Zhao, 2004]. However, parallelization of FSM is not straightforward. We proposed a parallel algorithm using Cuthill-McKee ordering that is suitable for massively threaded architecture. Here, we implement and compare different parallel algorithms for FSM using CUDA, OpenACC, and MPI. The maximum performance is achieved using CUDA and the parallel algorithm of Detrixhe et al., whereas a comparable speedup was achieved using OpenACC with a few directives, substantially shortening the development cycle.

Level: Intermediate technical
Type: Poster
Tags: Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6258 - GPU Accelerated Computation of the Beta Function of the SU(3) Gauge Theory with Ten Fundamental Fermions

Ting-Wai Chiu Professor, National Taiwan University
Ting-Wai Chiu has been a professor at the Physics Department of National Taiwan University since 1985. His research areas include quantum field theory, lattice QCD, high-energy physics, and computational physics.

Recent experiments in the Large Hadron Collider at CERN have discovered the Higgs scalar at the mass ~125 GeV. Even though the Higgs scalar is an elementary particle in the Standard Model, there is still a possibility that such a light scalar might arise as a composite particle in non-abelian gauge theories with many fermions, provided that it is not too far below the conformal window. This study focuses on the SU(3) gauge theory with 10 massless domain-wall fermions in the fundamental representation, using a GPU cluster at National Taiwan University, which is crucial for completing the dynamical simulations within a few months. Our result of the beta-function suggests that this theory is conformal. In this poster, we present our algorithms and strategies for GPU-accelerated computations.

Level: Advanced technical
Type: Poster
Tags: Computational Physics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6259 - Non-Uniform Diffusion of the Solar Surface Magnetic Field: Code Acceleration Using OpenACC for both GPUs and x86

Ronald Caplan Computational Scientist, Predictive Science Inc.
Ronald Caplan is a computational scientist whose main interests are developing and optimizing numerical methods for simulating physics-based models and their implementations in parallel high performance computing environments. His research focuses on the continued development and optimization of Predictive Science Inc.'s magnetohydrodynamic codes used to study the solar corona and heliosphere, as well as providing computational solutions for additional projects. He has an extensive background in GPU computing, including authoring the NLSEmagic code package, which integrates the nonlinear Schrodinger equation using CUDA codes called from MATLAB.

We show the results of implementing OpenACC into a non-uniform diffusion time integration Fortran code. The code's application is to smooth observation-based radial magnetic field maps of the solar surface for use as inner boundary conditions of global magnetohydrodynamic simulations of the corona and heliosphere. The code uses a RKL2 super-time-stepping algorithm to allow time-steps that far exceed the standard explicit stability limit. The algorithm remains explicit, making the code a prime target for OpenACC acceleration. The OpenACC implementation is discussed and speedup results are shown. The newly released OpenACC x86 feature in the PGI compiler is also tested and shown to produce multicore CPU code from the OpenACC directives that can outperform our OpenMP implementation.

Level: Intermediate technical
Type: Poster
Tags: Astronomy & Astrophysics; Computational Physics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6261 - cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs

Cheng Wang Research Assistant, University of Houston
Cheng Wang is a final year Ph.D. candidate in the HPCTools group of the Department of Computer Science at University of Houston. His research interests include languages and compilers for high performance computing and heterogeneous embedded systems. His advisor is Dr. Barbara Chapman.

The Fast Fourier Transform (FFT) is one of the most important numerical tools widely used in many scientific and engineering applications. The algorithm performs O(NlogN) operations on N input data points in order to calculate only small number of k large coefficients, while the rest of N-K numbers are zero or negligibly small. The algorithm is clearly inefficient, when N points input data lead to only k << N non-zero coefficients in the transformed domain. The sparse FFT (sFFT) algorithm provides a solution to this problem. In this poster, we present a parallel sFFT algorithm on GPUs using CUDA. Our CUDA-based sFFT, namely cusFFT, performs over 10x faster than the state-of-the-art cuFFT library on GPUs and over 28x faster than the parallel FFTW on multicore CPUs.

Level: Intermediate technical
Type: Poster
Tags: Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6262 - NaNet: FPGA-Based NICs for GPU Accelerated Real-Time Systems

Alessandro Lonardo Research Engineer, Istituto Nazionale di Fisica Nucleare (INFN)
Alessandro Lonardo works on the design and development of the APEnet+ 3D-torus network and NaNet family of network interface cards dedicated to GPU-accelerated real-time computing systems. He received his M.S. in physics in 1997 from University "La Sapienza" in Rome, Italy. His thesis work involved the development of a DSL optimizing compiler for the SIMD APEmille supercomputer. He contributed to the design of the apeNEXT SPMD parallel computer, developed its multi-node functional simulator and ported the GCC compiler to the new architecture. He was one of the designers of the distributed network processor, an IP enabling 2D/3D internode communications in embedded multi-tile systems, and developed its TLM-SystemC model.

NaNet is a modular design of a family of FPGA-based PCIe Network Interface Cards specialized for low-latency real-time operations. NaNet features a Network Interface module that implements RDMA-style communications both with the host (CPU) and the GPU accelerators memories (GPUDirect RDMA) relying on the services of a high performance PCIe Gen3 x8 core. NaNet I/O Interface is highly flexible and is designed for low and predictable communication latency: a dedicated stage manages the network stack protocol in the FPGA logic offloading the host operating system from this task and thus eliminating the associated process jitter effects. Between the two mentioned modules, stand the data processing modules implementing application-dependent processing on data streams.

Level: Intermediate technical
Type: Poster
Tags: Computational Physics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6263 - EEE Event Reconstruction on GPUs

Orsolya Visnyei Student, Eotvos Lorand University
Orsolya Visnyei is completing her B.S. studies in Hungary at the Eotvos Lorand University, writing her thesis about Sudoku algorithms evaluation. She works at the university's administration department handling registration of course information and student records. She is laboratory fellow, working at CERN.
Richard Forster Ph.D. Student, Eotvos Lorand University
Richard Forster is a Ph.D. student at CERN working on graph visualization and processing. He received his B.S. at Eotvos Lorand University in Hungary with his thesis being written about GPU acceleration of computing-intensive algorithms. Richard's M.S. thesis was written about parallel algorithm scheduling on GPUs. During this time, he worked at GE Healthcare on GPU optimization of its medical imagining software and was involved in developing new utilization methods. He was involved in multiple projects at CERN while being a World Laboratory Fellow.

The EEE project is a detector array of Multigap Resistive Plate Chambers located at selected sites on the Italian territory. Goals of the Project includes the study of the properties of the local muon flux and its dependence on planetary and solar environment, the detection of high energy extensive air showers created in the Earth's atmosphere by time and orientation correlations between several telescopes, and the search for possible long range correlations between far telescopes. We are proposing to involve GPUs in the data reconstruction phase to increase the available computing capacity as the available data rapidly increases.

Level: Intermediate technical
Type: Poster
Tags: Computational Physics; Astronomy & Astrophysics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6264 - Radar Signal Processing on GPU's and Performance Comparison with Vector Processors

Peter Joseph Basil Morris Researcher, Defense Research & Development Organisation
Peter Joseph Basil Morris is a researcher with the Electronics & Radar Development Establishment (LRDE), Government of India. He has been actively involved in the design, development, and deployment of signal processing algorithms for a variety of airborne radar systems. He is working in the field of signal processing for the Synthetic Aperture Radar System being developed for the Government of India. His areas of interest include radar signal processing, radar systems design, adaptive radar signal processing, synthetic aperture radars, and advanced hardware architectures for radar signal processing.

We investigate the computing capabilities of GPUs for radar signal processing applications through the realization of a radar signal processor on a GPU, leveraging the inherent parallelization offered by the radar signal processing algorithms and the extensive computing capability of the GPU.

Level: Intermediate technical
Type: Poster
Tags: Algorithms; Signal & Audio Processing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6265 - One Kernel To Rule Them All. Performance-Portable FMM for CPUs and GPUs

Ivo Kabadshow Scientist, Juelich Supercomputing Centre
Ivo Kabadshow is a scientist at Juelich Supercomputing Centre, Germany, and has worked on fast multipole methods for molecular dynamics since 2004. He started GPU-based parallelization in 2014.

We focus on a single code base for a certain scientific algorithm, a performance portable C++ implementation, using only a single code base that is easily executable in both CPU and GPU. For that purpose, we present our core algorithm -- the fast multipole method -- embedded in a stack of abstraction layers, allowing us to achieve portability without maintaining separate kernels for each architecture. In addition, we'll review common implementation pitfalls that might help other developers when aiming at a unified code base. Especially memory allocation, memory access, and the abstraction of SIMT for complex user-defined data structures are investigated. Finally, we present results/comparisons of the performance on a CPU and GPU.

Level: Intermediate technical
Type: Poster
Tags: Algorithms; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6266 - Parallel Algorithms for Unsteady Navier-Stokes Flow on GPU Architectures

Bahareh Mostafazadeh Davani Ph.D. Student, University of California, Irvine
Bahareh Mostafazadeh Davani is a Ph.D. student in computer engineering at the University of California, Irvine, which she has attended since 2014. Bahareh is a member of the HPC Factory, a scalable parallel computing lab, where she works on design and implementation of new parallel algorithms for the field of computational fluid dynamics. She got her B.S. at Sharif University of Technology in Iran.

We present HiPer, a high-performance algorithm for unsteady Navier-Stokes flow on GPU systems. Compared to other simulations, our approach has two distinct characteristics: (1) we achieve a high percentage of the machine's peak performance, and (2) can adapt to current and future heterogeneous systems. We present results on 1 million grid points and achieve 37X speedup compared to prior state-of-the-art software. We designed HiPer as the building block for next-generation CFD.

Level: Beginner technical
Type: Poster
Tags: Computational Fluid Dynamics; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6267 - CUDA-Accelerated Acquisition of Spread Spectrum Signal in Satellite Communication

Ying Liu Associate Professsor, University of Chinese Academy of Sciences
Ying Liu is an associate professor in the School of Computer and Control, at the University of Chinese Academy of Sciences. She also holds a position with the Key Lab of Big Data Mining and Knowledge Management of the Chinese Academy of Sciences. She received her B.S. from Peking University, China in 1999, and her M.S. and Ph.D. from Northwestern University, Evanston, Illi., in computer engineering in 2001 and 2005, respectively. Her research interests include data mining, high performance computing, and business intelligence. She has published more than 50 research papers in international conferences and journals. Ying is principal investigator and co-principal investigator of three National Nature Science Foundation Projects of China. She has received the NVIDIA Global Professor Partnership and Agilent Research Project Grant. She was awarded the CUDA Education Center and CUDA Research Center in 2011 and 2012, respectively.

Spread spectrum communication has been used in many applications in satellite communication, e.g. GPS. Due to its high computational complexity, no real-time spread spectrum signal acquisition (a critical step in the processing flow of satellite communication) has been achieved on a CPU-based system without FPGA or DSP. Thus, we proposed to use CUDA-enabled GPUs to accelerate it. The computation core, sliding correlation was identified and an efficient CUDA parallelization scheme was proposed. A CUDA-enabled acquisition algorithm was implemented. Experimental results on data from a real satellite spread spectrum system presented up to 212X speedup using a Tesla K20 GPU over the execution time on CPU with Intel IPP. Real-time acquisition was achieved in most cases and good scalability was observed.

Level: Intermediate technical
Type: Poster
Tags: Signal & Audio Processing; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6268 - Which GPU for How Many NPCs ?

Stéphane Cardon Associate Professor, Center of Research, French Military Academy of Saint-Cyr Coëtquidan
Stephane Cardon is an associate professor at the French Military Academy of Saint-Cyr Coetquidan, where he teaches object-oriented programming. He got his Ph.D. in computer science in 2003 from the University of Artois, France. In charge of a future simulation curriculum, he teaches CUDA and game programming. Since 2012, his research has focused on game artificial intelligence and the use of GPUs as the hardware of choice for game AI.

Today's game artificial intelligence makes use of CPUs and not of GPUs. With more and more powerful GPUs and cloud gaming, our vision is that GPUs will be the best choice to run game AI, just as it now provides computing power for game physics (e.g., NVIDIA PhysX). In particular, we believe GPUs can be used to implement game AI planning, which computes the behaviors of characters (non-player characters -- NPCs) in games. Our objective is an efficient GPU-based game AI planning component that can be used not only for PC games, but also for cloud gaming. We report here on our recent implementation which is up to 150X faster than our CPU-based game AI planning component. It handles up to 256 NPCs with plans of at most eight actions.

Level: Intermediate technical
Type: Poster
Tags: Game Development

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6269 - Dynamical Analysis of Connected Neuronal Motifs with OpenAcc and OpenMPI

Krishna Pusuluri Ph.D. Student, Computational Neuroscience, Georgia State University
Krishna Pusuluri is a Ph.D. student in neuroscience at Georgia State University, Atlanta. He was a researcher at the Indian School of Business, Hyderabad, and a software engineer at Yahoo Inc., Bangalore. His interests include solving problems in the field of computational neuroscience by exploring their mathematical foundations, employing scientific computational techniques, applied dynamical systems, and neural networks using the latest advancements in parallel processing and grid computing technologies to study such large-scale systems, alongside building new tools -- both computational and mathematical. He wants to deeply explore the interdisciplinary realms of neuroscience for a better understanding of the working of the human brain and to apply those insights into building modern technologies, particularly in the areas of machine learning and artificial intelligence.

Large-scale analysis of the dynamical behavior of central pattern generators (CPGs) formed by neuronal networks of even small sizes is computationally intensive and grows exponentially with network size. We have developed a suite of tools to exhaustively study the behavior of such networks on modern GPGPU accelerator clusters using OpenACC and OpenMPI. Directive-based approaches simplify the task of porting serial code onto GPUs without expertise in CUDA or OpenCL. Three-cell neuronal CPGs have been explored previously using various GPGPU tools. As motifs form the building blocks of larger networks, we have employed our framework to study four-cell CPGs and two connected three-cell motifs. We discuss the performance improvements achieved using this framework and present some of our results.

Level: Beginner technical
Type: Poster
Tags: Programming Languages; Deep Learning & Artificial Intelligence

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6270 - Measuring and Modeling the Motion of Volumetrically Digitized Objects

Duane Storti Professor of Mechanical Engineering, University of Washington-Seattle
Duane Storti is a professor of mechanical engineering at the University of Washington in Seattle and has 35 years of experience in teaching and research in the areas of engineering mathematics, dynamics and vibrations, computer-aided design, 3D printing, and applied GPU computing. Together with Mete Yurtoglu, Duane authored the recently released book "CUDA for Engineers: An Introduction to High-Performance Parallel Computing."

We present two results from CUDA-enabled processing of digitized objects derived from volumetric scans. (1) High-resolution, non-invasive measurement of foot bone motion during walking gait: We compute digitally reconstructed radiographs (DRRs) corresponding to projections of digitized bones and register with stereo fluoroscopy to obtain full 3D kinematics. Images shown include the first results obtained from full scans with multiple bones, overlaps in the projected views, and significant background noise. CUDA-powered algorithms play an essential role in speeding the DRR and registration computations to achieve rates that enable multi-patient studies. (2) Design of swept solids using a CUDA-powered image stack modeler: A multi-axis rotational sweep of a digitized talus is illustrated.

Level: Beginner technical
Type: Poster
Tags: Medical Imaging

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6271 - Computational Study of Magnetic Anisotropy Using Heisenberg Model and GPU-Accelerated Monte-Carlo Simulation

Soo Kyung Kim Student, Lawrence Livermore National Laboratory
Soo Kyung Kim is an M.S. student in computer science and a Ph.D. student in scientific computing at the School of Material Science and Engineering at Georgia Institute of Technology. Her advisors are Professor Hamid Garmestani in MSE and Professor Richard Fujimoto in CSE. Originally from Seoul, South Korea, Soo Kyung graduated summa cum laude with a B.S. in electrical and computer engineering and physics from Ewha Womans University. She received her M.S. in applied physics at Columbia University in New York. She continued her research in Pacific Northwest National Laboratory as a research scientist intern for two years, working in the computational mathematics department developing simulation code and analyzing data from DFT calculation for magnetic material for the purpose of developing rare-earth replacement magnetic material. Soo Kyung's research focuses on predicting various novel material properties using statistical and computational methodologies. Her interests include simulation and modeling techniques in scientific computing, such as molecular dynamics and DFT, and also statistical analysis of large data.

Although Monte-Carlo simulation based on Ising model is a widely used method to model the thermal fluctuation of magnetic spins and dynamics combined with DFT calculation, it has two main drawbacks: [1] accuracy of model and [2] performance of model. Firstly, the Ising model is not accurate enough to represent complicated pictures of real physics. Secondly, Monte-Carlo itself is very slow, which is critical issue to extend to the many atomic system. Monte-Carlo repeats random sampling for each iteration step. It can be extremely slow to generate "truly random" random numbers to choose the spins to flip in a large atomic system. To resolve the problems listed, we have implemented a checkerboard algorithm using GPUs within the heisenberg model, which is a better model to mimic real physics.

Level: Beginner technical
Type: Poster
Tags: Computational Physics; Computational Chemistry

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6272 - A Parallel Floyd-Warshall Algorithm on GPU

Roussian Gaioso Ph.D. Student, Universidade Federal de São Carlos
Roussian Gaioso is pursuing a Ph.D. in computer science at the Federal University of Sao Carlos in the area of distributed systems and parallel computing, which he started in February 2014. Roussian graduated in information systems at the State University of Goias in 2009 and gained his M.S. in computer science from the Federal University of Goias in 2014. He has experience in computer science with emphasis on analysis and development of software and parallel computing. His interests include parallel computing, distributed system, high performance computing, and graph theory.

We propose a new parallel algorithm for solving the APSP problem. The algorithm is based on Floyd-Warshall and, therefore, borrows some of its advantages as having a predictable performance regardless of the underlying graph structure. It was efficiently implemented on a machine with a many-core GPU, which is less expensive than a cluster of computers. The tests were performed on a Tesla C2075 graphics card. The implementation was able to identify the shortest paths among all pairs of vertices of randomly generated graphs (all containing a maximum of 8192 vertices) in less than 15 seconds, which represents a speedup of 150x over the sequential Floyd-Warshall algorithm.

Level: Intermediate technical
Type: Poster
Tags: Algorithms; Performance Optimization

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6273 - GPU-Accelerated Neighborhood Operators for Permutation-Based Problems

Victor Machado Student, Fluminense Federal University
Victor Machado is an M.S. student at the Institute of Computing of the Fluminense Federal University in the research area of algorithms and optimization. He holds a degree in industrial engineering from Fluminense Federal University (Brazil), with a one-year study period at the University of Porto (Portugal).

This poster presents an efficient GPU implementation of four neighborhood operators that are commonly applied in the local search of many metaheuristics for different permutation-based problems, such as the Traveling Salesman Problem and the Single Row Facility Layout Problem. Although many optimization problems have been solved through GPU parallelization in the last few years, the authors are not aware of a thorough analysis of the neighborhood moves. Therefore, we perform an evaluation of the neighborhood operators rather than analyzing a specific metaheuristic. The parallel approach achieved good results when compared to the CPU version reaching speedups ranging from 14x to 68x faster.

Level: Beginner technical
Type: Poster
Tags: Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6274 - A Parallel CPU/GPU Algorithm for 3D Pose Estimation Using CUDA and OpenMP

Kenia Picos PhD student, CITEDI-IPN
Kenia Picos is a Ph.D. student at the National Polytechnic Institute in the Center of Research and Development of Digital Technology, Mexico. Her fields of interest include image processing, computer vision, and computer graphics.

Pose recognition is characterized by location and orientation parameters, which introduces high complexity due the huge number of visualizations, more than a target can present within a scene. A design of an effective algorithm is needed to analyze the physical phenomena implied in the visualization of a moving 3D object. This work presents a proposal for pose recognition with adaptive correlation filters with a CPU/GPU implementation using OpenMP and CUDA in order to improve the execution performance.

Level: Intermediate technical
Type: Poster
Tags: Computer Vision & Machine Vision; Video & Image Processing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6276 - GPU Implementation of Splitting-Up Conjugate Gradient Method

Akiyoshi Wakatani Professor, Konan University
Akiyoshi Wakatani is a professor at Konan University in Japan. He received a Ph.D. from the Division of Information Engineering, Faculty of Engineering, Kyoto University in 1996. Akiyoshi was with Matsushita Electric Industrial (currently Panasonic) from 1986 to 2000 as a researcher and a senior researcher. From 1992 to 1994, he was a visiting scholar of Oregon Graduate Institute of Science and Technology. From 2000 to 2006, he was an associate professor in the Department of Information Science and Systems Engineering, Faculty of Science and Engineering, Konan University, Kobe, Japan. From 2006 to 2008, he was made a full professor. His research interests includes parallel processing and programming education.

We implemented a preconditioned conjugate gradient (CG) method on GPUs (GeForce GTX TITAN, Tesla K80). Our method utilizes Splitting-Up (SP) preconditioner, which is suitable for parallel processing because other dimensions except for one dimension are independent, while well-known Incomplete Cholesky is hard to parallelize. A simple implementation of the SPCG cannot completely utilize coalesced communications. So, to enhance the memory bandwidth, we carry out a pseudo matrix transposition before and after solving a tridiagonal matrix. By this policy, the speedups of our approach can be enhanced by up to 93%. In addition, the number of the transpositions can be reduced by a rotation configuration.

Level: Intermediate technical
Type: Poster
Tags: Performance Optimization; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6279 - Intraoperative GPU-Based Surgical Navigation for Needle Steering

Fangde Liu Research Associate, Imperial College London
Fangde Liu is currently a research associate at the Mechatronics In Medicine Group from Imperial College London. He has interdisciplinary background in computer science, robotics, and medical surgery. He is member of the British Machine Vision Association and senior associate of Royal Society of Medicine. He is leading the robotic surgical navigation software system for several European Surgical Robots projects, including EDEN 2020 (Horizon 2020), STING (Framework 7) for neural surgery, and SmartCather (British Heart Foundation) for endovascular surgery. He obtained his Ph.D. from the National Centre for Computer Animation, working on physics-based character animation technology and had a short working experience at ABB Robotics developing robotic manufacturing software.

Newly developed, robotically steered needles allow minimally invasive access and accurate guidance to deep-seated anatomical targets. They hope to improve efficacy of interventions such as deep brain stimulation and tumor management, while reducing patient trauma. By using ultrasound to track both the tissue and needle deformation, the optimal insertion trajectory can be updated intraoperatively. The whole navigation process can be accelerated by using a GPU-implementation, which greatly reduces the navigation latency, making surgery safer and more accurate.

Level: Advanced technical
Type: Poster
Tags: Medical Imaging; Robotics & Autonomous Machines

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6281 - Implementation of a Real-Time Polyphase Filter in Radio Astronomy

Karel Adámek Post-Doctoral Research Assistant, University of Oxford
Karel Adamek started as a postdoc at the University of Oxford's e-Research Centre in November 2015. He recently finished a theoretical physics doctoral program at Silesian University in Opava, Poland.

We present our implementation of a polyphase filter for real-time data processing in radio astronomy. The polyphase filter is a standard tool in digital signal processing and as such a well established algorithm. We describe our implementation of the polyphase filter algorithm. We have implemented the polyphase filter on three generations of NVIDIA GPU cards (Fermi, Kepler, Maxwell), on the Intel Xeon CPU, and Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse that the algorithm offers. Our GPU implementations explore two different methods for achieving this. The first makes use of L1/Texture cache, the second uses shared memory. We present our results in terms of the sample rate that can be processed per second.

Level: Beginner technical
Type: Poster
Tags: Astronomy & Astrophysics; Signal & Audio Processing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6282 - Tuning Heterogeneous Computing Architectures Through Integrated Performance Tools

Robert Lim Graduate Research Assistant, University of Oregon
Robert Lim is a second year Ph.D. student in computer science at University of Oregon working under supervision of Professor Allen Malony, and is a member of the Performance Research Laboratory. Previously, he was a M.S. student at UC Irvine in computer science working under supervision of Professor Isaac Scherson. Robert is a recipient of the Department of Defense SMART Fellowship. His interests are in the areas of high performance computing and performance monitoring tools, with a minor emphasis in large-scale geospatial applications.

Heterogeneous computing presents challenges in optimizing performance across diverse architectures, high-speed networks, and programming methods in these systems. Observing GPU hardware performance counters, collected either through instrumentation or sampling, elucidates kernel execution, but does not provide a means to correlate dense activity regions with source line information. GPUs specialize in executing SIMD (single instruction, multiple data) in lock-step, where threads that do not satisfy branching conditions are masked out. Control flow graphs represent program control flow and dependencies in a program. In deriving trip counts of the control flow graph, one can determine how an input of size N will perform on a GPU without having to compile or run the application.

Level: Intermediate technical
Type: Poster
Tags: Tools & Libraries; Performance Optimization

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6283 - GPU-Accelerated Sub-Sample Displacement Estimation Method for Real-Time Ultrasound Elastography

Bo Peng Postdoc, Michigan Technological University
Bo Peng is researching ultrasound speckle tracking and ultrasound shearwave elastrography. He received a B.S. from the School of Computer Science at Southwest Petroleum University, Chengdu, China, in 2003. He received his M.S. and Ph.D. in the College of Computer Science from Sichuan University, Chengdu, China, in 2007 and 2014, respectively. His research interests include ultrasound imaging, bio-medical signal processing, and GPU-based parallel computing. He joined Dr. Jiang's Lab at Michigan Technological University as a postdoctoral research fellow in January 2015.

Ultrasound elastography is a promising medical imaging modality that estimates mechanical properties of soft tissues. Because tissue elasticity is inferred from tissue displacements, a highly accurate displacement estimation method is critical. Recently, our group has developed an improved sub-sample displacement estimation algorithm where axial and lateral motion estimates are simultaneously performed to enhance the tracking accuracy. The proposed method calculates the sub-sample estimation while the region of interest requires no inter-process communication, thereby becoming a perfect candidate for the GPU-based parallelization. In this study, the proposed method has been implemented in CUDA. It's about 60X faster than the CPU implementation while maintaining its advantages.

Level: Intermediate technical
Type: Poster
Tags: Medical Imaging

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6285 - Parallelization of Graph Algorithms on GPU Using CUDA®

Chetan Pise Assistant Professor, Yeshwantrao Chavan College of Engineering Nagpur, India
Chetan Pise is an assistant professor at the Yeshwantrao Chavan College of Engineering, Nagpur, India. He loves the GPU concept, and has attended and organized workshops based on GPU CUDA and high performance computing at YCCE.

Graphs play a very important role in the field of science and technology for finding the shortest distance. Large graphs are common in scientific and engineering applications consisting of operation on millions of vertices and edges. For faster execution of such operations, parallel computation is essential. GPUs have high computation power and a low price. CUDA technology is becoming a new programming approach for GPGPUs. A multithreaded CUDA device makes various threads to run in parallel using GPUs. We demonstrate the comparison between serial and parallel implementation of BFS and DIJKSTRA algorithms.

Level: Beginner technical
Type: Poster
Tags: Algorithms; Performance Optimization

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6286 - Evolutionary Methodology Framework for GPUs

Mihály Retek PhD student, Corvinus University of Budapest
Mihály Retek has over 10 years of experience in software development. He is a Ph.D. student at the Faculty of Business Informatics in the Corvinus University of Budapest. His research topic is: "The Development of interactive forecast/foresight expert systems, which can be used for computing processes of temporal and spatial models." Mihaly obtained an M.S. at the Faculty of Software Architect - Mathematics with a specialization in image processing at the University of Szeged in 2003.

In evolutionary methods, many processes of the same type can be processed in parallel. These processes are connected to different source and target datasets. For this reason, these methods are optimal for SIMD architectures. This poster shows an evolutionary framework, in which evolutionary algorithms can be developed for GPUs and CPUs. The "Implemented Method" section of this poster is the foundation of this methodology and allows for the creation of more advanced forecasting.

Level: Intermediate technical
Type: Poster
Tags: Algorithms; Deep Learning & Artificial Intelligence

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6287 - A Task-based Programming Model Using Industry Standards for Embedded Hardware

Sunita Chandrasekaran Assistant Professor, University of Delaware
Sunita Chandrasekaran is an assistant professor at the University of Delaware with the Department of Computer & Information Sciences. Previously, she worked as a postdoctoral researcher at the University of Houston with Dr. Barbara Chapman at the Department of Computer Science. She received a Ph.D. in computer science engineering from Nanyang Technological University, Singapore, for developing a high-level portable software toolchain for complex embedded reconfigurable hardware. She received a B.S. in electrical and electronics from India. Sunita's research interests include building programming models for HPC and embedded systems, building validation and verification suite for emerging directive-based programing models such as OpenACC, and exploring irregular applications.

The Multicore Association (MCA) is an industry association that defines and promotes open specification to enable multicore product development. The main goal of MCA is to abstract hardware details and offer a portable software solution stack for embedded systems. One of the MCA APIs is Multicore Task Management API (MTAPI), which leverages task parallelism on embedded multicore systems that are comprised of symmetric and asymmetric processors. We have developed a runtime library (RTL) based on MTAPI that allows scheduling and mapping of tasks to the heterogeneous cores of the given platform. Our RTL utilizes Multicore Communication API (MCAPI) to communicate between cores. Our RTL is evaluated on the NVIDIA Jetson TK1 embedded processor comprising ARM and GPU cores.

Level: Intermediate technical
Type: Poster
Tags: Embedded; Programming Languages; IoT

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6288 - Acceleration of a Pseudo-Bacterial Potential Field Algorithm for Path Planning

Ulises Orozco-Rosas PhD Student, Instituto Politécnico Nacional
Ulises Orozco is working in the Department of Intelligent Systems, Center of Research and Development in Digital Technology (CITEDI). His research activities include the design of intelligent mobile robots, motion planning, and software engineering. He received a B.S. in electronics engineering from the Autonomous University of Baja California, Mexico in 2006, and an M.S. in digital systems from the Instituto Politecnico Nacional in 2014. He is currently pursuing a Ph.D. in digital systems from the Instituto Politecnico Nacional. His research interests are computational intelligence, heterogeneous parallel computing, and intelligent robots.

Path planning of a mobile robot -- determining an optimal path from a universe of possible solutions -- is one of the most computationally intensive tasks and a challenge in dynamically changing environments. Using GPUs, it is possible to process data-intensive tasks efficiently. This work presents the acceleration of a Pseudo-Bacterial Potential Field (PBPF) algorithm for path planning. The Matlab-CUDA implementation of the PBPF algorithm shows how to find an optimal collision-free path for a mobile robot and how to speed up the path planning computation through the use of GPUs. The simulation results demonstrate the efficiency of the PBPF implementation to solve the path planning problem in offline and online mode.

Level: Intermediate technical
Type: Poster
Tags: Robotics & Autonomous Machines; Self-Driving Cars & Automotive ; IoT

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6289 - Real-time 3D Reconstruction for Autonomous Driving through Semi-Global Matching

Antonio Espinosa Postdoc Researcher, Universitat Autònoma de Barcelona
Antonio Espinosa is a postdoctoral researcher in the Computer Architecture and Operating Systems Department at the University Autonoma of Barcelona. His interests include informatics research, parallel computing, bioinformatics, and GPU optimization. He belongs to the HPCA4SE group, where he coordinates GPU research center activities.

Robust and dense computation of depth information from stereo-camera systems is a computationally demanding requirement for real-time autonomous driving. Semi-Global Matching (SGM) [1] approximates heavy-computation global algorithms results but with lower computational complexity, therefore it is a good candidate for a real-time implementation. SGM minimizes energy along several 1D paths across the image. The aim of this work is to provide a real-time system producing reliable results on energy-efficient hardware. Our design runs on a NVIDIA Titan X GPU at 104.62 FPS and on a NVIDIA Drive PX at 6.7 FPS, promising for real-time platforms.

Level: Beginner technical
Type: Poster
Tags: Computer Vision & Machine Vision; Self-Driving Cars & Automotive

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6290 - HPC for Remote Visualization and Interaction with Scientific Applications

Deyberth Riaño Nuñez Student, Universidad Industrial de Santander
Deyberth Riano Nunez is an M.S. student in computer science at the High Performance and Scientific Computing Center of Universidad Industrial de Santander in Bucaramanga, Colombia (SC3-UIS). His research interests include high performance computing, scientific visualization, and interaction in scientific applications.

We present a work in progress, integrating remote visualization, interaction and high performance computing for the development of scientific applications, taking advantage of ultra-high-resolution display walls as a collaborative environment.

Level: Beginner technical
Type: Poster
Tags: Supercomputing & HPC; Graphics Virtualization

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6291 - Profiler Guided Manual Optimization for Accelerating the Cholesky Decomposition

Vinay Ramakrishnaiah Research Assistant/Intern, University of Wyoming/National Center for Atmospheric Research
Vinay Ramakrishnaiah is a Ph.D. candidate and research assistant at the University of Wyoming. The work presented here is a part of research conducted at the National Center for Atmospheric Research during SIParCS 2015.

'fields' is a widely used spatial statistics package in R that is used to analyze spatial data. At the National Center for Atmospheric Research(NCAR) it is used by the IMAGe group. We made use of the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library to accelerate the Cholesky Decomposition(CD) in 'fields'. Profiling provided insight to accelerate the CD. We were able to optimize the code and environment to get a speedup greater than 75x for large matrices. We integrated our accelerated C functions with Julia and drew a performance comparison between R and Julia. Julia was found to get a speedup up to 4x for large matrices. We found a potential way to improve MAGMA functions by replacing the intra-node inter-GPU communications with direct device-to-device calls.

Level: Intermediate technical
Type: Poster
Tags: Supercomputing & HPC; Earth System Modelling

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6292 - Agile Condor: Scalable High Performance Embedded Computing Architecture

Mark Barnell Senior Computer Scientist, Air Force Research Laboratory
Mark Barnell is a senior computer scientist with the U.S. Air Force Research Laboratory, high performance computing systems branch (AFRL/RITB). He has over 28 years of experience in computer engineering and advanced computing. He is currently the HPC director for AFRL's Information Directorate Computing and Communications Affiliate Resource Center and Agile High Performance Systems program manager. His areas of expertise include high-performance computers, embedded computing, persistent wide area surveillance, and distributed and next-generation architectures.
Christopher Capraro Senior Systems Engineer, SRC
Christopher Moore has over 20 years of experience in research, analysis, and software development. As a senior systems engineer at SRC, Inc., he has managed and worked on several research programs in the areas of Space-Time Adaptive Processing (STAP), Synthetic Aperture Radar (SAR), knowledge-aided techniques for improving radar signal processing, bistatic and multistatic radar systems, waveform diversity, high-performance embedded computing systems, system architectures, and software development. In addition, he has published various peer-reviewed papers and book sections.

The Air Force Research Laboratory Information Directorate Advanced Computing and Communications Division is developing a new computing architecture using GPUs, designed to provide high-performance embedded computing (HPEC) pod solution to meet operational and tactical real-time processing intelligence surveillance and reconnaissance (ISR) missions. This newly designed system, Agile Condor, is a scalable and HPEC system and based on open industry standards that will increase, far beyond the current state of the art, computational capability within the restrictive size, weight, and power constraints of unmanned aircraft systems' external "pod" payloads.

Level: Intermediate technical
Type: Poster
Tags: Aerospace & Defense; Embedded

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6293 - Fast and Robust Feature Matching

Ana Caroline Vargas Student, IC-UFF/Brazil
Ana Vargas is a graduate student in computer science at the Instituto de Computacao, Universidade Federal Fluminense (UFF). She earned a fellowship from CNPq Junior Scientific Initiation during high school (2010-2011) and the scholarship program Young Talent for Science awarded by Capes (2012-2013) and scholarship of scientific initiation for CNPq (2013-2015). She has experience in computer science with an emphasis in computer graphics in the following areas: computer vision and image processing.
Cristina Nader Vasconcelos Assistant Professor, IC-UFF/Brazil
Cristina Nader Vasconcelos has been an assistant professor at the Instituto de Computacao of the Universidade Federal Fluminense (UFF) since 2010. Cristina obtained a Ph.D. in computer science from PUC-Rio/Brazil in 2009 for computer vision algorithms on GPU. She got her M.S. in computer graphics in 2005, also from PUC-Rio. She has worked with digital video to extract its shots directly from MPEG compressed stream. Her interests include computer vision, pattern recognition, machine learning, computer graphics and parallel algorithms.

Choosing informative, discriminating, and independent features is crucial for effective computer vision algorithms. The result of a feature-detection procedure over an image is a set of keypoints commonly matched to other sets extracted over different images. Traditionally, the matching is obtained using a k-Nearest Neighbors modeling (k-NN). However, such approach does not model matching unicity restrictions, that is, it allows a keypoint to be associated more than once. We explore a parallel Bipartite Graph Matching (BGM) entirely in GPU for fast and robust matching and present its comparison against the k-NN.

Level: Intermediate technical
Type: Poster
Tags: Computer Vision & Machine Vision; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6294 - A Dictionary Learning Approach in GPU for Image Denoising

Lizeth Joseline Fuentes Pérez Master Student, Federal Fluminense University
Lizeth Joseline Fuentes Perez is an M.S. student in the Computer Science department at Institute of Computing - Federal Fluminense University.
Luciano Arnaldo Romero Calla Master Student, Federal Fluminense University
Luciano Arnaldo is a student in computer science at the Institute of Computing - Federal Fluminense University.

Many image processing problems require image denoising as a preprocessing step. We address the problem of removing white Gaussian noise in images through dictionary learning, which is a technique that has been proved to better fit a signal than fixed dictionary approaches. Learning an overcomplete dictionary for sparse representation is a problem that involves a high computational cost. In this poster, we present how an efficient parallel algorithm on GPU reduces training time.

Level: Intermediate technical
Type: Poster
Tags: Performance Optimization; Video & Image Processing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6295 - Automatic Speech Recognition Using Deep Learning

Takehiro Sekine Leader, Yahoo! JAPAN corporation
Takehiro Sekine joined Yahoo! Japan in 2009 as a software engineer and is currently the technical lead engineer and manager for the Automatic Speech Recognition Technology Development and Integration for Yahoo! Japan's speech recognition systems. He has led speech recognition development and production for IT service in Yahoo! Japan, including development and production engineering in multiple fields with lead positions in software design and system design and development in image processing and natural language processing systems.

Deep neural networks (DNNs) have become a popular foundation for state-of-the-art automatic speech recognition systems. We have collected and transcribed more than 2,000 hours of speech data and used it to train DNNs for acoustic models and voice activity detector models. Implementation techniques for efficient speech DNN training on GPUs are explained and several evaluation results and the training times are shown using different amounts of training data and sizes of DNN. The trained DNNs are deployed in our ASR services in Japan.

Level: Intermediate technical
Type: Poster
Tags: Deep Learning & Artificial Intelligence; Signal & Audio Processing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6296 - GPU-Accelerated Simulations of Evolution for Medical and Population Genetics

David Lawrie Postdoctoral Researcher, N/A
David Lawrie's research involves parallelize algorithms important in population genetics on the GPU. David graduated from Cornell University in 2007 with B.A. degrees in mathematics and biology. He received his Ph.D. in genetics from Stanford University in 2013 under Dr. Dmitri Petrov. He received his Ph.D. at the University of Southern California under Dr. Sergey Nuzhdin. His overall research focuses on modeling evolutionary processes to detect functional regions of the genome and ascertain their importance to the health and fitness of the organism.

Learn how the analysis of whole-genome SNP data will be revolutionized by accelerating the Wright-Fisher (WF) algorithm on GPUs. The tools of population genetics are crucial for detecting adaptive and disease alleles, as well as tracking population changes over time. The forward WF simulation is powerful in its ability to model complex population histories and selection scenarios, but is limited in its practical applications by its slow execution on the CPU. The presented GPU Optimized WF simulation (GO Fish) keeps the full flexibility of its serial, CPU counterpart while running far faster. As other related, computationally intensive algorithms important in population genetics are likewise parallelizable, GO Fish serves as an exciting template for future research in the field.

Level: Beginner technical
Type: Poster
Tags: Computational Biology; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6298 - An Off-Load Model for Computing on GPU for a Parallel CFD Solver HiFUN

Balakrishnan Narayanarao Associate Professor, Indian Institute of Science, Bangalore
N. Balakrishnan is an associate professor in the Computational Aerodynamics Lab at the Indian Insitute of Science, Bangalore. He got his Ph.D. from the Department of Aerospace Engineering, Indian Institute of Science, Bangalore in 1995. After a stint in ENSAM, Paris, as a post-doctoral fellow and at I.I.T., Kanpur, as an assistant professor, he re-joined the Department of Aerospace Engineering, I.I.Sc. as an assistant professor in 1998. His interests are in computational fluid dynamics algorithms and applications. His work on unstructured finite volume solvers has resulted in an industry-standard CFD solver HiFUN, which is extensively used by the Indian aerospace industry. He is also the founding director of SandI, a venture company specializing in CFD services, started by I.I.Sc.
Munikrishna Nagaram Chief Technology Officer, S & I Engineering Solutions Pvt. Ltd.
N. Munikrishna is chief technology officer at S & I Engineering Solutions Pvt. Ltd., Bangalore, a venture company specializing in computational fluid dynamics services. He obtained his Ph.D. from the Department of Aerospace Engineering, Indian Institute of Science (I.I.Sc.), Bangalore in 2007. He has worked as a postdoctoral fellow at NASA Glenn Research Center, Cleveland, Ohio, in 2008 and 2009. He then rejoined the Department of Aerospace Engineering, I.I.Sc. as a senior research associate until the end of 2013. His interests are in CFD algorithms and applications.
Nikhil Shende Director, S & I Engineering Solutions Pvt. Ltd.
Nikhil Vijay Shende received a Ph.D. from the Department of Aerospace Engineering, Indian Institute of Science, Bangalore in July 2005. He is now director of S & I Engineering Solutions Pvt. Ltd., a technology company spun off by the Indian Institute of Science. Nikhil Vijay's research interests include industrial CFD and parallel algorithms.
Thejaswi Rao Compute Devtech Engineer, NVIDIA
Thejaswi Rao has worked with NVIDIA on architecture and infrastructure since graduating with a B.S. and M.S. in engineering from IIT Kharagpur in 2008. He recently joined the compute devtech team, where his responsibilities are developing, porting, and optimizing algorithms on NVIDIA GPUs. His interests are in machine learning and bioinformatics.

The present study deals with porting the computational fluid dynamics flow solver HiFUN, a proprietary software by S & I Engineering Solutions Pvt. Ltd., on a GPU-based accelerator platform using OpenACC directives. HiFUN is already parallelized on distributed memory HPC platforms using MPI directives and exhibits excellent scalability. In one of the recent studies, scaling over 15,000 processor cores on Cray XC40 has been demonstrated. The challenge at hand is to port HiFUN solver to accelerator-based HPC clusters without compromising its scalability. The presentation includes details on the use of OpenACC directives, wherein the compute-intensive tasks are transferred to the GPU. The success of this strategy to realize the objectives with minimal code change is also highlighted.

Level: Beginner technical
Type: Poster
Tags: Computational Fluid Dynamics; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6307 - OpenACC Enabled Benchmark Suite on Intel Ivy Bridge

Joel Bricker Graduate Student, University of Delaware
Joel Bricker has a B.S. in computer science from the University of Delaware, where he is currently pursuing his M.S. in computer science. He worked for a supercomputing company developing system software for a system based on a specialized 80-processor chip. After that, he transitioned into the financial industry, where he has worked for the past four years.

We explore the new OpenACC implementation to parallelize code for a multi-core processor. We use an OpenMP implementation on the same code base and compare the performance results obtained when running the code instrumented with both standards. The results are notable because the OpenMP standard has existed for some time for multi-core parallelism, whereas the OpenACC standard just recently starting supporting multi-core. As such, it is important for the OpenACC implementation performance to match, or exceed, the performance of the existing OpenMP standard.

Level: Beginner technical
Type: Poster
Tags: Supercomputing & HPC; Tools & Libraries

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6309 - GPU Accelerated Image Recognition Cloud Services—Internet UGC Images and Videos Recognition Overall Solution

Leonard Li CEO, Tupu Technology Co.,Ltd.
Leonard Li founded Tupu in 2014 and helped establish it as a technology pioneer. He is a former Tencent T4 specialist, having joined the Tencent Guangzhou Research Center in 2005. During that time, he led the team that developed the QQ-mail. Leonard later joined WeChat as a member of the founding team.

With the advent of the era of visual media, more and more information has been spread through images and videos. The demand for image analysis and recognition is growing fast. For companies that need to process lots of images but don't have the technology, Tupu has built an open platform to provide a way to censor, search, or mine images automatically and intelligently. Tuputech Cloud Platform is the largest image and video analysis cloud service provider in China and provides highly customizable services for its clients.

Level: Beginner technical
Type: Poster
Tags: Computer Vision & Machine Vision

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6310 - Deep Residual Networks - Ultra-Deep Neural Networks with 150+ layers

Jian Sun Principal Research Manager, Microsoft
Jian Sun was born in Xian, China,home of The Terracotta Army. He received a B.S., M.S., and a Ph.D. from Xian Jiaotong University in 1997, 2000 and 2003, respectively. In 2003, he joined Microsoft Research Asia, and has been working in the fields of computer vision and computer graphics, with particular interests in building real-world working systems. His current primary research interests are computational photography, face recognition, and deep learning based image understanding.

Deeper neural networks are difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task.

Level: Intermediate technical
Type: Poster
Tags: Computer Vision & Machine Vision; Deep Learning & Artificial Intelligence

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6311 - QuickEye: Highly Efficient Face Detection & Recognition in Large Video Using Hadoop GPU-cluster

Illo Yoon Graduate Student, University of Seoul
Illo Yoon is working at designing an efficient hybrid scheduler for CPU+GPU map tasks in Hadoop. He designed and implemented QuickEye, a GPU-accelerated face detection and recognition system. Illo joined ParLab(Parallel Software Design Lab, Univ. of Seoul) in 2014 as an undergraduate researcher, and is expected to finish his B.S. and M.S. in August 2016.
Saehanseul Yi Research Engineer, Dasan Networks
Saehanseul Yi received the B.S. and M.S. degrees in electrical & computer engineering from University of Seoul in 2013 and 2015, respectively. He is currently with Dasan Networks. His research interest includes parallel software design, heterogeneous computing, embedded GPU platforms, computer vision and high-performance distributed framework using manycore accelerators
Youngmin Yi Associate Professor, University of Seoul
Youngmin Yi is an associate professor in University of Seoul. His research interest includes algorithm/architecture codesign for heterogeneous manycore platforms, GPU computing, high-performance distributed framework using manycore accelerators, and computer vision applications. He received the Ph.D. degree in electrical engineering and computer science from Seoul National University in 2007.

Ignoring the debate about privacy, we cannot deny that CCTV has had a positive contribution on crime prevention. CCTV is everywhere and more are coming. More and more video files are created and they are getting bigger. It is difficult to handle these big video file with a single server. So let's try Hadoop! Hadoop is a big data framework that can be used easily distributed in a cluster environment. By default, Hadoop cannot utilize GPUs, because Hadoop is running on JVM. So we attached GPU code to Hadoop using JNI(Java Native Interface) and introduce a system called QuickEye. QuickEye decodes large video files detecting face recognition using CUDA.

Level: Intermediate technical
Type: Poster
Tags: Video & Image Processing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6312 - Parallel computation of the value set of frequency response for uncertain systems with GPGPU

Paluri Nataraj Indian Institute of Technology Bombay, Indian Institute of Technology Bombay
Prof. Paluri S.V. Nataraj obtained his Ph.D. from IIT, Madras, India in process dynamics and control in 1987. He then worked in the CAD center at IIT Bombay for about one and a half years before joining the faculty of the systems and control engineering group at IIT Bombay in 1988. His current research interests are in the areas of High Performance computing (GPU), robust stability and control, nonlinear system analysis and control, and reliable computing.

We'll describe an efficient algorithm to compute the value set of any parametric uncertain system which can be modeled into input output transfer function. This value set of frequency response can be obtained by the calculation of the magnitudes and phases for all possible uncertain transfer functions. In robotics and autonomous machines application, these are very useful for controller synthesis. This computation is independent to each other and can be executed in parallel. To show the effectiveness of the proposed method, we have chosen 3-DOF longitudinal aircraft model with large number of parameters.

Level: Intermediate technical
Type: Poster
Tags: Robotics & Autonomous Machines; Algorithms; IoT

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6315 - Applying Deep Learning to Aerospace and Building System Applications at UTC

Vivek Venugopalan Sr. Research Scientist, United Technologies Research Center
Vivek Venugopalan is a Senior Research Scientist with United Technologies Research Center (UTRC). He works in the areas of hardware acceleration and reconfigurable platforms at UTRC for aerospace and building system applications.
Kishore Reddy Senior Research Scientist, United Technologies Research Center
Kishore is a Senior Research Scientist at United Technologies Research Center. His research interests are in Computer Vision and Deep Learning.

Deep Learning is an evolving area of research in machine learning and it has been adopted by UTC for solving various problems in the aerospace and building systems. The use cases highlighted include sensor diagnostics from onboard sensors on aircraft engines, energy estimation and health monitoring of building systems. GPUs provide the computational horsepower to tackle the huge amount of data generated from these sensors. The existing methods for extracting relevant information has been largely replaced by Deep Learning techniques by mapping the problem to large neural networks.

Level: Intermediate technical
Type: Poster
Tags: Deep Learning & Artificial Intelligence; Big Data Analytics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6316 - GPU Acceleration of Computational Electromagnetics Methods

Vivek Venugopalan Sr. Research Scientist, United Technologies Research Center
Vivek Venugopalan is a Senior Research Scientist with United Technologies Research Center (UTRC). He works in the areas of hardware acceleration and reconfigurable platforms at UTRC for aerospace and building system applications.

High fidelity prediction of the link budget between a pair of transmitting and receiving antennas in dense and complex environments is computationally very intensive at high frequencies. Iterative physical optics (IPO) is a scalable solution for electromagnetic (EM) simulations with complex geometry. An efficient and robust solution is presented to predict the link budget between antennas in a dense environment for two use cases: (1) Multi-objective path planning during autonomous navigation and (2) modeling the propagation of Wi-Fi signals inside aircraft cabins. Two NVIDIA GPUs with different number of cores and device memory were targeted for benchmarking the performance of the IPO algorithm.

Level: Intermediate technical
Type: Poster
Tags: Robotics & Autonomous Machines; Aerospace & Defense; IoT

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6317 - GPU Implementation for Non-Cartesian Parallel MRI

Hassan Shahzad Ph.D. Student, COMSATS Institute of Information Technology, Islamabad
Hassan Shahzad is a Ph.D. at the COMSATS Institute of Information Technology, Islamabad. His focus is in MRI image reconstruction using the GPU. Hassan's research is to implement parallel MRI reconstruction algorithms on GPUs for cartesian and non-cartesian trajectories.

Non-Cartesian trajectories provide faster MRI acquisition and efficient coverage of the k-space but they are computationally more demanding as they require gridding and de-gridding operations. CG-SENSE proposed by Pruessmann reconstructs fully sampled MR images from undersampled radial and spiral trajectories. CG-SENSE is iterative and it performs gridding and de-gridding operations in each iteration which consumes most of the CPU time; however these operations contain inherent parallelism. This work focuses on the implementation of gridding and de-gridding operations on GPU to reduce the CG-SENSE reconstruction time while maintaining the overall image quality in CG-SENSE. The results show that the GPU implementation is approximately 10 times faster than its CPU implementation.

Level: Intermediate technical
Type: Poster
Tags: Medical Imaging; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6322 - GPU Accelerated Practical Structural Analysis Using the Boundary Element Method

Ahmed Torky Ph.D. Student, Cairo University
Ahmed Torky is a Ph.D. student at Cairo University (Egypt) working with a research group that focuses on civil structures analysis called CUFEBE (http://www.be4e.com/site/cufebe). His group has recently (since 2013) managed to integrate CUDA programming into their work in structure analysis. Ahmed completed his Master's thesis using CUDA. He is currently teaching and guiding 10 undergraduate and postgraduate students in GPU Computing at Cairo University. Ahmed also works at The British University in Egypt.

An implementation of GPU computing on a direct boundary element method (BEM) for Reissner's plate bending theory. The plate bending problem conducts several independent operations to obtain boundary and internal values, which is reprogrammed using CUDA to allow parallel processing. An academic and commercial software (The PLPAK) uses BEM to solve stress resultants and deflections on plates under bending (slabs, raft foundations, piled-raft foundations, etc). The PLPAK new GPU code, with CUDA Fortran, alters three kernels to transform nested loops into parallel routines: the calculation of influence matrices, the solution to the system of linear equations, and the computation of internal point stress resultants and deflections. Practical examples are shown and accuracy is maintained.

Level: Beginner technical
Type: Poster
Tags: Product & Building Design; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6323 - Improving Automated Diabetic Retinopathy Detection with Deep Convolutional Neural Networks

Meindert Niemeijer CTO, IDx LLC
Meindert Niemeijer, PhD currently works as CTO at IDx LLC, a medical device company. At IDx he provides technical and research leadership to the engineering and R&D teams. He is an internationally recognized retinal image analysis expert with fifteen years of experience researching and developing software and algorithms for the analysis of images of the eye. He received his PhD in medical image analysis at Utrecht University in the Netherlands in 2006. Niemeijer has published more than 45 papers, a book chapter and holds several retinal imaging patents. He has been with IDx LLC since 2012.

Diabetic Retinopathy is the leading cause of blindness in the working population of the western world. Blindness due to diabetic retinopathy can be prevented, but many people with diabetes do not receive the necessary annual screening. IDx has developed an EU approved medical device, IDx-DR, that automates diabetic retinopathy screening and can lower both the barriers to high quality screening and the costs of healthcare. Our poster demonstrates the significant device performance gains that we were able to achieve using GPUs and deep learning techniques by comparing the latest, GPU based, version of the device to the previous, CPU based, version on 5,000 diabetic retinopathy screening exams.

Level: Intermediate technical
Type: Poster
Tags: Medical Imaging; Deep Learning & Artificial Intelligence

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6324 - Upscaling with Deep Convolutional Networks and Muxout Layers

Pablo Navarrete Michelini Principal Research Engineer, BOE Technology Group Co., Ltd.
Pablo Navarrete Michelini was born in Santiago, Chile. He received the B.Sc. in Physics (2001), B.Sc. in Electrical Engineering (2001) and the Electrical Engineer Degree (2002), from Universidad de Chile at Santiago. He received the Ph.D. degree in Electrical Engineering from Purdue University at West Lafayette, in 2008. Pablo worked as a research intern in CIMNE at Technical University of Catalonia in 2006, and as a visitor student research collaborator at Princeton University at Princeton, NJ, in 2006-2007. He worked as Assistant Professor in the Department of Electrical Engineering at Universidad de Chile, in 2008-2011. He worked as Senior Software Engineer on video compression algorithms in Yuvad Technologies Co., Ltd. at Beijing, China, in 2011-2014. Pablo has worked as a Principal Research Engineer at BOE Technology Group Co., Ltd. since 2014.

We consider the problem of super-resolution using convolutional networks. Previous work has shown the advantages of using convolutional networks to improve the quality of image upscaling. Unlike previous solutions, our method incorporates the image upsampling within the network structure. To achieve this goal we propose a so-called Muxout layer that increases the size of image features by combining them in groups. The system structure is motivated by an interpretation of convolutional networks as adaptive filters and classic interpolation theory. We use this interpretation to propose specialized initialization methods that are convenient for training deep structures. Our tests show state-of-art quality, high performance, and the ability for unsupervised learning of text images.

Level: Intermediate technical
Type: Poster
Tags: Deep Learning & Artificial Intelligence; Video & Image Processing

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6325 - Which Whale Is It, Anyway? Face Recognition for Right Whales Using Deep Learning

Robert Bogucki Chief Science Officer, deepsense.io
Robert is a Chief Science Officer at deepsense.io where he currently manages the R&D team and focuses on deep learning. He is also a successful Kaggle competitor. When tackling real life problems, he particularly enjoys leveraging algorithms and computational power instead of, or in addition to, domain knowledge. His motivation to work in the IT Industry is to bring the theoretical ideas and concepts and put them to good use.

With fewer than 500 North Atlantic right whales left in the world's oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction. To interest the data science community, NOAA Fisheries has organized a competition hosted on Kaggle.com. The challenge was to automate the right whales recognition process using a dataset of aerial photographs of individual whales - currently a painstaking and lengthy, manual process. In the poster, we outline the winning solution. It is based on deep learning and convolutional neural networks.

Level: Advanced technical
Type: Poster
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6327 - Fine-Tune For A Fortune: Transfer Learning Using DIGITS and GPUs

Valeriu Codreanu HPC consultant, SURFsara
Valeriu Codreanu is currently working as HPC consultant at SURFsara, the Dutch Supercomputing Center. Since 2015 Valeriu is PI for the CUDA Research Center at SURFsara. Before joining the team in Amsterdam, Valeriu was a postdoctoral researcher for three years at both Eindhoven University of Technology and the University of Groningen working on GPU computing, computer vision, and embedded systems. Valeriu received his PhD in Electrical Engineering from the Polytechnic University of Bucharest in 2011 with a thesis proposing efficient cooperation between multi-threaded and vector processors. His interests lie in the field of high-performance and energy-efficient machine learning systems.

Deep convolutional neural networks are widely accepted as the state-of-the-art solution for various computer vision problems. These commonly lead to a trade-off between network complexity and over-fitting, addressable by increasing the number of training examples, thus resulting in a lengthy training process. Moreover, more training examples may not even be available. Recent research suggests that this hurdle can be surmounted by using pre-trained complex networks and then fine-tuning them to fit specific datasets. We show that this approach allows for record-breaking performance on tasks ranging from natural image classification to handwritten character recognition. This is made possible by using high-performance NVIDIA GPUs in conjunction with the NVIDIA DIGITS training system.

Level: Intermediate technical
Type: Poster
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6334 - Advanced High-Productivity Framework for Large-Scale GPU/CPU Stencil Computations

Takashi Shimokawabe Assistant Professor, Tokyo Institute of Technology
Takashi Shimokawabe is currently an assistant professor at the Global Scientific Information and Computing Center (GSIC), Tokyo Institute of Technology (Tokyo Tech). Takashi is a member of the Aoki Laboratory in GSIC. His primary research interests are general-purpose computing on graphics processing units (GPGPU), computational fluid dynamics, and high performance computing. His group was awarded the 2011 Gordon Bell Prize Special Achievements in Scalability and Time-to-Solution for peta-scale phase-field simulations (T. Shimokawabe et al.) He received Ph.D. in Energy Science from Tokyo Tech in 2012. Takashi graduated with M.S. in Physics from Tokyo Tech in 2007.

A high-productivity framework for multi-GPU and multi-CPU computation of stencil applications is proposed. Our framework is implemented in C++ and CUDA languages. It automatically translates user-written stencil functions that update a grid point and generates both GPU and CPU codes. The programmers write user code just in the C++ language, and it can be executed on multiple GPUs with the auto-tuning mechanism and the overlapping method to hide communication cost by computation. It can be also executed on multiple CPUs with OpenMP without any change of code. In addition, our framework provides a data structure that supports element-wise computations, which allow us to write GPU kernel codes as inline codes. This poster presents our proposed framework and its performance evaluation.

Level: Intermediate technical
Type: Poster
Tags: Supercomputing & HPC; Computational Physics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6336 - Oblique-View Computed Tomography for 3D IC Package Inspection Using CUDA

Kyung-Chan Jin Principal Researcher, Korea Institute of Industrial Technology
Kyung-Chan Jin is a principal researcher at the Korea Institute of Industrial Technology.

This study focused on the CUDA implementation to oblique-view CT(Computed Tomography) technique for non-destructive internal inspection of 3D IC chips. With 400 projected images from rotating phantom in an oblique direction, we executed 16 GUPS performance to reconstruct 512x512x512 volume of phantom with NVIDIA Quadro K6000 GPU, showed that the GPU performed 100 times faster than the dual CPU processors in the CT reconstruction method.

Level: Intermediate technical
Type: Poster
Tags: Computer Vision & Machine Vision; Medical Imaging

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6338 - AceCAST - High Performance CUDA Based Weather Research Forecasting (WRF) Model

Allen Huang CTO, Tempo Quest Inc.
Dr. Allen Huang holds a doctorate degree in atmospheric science from the University of Wisconsin where he is now a Distinguished Scientist at the University of Wisconsin's Space Science and Engineering Center, SSEC. Allen has over 25 years experience in writing algorithms for the US Government satellite weather projects. He is one of the world's experts in accelerating regional weather forecast processing times using NVIDIA's CUDA language and GPU accelerator chip sets. He has authored 74 peer reviewed white papers on this subject. Allen was one of the founding scientists and Chief Information Officer of GeoMetWatch and is currently the CEO of Hyper Sensing LLC and CTO of Tempo Quest INc.

AceCAST is a proprietary version of WRF, a mesoscale and global weather research and forecasting model designed for both operational forecasters and atmospheric researchers that are widely used by commercial, government & institutional users around the world, in >150 countries. WRF is suitable for a broad spectrum of applications across domain scales ranging from meters to hundreds of kilometers. AceCAST increases in computational power which enables all time critical weather sensitive industry/commerce to achieve (1) High resolution accuracy and cost performance, (2) Need for strong scaling, and (3) Greatly improved profits. AceCAST is already one third completed and time to first commercial product is only ~12 months away.

Level: Intermediate technical
Type: Poster
Tags: Earth System Modelling; Supercomputing & HPC

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6339 - A GPU-Accelerated Statistical Method to Identify Differential Genetic Dependencies

Gil Speyer Senior Postdoctoral Fellow, The Translational Genomics Research Institute
Gil Speyer is a senior postdoctoral fellow at the Translational Genomics Research Institute. He received his Ph.D. in electrical engineering from Arizona State University.

We have developed a GPU implementation of a statistical method to identify gene sets enriched with condition-specific genetic dependencies. The statistical rigor of the method incurs a substantial computational burden, motivating this effort. Starting with pairwise comparisons between each set of nodes in the network, edges across the distribution of networks are determined in parallel. After network information has been condensed, a unique list of networks is determined, and the computation is then decomposed across all unique network nodes to compute the divergence. Initial implementation showed more than two orders of magnitude acceleration.

Level: Intermediate technical
Type: Poster
Tags: Computational Biology

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6340 - Cholesky Factorization on Batches of Matrices with Fixed and Variable Sizes

Ahmad Abdelfattah Postdoctoral Research Associate, Innovative Computing Laboratory, University of Tennessee
Ahmad Abdelfattah is a postdoctoral research associate at the Innovative Computing Laboratory (ICL), University of Tennessee. He received his Ph.D. in computer science from King Abdullah University of Science and Technology (KAUST) in 2015, where he was a member of the Extreme Computing Research Center (ECRC), under prof. David Keyes. Ahmad is interested in high performance numerical linear algebra on GPUs and emerging architectures, including both dense and sparse problems. He has been acknowledged by NVIDIA for his contribution in its BLAS library (CUBLAS). Ahmad has B.S. and M.S. in computer engineering from Ain Shams University, Cairo, Egypt.

This work presents a high performance solution for Cholesky factorization on batches of relatively small matrices. We discuss both fixed-size and variable-size batched problems. In order to handle the irregularity associated with this type of workloads, we present new optimization techniques that can maintain relatively high performance on such small matrix sizes. The proposed solution outperform most of the existing state-of-the-art techniques that can solve batched problems.

Level: Intermediate technical
Type: Poster
Tags: Performance Optimization; Supercomputing & HPC; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6341 - Accelerating Java Applications Using GPGPUs

James Clarkson PhD Student, University of Manchester
James Clarkson is a third year Ph.D. student at the University of Manchester in the UK. He is a member of the Advanced Processor Technologies (APT) group under the supervision of Dr. Mikel Lujan and is actively researching ways to make hardware accelerators more programmable. Prior to starting his Ph.D. James worked for ARM on the EU funded Mont Blanc project.

Over the last few years we have been researching ways to exploit features of managed-languages, such as Java, to simplify programming GPGPUs, we'll present our state of the art prototype: Tornado. Tornado is a framework that allows Java programmers to write GPU accelerated applications in 100% pure Java. It employs a task-based programming model, which makes it simple to compose complex processing pipelines that can execute multiple kernels across multiple GPGPUs. A key outcome of Tornado is that with minimal refactoring of code it is possible to port an application onto a GPGPU. We'll demonstrate a real-time computer vision application, ported from CUDA into Java, to reconstruct a 3D scene from a stream of RGB-D data.

Level: Beginner technical
Type: Poster
Tags: Programming Languages; Computer Vision & Machine Vision

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6342 - A High-Precision Power Model for the Tegra K1 CPU, GPU and RAM

Kristoffer Robin Stokke Ph.D. Candidate, University of Oslo
Kristoffer Robin received his bachelor’s degree in electrical engineering from the Oslo University College in 2009, and his master’s degree in Informatics from the University of Oslo in 2012. Currently working towards the PhD degree at the University of Oslo, he is affiliated with the MPG research group in Simula Research Laboratory. His primary research interest is in energy efficiency of heterogeneous multicore architectures such as the Tegra SoCs and how software can contribute to low-power operation. His recent research activities focus on high-precision energy modelling and optimisation of multimedia workloads for different processors, as well as evaluation of modern energy management in mobile devices.

This poster session accompanies our talk on high-precision power modelling for the Tegra K1's GPU, CPU clusters and memory. Power modelling is necessary to understand how software consumes energy and optimise for power. However, existing power models are typically very coarse-grained and can mispredict badly up to 70 % depending on hardware state. This poster introduces our high-precision power model which, by taking into account operating frequencies, clock- and core-gating, rail voltages and hardware utilisation, is shown to be over 98 % accurate for GPU and CPU video processing workloads. The model is not only able to predict power usage very accurately, but can also be used to analyse and optimise power usage of applications, for example by utilising non-coherent GPU caches.

Level: Intermediate technical
Type: Poster
Tags: Embedded; Energy Exploration; IoT

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6343 - GPU Boosted Deep Learning in Real-time Face Alignment

Binglong Xie Chief Architect, HiScene Information Technology Co,.Ltd
Dr. Binglong Xie is chief architect at HiScene, a leading Chinese image recognition and augmented reality technology provider. Before joining HiScene, he was Senior Staff Engineer at Qualcomm, leading heterogeneous acceleration of computer vision algorithms and applications on Snapdragon mobile platforms. Prior to that, he worked at Siemens Corporate Research on research and development in industrial inspection for Siemens Energy and other computer vision projects for Siemens business units. He received Ph.D. from Lehigh University in Electrical Engineering.

For the task of real-time face alignment, we employ a GPU server cluster to train a Convolutional Neural Network. By taking advantages of both Deep Learning and GPU computing technologies, our algorithm outperforms all existing algorithms on the popularly tested IBUG benchmark. In a photo edit application, our face alignment algorithm is integrated to locate precise facial key points, which provide the basis for further virtual facial makeup. Details of our algorithm are given in our poster, along with experimental results on the public benchmark.

Level: Intermediate technical
Type: Poster
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6344 - Large Time Series Single-Molecule Tracking Including Defocus and Motion Blur Control

Xu Xiaochun Research Associate, National University of Singapore
Xu Xiaochun is currently Research Associate in Mechanobiology Institute, Singapore. Mechanobiology Institute (MBI) is one of the Research Centres of Excellence in National University of Singapore (NUS), whose mission is to develop a new paradigm of biomedical research by focusing on the quantitative and systematic understanding of dynamic functional processes. Dr. Xu received his Bachelor and Master degrees in Biology from Huazhong University of Science and Technology (HUST), China in 2008 and 2011, respectively. His research interests lie in the fields of bio-photonics, bioinformatics, and biomedical system, including X-ray cone-beam microtomography, single molecular tracking, cubic membrane, and ophthalmic device. He has published several journal papers in these areas.

We'll present an operational tracking implementation for multi-channel microscopy time series from hundreds to tens of thousands of frames, depicting the dim traces of single fluorescent molecules moving over time. The characteristic shape of an optical point source is used to localize and trace thousands of molecules fast, accurately, and reliably over a timespan of several minutes.

Level: Beginner technical
Type: Poster
Tags: Video & Image Processing; Computational Biology

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6345 - A Rasterization Based Line Segment Intersection Algorithm for Urban Mobility Simulations

Benjamin Hernandez Computer Scientist, Oak Ridge National Laboratory
Dr. Benjamin Hernandez is a member of the recently created Advanced Data and Workflows group at NCCS in Oak Ridge National Laboratory. His previous appointments include the Barcelona Supercomputing Center and Tecnológico de Monterrey, campus Ciudad de Mexico, where he was PI of the CUDA Teaching Center. His research interests are in the intersection of HPC, GPUs, interactive computer graphics, HCI, crowd and traffic simulation, analysis, and visualization.

Road network data is an important component used to model city mobility. However, when using volunteered geographic information, such as OpenStreetMaps, road intersections usually are incomplete or invalid. A line segment intersection algorithm can correct this issue. However, the naïve algorithm has O(N^2) complexity and one of the best solutions, the Bentley-Ottmann algorithm, O(NlogN). We propose GPGPU alternative solution that uses OpenGL 4 rasterization, per-pixel linked lists and almost zero driver overhead functions. Results show our method offers a speed up of 87x over these algorithms.

Level: Intermediate technical
Type: Poster
Tags: Algorithms; Performance Optimization

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6346 - Testing Fine-Grained Parallelism for the ADMM on a Factor-Graph

Jose Bento Professor, Boston College
José Bento completed his Ph.D. in Electrical Engineering at Stanford University where he worked with Professor Andrea Montanari on statistical inference and structural learning of graphical models. After his PhD, he moved to Disney Research, Boston lab, where he worked with Dr. Jonathan Yedidia on algorithms for distributed optimization, robotics and computer vision. José is now with the Computer Science department at Boston College. His current research lies at the intersection of distributed algorithms and machine learning. In 2011 he received the SIGWEB DocEng Best paper award and won the RecSys-CAMRa2011 Challenge on context-aware recommendation systems. In 2014 he received a Disney Inventor Award for his work on distributed optimization.

You'll learn how to use the popular ADMM (Alternating Direction of Multipliers) to perform optimization and how to accelerate it using GPUs in a way that avoids writing any parallel code. More specifically, you'll learn (1) how the ADMM works and how it reduces an optimization problem to an iterative scheme on a graph, (2) which computations in this scheme can be parallelized, (3) where automatic parallelism enters into the picture, (4) when to expect speedups, and (5) see speedup values in three different applications: combinatorial optimization, machine learning and optimal control. Finally, you learn about parADMM, a tool we built to quickly prototype your own optimization solvers without having to write an application-specific parallel code.

Level: Intermediate technical
Type: Poster
Tags: Supercomputing & HPC; Algorithms

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6350 - Using CUDA® to Accelerate an Adaptive Game Controller

Leonardo Torok Ph.D. Student, Federal Fluminense University
Leonardo Torok is a Ph.D. student in computer science at Federal Fluminense University. He is researching adaptive interfaces, CUDA, game controllers, user experience, and multimodal signals.

The adaptive game controller is a novel concept that aims to introduce new ways to interact with video games, allowing developers to design the joystick used to play through a simple API provided by our solution. A key component is the K-means algorithm, used to fine tune the controller interface according to the user's touches during the gameplay session. Currently, our controller is implemented in Java for the Android operating system and the machine learning routines are executed by the mobile device's CPU, limiting the amount of touches that can be considered. With CUDA, we will introduce the next evolutionary step in our controller, increasing the amount of points evaluated to include all touches, in real time, using the GPU available on the computer that is running the game.

Level: Beginner technical
Type: Poster
Tags: Game Development; Deep Learning & Artificial Intelligence

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6351 - GPU-Accelerated Molecular Dynamics Simulations for Systems with Lennard-Jones Type Potential

Jose Maria Zamora Developer, Lufac Computación S.A. de C.V.
Jose Maria Zamora is the manager of research and development projects with Lufac Computacion S.A. de C.V. He graduated as an electronics engineer from the Universidad Autonoma Metropolitana. Later he obtained an M.S. at the Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas (IIMAS-UNAM). He has conducted several research assignments, including one at the University of Notre Dame. In the last 8 years, he has worked as a programmer, a manager, and software designer for different companies and government institutions, highlighting their scientific interests in the area of molecular dynamics.

This work shows an implementation of a basic algorithm to study molecular systems with interactions of Lennard-Jones type potential. We present a parallelization strategy using CUDA to accelerate the computations in a GPU. After reviewing the results of simulations with a large number of particles (about 1 million) different states of equilibrium are observed dependent on the initial arrangement of particles. The cause of these trajectories is that the initial arrangements have different values of total energy due to pressure differences (which is a microscopic system variable) depending on their initial geometric configuration of particles in the simulation cubic box. These differences are accentuated when the number of particles is greater than 10^5.

Level: Beginner technical
Type: Poster
Tags: Algorithms; Computational Physics

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6352 - Using CLANG/LLVM Vectorization to Generate Mixed Precision Source Code

Regis Portalez Research Engineer, Altimesh
Regis Portalez studied mathematics in France at Polytechnique School and Orsay University. He worked for a while at Spacegoo doing 3D web, then for three years at Microsoft composing music at microsoft.com. Regis finally joined Altimesh last year to help support c++. He interests include mathematics and writing programs that generate other programs.

At Supercomputing 2015, NVIDIA announced Jetson TX1. This platform is the first available to natively expose mixed precision instructions. However, this instruction set requires that operations on 16-bit precision floating points are done in pairs, requiring usage of the half2 type which pairs two values in a single register. We'll present an approach that makes use of existing vectorization tool developed for CPU code to further generate CUDA source code that uses half2 intrinsic functions, hence enabling mixed precision hardware usage. Using this approach, we're able to generate efficient CUDA code from a single scalar version of the code. This approach showed very nice boundary effects such as better memory access pattern and instruction level parallelism.

Level: Advanced technical
Type: Poster
Tags: Performance Optimization; Programming Languages

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6353 - Patient-Specific Hyperelastic Biomechanical Models for Clinical DIR Confidence Quantification

John Neylon Graduate Student Researcher, UCLA Radiation Oncology
John Neylon's research is focused on advancing and facilitating adaptive radiation therapy based cancer treatments to improve patient outcome and quality of life. He is developing a framework utilizing image registration and predictive biomechanical models for regression tracking, dose estimation, and extrapolation. Accelerating these tasks with GPUs will allow seamless integration into existing clinical workflows and provide physicians with a wealth of information for optimizing treatments to the patient's daily anatomy.

The accuracy of clinical multi-modal deformable image registration (DIR) is difficult to quantify. A framework was previously developed to validate a deformable registration algorithm (DIR) by generating patient-specific, GPU-based biomechanical models from head-and-neck (HN) patient CT scans and creating clinically realistic ground truth deformations[1]. We now aim to expand the model's applicability to quantify DIR confidence for clinical registrations between the planning CT and daily positioning images.

Level: Beginner technical
Type: Poster
Tags: Medical Imaging

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6354 - Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks

Yu-Hsin Chen PhD Student, MIT
Yu-Hsin Chen is from Taipei, Taiwan. He received the B. S. degree in Electrical Engineering from National Taiwan University in 2009, and the M. S. degree in Electrical Engineering and Computer Science from Massachusetts Institute of Technology in 2013. He is currently pursuing a Ph.D. with Prof. Vivienne Sze. His research interests include VLSI system design, computer vision and digital signal processing. His current research focuses on the architecture design for deep learning accelerators and machine vision processors.

Eyeriss is an energy-efficient deep convolutional neural network (CNN) accelerator that supports state-of-the-art CNNs, which have many layers, millions of filter weights, and varying shapes (filter sizes, number of filters and channels). The test chip features a spatial array of 168 processing elements (PE) fed by a reconfigurable multicast on-chip network that handles many shapes and minimizes data movement by exploiting data reuse. Data gating and compression are used to reduce energy consumption. The chip has been fully integrated with the Caffe deep learning framework. The chip can run the convolutions in AlexNet at 35 fps with 278 mW power consumption.

Level: Advanced technical
Type: Poster
Tags: Deep Learning & Artificial Intelligence; Embedded

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6355 - Energy-Efficient Fine-Grained DVFS

Ben Keller Graduate Student Researcher, University of California, Berkeley
Ben Keller (S’12) received the B.S. degree in engineering from Harvey Mudd College in 2010, and the M.S. degree in electrical engineering from the University of California, Berkeley in 2015. Since 2012, he has been pursuing the Ph.D. degree at the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley. In 2014, he completed an internship with NVIDIA Research. His research interests include energy-efficient microprocessor design, fine-grained DVFS, and innovative digital hardware design paradigms.

Fine-grained dynamic voltage and frequency scaling is a key technology for energy-efficient processors. This work demonstrates a RISC-V vector microprocessor implemented in 28nm FDSOI with fully-integrated simultaneous-switching switched-capacitor DC-DC converters and adaptive clocking. The converters achieve high efficiency at the system level by switching simultaneously to avoid charge sharing losses and by using an adaptive clock to maximize performance for the resulting voltage ripple. This system pushes the capabilities of dynamic voltage scaling by demonstrating fast transitions, simple packaging, and high energy efficiency.

Level: Advanced technical
Type: Poster
Tags: Performance Optimization

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

P6356 - Neural Attention for Object Tracking

Brian Cheung PhD Student, UC Berkeley
Brian Cheung is a Ph.D. Student at UC Berkeley working with Professor Bruno Olshausen at the Redwood Center for Theoretical Neuroscience. His research interests lie at the intersection between machine learning and neuroscience. Drawing inspiration from these fields, he hopes to create systems which can solve complex vision tasks such as navigation and planning.

With differentiable forms of attention being integrated into neural networks, end-to-end training with backpropagation is possible. We adopt the recently proposed attention mechanism in Spatial Transformer Networks (STNs) into a recurrent architecture to perform object tracking. We also present several issues which arise when such recurrent attention models are scaled up to much larger and more complex images/videos. We present pretraining strategies to resolve some of these training issues.

Level: Intermediate technical
Type: Poster
Tags: Computer Vision & Machine Vision; Deep Learning & Artificial Intelligence

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Grand Ballroom

Poster

SPECIAL EVENT

Presentation
Details

SE6138 - Posters & Beer Reception

See the GTC 2016 posters and meet their presenters over a beer on the concourse. These research posters describe the latest GPU-enabled research, exciting new projects, and encouraging preliminary results.

Level: All technical
Type: Special Event
Tags: Public Event; Press-Suggested Sessions: General Interest

Day: Monday, 04/04
Time: 17:00 - 19:00
Location: Room 210H

SE6130 - Dinner with Strangers

Engage in lively conversation with other GTC attendees over a self-hosted meal in some of the best restaurants in Silicon Valley. GTC organizers have pre-reserved tables for small groups for Monday, Tuesday and Wednesday nights in restaurants across a variety of cuisines and price points. Stop by the sign-up board located on the concourse. Space is available on a first come, first served basis.

Level: All technical
Type: Special Event
Tags: Public Event

Day: Monday, 04/04
Time: 20:00 - 22:00
Location: Concourse

Special Event

KEYNOTE

Presentation
Details

S6699 - Opening Keynote

Jen-Hsun Huang CEO & Co-Founder, NVIDIA
Jen-Hsun Huang co-founded NVIDIA in 1993 and has served since its inception as president, chief executive officer and a member of the board of directors. Under his leadership, NVIDIA invented the graphics processing unit (GPU) in 1999. Since then, it has consistently set new standards in visual computing with breathtaking, interactive graphics available on devices ranging from smartphones and tablets to notebooks and workstations. NVIDIA's expertise in programmable GPUs has led to breakthroughs in parallel processing that make supercomputing inexpensive and widely accessible. The company holds more than 7,000 U.S. patents granted or pending, including ones covering designs and insights fundamental to modern computing. Huang is a recipient of the Dr. Morris Chang Exemplary Leadership Award from the Global Semiconductor Association in recognition of his exceptional contributions to driving the development, innovation, growth and long-term opportunities of the fabless semiconductor industry. He has received the Daniel J. Epstein Engineering Management Award from the University of Southern California and an honorary doctorate from Oregon State University. He was named to the U.S. Immigrant Entrepreneur Hall of Fame when it was established in 2012. Prior to founding NVIDIA, Huang worked at LSI Logic and Advanced Micro Devices. He holds a BSEE degree from Oregon State University and an MSEE degree from Stanford University.

Don't miss GTC's opening keynote address from NVIDIA CEO and co-founder Jen-Hsun Huang.

Level: All technical
Type: Keynote
Tags: Deep Learning & Artificial Intelligence; Self-Driving Cars & Automotive ; Robotics & Autonomous Machines; General Interest; Press-Suggested Sessions: General Interest

Day: Tuesday, 04/05
Time: 09:00 - 11:00
Location: Hall 3

Keynote

SUMMIT TALK

Presentation
Details

OP178 - The OpenPOWER ISV Community is Awesome!

Randall Ross Ubuntu Community Manager, Canonical
Ubuntu Community Manager

The OpenPOWER ISV Community is Awesome!

Level:
Type: Summit Talk
Tags: OpenPower

Day: Tuesday, 04/05
Time: 09:45 - 10:00
Location: Room 220C

Summit Talk

THEATER TALK

Presentation
Details

OP128 -  A Methodology for Ensuring Architecture Compliance with the POWER Architecture

Laurent Fouornier Formal Verification Group, IBM
Laurent Fournier joined the IBM Research lab in Haifa in 1990 after completing his MSc degree in Computer Sciences at the Technion Institute in Israel. He worked in the area of hardware verification throughout these years, first as a technical leader and then at various management positions. During these years in IBM, Laurent has led the development of new technologies and tools spanning various disciplines in hardware verification, including: test generation, functional coverage, floating point verification, address translation, architecture compliance, as well as formal verification. As part of the IBM Haifa Research Lab, Laurent is managing today the department of HW Verification Technologies including around 50 researchers, focusing on developing advanced technologies and solutions in the area of hardware verification.

As part of the OpenPower ecosystem, there is a need to define architectural compliance criteria, to ensure compliance of any new design developed over the Power architecture. Relying on many years of experience in hardware verification, our group at IBM Research has developed a methodology for testing architectural compliance. This methodology attempts to catch every type of potential misinterpretation of the architecture. We propose to present this methodology, show how it fundamentally differs from a hardware verification methodology, and demonstrate how it leads to the specification of a compliance testing suite for the Power architecture.

Level:
Type: Theater Talk
Tags: OpenPower

Day: Tuesday, 04/05
Time: 11:00 - 11:15
Location: OpenPOWER Booth

Theater Talk

SPECIAL EVENT

Presentation
Details

SE6128 - Exhibits & Networking Lunch

Grab your lunch and network in the exhibit hall where companies and institutions will be demonstrating emerging technologies as well as some of the most innovative solutions available today.

Level: All technical
Type: Special Event
Tags: Public Event; Press-Suggested Sessions: General Interest

Day: Tuesday, 04/05
Time: 11:00 - 13:00
Location: Hall 1-2

Special Event

THEATER TALK

Presentation
Details

OP168 - Power of Open (Source) BMC

Kenneth Wilke Software Developer, IBM
Kenneth Wilke is a software developer focused on Rackspace's Barreleye OpenPOWER server; hacking on CAPI, open source firmware and open source hardware to accelerate cost-performance improvements in the data center beyond the death of Moore's law.

The software stacks powering Baseboard Management Controllers (BMCs) are traditionally pre-packaged closed source projects. In this talk, we share what benefits we've seen from running an open source BMC. We discuss how collaboration with the OpenBMC community makes it easier to catch issues, fix bugs and get new features implemented.

Level:
Type: Theater Talk
Tags: OpenPower

Day: Tuesday, 04/05
Time: 11:15 - 11:30
Location: OpenPOWER Booth

OP131 - Open Your Big Data Solutions to Gain Insights with IBM

Christophe Menichetti Information IT Specialist, IBM
Christophe Menichetti is Information IT Specialist based at the IBM Customer Center in Montpellier, France. His role is to support IBM sales and business partners, in selling and designing the most appropriate IBM Big Data / Business Analytics solutions for worldwide customers. Prior to this position, he joined the European IBM/Oracle Joint Solutions Center (2005-2010), with 5 Oracle and 5 IBM persons working closely together, cross brand and cross industries. His main activity was to help oracle sales and IBM sales, in selling and defining the synergy between Oracle Applications with IBM infrastructure solutions for joint oracle/IBM customers around the world (including sizing, architecture design, education and benchmark support). Prior to this position, he worked on Oracle Hyperion Brio products in Customer Scheduling Support Center (2004 – 2005). His role was to build and maintain reusable business performance reporting to manage customer orders scheduling across the world. His key domains of expertise include IBM systems, Oracle applications (Siebel CRM, Oracle BI), Oracle Database, Business Intelligence (Oracle BI, Hyperion), IBM Business Analytics Solutions (Cognos, Datastage, Netezza) and IBM Big Data Stack (BigInsight, Streams) plus Competition. He wrote many materials such as “Siebel 8.1 on AIX 6.1 cookbook” or “OBI EE 10gR3 on AIX 5.3 cookbook, led many Oracle/IBM events such as the first world wide “Siebel with Oracle 11gR2 RAC on IBM Power systems” education workshop and presented at many IBM STG university. Christophe Menichetti is an engineer in computer sciences and he is an IBM certified IT specialist. He is also University professor for ERP/CRM, Business Intelligence and Power virtualization.

BigData and Analytics solutions have transformed the meaning and the value of "Information" over the years. To provide solid IT infrastructure foundations to this transformation, IBM has designed new POWER8 systems which access and process data faster than any other platform. Furthermore, IBM also work with OpenPower high level solutions optimized on POWER8 to expand the end-to-end value to the customer.  This session will highlight these solutions and show how IBM POWER8 and OpenPower change clients approach to Analytics.

Level:
Type: Theater Talk
Tags: OpenPower

Day: Tuesday, 04/05
Time: 11:45 - 12:00
Location: OpenPOWER Booth

Theater Talk

HANGOUT

Presentation
Details

H6132 - Hangout: Dreaming Big: Scaling Up Deep Dream to Operate on Multi-Hundred Megapixel Images

Daniel Ambrosi Artist | Photographer | CEO, Bottled Light Productions LLC
Daniel Ambrosi is based in Half Moon Bay, California. Dan has been experimenting with novel methods of visual presentation since graduating from Cornell University in 1984 (Bachelor of Architecture & Master of Science in 3D Graphics; Cornell National Scholar & Eschweiler Prize Recipient). He began his career in architecture and transitioned to high tech marketing in 1994. He has been an independent marketing consultant since 2003 and a published artist since 2011.
Joseph Smarr Senior Staff Software Engineer, Google
Joseph Smarr is a software engineer at Google, currently working on Google Photos (co-creator of its stories feature). Previously, Joseph was a founding technical lead of Google+, focused on circles and sharing. Before that, he was Plaxo's Chief Technology Officer, where he led their initiative to open up the social web, starting with co-authoring the Bill of Rights for Users of the Social Web in 2007. He has served on the Board of Directors of the OpenID Foundation and OpenSocial Foundation. A frequent speaker and community participant in the social networking and web development communities, Joseph has built web applications for many years. Joseph has a BS and MS from Stanford University in Artificial Intelligence. His website is josephsmarr.com, or just Google him!
Chris Lamb Sr. Director of GPU Computing Software, NVIDIA
Chris is Senior Director of GPU Computing Software at NVIDIA where he's been guiding the ascent of the revolutionary CUDA GPU computing platform since 2007 and the CUDA 1.0 release. During that time he has had the pleasure to work on some of the most exciting areas of modern computing from the world's largest HPC and big data systems and algorithms, new parallel programming models, smart devices and self-driving cars, and the explosion in modern AI. Over his career he's worked in diverse areas beyond parallel computing from many-core computer architecture, compiler engineering, embedded systems, microwave communication, networking, and data-driven website design. Chris has a BS in Computer Engineering from the University of Illinois at Urbana-Champaign.

Learn how two experienced engineers from Google and NVIDIA collaborated with an artist who specializes in computational photography to enable an engaging new art form colloquially termed "dreamscapes." These immersive, vibrant, highly-detailed panoramic landscapes conceal a stunning degree of richly detailed and wholly unexpected content that is only revealed upon close-up viewing. Understand how a modified version of Deep Dream became the fourth--and most compute-intensive--tool in the artist's computational pipeline, performing as many as 90 quadrillion floating point operations in the process of transforming the artist's source images. Examine the compelling results of this fruitful collaboration, which will be on display at the conference as a series of 8' high backlit images.

Level: All technical
Type: Hangout
Tags: Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 12:00 - 13:00
Location: Pod A

H6134 - Hangout: Meet the Architects

Jeff Weiss GRID Solution Architect, NVIDIA
Jeff is the GRID Solutions Architect Manager for North America working with the Solution Architecture & Engineering team at NVIDIA. Prior to joining NVIDIA, Jeff's pedigree includes a 7 year stint at VMware as an EUC Staff Engineer, as well as time at Symantec and Sun Microsystems. Along with his current focus on NVIDIA GRID vGPU enabled end user computing, his experience includes datacenter business continuity/disaster recovery solutions, software infrastructure identity management and email security/archiving tools.

Speak with NVIDIA engineers and architects to answer your datacenter questions. This is the best place to get your queries concerning visualization answered, from user-level to developer-level. Learn how to achieve GPU-accelerated graphics while maintaining security and get your questions answered as to the best methods implementing NVIDIA GRID for your enterprise.

Level: All technical
Type: Hangout
Tags: Graphics Virtualization

Day: Tuesday, 04/05
Time: 12:00 - 13:00
Location: Pod B

H6159 - Hangout: Algorithms and Numerical Techniques

Stephane Chauveau Senior Developer Technology Engineer, NVIDIA
Stéphane is working as a DevTech Engineer at NVIDIA for the last year and as a HPC/Compilation Engineer for the previous 15 years. His main activity at NVIDIA is to port and tune HPC applications to GPUs. He is also a member of the OpenACC Technical Committee.
Vinay Deshpande TBA, NVIDIA
TBA

Stop by and have a chat with GPU programming experts on GPU accelerating algorithms and numerical techniques. Hangouts provide an opportunity for you to ask topic experts specific questions. Come on in. Find a person wearing an Ask Me button and ask a question!

Level: All technical
Type: Hangout
Tags: Algorithms

Day: Tuesday, 04/05
Time: 12:00 - 13:00
Location: Pod C

Hangout

SUMMIT TALK

Presentation
Details

OP169 - Measuring and Managing Power Consumption

Todd Rosedahl Chief Power Thermal Energy Mgmt Engineer on POWER, IBM
Todd Rosedahl is the Chief Power/Thermal/Energy Management Engineer on POWER. Todd has worked on power, thermal, and energy management for his entire 23yr career at IBM and has over 20 related patents

This presentation will include an overview of the power, thermal, and performance data that can be collected from OpenPOWER servers via various methods, including a newly open sourced profiling tool called AMESTER. The power/performance knobs, such as processor frequency, that are under the control of the On Chip Controller (OCC) will be described and the overall OCC power management functions will be highlighted.

Level:
Type: Summit Talk
Tags: OpenPower

Day: Tuesday, 04/05
Time: 12:15 - 12:30
Location: Room 220C

Summit Talk

THEATER TALK

Presentation
Details

OP134 - Build Synergy in the OpenPOWER Ecosystem Around OpenPOWER Compliance and OpenPOWER Ready.

Sandy Woodward Senior Technical Staff Member, IBM Academy of Technology Member, OpenPOWER Foundation Compliance Work Group Chair, IBM
Sandy Woodward is the OpenPOWER Foundation Compliance Work Group Chair. Sandy has over 20 years experience with the IBM POWER architecture.  She is a Senior Technical Staff Member at IBM and is an IBM Academy of Technology Member.  

The OpenPOWER Compliance Work Group has been working closely with the other OpenPOWER Foundation Work Groups.  Each OpenPOWER Foundation Work Group defines what is required for compliance through their Work Group Specifications each of which contains a conformance to the specification section.  The Compliance Work Group generates Compliance Test Harness and Test Suite specifications focused on each compliance area.  Test suites that are contributed to the OpenPOWER Foundation need to be licensed under Apache V2 license. Compliance may be measured with test suites not contributed to the OpenPOWER Foundation satisfying the Compliance Test Harness and Test Suite specification. 

Level:
Type: Theater Talk
Tags: OpenPower

Day: Tuesday, 04/05
Time: 12:45 - 13:00
Location: OpenPOWER Booth

Theater Talk

HANGOUT

Presentation
Details

H6113 - Hangout: Accelerating Micro Services / NVIDIA-Docker

Chris Gottbrath Product Manager, NVIDIA
Chris Gottbrath is an Accelerated Computing Product Manager working to ensure that the CUDA Math Libraries and other software products that NVIDIA provides deliver exceptional value to users. He has more than 15 years experience in the High Performance, Scientific, Technical and Enterprise Computing business with a lot of focus on user productivity, application performance and correctness. He started exploring CUDA about 10 years ago.
Michael O'Connor Engineering Manager, NVIDIA
TBA

Cloud and SaaS applications are composed of services and micro services. The NVIDIA GPU REST Engine allows you to easily create a GPU accelerated micro-service by calling a CUDA library function or custom written CUDA kernel. If you are working with REST services please come to this hangout and meet with developers here at NVIDIA enabling GPU accelerated micro services. Containers wrap applications into an isolated virtual environment to simplify data center deployment. NVIDIA Docker allows you to easily create Docker containers that are agnostic NVIDIA driver. If you are working with containers please come to this hangout and meet with the developers. This is an area where recent work has created a foundation for a lot of new developments. Please bring suggestions for capabilities or services you would like to see in GRE or NVIDIA Docker.

Level: All technical
Type: Hangout
Tags: Data Center & Cloud Computing

Day: Tuesday, 04/05
Time: 13:00 - 14:00
Location: Pod A

H6145A - Hangout: Video and Image Processing

Eric Young Developer Relation Manager for Video and Remote Graphics, NVIDIA Corporation
Eric Young is the developer relations engineering manager in the Content and Technology team. He is responsible in applied research in remote graphics technologies, video compression, and video processing. He has been working with NVIDIA beginning 2004 on video processing algorithms, computer vision, high performance computing, and workstation graphics.
Abhijit Patait Director of Multimedia Software, NVIDIA Corporation
Abhijit is the Director of Multimedia Software Team at NVIDIA and his team is responsible for multimedia and gaming technologies This includes the Video Codec SDK, GRID SDK, Audio and 3D-Vision. He has been associated with NVIDIA video encoding technologies since its inception in 2010. Prior to NVIDIA, Abhijit worked in engineering and management roles in various companies, including Motorla, Ericsson and Ditech Networks (now Nuance).
Edward Richards Video Datacenter Solution Architect, NVIDIA Corporation
Edward Richards is the Hyperscale Video Architect for Datacenters at NVIDIA. He helps to drive the current and future direction to Video Solutions at NVIDIA, influencing both the hardware and software design.

This is a Hangouts session with NVIDIA engineers. NVIDIA has professional and data center products designed to accelerate Video Encoding, Decoding, and Image Processing. Come talk to the experts to learn more about how these NVIDIA solutions can help you.

Level: All technical
Type: Hangout
Tags: Video & Image Processing

Day: Tuesday, 04/05
Time: 13:00 - 14:00
Location: Pod B

H6147 - Hangout: Vulkan C++ API

Markus Tavenrath Senior Developer Technology Engineer, NVIDIA
Markus Tavenrath experiments with future rendering technologies that bring high interactivity to complex scenes. This work includes both CPU and GPU strategies to solve typical scene-graph operations related to rendering. Markus finished his studies in computer science with a focus on computer graphics in 2008. He was one of the first to use ray-tracing on CUDA for his diploma thesis, which brought him straight to NVIDIA. There, he primarily worked on GPU raytracing for SceniX, NVIDIA's scene-graph technology, first showcased at SIGGRAPH 2008. Later he applied his experience to implement parts of OptiX, improve SceniX, and develop several ray-tracing demos. In close cooperation with external partners, he improved rendering performance and scenegraph usability as developer technology engineer.
Andreas Süßenbach Senior Developer Technology Engineer, NVIDIA
Andreas Sussenbach is a senior DevTech engineer in NVIDIA's Professional Solutions Group, where he works to help different ISVs improve their GPU-related implementations. He has more than 15 years of experience in scene graph and rendering technologies, with emphasis on efficient handling of geometries and materials. He has a diploma in mathematics with a focus on numerical mathematics and CAGD.

Attend this hangout and talk to the developers about the Vulkan C++ API.

Level: All technical
Type: Hangout
Tags: Real-Time Graphics

Day: Tuesday, 04/05
Time: 13:00 - 14:00
Location: Pod C

Hangout

HANDS-ON LAB

Presentation
Details

L6113 - Teach GPU Accelerating Computing: Hands-on with NVIDIA Teaching Kit for University Educators

Abdul Dakkak UIUC PhD Candidate, University of Illinois, Urbana-Champaign
x
Wen-Mei Hwu Professor of Electrical and Computer Engineering, University of Illinois
Professor Wen-mei Hwu holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, at the University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, implementation, compilation, and algorithms for parallel computing. He is the chief scientist of the Parallel Computing Institute and director of the IMPACT research group. He is a co-founder and CTO of MulticoreWare. He is the instructor for the Coursera Heterogeneous Parallel Programming course, which more than 60,000 students have taken. For his contributions in research and teaching, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the EKN Holmes MacDonald Outstanding Teaching Award, the ISCA Influential Paper Award, the IEEE Computer Society B. R. Rau Award, and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. He directs the UIUC CUDA Center of Excellence and serves as one of the principal investigators of the NSF Blue Waters Petascale computer project. Wen-mei received his Ph.D. in computer science from the University of California, Berkeley.
Joe Bungo GPU Educators Program Manager, NVIDIA
Joe Bungo is the GPU Educators Program Manager at NVIDIA, where he enables the use of NVIDIA and GPU technologies in universities in a variety of ways, including curriculum and teaching material development, facilitation of academic ecosystems, and hands-on instructor workshops. Previously, he managed the university program at ARM Inc. and worked as an applications engineer there.

As performance and functionality requirements of interdisciplinary computing applications rise, industry demand for new graduates familiar with accelerated computing with GPUs grows. In the future, many mass-market applications will be what are considered "supercomputing applications" as per today's standards. This hands-on tutorial introduces a comprehensive set of academic labs and university teaching material for use in introductory and advanced parallel programming courses. The teaching materials start with the basics and focus on programming GPUs, and include advanced topics such as optimization, advanced architectural enhancements, and integration of a variety of programming languages. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Level: Intermediate technical
Type: Hands-on Lab
Tags: Tools & Libraries; Education & Training

Day: Tuesday, 04/05
Time: 13:00 - 14:30
Location: Room 210B

L6131 - Deep Learning on GPUs: From Large Scale Training to Embedded Deployment on Maxwell

Julie Bernauer Senior Solutions Architect, NVIDIA
Julie Bernauer is Senior Solutions Architect for Deep Learning at NVIDIA since 2015. She attended ENS Cachan from 2001 to 2004 where she received a degree in Physical Chemistry. She obtained her PhD from Université Paris-Sud in 2006 while performing research in the Yeast Structural Genomics group. Her thesis focused on the use of Voronoi models for modelling protein complexes. After a post-doctoral position at Stanford University with Pr. Michael Levitt, Nobel Prize in Chemistry 2013, she joined Inria, the French National Institute for Computer Science. While Senior Research Scientist at Inria, Adjunct Associate Professor of Computer Science at École Polytechnique and Visiting Research Scientist at SLAC, her work focused on computational methods and machine learning for structural bioinformatics, specifically scoring functions for macromolecule docking, and statistical potentials for molecular simulations. She was the first to successfully introduce machine learning for coarse-grained models in the CAPRI challenge.
Allison Gray Solutions Architect, NVIDIA
TBA

The tutorial will show how to set up a deep learning environment on Jetson TX1 to perform deep learning tasks, with particular inference using pretrained models from a Digits server. Other demo applications, including, live image classification and image captioning will be covered.

Level: Intermediate technical
Type: Hands-on Lab
Tags: Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 13:00 - 16:00
Location: Room 210C

Hands-on Lab

TALK

Presentation
Details

S6117 - Parallelization and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures

Mark Govett Chief, Advanced Computing Section, NOAA Earth System Research Laboratory
Highly-Rated Speaker
Mark manages the High Performance Computing Section, a software group that both supports model development, parallelization, and porting to high performance computers, and explores advanced computing technologies for the National Oceanic and Atmospheric Administration (NOAA). Mark has worked in high performance computing, code parallelization and compiler development for over 20 years. He has developed two Fortran compilers, the Scalable Modeling System (SMS) for MPI based parallelization, and the F2C-ACC GPU compiler. He also parallelized two weather models using the F2C-ACC compiler and has been collaborating with Cray and PGI to improve the capabilities and performance of their commercial GPU compilers.

In an era defined by increasing diversity in computing architectures, performance portability is a key requirement for weather and climate applications that require massive computing resources. In this talk, you will learn about how we developed and achieve performance on CPU, GPU and MIC architectures using industry-standard OpenACC and OpenMP directives. Performance results from the NIM weather model will be shown for a number of device, node and multi-node and system configurations. Further, communications optimizations will highlight a more than a 40% improvement in runtime with scaling to thousands of GPUs.

Level: Intermediate technical
Type: Talk
Tags: Earth System Modelling; Supercomputing & HPC; Programming Languages; OpenACC; Press-Suggested Sessions: HPC & Science

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Room 211A

S6144 - Introducing NVIDIA's Data Center GPU Manager

Brent Stolle Software Engineer, NVIDIA
Brent Stolle is a software engineer at NVIDIA
Rajat Phull Software Engineer, NVIDIA
Rajat Phull is a software engineer at NVIDIA.

NVIDIA is launching a new tool for data center GPU management. This is a freely available, comprehensive GPU management framework that enables cluster management, resource scheduling and monitoring products from NVIDIA partners and supports individual users and admins as well. Data Center GPU manager 1.0, available for Tesla GPUs on Linux, helps to ensure GPU reliability and uptime, streamline common data center administrative tasks and improve overall resource efficiencies while still providing complete control over GPUs and expanded visibility into their behavior. It includes active health monitoring, diagnostics, system alerts, and governance policies including power and clock management. The talk will provide an introduction to the key features of this SW stack, as well as an overview.

Level: All technical
Type: Talk
Tags: Data Center & Cloud Computing; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Room LL21C

S6164 - Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs

Bertil Schmidt Professor, JGU Mainz
Bertil Schmidt is a tenured full professor and chair for Parallel and Distributed Architectures at the University of Mainz, Germany. Prior to that he was a faculty member at Nanyang Technological University (Singapore) and at University of New South Wales. Bertil's research group has designed a variety of algorithms and tools for computational science and bioinformatics, mainly focusing on the analysis of large-scale sequence and short read datasets. For his research work, he has received a CUDA Research Centre award, a CUDA Academic Partnership award, a CUDA Professor Partnership award, and Best Paper Awards at IEEE ASAP 2009 and IEEE ASAP 2015. Bertil serves as the champion for bioinformatics and computational biology on gpucomputing.net.
Christian Hundt Professor, University Mainz
Christian Hundt has received his diploma in theoretical physics for the analysis of quantization maps on curved manifolds and a PhD degree in Computer Science for the efficient subsequence alignment of time series on CUDA-enabled accelerators at the University of Mainz, Germany, in 2010 and 2015. In his current position, as a postdoctoral researcher at the Parallel and Distributed Architectures group, he investigates the design and parallelization of algorithms in the field of bioinformatics.

Learn how to efficiently parallelize gene set enrichment analysis (GSEA) using CUDA. GSEA is an important bioinformatics method that determines whether given sets of genes are statistically overrepresented between two phenotypes. The GSEA software from the Broad Institute is the most popular tool to perform such studies with several thousand users. NGS technologies are gradually replacing microarrays for high-throughput gene expression studies. Size and availability of input data sets are increasing, leading to high runtimes of the desktop GSEA application. We present an efficient CUDA parallelization of the core GSEA algorithm. By using a combination of parallelization techniques, we achieve speed-ups of around two orders of magnitude on a single GPU.

Level: Intermediate technical
Type: Talk
Tags: Computational Biology

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Marriott Salon 5

Talk

FEATURED PRESENTATION

Presentation
Details

S6176 - Inside Pascal

Mark Harris Chief Technologist, GPU Computing Software, NVIDIA
Highly-Rated Speaker
Mark Harris is chief technologist for GPU Computing Software at NVIDIA. Mark has 15 years of experience developing software for GPUs, ranging from graphics and games to physically based simulation, parallel algorithms, and high performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC, he recognized this nascent trend and coined a name for it: GPGPU (general-purpose computing on graphics processing units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work.
Lars Nyland Senior Architect, NVIDIA
Highly-Rated Speaker
Lars Nyland has worked at NVIDIA for 10 years with his full attention on GPU computing.

Level: All technical
Type: Featured Presentation
Tags: Supercomputing & HPC

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Grand Ballroom

Featured Presentation

TALK

Presentation
Details

S6227 - Distributed Deep Learning at Scale

Soumith Chintala Research Engineer, Facebook AI Research
Soumith Chintala is a Research Engineer at Facebook AI Research. Prior to joining Facebook in August 2014, Soumith worked at MuseAmi, where he built deep learning models for music and vision targeted at mobile devices. In the past, Soumith worked on state-of-the-art deep learning models for pedestrian detection, natural image OCR, depth-images among others while driving his research heavily using CUDA and multiple GPUs.

This talk provides a brief overview of deep learning research, the challenges involved in scaling it up across multi-GPU and multi-machine clusters, while providing software that is flexible enough for research settings. We discuss the clear trends that are emerging in deep learning from a HPC perspective and discuss several examples from our work at Facebook AI Research.

Level: All technical
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision; Press-Suggested Sessions: AI & Deep Learning

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Hall 3

S6253 - VMD: Petascale Molecular Visualization and Analysis with Remote Video Streaming

John Stone Senior Research Programmer, University of Illinois at Urbana-Champaign
Highly-Rated Speaker
John Stone is a senior research programmer in the Theoretical and Computational Biophysics Group at the Beckman Institute for Advanced Science and Technology, and associate director of the NVIDIA CUDA Center of Excellence at the University of Illinois. John is the lead developer of VMD, a high-performance molecular visualization tool used by researchers all over the world. His research interests include molecular visualization, GPU computing, parallel processing, ray tracing, haptics, and virtual environments. John was named an NVIDIA CUDA Fellow in 2010. In 2015, he joined the Khronos Group Advisory Panel for the Vulkan graphics API. John also provides consulting services for projects involving computer graphics, GPU computing, and high performance computing.

We'll showcase recent successes in the use of GPUs to accelerate challenging molecular visualization and analysis tasks on hardware platforms ranging from commodity desktop computers to the latest GPU-accelerated petascale supercomputers by Cray and IBM. We'll highlight the use of in-situ ray tracing and rasterization combined with GPU-accelerated video streaming for high-interactivity remote visualization, CUDA just-in-time compilation to increase the performance of data-driven visualization and analysis algorithms, and we'll describe new, GPU-accelerated, MD trajectory clustering algorithms.

Level: Intermediate technical
Type: Talk
Tags: In-Situ and Scientific Visualization; Computational Chemistry; Rendering & Ray Tracing

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Room LL21D

S6391 - Bootstrapping Labels for One Hundred Million Images

Jimmy Whitaker Software Engineer, Digital Reasoning
Jimmy Whitaker is a software engineer at Digital Reasoning, a cognitive computing company focused on enabling humans to leverage big data to make decisions, where he has been pioneering computer vision efforts. Prior to joining Digital Reasoning, Jimmy completed his M.S. in computer science at the University of Oxford, where he achieved a distinction for his research in the field of steganalysis -- detecting hidden information in images.

We'll describe how we created an iterative labeling process to perform data science on 100 million+ images using a GPU-powered workflow with convolutional neural networks. Recently, deep learning techniques such as deep convolutional neural networks (ConvNets) have achieved state-of-the-art results in many computer vision tasks. The data-driven nature of deep learning normally requires a large number of labeled examples to achieve high accuracies. Unfortunately, much of the publicly available data on the web is not labeled, thus requiring human labelers for large datasets or unsupervised machine learning techniques. Our labeling process allows weak labels and a small number of strong labels to be used to create classifiers for very large datasets.

Level: Beginner technical
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Big Data Analytics; Computer Vision & Machine Vision

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Room 210H

S6422 - Enhancing Visual Realism of Mixed Reality Applications with Stereo Vision

Azzam Edwin CTO, Stereolabs
Edwin Azzam co-founded STEREOLABS in 2010. As STEREOLABS's Chief Technical Officer, Edwin is responsible for leading the company's product development and technology strategy in stereo vision. Prior to founding STEREOLABS, Edwin was a project manager at Astrium Space Transportation, Paris.Edwin holds a Master's degree in Optics & Image Processing from Institut d'Optique, France, as well as a Master's degree in Management from ESSEC Business School. He is a PhD supervisor and a National Technical Expert for the ANR (National Research Agency), where he uses his technical and market expertise for the assessment of national research projects in the field of computer vision and 3D image processing.

Discover how stereo vision and 3D depth sensing on GPU enable the development of mixed reality applications, which merge virtual information into a live 3D video stream of the real world. We will discuss the various stages of a real-time mixed reality processing pipeline, and how NVIDIA's GPU acceleration is integral to every step of the pipeline. We will also show demonstrations of how stereo depth sensing can be used to create 3D virtual playgrounds and real-time augmentation of the environment.

Level: All technical
Type: Talk
Tags: Computer Vision & Machine Vision; Virtual Reality & Augmented Reality; Video & Image Processing; Embedded

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Room 210F

S6423 - Accelerating Approximate Weighted Matching on GPUs

Antonino Tumeo Research Scientist, Pacific Northwest National Laboratory
Highly-Rated Speaker
Dr. Antonino Tumeo has been a research scientist in the PNNL's High Performance Computing group since February 2011. Antonino received an M.S. degree in informatic engineering in 2005, and a Ph.D. in computer engineering in 2009, from Politecnico di Milano in Italy. He Joined PNNL in 2009 as a post-doctoral research associate. Previously, he was a post doctoral researcher at Politecnico di Milano. His research interests are modeling and simulation of high-performance architectures, hardware-software codesign, FPGA prototyping, and GPGPU computing.

Matching is a fundamental graph problem with numerous applications in science and engineering. This talk discusses the efficient implementation of half-approximate weighted matching on GPUs. We start by describing the Suitor algorithm, currently considered the best algorithm for this problem, and identifying by its key implementation challenges. In its basic formulation, the Suitor algorithm appears poorly suited to GPUs, due to the irregular memory accesses and the use of locks. We proceed by introducing four variants of the algorithm that progressively address these challenges by exploiting Kepler's hardware features. We demonstrate that the final implementation outperforms by several times the performance of previous best matching algorithms for GPUs and of the Suitor algorithm on CPUs.

Level: Intermediate technical
Type: Talk
Tags: Algorithms; Big Data Analytics; Aerospace & Defense

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Marriott Salon 3

S6452 - Run-Time Scene-Graph Construction from Geographic Source Data

Tim Woodard Chief Technology Officer, Diamond Visionics
Highly-Rated Speaker
Tim Woodard is the chief technology officer at Diamond Visionics, with over 18 years of experience specializing in the design and development of software architectures for real-time, PC-based image generation using Agile development processes, advanced C++, and modern OpenGL techniques. Tim has received patents for the real-time simulator database generation technology that forms the basis of Diamond Visionics' GenesisRTX worldwide database generation system. GenesisRTX provides high-fidelity generation, visualization, and manipulation of visual databases at run-time directly from source data on low-cost PC-based platforms, eliminating the need for traditionally labor-intensive off-line database production processes. He has served as the director of engineering, director of research and development, and principal investigator for a number of Phase I, II, and III U.S. Government Small Business Innovative Research Grants. Tim has also published and presented papers at I/ITSEC, IMAGE, NVIDIA's GPU Technology Conference, ASQ, and ITEC.

In modern computing hardware, the gaps in performance between GPUs, CPUs, RAM, and storage continue to widen. When visualizing large and dense geographic datasets (e.g., imagery, elevation, vectors, features), balancing the workload effectively between these resources (and considering the bottlenecks between them) is crucial. Conventional wisdom for optimal performance from just 10 years ago may not provide the same benefits it once did. In this talk, we demonstrate that by exploiting parallelism on the CPU and especially the GPU, much greater throughput can be achieved. Furthermore, by utilizing modern OpenGL techniques (e.g., NV_command_list), an order of magnitude increase in performance can be achieved when compared to previously available rendering methods.

Level: Intermediate technical
Type: Talk
Tags: Real-Time Graphics ; Aerospace & Defense; Performance Optimization

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Room 210E

S6616 - NCCL: Accelerated Collective Communications for GPUs

Nathan Luehr Senior Devtech Engineer, NVIDIA
Nathan Luehr is a senior developer technology engineer for compute applications at NVIDIA. He earned a Ph.D. in theoretical chemistry from Stanford University in June 2015.

We present NCCL, a library of multi-GPU communication collectives (e.g., broadcast, all-reduce, all-gather). NCCL enables applications to harness the computational throughput of multiple GPUs with minimal developer effort by providing optimized, topology-aware, asynchronous collectives with a familiar API.

Level: All technical
Type: Talk
Tags: Tools & Libraries; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Room 211B

S6650 - Optimizing In-Field Processing Using GPUs

Tarik Saidani Senior Software Engineer, PGS
Tarik is a Senior Software Engineer at PGS. He is specialized in parallel programming and software optimization. He worked in the Oil and Gas industry for the last five years helping research geophysicist in commercializing their applications, using parallel programming and software optimization techniques. He holds a PhD degree in parallel computing from Paris-Sud University.

Learn how GPU accelerators help marine seismic acquisition to efficiently perform one of the fundamental steps in the on-board processing flow. GPUs not only allow unprecedented data processing throughput, but also reduce hardware footprint, power consumption and heat dissipation of the in-field compute system.

Level: Intermediate technical
Type: Talk
Tags: Energy Exploration; Performance Optimization; Signal & Audio Processing

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Marriott Salon 1

S6689 - Creating CONSTRUCT: A GPU-Rendered Short Film

Kevin Margo Director / VFX Supervisor, Blur Studio
Highly-Rated Speaker
Kevin is director of the hit sci-fi short film "Grounded". He joined Blur studio in 2003 as a scene assembly, lighting and compositing artist and has since moved into the studio's VFX/CG Supervisor role. Recent work includes the prologue for Thor 2: The Dark World and the David Fincher produced Halo 4: scanned cinematic trailer.

Come watch a special screening of the GPU rendered independent short film "CONSTRUCT"! Afterwards, Kevin will describe how Chaos V-Ray RT and NVIDIA GPUs were used throughout production on the groundbreaking short film, rendered entirely on GPUs. Go here (http://constructfilm.com/) to see more of the project and here (https://www.youtube.com/watch?v=nnaz8q6FLCk) to see how interactive GPU rendering was used on a motion capture stage during production. As a bonus, Kevin will cover how GPU rendering was recently implemented in his day job at Blur Studio to help visualize the irreverent title sequence of the Fox/Marvel hit film DEADPOOL.

Level: All technical
Type: Talk
Tags: Rendering & Ray Tracing; Media & Entertainment; Real-Time Graphics ; Press-Suggested Sessions: Professional Graphics

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Room LL21B

Talk

FEATURED PRESENTATION

Presentation
Details

S6716 - Featured Presentation: The Psychology of High Performance and the Case for Better Technology

David Johnson Principal Analyst, Forrester Research
David Johnson serves Infrastructure & Operations Professionals and has one passion and one goal: helping companies create workforce computing experiences that engage people and enable them to do their best work. He is an expert in client virtualization technologies, including VDI, terminal services, application virtualization, desktop and mobile OS platforms, and converged infrastructures for VDI and DaaS. Forrester calls these technology capabilities digital workspace delivery. He helps Forrester clients build mastery of three things: workforce computing technologies, the audit and legal compliance aspects of workforce computing strategy, and the human behavior science of motivation and engagement.

Forrester analyst David Johnson examines the puzzle of companies who consistently outperform their competitors, year after year. Contrary to conventional wisdom, it isn't because they have better people or pay them more. It's because they're better at creating the conditions under which their people choose to perform. In this session, you will learn about the psychology of high performance, and how companies invest in technologies to facilitate it, which in turn leads to customer success.

Level: Intermediate technical
Type: Featured Presentation
Tags: Graphics Virtualization; General Interest; Product & Building Design

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Marriott Salon 4

Featured Presentation

TALK

Presentation
Details

S6782 - Securing the San Francisco 49ers and Levi's Stadium

Dan Cory VP of Security, San Francisco 49ers
Dan Cory is in his fifth year with the 49ers and first as the team's vice president of security. In his role, he oversees all elements of security for both the team and Levi's® Stadium. Prior to joining the organization, Cory served in the British Military as a Royal Marine Commando, followed by twelve years of service as a law enforcement officer for Scotland Yard's Special Operations Department. His law enforcement and military career has afforded him the opportunity to work with many different security units conducting operations around the world.

The scope and scale of securing a facility like Levi's Stadium and the San Francisco 49ers is monumental. Learn how video monitoring and analysis allows the 49ers security team to focus on ensuring a safe season for the team and fans alike.

Level: All technical
Type: Talk
Tags: Intelligent Video Analytics (IVA)

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Room LL20D

S6825 - The OpenPOWER Foundation: Revolutionizing Data-Centric Transformation (Presented by IBM)

Sumit Gupta Vice President, High Performance Computing and Analytics, IBM Power Systems
Sumit is responsible for offering and product management of OpenPower-based solutions for high performance computing and high performance data analytics. In this role, Sumit is driving the offerings IBM is building for the technical computing markets and machine and deep learning markets. Sumit joined IBM in May 2015 from NVIDIA, where he was the general manager for the Tesla GPU accelerator business. He was central in building this startup business within NVIDIA from zero to a several hundred million dollar business. Sumit is a product management, marketing, and business leader for enterprise systems and software products. He has previously held positions in marketing, business strategy, and engineering at Tensilica, Tallwood Venture Capital, Intel, S3, and IBM. Sumit has a Ph.D. in Computer Science from the University of California, Irvine, and a bachelors of technology in Electrical Engineering from the Indian Institute of Technology, Delhi. He has authored one book, one patent, several book chapters and more than 20 technical publications.

The growth of the OpenPOWER Foundation has been phenomenal. Why, you might ask? In less than two years, OpenPOWER has grown from five members to over 180, with membership across all tiers of hardware, software, and end users themselves. The Foundation provides a compelling and rapidly growing open approach to infrastructure and software for rapidly changing workloads and evolving IT consumption models. This is a revolution that is making a profound difference in the price/performance criteria of end users, as well as accelerating compelling development for performance to drive business advantage. OpenPOWER members are co-creating their approach to technology—as innovators, producers, and consumers utilizing IBM's Power Architecture.

Level: All technical
Type: Talk
Tags: Big Data Analytics; Data Center & Cloud Computing; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Marriott Salon 6

S6829 - Drive Me: Volvo's Autonomous Car Program

Henrik Lind Technical Expert, Volvo Car Corporation
Henrik Lind has a master in Electrical Engineering from Chalmers University of Technology. Henrik has been working with advanced driver assistance technologies and technology research at Volvo Technological Development since 1997 leading research of sensors and functions. From the year 2001 Henrik moved to Volvo Cars as responsible for the introduction of radar and vision related functions at Volvo Car Corporation with the aim to provide increased safety and comfort for drivers. He introduced forward collision warning with emergency brake and adaptive cruise control in 2006 followed by new innovations in safety. From 2013 and forward Henrik has been working in bringing in highly automated driving technologies at Volvo Cars. He is appointed technical specialist.

We'll present the Drive Me project involving 100 highly autonomous vehicles in the vicinity of Gothenburg, Sweden. Henrik will discuss different technologies related to sensors and sensor processing and the resulting requirement for high performance processing in autonomous vehicles.

Level: All technical
Type: Talk
Tags: Self-Driving Cars & Automotive ; Press-Suggested Sessions: Self-Driving Cars & Auto

Day: Tuesday, 04/05
Time: 13:00 - 13:25
Location: Room LL21E

S6838 - Create Full Set of Materials for Hyundai Genesis G380 with Substance Designer, Iray and MDL

David Nikel Digital Model Manager, Hyundai
After 10 years as a modeler for General Motors and running its own independent company for 4 years, David has been the Digital Model Manager at Hyundai USA since 2002.
Jerôme Derel Chief Product Officer, Allegorithmic
Engineer and product designer Jerome Derel joined Allegorithmic in 2014 as a chief product officer. Jerome worked for seven years at Dassault Systemes as a visualization expert in the Design Studio and the CATIA Design teams, leading projects producing high-quality virtual materials.
Pierre Maheut Product Manager & Senior Industrial Designer, Allegorithmic
With an industrial design background and after 8 years at Dassault Systemes as CATIA Creative Design expert & portfolio manager, Pierre joined Allegorithmic as product manager & senior industrial designer.

Discover how Substance Designer enables the creation of the extensive set of materials for the interior & exterior of the Hyundai Genesis G380. We will show how Allegorithmic's Substance procedural technology and NVIDIA's Material Definition Language (MDL) can be combined to bring materials creation to a level never reached before. Material review will be achieved on the actual fully detailed car model using Substance Designer and NVIDIA Iray integration. Finally, we will explain how Substance can help industrial designers in their creative iterations and exploration phases.

Level: All technical
Type: Talk
Tags: Product & Building Design; Rendering & Ray Tracing; Press-Suggested Sessions: Professional Graphics

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Room LL21A

S6848 - Deep Learning Workloads on CRAY Cluster Systems with NVIDIA™ GPUs (Presented by Cray)

Ryan Olson Principal Performance Engineer, Cray
Ryan Olson is a Member of the Performance Engineering Team at Cray since 2007. Prior to this, Ryan was a Postdoctoral Research Associate at the University of Minnesota, and he completed graduate work at the Ames Laboratory. Ryan holds a PhD in Physical Chemistry from Iowa State University, and a BA in Chemistry & Mathematics from Saint John's University.
Mark Staveley Director of Product Management, Cray
Mark Staveley is a Director of Product Management with CRAY. Mark is part of the Analytics Products Team, and his main role is to lead CRAY's Machine Learning efforts. Prior to joining CRAY, Mark spent over 6 years working at Microsoft where he held various roles – including being the Technical Program Manager for Microsoft Azure's accelerated computing and visualization program, the Research Program Manager for Microsoft Research's large-scale data management and processing program, Senior Engineer on Xbox One and the Microsoft Windows HPC Server. Prior to Microsoft, Mark worked as a Computational Researcher at ACEnet and HPCVL – two of Canada's Largest High Performance Computing Centers.

Cray Cluster Systems have long been used to support Supercomputing and Scientific Applications. In this talk we'll demonstrate how these same systems can be easily configured to support Docker and subsequently various Machine Learning Software Packages – including NVIDIA's Digits Software. Additionally, these systems can be configured in such a way that their Docker containers can be configured to pull data from Cray's Sonexion Scale-out Lustre Storage System. With this configuration our systems can have maximum application flexibility through docker as well as simultaneously being able to support the high performance storage requirements of many types of machine learning workloads through a connection with our Lustre ecosystem.

Level: All technical
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Tools & Libraries; Supercomputing & HPC

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Room 212B

S6902 - Virtual Reality: You Are Here

David Luebke Vice President Graphics Research, NVIDIA
Highly-Rated Speaker
David Luebke helped found NVIDIA Research in 2006 after eight years teaching computer science on the faculty of the University of Virginia. David is currently Vice President of Graphics Research at NVIDIA. His personal research interests include virtual and augmented reality, display technology, ray tracing, and graphics architecture. His honors include the NVIDIA Distinguished Inventor award, the NSF CAREER and DOE Early Career PI awards, and the ACM Symposium on Interactive 3D Graphics "Test of Time Award". David has co-authored a book, a SIGGRAPH Electronic Theater piece, a major museum exhibit visited by over 110,000 people, an online course on parallel computing that has reached over 80,000 students, and dozens of papers, articles, chapters, and patents on computer graphics and GPU computing.

NVIDIA Research reviews the technology, the components, and the challenges of virtual reality. We describe how GPUs are addressing these challenges, and our vision for the future of VR.

Level: All technical
Type: Talk
Tags: Virtual Reality & Augmented Reality; Press-Suggested Sessions: Virtual Reality

Day: Tuesday, 04/05
Time: 13:00 - 13:50
Location: Room LL20C

S6157 - Effective Evaluation of Betweenness Centrality on Multi-GPU Systems

Massimo Bernaschi Director of Technology, National Research Council of Italy
Highly-Rated Speaker
Massimo Bernaschi is with CNR, the National Research Council of Italy as Chief Technology Officer of the Institute for Applied Computing. He is also an Adjunct Professor of Systems Programming at "Sapienza" University in Rome; Trainer in Digital Forensics at "Sapienza" and Modena Universities. Before joining CNR in 1998, Massimo worked ten years at the IBM European Center for Scientific and Engineering Computing where he developed the IBM PVMe product and received two Outstanding Technical Achievement Awards. His main scientific interests are parallel computing; modelling of complex systems (finance and biology); systems and network security; high performance computing. Massimo is the author of over 150 papers in peer-reviewed journals and international conferences.

Learn how to use (multi) GPU and CUDA to speed up the process of ranking the importance of each node in a large scale network. You will see how to solve an extraordinary challenge, that is the exact computation of Betweenness Centrality, by using as building blocks relatively simple algorithms, like the Breadth First Search, that have been highly tuned for latest generation GPU cards. Our approach is fully scalable and overcomes the limitation on the size of the graph that can be studied on a single GPU. We'll present results obtained on both synthetic and real-world graphs.

Level: Intermediate technical
Type: Talk
Tags: Algorithms; Performance Optimization

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Marriott Salon 3

S6362 - CNN Based Object Detection in Large Video Images

Tao Wang Chief Scientist, iQIYI ltd. Corp.
Dr. Tao Wang is chief scientist of iQIYI ltd. Corp., the biggest video sharing platform in China, where he works on computer vision and multimedia software applications. He received his Ph.D. in computer science from Tsinghua University in 2003. Tao then worked as a senior researcher in Intel Labs China. He has published more than 60 papers in IJCV, CVPR, CIVR, ICME, and ACM multimedia.

Object detection in real video images is more challengable than image data set. We'll present CNN based object detection research on IQIYI large image and videos. It is used for content based ads recommendation.

Level: All technical
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Room 210H

S6420 - Parallel Silence Coding Algorithms for Seismic Data Compression on GPUs

John Cheng Research Scientist, BGP
John is a research scientist with profound industry experience in high-performance. John has developed seismic imaging products with GPU and many parallel applications computing on heterogeneous computing platforms. John is the author of many books, including Professional CUDA C programming, by Wiley 2014. John has profound experience in both academic research and industry development, and is gifted in making complex subjects accessible to readers with a concise and illustrative approach. John earned his doctoral degree in Computational Intelligence from Tokyo Institute of Technology.

Join industry experts for a discussion on a novel parallel silence coding algorithms on GPUs. Silence coding, combined with Huffman coding to form a lossless scheme, is extensively used in seismic data compression. It is inherently an serial procedure and not easy to be parallelized for GPUs. In this session, you will learn how to convert the sequential computation into the parallel computation through prefix-scan operations, a key primitive in many parallel algorithms, and how to quickly implement your kernels by utilizing NVIDIA CUB, a library of high-performance parallel primitives and reusable components for every layer of the CUDA programming mode. Concepts and performance are illustrated through examples by adjusting alternative algorithmic strategies provided in CUB.

Level: Intermediate technical
Type: Talk
Tags: Energy Exploration; Performance Optimization

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Marriott Salon 1

S6563 - Where Tegra Meets Titan: Asymmetric Computer Vision for Smartphones and Robotics

Tom Drummond Professor, Monash University
Tom Drummond has been a principal investigator on several EU Framework projects and is a chief investigator in the ARC Centre of Excellence for Robotic Vision. Tom studied mathematics for his B.A. at the University of Cambridge. In 1989, he emigrated to Australia and worked for CSIRO in Melbourne for four years before moving to Perth for his Ph.D. in computer science at Curtin University. In 1998, he returned to Cambridge as a postdoctoral research associate and in 1991 was appointed as a university lecturer and was subsequently promoted to senior university lecturer. In 2010, he returned to Melbourne and took up a professorship at Monash University.

This presentation will argue that battery life and thermal limits will prevent small mobile devices from implementing the next generation of visual processing algorithms without external assistance from high performance computing. Several innovative methods of distributing these problems between lightweight and high-powered nodes will be explored for a number of visual processing applications relevant to smartphones and robotics. We'll illustrate how these problems can be mapped onto the thread model of GPUs and will present a couple of CUDA tricks used to maximize efficiency.

Level: All technical
Type: Talk
Tags: Computer Vision & Machine Vision; Robotics & Autonomous Machines; Virtual Reality & Augmented Reality

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Room 210F

S6577 - Fighting Infections and Antimicrobial Resistance Through GPU-Accelerated In Silico Models

Radu Marculescu Professor of Electrical & Computer Engineering, Carnegie Mellon University
Dr. Radu Marculescu is a professor in the Electrical and Computer Engineering Department at Carnegie Mellon University. He has received several Best Paper Awards in top conferences and journals covering design automation of integrated systems and embedded systems. Radu currently serves as the editor-in-chief of Foundations & Trends of Electronic Design Automation and as an associate editor of Elsevier Journal of Nano Communication Networks. Radu has been involved in organizing many symposia and conferences, and has been guest editor of special issues in archival journals and magazines. His current research focuses on cyber-physical systems, social, and biological systems. He is an IEEE Fellow.

Learn the core principles behind cell-cell communication and understand the use of in silico models and simulation algorithms needed to evaluate the dynamics of heterogeneous microbial populations. Explore the pathogens' inter-cellular network, its dynamics and contribution to biofilm formation. See how the newest GPU-based platforms can enable highly parallel simulations with performance gains of orders of magnitude over existing CPU-only solutions. Understand how the application of social and network sciences to understanding bacterial population dynamics can aid in developing new treatments and better drugs to control the many pathogenic bacteria that use social interactions to cause infections and antimicrobial resistance.

Level: Beginner technical
Type: Talk
Tags: Computational Biology; Supercomputing & HPC; Press-Suggested Sessions: HPC & Science

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Marriott Salon 5

S6583 - WetBrush: GPU-Based 3D Painting Simulation at the Bristle Level

Zhili Chen 3D Graphics Researcher, Adobe Research
Zhili Chen is a 3D graphics researcher at Adobe. He got his Ph.D. in computer science at The Ohio State University in 2015. His research interests include physically based simulation, real-time graphics, 3D reconstruction, and virtual reality.
Byungmoon Kim Software Engineer, Adobe Research
Byungmoon Kim worked in Creative Technology, leading developments of MIDI software. He then moved to US to study at the Georgia Institute of Technology with broad interests that resulted in three master’s degrees in Aerospace Engineering, Mathematics, and Computer Science, before he received his PhD in Computer Science. His research included robot control, spacecraft control and experiments, collision detection, geometry mesh filtering, some haptics, and fluid simulations. After his PhD, he worked at NVIDIA as a software engineer. During this time, he worked on general DirectX driver development; and anti-aliasing, driver ambient occlusion, and some advanced stereo features. He later joined to Adobe Research working on Flash scene graph and physics engine, implicit/explicit hybrid mesh repair for 3D printing, octree/quadtree simulation, interactive selection tools, painting simulation, face-aware liquify warp tool, and a number of research projects.

We built a real-time oil painting system that simulates the physical interactions among brush, paint, and canvas at the bristle level entirely using CUDA. To simulate sub-pixel paint details given the limited computational resource, we propose to define paint liquid in a hybrid fashion: the liquid close to the brush is modeled by particles, and the liquid away from the brush is modeled by a density field. Based on this representation, we develop a variety of techniques to ensure the performance and robustness of our simulator under large time steps, including brush and particle simulations in non-inertial frames, a fixed-point method for accelerating Jacobi iterations, and a new Eulerian-Lagrangian approach for simulating detailed liquid effects.

Level: Intermediate technical
Type: Talk
Tags: Real-Time Graphics ; Computational Fluid Dynamics

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Room 210E

S6628 - Co-Designing GPU-Based Systems and Tools for Numerical Weather Predictions

Thomas Schulthess Director, Swiss National Supercomputing Centre
Thomas Schulthess is director of the Swiss National Supercomputing Centre (CSCS) and a professor for computational physics at ETH Zurich. He received his Ph.D. in physics in 1994. Since 2010, he has taken interest in refactoring climate codes to take advantage of novel, energy-efficient computing architectures.
Carlos Osuna Scientific Developer, Meteoswiss, Zurich
Carlos Osuna is a scientific software developer at MeteoSwiss, Zurich. Since 2011, he has been involved in research projects at ETH Zurich and MeteoSwiss, refactoring dynamical cores of weather codes using DSLs to port legacy codes to GPUs and provide performance portable applications. He received his Ph.D. in experimental high energy physics in 2003.

We'll discuss the hardware-software co-design project behind the most cost and energy efficient system for numerical weather prediction -- an appliance based on the Cray CS-Storm system architecture that is loaded with NVIDIA K80 GPUs and operated on behalf of MeteoSwiss by CSCS since October 2015.

Level: Intermediate technical
Type: Talk
Tags: Earth System Modelling; Press-Suggested Sessions: HPC & Science

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Room 211A

S6778 - Scaling Human Vision with GPUs

David Luan CEO, Founder, Dextro
David Luan is the co-founder of Dextro, a video analysis company based in NYC that uses machine learning and computer vision to help companies with large video collections easily understand, categorize, search, and visually transcribe their content—without depending on text metadata. David previously built computer vision systems at iRobot's military research group, and commercialized cutting-edge research as a Thiel Fellow.

Searching, filtering, and running aggregations on video at scale requires tools to enable humans to apply rich queries to the dataset, and get accurate answers quickly. Discover how users of Dextro's GPU-powered computer vision platform are able to analyze huge video datasets and extract meaningful answers in a matter of seconds, and how we apply this technique to train and create new categories on the fly. See live demos of Dextro's platform applied to user-generated video and security footage datasets.

Level: All technical
Type: Talk
Tags: Intelligent Video Analytics (IVA); Deep Learning & Artificial Intelligence; Media & Entertainment

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Room LL20D

S6856 - Audi Autonomous Braking with a 3D Monovision Camera

Rudolph Matthias Director Architecture Driver Assistance Systems , Audi AG
Dr. Rudolph studied Electrical Engineering at the University of Kassel and got his Ph.D. in Aerospace Engineering and Engineering Mechanics from Iowa State in 1999 with a minor in mathematics. After holding various positions at Audi he took in 2009 the Lead of the Department "Architecture Driver Assistance Systems". The zFAS project is one of the core development of the department. Dr. Rudolph is a member of the management at Audi.

To fulfill the EuroNCAP requirements an autonomous braking system has to be developed. The emergency braking system is designed to brake for pedestrians as well as for car to car scenarios. We'll explain how the functional logic is developed and what has to be done to reach a zero false positive goal with an excellent field performance. Audi was the first OEM who fulfilled this goal with a single 3D Monovision camera by developing the first A-SIL B camera with our supplier Kostal, the architecture of the 3D camera is explained as well.

Level: All technical
Type: Talk
Tags: Self-Driving Cars & Automotive

Day: Tuesday, 04/05
Time: 13:30 - 13:55
Location: Room LL21E

Talk

HANGOUT

Presentation
Details

H6104 - Data Center Management and Monitoring

Andrew Iles DC Management Software, NVIDIA
Manager, DC management SW team. 14 years at NVIDIA working on automation, cluster and enterprise management products.

Hangout covering the topics of Data center and cluster management. Experts will be able to answer questions about NVIDIA-provided tools for GPU monitoring, scheduling and general management in data center environments. This hangout will also be able to cover questions about GPU integration into 3rd-party infrastructure, including common HPC and hyperscale environments.

Level: All technical
Type: Hangout
Tags: Data Center & Cloud Computing

Day: Tuesday, 04/05
Time: 14:00 - 15:00
Location: Pod A

H6148 - Hangout: CUDA for HPC Simulation and Visualization

Rama Hoetzlein Graphics Research Engineer, NVIDIA
Rama Karl Hoetzlein is a graphics research engineer and devtech with NVIDIA, focusing on professional graphics and simulation. His background includes studies in computer science and fine arts at Cornell University in 2001, and media arts at University of California Santa Barbara in 2010, with a focus on procedural modeling. In 2013 he created Fluids v.3, an open source simulator for efficient particle fluids, and currently investigates sparse volumes at NVIDIA.

This hangout shares ideas, methods and techniques for using NVIDIA Hardware in High Performance Computing for computational dynamics and for in-situ data visualization. The format will consist of informal discussion based on audience interest backed by content in slide decks as needed. Topics may include include CUDA methodology for HPC visualization, and offerings from NVIDIA covering IndeX, GVDB and OptiX. Special topics may be covered based on interest, such accelerated grid-based or Langrangian simulation, with a focus on on NVIDIA hardware.

Level: All technical
Type: Hangout
Tags: Computational Fluid Dynamics

Day: Tuesday, 04/05
Time: 14:00 - 15:00
Location: Pod B

H6150 - Hangout: OptiX Ray Tracing Library: Best practices and Use-Case Consultation

R. Keith Morley Dev-Tech Engineer, NVIDIA
Keith Morley was a founding developer of the OptiX ray tracing library and is currently an engineer on NVIDIA's development technology team. His background is in physically based rendering and film rendering.
Dylan Lacewell Dev-Tech Engineer, NVIDIA
Dylan Lacewell is an engineer on NVIDIA's development technology team. His background is in film rendering.

Engineers from the OptiX development technology team will be available for one-on-one discussions about the OptiX ray tracing API. Feel free to bring any general questions you have regarding OptiX or receive consultation specific to your product's use of OptiX. Example topics; optimization of your OptiX-based application, usage of Callable Programs for flexible shader design, best practices for implementation of physically based renderers.

Level:
Type: Hangout
Tags: Rendering & Ray Tracing

Day: Tuesday, 04/05
Time: 14:00 - 15:00
Location: Pod C

Hangout

HANDS-ON LAB

Presentation
Details

L6108 - Kokkos, Manycore Performance Portability Made Easy for C++ HPC Applications

H. Carter Edwards Principle Member Technical Staff, Sandia National Laboratories
Highly-Rated Speaker
H. Carter Edwards has over three decades of experience developing software for simulations of a variety of engineering domains. He has been researching and developing software for HPC algorithms and data structures for the past 16 years at Sandia National Laboratories. An expert in high performance computing, he's currently focusing on thread-scalable algorithms and data structures for heterogeneous many-core architectures, such as NVIDIA GPU, AMD Fusion, and Intel Xeon Phi. He has a B.S. and M.S. in aerospace engineering from the University of Texas at Austin, and worked for 10 years at the Johnson Space Center in the domain of spacecraft guidance, navigation, and control. He has a Ph.D. in computational and applied mathematics, also from the University of Texas at Austin.
Christian Trott Senior Member Technical Staff, Sandia National Laboratories
Christian Trott is a high performance computing expert with experience in designing and implementing software for GPU and MIC compute-clusters. He earned a Dr. rer. nat. from the University of Technology Ilmenau in theoretical physics. Prior scientific work focused on computational material research using Ab-Initio calculations, molecular dynamic simulations and monte carlo methods. As of 2015 Christian is a senior member of technical staff at the Sandia National Laboratories. He is a core developer of the Kokkos programming model with a large role in advising applications on adopting Kokkos to achieve performance portability for next generation super computers.
Jeff Amelang Visiting Professor, Harvey Mudd College
Jeff Amelang focuses on teaching high performance computing technologies and techniques in a variety of contexts. He has taught courses, tutorials, and workshops at several national labs as well as US and international universities. He obtained his MS and PhD in Mechanical Engineering from the California Institute of Technology, with a focus on Computational Science and Engineering. Currently serving as a Visiting Professor at Harvey Mudd College, his favorite courses to teach are on distributed and GPU programming.

The Kokkos C++ library enables development of HPC scientific applications that are performance portable across disparate manycore architectures such as NVIDIA Kepler, AMD Fusion, and Intel Xeon Phi. Kokkos leverages the CUDA 7.5 device lambda capability to provide a highly intuitive and easy to use parallel programming model. Kokkos simplifies data management for heterogeneous memory (CPU, GPU, UVM, etc.) through a unique polymorphic multidimensional array view interface. View polymorphism includes mutable multidimensional layout, transparent overloads for atomic operations, and simplified access to GPU texture hardware. Kokkos advanced features culminate in portable team parallelism that performantly maps onto CUDA grids, blocks, and shared memory. This lab utilizes GPU resources in the cloud, you are required to bring your own laptop.

Level: Beginner technical
Type: Hands-on Lab
Tags: Supercomputing & HPC; Tools & Libraries

Day: Tuesday, 04/05
Time: 14:00 - 17:00
Location: Room 210A

Hands-on Lab

TALK

Presentation
Details

S6210 - NVIDIA GRID™ and Dassault Catia from Proof of Concept to Production

Fred Devoir Sr. Architect & Manager of IT Infrastructure, Textron Inc.
Fred Devoir is a senior systems architect and manager of IT infrastructure at Textron Inc. Fred has a wide variety of specialized and business systems experience with particular interests in integration and virtualization projects specifically centered around virtual desktop infrastructure (VDI), graphics acceleration, and high performance computing clusters. His past experience includes 22 years of IT professional work experience as an IT manager, senior systems analyst, engineer, and architect for Fortune 500 companies in the aerospace/defense, engineering, medical, and pharmaceutical industries.
Chris Savage Infrastructure Operations Manager, Bell Helicopter
Chris Savage joined Bell Helicopter in 2011 as Disaster Recovery Program Manager following 15 years in IT and BC/DR management. He currently serves as Infrastructure Operations Manager. He earned his B.Sc. degree in Emergency Administration and Planning from the University of North Texas, and holds an MBCP certification from DRI.

Join us for a technical discussion on NVIDIA GRID-accelerated virtual desktop infrastructure to support Dassault Catia workloads and what it takes to deploy the infrastructure from proof of concept to production. Learn how to tune Catia for use on virtual desktops, including how to optimize the NVIDIA GRID graphics drivers in windows to deliver Catia workloads. Learn about frame rate limiters and other performance optimization settings available in the infrastructure. Learn how persona management can assist with concurrent user deployments and what data is required to save with the users personalization settings in order to retain the Catia settings.

Level: Intermediate technical
Type: Talk
Tags: Graphics Virtualization; Product & Building Design

Day: Tuesday, 04/05
Time: 14:00 - 14:50
Location: Marriott Salon 4

S6246 - Digital Actors at MPC: Bridging the Uncanny Valley with GPU Technology

Damien Fagnou Global Head of VFX Operations, MPC
Highly-Rated Speaker
Damien Fagnou is the global head of VFX Operations at MPC, where he brings together his expertise in software and production to evolve and refine the creation processes across all feature film VFX work. After finishing university with an M.S. in computer science in France, he worked for an animated series implementing the technology to speed up the motion capture pipeline and rendering. He later accepted a job to help set up the workflow at Attitude studios and then took on the role of Tools and Workflow Programmer at Climax in the U.K. In 2003, he transferred his skills to the film industry and started at leading VFX post-production studio MPC to work on Troy, implementing preview tools and city rendering scripts. In 2005, Damien became R&D lead on Charlie and the Chocolate Factory, 10,000 BC, and Narnia. He then moved closer to production and became MPC's stereographer working on movies, including Pirates of the Caribbean: On Stranger Tides, the Harry Potter films, and Prometheus. After a few years in production, he returned to his software roots and became global head of Software overseeing software development efforts across the company.

Discover the next generation of GPU-enabled facial rigs for digital actors at MPC. Through a mixed approach of linear deformers and non-linear analysis, MPC aims to improve the performance and appearance of its digital actors and improve upon the state of the art in the visual effects industry. You'll learn from industry experts how MPC is using the latest fabric engine technology to ease the transition to GPUs, enabling fast drawing of characters and fast parallel computation of deformers on CUDA.

Level: Intermediate technical
Type: Talk
Tags: Media & Entertainment; Performance Optimization; Algorithms

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room LL21C

S6334 - 10X Faster Transparency from Low Level Shader Optimisation

Pyarelal Knowles Student, RMIT University
Pyarelal Knowles is a PhD student at RMIT University, Melbourne, with research interests in real-time computer graphics and physics simulations. He completed his Bachelor of IT (Games and Graphics Programming) in 2008, before a Comp. Sci. (Honours) year in 2009 at RMIT.

Some techniques to improve sorting performance of many small lists are discussed for much faster order independent transparency, a problem with elements of both compute and graphics which are quickly converging. The focus is on technical issues encountered, such as occupancy and memory hierarchy performance, a comparison between GLSL shaders and CUDA, and some discussion of GPU evolution.

Level: Intermediate technical
Type: Talk
Tags: Real-Time Graphics ; Algorithms; Performance Optimization

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room 210E

S6337 - GPU-Accelerated Graph Query for Cyber Applications

Jim Carbonaro Senior Software Engineer, Blazegraph
Jim Carbonaro is subject matter expert for integration and scaling of Blazegraph solutions with real-time analytic processing frameworks, including Spark, Scala, Storm, Kafka, GraphX, and Redis. He is a lead developer of DASL and DASL algorithms for large-scale graph analytics. He led recent work to compare performance of Apache Spark GraphX with Blazegraph-accelerated graph analytics.

Cyberspace is a critical domain for government and commercial organizations. It is about networks, devices, and how they interact. Graphs model nodes and links and how they are connected. Defending the critical networks in cyberspace requires processing and analyzing extremely large quantities of graph data in near-real time. Key cyber analytics and data sets ranging from Topological Vulnerability Analysis, Traffic Flow Analysis, and Network Attack Graphs are graphs. This session will discuss how Blazegraph GPU meets this challenge by delivering near-real time performance at a very large data scales, uses a flexible and updatable graph representation to support complex analytics, and supports existing graph frameworks (RDF, Tinkerpop) and query languages (SPARQL).

Level: Intermediate technical
Type: Talk
Tags: Aerospace & Defense; Big Data Analytics; Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Marriott Salon 2

S6345 - Advances in V-Ray RT GPU

Vladimir Koylazov CTO, Chaos Software Ltd.
Highly-Rated Speaker
Vladimir Koylazov (Vlado) has more than 15 years of software development experience, the majority of which he spent developing and improving the render engine V-Ray. Passionate about 3D graphics and programming, Vlado is the driving force behind Chaos Group's software solutions. Vladimir is CTO of Chaos Software and one of the original creators of the V-Ray renderer.
Blagovest Taskov Lead Developer, Chaos Group
Blagovest Taskov is the lead of the V-Ray RT GPU developers team. He has been working on the some of the latest advancements in V-Ray RT GPU, including improved OpenCL support, performance optimizations and many rendering features.

The talk describes recent advances in the V-Ray RT GPU raytracer for production rendering. With V-Ray 3.0 we will be offering a host of new features, optimizations, and improvements that will simplify artists' workflow while offering advanced capabilities and great speed improvements. One of the key improvements will be the simplified workflow.

Level: Advanced technical
Type: Talk
Tags: Rendering & Ray Tracing; Product & Building Design

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room LL21B

S6357 - Towards Efficient Communication Methods and Models for Scalable GPU-Centric Computing Systems

Holger Froening Associate Professor, Ruprecht-Karls University of Heidelberg
Holger Froenig is an associate professor at the Department of Mathematics and Computer Science at the Ruprecht-Karls University of Heidelberg (Germany), and leads the Computer Engineering Group at the Institute of Computer Engineering. His research interests include parallel computing, computer architecture, interconnection networks and hardware design with a recent focus on application-specific heterogeneous computing, data movement optimizations and associated power and energy aspects.

GPU computing is used pervasively for many reasons, including performance increase and improved energy efficiency. They are used pervasively in high performance computing, resulting in a strong need to optimize data movements between GPUs at the cluster level. Existing communication models and methods are designed for CPUs, however. We'll point out limitations when employing traditional techniques to GPUs, and how to overcome those to support a full GPU-centric traffic souring and sinking. Our experiments show that such specialized communication models and methods provide substantial advantages in terms of energy and time. We observe that besides specialization in computing, we also need specializing in communication for utmost performance and energy efficiency.

Level: Intermediate technical
Type: Talk
Tags: Supercomputing & HPC; Performance Optimization

Day: Tuesday, 04/05
Time: 14:00 - 14:50
Location: Room 212A

S6384 - NVIDIA CUDA® for Mobile

Yogesh Kini Manager, CUDA System Software, NVIDIA
Yogesh Kini manages the Tegra CUDA driver team at NVIDIA. For last four years he has been working towards enabling GPU compute software on different Tegra platforms. His team is responsible for the CUDA API and system software on various embedded, automotive, and mobile platforms based on Tegra SOC. He holds a B.S. from Manipal Institute of Technology, India.

This session is about a few important use-cases in mobile that can be accelerated using CUDA. Use-cases include image processing, camera output post-processing, using CUDA. Attendees will learn about : [1] Tegra unified memory architecture, which can be utilized by applications to reduce total memory usage and power consumption. [2] CUDA interoperability with EGLImage [3] Use of EGLStreams to setup image processing pipeline using CUDA. [4] Tegra specific enhancements to CUDA-OpenGL(ES) interop

Level: Intermediate technical
Type: Talk
Tags: Computer Vision & Machine Vision; Video & Image Processing; Tools & Libraries

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room 210F

S6401 - Towards Interactive Visual Exploration of Massively Parallel Programs Using a Domain-Specific Language

Tobias Klein Student, TU Vienna / KAUST
Tobias Klein is an M.S. student at TU Vienna working under the direction of Professor Eduard Groller. He is visiting the Visual Computing Center at KAUST for his M.S. thesis research work on the interactive visualization and analysis of massively parallel GPU programs in the context of the KAUST NVIDIA CUDA Research Center, in collaboration with Dr. Peter Rautek and Professor Markus Hadwiger.

Learn about the world of visual exploration of massively parallel programs. We describe our framework for interactive program visualization to help you understand program run-time behavior and find the causes of possible slowdowns and bugs. Our framework comprises a simple domain-specific language for annotating OpenCL kernel code, automatic just-in-time compilation of the necessary code instrumentation, and interactive visualization capabilities. Our problem-specific code annotation facilitates user-centered analysis. We describe a variety of interactive visualizations using the well-known D3 framework, providing insight into the program's structure, execution, and memory accesses. We describe several use cases that show the program visualization capabilities of our approach in action.

Level: Intermediate technical
Type: Talk
Tags: Tools & Libraries; Performance Optimization; Programming Languages

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room 211B

S6414 - Structure-Preserving Smoothing for Seismic Amplitude Data by Anisotropic Diffusion Using GPGPU

Joner Duarte R&D Software Engineer, Tecgraf/PUC-Rio
Joner Duarte is researcher at computational geophysics group of Tecgraf/PUC-Rio. He received MSc degree in Computer Graphics at Pontifical Catholic University of Rio de Janeiro (2012). He currently researches on High Performance Computing and HCI for Virtual Reality applied to geophysics.

In this session we present a new method for attenuating noise on seismic data while preserving structural features. Our approach uses anisotropic diffusion to filter the seismic amplitude data that involves solving a large linear system. Moreover, we use seismic attributes, that represent horizons and faults, as restrictions of the diffusion process and an implicit method for solving the diffusion equation. The use of GPGPU to accelerate the linear system solution allows the seismic filtering to be executed at interactive time providing a fine adjustment of input parameters before processing the whole data.

Level: All technical
Type: Talk
Tags: Energy Exploration; Tools & Libraries

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Marriott Salon 1

S6424 - Exploring Scalable Implementations of Triangle Enumeration in Graphs of Diverse Densities: Apache Spark vs. GPUs

Michela Taufer Associate Professor, University of Delaware
Michela Taufer joined the University of Delaware in 2007, where she was promoted to associate professor with tenure in 2012. She earned her M.S. in computer engineering from the University of Padova and her Ph.D. in computer science from the Swiss Federal Institute of Technology (ETH). She was a post-doctoral researcher supported by the La Jolla Interfaces in Science Training Program (also called LJIS) at UC San Diego and the Scripps Research Institute. Before she joined the University of Delaware, Michela was faculty in computer science at the University of Texas at El Paso.
Travis Johnston Postdoctoral Researcher, Univeristy of Delaware
Travis Johnson is a post-doctoral researcher working with Dr. Michela Taufer in the Global Computing Laboratory at the University of Delaware. He is working on several projects that are centered on big data analytics for scientific computation.

We'll present graphs as powerful tools when analyzing complex relationships between entities. We'll share how many structures commonly found in computer science, like social networks, computer networks, and the world wide web, can be modeled as graphs. Since many of the real graphs are very large and complex, the associated analysis algorithms must be very efficient and highly parallel. We present two implementations of a key graph-based analysis such as the triangle enumeration for two different parallel paradigms: GPU programming and Apache Spark. We'll reveal the performance of the two different implementations for the different paradigms as the characteristics of the graph change.

Level: Beginner technical
Type: Talk
Tags: Algorithms; Tools & Libraries; Big Data Analytics

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Marriott Salon 3

S6468 - Democratizing Sequencing with Ion Torrent S5 and S5XL DNA Sequencers Powered by GPUs

Mohit Gupta Staff Software Engineer, Thermo Fisher Scientific
Highly-Rated Speaker
Mohit Gupta is working as a staff software engineer in Genetic, Medical and Applied Sciences division of Life Sciences Solutions, a part of Thermo Fisher Scientific Inc. In this capacity, he is responsible for speeding up algorithms used in data analysis of PGM, Proton, S5, and S5XL DNA sequencers, with a particular focus on GPU computing. Previously, Mohit worked as senior research and development engineer with Mirafra Technologies, Bangalore, India, in the area of electronic design automation working on compiler for hardware description languages like Verilog. He holds a B.Tech. in electrical engineering from the Indian Institute of Technology, Bombay, India and an M.S. in computer engineering from University of California, San Diego. He has published or presented in conferences and workshops like ICCAD, GTC, and DFMY.

Learn how GPUs have accelerated the pace of research in targeted DNA sequencing by providing quick turnaround time in data analysis of terabytes of raw data generated by Ion Torrent DNA sequencers, like S5XL, in a matter of hours. We'll showcase our complete signal processing pipeline running on GPUs and share our results with lessons learnt in developing CUDA code for Kepler and Maxwell architectures. We'll share our experiences with using CUDA Multi Process Service (MPS). We'll touch upon several examples in areas of clinical research, drug discovery, and human identification that have got a tremendous boost from the speed of our technology propelled by GPUs.

Level: All technical
Type: Talk
Tags: Computational Biology; Press-Suggested Sessions: HPC & Science

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Marriott Salon 5

S6536 - Building Immersive NVIDIA SHIELD Android TV Games

Luc Beaulieu CTO, Frima Studio
Luc Beaulieu is chief technology officer at Frima, a 350+ employee company, with a passion for digital entertainment. He is currently leading the technical innovation and smart toy groups. Luc has over 20 years of experience in videos games, online communities, and digital experiences.
Jean-Philippe Doiron Technology Director, Frima Studio
With Frima since 2009, Jean-Philippe benefits from over 15 years of professional experience in the creation of games and Web applications. Before joining Frima, he co-founded an Internet multimedia development company where he acted as developer for the better part of six years. He also worked for seven years as a consultant, mostly in Canada, the US and the UK, developing highly optimized and multithreaded database architecture and reporting systems for transit operators. His current responsibilities as Director of Technology include aligning the R&D efforts with Frima's technological needs, laying down quality standards, and providing the R&D team with support. Jean-Philippe is an expert in software and game development, profiling and optimization, Direct X, C++, and Flash 3D.

We'll show how game experiences on the SHIELD Android TV can be extended beyond the border of the monitor. Sound is an obvious example, but what about lights? Find out how Frima increased immersion in its Chariot game by adding connectivity between SHIELD and the Philips Hue lighting system. We'll show how the game interacts with the lights and what had to change to make it work. With Chariot being a console game first, the audience also learns about performance comparisons between the SHIELD TV, new and old generation consoles.

Level: All technical
Type: Talk
Tags: Game Development

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room 212B

S6594 - Connecting the Dots of Emerging Technologies and Real-World Application

Anthony Cortez Senior Designer | Visualization | Lighting Consultant, Arup
Anthony Cortez is the Arup visualization leader in the Americas Region. A graduate of the Art Institute of Pittsburgh, Anthony has been working as a senior designer for the visualization industry for over 18 years. Anthony has worked on numerous architectural and engineering projects in visualization, lighting design, and motion graphics. Projects include the Fulton Center, the Tappan Zee Bridge, JFK JetBlue Terminal 5, YAS Marina Hotel, Singapore Stadium and the Academy of Arts & Science Theater at the Lighthouse International. In addition to traditional 3D visualization, he integrates with other Arup disciplines to produce validated lighting studies, acoustic and pedestrian/vehicle simulations, fire simulations, as well as real-time simulations using cutting-edge gaming technology.

New technologies are enabling us to bring designs to life in ways never before possible. Real-time graphics engines immerse viewers in a virtual environment, providing a perfect tool for showcasing projects and troubleshooting problems during the design phase. Our engineers and graphics experts collaborate to produce interactive walk-throughs showing options for different spaces and environments. As new synergies continue to emerge in the design field, visualization is key to providing a common basis of understanding among multiple design disciplines and project stakeholders. Arup is helping turn best practice into the next practice by using real-time graphics engines as a presentation tool, greatly improving communication and understanding for project teams and clients alike.

Level: All technical
Type: Talk
Tags: Product & Building Design; Virtual Reality & Augmented Reality

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room LL21A

S6621 - Unified CPU+GPU Programming for the ASUCA Production Weather Model

Michel Müller Research Assistant, Tokyo Institute of Technology
Michel Muller entered the Ph.D. graduate course at the Department of Energy Sciences, Tokyo Institute of Technology in 2015, supervised by Professor Aoki. He graduated with an M.S. in electrical engineering and information technology from ETH Zurich in 2012. From 2009 to 2014, he worked as a consultant and then a systems architect at ATEGRA AG, Switzerland.

Porting applications to GPUs still requires compromises between time-to-solution, GPU performance, and CPU performance. This often leads to major challenges for large, Fortran-based applications like weather and climate models. We'll focus on two of these challenges, whose significance is shown using real-world code examples and performance results: The differing requirements on parallel task granularity as well as storage order between the two architectures. A proposed solution is a flexible preprocessor framework called "Hybrid Fortran," which has been used to port both dynamics and physics of ASUCA, one of the Japan Meteorological Agency's current operational weather models. Finally, an even more hands-off solution to GPU portability is proposed in the shape of a black box solution.

Level: Intermediate technical
Type: Talk
Tags: Earth System Modelling; Tools & Libraries; Computational Physics; Supercomputing & HPC; OpenACC

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room 211A

S6695 - Generative Adversarial Networks

Ian Goodfellow Senior Research Scientist, OpenAI
Ian is a Senior Research Scientist. He is the lead author of the textbook Deep Learning (www.deeplearningbook.org). He studies new methods to improve deep learning. His interests include generative models, machine learning in the adversarial setting, and accelerating the training of neural networks. He has contributed to several open source machine learning libraries that leverage CUDA, including Theano, Pylearn2, and TensorFlow.

Generative adversarial networks (GANs) provide a way to generate realistic images with deep neural networks. Compared to other approaches to generative modeling, GANs are able to learn the cost function. This allows them to learn to capture important details that a fixed, manually designed cost function, such as mean squared error, would ignore. Compared to maximum likelihood estimation (MLE), GANs are specialized for the task of generating realistic samples. Both MLE and GANs are consistent statistical estimators, but have different priorities. MLE aims to assign high likelihood to all of the data, but may also assign high likelihood to other points and thus generate unrealistic samples. GANs aim to always generate realistic samples.

Level: Intermediate technical
Type: Talk
Tags: Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Hall 3

S6761 - Advanced System Power Management for Deep Learning and A.I. Machines (Presented by Linear Technology)

Dave Dwelley Office of the CTO, Linear Technology
Dave Dwelley is an Office of the CTO at Linear Technology. Since joining the company over 25 years ago, Dave has served as an analog chip designer and design manager. He received his BSEE/CD degree from UC Berkeley in 1986 and is a member of the original IEEE P802.3af (PoE) task force starting in 2000 and was a founding member of the P802.3at (PoE+) group starting in 2004. He now serves as chairman of the IEEE 802.3 Power over Data Lines study group and participates in the P802.3bp Reduced Twisted Pair Gigabit group. Dave holds 16 patents and spends his free time raising two teenagers and thinking about rebuilding the collection of old cars gathering dust in his garage.

Linear Technology's DC/DC regulator and power management solutions enable designers to increase performance in GPU- and CPU-based systems. Improved electrical, thermal and mechanical properties for core, I/O, and memory rails, combined with expertise and tools for PCB layout, simulation and design verification permit deployment of more efficient, lighter weight, cooler, and more compact systems. This presentation will also focus on methods of controlling, monitoring and debugging power circuits by digitally communicating with the device, reading temperature and load current data while setting voltages and start-up conditions. Future product advancements related to powering automotive electronics will also be discussed.

Level: All technical
Type: Talk
Tags: Performance Optimization; Deep Learning & Artificial Intelligence; Self-Driving Cars & Automotive

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Marriott Salon 6

S6794 - The Technology Powering the Immersive Cinema Experiences from Lucasfilm's ILMxLAB

Lutz Latta Principal Engineer, Lucasfilm
Lutz Latta is the Principal Engineer of ILMxLAB, where he leads the technology development for Lucasfilm's innovative VR, AR, and immersive experiences. Previously he worked extensively on video games, as the Lead Graphics Engineer of Star Wars: 1313 at LucasArts, and on The Lord of the Rings and Command & Conquer games at Electronic Arts Los Angeles.

Bringing Cinematic Virtual Reality to life requires the kind of tight collaboration between technical and creative forces that Lucasfilm has thrived on for over 40 years. We'll dive deep into the technology that powers the creative and technical experiments underway at the studio. We will discuss how multiple GPUs collaborate to achieve the highest level of photorealism for virtual reality today, how to repurpose offline rendered movie quality assets for real time rendering in sub 11 milliseconds per frame, and some of the lessons learned along the way.

Level: Intermediate technical
Type: Talk
Tags: Virtual Reality & Augmented Reality; Media & Entertainment; Real-Time Graphics ; Press-Suggested Sessions: Professional Graphics; Press-Suggested Sessions: Virtual Reality

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room LL20C

S6809 - Large-Scale Volume and Particle Visualization on GPU Clusters with vl3

Silvio Rizzi Postdoctoral Appointee, Argonne National Laboratory
Silvio Rizzi is a Postdoctoral Appointee at the Argonne Leadership Computing Facility, Argonne National Laboratory. His research interests include large-scale data analysis and visualization, in-situ visualization, GPU and many-core computing, display technologies, augmented and virtual reality, and surgical simulation. Silvio earned a B.Sc. in Electronics Engineering from Universidad Tecnologica Nacional, Mendoza, Argentina, and the degrees of M.S. in Electrical and Computer Engineering and Ph.D. in Industrial Engineering and Operations Research from the University of Illinois at Chicago.

We'll describe vl3, a GPU-accelerated parallel framework for large-scale scientific visualization and visual analytics. We'll explain its parallel architecture, based on a combination of the message passing interface (MPI) and GLSL shaders. In addition, we'll present applications to interactive visualization of large-scale volumetric and particle-based datasets generated on some of the most powerful supercomputers on the planet. We will also discuss strong and weak scalability experiments on up to 125 NVIDIA K80 GPUs. Finally, we'll cover vl3's capability for remote data visualization and streaming ultra-high-resolution images to remote displays, including a large tiled display driven by a workstation with multiple Quadro M6000 cards and Mosaic technology.

Level: All technical
Type: Talk
Tags: In-Situ and Scientific Visualization; Large Scale and Multi-Display Visualization

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room LL21D

S6830 - WePod: Autonomous Shuttles on Public Roads

Floris Gaisser Researcher, Delft University of Technology
Floris Gaisser has finished a masters in Mechanical Engineering at Delft University of Technology with a specialization in Vision Based Robotics and Intelligent Mechanical Systems. Currently he is working as a PhD researcher in the Intelligent Vehicles & Cognitive Robotics department on the WePods project. Further he is a founding partner in Robot Care Systems. His expertise lies in the field of Product and Mechanical Engineering with a focus on Computer Vision Applications.

The WePod is the first self-driving vehicle on the public road without a steering wheel or pedals. To achieve driving in such a complex environment and guarantee safety, multiple sensors covering 360 degrees around the vehicle have been used. Sensor-fusion, road-user detection, classification and tracking have been implemented on NVIDIA's DrivePX platform. This session will give an overview of the systems architecture and implementation, as well preliminary test results of driving on the public road will be presented.

Level: All technical
Type: Talk
Tags: Self-Driving Cars & Automotive ; Deep Learning & Artificial Intelligence; Press-Suggested Sessions: AI & Deep Learning; Press-Suggested Sessions: Self-Driving Cars & Auto

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room LL21E

S6840 - Intelligent Video Analysis System Based on GPU and Distributed Architecture

Shiliang Pu Executive Vice President, Hikvision Research Institute
Shiliang Pu, incumbent Executive Vice-President of Hikvision Research Institute, received double doctor's degree of Rouen University and Zhejiang University. He is regarded as the top expert in the field of image processing and intelligent identification both here and in the Zhejiang Province. He is the technical leader of a key laboratory in the Ministry of Public Security and a technological innovation leader at CETC.

In this session, we'll show you the impact that GPUs and technology like deep learning will have on the surveillance industry. As a global leader in the security/surveillance market, we will share some of the work we do with governments and municipalities to help them manage massive amounts of video and data.

Level: All technical
Type: Talk
Tags: Intelligent Video Analytics (IVA); Deep Learning & Artificial Intelligence; Video & Image Processing

Day: Tuesday, 04/05
Time: 14:00 - 14:25
Location: Room LL20D

S6116 - Towards Building a GPU Cloud Service for Human-Level Quality Image Understanding

Xiaodong He Senior Researcher, Microsoft
Xiaodong He is a senior researcher in the Deep Learning Technology Center, Microsoft Research, Redmond, Wash. He is also an affiliate full professor in the Department of Electrical Engineering at the University of Washington, Seattle, serving in the Ph.D. reading committee. His research interests include deep learning, speech, natural language, vision, information retrieval, and knowledge representation and management. He has published in IEEE TASLP, IEEE SPM, Proc. IEEE, ICASSP, ACL, EMNLP, NAACL, CVPR, SIGIR, WWW, CIKM, ICLR, NIPS, and other venues. He has received several awards, including the Outstanding Paper Award of ACL 2015. He and colleagues developed the MSR-NRC-SRI entry and the MSR entry that won No. 1 in the 2008 NIST Machine Translation Evaluation and the 2011 IWSLT Evaluation (Chinese-to-English), respectively, and the MSR image captioning system that won first prize at the MS COCO Captioning Challenge 2015. He has held editorial positions on several IEEE Journals and has served on the organizing and program committees of major speech and language processing conferences. He is a senior member of IEEE and a member of ACL.
Kenneth Tran Senior Research Engineer, Microsoft Tran
Kenneth Tran is a senior research engineer in the Deep Learning Technology Center, Microsoft Research. Previously, he was a machine learning scientist in the Cloud Machine Learning group at Microsoft, building a machine learning platform, which now powers the Azure ML. His research interest includes machine learning, optimization, and distributed computing.

Learn the latest deep learning techniques for semantic modeling of image, text, and knowledge graph, all empowered by GPU computing and cloud service. We'll demonstrate how to build deep semantic models across different modalities, and how to apply these models to reach the best results in information retrieval, question answering, and image captioning benchmarks. In particular, facilitated by the recently announced Microsoft Azure GPU compute instances, we'll show how to use GPU clusters to extend the MSR image captioning system, which won first prize in the COCO Captioning Challenge at CVPR 2015, and to build a publically available, large-scale, deep image understanding service that achieves state-of-the-art performance in generating novel captions for images.

Level: Intermediate technical
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision; Data Center & Cloud Computing; Press-Suggested Sessions: AI & Deep Learning

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Hall 3

S6131 - Nvpro-Pipeline: Handling Massive Transform Updates in a SceneGraph

Markus Tavenrath Senior Developer Technology Engineer, NVIDIA
Markus finished his studies in computer science with focus on computer graphics in 2008. He was one of the first using ray tracing on CUDA for this diploma thesis which brought him straight to NVIDIA. There he primarily worked on GPU raytracing for SceniX, NVIDIA's scene graph technology, which had been showcased at SIGGRAPH 2008. Afterwards he applied his experience to implement parts of OptiX, improve SceniX and develop several ray tracing demos. In close cooperation with external partners he improved rendering performance and scenegraph usability as developer technology engineer. Now he is using the gained knowledge to experiment with future rendering technologies that bring high interactivity to complex scenes. This work includes both CPU and GPU strategies to solve typical scene graph operations related to rendering.

This session will walk through the new transforms hierarchy module in nvpro-pipeline which is able to compute the world matrices for each node in a transform hierarchy massively parallel instead of the traditional serial computation. Running the algorithm on a high-end GPU like a Quadro M6000 gives a massive speedup over computing the hierarchy on the CPU. In addition to this, the data which has to be transferred between the CPU and GPU gets minimized which gives another performance boost.

Level: Intermediate technical
Type: Talk
Tags: Real-Time Graphics

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room 210E

S6215 - MBE: A GPU-Based Fast, Robust and Precise Solver for Chemical ODEs

Fan Feng Ph.D. Student, Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
Fan Feng is a Ph.D. student with the Supercomputer Center, Computer Network Information Center, Chinese Academy of Sciences, Beijing.
Zifa Wang Professor, Institute of Atmospheric Physics, Chinese Academy of Sciences
Professor Zifa Wang is Director of the State Key Laboratory of Atmospheric Boundary Layer Physics and Atmospheric Chemistry (LAPC) of the Institute of Atmospheric Physics, Chinese Academy of Sciences, the editor of Chinese Journal of Atmospheric Sciences, Aerosol and Air Quality Research, The ScientificWorld JOURNAL and SOLA Journal. He has developed a nested air quality prediction modeling system (NAQPMS), which is a tool to study air pollution such as Asian dust storms across a regional and urban scale and is widely used in China as a real time forecasting model of air quality. This model was included in a multiple model inter-comparison project MICS-Asia III. He studied and worked in Japan from 1998 to 2002 and got his PhD in Atmospheric Physics in 1997.

Explore a GPU-based efficient algorithm for chemical ODEs, which is the core and costly part of atmosphere chemistry model in CAS-ESM project. Chemical ODEs is numerically sticky because of its stiffness, nonlinearity, and nonnegativity. Traditional solvers, such as LSODE, are hard for parallelism because of its complicated control flow and coupling. In our experiments, we have obtained 3-5X speedup on GPU when the same input is set on each node, which eliminates the divergences in kernel, while the performance with real input is even worse than the serial code. So we develop a new solver Modified-Backward-Euler (MBE). In our numerical experiments, MBE is shown to be faster and more precise than LSODE, and it's easy to parallelize, so we can expect a significant speedup on GPU.

Level: All technical
Type: Talk
Tags: Earth System Modelling; Computational Fluid Dynamics; Algorithms

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room 211A

S6276 - Autonomous Robotic 3D Printing: Real-Time Path Planning with Computer Vision

Daghan Cam Architect, University College London
Daghan Cam is an architect and researcher based in London. He is the director of Daghan Cam Limited, which operates between architecture, technology, and research. He runs a post-graduate research cluster at UCL's Bartlett School of Architecture with Alisa Andrasek and Andy Lomas. He also leads research on GPU computing and he is a co-principal investigator of UCL as an NVIDIA GPU Research Center. Previously he worked with Zaha Hadid Architects. He taught workshops and gave lectures at AA Visiting Schools in Istanbul, Athens, London, and at Ecole d'architecture in Paris. His work on computational design and large-scale robotic fabrication has been widely exhibited, recently in San Francisco and in Milan Design Week 2015.

Teach your 3D printing robot how to adapt to unpredictable material behavior by using deep learning algorithms. We'll introduce a path planning strategy for iteratively correcting robot target positions in a 3D printing process by using an NVIDIA Jetson card attached to an industrial robotic arm. Initial path generation, visual tracking of material behavior in real-time, evaluation and recomputation of robot trajectories will be explained by code examples and video recordings from the fabrication process.

Level: Beginner technical
Type: Talk
Tags: Product & Building Design; Robotics & Autonomous Machines; Computer Vision & Machine Vision; Deep Learning & Artificial Intelligence

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room LL21A

S6288 - Automatically Fusing Hundreds of Stencil Kernels for Better Performance and Productivity

Mohamed Wahib Post-Doctoral Researcher, RIKEN Advanced Institute for Computational Science
Mohamed Wahib is currently a post-doctoral researcher in the HPC Programming Framework Research Team at RIKEN Advanced Institute for Computational Science (RIKEN AICS). He joined RIKEN AICS in 2012 after years at Hokkaido University, Japan, where he received a Ph.D in computer science in 2012. Prior to his graduate studies, Mohamed worked as a researcher at Texas Instruments R&D for four years.

This talk proposes an end-to-end framework for automatically transforming stencil-based CUDA programs to exploit inter-kernel data locality. The CUDA-to-CUDA transformation collectively replaces the user-written kernels by auto-generated kernels optimized for data reuse. The transformation is based on two basic operations, kernel fusion and fission, and relies on a series of automated steps: gathering metadata, generating graphs expressing dependencies and precedency constraints, searching for optimal kernel fissions/fusions, and generation of optimized code. We show how the automatic transformations were practical and effective in exploiting exposed data localities for a variety of real-world applications with large codebases that contain dozens of kernels and data arrays.

Level: Intermediate technical
Type: Talk
Tags: Tools & Libraries; Supercomputing & HPC; Performance Optimization

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room 211B

S6389 - Embedded Deep Learning for Object Detection and Classification in Aerial Images

Jon Barker Solution Architect, NVIDIA
Jon Barker is a solution architect with NVIDIA, helping customers and partners develop applications of GPU-accelerated machine learning and data analytics to solve defense and national security problems. He is particularly focused on applications of the rapidly developing field of deep learning. Prior to joining NVIDIA, Jon spent almost a decade as a government research scientist within the U.K. Ministry of Defence and the U.S. Department of Defense R&D communities. While in government service, he led R&D projects in sensor data fusion, big data analytics, and machine learning for multi-modal sensor data to support military situational awareness and aid decision making. He has a Ph.D. and B.S. in pure mathematics from the University of Southampton, U.K.

Learn how deep learning can be applied to object detection, localization, and tracking problems in remote sensing. We'll present a technical case study showing how a convolutional neural network (CNN) trained in the data center using DIGITS can be deployed to an embedded GPU system to carry out low-latency object detection, classification, and tracking in high-resolution aerial imagery. We'll compare different approaches to detection and localization tasks. An example will be given of integrating the Caffe deep learning framework for GPU-accelerated CNN inference with an OpenCV-based image and video processing pipeline. We'll also show how transfer learning can be accomplished using DIGITS to train a CNN when only a small task specific training dataset is available.

Level: Intermediate technical
Type: Talk
Tags: Deep Learning & Artificial Intelligence; Computer Vision & Machine Vision; Robotics & Autonomous Machines; Aerospace & Defense

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room 210H

S6397 - Real-Time Non-Rigid Image Registration Engine

Randall Miles Senior Research Scientist, Propulsion Science and Technology
Dr. Randall Miles is a physicist, algorithm developer, and senior research scientist at Propulsion Science and Technology. He is lead designer and developer for model database development activities, and key contributor on a variety of projects, including quantum chemistry calculations and radar cross section modeling of CFD fields.

Non-rigid image registration, i.e., morphing, allows a smaller footprint of seed images to be used to create a smooth and continuously changing series of images. We'll present a new high-speed toolkit for image morphing implemented using NVIDIA GPU technology. Time improvements of ~80% were seen through implementing a succession of CUDA optimizations guided by the Nsight profiler results. Tests were conducted using available simulated rocket plume images to calculate run times and create performance measures.

Level: All technical
Type: Talk
Tags: Aerospace & Defense; Video & Image Processing; Performance Optimization; Computer Vision & Machine Vision

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Marriott Salon 2

S6421 - Using OpenACC to Parallelize Seismic One-Way-Based Migration

Maxime Hugues HPC Research Scientist, Total E&P Research & Technology USA, LLC
Maxime Hugues has been an HPC research scientist at TOTAL Houston since 2012. Maxime graduated from the French National Engineer School ""ISEN-Toulon"" in 2007. The same year, he received an M.S. from the University of Science and Technologies of Lille. He was a Ph.D. fellow at the oil and gas company TOTAL, and received his degree in computer science in 2011 at the University of Lille. While doing his Ph.D., he worked as a junior researcher in high performance computing at TOTAL. He continued to work on the multi-programming paradigm as a postdoctoral researcher at INRIA and as a visitor scientist at the University of Tsukuba. His research focuses on programming paradigms and innovative hardware for extreme computers.

We'll describe our experience in using OpenACC to parallelize One-Way Based Migration, a seismic application that uses Fourier Finite Differencing. We describe our approach at optimizing application kernels that involve FFT operations and solving systems of tridiagonal sparse matrices. We talk about expectations and challenges of using OpenACC along with potential pitfalls for application users. We highlight the advantages of using OpenACC for high-performance scientific applications and list shortcomings that affect performance.

Level: All technical
Type: Talk
Tags: Energy Exploration; Performance Optimization; Supercomputing & HPC; OpenACC

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Marriott Salon 1

S6428 - Molecular Simulations of DNA Loop Extrusion Explain and Predict Human Genome Architecture

Adrian Sanborn Ph.D. Candidate, Department of Computer Science, Stanford University
Adrian Sanborn is a Ph.D. candidate in the department of Computer Science at Stanford University and a researcher at the Center for Genome Architecture in Houston. Previously, he graduated summa cum laude from Harvard University with a degree in mathematics and computer science.

We'll show how the human genome's 3D organization, which is closely linked to important cellular functions, can be explained and predicted by molecular simulations. Our recent high-resolution maps of DNA self-contacts revealed that the genome is organized into loops and domains demarcated by the DNA-binding protein CTCF. We present a model, developed using GPU-accelerated molecular simulations, in which loops and domains form through loop extrusion. Our simulations recapitulate DNA contact maps given only CTCF binding locations. When we alter CTCF binding locations using genome editing, our simulations generate accurate predictions for the edited DNA contact maps. These results significantly advance our understanding of genome folding and open a path towards targeted surgery of 3D genomes.

Level: All technical
Type: Talk
Tags: Computational Biology; Computational Physics; Press-Suggested Sessions: HPC & Science

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Marriott Salon 5

S6489 - Testing Chordal Graphs with CUDA®

Agnieszka Lupinska PhD Student , Jagiellonian University
Agnieszka Lupinska is a Ph.D. student of computer science at Jagiellonian University in Cracow. She teaches programming in CUDA. She was a software engineer in Nokia Cracow, from June 2014 to June 2015. She developed embedded Linux systems in Small Cell, Nokia's product for LTE technology. Her interests include parallel computing, multi-threaded algorithms, low-level language programming, advanced algorithms, and computational complexity.

We'll present the CUDA implementation of algorithm to test chordality of graphs, which uses the parallel partition refinement with pivots. A graph is chordal if each cycle of size greater than three in $G$ has a chord, that is an edge between two non-adjacent vertices on the cycle. In total, the algorithm takes O(N) time on N-threads grid and it performs O(N+M) work for graphs of N vertices and M edges. We'll compare the performance tests results achieved by the CUDA implementation on NVIDIA GeForce GTX TITAN X and the sequential implementation on CPU with four cores (eight threads). We'll present the tests results for cliques, sparse graphs, dense graphs, and random chordal graphs.

Level: Advanced technical
Type: Talk
Tags: Algorithms; Big Data Analytics

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Marriott Salon 3

S6528 - Implementing Deep Learning for Video Analytics on Tegra X1

Carles Fernández Tena Director of Research, Herta Security
Carles Fernández Tena received his B.S. in Telecommunication Eng. and M.S. in Language and Speech from the Technical University of Catalonia (UPC) in 2005. He received an M.S. in Computer Vision and AI from the Autonomous University of Barcelona (UAB) in 2008, where he obtained his Ph.D. cum laude in 2010, receiving the 2010 Extraordinary Ph.D. Award. He has published more than 40 scientific articles in international journals and conferences. Currently he leads the research team at Herta Security. His research interests include Biometrics, Computer Vision, and Machine Learning, particularly unconstrained facial analysis in image and video.

The performance of Tegra X1 architecture opens the door to real-time evaluation and deployment of deep neural networks for video analytics applications. This session presents a highly optimized, low-latency pipeline to accelerate demographics estimation based on deep neural networks in videos. The proposed techniques leverage the on-die hardware video decoding engine and Maxwell GPU cores for conducting advanced video analytics such as gender or age estimation. Our results show that Tegra X1 is the right platform for developing embedded video analytics solutions.

Level: All technical
Type: Talk
Tags: Intelligent Video Analytics (IVA); Embedded; Aerospace & Defense; Deep Learning & Artificial Intelligence; IoT

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room LL20D

S6693 - Network, Storage, and Workflow Design for GPU Centric Film Productions

Jeff Brue CTO, Open Drives
Jeff is the CTO and founder of Open Drives a Data Storage company focused on Production IT. Jeff has an extensive background in filmmaking, 3d animation, and storage kernel design. Having worked on over 120 feature films in his time as CTO at several post production facilities, as well as heading up the first Uncompressed Data workflow commercial system in Hollywood with the Viper camera. Founding Open Drives in 2011 he has been the principal architect for the facilities for such productions as Gone Girl for David Fincher as well as designing the in house technology architecture for the Coen Brothers and Deadpool for 20th Century Fox. Open Drives studio data clients include Legendary Pictures, Warner Brothers, 20th Century Fox, Disney as well as many others.

This talk is designed to provide an overall perspective on GPU centric workflow for Media and Entertainment and an update on the talk from last years perspective on Gone Girl, to this year's high lighted film Deadpool. Particularly focused on designing storage systems for high speed GPU workflows in VFX and editing.

Level: All technical
Type: Talk
Tags: Media & Entertainment; Performance Optimization; General Interest

Day: Tuesday, 04/05
Time: 14:30 - 14:55
Location: Room LL21C