High-Performance Graphics 2009
Permanent URI for this collection
Browse
Browsing High-Performance Graphics 2009 by Title
Now showing 1 - 20 of 20
Results Per Page
Sort Options
Item Accelerating Shadow Rays Using Volumetric Occluders and Modified kd-Tree Traversal(The Eurographics Association, 2009) Djeu, Peter; Keely, Sean; Hunt, Warren; David Luebke and Philipp SlusallekMonte Carlo ray tracing remains a simple and elegant method for generating robust shadows. This approach, however, is often hampered by the time needed to evaluate the numerous shadow ray queries required to generate a high-quality image. We propose the use of volumetric occluders stored within a kd-tree in order to accelerate shadow rays cast on a closed, watertight mesh. Intersection with a volumetric occluder is much cheaper than intersection with mesh geometry, although performing these intersections requires modification to the traversal order through the kd-tree. We propose two such modifications, both of which enable the use of volumetric occluders for cheap shadow ray termination. We also propose using a software-managed cache to store and reuse volumetric occluders for even earlier termination. Our approach provides a performance improvement of up to 2.0x for our test scenes while producing images identical to those produced by the unaccelerated baseline.Item CFU: Multi-Purpose Configurable Filtering Unit for Mobile Multimedia Applications on Graphics Hardware(The Eurographics Association, 2009) Sun, Chih-Hao; Lok, Ka-Hang; Tsao, You-Ming; Chang, Chia-Ming; Chien, Shao-Yi; David Luebke and Philipp SlusallekIn order to increase the capability of mobile GPUs in image/video processing, a multi-purpose configurable filtering unit (CFU), which is a new configurable unit for image filtering on stream processing architecture, is proposed in this paper. CFU is located in the texture unit of a GPU and can efficiently execute many kinds of filtering operations by directly accessing multi-bank texture cache and specially-designed data-paths. The following programmabilities are supported in our proposed CFU. First, different sampling point windows can be selected by programmers. Besides, the arithmetic type of the filter can be chosen. Not only original texture filtering functions and finite impulse response (FIR) filters, morphological operations in computer vision are also embedded in CFU. Furthermore, the weighting coefficients of FIR filters and morphological operations can be defined by programmers. Simulation results show that in average, compared with conventional texture unit, 25.35% of processing time in H.264/AVC motion compensation and 58.6% of processing time in video segmentation can be reduced with the assistance of CFU.Item Data-Parallel Rasterization of Micropolygons with Defocus and Motion Blur(The Eurographics Association, 2009) Fatahalian, Kayvon; Luong, Edward; Boulos, Solomon; Akeley, Kurt; Mark, William R.; Hanrahan, Pat; David Luebke and Philipp SlusallekCurrent GPUs rasterize micropolygons (polygons approximately one pixel in size) inefficiently. We design and analyze the costs of three alternative data-parallel algorithms for rasterizing micropolygon workloads for the real-time domain. First, we demonstrate that efficient micropolygon rasterization requires parallelism across many polygons, not just within a single polygon. Second, we produce a data-parallel implementation of an existing stochastic rasterization algorithm by Pixar, which is able to produce motion blur and depth-of-field effects. Third, we provide an algorithm that leverages interleaved sampling for motion blur and camera defocus. This algorithm outperforms Pixar s algorithm when rendering objects undergoing moderate defocus or high motion and has the added benefit of predictable performance.Item Efficient Depth Peeling via Bucket Sort(The Eurographics Association, 2009) Liu, Fang; Huang, Meng-Cheng; Liu, Xue-Hui; Wu, En-Hua; David Luebke and Philipp Slusallekpeeling via bucket sort of fragments on GPU, which makes it possible to capture up to 32 layers simultaneously with correct depth ordering in a single geometry pass. We exploit multiple render targets (MRT) as storage and construct a bucket array of size 32 per pixel. Each bucket is capable of holding only one fragment, and can be concurrently updated using the MAX/MIN blending operation. During the rasterization, the depth range of each pixel location is divided into consecutive subintervals uniformly, and a linear bucket sort is performed so that fragments within each subintervals will be routed into the corresponding buckets. In a following fullscreen shader pass, the bucket array can be sequentially accessed to get the sorted fragments for further applications. Collisions will happen when more than one fragment is routed to the same bucket, which can be alleviated by multi-pass approach. We also develop a two-pass approach to further reduce the collisions, namely adaptive bucket depth peeling. In the first geometry pass, the depth range is redivided into non-uniform subintervals according to the depth distribution to make sure that there is only one fragment within each subinterval. In the following bucket sorting pass, there will be only one fragment routed into each bucket and collisions will be substantially reduced. Our algorithm shows up to 32 times speedup to the classical depth peeling especially for large scenes with high depth complexity, and the experimental results are visually faithful to the ground truth. Also it has no requirement of pre-sorting geometries or post-sorting fragments, and is free of read-modify-write (RMW) hazards.Item Efficient Ray Traced Soft Shadows using Multi-Frusta Tracing(The Eurographics Association, 2009) Benthin, Carsten; Wald, Ingo; David Luebke and Philipp SlusallekRay tracing has long been considered to be superior to rasterization because its ability to trace arbitrary rays, allowing it to simulate virtually any physical light transport effect by just tracing rays. Yet, to look plausible, extraordinary amounts of rays for effects such as soft shadows are typically required. This makes the prospects of real-time performance rather remote. Rasterization, in contrast, has a record of producing such effects in real-time through employing specialized and approximate solutions for individual effects. Though ray tracing may still be the right choice for effects like reflections and refractions, using specialized solutions for certain important effects also makes sense for a ray tracer. In this paper, we propose a special solution to ray trace soft shadows that is particularly targeted for Intel s Larrabee architecture. We use a specialized frustum tracing that traces multiple frusta of specialized light-weight shadow packets in parallel, while generating rays within each frustum on demand. The technique can easily be integrated into any packet ray tracer, and fits well into the wide SIMD and cache-size constraints of the Larrabee architecture. Our technique allows to reach rates of up to several dozen million rays per second per Larrabee core, outperforming traditional packet techniques by up to 6×. This high performance combined with a simple light-weight illumination filtering step allows to achieve real-time soft shadows for game-like scenes.Item Efficient Stream Compaction on Wide SIMD Many-Core Architectures(The Eurographics Association, 2009) Billeter, Markus; Olsson, Ola; Assarsson, Ulf; David Luebke and Philipp SlusallekStream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage. For wide SIMD many-core architectures, we present a novel stream compaction algorithm and explore several variations thereof. Our algorithm is designed to maximize concurrent execution, with minimal use of synchronization. Bandwidth and auxiliary storage requirements are reduced significantly, which allows for substantially better performance. We have tested our algorithms using CUDA on a PC with an NVIDIA GeForce GTX280 GPU. On this hardware, our reference implementation provides a 3× speedup over previous published algorithms.Item Embedded Function Composition(The Eurographics Association, 2009) Whitted, Turner; Kajiya, Jim; Ruf, Erik; Bittner, Ray; David Luebke and Philipp SlusallekA low-level graphics processor is assembled from a collection of hardwired functions of screen coordinates embedded directly in the display. Configuration of these functions is controlled by a buffer containing parameters delivered to the processor on-the-fly during display scan. The processor is modular and scalable in keeping with the demands of large, high resolution displays.Item Fast Minimum Spanning Tree for Large Graphs on the GPU(The Eurographics Association, 2009) Vineet, Vibhav; Harish, Pawan; Patidar, Suryakant; Narayanan, P. J.; David Luebke and Philipp SlusallekGraphics Processor Units are used for many general purpose processing due to high compute power available on them. Regular, data-parallel algorithms map well to the SIMD architecture of current GPU. Irregular algorithms on discrete structures like graphs are harder to map to them. Efficient data-mapping primitives can play crucial role in mapping such algorithms onto the GPU. In this paper, we present a minimum spanning tree algorithm on Nvidia GPUs under CUDA, as a recursive formulation of Boruvka's approach for undirected graphs. We implement it using scalable primitives such as scan, segmented scan and split. The irregular steps of supervertex formation and recursive graph construction are mapped to primitives like split to categories involving vertex ids and edge weights. We obtain 30 to 50 times speedup over the CPU implementation on most graphs and 3 to 10 times speedup over our previous GPU implementation. We construct the minimum spanning tree on a 5 million node and 30 million edge graph in under 1 second on one quarter of the Tesla S1070 GPU.Item Faster Incoherent Rays: Multi-BVH Ray Stream Tracing(The Eurographics Association, 2009) Tsakok, John A.; David Luebke and Philipp SlusallekHigh fidelity rendering via ray tracing requires tracing incoherent rays for global illumination and other secondary effects. Recent research show that the performance benefits from fast packet traversal schemes that exploit high coherence are lost when coherency is low due to inefficient use of the CPU s SIMD units. In an effort to solve this problem, methods have been proposed which try to extract the remaining coherency from secondary rays through ray sorting, reordering and streaming. Another category of traversal methods have also been proposed which ignore coherency altogether and use a higher order tree branching factor while tracing single rays at a time. These single ray methods not only target applications with incoherent rays but are also scalable with larger SIMD widths. This paper combines ideas from both categories to form a new traversal method which extracts coherency from a group of rays through simple filtering while still providing a fast single ray traversal in cases where there is no coherency present. This new algorithm does not depend on the use of packets which cleanly decouples traversal from shading and is scalable for larger SIMD widths. Results show that overall performance benefits are obtained on a current generation CPU architecture.Item Hardware-Accelerated Global Illumination by Image Space Photon Mapping(The Eurographics Association, 2009) McGuire, Morgan; Luebke, David; David Luebke and Philipp SlusallekWe describe an extension to photon mapping that recasts the most expensive steps of the algorithm - the initial and final photon bounces - as image-space operations amenable to GPU acceleration. This enables global illumination for real-time applications as well as accelerating it for offline rendering. Image Space Photon Mapping (ISPM) rasterizes a light-space bounce map of emitted photons surviving initial-bounce Russian roulette sampling on a GPU. It then traces photons conventionally on the CPU. Traditional photon mapping estimates final radiance by gathering photons from a k-d tree. ISPM instead scatters indirect illumination by rasterizing an array of photon volumes. Each volume bounds a filter kernel based on the a priori probability density of each photon path. These two steps exploit the fact that initial path segments from point lights and final ones into a pinhole camera each have a common center of projection. An optional step uses joint bilateral upsampling of irradiance to reduce the fill requirements of rasterizing photon volumes. ISPM preserves the accurate and physically-based nature of photon mapping, supports arbitrary BSDFs, and captures both high- and low-frequency illumination effects such as caustics and diffuse color interreflection. An implementation on a consumer GPU and 8-core CPU renders highquality global illumination at up to 26 Hz at HD (1920-1080) resolution, for complex scenes containing moving objects and lights.Item Image Space Gathering(The Eurographics Association, 2009) Robison, Austin; Shirley, Peter; David Luebke and Philipp SlusallekSoft shadows, glossy reflections and depth of field are valuable effects for realistic rendering and are often computed using distribution ray tracing (DRT). These blurry effects often need not be accurate and are sometimes simulated by blurring an image with sharper effects, such as blurring hard shadows to simulate soft shadows. One of the most effective examples of such a blurring algorithm is percentage closer soft shadows (PCSS). That technique, however, does not naturally extend to shadows generated in image space, such as those computed by a ray tracer, nor does it extend to glossy reflections or depth of field. This limitation can be overcome by generalizing PCSS to be phrased in terms of a gather from image space textures implemented with cross bilateral filtering. This paper demonstrates a framework to create visually compelling and phenomenologically accurate approximations of DRT effects based on repeatedly gathering from bilaterally weighted image space texture samples. These gathering and filtering operations are well supported by modern parallel architectures, enabling this technique to run at interactive rates.Item Morphological Antialiasing(The Eurographics Association, 2009) Reshetov, Alexander; David Luebke and Philipp SlusallekWe present a new algorithm that creates plausibly antialiased images by looking for certain patterns in an original image and then blending colors in the neighborhood of these patterns according to a set of simple rules. We construct these rules to work as a post-processing step in ray tracing applications, allowing approximate, yet fast and robust antialiasing. The algorithm works for any rendering technique and scene complexity. It does not require casting any additional rays and handles all possible effects, including reflections and refractions.Item Object Partitioning Considered Harmful: Space Subdivision for BVHs(The Eurographics Association, 2009) Popov, Stefan; Georgiev, Iliyan; Dimov, Rossen; Slusallek, Philipp; David Luebke and Philipp SlusallekA major factor for the efficiency of ray tracing is the use of good acceleration structures. Recently, bounding volume hierarchies (BVHs) have become the preferred acceleration structures, due to their competitive performance and greater flexibility compared to KD trees. In this paper, we present a study on algorithms for the construction of optimal BVHs. Due to the exponential nature of the problem, constructing optimal BVHs for ray tracing remains an open topic. By exploiting the linearity of the surface area heuristic (SAH), we develop an algorithm that can find optimal partitions in polynomial time. We further generalize this algorithm and show that every SAH-based KD tree or BVH construction algorithm is a special case of the generic algorithm. Based on a number of experiments with the generic algorithm, we conclude that the assumption of non-terminating rays in the surface area cost model becomes a major obstacle for using the full potential of BVHs. We also observe that enforcing space partitioning helps to improve BVH performance. Finally, we develop a simple space partitioning algorithm for building efficient BVHs.Item A Parallel Algorithm for Construction of Uniform Grids(The Eurographics Association, 2009) Kalojanov, Javor; Slusallek, Philipp; David Luebke and Philipp SlusallekWe present a fast, parallel GPU algorithm for construction of uniform grids for ray tracing, which we implement in CUDA. The algorithm performance does not depend on the primitive distribution, because we reduce the problem to sorting pairs of primitives and cell indices. Our implementation is able to take full advantage of the parallel architecture of the GPU, and construction speed is faster than CPU algorithms running on multiple cores. Its scalability and robustness make it superior to alternative approaches, especially for scenes with complex primitive distributions.Item Parallel View-Dependent Tessellation of Catmull-Clark Subdivision Surfaces(The Eurographics Association, 2009) Patney, Anjul; Ebeida, Mohamed S.; Owens, John D.; David Luebke and Philipp SlusallekWe present a strategy for performing view-adaptive, crack-free tessellation of Catmull-Clark subdivision surfaces entirely on programmable graphics hardware. Our scheme extends the concept of breadth-first subdivision, which up to this point has only been applied to parametric patches. While mesh representations designed for a CPU often involve pointer-based structures and irregular perelement storage, neither of these is well-suited to GPU execution. To solve this problem, we use a simple yet effective data structure for representing a subdivision mesh, and design a careful algorithm to update the mesh in a completely parallel manner. We demonstrate that in spite of the complexities of the subdivision procedure, real-time tessellation to pixel-sized primitives can be done. Our implementation does not rely on any approximation of the limit surface, and avoids both subdivision cracks and T-junctions in the subdivided mesh. Using the approach in this paper, we are able to perform real-time subdivision for several static as well as animated models. Rendering performance is scalable for increasingly complex models.Item Scaling of 3D Game Engine Workloads on Modern Multi-GPU Systems(The Eurographics Association, 2009) Monfort, Jordi Roca; Grossman, Mark; David Luebke and Philipp SlusallekThis work supposes a first attempt to characterize the 3D game workload running on commodity multi-GPU systems. Depending on the rendering workload balance mode used, the intra and interframe dependencies due to render-to-texture require a number of synchronizations that can significantly impact the scalability with multiple GPUs. In this paper, a proprietary analytical tool called EMPATHY has been used to evaluate, for a set popular DX9 games, the performance of both classic split frame and alternate frame rendering modes as well as combined modes supporting more than 4 GPUs. We have also evaluated the application of the early copy and concurrent update techniques together as alternative to delayed surface copy of render-to-texture surfaces, showing a 48% percent improvement for some workloads.Item Selective and Adaptive Supersampling for Real-Time Ray Tracing(The Eurographics Association, 2009) Jin, Bongjun; Ihm, Insung; Chang, Byungjoon; Park, Chanmin; Lee, Wonjong; Jung, Seokyoon; David Luebke and Philipp SlusallekWhile supersampling is an essential element for high quality rendering, high sampling rates, routinely employed in offline rendering, are still considered quite burdensome for real-time ray tracing. In this paper, we propose a selective and adaptive supersampling technique aimed at the development of a real-time ray tracer on today s many-core processors. For efficient utilization of very precious computing time, this technique explores both image space and object space attributes, which can be easily gathered during the ray tracing computation, minimizing rendering artifacts by cleverly distributing ray samples to rendering elements according to priorities that are selectively set by a user. Our implementation on the current GPU demonstrates that the presented algorithm makes high sampling rates as effective as 9 to 16 samples per pixel more affordable than before for real-time ray tracing.Item Spatial Splits in Bounding Volume Hierarchies(The Eurographics Association, 2009) Stich, Martin; Friedrich, Heiko; Dietrich, Andreas; David Luebke and Philipp SlusallekBounding volume hierarchies (BVH) have become a widely used alternative to kD-trees as the acceleration structure of choice in modern ray tracing systems. However, BVHs adapt poorly to nonuniformly tessellated scenes, which leads to increased ray shooting costs. This paper presents a novel and practical BVH construction algorithm, which addresses the issue by utilizing spatial splitting similar to kD-trees. In contrast to previous preprocessing approaches, our method uses the surface area heuristic to control primitive splitting during tree construction. We show that our algorithm produces significantly more efficient hierarchies than other techniques. In addition, user parameters that directly influence splitting are eliminated, making the algorithm easily controllable.Item Stream Compaction for Deferred Shading(The Eurographics Association, 2009) Hoberock, Jared; Lu, Victor; Jia, Yuntao; Hart, John C.; David Luebke and Philipp SlusallekThe GPU leverages SIMD efficiency when shading because it rasterizes a triangle at a time, running the same shader on all of its fragments. Ray tracing sacrifices this shader coherence, and the result is that SIMD units often must run different shaders simultaneously resulting in serialization. We study this problem and define a new measure called heterogeneous efficiency to measure SIMD divergence among multiple shaders of different complexities in a ray tracing application. We devise seven different algorithms for scheduling shaders onto SIMD processors to avoid divergence. In all but simply shaded scenes, we show the expense of sorting shaders pays off with better overall shading performance.Item Understanding the Efficiency of Ray Traversal on GPUs(The Eurographics Association, 2009) Aila, Timo; Laine, Samuli; David Luebke and Philipp SlusallekWe discuss the mapping of elementary ray tracing operations- acceleration structure traversal and primitive intersection-onto wide SIMD/SIMT machines. Our focus is on NVIDIA GPUs, but some of the observations should be valid for other wide machines as well. While several fast GPU tracing methods have been published, very little is actually understood about their performance. Nobody knows whether the methods are anywhere near the theoretically obtainable limits, and if not, what might be causing the discrepancy. We study this question by comparing the measurements against a simulator that tells the upper bound of performance for a given kernel. We observe that previously known methods are a factor of 1.5 2.5X off from theoretical optimum, and most of the gap is not explained by memory bandwidth, but rather by previously unidentified inefficiencies in hardware work distribution. We then propose a simple solution that significantly narrows the gap between simulation and measurement. This results in the fastest GPU ray tracer to date. We provide results for primary, ambient occlusion and diffuse interreflection rays.