High-Performance Graphics 2011
Permanent URI for this collection
Browse
Browsing High-Performance Graphics 2011 by Issue Date
Now showing 1 - 20 of 22
Results Per Page
Sort Options
Item Active Thread Compaction for GPU Path Tracing(ACM, 2011) Wald, Ingo; Carsten Dachsbacher and William Mark and Jacopo PantaleoniModern GPUs like NVidia s Fermi internally operate in a SIMD manner by ganging multiple (32) scalar threads together into SIMD warps; if a warp s threads diverge, the warp serially executes both branches, temporarily disabling threads that are not on that path. In this paper, we explore and thoroughly analyze the concept of active thread compaction i.e., the process of taking multiple partially-filled warps and compacting them to fewer but fully utilized warps in the context of a CUDA path tracer. Our results show that this technique can indeed lead to significant improvements in SIMD utilization, and corresponding savings in theamount of work performed; however, they also show that certain inadequacies of today s hardware wipe out most of the achieved gains, leaving bottom-up speed-ups of a mere 12 16%. We believe our analysis of why this is the case will provide insight to otherresearchers experimenting with this technique in different contexts.Item Rapid Simplifi cation of Multi-Attribute Meshes(ACM, 2011) Willmott, Andrew; Carsten Dachsbacher and William Mark and Jacopo PantaleoniWe present a rapid simplification algorithm for meshes with multiple vertex attributes, targeted at rendering acceleration for realtime applications. Such meshes potentially feature normals, tangents, one or more texture coordinate sets, and animation information,such as blend weights and indices. Simplification algorithms in the literature typically focus on position-based meshes only, though extensions to handle surface attributes have been explored for those techniques based on iterative edge contraction. We show how to achieve the same goal for the faster class of algorithms based on vertex clustering, despite the comparative lack of connectivity information available. In particular, we show how tohandle attribute discontinuities, preserve thin features, and avoid animation-unfriendly contractions, all issues which prevent the base algorithm from being used in a production situation. Our application area is the generation of multiple levels of detail for player-created meshes at runtime, while the main game process continues to run. As such the robustness of the simplification algorithm employed is key; ours has been run successfully on manymillions of such models, with no preprocessing required. The algorithm is of application anywhere rapid mesh simplification of standard textured and animated models is desired.Item SSLPV: Subsurface Light Propagation Volumes(ACM, 2011) Børlum, Jesper; Christensen, Brian Bunch; Kjeldsen, Thomas Kim; Mikkelsen, Peter Trier; Noe, Karsten Østergaard; Rimestad, Jens; Mosegaard, Jesper; Carsten Dachsbacher and William Mark and Jacopo PantaleoniThis paper presents the Subsurface Light Propagation Volume (SSLPV) method for real-time approximation of subsurface scattering effects in dynamic scenes with changing mesh opology and lighting. SSLPV extends the Light Propagation Volume (LPV) technique for indirect illumination in video games. We introduce a new consistent method for injecting flux from point light sources into an LPV grid, a new rendering method which consistently convertslight intensity stored in an LPV grid into incident radiance, as well as a model for light scattering and absorption inside heterogeneous materials. Our scheme does not require any precomputation and handles arbitrarily deforming meshes. We show that SSLPV provides visually pleasing results in real-time at the expense of a few milliseconds of added rendering time.Item Improving SIMD Efficiency for Parallel Monte Carlo Light Transport on the GPU(ACM, 2011) Antwerpen, Dietger van; Carsten Dachsbacher and William Mark and Jacopo PantaleoniMonte Carlo Light Transport algorithms such as Path Tracing (PT), Bi-Directional Path Tracing (BDPT) and Metropolis Light Transport (MLT) make use of random walks to sample light transport paths. When parallelizing these algorithms on the GPU the stochastic termination of random walks results in an uneven workload between samples, which reduces SIMD efficiency. In this paper we propose to combine stream compaction and sample regeneration to keep SIMD efficiency high during random walk construction, in spite of stochastic termination. Furthermore, for BDPT and MLT, we propose to evaluate all bidirectional connections of a sample in parallel in order to balance the workload between GPU threads and improve SIMD efficiency during sample evaluation. We present efficient parallel GPU-only implementations for PT, BDPT, and MLT in CUDA.We show that our GPU implementations outperform similarCPU implementations by an order of magnitude.Item Lossless Compression of Already Compressed Textures(ACM, 2011) Ström, Jacob; Wennersten, Per; Carsten Dachsbacher and William Mark and Jacopo PantaleoniTexture compression helps rendering by reducing the footprint in graphics memory, thus allowing for more textures, and by lowering the number of memory accesses between the graphics processor and memory, increasing performance and lowering power consumption.Compared to image compression methods like JPEG however, textures codecs are typically much less efficient, which is a problem when downloading the texture over a network or reading it from disk. Therefore, in this paper we investigate lossless compression of already compressed textures. By predicting compression parameters in the image domain instead of in the parameter domain, a more efficient representation is obtained compared to using generalcompression such as ZIP or LZMA. This works well also for pixel indices that have previously proved hard to compress. A 4-bit-perpixel format can thus be compressed to around 2.3 bits per pixel (bpp), or 9.6% of the original size, compared to around 3.0 bpp when using ZIP or 2.8 bpp using LZMA. Compressing the original images with JPEG to the same quality also gives 2.3 bpp, meaning that texture compression followed by our packing is on par with JPEG in terms of compression efficiency.Item Depth Buffer Compression for Stochastic Motion Blur Rasterization(ACM, 2011) Andersson, Magnus; Hasselgren, Jon; Akenine-Moeller, Tomas; Carsten Dachsbacher and William Mark and Jacopo PantaleoniPrevious depth buffer compression schemes are tuned for compressing depths values generated when rasterizing static triangles. They provide generous bandwidth usage savings, and are of great importance to graphics processors. However, stochastic rasterizationfor motion blur and depth of field is becoming a reality even for real-time graphics, and previous depth buffer compression algorithms fail to compress such buffers due to the irregularity of the positions and depths of the rendered samples. Therefore, we presenta new algorithm that targets compression of scenes rendered with stochastic motion blur rasterization. If possible, our algorithm fits a single time-dependent predictor function for all the samples in a tile. However, sometimes the depths are localized in more than onelayer, and we therefore apply a clustering algorithm to split the tile of samples into two layers. One time-dependent predictor function is then created per layer. The residuals between the predictor and the actual depths are then stored as delta corrections. For scenes with moderate motion, our algorithm can compress down to 65% compared to 75% for the previously best algorithm for stochastic buffers.Item Randomized Selection on the GPU(ACM, 2011) Monroe, Laura; Wendelberger, Joanne; Michalak, Sarah; Carsten Dachsbacher and William Mark and Jacopo PantaleoniWe implement here a fast and memory-sparing probabilistic top k selection algorithm on the GPU. The algorithm proceeds via an iterative probabilistic guess-and-check process on pivots for a three-way partition. When the guess is correct, the problem is reduced to selection on a much smaller set. This probabilistic algorithm always gives a correct result and always terminates. Las Vegas algorithms of this kind are a form of stochastic optimization and can be well suited to more general parallel processors with limited amounts of fast memory.Item The Alchemy Screen-Space Ambient Obscurance Algorithm(ACM, 2011) McGuire, Morgan; Osman, Brian; Bukowski, Michael; Hennessy, Padraic; Carsten Dachsbacher and William Mark and Jacopo PantaleoniAmbient obscurance (AO) produces perceptually important illumination effects such as darkened corners, cracks, and wrinkles; proximity darkening; and contact shadows. We present the AO algorithm from the Alchemy engine used at Vicarious Visions in commercialgames. It is based on a new derivation of screen-space obscurance for robustness, and the insight that a falloff function can cancel terms in a visibility integral to favor efficient operations. Alchemy creates contact shadows that conform to surfaces, capturesobscurance from geometry of varying scale, and provides four intuitive appearance parameters: world-space radius and bias, and aesthetic intensity and contrast. The algorithm estimates obscurance at a pixel from sample points read from depth and normal buffers. It processes dynamic scenes at HD 720p resolution in about 4.5 ms on Xbox 360 and 3 ms onNVIDIA GeForce580.Item MSBVH: An Efficient Acceleration Data Structure for Ray Traced Motion Blur(ACM, 2011) Gruenschloß, Leonhard; Stich, Martin; Nawaz, Sehera; Keller, Alexander; Carsten Dachsbacher and William Mark and Jacopo PantaleoniWhen a bounding volume hierarchy is used for accelerating the intersection of rays and scene geometry, one common way to incorporate motion blur is to interpolate node bounding volumes according to the time of the ray. However, such hierarchies typically exhibit large overlap between bounding volumes, which results in an inefficient traversal. This work builds upon the concept of spatially partitioning nodes during tree construction in order to reduce overlap in the presence of moving objects. The resulting hierarchies are often significantly cheaper to traverse than those generated by classic approaches.Item Adaptive Transparency(ACM, 2011) Salvi, Marco; Montgomery, Jefferson; Lefohn, Aaron; Carsten Dachsbacher and William Mark and Jacopo PantaleoniAdaptive transparency is a new solution to order-independent transparency that closely approximates the ground-truth results obtained with A-buffer compositing but, like a Z-buffer, operates in bounded memory and exhibits consistent performance. The key contributionof our method is an adaptively compressed visibility representation that can be efficiently constructed and queried while rendering. The algorithm supports a wide range and combination of transparent geometry (e.g., foliage, windows, hair, and smoke). We demonstrate that adaptive transparency is five to forty times faster than realtimeA-buffer implementations, closely matches the image quality, and is both higher quality and faster than other approximate orderindependent transparency techniques: stochastic transparency, uniform opacity shadow maps, and Fourier opacity mapping.Item Voxelized Shadow Volumes(ACM, 2011) Wyman, Chris; Carsten Dachsbacher and William Mark and Jacopo PantaleoniEfficient shadowing algorithms have been sought for decades, but most shadow research focuses on quickly identifying shadows on surfaces. This paper introduces a novel algorithm to efficiently sample light visibility at points inside a volume. These voxelized shadow volumes (VSVs) extend shadow maps to allow efficient, simultaneous queries of visibility along view rays, or can alternately be seen as a discretized shadow volume. We voxelize the scene intoa binary, epipolar-space grid where we apply a fast parallel scan to identify shadowed voxels. Using a view-dependent grid, our GPU implementation looks up 128 visibility samples along any eye ray with a single texture fetch. We demonstrate our algorithm in the context of interactive shadows in homogeneous, single-scattering participating media.Item High-Performance Software Rasterization on GPUs(ACM, 2011) Laine, Samuli; Karras, Tero; Carsten Dachsbacher and William Mark and Jacopo PantaleoniIn this paper, we implement an efficient, completely software-based graphics pipeline on a GPU. Unlike previous approaches, we obey ordering constraints imposed by current graphics APIs, guarantee hole-free rasterization, and support multisample antialiasing. Our goal is to examine the performance implications of not exploiting the fixed-function graphics pipeline, and to discern which additional hardware support would benefit software-based graphics themost. We present significant improvements over previous work in terms of scalability, performance, and capabilities. Our pipeline is malleable and easy to extend, and we demonstrate that in a wide variety of test cases its performance is within a factor of 2 8x compared to the hardware graphics pipeline on a top of the line GPU. Our implementation is open sourced and available at http://code.google.com/p/cudaraster/Item Preface and Table of Contents(ACM, 2011) Carsten Dachsbacher and William Mark and Jacopo PantaleoniItem Primitive Processing and Advanced Shading Architecture for Embedded Space(ACM, 2011) Kazakov, Max; Ohbuchi, Eisaku; Carsten Dachsbacher and William Mark and Jacopo PantaleoniThis paper presents a new graphics architecture enabling contentrich applications for the embedded space by extending hardware architecture in two main areas - geometry processing and configurable per-fragment shading. Our first contribution combines vertex cache and a programmable geometry engine that handles both fixed and variable size geometrical primitives completely on-chip. It enables subdivision surface tessellation, silhouette rendering and other geometry processing algorithms to be implemented in one pass and without external memory access. Our second contribution is in configurable per-fragment shading that is mainly a dot product + lookup table machine being versatile enough to realize Cook-Torrance shading, Schlick anisotropy model and others. Memory storage and memory bandwidth are reduced in proposed architecture as both compact geometry and material descriptions are possible, enabling complex shapes and sophisticated shading models in embedded space. The architecture has complete HDL and ASIC implementations and was demonstrated during the ESEC 2008 exhibition in Japan. Exposing all the features of our architecture via OpenGL ES 1.X and 2.0 API enabled extended OpenGL ES engines from Rightware Oy to run on our ASIC implementations.Item An Inexpensive Bounding Representation for Offsets of Quadratic Curves(ACM, 2011) Ruf, Erik; Carsten Dachsbacher and William Mark and Jacopo PantaleoniWe describe a simple mechanism for bounding the portion of the plane lying between a quadratic Beizer curve segment and its offset curve at distance d. Instead of comprising one or more partial bounding polygons, our representation consists of only a single approximate offset curve segment, also in quadratic Bezier form. Evaluated on a corpus of real-world curves, this technique avoids 68-99% of antialias-distance queries and 41-96% of brushparameter queries. A proof of correctness is provided.Item Precision Selection for Energy-Effi cient Pixel Shaders(ACM, 2011) Pool, Jeff; Lastra, Anselmo; Singh, Montek; Carsten Dachsbacher and William Mark and Jacopo PantaleoniIn this work, we seek to realize energy savings in modern pixel shaders by reducing the precision of their arithmetic. We explore three schemes for controlling this reduction. The first is a static analysis technique, which analyzes shader programs to choose precisionwith guaranteed error bounds. This approach may be too conservative in practice since it cannot take advantage of run-time information, so we also examine two methods that take the actual data values into account - a programmer-directed approach and a closed-loop error-tracking approach, both of which can lead to higher savings. To use this last method, we developed several heuristics to control how the precisions will change over time. Wesimulate several series of frames from commercial applications to evaluate the performance of these different schemes. The average savings found by the static and dynamic approaches are 31%, 70%, and 62% in the pixel shader s arithmetic, respectively, which could result in as much as a 10-20% savings of the GPU s energy as a whole.Item Farthest-Point Optimized Point Sets with Maximized Minimum Distance(ACM, 2011) Schlömer, Thomas; Heck, Daniel; Deussen, Oliver; Carsten Dachsbacher and William Mark and Jacopo PantaleoniEfficient sampling often relies on irregular point sets that uniformly cover the sample space. We present a flexible and simple optimization strategy for such point sets. It is based on the idea of increasing the mutual distances by successively moving each point to the farthestpoint, i.e., the location that has the maximum distance from the rest of the point set. We present two iterative algorithms based on this strategy. The first is our main algorithm which distributes points in the plane. Our experimental results show that the resulting distributions have almost optimal blue noise properties and are highly suitable for image plane sampling. The second is a variant of the main algorithm that partitions any point set into equally sizedsubsets, each with large mutual distances; the resulting partitionings yield improved results in more general integration problems such as those occurring in physically based renderingItem Hierarchical Stochastic Motion Blur Rasterization(ACM, 2011) Munkberg, Jacob; Clarberg, Petrik; Hasselgren, Jon; Toth, Robert; Sugihara, Masamichi; Akenine-Moeller, Tomas; Carsten Dachsbacher and William Mark and Jacopo PantaleoniWe present a hierarchical traversal algorithm for stochastic rasterization of motion blur, which efficiently reduces the number of inside tests needed to resolve spatio-temporal visibility. Our method is based on novel tile against moving primitive tests that also provide temporal bounds for the overlap. The algorithm works entirely in homogeneous coordinates, supports MSAA, facilitates efficient hierarchical spatio-temporal occlusion culling, and handles typical game workloads with widely varying triangle sizes. Furthermore, we use high-quality sampling patterns based on digital nets, and present a novel reordering that allows efficient proceduralgeneration with good anti-aliasing properties. Finally, we evaluate a set of hierarchical motion blur rasterization algorithms in terms of both depth buffer bandwidth, shading efficiency, and arithmetic complexity.Item VoxelPipe: A Programmable Pipeline for 3D Voxelization(ACM, 2011) Pantaleoni, Jacopo; Carsten Dachsbacher and William Mark and Jacopo PantaleoniWe present a highly exible and e cient software pipeline for programmable triangle voxelization. The pipeline, entirely written in CUDA, supports both fully conservative and thinvoxelizations, multiple boolean, oating point, vector-typed render targets, user-de ned vertex and fragment shaders, and a bucketing mode which can be used to generate 3D A-bu ers containing the entire list of fragments belonging to each voxel. For maximum e ciency, voxelization is implemented as a sort-middle tile-based rasterizer, while the A-bu er mode, essentially performing 3D binning of triangles over uniform grids, uses a sort-last pipeline. Despite its major exibility, the performance of our tile-based rasterizer is always competitive with and sometimes more than an order of magnitude superior to that of state-of-the-artbinary voxelizers, whereas our bucketing system is up to 4 times faster than previous implementations. In both cases the results have been achieved through the use of carefulload-balancing and high performance sorting primitives.Item SAH KD-Tree Construction on GPU(ACM, 2011) Wu, Zhefeng; Zhao, Fukai; Liu, Xinguo; Carsten Dachsbacher and William Mark and Jacopo PantaleoniKD-tree is one of the most efficient acceleration data structures for ray tracing. In this paper, we present a kd-tree construction algorithm that is precisely SAH-optimized and runs entirely on GPU. We construct the tree nodes in breadth-first order. In order to precisely evaluate the SAH cost, we design a parallel scheme based on the standard parallel scan primitive to count the triangle numbers for all split candidates, and a bucket-based algorithm to sort theAABBs (axis-aligned bounding box) of the clipped triangles of the child nodes. The proposed parallel algorithms can be mapped well to GPU s streaming architecture. The experiments showed that our algorithm can produce the highest quality kd-tree as the off-line CPU algorithms, but runs faster than multi-core CPU algorithms and the GPU SAH BVH-Tree algorithm.