Since its first release back in year 2007 with compute capability 1.0, CUDA has three more architectural releases and eight more compute capabilities which marks the fact that it’s an ever evolving architecture. Although CUDA is forward compatible but every new release comes with its own new features worth using and an increased thread/memory support. As a rule of thumb every new architecture runs the CUDA code faster than previous generation given both cards have same number of cores.

The comparison below gives a list of feature/functionality support between compute capabilities of NVIDIA’s CUDA enabled devices. Note that atomic operations weren’t supported in the first release and since they are so important, NVIDIA now practically compares architectures from 1.1 and later.

Continuing the legacy to provide the best imaging algorithm at lightning fast speed, we are proud to announce the addition of DFPD debayer algorithm in CUVI which is more robust than the existing demosaic and shows no artifacts at high feature areas. The previous implementation of demosaic algorithm (which uses bilinear interpolation) is super fast giving a throughput of more than 500 fps on full HD image on a common GPU yet it has its downside.

Since color planes have severe aliasing, a simple interpolation (or HQ bilinear interpolation for that matter) of the individual planes has little effect in removing the artifacts that appear at high feature regions. Hence we need a a better reconstruction approach:


Not only the new algorithm removes artifacts at high-feature regions, the colors get more natural and crisp. This is due to the fact that DFPD (directional filtering with posteriori decision) algorithm better estimates the green plane taking into account the natural edges of the image and then reconstruct the missing red/blue pixels based on that reconstructed green image instead of calculating all values directly.

This huge improvement over the existing implementation comes at a price: more computational cost. The DFPD algorithm is almost half as slow as the previous one, however, it still gives a whopping 263 fps on a full HD image. Note this time excludes the memory transfers. And as always as in CUVI you can use this GPU accelerated DFPD debayer with just three lines of code:

CuviImage input("D:/bayer.tif", CUVI_LOAD_IMAGE_GRAYSCALE_KEEP_DEPTH), output;

cuvi::colorOperations::demosaic_DFPD(input, output, CUVI_BAYER_RGGB);

cuvi::io::saveImage(output, "D:/debayered.png");

There’s an additional refinement step (optional) that comes with DFPD to further refine the pixels values and cut down the unnatural high frequencies. By default, it’s set to false but you can enable it with a flag:

// Further refine the results
cuvi::colorOperations::demosaic_DFPD(input, output, CUVI_BAYER_RGGB, true);

Download the latest cuvi from here or get more information on the features at our wiki.

CUVILib provides out-of-the-box hyper-accelerated Imaging functionality, ready for use in your film scanning, restoration & recoloring applications. With CUVI, you can deliver supercomputing like performance to your users without the need to set up expensive high-end CPUs.


A preview of CUDA toolkit 5 is already available for Registered developers and NVIDIA is expected to roll out the production release soon. Besides habitual addition of more image processing functionality, the new toolkit offers some great features including:

  1. Dynamic parallelism
  2. GPUDirect for clusters (RDMA)
  3. GPU object linking
  4. NVIDIA Nsight, Eclipse Edition

CUVILib has finally came out of Beta. We have added a lot more functionality and made sure that it runs smooth on mission-critical applications. Its simple API, magnitudes better performance than competing solutions and cross-platform support provides you a complete Imaging package. Before we get into what’s new in version 1.2 here are some useful links worth checking out:

The next release of CUVI library is due within next 30 days and we are pleased to announce that it’ll be having lots of functions from Image Enchantments domain. Our filter module just got better and now support dozens of predefined filters as well as the option to add your own custom taps and anchor position. One particular function that I’m excited about in the new release is adjust which is equivalent to MATLAB’s imadjust function.

CUVI version 0.5 is cooked in our labs and we are doing testing and documentation at the moment. The new release will be out anytime in the coming week. We have been working for almost six months on the new framework that couldn’t get any simpler and easy to use. In this release we are also enabling our premium feature detectors that are 10 times faster than OpenCV 2.2

EVGA has announced GTX 460 2Win, the first dual-Fermi graphics card featuring 662 CUDA cores (at 700 MHz) and 2GB of DDR5 memory (3600 MHz effective). According to the company, this combination of two low end Fermi chips will beat the 3D Mark score of the NVIDIA GTX 580. That’s not a biggie, as GTX 580 has only 512 CUDA core, but the better news is that GTX 460 2Win will cost less than GTX 580, says EVGA.

Image filtering is one of the most basic utility of image processing and computer vision. Any image processing application, like feature detection, is composed of applying a series of filters to the image. After reading this guide, you’ll be able to efficiently apply filters to images using shared memory of CUDA architecture. Here’s a step by step guide to write your own filter of any type and size. For simplicity I’ll use a 16 bit unsigned grey scale image in this tutorial.

Note: See updated complete list of differences between all Compute Capabilities of CUDA.

The release of next generation CUDA architecture, Fermi, marks the fact that CUDA is still an evolving architecture. Fermi having compute capability of 2.0 has several differences from previous architectures. In addition to increasing the number of threads per blocks and packing 512 cores in a single chip, Fermi can also run multiple Kernels simultaneously. Shared memory has also been increased from 16 KB to 48KB and most importantly the number of streaming processors in one SM have been increased to 32. The comparison below, by NVIDIA, gives a complete picture of the differences between compute capability 1.0, 1.1, 1.2, 1.3 and 2.0 of NVIDIA’s CUDA enabled devices.