After implemented your blurring function, I have implemented what I think is a “bilateral filter” too. I don’t know if it is 100% correct but seams to do what I mean.

Pseudocode:

[code]

#define DENOISE_AMOUNT 1 // between 0-1

ref = texture( imageMap, IMAGE_TEX_COORDS );

sumColor = ref * weight[0] ;

#define CHECK(S, X) step( X , intensity_(S) * DENOISE_AMOUNT )

for (int i=1; i<BLUR_SAMPLE_COUNT; i++) {

cSamp1 = texture( imageMap, IMAGE_TEX_COORDS + OFFSET_BLUR );

cSamp2 = texture( imageMap, IMAGE_TEX_COORDS - OFFSET_BLUR );

float diff1 = intensity_(abs(cSamp1 - ref));

float diff2 = intensity_(abs(cSamp2 - ref));

sumColor += mix( ref, cSamp1, CHECK(cSamp1* weight[i], diff1 ) ) * weight[i] ;

sumColor += mix( ref , cSamp2, CHECK(cSamp2* weight[i], diff2 ) ) * weight[i] ;

}

return max(sumColor, 0);

[/code]

for RGB, 'intensity_()' it's the simple luminance of the color

I hope I can use this for things like blurring the SSAO map, but I have not tested it for that thing yet.

If someone have any advice.. I much appreciate!

@Christian Cann Schuldt Jensen

I doubt that you are right for the single-pass. The gaussian it's separable, but if in the texture there are not blurred value..?! but I'm beginning so… correct me if I'm wrong.

P.S.

sorry if I made any language's mistake, I'm italian

This results in the standard

121

242

121

9-tap gaussian.

You can do the same gaussian in a single pass if you sample at all 4 corners (4 texture fetches)

You can do a nice and very fast 7-tap blur in a single pass if you sample two opposite corners with an offset of 1/3 away from the center.

This results in a

121

282

121

convolution kernel.

If you calculate the strength of the blur by subtracting from the center pixel you can then scale the result so that it closely matches the regular 9-tap gaussian.

I also do a 17-tap blur using only 5 fetches in a single pass by sampling the center pixel and then offsets (0.4,-1.2) , (-1.2,-0.4) , (1.2,0.4) and (-0.4,1.2).

I then adjust the strength of it to more closely match the 9-tap gaussian.

nice article, the connection with Pascal’s triangle is indeed very interesting.

I would like to add two things. First, maybe it would be good to add that Gaussian has a unit integral, since it’s probably one of the reasons the filtering works (in the continuous case at least). The 3D plot should also reflect this (meaning that such a wide Gaussian certainly wouldn’t reach a value of 1.6). This is of course just minor.

The second thing that struck me was:

“A Gaussian function with a distribution of 2σ is equivalent with the product of two Gaussian functions with a distribution of σ.”

This is just my opinion, but I think this doesn’t hold. First, I think such product won’t be a 2σ-Gaussian (properly normalized and all). Second, a repeated application of any filter corresponds to applying their *convolution*, which differs from product. And, convolution of two Gaussians with StD=σ produces a Gaussian with StD=Sqrt(2*σ^2)=σ*Sqrt(2).

I think what you meant was that product of 2 polynoms with order n (n-th row of Pascal triangle) gives you a polynom of order 2n? That should be true.

What do you think?

Oskar

This article was incredibly clear and informative. Thank you for taking the time to put it together. I’m still learning this stuff, so this was very helpful.

I wrote up a small Python script based on this post to generate the gaussian texture lookups, optimized for linear sampling. Maybe it can be useful to someone else too:

https://gist.github.com/2332010

Thanks again

]]>Where is the benefit of the usual easier way? Lookup at coordinates [-4, -2, 0, 2, 4] to get 10 pixel range and then apply the weighting of each two texels and calculate final value?

]]>Actually this is new thing to me, no such things happen on desktop GPUs.

]]>has the following language:

“Dynamic texture lookups, also known as dependent texture reads, occur when a fragment shader computes texture coordinates rather than using the unmodified texture coordinates passed into the shader. Although the shader language supports this, dependent texture reads can delay loading of texel data, reducing performance. When a shader has no dependent texture reads, the graphics hardware may prefetch texel data before the shader executes, hiding some of the latency of accessing memory.

It may not seem obvious, but any calculation on the texture coordinates counts as a dependent texture read. For example, packing multiple sets of texture coordinates into a single varying parameter and using a swizzle command to extract the coordinates still causes a dependent texture read.”

Perhaps there is a difference in terminology here, but I’ve seen that anything that performs a calculation involving a texture coordinate causes a huge slowdown in the texture fetch in a fragment shader on these mobile devices. This is even true for calculations involving constant offsets, like yours here. Moving the calculations from the fragment to the vertex shader, then providing the results as varyings, took my version of your blur from 50 ms / frame to 2 ms / frame. I think this is one of those areas where the tile-based deferred renderers diverge from standard desktop hardware.

My comments about the for loops are also based on my profiling. The current shader compilers employed for mobile GPUs are not great at these kinds of optimizations yet, so we still need to hand-tune things that should be taken care of for us, like unrolling loops. It’s unfortunate, but hopefully they’ll catch up to desktop compilers soon.

Mobile GPUs are interesting animals, and I’ve had to perform quite a few tweaks like this to get desktop shaders working well on them.

]]>There is no dependent texture reads in my shader either. Everything is uniform. Dependent texture reads mean that the texture coordinates of the fetch are affected by a previous texture fetch. While moving the texture coordinate calculation to the vertex shader might help in some situations, on desktop GPUs the additional varying count would hurt the performance more than performing the math in the fragment shader.

Also, the for loop should not hurt performance either as a proper GLSL compiler implementation should be able to unroll for loops with compile time constant conditions. At least, I can assure you that this happens in case of desktop drivers.

]]>In response to Evan, I was able to adapt the above into a shader program that works well on iOS devices. On an iPhone 4, it can filter a live 640×480 video frame in under 2 ms, and I was able to filter a 2000×1494 image and save it to disk in about 500 ms. The code for this can be found within my open source GPUImage framework under the GPUImageFastBlurFilter class:

https://github.com/BradLarson/GPUImage

I used the following vertex shader:

attribute vec4 position;

attribute vec2 inputTextureCoordinate;

uniform mediump float texelWidthOffset;

uniform mediump float texelHeightOffset;

varying mediump vec2 centerTextureCoordinate;

varying mediump vec2 oneStepLeftTextureCoordinate;

varying mediump vec2 twoStepsLeftTextureCoordinate;

varying mediump vec2 oneStepRightTextureCoordinate;

varying mediump vec2 twoStepsRightTextureCoordinate;

// const float offset[3] = float[]( 0.0, 1.3846153846, 3.2307692308 );

void main()

{

gl_Position = position;

vec2 firstOffset = vec2(1.3846153846 * texelWidthOffset, 1.3846153846 * texelHeightOffset);

vec2 secondOffset = vec2(3.2307692308 * texelWidthOffset, 3.2307692308 * texelHeightOffset);

centerTextureCoordinate = inputTextureCoordinate;

oneStepLeftTextureCoordinate = inputTextureCoordinate – firstOffset;

twoStepsLeftTextureCoordinate = inputTextureCoordinate – secondOffset;

oneStepRightTextureCoordinate = inputTextureCoordinate + firstOffset;

twoStepsRightTextureCoordinate = inputTextureCoordinate + secondOffset;

}

and the following fragment shader:

precision highp float;

uniform sampler2D inputImageTexture;

varying mediump vec2 centerTextureCoordinate;

varying mediump vec2 oneStepLeftTextureCoordinate;

varying mediump vec2 twoStepsLeftTextureCoordinate;

varying mediump vec2 oneStepRightTextureCoordinate;

varying mediump vec2 twoStepsRightTextureCoordinate;

// const float weight[3] = float[]( 0.2270270270, 0.3162162162, 0.0702702703 );

void main()

{

lowp vec3 fragmentColor = texture2D(inputImageTexture, centerTextureCoordinate).rgb * 0.2270270270;

fragmentColor += texture2D(inputImageTexture, oneStepLeftTextureCoordinate).rgb * 0.3162162162;

fragmentColor += texture2D(inputImageTexture, oneStepRightTextureCoordinate).rgb * 0.3162162162;

fragmentColor += texture2D(inputImageTexture, twoStepsLeftTextureCoordinate).rgb * 0.0702702703;

fragmentColor += texture2D(inputImageTexture, twoStepsRightTextureCoordinate).rgb * 0.0702702703;

gl_FragColor = vec4(fragmentColor, 1.0);

}

The naming needs a little work, because I use this in two passes, one each for the vertical and horizontal halves, and I just set the corresponding width or height input offset to 0 for the appropriate stage.

You do need to move the texture offset calculations to the vertex shader, because dependent texture reads are very expensive on these GPUs. Also, they don’t handle for loops well, so I had to inline the calculations.

]]>