This many points is surely out of scope!

This is about an update to Blender video editing Scopes (waveform, vectorscope, etc.), and a detour into rendering many points on a GPU.

Making scopes more ready for HDR

Current Blender Studio production, Singularity, needed improvements to video editing visualizations, particularly in the HDR area. Visualizations that Blender can do are: histogram, waveform, RGB parade, vectorscope, and “show overexposed” (“zebra stripes”) overlay. Some of them were not handling HDR content in a useful way, e.g. histogram and waveform were clamping colors above “white” (1.0) and not displaying their actual value distribution.

So I started to look into that, and one of the issues, particularly with waveform, was that it gets calculated on the CPU, by putting the waveform into a width x 256 size bitmap.

This is what a waveform visualization does: each column displays pixel luminance distribution of that column of the input image. For low dynamic range (8 bit/channel) content, you can trivially know there are 256 possible vertical values that would be needed. But how tall should the waveform image be for HDR content? You could guesstimate things like “waveform displays +4 extra stops of exposure” and make a 4x taller bitmap.

Or you could…

…move Scopes to the GPU

I thought that doing calculations needed for waveform & vectorscope visualizations on the CPU, then sending that bitmap to the GPU for display sounds a bit silly. And, at something like 4K resolutions, that is not very fast either! So why not just do that on the GPU?

The process would be:

  • GPU already gets the image it needs to display anyway,
  • Drawing a scope would be rendering a point sprite for each input pixel. Sample the image based on sprite ID in the vertex shader, and position it on the screen accordingly. Waveform puts it at original coordinate horizontally, and at color luminance vertically. Vectorscope puts it based on color YUV U,V values.
  • The points need to use blending in “some way”, so that you can see how many points hit the same luminance level, etc.
  • The points might need to be larger than a pixel, if you zoom in.
  • The points might need to be “smaller than a pixel” if you zoom out, possibly by fading away their blending contribution.

So I did all that, it was easy enough. Performance on my RTX 3080Ti was also much better than with CPU based scopes. Since rendering alpha blended points makes it easy to have them colored, I also made each point retain a bit of original image pixel’s hue:

Yay, done! …and then I tested them on my Mac, just to double check if it works. It does! But the new scopes now playback at like 2 frames per second 🤯 Uhh, what is going on? Why?!

I mean, sure, at 4K resolution a full scope now renders 8 million points. But come on, that is on a M4 Max GPU; it should be able to easily do hundreds of millions of primitives in realtime!

Rendering points on a GPU

Turns out, the problematic performance was mostly the vectorscope visualization. Recall that a vectorscope places points based on their signed U,V (from YUV color model). Which means it places a lot of points very near the center, since usually most pixels are not very saturated. A vectorscope of a grayscale image would be all the points right in the middle!

And it turns out, GPUs are not entirely happy when many (tens of thousands or more) points are rendered at the same location and alpha blending is on. And Apple GPUs are extremely not happy about this. “Way too many” things in the same tile are likely to overflow some sort of tile capacity buffers (on tile-based GPUs), and blending “way too many” fragments in the same location is probably running into a bottleneck due to fixed capacity of blending / ROP backend queues (see “A trip through the Graphics Pipeline 2011, part 9”).

Rendering single-pixel points is not terribly efficient on any GPU, of course. GPUs rasterize everything in 2x2 pixel “quads”, so each single pixel point is at least 4 pixel shader executions, with three of them thrown out (see “Counting Quads” or “A trip through the Graphics Pipeline 2011, part 8”).

Could I rasterize the points in a compute shader instead? Would that be faster?

Previous research ("Rendering Point Clouds with Compute Shaders", related code) as well as “compute based rendering” approaches like Media Molecule Dreams or Unreal Nanite suggest that it might be worth a shot.

It was time to do some 🔬📊SCIENCE📊🔬: make a tiny WebGPU test that tests various point rendering scenarios, and test it out on a bunch of GPUs. And I did exactly that: webgpu-point-raster.html that renders millions of single pixel points in a “regular” (500x500-ish) area down to “very small” (5x5 pixel) area, with alpha blending, using either the built-in GPU point rendering, or using a compute shader.

A bunch of people on the interwebs tested it out and I got results from 30+ GPU models, spanning all sorts of GPU architectures and performance levels. Here, how much time each GPU takes to render 4 million single-pixel points into a roughly 460x460 pixel area (so about 20 points hitting each pixel). The second chart is how many times point rasterization becomes slower, if the same amount of points gets blended into a 5x5 pixel area (160 thousand points per pixel).

From the second chart we can see that even if conceptually the GPU does the same amount of work – same amount of points doing the same type of animation and blending, and the 2x2 quad overshading affects both scenarios the same – all the GPUs render slower when points hit a much smaller screen area. Everyone is slower by 2-5 times, and then there are Apple Mac GPUs that are 12-19 times slower. Also curiously enough, even within the same GPU vendor, it looks like the “high-end” GPUs experience a relatively larger slowdown.

My guess is that this shows the effect of blending units having a limited size “queue” and nature of the fact that blending needs to happen serially and in-order (again, see part 9 mentioned above). And Apple GPUs affected way more than anyone else is… I don’t know why exactly. Maybe because they do not have fixed function blending hardware at all (instead the shader reads current pixel value and does blending by modifying it), so in order to maintain the correct blending ordering, the whole pixel execution needs to be in some sort of “queue”? Curiously Apple’s own performance tools (Metal frame capture in Xcode) does not tell anything useful for this case, except “your fragment shader takes forever!”. It is not entirely incorrect, but it would be useful if it said “it is not the part of your code that is slow, it is blending”.

Let’s do some compute shader point rendering!

The compute shader is trivially naïve approach: have R,G,B uint per pixel buffers, each point does atomic add of the fixed point color, finally a regular fragment shader resolves these buffers to visible colors. It is a “baby’s first compute” type of approach really, without any tricks like using wave/subgroup operations to detect whole wavefront hitting the same pixel, or distributing points into tiles + prefix sum + rasterize points inside tiles, or trying to pack the color buffers into something more compact. None of that, so I was not expecting the compute shader approach to be much better.

Here’s two charts: how much faster is this simple compute shader approach, compared to built-in GPU point rendering. First for the “4M points in 460x460 pixel area” case, then for the 5x5 pixel area case:

Several surprising things:

  • Even this trivial compute shader for the not-too-crazy-overdraw case, is faster than built-in point rasterization on all GPUs. Mostly it is 1.5-2 times faster, with some outliers (AMD GPUs love it – it is like 10x faster than rasterization!).
  • For the “4M points in just a 5x5 pixel area” case, the compute shader approach is even better. I was not expecting that – the atomic additions it does would get crazily contended – but it is around 5x faster that rasterization across the board. My only guess is that while contended atomics are not great, they perhaps are still better than contended blending units?

Finally, a chart to match the rasterization chart: how many times the compute shader rendering gets slower, when 460x460 area gets reduced to 5x5 one:

I think this shows “how good the GPU is at dealing with contended atomics”, and it seems to suggest that relatively speaking, AMD GPUs and recent Apple GPUs are not that great there. But again, even with this relative slowdown, the compute shader approach is way faster than the rasterization one, so…

Compute shaders are useful! What a finding!

But let’s get back to Blender.

Blender Scopes on the GPU, with a compute shader

So that’s what I did then - I made the Blender video sequencer waveform/parade/vectorscope be calculated and rendered on the GPU, using a compute shader to do point rasterization. That also allowed to do “better” blending than what would be possible using fixed function blending, actually – since I am accumulating the points hitting the same pixel, I can do a non-linear alpha mapping in the final resolve pass.

The pull request #144867 has not landed yet has just landed, so scopes in Blender 5.0 will get faster and look better. All the scopes, everywhere, all at once, now look like this:

Whereas in current Blender 4.5 they look like this:

And for historical perspective, two years ago in Blender 4.0, before I started to dabble in this area, they looked like this:

Also, playback of this screen setup (4K EXR images, all these views/scopes) on my PC was at 1.1FPS in Blender 4.0; at 7.9FPS in Blender 4.5; and at 14.1FPS with these GPU scopes. Still work to do, but hey, progress.

That’s it, bye!


Lossless Float Image Compression

Back in 2021 I looked at OpenEXR lossless compression options (and I think my findings led a change of the default zip compression level, as well as change of the compression library from zlib to libdeflate. Yay blogging about things!). Then in 2023 I looked at losslessly compressing a bunch of floating point data, some of which might be image-shaped.

Well, now a discussion somewhere else has nerd-sniped me to look into lossless compression of floating point images, and especially the ones that might have more than just RGB(A) color channels. Read on!

Four bullet point summary, if you’re in a hurry:

  • Keep on using OpenEXR with ZIP compression.
  • Soon OpenEXR might add HTJ2K compression; that compresses slightly better but is worse compression and decompression performance, so YMMV.
  • JPEG-XL is not competitive with OpenEXR in this area today.
  • You can cook up a “custom image compression” that seems to be better than all of EXR, EXR HTJ2K and JPEG-XL, while also being way faster.

My use case and the data set

What I wanted to primarily look at, are “multi-layer” images that would be used for film composition workflows. In such an image, a single pixel does not have just the typical RGB (and possibly alpha) channels, but might have more. Ambient occlusion, direct lighting, indirect lighting, depth, normal, velocity, object ID, material ID, and so on. And the data itself is almost always floating point values (either FP16 or FP32); sometimes with different precision for different channels within the same image.

There does not seem to be a readily available “standard image set” like that to test things on, so I grabbed some that I could find, and some I have rendered myself out of various Blender splash screen files. Here’s the 10 data files I’m testing on (total uncompressed pixel size: 3122MB):

File Resolution Uncompressed size Channels Sample
Blender281rgb16.exr 3840x2160 47.5MB RGB half
Blender281rgb32.exr 3840x2160 94.9MB RGB float
Blender281layered16.exr 3840x2160 332.2MB 21 channels, half
Blender281layered32.exr 3840x2160 664.5MB 21 channels, float
Blender35.exr 3840x2160 332.2MB 18 channels, mixed half/float
Blender40.exr 3840x2160 348.0MB 15 channels, mixed half/float
Blender41.exr 3840x2160 743.6MB 37 channels, mixed half/float
Blender43.exr 3840x2160 47.5MB RGB half
ph_brown_photostudio_02_8k.exr 8192x4096 384.0MB RGB float, from polyhaven
ph_golden_gate_hills_4k.exr 4096x2048 128.0MB RGBA float, from polyhaven

OpenEXR

OpenEXR is an image file format that has existed since 1999, and is primarily used within film, vfx and game industries. It has several lossless compression modes (see my previous blog post series).

It looks like OpenEXR 3.4 (should be out 2025 Q3) is adding a new HTJ2K compression mode, which is based on “High-Throughput JPEG 2000” format/algorithms, using open source OpenJPH library. The new mode is already in OpenEXR main branch (PR #2041).

So here’s how EXR does on my data set (click for a larger interactive chart):

This is two plots: compression ratio vs. compression performance, and compression ratio vs. decompression performance. In both cases, the best place on the chart is top right – the largest compression ratio, and the best performance.

For performance, I’m measuring it in GB/s, in terms of uncompressed data size. That is, if we have 1GB worth of raw image pixel data and processing it took half a second, that’s 2GB/s throughput (even if compressed data size might be different). Note that the vertical scale of both graphs is different. I am measuring compression/decompression time without actual disk I/O, for simplicity – that is, I am “writing” and “reading” “files” from memory. The graph is from a run on Apple MacBookPro M4 Max, with things being compiled in “Release” build configuration using Xcode/clang 16.1.

Green dot is EXR ZIP at default compression level (which is 4, but changing the level does not affect things much). Blue dot is the new EXR HTJ2K compression – a bit better compression ratio, but also lower performance. Hmm dunno, not very impressive? However:

  • From what I understand, HTJ2K achieves better ratio on RGB images by applying a de-correlation transform. In case of multi-layer EXR files (which is most of my data set), it only does that for one layer (usually the “final color” one), but does not try to do that on, for example, “direct diffuse” layer which is also “actually RGB colors”. Maybe future work within OpenEXR HTJ2K will improve this?
  • Initial HTJ2K evaluation done in 2024 found that a commercial HTJ2K implementation (from Kakadu) is quite a bit faster than the OpenJPH that is used in OpenEXR. Maybe future work within OpenJPH will speed it up?
  • It very well might be that once/if OpenEXR will get lossy HTJ2K, things would be much more interesting. But that is a whole another topic.

I was testing OpenEXR main branch code from 2025 June (3.4.0-dev, rev 45ee12752), and things are multi-threaded via Imf::setGlobalThreadCount(). Addition of HTJ2K compression codec adds 308 kilobytes to executable size by the way (on Windows x64 build).

Moving on.

JPEG-XL lossless

JPEG-XL is a modern image file format that aims to be a good improvement over many already existing image formats; both lossless and lossy, supporting stardard and high dynamic range images, and so on. There’s a recent “The JPEG XL Image Coding System” paper on arXiv with many details and impressive results, and the reference open source implementation is libjxl.

However, the arXiv paper above does not have any comparisons in how well JPEG-XL does on floating point data (it does have HDR image comparisons, but at 10/12/16 bit integers with a HDR transfer function, which is not the same). So here is me, trying out JPEG-XL lossless mode on images that are either FP16 or FP32 data, often with many layers (JPEG-XL supports this via “additional channels” concept), and sometimes with different floating point types based on channel.

Here’s results with existing EXR data, and JPEG-XL additions in larger red dots (click for an interactive chart):

Immediate thoughts are okay this can achieve better compression, coupled with geez that is slow. Expanding a bit:

  • At compression effort 1-3 JPEG-XL does not win against OpenEXR (ZIP / HTJ2K) on compression ration, while being 3x slower to compress, and 3x-7x slower to decompress. So that is clearly not a useful place to be.
  • At compression effort levels 4+ it starts winning in compression ratio. Level 4 wins against HTJ2K a bit (1.947x -> 2.09x); the default level 7 wins more (2.186x), and there’s quite a large increase in ratio at level 8 (2.435x). I briefly tried levels 9 and 10, but they do not seem to be giving much ratio gains, while being extraordinarily slow to compress. Even level 8 is already 100 times slower to compress than EXR, and 5-13x slower to decompress. So yeah, if final file size is really important to you, then maybe; on the other hand, 100x slower compression is, well, slow.

Looking at the feature set and documentation of the format, it feels that JPEG-XL is mostly and primarily is targeted at “actually displayed images, perhaps for the web”. Whereas with EXR, you can immediately see that it is not meant for “images that are displayed” – it does not even have a concept of low dynamic range imagery; everything is geared towards it being for images used in the middle of the pipeline. From that falls out built-in features like arbitrary number of channels, multi-part images, mipmaps, etc. Within JPEG-XL, everything is centered around “color”, and while it can do more than just color, these feel like bolted-on things. It can do multiple frames, but these have to be same size/format and are meant in the “animation frames” sense; it can do multiple layers but these are meant in the “photoshop layers” sense; it talks about storing floating point data, but it is in the “HDR color or values a bit out of color gamut” sense. And that is fine; the JPEG-XL coding system paper itself has a chart of what JPEG-XL wants to be (I circled that with red) and where EXR is (circled with green):

More subjective notes and impressions:

  • Perhaps the floating point paths within libjxl did not (yet?) get the same attention as “regular images” did; it is very possible that they will improve the performance and/or ratio in the future (I was testing end-of-June 2025 code).
  • A cumbersome part of libjxl is that color channels need to be interleaved, and all the “other channels” need to be separate (planar). All my data is fully interleaved, so it costs some performance to arrange it as libjxl wants, both for compression and after decompression. As a user, it would be much more convenient to use if their API was similar to OpenEXR Slice that takes a pointer and two strides (stride between pixels, and stride between rows). Then any combination of interleaved/planar or mixed formats for different channels could be passed with the same API. In my own test code, reading and writing EXR images using OpenEXR is 80 lines of code, whereas JPEG-XL via libjxl is 550 lines.
  • On half-precision floats (FP16), libjxl currently is not fully lossless – subnormal values do not roundtrip correctly (issue #3881). The documentation also says that non-finite values (infinities / NaNs) in both FP32 and FP16 are not expected to roundtrip in an otherwise lossless mode. This is in contrast with EXR, where even for NaNs, their exact bit patterns are fully preserved. Again, this does not matter if the intended use case is “images on screen”, but matters if your use case is “this looks like an image, but is just some data”.
  • From what I can tell, some people did performance evaluations of EXR ZIP vs JPEG-XL by using ffmpeg EXR support; do not do that. At least right now, ffmpeg EXR code is their own custom implementation, that is completely single threaded and lacks some other optimizations that official OpenEXR library does.

I was testing libjxl main branch code from 2025 June (0.12-dev, rev a75b322e), and things are multi-threaded via JxlThreadParallelRunner. Library adds 6017 kilobytes to executable size (on Windows x64 build).

And now for something completely different:

Mesh Optimizer to compress images, why not?

Back when I was playing around with floating point data compression, one of the things I tried was using meshoptimizer by Arseny Kapoulkine to losslessly compress the data. It worked quite well, so why not try this again. Especially since it got both compression ratio and performance improvements since then.

So let’s try a “MOP”, which is not an actual image format, just something I quickly cooked up:

  • A small header with image size and channel information,
  • Then image is split into chunks, each being 16K pixels in size. Each chunk is compressed independently and in parallel.
  • A small table with compressed sizes for each chunk is written after the header, followed by the compressed data itself for each chunk.
  • Mesh optimizer needs “vertex size” (pixel size in this case) to be multiple of four; if that is not the case the chunk data is padded with zeroes inside the compression/decompression code.
  • And just like the previous time: mesh optimizer vertex codec is not an LZ-based compressor (it seems to be more like delta/prediction scheme that is packed nicely), and you can further compress the result by just piping it to a regular lossless compressor. In my case, I used zstd.

So here’s how “MOP” does on my data set (click for a larger interactive chart):

The purple dots are the new “MOP” additions. You can see there are two groups of them: 1) around 2.0x ratio and very high decompression speed is just mesh optimizer vertex codec, 2) around 2.3x ratio and slightly lower decompression speed is mesh optimizer codec followed by zstd.

And that is… very impressive, I think:

  • Just mesh optimizer vertex codec itself is about the same or slightly higher compression ratio as EXR HTJ2K, while being almost 2x faster to compress and 5x faster to decompress.
  • Coupled with zstd, it achieves compression ratio that is between JPEG-XL levels 7-8 (2.3x), while being 30-100 times faster to compress and 20 times faster to decompress. This combination also very handily wins against EXR (both ZIP and HTJ2K), in both ratio and performance.
  • Arseny is a witch!?

I was testing mesh optimizer v0.24 (2025 June) and zstd v1.5.7 (2025 Feb). Mesh optimizer itself adds just 26 kilobytes (!) of executable code; however zstd adds 405 kilobytes.

And here are the results of all the above, running on a different CPU, OS and compiler (Ryzen 5950X, Windows 10, Visual Studio 2022 v17.14). Everything is several times slower (some of that is due to Apple M4 having crazy high memory bandwidth, some of that CPU differences, some of that compiler differences, some OS behavior with large allocations, etc.). But overall “shape” of the charts is more or less the same:

That’s it for now!

So there. Source code of everything above is over at github.com/aras-p/test_exr_htj2k_jxl. Again, my own take aways are:

  • EXR ZIP is fine,
  • EXR HTJ2K is slightly better compression, worse performance. There is hope that performance can be improved.
  • JPEG-XL does not feel like a natural fit for this (multi-layered, floating point) images right now. However, it could become one in the future, perhaps.
  • JPEG-XL (libjxl) compression performance is very slow, however it can achieve better ratios than EXR. Decompression performance is also several times slower. It is possible that both performance and ratio could be improved, especially if they did not focus on floating point cases yet.
  • Mesh Optimizer (optionally coupled with zstd) is very impressive, both in terms of compression ratio and performance. It is not an actual image format that exists today, but if you need to losslessly compress some floating point images for internal needs only, it is worth looking at.

And again, all of that was for fully lossless compression. Lossy compression is a whole another topic, that I may or might not look into someday. Or, someone else could look! Feel free to use the image set I have used.


Voronoi, Hashing and OSL

Sergey from Blender asked me to look into why trying to manually sprinkle some SIMD into Cycles renderer Voronoi node code actually made things slower, and I started to look, and what I did in the end had nothing to do with SIMD whatsoever!

TL;DR: Blender 5.0 changed Voronoi node hash function to a faster one.

Voronoi in Blender

Blender has a Voronoi node that can be used in any node based scenario (materials, compositor, geometry nodes). More precisely, it is actually a Worley noise procedural noise function. It can be used to produce various interesting patterns:

A typical implementation of Voronoi uses a hash function to randomly offset each grid cell. For something like a 3D noise case, it has to calculate said hash on 27 neighboring cells (3x3x3), for each item being evaluated. That is a lot of hashing!

Current implementation of e.g. “calculate random 0..1 3D offset for a 3D cell coordinate” looked like this in Blender:

// Jenkins Lookup3 Hash Function
// https://burtleburtle.net/bob/c/lookup3.c
#define rot(x, k) (((x) << (k)) | ((x) >> (32 - (k))))
#define mix(a, b, c) { \
    a -= c; a ^= rot(c, 4); c += b; \
    b -= a; b ^= rot(a, 6); a += c; \
    c -= b; c ^= rot(b, 8); b += a; \
    a -= c; a ^= rot(c, 16); c += b; \
    b -= a; b ^= rot(a, 19); a += c; \
    c -= b; c ^= rot(b, 4); b += a; \
}
#define final(a, b, c) { \
    c ^= b; c -= rot(b, 14); \
    a ^= c; a -= rot(c, 11); \
    b ^= a; b -= rot(a, 25); \
    c ^= b; c -= rot(b, 16); \
    a ^= c; a -= rot(c, 4); \
    b ^= a; b -= rot(a, 14); \
    c ^= b; c -= rot(b, 24); \
}
uint hash_uint3(uint kx, uint ky, uint kz)
{
    uint a;
    uint b;
    uint c;
    a = b = c = 0xdeadbeef + (3 << 2) + 13;
    c += kz;
    b += ky;
    a += kx;
    final(a, b, c);
    return c;
}
uint hash_uint4(uint kx, uint ky, uint kz, uint kw)
{
    uint a;
    uint b;
    uint c;
    a = b = c = 0xdeadbeef + (4 << 2) + 13;
    a += kx;
    b += ky;
    c += kz;
    mix(a, b, c);
    a += kw;
    final(a, b, c);
    return c;
}

float uint_to_float_incl(uint n)
{
    return (float)n * (1.0f / (float)0xFFFFFFFFu);
}
float hash_uint3_to_float(uint kx, uint ky, uint kz)
{
    return uint_to_float_incl(hash_uint3(kx, ky, kz));
}
float hash_uint4_to_float(uint kx, uint ky, uint kz, uint kw)
{
    return uint_to_float_incl(hash_uint4(kx, ky, kz, kw));
}
float hash_float3_to_float(float3 k)
{
    return hash_uint3_to_float(as_uint(k.x), as_uint(k.y), as_uint(k.z));
}
float hash_float4_to_float(float4 k)
{
    return hash_uint4_to_float(as_uint(k.x), as_uint(k.y), as_uint(k.z), as_uint(k.w));
}

float3 hash_float3_to_float3(float3 k)
{
    return float3(hash_float3_to_float(k),
        hash_float4_to_float(float4(k.x, k.y, k.z, 1.0)),
        hash_float4_to_float(float4(k.x, k.y, k.z, 2.0)));
}

i.e. it is based on Bob Jenkins’ “lookup3” hash function, and does that “kind of three times”, pretending to hash float3(x,y,z), float4(x,y,z,1) and float4(x,y,z,2). This is to calculate one offset of the grid cell. Repeat that to 27 grid cells for 3D Voronoi case.

I know! Let’s switch to PCG3D hash!

If you are aware of “Hash Functions for GPU Rendering” (Jarzynski, Olano, 2020) paper, you can say “hey, maybe instead of using hash function from 1997, let’s use a dedicated 3D->3D hash function from several decades later”. And you would be absolutely right:

uint3 hash_pcg3d(uint3 v)
{
  v = v * 1664525u + 1013904223u;
  v.x += v.y * v.z;
  v.y += v.z * v.x;
  v.z += v.x * v.y;
  v = v ^ (v >> 16);
  v.x += v.y * v.z;
  v.y += v.z * v.x;
  v.z += v.x * v.y;
  return v;
}
float3 hash_float3_to_float3(float3 k)
{
  uint3 uk = as_uint3(k);
  uint3 h = hash_pcg3d(uk);
  float3 f = float3(h);
  return f * (1.0f / (float)0xFFFFFFFFu);
}

Which is way cheaper (the hash function itself is like 4x faster on modern CPUs). Good! We are done!

If you are using hash functions from the 1990s, try some of the more modern ones! They might be both simpler and the same or better quality. Hash functions from several decades ago were built on assumption that multiplication is very expensive, which is very much not the case anymore.

So you do this for various Voronoi cases of 2D->2D, 3D->3D, 4D->4D. First in the Cycles C++ code (which compiles itself to both CPU execution, and to GPU via CUDA/Metal/HIP/oneAPI), then in EEVEE GPU shader code (GLSL), then in regular Blender C++ code (which is used in geometry nodes and CPU compositor).

And you think you are done until you realize…

Cycles with Open Shading Language (OSL)

The test suite reminds you that Blender Cycles can use OSL as the shading backend. Open Shading Language, similar to GLSL, HLSL or RSL, is a C-like language to write shaders in. Unlike some other languages, a “shader” does not output color; instead it outputs a “radiance closure” so that the result can be importance-sampled by the renderer, etc.

So I thought, okay, instead of updating the Voronoi code in three places (Cycles CPU, EEVEE GPU, Blender CPU), it will have to be four places. Let’s find out where and how does Cycles implements the shader nodes for OSL, update that place, and we’re good.

Except… turns out, OSL does not have unsigned integers (see data types). Also, it does not have bitcast from float to int.

I certainly did not expect an “Advanced shading language for production GI renderers” to not have a concept of unsigned integers, in year 2025 LOL :) I knew nothing about OSL just a day before, and now I was there wondering about the language data type system.

Luckily enough, specifically for Voronoi case, all of that can be worked around by:

  • Noticing that everywhere within Voronoi code, we need to calculate a pseudorandom “cell offset” out of integer cell coordinates only. That is, we do not need hash_float3_to_float3, we need hash_int3_to_float3. This works around the lack of bit casts in OSL.
  • We can work around lack of unsigned integers with a slight modification to PCG hash, that just operates on signed integers instead. OSL can do multiplications, XORs and bit shifts, just only on signed integers. Fine with us!
int3 hash_pcg3d_i(int3 v)
{
  v = v * 1664525 + 1013904223;
  v.x += v.y * v.z;
  v.y += v.z * v.x;
  v.z += v.x * v.y;
  v = v ^ (v >> 16);
  v.x += v.y * v.z;
  v.y += v.z * v.x;
  v.z += v.x * v.y;
  return v & 0x7FFFFFFF;
}
float3 hash_int3_to_float3(int3 k)
{
  int3 h = hash_pcg3d_i(k);
  float3 f = float3((float)h.x, (float)h.y, (float)h.z);
  return f * (1.0f / (float)0x7FFFFFFFu);
}

So that works, just instead of only having to change hash_float3_to_float3 and friends, this now required updating all the Voronoi code itself as well, to make it hash integer cell coordinates as inputs.

“Wait, but how did Voronoi OSL code work in Blender previously?!”

Good question! It was using the OSL built-in hashnoise() functions that take float as input, and produce a float output. And… yup, they just happened to use exactly the same Jenkins Lookup3 hash function underneath. Happy coincidence? One implementation copying what the other was doing? I don’t know.

It would be nice if OSL got unsigned integers and bitcasts though. Since today, if you need to hash float->float, you can only use the built-in OSL hash function, which is not particularly fast. For Voronoi case that can be worked around, but I bet there are other cases where workign around it is much harder.

So that’s it!

The pull request that makes Blender Voronoi node 2x-3x faster has been merged for Blender 5.0. It does change the actual resulting Voronoi pattern, e.g. before and after:

So while it “behaves” the same, the literal pattern has changed. And that is why a 5.0 release sounds like good timing to do it.

What did I learn?

  • Actually learned about how Voronoi/Worley noise code works, instead of only casually hearing about it.
  • Learned that various nodes within Blender have four separate implementations, that all have to match in behavior.
  • Learned that there is a shading language, in 2025, that does not have unsigned integers :)
  • There can be (and is) code out there that is using hash functions from the previous millenium, which might be not optimal today.
  • I should still look at the SIMD aspect of this whole thing.

Blender FBX importer via ufbx

Three years ago I found myself speeding up Blender OBJ importer, and this time I am rewriting Blender FBX importer. Or, letting someone else take care of the actually complex parts of it.

TL;DR: Blender 4.5 will have a new FBX importer.

FBX in Blender

FBX, a 3D and animation interchange format owned by Kaydara Alias Autodesk, is a proprietary format that is still quite popular in some spaces. The file format itself is quite good actually; the largest downsides of it are: 1) it is closed and with no public spec, and 2) due to it being very flexible, various software represent their data in funny and sometimes incompatible ways. The future of the format seems to be “dead” at this point; after a decade of continued improvements to the FBX format and the SDK, Autodesk seems to have stopped around year 2020. However, the two big game engines (Unity and Unreal) still both treat FBX as the primary format in how 3D data gets into the engine. But going forward, perhaps one should use USD, glTF or Alembic.

Blender, by design / out of principle only uses open source libraries for everything that ships inside of it. Which means it can not use the official (closed source, binary only) Autodesk FBX SDK, and instead had to reverse engineer the format and write their own code to do import/export. And so they did! They added FBX export in 2.44 (year 2007), and import in 2.69 (year 2013), and have a short reverse engineered FBX format description on the developer blog.

The FBX import/export functionality, as was common within Blender at the time, was written in pure Python. Which is great for the expressiveness and makes for very compact code, but is not that great for many other reasons. However, it has been expanded, fixed and optimized over the years – recent versions use NumPy for many heavy number crunching parts, the exporter does some multi-threaded Deflate compression, and so on. The whole implementation is about 12 thousand lines of Python code, which is very compact, given that it does import, export and all the Blender parts too (that comes at a cost though… in some places it feels too compact, when someone else wants to understand what the code is doing :)).

So far so good! However, ufbx open source library was born sometime in 2019.

ufbx, and other FBX parsers

ufbx (github, website) by Samuli Raivio is a single source file C library for loading FBX files.

And holy potatoes, it is an excellent library. Seriously: ufbx is one of the best written libraries I’ve ever seen.

  • Compiles out of the box with no configuration needed. You can configure it very extensively, if you want to:
    • Disable parts of library you don’t need (e.g. subdivision surface evaluation, animation layer blending evaluation, etc.).
    • Pass your own memory allocation functions, your own job scheduler, even your own C runtime functions.
    • Disable data validation, disable loading of certain kinds of data, and so on.
  • The API and the data structures exposed by it just make sense. This is highly subjective, but I found it very easy to use and find my way around.
  • It is very extensively tested at multiple levels.

And also, it is very fast. I actually wanted to compare ufbx with the official FBX SDK, and several other open source FBX parsing libraries I managed to find (AssImp, OpenFBX).

Here’s time it takes (in seconds, on Ryzen 5950X / Windows / VS2022) to read 9 FBX files (total size 2GB), and extract very basic information about the scene (how many vertices in total, etc.). There is sequential time (read files one by one), and parallel time (read files in parallel, independently), as well size of the executable that does all that.

Parser Time sequential, s Time parallel, s Executable size, KB
ufbx 9.8 2.7 457
ufbx w/ internal threads 4.4 2.6 462
FBX SDK 869.9 crash! 4508
AssImp 33.9 26.7 1060
OpenFBX 26.7 15.8 312

Or, in more visual form:

Does performance of the official FBX SDK look very bad here? Yes indeed it does. This seems to be due to two reasons:

  • It can not parse several FBX files in parallel. It just can’t due to shared global data of some sorts.
  • On some files (mostly the ones that have lots of animation curves, or lots of instancing), it is very slow. Not to parse them! But to clean up after you are done with parsing. Looks like even if you want to tell it “yeet everything”, it proceeds to do that one entity at a time, doing a lot of work making sure that after removing each individual little shit, it is properly de-registered from any other things that were referencing it. Probably effectively a quadratic complexity amount of work.

I have put source code and more details of these tests over at github.com/aras-p/test-fbx-parsers

Anyway, if you need to load/parse FBX files, just use ufbx!

Blender 4.5: new FBX importer

So, I made a new FBX importer for Blender 4.5 (pull request). So far it is marked “experimental”, and for a while both the new one and the Python-based one will co-exist. Until the new one is sufficiently tested and the old one can be removed.

The new importer comes with several advantages, like:

  • It supports ASCII FBX files, as well as binary FBX files that are older than FBX 7.1.
  • Better handling of “geometric transform” (common in 3dsmax), support for more types of Material shading models, imported animations are better organized into Actions/Slots.
  • So far I have found about 20 existing open bug reports, where current FBX importer was importing something wrong, and the new one does the correct thing.
    • Right now there is one bug report about the new importer doing incorrect thing… looking into it! :)

Oh, and it is quite a bit faster too. While the Python based importer was quite fast (for Python, that is), the new one is often 5x-20x faster, while using less memory too. Here are some tests, import times in seconds (Ryzen 5950X):

Test case Notes Time Python Time new
Blender 3.0 Splash 30k instances 83.4 4.7
Rain Restaurant 300k animation curves 86.8 4.4
Zero Day 6M triangles, 8K objects 22.1 1.7
Caldera 21M triangles, 6K objects 44.2 4.4

Even if it is ufbx that takes care of all the actually complex parts of the work, I still managed to waste spend quite a lot of time on this.

However, most of the time I spent procrastinating, in the style of “oh no, now I will have to do materials, this is going to be complex” – proceed to find excuses to not do it just yet – eventually do it, and turns out it was not as scary. Besides stalling this way, most of the other time sink has been just learning innards of Blender (the whole area of Armatures, Bones, EditBones, bPoseChannels is quite something).

The amount of code for just the Blender importer side (i.e. not counting ufbx itself) still ended up at 3000 lines, which is not compact at all.

Anyway, now what remains is to fix the open issues (so far… just one!), do a bunch of improvements I have in mind, and ship this to the world in Blender 4.5. You can try out daily builds yourself!

And then… maybe a new FBX exporter? We will see.


US New Orleans Trip 2025

We just spent a week-and-a-bit in southern part of United States, so here’s a bunch of photos and some random thoughts!

Aistė went for a work related conference in New Orleans, and we (myself and our two kids) joined her after that. A bit in New Orleans, then driving across the Gulf of Mexico coast towards Orlando.

I wanted to see New Orleans mostly due to its historical importance for music, and we wanted to end the trip in Kennedy Space Center due to, well, “space! rockets!”

New Orleans

The conference Aistė attended was great, but she was struck by anxiety experienced by her American colleagues regarding all the current shitshow in the country. Decades of achievements in diversity, inclusion, healthcare access, scientific advancements are wiped away by some idiot manbabies. We still remember the not-too-distant times when the “head of the state” was above all reason, rules and logic, and that is not a great setup :(

Anyway. New Orleans definitely looks different than most other US cities I’ve been to. However, I somehow expected something more, but not sure what exactly. Maybe more spontaneous / magic music moments? They are probably happening somewhere, we just did not stumble into them. Several times we saw homeless people playing music in their own bands out in the parks, and while that is cool, it is also sad.

New Orleans National WWII Museum

I did not have high expectations going into WWII museum; I expected something bombastic, proudly proclaiming how US victoriously planted their flag and saved the world (no reason to think like this, I just have stereotypes). The museum is nothing like that; I think it conveys wery well how a war is really a shit situation, and everything is terrible there. Both the Pacific Theater and the European Theater exhibits are full of stories of failed operations, strategic miscalculations, and so on.

That last photo above however… how quaint! <gestures at, well, everything around>

Whitney Plantation

I had not realized that many plantations continued to operate well into the 20th century, often with the same people working at them that used to be slaves. “You are free now! Except you have no property, money, or education. Good luck out there!”

The last picture is from a nearby Evergreen Plantation which is closed now. You might have seen the house in Django Unchained.

Barataria Preserve

Some nice bayou trails near New Orleans at Barataria Preserve.

Battleship Memorial Park

Driving towards Orlando, we stopped at the Battleship Memorial Park in Mobile, Alabama. USS Alabama is very impressive from engineering standpoint. Heck, it is over 80 years old by now! Now of course, it would be ideal if such engineering was not needed at all… but here we are.

Pensacola Beach

We only caught an evening glimpse of Pensacola Beach while driving onwards. The whole color scheme is impressively blue in the evening!

Maclay State Gardens

Alfred B. Maclay State Gardens park near Tallahassee was a somewhat random choice for a stop, and it turned out to be very impressive. Magnolias, azaleas and camellias are in beautiful bloom, and the tillandsias look like magic.

Gator / Bird watching

Airboat tour at Tohopekaliga lake near Orlando, FL. It was a shame that this tour guide focused only on the gators, more or less. There was so many other things around!

Kennedy Space Center

Visitor complex at the Kennedy Space Center is very impressive. The only problem is – way too many people! :) At least when we visited, it was packed with primary and middle school kids, who, it turns out, create an impressive amount of noise and chaos.

But seeing the actual Shace Shuttle and Saturn V is a sight to behold. The photos do not convey the scale.

That’s it!

So that was our trip! Short and sweet, and we only hit a minor snag on the way back due to a flight delay, which made us miss the next flight, so the total journey back became some 30 hours. The luggage arrived as well, eventually :)

On the plane back I watched Soundtrack to a Coup d’Etat, a 2024 documentary about 1961, with newly independent Congo, UN, large powers of the world (US and USSR) dividing their spheres of influence, music as a soft power, and various plots for eventual control of natural resource deposits. Political backroom deals over natural resource deposits? Proclaiming support and then backstabbing someone? Ugh the 60s, how antiquated, surely something like that would not happen in 21st century. Right?!

It is a good movie.

Maybe people should smell more blooms.