Surface-Stable Fractal Dither on Playdate

Rune Skovbo Johansen has a really sweet Surface-Stable Fractal Dithering technique, where the dither dots “stick” to 3D surfaces, yet the dot density adapts to the view distance and zoom level.

Some people have asked whether this would be a good technique for Playdate, given that the screen is one-bit color. And so I had to try it out! Here’s a video:

And here’s the code: github.com/aras-p/playdate-dither3d.

This is a long-arse post, so here’s the sections:

Is it practical?

My impression: not really practical.

Playdate hardware is like a PC from 1995 - no GPU at all, one fairly simple CPU core. As such, it can do fairly simple 3D rendering (well, you need to write the whole rasterizer on the CPU), but can barely do more than a handful of math operations per rasterized pixel. Rasterizing with screen-space fixed Bayer or Blue Noise dither patterns is the way to go due to their simplicity.

You can barely do textured triangles, whereas cost of Fractal Dithering is, with some simplifications, at least twice that (you still need to interpolate the texture coordinates, do some additional math on them, do a 3D texture lookup, and additional math on that).

But, while doing this, I’ve learned a thing or two about software rasterizers. Of course, everyone else has learned that in 1997, but I’ve never written a perspective-correct textured triangle rasterizer… As the old saying goes, “the best time to write a triangle rasterizer was thirty years ago. The second best time is today.” So what follows is various notes I have made during the process.

Surface-Stable Fractal Dithering

Rune has a really great video explaining the technique, and an extended feature demo video, plus all the source in the github repository, with most of the shader logic being in Dither3DInclude.cginc HLSL shader file.

Here’s an outline of the steps. We have some scene where the input for dithering is “brightness”:

And the Surface-Stable Fractal Dithering (with a 4x4 dither pattern) turns it into this:

Now, the dots above are still nicely anti-aliased; whereas Playdate is strictly 1-bit screen. Giving it only two colors, and making them similar to how the device screen looks like, the result would be like this (note that resolution here is 2x larger than Playdate, i.e. 800x480):

In addition to brightness, the dithering process needs geometry texture coordinates (“UVs”) as well. The dithering pattern “sticks” to the surfaces by placing them in UV space.

It also needs the derivatives of UVs in screen space, to know “how fast” they change across the screen projection. That will be used to make sure the dither pattern is roughly constant size on screen. On the GPU, the derivatives fall out naturally out of 2x2 pixel execution pattern, and in HLSL are provided by ddx and ddy built-in functions. Here they are, visualized as abs(ddx(uv))*100 and abs(ddy(uv))*100 respectively:

Now, given these four derivative values, the technique uses singular value decomposition to find the minimum and maximum rate of change (these might not be aligned to screen axes). The maximum and minimum frequencies (scaled up 30x) look like:

The minimum frequency, together with user/material-specified dot spacing factor, gets turned into base dither dot spacing value:

Then it is further adjusted by input brightness (if “size variability” material parameter is zero), so that the actual dots stay roughly the same size, but in darker areas their spacing spreads further out.

This spacing is then used to calculate two factors used to sample a 3D lookup texture: 1) by which power of two to adjust the mesh UVs so that the dither dots pattern is placed onto surface properly, and 2) which actual dither pattern “slice” to use, so that the pattern more or less seamlessly blends between powers-of-two levels.

The mesh UVs, adjusted for 3D texture sampling, look like this, as well as indication which Z slice of the texture to use:

Result of sampling the 3D lookup texture (that was prepared ahead of time) is:

The 3D texture itself for the 4x4 dither pattern (64x64x16, with 3D texture slices side by side) looks like this:

We’re almost there! Next up, the algorithm calculates contrast factor, which is based on material settings, dot spacing, and the ratio of minimum and maximum UV rate of change. From that, the base brightness value that the contrast is scaled around is calculated (normally it would be 0.5, but where the pattern would be very blurry that would look bad, so there it is scaled away). And finally, the threshold value to compare the radial gradient from 3D texture is calculated based on input brightness. The contrast, base value and threshold respectively look like:

And finally we get our result:

So all of that was… <takes a quick look> something like one 3D texture sample, 4 divisions, 2 raises to a power, 3 square roots, 3 exp2s, 1 log2, and several dozen regular multiplies or additions for every pixel. Provided you have UVs and their derivatives already, that is.

That should, like, totally fly on a Playdate, right? 🥹

Anyway, let’s do this! But first…

Perspective correct texturing

Triangles have texture coordinates defined at their vertices, and while rasterizing the triangle, you interpolate the texture coordinates, and at each pixel, read the texture value corresponding to the interpolated coordinate. Here’s a simple checkerboard texture using interpolated UVs (ignore the noisy dithering; it is unrelated):

However, if we look at the same mesh at an angle, it looks really weird:

That is because under perspective projection, you need to use perspective correct texture mapping, and not just simply interpolate UVs in screen space. With perspective correction things look good, however that means now we have to do a division per-pixel. And, divisions are slow. Anyway, this is the least of our problems (for now…).

Displaying brightness on a Playdate

Playdate hardware has 1-bit “memory LCD” display: each pixel can only be “on” or “off”. So typically to display “brightness”, some sort of dithering is used. The example simple 3D rasterizer included in the Playdate SDK (“mini3d”) contains code that rasterizes triangles using different patterns based on brightness:

Another common approach is to use a blue noise texture for thresholding the brightness. I’ve used that approach in “Everybody Wants to Crank the World” Playdate demo as well.

So the question now is, could Surface-Stable Fractal Noise be another approach to display “brightness” on a 1-bit Playdate screen?

“Porting” Surface-Stable Fractal Noise to Playdate

I had a triangle rasterizer based on Fabian Giesen’s “Simple watertight triangle rasterizer” working on a Playdate. This is a half-space / barycentric rasterizer, which is a good fit for hardware or modern CPUs with SIMD instruction sets. It might be a bad fit for Playdate CPU, but for now we’re not very concerned with that. The code is nice and small; further recommended reading from Fabian are “The barycentric conspiracy”, “Triangle rasterization in practice”, “Optimizing the basic rasterizer” blog posts, as well as “Fast software rasteriser in JavaScript?” discussion on pouët.

Initial port of Rune’s dithering shader code on a Playdate looked like this:

Welp, this does not look correct at all. Time to debug where exactly it goes wrong!

For development convenience, I have the whole “playdate application” setup as both a Playdate build target, and an application that can build for PC. There’s a super tiny “platform abstraction” that provides pointer to “screen”, as well as input controls handling, and on a Playdate that goes directly into the SDK, whereas on a PC that is all handled through Sokol. Is nice!

For the “PC” build target, in addition to the regular 1-bit “screen” buffer, I also have a full-color “RGBA per pixel” debug overlay buffer. That way I can have the correct shader with some debug visualizations running in Unity, and my own “shader port” running in a software rasterizer, side by side, with a full color debug overlay. Check it out – left side my code (displaying obviously incorrect result), right side Unity:

The mesh UVs are correct and interpolated correctly (multiplied by 5 to see their interpolation better):

Derivatives in my code, turns out, were not entirely wrong, but not correct either:

At that point my rasterizer was doing 1 pixel at a time, so in order to calculate the derivatives I tried to calculate them with some math, and got the math wrong, obviosly. With the full proper calculation, they were correct:

Turns out I also had the 3D texture Z layers order wrong, and with that fixed, everything else was correct too. Dither UVs, 3D texture radial pattern, render result, render result with 2 colors only, and finally render result with non-constant input lighting:

So, yay! It works!

It runs at… 830 milliseconds per frame though (1.2FPS). 🐌

Optimizing Fractal Dithering

Trivially move some math from per-pixel to be done once per triangle: 604ms.

Replace division and exp2f call by directly working on floating point bits. If we are in “regular floats” range (no NaNs/infinities/denormals), x / exp2f((float)i) can be replaced by something like:

// equivalent to `x / exp2f((float)i)`, provided we are not in
// infinities / subnormals territory.
static inline float adjust_float_exp(float x, int i)
{
    union {
        float f;
        uint32_t u;
    } fu;
    fu.f = x;
    fu.u -= (uint32_t)i << 23;
    return fu.f;
}

In the dither shader, this was used to transform mesh UVs to the fractal pattern UVs. That gets us down to 316ms, yay! (by the way, such an optimization for today’s GPUs is barely – if at all – worth doing)

Likewise, in the 3D texture fractal level and fraction calculation that uses log2f and a floorf can also be replaced with direct float bit manipulation:

//float spacingLog = log2f(spacing);
//const float patternScaleLevel = floorf(spacingLog); // Fractal level.
//const int patternScaleLevel_i = (int)patternScaleLevel;
//float f = spacingLog - patternScaleLevel; // Fractional part.
//
// instead of above, work on float bits directly:
union {
    float f;
    uint32_t u;
} fu;
fu.f = spacing;
// patternScaleLevel is just float exponent:
const int patternScaleLevel_i = (int)((fu.u >> 23) & 0xFF) - 127;
// fractional part is:
// - take the mantissa bits of spacing,
// - set exponent to 127, i.e. range [0,1)
// - use that as a float and subtract 1.0
fu.u = (fu.u & 0x7FFFFF) | 0x3F800000;
float f = fu.f - 1.0f;

And now we are at 245ms.

And now, switch the rasterizer to operate in 2x2 pixel blocks (hey, just like a GPU does!). This does make the code much longer (commit), but things like derivatives come “for free”, plus it allows doing a bunch of calculations (all the dither dot spacing, 3D texture level etc.) once per 2x2 pixel block. 149ms.

Some more simple math operation moves and we’re at 123ms.

At this point I was out of easy ideas, so I decided that running “full” effect on a Playdate is not going to work, so it is time to simplify / approximate it.

The effect spends quite some effort in determining nice “contrast” value, but it comes with a cost: doing singular value decomposition on the four derivatives, a division, and a bunch of other maths. Let’s remove all that, and instead determine dither pattern spacing by a simple average of dU/dX, dV/dX, dU/dY, dV/dU. Then there’s no longer additional contrast tweak based on “blurriness” (ratio of min/max UV change). However, it runs at 107ms now, but looks different:

The 3D lookup texture for dithering, at 64x64x16 resolution, is 64 kilobytes in size. The CPU cache is only 8KB, and the memory bandwidth is not great. Maybe we could reduce the texture horizontal resolution (to 32x32x16), for a 16KB texture, and it would not reduce quality all that much? Looks a bit different, but hey, 83ms now:

Instead of doing perspective correct UV interpolation for every pixel, do it for every 2nd pixel only, i.e. for the first column of each 2x2 pixel block. For the other column, do regular interpolation between this and next block’s UV values (commit). 75ms:

Simplify the final math that does sampled 3D texture result comparison, now it is just a simple “compare value with threshold”. 65ms:

At this point I was out of easy ideas how to speed it up further (harder ideas: do perspective correct interpolation at even lower frequency). However, anecdotally, the whole current approach of using halfspace/barycentric rasterizer is probably not a good fit for the Playdate CPU (it does not have SIMD instructions that would be useful for this task, afterall). So maybe I should try out the classic “scanline” rasterizer approach?

Scanline Rasterizers

The Playdate SDK sample code (“mini3d”) contains a simple scanline triangle rasterizer, however it can only cover whole triangle in a screen-space aligned pattern, and has no support for texture coordinates or any other sort of interpolation. However, people have taken that and expanded on it, e.g. there’s Mini3D+ that adds near plane clipping, texturing, alpha-testing, fog, etc. Nice!

So let’s try it out. Here’s the same scene, with just a pure black/white checkerboard pattern based on mesh UVs. My existing halfspace/barycentric rasterizer, and the one from Mini3D+ respectively:

Immediate notes:

  • Yes the scanline rasterizer (for UV based checkerboard at least) is faster using the scanline approach (54ms halfspace, 33ms scanline),
  • However the scanline one has more “artifacts”: stray black pixels near edge of plane/cube, and in general things are shifted by a pixel here and there for some reason. At this point I do not know which one is “more correct” however, but the difference was bothering me :)
  • The checkerboard lines are “more wiggly” in the scanline one, most visible on the “floor” object.

I tried to “port” the dithering effect to this rasterizer, but got lost in trying to calculate the correct UV derivatives (horizontal ones are easy, vertical ones are harder). And the subtle rendering differences were bothering me, so I decided to actually read up about scanline rasterizers. The seminal series on them are from 1995/1996, by Chris Hecker for Game Developer Magazine. Hecker has the archive and the code drop at his website: Perspective Texture Mapping.

So! Taking the initial (fully floating point) rasterizer from Hecker’s code, the UV based checkerboard renders like this:

This one runs slower than Mini3D+ one (42ms), but does not have stray “black pixels” artifacts around some mesh edges, and the lines on the floor are no longer wiggly. However, it is slightly different compared to the halfspace one! Why? This has nothing to do with task at hand, but the fact was bothering me, so…

Comparing Scanline Rasterizers to actual GPU

Again using my “colored debug overlay on PC build” feature, I made a synthetic “test scene” with various cases of UV mapped geometry, with cases like:

  • Checkerboard should map exactly 1:1 to pixels, at regular orientation, and geometry being rotated by exactly 90 degrees,
  • The same, but geometry coordinates being shifted by less than half a pixel; the result should look the same.
  • Some geometry that should be exactly one pixel away from screen edge,
  • Some geometry where each checkerboard square should map to 1.5 (will have aliasing patterns) or 2 (should be exact) screen pixels.
  • Several cases of perspective projection,
  • Several cases of geometry being clipped by screen edges.

Here’s how it is rendered by the halfspace/barycentric rasterizer:

And then I made a simple Unity shader & C# script that renders exactly the same setup, using actual GPU. Here it is (pasted into the same window frame as the test app):

Not exactly the same, but really close, I’ll claim this is acceptable (FWIW, GPUs use 8 bits subtexel precision, whereas my code uses 4 bits).

The rasterizer from Mini3D+ however looks much more different: 1) some cases do not map checkerboard to pixels 1:1, 2) the artifacts between some faces is where the rasterizer is not “watertight” and neighboring faces both write to the same pixels, 3) some cases where geometry should be exactly one pixel away from screen edge are actually not.

Hecker’s “fully floating point” rasterizer looks better, but still a lot more different from what the GPU does.

The fixed point, sub-dividing affine span rasterizer from Hecker’s code (i.e. the last iteration before the assembly-optimized one) looks like this however. It fixes some artifacts from the previous one, but still covers slightly different pixels compared to the GPU, and introduces UV wrapping artifacts at right sides of some planes.

My understanding of the difference is that maybe Hecker’s rasterizer follows pre-Direct3D 10 coordinate conventions, i.e. where pixel integer coordinates are placed directly on pixel centers. From part 3 of the article series, there’s this bit:

I chose the corresponding screen coordinates for the destination. I wanted the destination pixel centers to map exactly to the source pixel centers.

And when talking about how one would map a texture directly to screen at 1:1 ratio, he talks about adding -0.5 offset to the coordinates. This sounds very much like what people back in Direct3D 8/9 times had to always keep in mind, or try to solve that automatically in all their shaders.

While this coordinate system intuitively makes sense (pixel centers are at integer coordinates, yay!), eventually everyone realized it causes more problems down the line. The official DirectX Specs website minces no words:

D3D9 and prior had a terrible Pixel Coordinate System where the origin was the center of the top left pixel on the RenderTarget

Armed with that guess, I changed the Hecker’s rasterizer code to shift positions by half a pixel, and remove the complexicated dUdXModifier dance it was doing. And it became way closer to what the GPU is doing:

The fixed point, subdividing affine Hecker’s rasterizer with the above fix was more correct than the one from Mini3D+, and running a tiny bit faster by now. So I left only that code, and proceeded with it.

Back to scanline rasterizer

Initial “port” of the Fractal Dithering to the scanline rasterizer was at 102ms, i.e. slower than halfspace one (63ms). But, I was calculating the UV derivatives for every pixel. Derivatives along X axis are cheap (just difference to next pixel, which the inner scanline loop already does), but the vertical ones I was doing in a “slow but correct” way.

The derivatives change fairly slowly across the triangle surface however, so what if I calculate dU/dY and dV/dY only at the scanline endpoints, and just interpolate it across? This gets us down to 71ms.

But hey! Maybe I do not need the per-pixel UV derivatives at all? The whole reason for derivatives is to calculate the dither pattern spacing. But, at least in my scenes, the spacing varies very slowly (if at all) across the triangle surface. Recall the previous visualization:

I can just calculate the derivatives at triangle vertices, do all the dither spacing calculations there, and interpolate spacing value across the triangle. 56ms!

Then, do the 3D lookup math directly from fixed point UVs that are interpolated by the rasterizer. The previous “replace division by exp2” trick by working on floating point bits is even simpler in fixed point: just shift by the provided integer amount, and take the fractional bits as needed. 50ms

And the final optimization step I did so far has nothing to do with dithering step itself: the higher level code was transforming all mesh triangles, calculating their normals for lighting, then sorting them by distance, and finally rasterizing them back-to-front. Here the triangles that are back-facing, outside of the screen, or zero-area are culled. I moved the triangle culling to happen before sorting (there’s no point in sorting invisible triangles anyway), and now the scanline dither effect runs at 45ms (halfspace one at 60ms).

That’s it for now!

So, this scene runs at 45ms (20-22FPS) on the Playdate device right now, which is much better than the initial 830ms (1.2FPS). Can it be made yet faster? Most likely.

Does it make it a practical effect on the Playdate? Dunno, it is quite heavy and at this low resolution, does not look very good (it does not help that some approximations/simplifications I did actually increase dither pattern aliasing).

But hey, this was fun! I learned a thing or three. And if you want either a scanline or a halfspace rasterizer for Playdate that very closely matches what actual GPU would do (i.e. it is a more correct rasterizer than mini3d from Playdate SDK or Mini3D+), you can find them at github.com/aras-p/playdate-dither3d


Doom in Blender VSE

You know how in Blender Video Sequence Editor (VSE) you can create Color strips, and then their color is displayed in the timeline?

You can create many of them, and when sufficiently zoomed out, the strip headings disappear since there’s not enough space for the label:

So if you created say 80 columns and 60 rows of color strips…

…and kept on changing their colors constantly… you could run Doom inside the Blender VSE timeline.

And so that’s what I did. Idea sparked after seeing someone make Doom run in Houdini COPs.

Result

Here’s the result:

And the file/code on github: github.com/aras-p/blender-vse-doom

It is a modal blender operator that loads doom file, creates VSE timeline full of color strips (80 columns, 60 rows), listens to keyboard input for player control, renders doom frame and updates the VSE color strip colors to match the rendered result. Escape key finishes the operator.

All the Doom-specific heavy lifting is in render.py, written by Mark Dufour and is completely unrelated to Blender. It is just a tiny pure Python Doom loader/renderer. I took it from “Minimal DOOM WAD renderer” and made two small edits to avoid division by zero exceptions that I was getting.

Performance

This runs pretty slow (~3fps) in current Blender (4.1 .. 4.4) 😢

I noticed that is was slow when I was “running it”, but when stopped, navigating the VSE timeline with all the strips still there was buttery smooth. And so, being an idiot that I am, I was “rah rah, Doom rendering is done in pure Python, of course it is slow!”

Yes, Python is slow, and yes, the minimal Doom renderer (in exactly 666 lines of code – nice!) is not written in “performant Python”. But turns out… performance problems are not there. Another case for “never guess, always look at what is going on”.

The pure-python Doom renderer part takes 7 milliseconds to render a 80x60 “frame”. Could it be faster? Probably. But… it takes 300 milliseconds to update the colors of all the VSE strips.

Note that in Blender 4.0 or earlier it runs even slower, because redrawing the VSE timeline with 4800 strips takes about 100 milliseconds; that is no longer slow (1-2ms) in later versions due to what I did a year ago.

Why does it take 300 milliseconds to update the strip colors? For that of course I brought up Superluminal and it tells me the problem is cache invalidation:

Luckily, cache invalidation is one of the easiest things in computer science, right? 🧌

Anyway, this looks like another case of accidental quadratic complexity: for each strip that gets a new color set on it, there’s code that 1) invalidates any cached results for that strip (ok), and 2) tries to find whether this strip belongs to any meta-strips to invalidate those (which scans all the strips), and 3) tries to find which strips intersect the strip horizontal range (i.e. are “composited above it”), and invalidate partial results of those – this again scans all the strips.

Step 2 above can be easily addressed, I think, as the codebase already maintains data structures for finding which strips are part of which meta-strips, without resorting to “look at everything”.

Step 3 is slightly harder in the current code. However, half a year ago during VSE workshop we talked about how the whole caching system within VSE is maybe too complexicated for no good reason.

Now that I think about it, I think most or all of that extra cost could be removed, if Someone™️ would rewrite VSE cache to be along the lines of how we discussed at the workshop.

Hmm. Maybe I have some work to do. And then the VSE timeline could be properly doomed.


Verbosity of coding styles

Everyone knows that different code styles have different verbosity. You can have very dense code that implements a path tracer in 99 lines of C, or on the back of a business card (one, two). On the other side of the spectrum, you can have very elaborate code where it can take you weeks to figure out where does the actual work happen, digging through all the abstraction layers and indirections.

Of course to be usable in a real world project, code style would preferably not sit on either extreme. How compact vs how verbose it should be? That, as always, depends on a lot of factors. How many people, and of what skill level, will work on the code? How much churn the code will have? Will it need to keep on adapting to wildly changing requirements very fast? Should 3rd parties be able to extend the code? Does it have public API that can never change? And a million of other things that all influence how to structure it all, how much abstraction (and of what kind) should there be, etc.

A concrete example: Compositor in Blender

The other day I was happily deleting 40 thousand lines of code (just another regular Thursday, eh), and I thought I’d check how much code is in the “new” Compositor in Blender, vs in the old one that I was removing.

What is the “old” and “new” compositor? Well, there have been more than just these two. You see, some months ago I removed the “old-old” (“tiled”) compositor already. There’s a good talk by Habib Gahbiche “Redesigning the compositor” from BCON'24 with all the history of the compositor backends over the years.

So, how large is the compositor backend code in Blender?

I am using scc to count the number of lines. It is pretty good! And counts the 4.3 million lines inside Blender codebase in about one second, which is way faster than some other line counting tools (tokei is reportedly also fast and good). I am using scc --count-as glsl:GLSL since right now scc does not recognize .glsl files as being GLSL, d’oh.

The “Tiled” compositor I removed a while ago (PR) was 20 thousand lines of code. Note however that this was just one “execution mode” of the compositor, and not the full backend.

The “Full-frame” compositor I deleted just now (PR) is 40 thousand lines of C++ code.

What remains is the “new” (used to be called “realtime”) compositor. How large is it? Turns out it is… 27 thousand lines of code. So it is way smaller!

And here’s the kicker: while the previous backends were CPU only, this one works on both CPU and GPU. With no magic, just literally “write the processing code twice: in C++ and GLSL”. “Oh no, code duplication!”… and yet… it is way more compact. Nice!

I know nothing about compositing, or about relative merits of “old” vs “new” compositor code. It is entirely possible that the verbosity of the old compositor backend was due to a design that, in retrospect, did not stand the test of time or production usage – afterall compositor within Blender is a 18 year old feature by now. Also, while I deleted the old code because I like deleting code, the actual hard work of writing the new code was done mostly by Omar Emara, Habib Gahbiche and others.

I found it interesting that the new code that does more things is much smaller than the old code, and that’s all!


A year in Blender VSE land

Turns out, now is exactly one year of me working on the video sequence editor (VSE).

Going pretty well so far! What I managed to put into Blender 4.1 and 4.2 is in the previous blog posts. Blender 4.3 has just shipped, and everything related to Video Sequence Editor is listed on this page. Items related to performance or thumbnails are my doing.

Some of the work I happened to do for VSE over this past year ended up improving other areas of Blender. E.g. video rendering improvements are useful for anyone who renders videos; or image scaling/filtering improvements are beneficial in other places as well. So that’s pretty cool!

Google Summer of Code

The main user-visible workflow changes in 4.3 VSE (“connected” strips, and preview area snapping) were done by John Kiril Swenson as part of Google Summer of Code, see his report blog post. I was “mentoring” the project, but that was surprisingly easy and things went very smoothly. Not much more to say, except that the project was successful, and the result is actually shipping now as part of Blender. Nice!

Sequencer workshop at Blender HQ

In 2024 August some of us had a “VSE Workshop” at the Blender office in Amsterdam. Besides geeking out on some technical details, most of discussion was about high level workflows, which is not exactly my area (I can implement an existing design, or fix some issues, but doing actual UI or UX work I’m the least suitable person for).

But! It was very nice to hear all the discussions, and to see people face to face, at last. Almost five years of working from home is mostly nice, but once in a while getting out of the house is also nice.

There’s a short blog post and a more detailed report thread about the workshop on Blender website/forum.

Surprising no one, what became clear is that the amount of possible work on the video editing tools is way more than the amount of people and the amount of time they can spend implementing them. Like, right now there’s maybe… 1.5 people actually working on it? (my math: three people, part-time). So while Blender 4.1, 4.2 and 4.3 all have VSE improvements, no “hey magically it is now better than Resolve / Premiere / Final Cut Pro” moments anytime soon :)

A side effect of the workshop: I got to cuddle Ton’s dog Bowie, and saw Sergey’s frog collection, including this most excellent güiro:

Blender Conference 2024

I gave a short talk at BCON'24, “How to accidentally start working on VSE”. It was not so much about VSE per se, but more about “how to start working in a new area”. Vibing off the whole conference theme which was “building Blender”.

Here’s slides for it (pdf) and the recording:

The whole conference was lovely. All the talks are in this playlist, and overall feeling is well captured in the BCON'24 recap video.

What’s Next

Blender 4.4 development is happening as we speak, and VSE already got some stuffs done for it. For this release, so far:

  • Video improvements: H.265/HEVC support, 10- and 12-bit videos. Some colorspace and general color precision shenanigans.
  • Proxy improvements: proxies for EXR images work properly now, and are faster to build. There’s a ton of possible improvements for video proxies, but not sure how much of that I’ll manage to squeeze into 4.4 release.

Generally, just like this whole past year, I’m doing things without much planning. Stochastic development! Yay!


Vector math library codegen in Debug

This will be about how when in your C++ code you have a “vector math library”, and how the choices of code style in there affect non-optimized build performance.

Backstory

A month ago I got into the rabbit hole of trying to “sanitize” the various ways that images can be resized within Blender codebase. There were at least 4 different functions to do that, with different filtering modes (expected), but also different corner case behaviors and other funkiness, that was not well documented and not well understood.

I combed through all of that, fixed some arguably wrong behaviors of some of the functions, unified their behavior, etc. etc. Things got faster and better documented. Yay! (PR)

However. While doing that, I also made the code smaller, primarily following the guideline of “code should use our C++ math library, not the legacy C one”. That is, use Blender codebase classes like float4 with related functions and operators (e.g. float4 c = a + b), instead of float v[4] c; add_v4_v4v4(c, a, b); and so on. Sounds good? Yes!

But. There’s always a “but”.

Other developers later on noticed that some parts of Blender got slower, in non-optimized (“Debug”) build. Normally people say “oh it’s a Debug build, no one should care about performance of it”, and while in some cases it might be true, when anything becomes slower it is annoying.

In this particular case, it was “saving a file within Blender”. You see, as part of the saving process, it takes a screenshot of your application, resizes it to be smaller and embeds that as a “thumbnail” inside the file itself. And yes, this “resize it” part is exactly what my change affected. Many developers run their build in Debug mode for easier debugging and/or faster builds; some run it in Debug mode with Address Sanitizer on as well. If “save a file”, an operation that you normally do many times, became slower by say 2 seconds, that is annoying.

What can be done?

How Blender’s C++ math library is written today

It is pretty compact and neat! And perhaps too flexible :)

Base of the math vector types is this struct, with is just a fixed size array of N entries. For the (most common) case of 2D, 3D and 4D vectors, the struct is instead entries explicitly named x, y, z, w:

template<typename T, int Size>
struct vec_struct_base { std::array<T, Size> values; };
template<typename T> struct vec_struct_base<T, 2> { T x, y; };
template<typename T> struct vec_struct_base<T, 3> { T x, y, z; };
template<typename T> struct vec_struct_base<T, 4> { T x, y, z, w; };

And then it has functions and operators, where most of their implementations use an “unroll with a labmda” style. Here’s operator that adds two vectors together:

friend VecBase operator+(const VecBase &a, const VecBase &b)
{
    VecBase result;
    unroll<Size>([&](auto i) { result[i] = a[i] + b[i]; });
    return result;
}

with unroll itself being this:

template<class Fn, size_t... I> void unroll_impl(Fn fn, std::index_sequence<I...>)
{
    (fn(I), ...);
}
template<int N, class Fn> void unroll(Fn fn)
{
    unroll_impl(fn, std::make_index_sequence<N>());
}

– it takes “how many times to do the lambda” (typically vector dimension), the lambda itself, and then “calls” the lambda with the index N times.

And then most of the functions use indexing operator to access the element of a vector:

T &operator[](int index)
{
    BLI_assert(index >= 0);
    BLI_assert(index < Size);
    return reinterpret_cast<T *>(this)[index];
}

Pretty compact and hassle free. And given that C++ famously has “zero-cost abstractions”, these are all zero-cost, right? Let’s find out!

Test case

Let’s do some “simple” image processing code that does not serve a practical purpose, but is simple enough to test things out. Given an input image (RGBA, byte per channel), blur it by averaging 11x11 square of pixels around each pixel, and overlay a slight gradient over the whole image. This input, for example, gets turned into that output:

The filter code itself is this:

inline float4 load_pixel(const uint8_t* src, int size, int x, int y)
{
    x &= size - 1;
    y &= size - 1;
    uchar4 bpix(src + (y * size + x) * 4);
    float4 pix = float4(bpix) * (1.0f / 255.0f);
    return pix;
}
inline void store_pixel(uint8_t* dst, int size, int x, int y, float4 pix)
{
    pix = math_max(pix, float4(0.0f));
    pix = math_min(pix, float4(1.0f));
    pix = math_round(pix * 255.0f);
    ((uchar4*)dst)[y * size + x] = uchar4(pix);
}
void filter_image(int size, const uint8_t* src, uint8_t* dst)
{
    const int kFilter = 5;
    int idx = 0;
    float blend = 0.2f;
    float inv_size = 1.0f / size;
    for (int y = 0; y < size; y++)
    {
        for (int x = 0; x < size; x++)
        {
            float4 pix(0.0f);
            float4 tint(x * inv_size, y * inv_size, 1.0f - x * inv_size, 1.0f);
            for (int by = y - kFilter; by <= y + kFilter; by++)
            {
                for (int bx = x - kFilter; bx <= x + kFilter; bx++)
                {
                    float4 sample = load_pixel(src, size, bx, by);
                    sample = sample * (1.0f - blend) + tint * blend;
                    pix += sample;
                }
            }
            pix *= 1.0f / ((kFilter * 2 + 1) * (kFilter * 2 + 1));
            store_pixel(dst, size, x, y, pix);
        }
    }
}

This code uses very few vector math library operations: create a float4, add them together, multiply them by a scalar, some functions to do min/max/round. There are no branches (besides the loops themselves), no cross-lane swizzles, fancy packing or anything like that.

Let’s run this code in the usual “Release” setting, i.e. optimized build (-O2 on gcc/clang, /O2 on MSVC). Processing a 512x512 input image with the above filter, on Ryzen 5950X, Windows 10, single threaded, times in milliseconds (lower is better):

MSVC 2022 Clang 17 Clang 14 Gcc 12 Gcc 11
Release (O2) 67 41 45 70 70

Alright, Clang beats the others by a healthy margin here.

Enter Debug builds

At least within Blender (but also elsewhere), besides a build configuration that ships to the users, during development you often work with two or more other build configurations:

  • “Debug”, which often means “the compiler does no optimizations at all” (-O0 on gcc/clang, /Od on MSVC). This is the least confusing debugging experience, since nothing is “optimized out” or “folded together”.
    • On MSVC, people also sometimes put /JMC (“just my code” debugging), and that is default in recent MSVC project templates. Blender uses that too in the “Debug” cmake configuration.
  • “Developer”, which often is the same as “Release” but with some extra checks enabled. In Blender’s case, besides things like “enable unit tests” and “use a guarded memory allocator”, it also enables assertion checks, and in Linux/Mac also enables Address Sanitizer (-fsanitize=address).

While some people argue that “Debug” build configuration should pay no attention to performance at all, I’m not sold on that argument. I’ve seen projects where a non-optimized code build, while it works, produces such bad performance that using the resulting application is an exercise in frustration. Some places explicitly enable some compiler optimizations on an otherwise “Debug” build, since otherwise the result is just unusable (e.g. in C++-heavy codebase, you’d enable function inlining).

However, the “Developer” configuration is an interesting one. It is supposed to be “optimized”, just with “some” extra safety features. I would normally expect that to be “maybe 5% or 10% slower” than the final “Release” build, but not more than that.

Let’s find out!

MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release (O2) 67 41 45 70 70
Developer (O2 + asserts) 591 42 45 71 71
Debug (-O0 / /Od /JMC)175604965647656105942

Or, phrased in terms of “how many times a build configuration is slower compared to Release”:

MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release (O2) 1x 1x 1x 1x 1x
Developer (O2 + asserts) 9x 1x 1x 1x 1x
Debug (-O0 / /Od /JMC)262x121x144x80x85x

On Developer config, gcc and clang are good: assertions being enabled does not cause a slowdown. On MSVC, however, this makes the code run 9 times slower. All of that only because vector operator[](int index) has asserts in there. And it is only ever called with indices that are statically known to pass the asserts! So much for an “optimizing compiler”, eh.

The Debug build configuration is just bad everywhere. Yes it is the worst on MSVC, but come on, anything that is more than 10 times slower than optimized code is going into “too slow to be practical” territory. And here we are talking about things being from 80 times to 250 times slower!

What can be done without changing the code?

Perhaps realizing that “no optimizations at all produce unusably slow result” is true, some compiler developers have added an “a bit optimized, yet still debuggable” optimization level: -Og. GCC has added that in 2013 (gcc 4.8):

-Og should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience.

Clang followed suit in 2017 (clang 4.0), however their -Og does exactly the same thing as -O1.

MSVC has no setting like that, but we can at least try to turn off “just my code debugging” (/JMC) flag and see what happens. The slowdown table:

MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release (O2) 1x 1x 1x 1x 1x
Faster Debug (-Og / /Od)114x2x2x50x14x
Debug (-O0 / /Od /JMC)262x121x144x80x85x

Alright, so:

  • On clang -Og makes the performance good. Expected since this is the same as -O1.
  • On gcc -Og is better than -O0. Curiously gcc 12 is slower than gcc 11 here for some reason.
  • MSVC without /JMC is better, but still very very slow.

Can we change the code to be more Debug friendly?

Current way that the math library is written is short and concise. If you are used to C++ lambda syntax, things are very clear:

friend VecBase operator+(const VecBase &a, const VecBase &b)
{
    VecBase result;
    unroll<Size>([&](auto i) { result[i] = a[i] + b[i]; });
    return result;
}

however, without compiler optimizations, for a float4 that produces (on clang):

  • 18 function calls,
  • 8 branches,
  • assembly listing of 150 instructions.

What it actually does, is do four float additions.

Loop instead of unroll lambda

What about, if instead of this unroll+lambda machinery, we used just a simple loop?

friend VecBase operator+(const VecBase &a, const VecBase &b)
{
    VecBase result;
    for (int i = 0; i < Size; i++) result[i] = a[i] + b[i];
    return result;
}
MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release unroll 1x 1x 1x 1x 1x
Release loop 1x 1x 1x 3x 10x
Developer unroll 9x
Developer loop 1x
Faster Debug unroll114x2x2x50x14x
Faster Debug loop 65x2x2x31x14x
Debug unroll 262x121x144x80x85x
Debug loop 126x102x108x58x58x

This does help Debug configurations somewhat (12 function calls, 9 branches, 80 assembly instructions). However! It hurts Gcc code generation even in full Release mode 😱, so that’s probably a no-go. If it were not for the Gcc slowdown, it would be a win: better performance in Debug configuration, and a simple loop is easier to understand than a variadic template + lambda.

Explicit code paths for 2D/3D/4D vector cases

Out of all the possible vector math cases, 2D, 3D and 4D vectors are by far the most common. I’m not sure if other cases even happen within Blender codebase, TBH. Maybe we could specialize those to help the compiler a bit? For example:

friend VecBase operator+(const VecBase &a, const VecBase &b)
{
    if constexpr (Size == 4) {
        return VecBase(a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w);
    }
    else if constexpr (Size == 3) {
        return VecBase(a.x + b.x, a.y + b.y, a.z + b.z);
    }
    else if constexpr (Size == 2) { 
        return VecBase(a.x + b.x, a.y + b.y);
    }
    else {
        VecBase result;
        unroll<Size>([&](auto i) { result[i] = a[i] + b[i]; });
        return result;
    }
}

This is very verbose and a bit typo-prone however :( With some C preprocessor help it can be reduced to hide most of the ugliness inside a macro, and then the actual operator implementation is not terribad:

friend VecBase operator+(const VecBase &a, const VecBase &b)
{
    BLI_IMPL_OP_VEC_VEC(+);
}
MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release unroll 1x 1x 1x 1x 1x
Release explicit 1x 1x 1x 1x 1x
Developer unroll 9x
Developer explicit 1x
Faster Debug unroll114x2x2x50x14x
Faster Debug explicit 19x2x2x5x3x
Debug unroll 262x121x144x80x85x
Debug explicit 55x18x30x22x21x

This actually helps Debug configurations quite a lot! One downside is that you have to have a handful of C preprocessor macros to hide away all the complexity that specializes implementations for 2D/3D/4D vectors.

Not using C++ vector math, use C instead

As a thought exercise – what if instead of using the C++ vector math library, we went back to the C-style of writing code?

Within Blender, right now there’s guidance of “use C++ math library for new code, occasionally rewrite old code to use C++ math library” too. That makes the code more compact and easier to read for sure, but does it have any possible downsides?

Our test image filter code becomes this then (there’s no “math library” used then, just operations on numbers):

inline void load_pixel(const uint8_t* src, int size, int x, int y, float pix[4])
{
    x &= size - 1;
    y &= size - 1;
    const uint8_t* ptr = src + (y * size + x) * 4;
    pix[0] = ptr[0] * (1.0f / 255.0f);
    pix[1] = ptr[1] * (1.0f / 255.0f);
    pix[2] = ptr[2] * (1.0f / 255.0f);
    pix[3] = ptr[3] * (1.0f / 255.0f);
}
inline void store_pixel(uint8_t* dst, int size, int x, int y, const float pix[4])
{
    float r = std::max(pix[0], 0.0f);
    float g = std::max(pix[1], 0.0f);
    float b = std::max(pix[2], 0.0f);
    float a = std::max(pix[3], 0.0f);
    r = std::min(r, 1.0f);
    g = std::min(g, 1.0f);
    b = std::min(b, 1.0f);
    a = std::min(a, 1.0f);
    r = std::round(r * 255.0f);
    g = std::round(g * 255.0f);
    b = std::round(b * 255.0f);
    a = std::round(a * 255.0f);
    uint8_t* ptr = dst + (y * size + x) * 4;
    ptr[0] = uint8_t(r);
    ptr[1] = uint8_t(g);
    ptr[2] = uint8_t(b);
    ptr[3] = uint8_t(a);
}
void filter_image(int size, const uint8_t* src, uint8_t* dst)
{
    const int kFilter = 5;
    int idx = 0;
    float blend = 0.2f;
    float inv_size = 1.0f / size;
    for (int y = 0; y < size; y++)
    {
        for (int x = 0; x < size; x++)
        {
          float pix[4] = { 0,0,0,0 };
          float tint[4] = { x * inv_size, y * inv_size, 1.0f - x * inv_size, 1.0f };
          for (int by = y - kFilter; by <= y + kFilter; by++)
          {
              for (int bx = x - kFilter; bx <= x + kFilter; bx++)
              {
                  float sample[4];
                  load_pixel(src, size, bx, by, sample);
                  sample[0] = sample[0] * (1.0f - blend) + tint[0] * blend;
                  sample[1] = sample[1] * (1.0f - blend) + tint[1] * blend;
                  sample[2] = sample[2] * (1.0f - blend) + tint[2] * blend;
                  sample[3] = sample[3] * (1.0f - blend) + tint[3] * blend;
                  pix[0] += sample[0];
                  pix[1] += sample[1];
                  pix[2] += sample[2];
                  pix[3] += sample[3];
              }
          }
          float scale = 1.0f / ((kFilter * 2 + 1) * (kFilter * 2 + 1));
          pix[0] *= scale;
          pix[1] *= scale;
          pix[2] *= scale;
          pix[3] *= scale;
          store_pixel(dst, size, x, y, pix);
        }
    }
}
MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release unroll 1x 1x 1x 1x 1x
Release C 1x 0.5x 0.5x 1x 1x
Developer unroll 9x
Developer C 1x
Faster Debug unroll114x2x2x50x14x
Faster Debug C 4x0.5x1.5x1x1x
Debug unroll 262x121x144x80x85x
Debug C 5x6x6x4x4x

Writing code in “pure C” style makes Debug build configuration performance really good! But the more interesting thing is… in Release build on clang, this is faster than C++ code. Even for this very simple vector math library, used on a very simple algorithm, C++ abstraction is not zero-cost!

What about SIMD?

In an ideal world, the compiler would take care of SIMD for us, especially in simple algorithms like the one being tested here. It is just number math, with very clear “four lanes” being operated on (maps perfectly to SSE or NEON registers), no complex cross-lane shuffles, packing or any of that stuff. Just some loops and some math.

Of course, as Matt Pharr writes in the excellent ISPC blog series, “Auto-vectorization is not a programming model” (post) (original quote by Theresa Foley).

What if we manually added template specializations to our math library, where for “type is float, dimension is 4” case it would just use SIMD intrinsics directly? Note that this is not the right model if you want absolute best performance that also scales past 4-wide SIMD; the correct way would be to map one SIMD lane to one “scalar item” in your algorithm. But that is a whole another topic; let’s limit ourselves to 4-wide SIMD and map a float4 to one SSE register:

template<> struct vec_struct_base<float, 4> { __m128 simd; };
template<>
inline VecBase<float, 4> operator+(const VecBase<float, 4>& a, const VecBase<float, 4>& b)
{
    VecBase<float, 4> r;
    r.simd = _mm_add_ps(a.simd, b.simd);
    return r;
}
MSVC 2022Clang 17Clang 14Gcc 12Gcc 11
Release unroll 1x 1x 1x 1x 1x
Release C 1x 0.5x 0.5x 1x 1x
Release SIMD 0.8x 0.5x 0.5x 0.7x 0.7x
Developer unroll 9x
Developer C 1x
Developer SIMD 0.8x
Faster Debug unroll114x2x2x50x14x
Faster Debug C 4x0.5x1.5x1x1x
Faster Debug SIMD 24x1.3x1.3x2x2x
Debug unroll 262x121x144x80x85x
Debug C 5x6x6x4x4x
Debug SIMD 70x44x51x20x20x

Two surprising things here:

  • Even for this very simple case that sounds like “of course the compiler would perfectly vectorize this code, it is trivial!”, manually writing SIMD still wins everywhere except on Clang.
  • However, in Debug build configuration, SIMD intrinsics incur heavy cost on performance, i.e. code is way slower than written in pure C scalar style. SIMD intrinsics are still better at performance than our intial code that uses unroll+lambda style.

What about O3 optimization level?

You say “hey you are using -O2 on gcc/clang, you should use -O3!” Yes I’ve tried that, and:

  • On gcc it does not change anything, except fixes the curious “changing unroll lambda to a simple loop” problem, i.e. under -O3 there is no downside to using a loop in your vector math class compared to unroll+lambda.
  • On clang it makes the various C++ approaches from above almost reach the performance of either “raw C” or “SIMD” styles, but not quite.

What about “force inline”?

(Update 2024 Sep 17) Vittorio Romeo asked “what about using attributes that force inlining”? I don’t have a full table, but a quick summary is:

  • In MSVC, that is not possible at all. Under /Od nothing is inlined, even if you mark it as “force inline please”. There’s an open feature request to make that possible, but is not there (yet?).
  • In Clang force inline attributes help a bit, but not by much.

Learnings

All of the learnings are based on this particular code, which is “simple loops that do some simple pixel operations”. The learnings may or might not transfer to other domains.

  • Clang feels like the best compiler of the three. Consistently fastest code, compared to other compilers, across various coding styles.
  • “C++ has zero cost abstractions” (compared to raw C code) is not true, unless you’re on Clang.
  • Debug build (no optimizations at all) performance of any C++ code style is really bad. The only way I could make it acceptable, while still being C++, is by specializing code for common cases, which I achieved by… using C preprocessor macros 🤦.
  • It is not true that “MSVC has horrible Debug build performance”. Yes it is the worst of all of them, but the other compilers also produce really badly performing code in Debug build config.
  • SIMD intrinsics in a non-optimized build have quite bad performance :(
  • Using “enable some optimizations” build setting, e.g. -Og, might be worth looking into, if your codebase is C++ and heavy on inlined functions, lambdas and that stuff.
  • Using “just my code debugging” (/JMC) on Visual Studio has a high performance cost, on already really bad Debug build performance. I’m not sure if it is worth using at all, ever, anywhere.

All my test code of the above is in this tiny github repo, and a PR for Blender codebase that does “explicitly specialize for common cased via C macros” is at #127577. Whether it will get accepted is up in the air, since it does arguably make the code “more ugly” (the PR got merged into Blender mainline).