Analyzing Darktable OpenCL Memory Use

Georg Lukas, 2025-11-11 18:24

For my time-lapse photography, I'm using darktable to develop the pictures, processing hundreds, sometimes over a thousand, RAW photos for a single scene. Often, getting a good result requires multiple iterations of export and video rendering, with each export usually taking half an hour or longer on my slightly dated PC.

With OpenCL, exports can be significantly accelerated if the workload fits into GPU memory (darktable will apply image tiling on some, but not all modules).

This post outlines how to identify workloads that don't run on the GPU, slowing down the export pipeline, how to understand how much more GPU memory you will need, and how to make available that memory or change your workflow to require less of it.

Background
Performing Measurements
Interpreting Debug Logs
Increasing darktable's GPU memory
1. Performance profiles
2. Secret debug profiles
"Local contrast" GPU memory use
Conclusion

Background

There is an excellent and comprehensive darktable OpenCL benchmark that I'm using as inspiration for the benchmark CLI commands.

Furthermore, the darktabe manual has a section on memory & performance tuning describing the config parameters you can tune to improve GPU memory performance.

Both references outline how to obtain the benchmark values, and this post is meant to show how to parse and interpret them.

Huge thanks go to the people in #darktable on OFTC, who were very helpful by providing the right pointers (and had to bear with me while I was figuring out the details)!

Performing Measurements

The benchmark page suggests to run the darktable-cli command-line interface as follows:

# CPU only:
darktable-cli setubal.orf setubal.orf.xmp test.jpg --core --disable-opencl -d perf
# CPU + GPU, using OpenCL:
darktable-cli setubal.orf setubal.orf.xmp test.jpg --core -d perf -d opencl

You need to prepare a RAW image (setubal.orf) and its accompanying .xmp sidecar with the desired edits. The call will initialize the pipeline and process the RAW image, exporting it into test.jpg. The arguments after --core are passed to darktable, enabling debugging (-d) for the opencl and perf (performance) subsystems.

There are some more things we will need that are hidden in the -d memory and -d pipe (processing pipeline) subsystems.

All these debug flags will make darktable very verbose, so we'll have to heavily apply grep to filter the noise out.

Interpreting Debug Logs

Logs from `perf`

The perf subsystem alone will already provide two useful outputs.

First, the time needed by each module during the pipeline and how it is executed. Grep for dev_pixelpipe:

4.3001 [dev_pixelpipe] took 0.000 secs (0.000 CPU) initing base buffer [export]
4.3474 [dev_pixelpipe] took 0.047 secs (0.016 CPU) [export] processed `rawprepare' on GPU, blended on GPU
4.3651 [dev_pixelpipe] took 0.018 secs (0.016 CPU) [export] processed `temperature' on GPU, blended on GPU
4.4022 [dev_pixelpipe] took 0.037 secs (0.031 CPU) [export] processed `highlights' on GPU, blended on GPU
4.5788 [dev_pixelpipe] took 0.177 secs (0.062 CPU) [export] processed `demosaic' on GPU, blended on GPU
4.6352 [dev_pixelpipe] took 0.056 secs (0.016 CPU) [export] processed `lens' on GPU, blended on GPU
4.7016 [dev_pixelpipe] took 0.066 secs (0.031 CPU) [export] processed `ashift' on GPU, blended on GPU
4.7428 [dev_pixelpipe] took 0.041 secs (0.031 CPU) [export] processed `exposure' on GPU, blended on GPU
4.7930 [dev_pixelpipe] took 0.050 secs (0.031 CPU) [export] processed `crop' on GPU, blended on GPU
4.8451 [dev_pixelpipe] took 0.052 secs (0.016 CPU) [export] processed `colorin' on GPU, blended on GPU
4.9222 [dev_pixelpipe] took 0.077 secs (0.047 CPU) [export] processed `channelmixerrgb' on GPU, blended on GPU
5.9323 [dev_pixelpipe] took 1.010 secs (0.109 CPU) [export] processed `colorbalancergb' on GPU, blended on GPU
6.0176 [dev_pixelpipe] took 0.085 secs (0.000 CPU) [export] processed `agx' on GPU, blended on GPU
7.8012 [dev_pixelpipe] took 1.784 secs (2.469 CPU) [export] processed `bilat' on CPU, blended on CPU
8.0536 [dev_pixelpipe] took 0.252 secs (0.188 CPU) [export] processed `colorzones' on GPU, blended on GPU
8.1302 [dev_pixelpipe] took 0.077 secs (0.062 CPU) [export] processed `finalscale' on GPU, blended on GPU
8.2186 [dev_pixelpipe] took 0.088 secs (0.031 CPU) [export] processed `colorout' on GPU, blended on GPU

Each line prints the (wall-clock) time spent, the CPU time (this is a sum from all CPU cores used, and thus can be higher than the wall-clock time), and whether the module was executed "on CPU" or "on GPU".

If you have very good eyes, you will see that one module, namely bilat (the internal name for "local contrast") is the only one processed "on CPU":

… took 1.784 secs … [export] processed `bilat' on CPU, blended on CPU

Also bilat is the slowest module, taking 1.784s, with colorbalancergb being a distant second at 1.010s (I'm using a blend mask to increase the saturation of the sky in this image, which is probably the costly calculation that I cannot get rid of).

The internal module names correspond to the file names in darktable's src/iop directory and the human-readable name is returned by the respective module's name() function.

The second useful output from perf comes at the end, and shows the wall-clock time and the CPU time used for the whole image export pipeline:

8.2709 [dev_process_export] pixel pipeline processing took 3.975 secs (3.203 CPU)

We have to multiply this with the number of photos in our export batch to estimate whether we need to grab a coffee or to take a nap.

This tells us that exporting a picture takes 3.975s, half of which is spent in bilat on the CPU, but it doesn't say why the GPU isn't used.

Logs from `pipe`

To find out why bilat is running on the CPU, we need to add -d pipe to the mix, which will give us the following three lines about bilat:

6.1278 transform colorspace  CPU [export]  bilat  6900  (0/0)  4254x2392 sc=1.000; IOP_CS_RGB -> IOP_CS_LAB `linear Rec2020 RGB'
6.4772 process               CPU [export]  bilat  6900  (0/0)  4254x2392 sc=1.000; IOP_CS_LAB 1520MB
7.7461 blend with form       CPU [export]  bilat  6900  (0/0)  4254x2392 sc=1.000; IOP_CS_LAB, BLEND_CS_LAB

The relevant info is hidden at the end of the second line: 1520MB is the memory needed to process the image with bilat. This is less than the 2GB of GPU memory we have, so why does it fail?

Logs from `opencl` + `memory`

Certain outputs require a combination of multiple -d flags, like the GPU memory tracking which only works with -d opencl -d memory:

…
6,1174 [opencl memory] device 'NVIDIA CUDA NVIDIA GeForce GTX 950' id=0: 0.0MB in use, 989.7MB available GPU mem of 2047.8MB
6.1278 transform colorspace  CPU [export]  bilat  6900  (0/0)  4254x2392 sc=1.000; IOP_CS_RGB -> IOP_CS_LAB `linear Rec2020 RGB'
…
[opencl_summary_statistics] device 'NVIDIA CUDA NVIDIA GeForce GTX 950' id=0: peak memory usage 548.5 MB

This will print the video memory use on each allocation and the overall peak memory use at the end.

As can be seen here, there is only 989.7MB of available GPU memory, despite 0MB being in use. nvidia-smi also confirms there are no other allocations, so darktable should be able to utilize the full 2GB.

Increasing darktable's GPU memory

When darktable is started with -d opencl, it will also show the GPU parameters in the beginning, including the total GPU memory of 2GB:

[dt_opencl_device_init]
DEVICE:                0: 'NVIDIA GeForce GTX 950', NEW
…
DEVICE_TYPE:           GPU, dedicated mem
GLOBAL MEM SIZE:       2048 MB
MAX MEM ALLOC:         512 MB

However, as outlined in the memory & performance tuning / darktabe resources section in the manual, darktable will restrict the GPU memory allocation based on two factors, to prevent resource exhaustion:

headroom: hardcoded 600MB reserve(*)
darktable resource profile, OpenCL memory: fraction of the remaining GPU memory, with a default of 700 (and a divisor of 1024)

The formula is:

usable_memory = (total_memory - headroom) * opencl_memory / 1024

On a 2GB GPU with the "default" memory allocation profile this means:

usable_memory = (2GB - 600MB) * 700 / 1024 = 989.7MB

(*) the docs say that the "headroom" can be disabled by setting the "use all GPU memory" option in the settings, but that option was restricted to multi-GPU systems due to users accidentally enabling it without understanding the side effects. On a single-GPU system, there is no way to disable the headroom.

Performance profiles

In addition to "default" there is also a "small" and a "large" performance profile.

The "large" profile increases the GPU memory fraction from 700 to 900, thus giving us 1272MB. Manually changing the resource_large value from 900 to the maximum allowed 1024 in darktablerc brings us to 1448MB, still 72MB short of the 1520MB needed for bilat. Bummer.

Secret debug profiles

However, there is another path outlined in the dt_opencl_check_tuning() function. If a negative level is set, the headroom value is ignored.

The level variable corresponds to the resource profile, and negative values correspond to debug / performance testing profiles that only be defined on the command line (mapping reference):

-1 = "reference" = 8GB RAM, 2GB video RAM
-2 = "mini" = 1GB RAM, 0.2GB video RAM
-3 = "notebook" = 4GB RAM, 1GB video RAM

WARNING: kids, don't do this at home!

Running darktable --conf resourcelevel="reference" will ignore the safety limits of darktable and the physical limits of your hardware and just assume that your GPU has 2GB of video RAM available.

Let's see how that goes on our 2GB GPU:

3.8193 [opencl memory] device 'NVIDIA CUDA NVIDIA GeForce GTX 950' id=0: 310.5MB in use, 2048.0MB available GPU mem of 2047.8MB
4.1378 [dev_pixelpipe] took 0.319 secs (0.234 CPU) [export] processed `bilat' on GPU, blended on GPU
…
4.4987 [dev_process_export] pixel pipeline processing took 1.415 secs (1.078 CPU)
[opencl_summary_statistics] device 'NVIDIA CUDA NVIDIA GeForce GTX 950' id=0: peak memory usage 1452.8 MB

By manually overriding all the safety margins, we get bilat to run on the GPU, consuming 75% of video memory, and accelerating the overall pipeline from 3.97s to 1.415s (-63%).

However, driving without seatbelts is not sustainable, and maybe we can get around the problem by reducing the memory needed by bilat somehow?

"Local contrast" GPU memory use

The bilat "local contrast" module supports two modes, "local laplacian" (default, more robust) and "bilateral grid":

screenshot of the local contrast mode selection drop-down

The memory requirements are calculated based on a rather complex formula in the tiling_callback() function. The tiling->factor value defines the memory required for processing an image, relative to its size in bytes. For laplacian mode, it depends on the number of scaling levels required at a given image resolution, and ends up being ~9.33 for a 4254x2392 pixels image, which totals to 1520MB at 16 bytes per pixel (4 channels with 4-byte float values):

4254*2392*16*9.33 ≈ 1520MB

Notably, tiling is unsupported in laplacian mode, so the module needs to allocate all of that GPU memory at once.

Given that tiling is only disabled in laplacian mode, we can try the "bilateral grid" mode to see if it is more efficient. The algorithm is different, needs a higher "detail" value to achieve a similar effect, and might end up with artifacts.

However, it allows shifting the workload from the CPU to the GPU without hacks:

3.0816 [dev_pixelpipe] took 0.000 secs (0.000 CPU) initing base buffer [export]
3.1319 [dev_pixelpipe] took 0.050 secs (0.078 CPU) [export] processed `rawprepare' on GPU, blended on GPU
3.1634 [dev_pixelpipe] took 0.031 secs (0.000 CPU) [export] processed `temperature' on GPU, blended on GPU
3.2106 [dev_pixelpipe] took 0.047 secs (0.031 CPU) [export] processed `highlights' on GPU, blended on GPU
3.3463 [dev_pixelpipe] took 0.136 secs (0.156 CPU) [export] processed `demosaic' on GPU, blended on GPU
3.3835 [dev_pixelpipe] took 0.037 secs (0.016 CPU) [export] processed `lens' on GPU, blended on GPU
3.4312 [dev_pixelpipe] took 0.048 secs (0.031 CPU) [export] processed `ashift' on GPU, blended on GPU
3.4737 [dev_pixelpipe] took 0.042 secs (0.031 CPU) [export] processed `exposure' on GPU, blended on GPU
3.5173 [dev_pixelpipe] took 0.044 secs (0.047 CPU) [export] processed `crop' on GPU, blended on GPU
3.5530 [dev_pixelpipe] took 0.036 secs (0.031 CPU) [export] processed `colorin' on GPU, blended on GPU
3.6314 [dev_pixelpipe] took 0.078 secs (0.047 CPU) [export] processed `channelmixerrgb' on GPU, blended on GPU
3.8093 [dev_pixelpipe] took 0.178 secs (0.156 CPU) [export] processed `colorbalancergb' on GPU, blended on GPU
3.8425 [dev_pixelpipe] took 0.033 secs (0.031 CPU) [export] processed `agx' on GPU, blended on GPU
4.0338 [dev_pixelpipe] took 0.191 secs (0.141 CPU) [export] processed `bilat' on GPU, blended on GPU
4.1906 [dev_pixelpipe] took 0.157 secs (0.125 CPU) [export] processed `colorzones' on GPU, blended on GPU
4.2605 [dev_pixelpipe] took 0.070 secs (0.047 CPU) [export] processed `finalscale' on GPU, blended on GPU
4.3451 [dev_pixelpipe] took 0.085 secs (0.062 CPU) [export] processed `colorout' on GPU, blended on GPU
…
4.4328 [dev_process_export] pixel pipeline processing took 1.352 secs (1.094 CPU)
[opencl_summary_statistics] device 'NVIDIA CUDA NVIDIA GeForce GTX 950' id=0: peak memory usage 548.5 MB

The visual result is close to the initial run, but we've reduced the time by 66%, from 3.975s to 1.352s, while significantly reducing the required video memory! This is only marginally faster than ignoring the safety margins in laplacian mode, but much safer to operate.

Conclusion

In most use cases, a few seconds of difference for exporting a JPEG from darktable won't have a measurable effect. However, for exporting a 600-photo time-lapse, we just went from 40 minutes to 14!

Inspecting darktable's detailed performance and memory consumption logs paid off, and allowed us to identify the bottleneck and to work around it!

That said, maybe it's time to replace the decade-old GTX 950 anyway, and a card with 4GB or more will be able to perform the laplacian filter in GPU memory on the "default" profile with no further tuning.

Discuss on Mastodon