Rendering Moana with Swift

Moana (2048×858 pixels, 64 spp) rendered with Gonzales on a Google Cloud Instance with 8 vCPUs and 64GB of memory in roughly 26 hours. Memory usage is about 60GB. (Denoised with OpenImageDenoise.)

TLDR: Render Disney’s Moana scene in less than 10.000 lines of Swift code.

After Walt Disney Animation Studios released the scene description of the island in Moana some efforts were started to render it besides Disneys Hyperion. I am aware of the following render engines:

Here I present another one, the Gonzales renderer, written by me. It is heavily inspired by PBRT and written in Swift (with a few lines in C++ to call OpenEXR and Ptex). It is optimized only as far as to be able to render it in a reasonable amount of time on a free Google Cloud instance (8 vCPUS, 64GB RAM). As far as I know this is the only renderer able to render Moana not written in C/C++. I wrote it with vi and command line Swift on Ubuntu Linux and Xcode on macOS so it should be relatively painless to get it compiled on these platforms.

Why Swift?

I was always uncomfortable with header files and the preprocessor in C and C++. From my point of view something (a variable, a function, …) should be declared and defined once, not twice. Also, the textual inclusion of header files brings with it many problems like having to add implementation details to header files (templates come to mind) or slow compilation times by repeated inclusion of headers and its combinatorial explosion. When I started C++ modules were not available so I evaluated Python (too slow), Go (too much like C) and some others but in the end only Rust and Swift were serious contenders. I finally chose Swift because of readability (I just don’t like „fn main“ of „impl trait“). Also, being written by the implementors of LLVM and Clang gave me confidence that it would a) not be abandoned in the future and b) meet my performance goals. In short, I wanted a compiled language, no pointers, modules, concepts, ranges, readable templates, and I wanted it now. Also, compilers were invented to make the life of programmers easier by making programs more readlabe, and sometimes when looking at templated-based code makes me think we are going backwards in time. I like my stuff readable.

Random notes

Parsing went through a few incarnations. First it was a simple String(file.availableData, encoding: .utf8) but that is simply to big to fit in memory. Data was not used for similar reasons. Also Scanner from Foundation was evicted at a time. In the end I settled on a InputStream read into an UnsafeMutablePointer<UInt8> array of 64kB.

The Array dead end; in short, don’t ever use Array in a hot path. That is to say, do not ever generate one. This should have been clear from the beginning since it is heap allocated but the lesson was learned quickly since it always turned up at the top of an analysis done with perf. For fixed-size arrays this can be overcome with tuples or Swift’s internal FixedArray. Even if the Array is only used subscript getters tend to show up at the top of perf runs.

In general, I found it quite practical to develop on Linus and macOS in parallel since the available tools to check for performance and memory nicely complement each other. I used mainly four tools:

  • Perf: This Linux kernel tool gives valuable information where time is spent. Just fire it up, look at the function showing up at the top and wonder where time is wasted. Hint; it is usually not where you think it is. In my case it was always swift_retain or release which tells you over and over again to not allocate objects on the heap.
  • Valgrind Memcheck: This shows where the memory is gone. For example, an analysis with this tool is the reason why the acceleration structure is separated from the acceleration structure builder; the memory spent in building a bounding hierarchy was simply never released. It is nice to have no pointers in Swift, no malloc or new, or even shared_pointers, but it is still necessary to think about how memory is used.
  • Xcode profiling: I mostly used Time Profiler, Leaks and Allocations which gives you roughly the same information as Perf and Valgrind but from a different viewpoint. Sometimes it is very helpful to look at the same thing from two different views. Which reminds me of the old times when we used to feed our software to three different compilers (Visual Studio, GCC and the one from IRIX, what was its name again? MIPSPro?).

Talking about memory, while Swift makes it very easy to write readable and compact code, you still have to think about low-level operations like memory allocations and the like. I frequently switched between structs and classes just to see how memory and performance are affected. The nice thing about not having pointers, new and shared_pointers is that I was able most of the time to just switch between the two without changing anything else in the system.

One tool I didn’t use extensively but which gives nice images is FlameGraph. One thing that can be seen though is that most time is spent in intersection testing for bounding hierarchies and triangles. Things like protocol witness checking do not use much time.

About protocol-based programming: Grepping through todays‘ Gonzales shows 23 protocols, 57 structs, 47 final classes and 2 non-final classes. Inheritance is almost never used. The two remaining non-final classes are TrowbridgeReitzDistribution and Texture, both of which I’m not happy about and think about redesigning them in the future. All in all, protocol-based programming turns out to result in nice code, for example I used to have a Primitive class like PBRT but soon changed it to a protocol inheriting from protocols like Boundable, Intersectable, Emitting (gone now) and others. Now it is gone too, the BoundingHierarchyBuild just depends on a Boundable existential type and returns a hierarchy of Intersectables that is used by BoundingHierarchy. All primitives are now stored as an array of existential types consisting of a composition of protocols of Boundable and Intersectable (var primitives = [Boundable & Intersectable]()).

The primitives in a BoundingHierarchy on the other hand are stored as a [AnyObject & Intersectable]. This has two reasons: 1. Only intersection is needed. 2. AnyObject forces the stored objects to be reference types (or classes) which saves memory since the layout of protocols for both structs and classes (OpaqueExistentialContainer) uses 40 bytes since Swift tries to store structs inline, whereas class-only protocols (ClassExistentialContainer) use only 16 bytes as only a pointer has to be stored as can be seen in Swift’s documentation or verified in the source. I emphasize that this is not only an academic discussion but I came across this since it showed up at the top of a memcheck run.

One of the reasons you can render Moana in less than 10.000 lines is the ability to write compact code in Swift. One extreme example is parameter lists. In PBRT you can attach arbitrary parameters to objects which results in around 1000 lines of code in paramset.[h|cpp]. In Swift you can achieve the same in about three lines:

protocol Parameter {}
extension Array: Parameter {}
typealias ParameterDictionary = Dictionary
<String, Parameter>

Actually, I’m cheating a little bit here but you get the point. (Also, I think this has changed in PBRT-v4.)

About interfacing C++ for Ptex and OpenEXR support: Interoperability with C++ is on the way for Swift but wasn’t available when I started/as of now. Since I’m using OpenEXR and Ptex only for reading textures and writing images I resorted to extern "C". One modulemap and a few lines of C++ code later (100 for Ptex, 82 for OpenEXR) I had support for reading and writing OpenEXR images and Ptex textures.

I am releasing the code now as I am able to render Moana on a Google Compute Engine with 8 vCPUs and 64GB memory which is free for three months, so please download the code, get an account at fire it up. 🙂 That said, there is a lot to do as I optimized it only as far as to be able to get one image rendered. The following is a big todo list roughly sorted from easily implemented to big projects which I might or might not tackle in the future.

TODO

  • Ray differentials for direct rays. This should be relatively easy; have a look at how PBRT-v3 does it, implement differential generation in the camera, pump it through the system and use it in the call to Ptex. There it is handled automatically.
  • Better hierarchies: I only implemented the simplest bounding hierarchy which is nice since it is only 177 lines of code but is also results in suboptimal rendering times. SAH optimized hierarchies should be much better in this regard. They also should not be to difficult to implement since I followed very much PBRT’s implementation.
  • Faster parsing: Integrate Ingo Wald’s fast pbrt parser which parses Moana in seconds instead of half an hour. Or even better: Write a parser for the pbf format in Swift.
  • Faster hierarchy generation: This is somewhat slow. Can there be done something about it?
  • An idea about faster parsing, hierarchy generation and scene formats: LLVM has three different bitcode formats; in-memory, machine readable (binary) and human readable and it can losslessly convert between the three. Can we have the same? Like PBRT (human readable), PBF or USD (machine readable) and BHF (binary hierarchy format) where bounding hierarchies are already generated and can simply be mapped to memory.
  • Beginner tasks: I only tried to get Moana to render but it should be fairly easy to enhance Gonzales to be able to render other scenes by adding features or fixing bugs. There are lots of scenes to try. Also there are many exporters for PBRT which should work for Gonzales too.
  • Bump mapping: Should be fairly easy.
  • Displacement mapping: Not so easy.
  • Memory: Lots of memory is used for pixel samples as the image is only written when rendering is finished. Change that to write tiles as they are rendered and discard samples early. This interferes with pixel filtering but since we are denoising anyway maybe this is not needed anymore?
  • Smaller Transforms: As of now Transforms store two matrices, a 4×4 matrix storing the transformation and its inverse. This is a little wasteful since you can always compute one from the other but inversion is slow but after careful thinking when which transform is needed it should be possible to get rid of one. Right now both are used when intersecting a triangle but is it possible to store triangle (and other objects like curves) in world space to get rid of the transformation of the ray into object space and similarly the transformation to world space for surface interactions? And how does this interact with transformed primitives and object instances?
  • Denoising: I am using OpenImageDenoise for the time being but of course an integrated denoiser in Swift would be nice to have. Also, the beauty, albedo and normal image are written separately, this should be rearchitected.
  • USD: Write a parser for Pixar’s Universal Scene Description.
  • Better sampling: Implement discrepany-based sampling or correlated multi-jittered sampling.
  • Beyond path tracing: Look at PxrUnified and implement Guided Path Tracing (I had a look at it but it looks… confusing) and Manifold Next Event Estimation. I think I saw an implementation somewhere but I forgot. (And if only Weta followed Disney’s lead and published the Gandalf head from that paper, sigh!)
  • Subsurface scattering. Already in PBRT.
  • Faster rendering: Embree has a path tracer. Look at it hard and try to make Gonzales faster.
  • GPU rendering: This should be a big one, PBRT-v4 obviously does this as some of the mentioned renderers above. It should be very well possible to follow them and use Optix to render on a graphics card but I would much prefer a solution not involving closed source. Which would mean that you have to implement your own Optix. :\ But looking at how CPUs and GPUs are evolving it might be possible in a distant future to use the same (Swift) code on both of them; you can have instances with 448 CPUs in the cloud and the latest GPUs have a few thousand micro-cpus, they look more and more the same. I wonder whether it will be needed to program for AVX in the future as it seems less needed as you can just throw more cores at the problem. At the same time memory is getting more and more NUMA-like so having your data next to the ALU is getting more important. Maybe one day we have render nodes in the cloud each responsible for one part of the scene, each node partitioning the scene geometrically and sending only portions to the CPUS. Then the returned intersections could simply sorted by the t value of the ray which reminds me of sort-first/middle/last architectures like Chromium.

That’s it for now. I would be extremely happy to receive comments what could be done better or implemented more elegantly, bug reports or even pull requests. 😉 Also thanks to Matt Pharr and PBRT, the most valuable resource in the known universe (at least when it involves rendering).

January 14th, 2021.

Andreas

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden /  Ändern )

Google Foto

Du kommentierst mit Deinem Google-Konto. Abmelden /  Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden /  Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden /  Ändern )

Verbinde mit %s