Tuesday, December 8, 2009

Techtalk Tuesday -- Video editing and image compositing

So I've been thinking about video editing, Linux, and how much I hate Cinelerra.

Now, I don't know a lot about the internals and features of existing video editing tools, but I at least know some of the basics. First, you produce a series of images at a rate intended give the illusion of movement. Let's look at a single point in time, and ignore the animation side of the equation. Let's also focus on visual (as opposed to auditory) factors.

You have source images, you have filters and other transformations you want to apply to them in a particular order, and you want to output them into a combined buffer representing the full visualization of the frame.

Let's break it down a bit further, along the lines of the "I know enough to be dangerous" areas.

Raster image data has several possible variations, aside from what is being depicted. It may have a specific color space, be represented in binary using different color models (RGB vs YUV vs HSL vs HSV), may have additional per-pixel data (like an alpha channel) thrown in, and the subpixel components can have different orderings(RGB vs RBG), sizes(8bpp to 32bpp), and even formats (integer, floats of various sizes and multiplier/mantissa arrangements). ICC color profiles fit in there somewhere, too, but I'm not sure where. There's even dpi, though not a lot of folks pay attention to that in still imagery, much less video. Oh, don't forget stride (empty space often left as padding at the end of an image data row, to take advantage of performance improvements related to byte alignment.).

Now let's look at how you might arrange image transformations. The simplest way to do it might me to organize the entire operation set as an unbalanced tree, merging from the outermost leafs inward. (Well, that's the simplest way I can visualize it, at least). Each node would have a number of children equal to the number of its inputs. A simple filter would have one input, so it would have one child. Any more inputs, and you have a compositing node. An alpha merge, binary (XOR/OR/AND) or arithmetic(subtract, add, multiply, etc) merge would be two-arity, while a mask merge might be three-arity.

Fortunately, all of this is pretty simple to describe in code. You only need one prototype for all of your image operations:

void imageFunc(in ConfigParams, in InputCount, in BUFFER[InputCount], out BUFFER,)
{
}

An image source would have an InputCount of 0; It gets its data from some other location, specified by ConfigParams.

So assuming you were willing to cast aside performance in the interests of insane levels of flexibility (hey, i love over-engineering stuff like this; Be glad I left out the thoughts on scalar filter inputs, vector-to-scalar filters, multiple outputs (useful for deinterlacing), and that's not even fully considering mapping in vector graphics.), you probably want to be able to consider all that frame metadata. Make it part of your BUFFER data type.

One needs to make sure to match input and output formats as much as possible, and minimize glue-type color model and color space conversions. For varying tradeoffs of performance to accuracty, you could up-convert lower-precision image formats to larger range and higher-precision ones, assuming downstream filters supported those higher-precision ones. Given modern CPU and GPU SIMD capabilities, that might even be a recommended baseline for stock filters.

Additionally, it *might* be possible to use an optimizing compiler for the operation graph. From rearranging mathematically-equal filters and eliminating discovered redundancy to building filter machine code based on types and op templates. But that's delving into domain-specific language design, and not something I want to think too hard about at 4AM. In any case, it would likely be unwise to expose all but the most advanced users to the full graph, instead allowing the user interface to map more common behaviors to underlying code.

There's also clear opportunity for parallelism, in that the tree graph, being processed leaf-to-root, could have a thread pool, with each thread starting from a different leaf.

That's an image compositor. Just about any image editing thing you could want to do can be done in there. One exception I can think of are stereovision video, though the workaround for that is to lock-mirror the tree and have the final composite a map-and-join. (If you want to apply different fiters to what each eye sees, you're an evil, devious ba**ard. I salute you, and hope to see it.) Another is gain-type filtering, where a result from nearer the tree root could be used deeper in the tree (such as if you wanted to dynamically avoid clipping due to something you were doing, or if you simply wanted to reuse information you lost due to subsequent filtering or compositing steps). Still another is cross-branch feeding; I can think of a few interesting effects you could pull off with somethig like that. There's also layering and de-layering of per-pixel components.

As a bonus, it's flexible enough that you could get rid of that crap compisitor that's been sitting at the core of the GIMP for the past twenty years.

No comments:

Post a Comment