Adventures in Encoding or: Go Over (9000 words to explain the issues with) Time(ranger) and (take up too much) SpaceJuly 14th, 2012 at 17:50
Lynxara is currently indisposed, so she cannot regale you with Adventures in QC at the moment. To fill the void I figured I’d finally jump in. The Timeranger 06 release post joked like we forgot about the project, and a few of the comments on it have suggested that they believed we might have done exactly that. This post has been a long time coming, but is especially important in light of that, because the joke was just that. The assertion that anybody forgot about this is categorically untrue, and the delays all fall on a single member of the group, Ignis was simply too nice to call me out on it.
The short version of this post is this: I am the reason it took so long for Timeranger 06 to come out, because the video is a nightmare and it’s taken forever to figure out how to manage it. Well, that’s only half the truth, as a large portion of the issue was also getting up the nerve to want to deal with said nightmare. The long version of the post is… considerably longer. Let me assure you, the tl;dr tag has never been so appropriate in the history of this blog.
So you’re still here? I can’t say I understand your curiosity, but I commend it. Anyway, probably the best method of illustrating the issue here is by way of comparison with something else I encode on a much more regular basis. Don’t worry if you’ve never encoded yourself, I’m going to walk through the pertinent bits as we go.
The Go-Busters Example
Click here and leave this in another tab/window. This is the base script for last week’s episode of Go-Busters. As I’m sure you can immediately tell, it’s 43 lines long, including a few blanks. It may appear somewhat intimidating, but I assure you it’s not. A long upcoming discussion of the perils of telecine and its inverse aside, this is pretty much easy mode.
The Trivial Bits
Lines 1-10 are just loading up plugins or non-compiled scripts. If Avisynth’s auto-loading functionality could be trusted to work safely, this would be completely redundant, dropping a good quarter of the script by itself. Lines 12-19 are a bit of syntactic sugar I wrote last week after finally deciding that my previous solution to this problem was inelegant. I could relegate these eight lines to another file and Import() them as above, saving seven more. I’ll explain the problem this was designed to solve when we get to its use.
Line 21 is typically unimportant, but given that I try to multithread the execution of this script, it’s better to be safe than sorry. Line 22 simply loads the video. Okay, I’m technically lying to you, but if you understand what’s actually happening you don’t need this explanation, and if you don’t it’s not important to you. Line 24 is a horizontal resize that limits off the far left and right by a few pixels. Horizontal resizing of interlaced content is safe, and cuts down on the work required of the later filters to a degree. The side limiting works essentially like cropping, but technically not as well. It’s convenient though, since this way there is no restriction on the width of a temporary step.
Inverse Telecine and You
Now we’re getting to the good stuff. Lines 26-27 are the first step of inverting a 3:2 Pulldown. Since the start of Super Hero Time being broadcast in HD, both shows have been filmed at NTSC Film rate, (approximately 24fps), but Japanese broadcast is still all NTSC Video, and all its HD is in fact 1440x1080i. You don’t have to worry too much about the details of this process, but illustration is somewhat helpful in explaining what exactly is going on. We’ll focus on line 27 first, as that is the field matcher that attempts to do better than what you see in the resulting video frame column. Timeranger is also, theoretically, telecined with 3:2 pulldown, so this is actually important.
tfm() is a field matcher, which means it attempts to match even fields to adjacent odd fields to give frames that are identical to the original source. tfm() in particular is capable of checking a frame (two fields) ahead or behind to find a match that has the fewest artifacts. As a result, instead of having two frames that are an awkward blend of BC and then CD fields, there are two identical C frames in the middle of the sequence. Obviously this could lead to stuttering on playback, and it is a bit of a waste of bitrate to boot since after compression/transmission over the air and wires, noise is introduced making the copies not perfect. Solving that issue is the domain of line 36, but we’re not there yet.
Backing up for a second, we have line 26. This line is treating a pair of fields as if they are truly interlaced (half height versions of frames that are each an entire point in time rather than half of a field) and attempts to deinterlace them. This process involves a lot of interpolation and guesswork, and is just generally lower quality than field matching, but occasionally field matching is imperfect. More to the point though, sometimes there are elements on the screen that are a different frame rate than the majority of the scene. In the case of a show like Go-Busters, this means the clock in the upper left corner, the TV Asahi logo that fades in after the commercial breaks, and the scrolling text at the bottom of the screen. Two of the three of these elements are in a line together, so it is actually possible to tell tfm() not to even look at those lines at match time, so at least they don’t impact the frame in that way, but they can leave combing artifacts. In case you have seen Gokaiger and remember the first episode after the tsunami, the zoom-in/out was also interlaced, so there were changes that did not correspond to the field matching, and even changes where there was no new frame. Without going to the potentially damaging, but more correct for interlaced content method of deinterlacing, the text crawl would look like this instead of this. I don’t think I need to tell you which looks better.
Of course, as I keep beating home, deinterlacing has the potential to lose detail, so it’s best to do it as little as possible. Inside that long tdeint() line is a call to tmm(), which masks off areas that are combed, and constrains the changes only to those parts. And more than that, tfm() itself attempts (somewhat poorly) to do the same thing, and will only ever even use the deinterlaced version for anything if it detects combing in the first place, leaving as many frames as humanly possible untouched.
As I said earlier, these interlaced elements will move on all sixty ticks per second, but the video proper will not. This means that at certain times, instead of a frame pattern of ABCCD, we get ABC1C2D. This is not in and of itself a problem, but the second step in the inverse telecine process is to kill the duplicate frames. This is usually done by following the pattern, but occasionally the decimation filter (in this case, line 36) can get confused. The worst case scenario, which I can assure you has happened several times, can lead it to remove a frame other than C1 or C2, giving a pattern such as AC1C2D. The decimation function thinks nothing of this, but on playback you’ll note jumpy motion followed by lingering an extra frame on no actual motion. Throughout Gokaiger, I solved this problem by cropping the bottom 110 lines off, and using that to determine which frames to keep. Then Asahi started using the three line scroll in Sentai as well as Kamen Rider. For a bunch of weeks, I cheated by just increasing the crop to 330 lines, but this was occasionally causing the very problem I was trying to avoid, but in sections without scrolling text.
Enter BlankBottom(). The function takes three parameters: the first frame, the last frame, and the number of lines to crop off and pad back with solid black, like the Blah-Bar from 30 Rock. This method of splicing is not at all safe, for the record, so don’t try this at home. Doing it this way, I can only crop away the bits that really need it, and let tdecimate() run as normal all other times. I should mention that this is by no means the optimal way to handle 60i scroll over 24p video, but the better methods are slower and take longer to get running. Given the overall low quality of TV broadcast, and the fact that this would only increase the fidelity of a video element that we don’t actually want you looking at in the first place, I’ve opted to go with the lazy route. Line 31 just takes a copy of the video before being attacked with the Blah-Bar, 32-34 add varying amounts of Blah to different ranges of frames, and 36 does its decimation decisions based on the edited frame, but returning the unedited version.
Okay, we’re on the home stretch here. I skipped past line 29 in the interest of keeping the whole inverse telecine block together, but it’s really straightforward. That’s a series of cuts made to skip around commercials. You might notice that the first two segments and the last three are all adjacent; having separate trim points like this allows the tool I use for chapters to automatically calculate all of them at once. That is to say, adding the extra trims that don’t exclude any frames allows me to be lazier. Line 38 is simply doing the vertical resample down to 720 high, without affecting the horizontal any further. Since the video has been decombed, it’s safe now. This could actually be done before decimation I suppose, but there’s not really any benefit either way.
The last segment, from line 40-43, is about denoising and readying that for output. dfttest() is the denoiser in question, the sigma here is the strength, and the rest is about limiting ghosting and a bit of a speed/quality trade-off. The very last argument turns on 16-bit processing mode, which allows for slightly better quality on the way to 10-bit output. smoothgrad() smooths out banded sections into continuous gradients, where possible, as denoising can introduce banding all on its own. For an example of these at work, look no further than right here. This is not exactly a shining beacon of the quality jump, though that’s mostly because I picked a bad frame on purpose to also help illustrate why I don’t so much care about perfection on TV broadcasts. Even on this though, the appearance of strong block artifacts is diminished, which increases compression, and this frame is from a high motion segment where you’re less likely to notice the problem anyway.
dither_quantize() dithers the 16-bit color down to 10-bit in a better fashion than the encoder itself will, and the convey line (I always have to copy/paste) just rearranges the data into a fashion suitable for the encoder. If you’re curious, it also makes the screen look like this. That’s the same frame as the previous two screenshots. Obviously, we don’t do a lot of the editing process with this on.
Finishing Remarks For Easy Encoding
This script gets run into four (probably three starting this weekend) segmented lossless encodes, which are then joined up for the final passes. From here, it’s extremely fast (~8m) to downscale and create a workraw, and almost as fast (~16m) to downscale and hardsub for the mp4. The HD encode is slightly faster from the lossless source than above script, but if that was the only encode I was making from the same base video it wouldn’t be worth it.
Okay, so that was a lot of words about what I was calling the easy version. The thing is, all of that is pure rote. I don’t even write most of that script by hand anymore; it is generated by a script. The only thing I really do for it week-to-week is scrub through to find the trim points, and now a second pass to find the BlankBottom points. This tends to take less time than unpacking the raw from the segmented rars I get it in.
I don’t feel like hunting for a BD or recent DVD movie script, but I’ll let you in on a little secret: they’re usually even easier. The Sentai/Kamen Rider DVDs I’ve encoded have been properly telecined, which means the little program that generates the d2v file used all the way up in line 22 can actually do all of the inverse telecine process itself. The Blu-rays are even easier, as the BD spec allows you to just use 24fps video right on the disc. I tend to use higher quality (read: slower) denoising on those, and usually weaker settings as they’re cleaner, but it’s overall a much simpler script, and almost always the same.
The Timeranger Problem
Now we get to Timeranger. This was the first series Toei released on DVD from their back catalog, or not as it was airing. This makes them the oldest DVDs where the video itself was not processed from the very beginning with the intention of being put on DVD. There have obviously been older series to undergo this sort of treatment since, but they were done by engineers with much more experience under their belts. Add to that the fact that this show was made in an era where most principle photography was done on film still, but most editing done directly to tape, and you have a recipe for issues.
Bad Telecining And Bourbon
The primary issue is this one: instead of having a series of ten fields that look roughly like AaBbBcCdDd that can be reassembled cleanly, we have a series of fields that have varying amounts of each other blended into one another. Observe an example. Okay yes, I’m cheating slightly in that these are not the raw fields but rather the fields after having been run through a medium quality bobber to get to full height, but I’ll get to that. Each of these fields is somewhat blended with adjacent ones, though obviously some are more than others. The third and eighth are clearly the least messy, the first, fourth, sixth, and ninth are the worst. Unfortunately though, ten fields of telecined footage needs to yield four frames, and we only have two particularly good picks here. Adding the fifth and tenth frame from the sequence should cover every individual point in time across these ten fields with the minimum amount of ugly blending. Doing so would look something like this. This sequence covers the same amount of temporal data as the full ten, but without quite as many distracting blends. Unfortunately for everyone involved, this pattern is not fixed throughout the run of an episode, let alone the full disc. Further, because adjacent fields don’t have the same blending, even though this is nominally telecined, a field matcher won’t really work correctly.
This dovetails nicely into the second problem; not all of this content is actually telecined film. There is some 30fps video footage in the episodes, primarily in the OP/ED though not exclusively. This requires some extra work to handle, and all of it combines to make a fairly unpleasant experience. How unpleasant? Let me walk you through the steps, each of which is at least less complicated to explain, if more time consuming to generate, than the example above.
Step One: Bob
Here’s the first step. Most of this should look familiar. Checkmate() on line 10 is a filter to remove dot-crawl and rainbowing artifcats. These artifacts are caused by running the video through a low quality connector at some point in the process, typically a composite video cable. This is honestly kind of embarrassing on its own, but it’s sadly common. No filter is going to be perfect about removing the influence, but this does a reasonable job. It only works correctly if applied before any deinterlacing, so it has to be here. Line 11 is running nnedi3 as a bobber. This is in fact the same filter that was used to clean up combing back in the Go-Busters script, but here it’s being used to interpolate every single 240 line field into a 480 line frame, as you saw in the above image sequences.
This entire file is honestly a relic of an earlier attempt at this process, wherein I used a much higher quality but much slower bobber, qtgmc(). Unfortunately for my sanity, the thing that makes qtgmc() preferable in general use scenarios make it a significantly worse choice for this one: the interpolation takes fields from before and after the current one into account, attempting to make better guesses about what the missing lines should be. This also means it propagates blends into fields that were clean. Typically, not entire blends though, as that would be too easy. It has a habit of creating new half blended frames that are warped in strange ways, and make it much harder to pick the correct fields to keep. Even worse, they can obscure things and make it look like the pattern changes more often than it actually does, and as you’re about to see it already changes enough. checkmate() and nnedi3() run fast enough on my system though that this entire set of operations could probably be inserted at the top of the next script. As it stands, I render out this video into a lossless codec, and load that to start the next step.
Drink Until BlindCherrypick Clean Fields
Now we’re finally to the hair pulling, suicide inducing tedium bit. Most of what I need to explain is just what the functions up top are for, the rest is just provided for context. DupFrame() allows me to append a single frame anywhere. There are occasions where Select24() over a range of frames has fewer output frames than Select30 would, which is can add up. Of course, as I’ve found out, sometimes I can add more frames than there should be. I don’t actually know how that happens, but I can tell you that this script returns one more frame than it started with, even though each section output has the same number of frames as half the input. I don’t get it either.
Zoom does what it says on the tin: it resamples the span it’s applied to while missing the outside pixels as defined by left, top, right, and bottom. It’s mostly just less to type. The number in Select*() refers to the fps of the output, rounded up. Well sort of. In the case of Select60(), it’s the same as just trim(start,end), but it is easier to edit to the reduced versions than by typing trim at first. It’s a little thing, but it adds up, believe me. Select30() is a trim followed by SelectEven(). Just like the last couple, it’s about saving time typing. This is useful for properly 30fps content, but also for checking that I’m not losing frames. Its w, x, y, and z parameters do nothing at all; they exist so I can just rename Select24 to Select30 without any other changes and check frame count.
And now we get to Select24(). This function lets me give four frames, counted from 0 to 9, and it will pick those four out of every set of ten frames between the start and end value. In Avisynth, you can’t actually join video segments with different frame rates, so I take a page from the telecine process. I repeat the last frame exactly, giving an output pattern of a,b,c,d,d, which isn’t what 3:2 pulldown uses but since I’m not actually reweaving it doesn’t matter. What is important for the moment is that I now have 30fps output to merge with the true 30fps sections.
With these functions in place, it’s just a matter of actually finding the section borders and picking the right frames. I start by splitting up the chapter points, just so it’s easier to track where frame gain/loss might happen later on. Then, I start each chapter segment and start by picking the first 4/10 frames, turning on Select24, and scrubbing the video until I start seeing bad blending again. A few frames at a time. When I do see blends, I back up to the start of the scene (it usually happens on scene cuts, though not always) and start with Select60 and pick again. A lot of times, this is futile, as the blends I’m seeing are still from the best case picks; after all you can see from the ten image array above that none of the frames are ever completely clean. This basically just amounts to wasting my time. Sometimes one of the frames needs to be switched off by one. Sometimes the pattern changes entirely, so I don’t feel so bad about having to check.
The calls to framenumber(), commented out at the top of each segment, are just there to help keep track of where I am relative to the segment start, as each Select* is a trim that resets the viewer’s frame count relative to the current selection. It’s mostly just a quality of life thing, but I’d still be working without it.
This episode didn’t do it apparently, but there is a section in episode 7 that does not neatly adhere to the same 4/10 across any length of time. It was fairly close to the same 8/20, with the first and second set of ten being different, but that wasn’t even perfect. It was a fairly short scene cut overall, so I basically gave up on perfection, which I think you’ll agree was probably good for my sanity, based on what you’ve now seen.
The final new bit here is freezeframe(), which does exactly what you’d expect. In order, the parameters specify the start, end, and what you want to replace them with. There are two spots where the frames do not change at all over a period, and there is no reason to allow noise or bobbing artifacts to come into play. This file is again rendered into a lossless file, which is loaded in the final script.
Step Three: Dedup, Degrain, Delovely
The last step is, thankfully, pretty simple again. There are a lot of imports up top, because qtgmc() and smdegrain() are fairly complex pieces of programming that I’m exceptionally glad somebody else figured out how to do. I certainly would not have been able to anyway, I’m technically skipping a step here. The first pass of this file, line 18 is uncommented, and everything south of it is commented out. During this pass, dupmc() is collecting data about differences between adjacent frames. This is usually used for removing duplicate frames from cheap animation, but seeing as I’ve been inserting exact duplicates all over the place between Select24() and DupFrame(), and not introducing any encoding artifacts thanks to lossless encoding, it’s got plenty to chew on. Once a pass through the entire video with dupmc() is done, that gets commented out, and the dedup() line can be uncommented. The threshold is intentionally set extremely low, as the metrics collected correctly show that the identical frames do in fact have 0.00% difference. A threshold of 0 unfortunately turns the removal off entirely rather than simply skipping any frame that has even the slightest difference. dedup() actually fully removes these duplicate frames from the stream, leading to variable frame rate output, but unlike using tdecimate() to automatically generate variable frame rate output, this is perfectly reliable, because it really can’t be wrong.
After that we have a call to qtgmc() in its progressive mode. This attempts to fix the bob shimmer caused by using lesser bobbers (such as nnedi3()). It isn’t quite as good as just using qtgmc() on the interlaced source, but since I’ve already ruled that out above, I think you’ll understand why I’m willing to live with the shortcomings of using it to post-process.
smdegrain() removes grain. I’m sure you could have guessed that yourselves. The film for the show is pretty dirty, so it can definitely use this step. The lsb parameter here turns on 16-bit processing mode using the dither tools as seen in Go-Busters, but like this it neither expects 16-bit input nor returns 16-bit output, opting instead to dither to 8-bit before you even see the result. The last line generates XviD keyframes, which are useful for frame timing subtitles. x264′s keyframes are much more useful for compression reasons, but it can be conservative about finding scene changes as a result. This file is then rendered out via x264 and is the final result.
Never The End
The reasons why I wasn’t exactly looking forward to this should be pretty evident already, but I’m leaving out some non-technical aspects. Well, non-encoding related technical aspects I suppose? The computer that I do all the encoding on is my Windows box that I have attached to my television. This may seem strange, but that PC was built specifically to run Windows and decode high bitrate, high resolution video. As such, it’s the best candidate, strange interface aside.
Unfortunately, this means it doesn’t really have a proper workstation setup, such as a place to sit and reasonably see small interface elements like frame numbers in VirtualDub or what I’m typing in notepad. This means I have to walk up to my receiver and use it like a poorly positioned desk, which is in fact how I did all of episodes 4 and 5, and do most of the Go-Busters work (though as I’ve said that’s only about ten minutes of scrubbing). I do have a VNC server on the machine, but even over 802.11n in an area that is not congested with too much wireless, it’s way too sluggish. The encoder box ran Windows 7 Home Premium, so RDP was right out… until it wasn’t. Episode 6 was completed via RDP, because now that it’s available, it’s more than fast enough to scrub through without issue. There are some issues with key repeat thinking I’m holding down way longer than I actually am sometimes, but that’s a tiny irritation compared with the alternatives. It’s really not so bad overall, especially now that I’m no longer making my job worse with qtgmc() abuse.
So there you have it. I’m the jerk here when it comes to Timeranger being delayed. I know that other people have encoded these videos before without going into this kind of depth, but for some reason I just took the ugly blended mess that a traditional approach resulted in as a personal failing. Even looking at the first couple episodes kind of makes me cringe at the moment. I’m not even sure if anybody noticed the difference to be honest, so whether it’s actually worth it is very much up for debate, but here it is. Don’t blame anybody else though; they’ve all been on the ball and not distracted by their own sloth, dread, and self-disgust over said sloth and dread.
I literally cannot believe you read all of this. Do not attempt to convince me otherwise. I do not believe I actually wrote all of it in the first place. Also, yes I am aware this is only around 4,500 words, for the love of god do not point it out. I was stumbling into the DBZ joke. Oh, and for the record, I’m actually out of bourbon.