31 Aug 2020
In an earlier post, I talked about building an rABS entropy decoder on 6502, which became the heart of my new 4k intro packer experiment for C64. While exploring several different entropy coder variants for simple LZ codecs, and in particular tANS or an approximation of it with adaptive probabilities (spoiler alert: this was a dead-end), I realized that rANS was not only a possible alternative, but actually an OK fit for the system, providing a high precision entropy coder with… well, good enough decode speed and a not-as-tricky-as-I-had-originally-feared implementation :) .
However, I intentionally left out several details of the rest of the codec. The earlier post only describes how the entropy decoder is implemented and some of the modeling, the combination of which is kindof the novel thing here. But essentially, it only gives us a way to pull modeled bits out of an encoded bitstream. The bits we actually encode/decode and how we model those is what a codec is really all about; the entropy coding part, while certainly important, is really just a piece of machinery that allows us to build the rest of the codec effectively and take as much advantage of the modeling choices we make as possible.
So, in this post I’d like to go into detail about the rest of the codec, including the LZ scheme chosen, how we model various symbols, how some of these techniques were chosen/tested, and several parts of the encoder. I’ll save the rABS-specific encoding details/intuitions for a future post where I can talk about that specifically in more depth.
Before I dig into the current codec details, I want to talk about where the design came from, why I embarked on this journey in the first place, and how things evolved over time.
Back in 2017, for makeshift, my first 4k intro for the C64, I developed a custom LZ compressor to compress all code/data. This compressor was called “Admiral P4kbar”, a name which contained both “pack” and “4k” in some form along with a stupid pun; quite the satisfying trifecta. Anyways, when the intro runs, it first relocates the compressed data and decompressor, decompresses the compressed data into memory, and then returns to BASIC to run the decompressed code. This relocation is necessary as we want to pack the intro into a contiguous memory region starting in BASIC memory to make the intro easier to run (just
SYS 1337 or whatever nonsense), but the decompressed data also needs to decompress into BASIC memory so we can use BASIC to generate a sine table, which is used in several effects. But I’m digressing a bit already - what I want to talk about specifically is the LZ codec design used there, so we can eventually get to the new packer’s design.
Makeshift is open-source! You can find the compressor code here and the decompressor code here. The new packer is unfortunately not part of this source release, but will be available when we do our next C64 4k :) .
The LZ codec used in Admiral P4kbar (which I’ll now refer to as “apv1” for brevity) is relatively straightforward - the data is encoded using a sequence of literals and backreferences (usually called matches in LZ parlance, but I just happened to have called them backreferences in this case because reasons). Literals are coded as a single
0 control bit followed by a raw 8-bit sequence (no entropy coding here); backferences are coded as a single
1 control bit followed by a simple entropy-coded length and distance pair. Both of these fields are coded the same way: an index into a 16-entry table is unary-coded, and then the corresponding table entry specifies a 16-bit base value and a number of additional bits (0-15) to read in and add to the base value, which results in the final field value (similar to LZMA’s “distance slots” which are sort of an adjustable gamma coding-like scheme; this kind of thing is really common in LZ codecs). Different tables are used for each field. Lengths always use the same table; distances use one of two tables, one if the length for the current backreference is
1, and the other for all other lengths. The tables themselves are coded as 16 4-bit bit counts (packed in nybbles); each entry’s base is the previous entry’s base + 2 to the power of the previous entry’s additional bit count. This packed table representation does introduce some overhead into the coded bitstream (24 bytes always), but is certainly not a bad way to represent/transmit symbol probabilities to the decoder.
Most of the encoding ideas in apv1 came from Exomizer, the de facto standard for data compression for C64 (and other platforms where the decoder environment is limited). Particularly, the distance/length table scheme is totally stolen from there; as is the algorithm to optimally search for the best table bit counts given a set of symbols and their frequencies (which I reimplemented myself as a learning exercise of course, but it’s the same core idea). The exact tables used differ slightly, but this isn’t terribly important. Both compressors also use a similar iterative scheme to find the best parse/encoding pair given the input data: First, an optimal LZ parse given the current best encoding (initially some predetermined “seed” encoding) is found (which is straightforward as there’s no adaptive modeling during coding), and then an optimal encoding is found for the symbols used in that parse. These steps are repeated one after another until no further gains are achieved; usually only a few iterations (2-3) are performed before we find a minima (ideally global but likely local) and we’re done.
Ultimately, apv1 performs worse than Exomizer, and somewhat embarrassingly, has a slightly larger decompressor to boot. In the end I never actually tested Exomizer on the final binary (only a few times early in the development of apv1 for comparison), but I had estimated that the intro would have been about 100 bytes smaller had we used Exomizer instead. Not an enormous margin, but definitely would have helped to not have to reduce some of our effects so much, or at least have some nicer/smoother transitions.
That said, I’m not entirely unsatisfied with apv1 - it was developed rather quickly in conjuction with the rest of the intro for Solskogen that year, and it was a really good exercise for me. Before that I had done some trivial LZ codecs before, but nothing that performed very well or really even did much bit IO, much less entropy coding in any proper capacity. At the same time though, I wasn’t able to experiment with it as much as I would have liked due to the project timeline, and compression suffered a bit as a result. Nevertheless, I decided to use apv1 anyways (because making/using your own custom packer is always rad!) and made the 4k limit just fine (coming in at 4095 bytes after loads of aggressive hand-optimizing, which is normal for 4k’s), but I always knew we could do a little better, and that I would want to take another stab at this eventually. At the very least, I had to beat Exomizer, at least for this specialized use case.
I didn’t end up touching the project for another year and a half or so, but when developing squishy, I spent some time exploring ways to compress the (somewhat large) decompressor for further size gains. This turned out to be a pretty fruitful endeavor, and not just for that packer - as it turns out, the LZ scheme I implemented (after reading a couple great blogs for fun) turned out to be pretty effective on makeshift as well. Granted, the exact same construction wasn’t phenomenal, since the codec was highly tuned to the decompressor it was compressing (as it should be!). But it was promising, very simple (to keep the decompressor as small as possible), and easy to tweak - so naturally I wanted to see if it could be something useful if adjusted for a C64 implementation.
squishy’s decompressor compressor (which I’ll now refer to as “stage2”, as it compresses the second decompression stage in that packer) uses a very simple LZ-based coding scheme. As usual, there are two types of packets: literals, which represent single bytes, and matches, which represent copies of data previously seen by the codec. Both types of packets start with an entropy-coded 4-bit length field. A value of 0 for length means the packet is a literal; all other values denote a match, and represent how many bytes the match will copy. Following the length field for literals we have an 8-bit value field (this is the literal’s byte data); for matches we have an 11-bit offset field that determines how far behind the write pointer the copy starts. Matches with length 1 are allowed, and as is common with LZ codecs, a match with length > offset is permitted (assuming a match copy is performed forward byte-by-byte, this is a nice way to “implicitly” support repetitions/RLE sequences). A match with an offset of 0 is possible to encode but never used (this wouldn’t make any sense for the decoder, which hasn’t seen the next byte yet), as using an implicit match offset bias doesn’t provide any compression gains in this case and the decoder is a tiny bit smaller without it. There are no repeated offsets.
All of the fields in the stage2 encoding (length, literal value, and match offset) are entropy-coded with the same modeling scheme using a 32-bit arithmetic coder (33 bit in the encoder for carry propagation) that shifts in/out whole bytes at a time. Each type of field gets a separate binary tree of bit predictions, identical to the “normal scheme” described in the LZMA specification. All predictions are 12 bits and use a very simple fixed-point Howard/Vitter-style update scheme as described by ryg, but using a divide instead of a shift for the update rule (which is certainly slower but not noticeably so with so little data, and allows more freedom for the values chosen, since they no longer have to be powers of two) and the learning rate/divide value of 19 (~5.26%). There’s no additional context used to model the fields, as with such a small amount of data, any gains that I got from more advanced model schemes in my experiments were either totally insignificant or resulted in a larger decoder for a net loss.
The various parameters in this encoding (bit counts, learning rates, …) were found with a brute-force search, along with the initial prediction values. Typically, predictions in the model are initialized with a probability of 0.5. Since we have such a small piece of data that never changes, though, it pays to experiment with this initial value as well. The brute-force parameter search found that ~0.415 (1700/4096), along with the other encoding parameters used, provided the best compression.
Looking at the stage2 encoding above, we can already see several differences from apv1. The first one is that we don’t necessarily need a whole bit of overhead for encoding which type of packet we encode (literal or backreference/match), since this is now entropy-coded as part of the length field, and this has some very big potential wins, as we typically encode a couple thousand packets in the final bitstream. Additionally, literals are now entropy-coded as well as matches, which helps even more.
Of course, one key thing with the stage2 codec is that it has a highly-accurate 32-bit arithmetic coder and 12-bit probabilities for entropy coding. A 32-bit arithmetic coder is completely out of the question for 6502; perhaps so are 12-bit probabilities (at least we’d prefer 8 bit ones, maybe even 16 bit if possible), so something else has to be done here if a similar codec is even going to be viable for C64.
Another major difference here is that the stage2 codec relies on adaptive probabilities, not static ones. This certainly helps compression, so we’d like to keep that feature - however, it also limits the viability of several entropy coders that would otherwise be a natural fit for implementing on 6502. Specifically, fast, table-based Huffman or tANS/FSE coders don’t really work anymore; any efficiency they promise over other entropy coders gets totally dwarfed if we’re to rebuild tables for every symbol decoded. There may be ways around this (eg. more sparsely updating model probabilities), but I didn’t bother exploring them. There’s also the not-so-small caveat with adaptive probabilities that they should very much be taken into account in the parse while encoding (which I’ll talk about in a later section). But for now, let’s continue talking about encoding details.
So now let’s finally talk about the new packer’s codec. The new packer is called (somewhat predictably) “Admiral P4kbar V2”, and I’ll refer to it as “apv2” for the rest of this post.
There actually aren’t that many fundamental differences between apv2 and stage2, but the few there are make a lot of difference.
For starters, I still wanted to use adaptive probabilities. Since we’re targeting an 8-bit CPU, these should ideally be 8 bits; perhaps 16 if we would absolutely need it (this would of course depend on the kind of entropy coder we would end up with). With full integer precision, I tested several configurations of different bit widths for predictions and entropy coders, mostly centered around rABS and binary arithmetic/range coders. For example, rABS and more classical binary arithmetic/range coders with 16-bit states and 8-bit probabilities perform nearly identically to one another (as expected), and vs larger state values we don’t really gain any usable precision with such small data (this amounts to a loss of around 8 bytes or so vs a 32-bit state, for example), so using a 16-bit state seems to be sufficient.
Small aside: tANS of course gets away with using much smaller state values. With 8-bit probabilities that would sum to 256, for example, we could use as low as 9 bits to specify the entire tANS state. In 6502 terms this boils down to storing the state in a single 8-bit memory location/register and relying on the carry to specify the 9th bit (and even then we only need it specified in parts of the update where we don’t already know its value, since it’s always assumed that our value lies in the normalization interval between symbols), so this is a very attractive option as well. However, with tANS we’re still left with static probabilities, and we know we can compress better with adaptive ones, so it doesn’t really help us.
So, ultimately, I ended up with rABS. This doesn’t seem like a good fit at first, considering the decoding loop requires a multiply, and the C64 doesn’t have any facilities for performing one efficiently. As it turns out, this actually isn’t that much of an issue in practice, since we can still implement a software multiply that’s fast enough for our purposes (even though we spend a quite significant amount of decoder cycles on this!) and relatively small. On a platform with a hardware multiplier this would of course have been much better (hint hint), but oh well. Anyways, the details of the decoder (including the multiplier I ended up using) have already been described in an earlier post, so I’ll skip those here. Note that that post also goes over the kind of bitwise modeling/updates I use; this is in addition to what I described earlier about how stage2 models/encodes n-bit values as individual bits. apv2 uses the exact same scheme.
Tiny fun-fact: the initial rABS decoder impl that I designed for C64 actually used a 17-bit state, with explicit 8-bit low/high variables and using the carry for the 17th bit. It turned out that the extra bit of precision didn’t produce any noticeable coding efficiency (I didn’t gain a single bit in my tests) and the code would end up smaller without it, so I went with 16 instead.
Several of the bit counts in the apv2 codec were adjusted by brute-force search, just like with stage2. Literals are still 8 bits (whole bytes); match lengths ended up as 8 bits, and match offsets ended up as 12 bits. Not only did this provide the best compression for the intro data I tested it on, it also helped make the implementation smaller by allowing us to assume we’d always read in at least 8 bits (and thus didn’t have to clear the result high byte in the n-bit value decoder before decoding) and that all of our model tables would be at least 256 bytes large, meaning we could specify their addresses with the high byte only (assuming we’d also align them on a 256 byte boundary). Biased initial bit predictions also stuck; they ended up being 110/256 in apv2 (about 0.43, interestingly closer to 0 than in stage2).
One of the things I tried for stage2 that didn’t end up helping was repeated offsets. This is a common LZ trick that works well on structured data. The idea is that if you see an offset, it’s very likely you’ll see the same offset soon (or right away), eg. every 4 or 8 bytes (or whatever size individual data records/fields are). The simplest incarnation of this is to just keep track of the last match offset encoded, and use a special escape code for using that instead of encoding the full offset again. In apv2, this was done by designating offset 0 (which is invalid, as offsets start at 1) as the last offset, which turned out to help significantly and didn’t add too much code size. I tried some more complex variants as well, mostly focusing on schemes that could store multiple previously-used offsets (including cache-based schemes) but they didn’t result in good enough gains (if any at all!) to make up for the additional code complexity (just as repeated offsets as a whole never made it into stage2). Match offset biases also didn’t help at all in apv2, so they didn’t make it in.
By the way, it may have been obvious from the other writings, but it might be worth mentioning: The process for finding this particular codec design was relatively stochastic. With LZ codecs, it’s pretty easy to put a working one together as a starting point, and then you can go through several stages of just throwing a bunch of common tricks at it and see what helps. For example, do you gamma-code lengths? Offsets? Do you have a lower bound for match lengths? Etc. I think most people tweak compressors this way; it’s so quick to try new things, and as long as you have a good parse (this is actually not trivial, again, as we’ll see later) you can generally trust that the changes you’re making are at least being utilized as best they can be. And you don’t even need a full working codec to start the process btw; as long as you have some way of modeling and analyzing/estimating statistics of real data, you can operate purely based on those statistics and assume your coder will match those well enough. It also helps to consider the kinds of data you’re compressing and try to experiment with features that should work well with that, perhaps after doing some targeted data analysis (this is a great way to pick which features you should focus on trying next). Typically, though, I find it’s easier/more intuitive to reason about why a particular modeling change was beneficial after it has been than to predict whether or not it will be, especially since it’s easier to observe the actual statistics the codec was taking advantage of in hindsight. Maybe that’s one of those ymmv things, who knows. It’s also important to have enough data to test on, and specifically relevant data. In this case I was testing with makeshift and an uncompressed binary of datafresh (thanks, Blueberry!); not exactly a huge corpus, but certainly seemed to be relevant enough. I also tried some of the classic corpora with mixed results, but this was more of a sanity check, and I didn’t bother trying to optimize for those at all (text has pretty different statistical properties than executable code/binary data anyways). But my point with this paragraph is that the process for putting codec features together can be fairly random, especially with LZ codecs as your model/parse combination is less predictable than a simpler model/coder combo on its own, but the feedback is quick, and it’s probably the most fun part of the whole packer-making process.
One last thing, there’s a not-so-small detail that ANS entropy coders are LIFO, not FIFO. From an encoding perspective, this isn’t terribly difficult to work around, especially if you’re only compressing a small amount of data (as is the case here): All you really need to do is encode the data in backwards order, and then the decoder will decode in forwards order and we’re all good. But there’s something more subtle that really throws a wrench into the mix - with adaptive probabilities, we have state updates that need to happen in the forward direction, while encoding based on that state needs to happen in the reverse direction. Again, though, this actually isn’t that bad as long as we have a small amount of data. The way we get around this in stage2 and apv2 is that we run the parse with adaptive modeling in the forward direction, buffering up which symbol we coded and its probability as we go (we only need these two values for this, since we’re coding bits), without actually encoding anything. Then, in a second pass, we iterate over those symbol/probability pairs in the reverse direction and actually encode everything. An important detail here is that the resulting coded bitstream also has to be reversed again, as the decoder will read from the end of the stream like popping off a stack, but again, with a very small amount of data, this is trivial. Alternatively, the decoder can read the coded bitstream in reverse order, but since we may need to pad the output bits up to a byte boundary, we’d rather do that in the encoder so the decoder doesn’t ever have to care about it.
So that’s kindof about it for the actual bitstream details; there should be enough info in these paragraphs for the dedicated compression hobbyist to be able to construct something similar (unless I’m missing something - if it seems so and/or you have any questions feel free to send me a comment below). There are a couple loose ends, though.
One thing I’ve been hinting at but haven’t really dealt with in this post yet is the issue of parsing. In LZ codecs the parse is critical, especially when tweaking details of the codec; having a good parsing algorithm allows you to take advantage of your various coding possibilities in the best way possible.
A key thing (and kindof a funny thing when you think about it) with LZ codecs is that by introducing the LZ transform (i.e. turning an input bitstream into an equivalent
[literal | match]* sequence), you’re essentially adding to the redundancy of the stream by embedding the original symbols into an encoding with more degrees of freedom than the original encoding had. What I mean by this is that given a given input symbol stream, it’s usually possible to encode it as one of several different streams in the LZ encoding. For example, if I had a string
ABA, I could encode the first two symbols (
B) as literals, and for the second
A I have a choice - do I encode a literal
A again, or do I encode a match that represents the first
A? Either would be equivalent symbolically and would result in the same input stream after decompression, but the choice is important, as different LZ symbols will have different encoded lengths, especially if their fields are entropy coded (which they will be). In many modern LZ codecs that target very fast decode speed, this choice is often just as important or even moreso (especially for candidate encodings which may result in the same size), as more LZ symbols typically means more iterations through the symbol decode loop. One can even go so far as to try and influence which branches a decoder takes in order to please the target CPU’s branch predictor. But I’m digressing - the important parts here are that this resulting LZ symbol stream is called a parse, how we determine it in our encoder is important, and it’s not hard to imagine that this search space gets fairly explodey on real data, so we’ll probably want to be smart about it.
In apv1, we were essentially working with fixed encoding schemes - even though we were searching to find a good parse/encoding pair, we did so by alternating which one we searched for with respect to the other - so when we were parsing we had a fixed encoding, and vice versa. With a fixed encoding, we could actually do a simple backwards LZSS-style optimal parse given a candidate encoding, which meant that we could actually find the best parse possible for a given encoding/input stream pair. However, in apv2, we use an adaptive probability model, so that same kind of scheme doesn’t apply. We need a parsing algorithm that’s ultimately aware of how the model will adapt over time and change its statistical understanding of the packet data (especially with very limited probability precision as we’re using here!) as it searches the effectively endless available parse paths that it could take through the input stream; otherwise, we’re throwing away a lot of potential gains in compression!
Luckily, this is a pretty well-known problem, and several quite clever people have already worked out really great strategies for it. A particularly nice one (that I implemented) is outlined here, and it appears to go by a few different names. One name that I really do NOT like is “optimal parse”; as outlined in the post I just mentioned, this is really a historical convention and while the algorithm can often find very good parses, it’s not computable which ones are actually optimal, and because of that, it’s not optimal, is it? Bah. Anyways, I’ll call it a forward-arrivals approach here - and it’s used in several places.
The idea is already explained in very good detail in the post I’m now mentioning for the third time - seriously, go read that (or at the very least, this comment on it by ryg is a great summary). Has some good history too. I’ll try to summarize, though: the main idea is that for every position in the file, we can store a number of candidate arrivals, which store some data about how you might have reached this position from a previous position in the file, as well as some of the model statistics that have been adapted along that path so far. You go through the file in a forward direction (hence forward-arrivals) and at each position, you look at what your possible encodings are from there, and if they look pretty good (“pretty good” here meaning “better than other potential arrivals at the target positions as if you chose those encodings” - and note that “better” here isn’t necessarily just “cheapest so far” even though that’s what’s typically used and assumed to be a good proxy for “globally cheapest”!) then you update the potential arrivals at those target positions. At the end of this process, you look at the last position of the file, choose the arrival that represents the lowest cost, and jump backwards to where that came from. This series of backwards jumps represents the LZ symbols that you’ll encode in reverse order - so if you encode them the other way around, then congrats, you’ve probably got a decent bitstream on your hands!
There are a few key points here that make this a good algorithm:
Now, typically the stats/model state that’s actually propagated to the arrivals is pretty small/cheap (and correctly so!) - but remember, this is for 4KB intros, so we actually don’t have that much input data. Put another way, the amount of positions which we’ll have to store arrivals for is quite low - which means we can reasonably get away with using quite large arrival structures if we want to.
In fact, in the apv2 encoder, since we have so few model contexts and only a single last match offset, I actually went ahead and just duplicated the entire model state for each arrival. Yep. This amounts to something like 4.5KB / arrival, which is.. kinda nuts, but means we get very accurate estimates for each potential LZ symbol we might try to encode, and we can track exact model updates after each symbol and the effect they’ll have after encoding.
Let’s take it a step further: Not only can we tractably store more state / arrival, we can also try storing more arrivals / position! For apv2 I chose to use a maximum of 16 arrivals / position - this is a bit high and is well into diminishing returns territory, on top of taking a bit of time to explore all those possible paths, but that’s a price we absolutely can afford in this context.
And what is that price, exactly? Speedwise, the encoder will happily take up to a few minutes to compress an intro, which actually isn’t that bad at all for this sort of thing. Memory-wise, it’s something like 4.5kb per arrival, * 16 arrivals per position, * ~10k or so positions = ~740 megabytes of memory, tops. Yeah, this is unusable in many contexts - but for an intro packer, this is just fine; computer go brrr and all that.
This is a great way to keep some consumer computing appliances warm and all, but actually, the speed thing is kindof a big issue, especially when iterating on intro effects/transitions where turnaround time is paramount - so we’ll need to do something about that.
I mentioned earlier that this kind of parse is tweakable, and actually very easily so. The easiest parameter we might tweak is how many arrivals / position we have. Additionally, we can throw a few classic LZ parser tricks at it; for example, we can decide to have some threshold length where we just take any match that exceeds this length greedily, or perhaps have min/match length bounds, all kinds of stuff. These kinds of things will typically negatively impact compression ratio, but reaaaally help with speed, especially when searching for potential matches, and they tend to compound well for nice overall gains.
This is the reason that packers typically have multiple profiles - we can pick a few parameters for the parse that trade off ratio for speed or vice versa, and pack our intro more quickly (but less efficiently) while iterating, or more efficiently (but less quickly) when we’re ready to pack up the final thing. We can even choose a middle ground if we can tolerate waiting a bit longer but want to get a better idea of how well something might pack. In apv2, I made the (somewhat arbitrary) choice to cover these exact 3 cases. The optimal profile (yep, I went ahead and propagated this terrible convention! oh well..) will use 16 arrivals / position, a minimum match length of 1 (the lowest supported by the codec format), and no greedy matches. The instant profile actually only stores a single arrival / position, accepts matches only if they’re greater than 3 in length, and will greedily accept any match of length 4 and above. Finally, the balanced profile provides a “middle” ground (it’s closer to the instant mode than it is the optimal mode) by storing 2 arrivals / position, min match length 2, and will greedily accept matches of length 64 and above.
Again, these choices were somewhat arbitrary; they were made by experimenting with different values until I found some “comfortable” ones, and they work fine enough. Who knows, I’ll probably change them as I work on the next 4k; either way, it’s good to have them documented here for posterity’s sake.
I already talked about the results previously, but tl;dr, yes, I beat Exomizer for this specific use case (and by a good margin, too!), so yay!
Maybe it’s a bit strange to talk in such detail about a packer that hasn’t even been used in anything real yet, but I’m excited about it and it was really fun to make, so why not. Again, the source for the packer will be released after we do our next C64 4k intro (no promises here, it may never happen, or it might happen soon, who knows), at least as part of the intro source dump like last time so keep an eye out for that.
In any case, I hope this post was at least enlightening in terms of showing how codecs can be developed and how ideas in one compressor may spark ideas in another, and especially in the context of demoscene productions where the requirements for such a thing are a bit wild/esoteric compared to the ones you normally have to deal with.
Edit 02/09/2020: Small clarifications/simplifications, add results link, removed some fluffy “I’ll write about this later” stuff (as it felt a bit distracting reading it back).
31 Aug 2020