Technical ramblings and other stuff

ferris blogs stuff rANS on 6502

11 Feb 2019

I’m in the middle of writing several compression-related posts that are all in various stages of completion, and I seem to have already developed a habit of writing waaay too much stuff. So I thought I’d try to write something a bit more short-form (yeah, it’s not really short at all, but whatever) about an experiment I did that actually went pretty well: I implemented an rANS-based packer on C64 for 4k intros, and I’m quite pleased with the results.

OK, so technically it’s an rABS-based packer, the difference here being that it only entropy codes binary symbols. The reason for this is that we typically observe better compression with adaptive modeling, and updating symbol predictions is much, much simpler with a binary alphabet (look ma, no divides!).

At some point I’ll also publish a post (or perhaps a series of posts) that will talk about the codec in more detail (along with several other things I tried as well), but for this post I’d just like to focus on the rABS implementation that forms the basis of the codec.

The rest of this post will focus almost entirely on the rABS *decoder* used in the packer. While the encoder runs on a modern PC, the decoder runs on the target platform to decode the intro into memory before it runs. Thus, several of the details in the codec will be decided around keeping the decoder as simple as possible, while still being able to deliver the best compression ratio we can manage. Speed is also somewhat a concern, but not our first; nobody will complain if a kickass 4k intro for C64 takes 30 seconds to decode, for example (so long as the user knows it hasn’t crashed, of course).

The idea here is that to decode an rANS/rABS bitstream, we have some running state variable `x`

that contains some information about all of the symbols previously encoded. We’ll decode symbols “out” of this state (essentially popping them off of a stack), and then update the state to a new value. The state always has enough information to decode the last symbol coded at least, though since it has limited precision, sometimes we’ll have to pull in additional state bits from the encoded bitstream. The full symbol decoding algorithm looks something like this:

```
- Determine symbol s from state x and symbol probs (decode)
- Based on symbol s and its frequency Fs, find a new state x' (update)
- If necessary, pull in additional bits into x' (renormalize)
- Assign x = x'
```

Let’s first describe how a symbol `s`

is extracted from a state value `x`

. With rANS/rABS, this is actually pretty straightforward. If `x`

is in some range `I = [L, b * L - 1]`

(where `L`

is some multiple of `M`

, which is the sum of all symbol frequencies) and `b`

is some integer base (2 for streaming bits, 8 for bytes, …), then `I`

is essentially built up by several subranges for each possible symbol, starting at its cumulative frequency `Bs`

and with a range of its frequency `Fs`

, offset by `L`

and scaled by `b`

. We can then determine `s`

just by checking which subrange of `I`

that `x`

falls into. This is nearly identical to what you’d do for an arithmetic decoder, except that the range is defined differently (it would be `[0, M]`

or perhaps `[0, L]`

for an arithmetic coder).

Note that this is extremely simple in the binary alphabet case (like we have with rABS) because `I`

is essentially only divided into two subranges; one for a 1 bit and one for a 0 bit. Determining which range `x`

is in then is just a comparison of `x - M`

against `Bs`

, where `s`

is the symbol corresponding to the upper subrange in `I`

.

As an example, consider the following range for symbols 0 and 1 (and we’re simplifying here with `L`

= `M`

, which interestingly is equivalent to tANS with rANS symbol order, but that’s a whole other can of worms that I would *love* to talk about some other time):

```
M = 8
L = 8 (M)
b = 2
F1 = 2
F0 = 6
x = 11
L 2L - 1
[ 1s | x 0s ] <- I
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
```

In this case we can clearly see that `x`

falls into the subrange of `I`

corresponding to symbol 0, and thus a 0 is decoded from `x`

.

Note that while this example shows a binary alphabet, this works for an arbitrary number of symbols (assuming we have enough precision to uniquely represent each subrange). Another thing I should probably mention is that when

`L`

>`M`

, we actually haveseveralsubranges in`I`

that correspond to each symbol; essentially we have`L`

/`M`

subranges for each symbol as`[0, M)`

is repeated throughout`L`

. This is more or less what gives us the modulo in the rANS state update formula and the ratio of`L`

to`M`

has a lot to say when it comes to the precision of the coder. Don’t worry if this is a bit tough to follow; the details here aren’t actually that important, and I’ll try to fill in as many blanks as possible shortly.

Once we can extract a symbol `s`

from state `x`

, the heart of an rANS/rABS decoder is the following formula that calculates the next state value `x'`

(citing Charles Bloom’s blog series on ANS):

```
x: current state value
x': new state value
Fs: frequency for symbol s
Bs: cumulative frequency for symbol s
M: sum of all symbol frequencies
x' = Fs * (x / M) + (x mod M) - Bs
```

Already we can tweak this a bit by making sure `M`

is a power of two, which turns the divide into a right shift and the modulo into a bitwise `AND`

:

```
x' = Fs * (x >> log2(M)) + (x & (M - 1)) - Bs
```

Now we’re looking pretty good here for something relatively fast/simple.

The last part involves renormalization. This part is easy. After state update, we have a new state `x'`

which we want to be in range `I`

. Often times it already will be; when it’s not, it’s going to be some value less than `L`

. So, we just need to pull in bits from the encoded bitstream until `x'`

is in `L`

:

```
while x' < L {
x' = (x' * 2) + input_bit()
}
```

And that’s it. As we’ll see later, the comparison against `L`

can even be done in the CPU automatically for us, if we choose our ranges correctly.

Putting all of this together, our bit decoding code is becoming clearer:

```
// Input state x and probability for bit 1, output symbol and new state x'
fn decode_bit(mut x: u32, mut probability: u32) -> (u32, u32) {
let mut symbol = 0;
let mut bias = 0;
// Decode symbol
if probability < (x & (M - 1)) {
// x is in the lower subrange; bit is 1, bias is 0
symbol = 1;
} else {
// x is in the upper subrange; bit is 0, bias is probability,
// and probability should represent the upper subrange instead
// of the lower one
bias = probability;
probability = M - probability;
}
// Update state
x = probability * (x >> log2(M)) + (x & (M - 1)) - bias;
// Renormalize
while x < L {
x = (x << 1) | input_bit();
}
(symbol, x)
}
```

Ultimately, this is what we’re going to implement on the C64.

So there are a few things we need to work around here. Particularly, we’d like to do as much arithmetic as we can in 8 bits (with some things in 16 bits where necessary).

With the above code, we can start plugging in some constants based around what the 6502 is good at. With `b = 2`

, `M = 256 (2^8)`

, and `L = 32768 (2^15)`

, we can represent our entire state `x`

in 16 bits (two bytes, which exactly covers `[0, b * L - 1]`

), and the sign bit of the high byte corresponds to whether or not `x`

is above `L`

(and thus whether or not it’s inside `I`

), so we get that comparison for free! Further, the `x >> log2(M)`

term above is just reading the high byte of the state, and the `x & (M - 1)`

term is just reading the low byte of the state. Also, with `M = 256`

our probabilities will always sum to 256; excluding 0 and 256 as possible probabilities (as these values don’t leave any room for the other symbol, and thus would break the entropy coder) we get a range of 1-255 for each probability, which fits perfectly within one byte. All quite nice for our byte-oriented 8-bit CPU, and with plenty of bits of precision between `M`

and `L`

to work with.

The one gotcha here still is the multiply in `Fs * (x >> log2(M))`

, since the 6502 doesn’t have a built-in hardware multiplier. Luckily, since our predictions are always one byte and so is the shifted part of our state, we can do an 8x8 multiply with 16-bit result relatively easily in software, which looks something like this:

```
FAC1 = $58
FAC2 = $59
; A*256 + X = FAC1 * FAC2
MUL8
lda #$00
ldx #$08
clc
m0 bcc m1
clc
adc FAC2
m1 ror
ror FAC1
dex
bpl m0
ldx FAC1
rts
```

There are lots of such routines floating around; this one was taken from Codebase 64. Without some of the extra data moving and inlining this code, we can get it down to just 10 instructions, or ~16 bytes. Of course it’s not particularly fast (on average ~130 cycles or so), but assuming an upper limit of 30 seconds to decode and a ~1mhz CPU, we should have around 915 available cycles/bit for our target size of 4096*8 packed bits, so we’re well within limits here.

There are several faster ways to do multiplication in software; most take advantage of alternative equivalent mathematical forms and/or lookup tables. In this case, I went with the simplest solution in order to keep decoder complexity low, but we can always trade some of this size if speed is more important.

With our various constants plugged in and our multiply available, we can rearrange our decode code a bit to better match what we’ll implement in 6502 asm:

```
// Input state x and probability for bit 1, output symbol and new state x'
fn decode_bit(mut x: u16, mut probability: u8) -> (u8, u16) {
let mut symbol = 0;
let mut bias = 0;
// Decode symbol
if probability < x.low {
// x is in the lower subrange; bit is 1, bias is 0
symbol = 1;
} else {
// x is in the upper subrange; bit is 0, bias is probability,
// and probability should represent the upper subrange instead
// of the lower one
bias = probability;
probability = 0x100 - probability; // set carry, sub from zero
}
// Update state
let mut new_x = x.high * probability; // 8x8 multiply, 16 bit result
new_x += x.low; // 16+8 add
new_x -= bias; // 16+8 sub
x = new_x;
// Renormalize
while sign bit of x_high is not set {
x = (x << 1) | input_bit(); // get new bit in carry, rotate low+high
}
(symbol, x)
}
```

Not too much different than the earlier listing, but hopefully it becomes clearer how we’ll handle everything in 8 or 16 bits. I believe this is clear enough to imagine the full assembler listing - I’ll leave that as an exercise to the reader (if you’re familiar with 6502 this should all be pretty straightfoward). In the actual decoder it’s almost as simple; the code is slightly more complicated as it’s also responsible for updating the model prediction (since both the symbol decode and model update will perform the same branch on the decoded symbol), but beyond that it’s basically the obvious 6502 interpretation of the above listing, with some things folded together to make the code as small as possible.

I mentioned that in this codec we use 8-bit probabilities. While this is enough precision to represent predictions fairly accurately, the update rule becomes critical in order to use this precision as effectively as possible.

Modeling binary probabilities adaptively is really quite simple. ryg has already done a great post about this, so I’ll leave most of the details there, but repeat some of the relevant parts here for clarity.

The model adaption code I use starts with the following:

```
LEARNING_RATE = 4;
prob = 128;
fn adapt() {
if bit == 1 {
prob += (256 - prob) >> LEARNING_RATE;
} else {
prob -= prob >> LEARNING_RATE;
}
}
```

`4`

might seem a bit high for `LEARNING_RATE`

with only 8 bit probabilities, but empirically it performs quite well on a lot of different data sources, with both 16 and 8 bit probabilities. This update rule has some nice properties as well; it’s very easy to implement, and we never have to worry about boundary conditions (`prob`

will never reach `0`

or `256`

so we don’t need to check for either).

It does have one problem though, which ryg outlines in his post. Due to the shift, the probability will never go below `2 ^ LEARNING_RATE - 1`

(`15`

in this case); likewise, it will never go above `256 - 15`

either. This may not sound significant, but instead of representing the range `[0, 1]`

, our probabilities now only represent the range `[15/256, 1 - 15/256]`

or about `[0.0586, 0.9414]`

. That’s actually *very* significant; we’re losing the bottom and top 6% of our available range! This can result in a compression ratio loss of up to a few percent (over 100 bytes for a 4k!).

So, we’d like to come up with a scheme that will be able to use this remaining range effectively. At the same time, we’d like not to change the learning rate too dramatically for the range that we’re able to effectively adapt in, since it performs quite well as-is.

In the end, I went with a simple method that allows us to take advantage of the full range in a trivial way:

```
LEARNING_RATE = 4;
prob = 128;
fn adapt() {
if bit == 1 {
adjustment = (256 - prob) >> LEARNING_RATE;
if adjustment == 0 {
adjustment = 1;
}
prob += adjustment;
if prob > 255 {
prob = 255;
}
} else {
adjustment = prob >> LEARNING_RATE;
if adjustment == 0 {
adjustment = 1;
}
prob -= adjustment;
if prob < 1 {
prob = 1;
}
}
}
```

This is identical to the adaptation code above, but when the model update won’t actually update the prediction, we force it to adjust by 1/256 anyways. Then, we handle over/underflow the obvious way. In 6502 both the `prob > 255`

and `prob < 1`

conditions can be checked by letting the probability overflow or decrease to `0`

, so this code is very small to implement. Admittedly, it’s a bit of a hack, but this simple adjustment gains us 120 additional bytes in makeshift, just by making sure we take advantage of precision we already have in the rest of the decoder. This is the kind of trick that can make 8 bit probabilities actually usable in practice.

So even though I still plan on doing another post detailing the other aspects of the packer, I think I should at least share some actual results here.

The original intro this was used for, makeshift, clocked in at at total of **4095 bytes**, just 1 byte below the 4096 byte limit when it was first released. It used a custom packer that I developed rather quickly, and was the first LZ I had developed that even did bit IO, not to mention anything resembling entropy coding.

The current king of compression for C64 is Exomizer, which provides quite good compression and good decode speed. It’s the most commonly used tool for C64 4k intro compression *by far*, so if we’re to accurately gauge our results, we should see how we stack up with it as well, not just my older packer. Admittedly, I wasn’t able to produce a working version of makeshift with the latest Exomizer (nor was I able to when the intro was first released), but I *was* able to get a best size for the compressed data, and it’s open-source, so we can do some byte counting for the decompressor/stub code and figure out how large a full working version would be. There’s some estimation here, but this should be quite close at the very least: with the best possible packer/overhead size for Exomizer (i.e. not using literal sequences, still relocating some mem to decode into basic mem, and a few other small things that the intro needs to do), makeshift would clock in at **4001 bytes**, a total of **94 bytes gained** or a **2.2% reduction**. Certainly nothing to sneeze at, but of course, the new packer does better.

For more detailed notes and how I came up with the 4001 bytes figure, see this gist. As it

isan estimate after all, it’s possible I’ve made a few mistakes, but it should be well within acceptable margins.

With the new packer at the time of writing, makeshift clocks in at **3892 bytes**, for a total of **203 bytes gained** or a **4.9% reduction**! This beats the best Exomizer size by **109 bytes** (a full **2.7% reduction** on top of what it would have given us), which is very exciting to me at least :) . And in the spirit of “binary or it didn’t happen”, you can download that version here.

Note that this figure is with *almost* zero changes to the actual intro code to tune it to the packer’s strengths. The one change I made was to remove the large portion of 0’s at the end of the original data and replace it with a clear loop that clears that area of memory instead (and I also adjusted the placement of some sections in memory to make room for this loop). This was mostly done to speed up encoding during testing (the current search algorithm is quite slow for long runs of the same byte, as is common for relatively naïve match searching), but the new packer *does* do a bit better with this loop (around 10 bytes gained, nothing more; this is in contrast with the old packer which did better just compressing all the 0’s), so I think this comparison is still totally fair and valid. There should be more potential gains tweaking the code/data to feed the packer better, but I think that’s better saved for the next intro(s) instead of trying to squeeze the most out of something that’s already released.

One thing I haven’t mentioned here is decompression speed. My old packer and Exomizer have very similar speed profiles; they both decode makeshift in just a few seconds (about 2 iirc). The new packer is *far* slower, decoding the intro in about 22.5 seconds. This is an enormous speed difference, but is perfectly acceptable for a 4k intro (assuming there’s a small progress indicator to let the user know it hasn’t crashed, which is what 6 of our decoder bytes go to); perhaps not much else, unless compressed size was very much a concern. Another thing I haven’t talked about is memory used by the decoder. Again, my old packer and Exomizer are similar here and use somewhere around 500 bytes for decoding tables, whereas the new one uses about 4.5kb for all of the possible predictions in the model. Also perhaps not that useful for anything but decoding a 4k, but an easy price to pay for better compression in this scenario, especially since it’s only used by the decompressor and can be clobbered afterwards (or whenever the decoder is not in use if the decoder were to be used on several compressed chunks throughout the intro or something; maybe that makes sense in a 64k…).

As mentioned, I plan on writing another post or two detailing the rest of the details of the codec, not just the entropy coder, so stay tuned for that. And maybe I’ll have even done another few rounds of optimization/packer improvements, and makeshift could be smaller still by then :) .

As a small bonus, I’ve also tried the exact same packer on Datafresh (cheers to blueberry for providing the unpacked data for testing!). I didn’t make a full working intro, but given that the original released intro was

4096 bytesand the latest Exomizer crunches that to3820 bytes, we can assume 276 bytes for the full Exomizer decompressor/stub, which is bang on the money for our estimates above. My new packer compresses this same data to3661 bytes (159 bytes smaller), so if we assume a 343 bytes decompressor/stub for the new packer (same as makeshift minus the relocation overhead), that would yield4004 bytes final size, a 92 byte gain (2.2% reduction). So it’s not just my intro that wins here at least. :)

11 Feb 2019