ferris blogs stuffTechnical ramblings and other stuff
https://yupferris.github.io/blog/
Stochastic rounding<p>So here’s a little quickie based on something I encountered <a href="https://www.graphcore.ai/">in my new job</a> :) .</p>
<p>The main motivation is that you’d like to work with lower precision numbers than you typically work with for <em>some reason</em> (typically to reduce storage for some model and ideally do faster calculations). But ofc to do that you need to drop some precision, which involves rounding.</p>
<p>Often times this rounding ends up being a floor or towards-zero round, sometimes a nearest round, but whichever you choose you’re still losing fractional information.</p>
<p>So the idea behind a stochastic round is that you use some random number that decides whether you round up or down, instead of always doing one or the other or taking the closest integer. The critical detail here is that the distribution of that number matches that of the fractional part of the number you’re rounding.</p>
<!-- excerpt -->
<p>eg. if you round 1.3 to the nearest integer 100 times, you round up 30% of the time, down 70%.</p>
<p>likewise if you round 1.6 100 times, you round up 60% of the time, down 40%.</p>
<p>So if you <em>just</em> sum up that rounded value 100 times, you get the same result as if you had done the adds with higher precision and then rounded the normal way, which is a really cool idea actually! Of course you still get some cumulative error over time in practical use cases, but it actually happens to work out pretty well and tends to be noticeably better than just dropping the extra precision.</p>
<p>So naturally, I had to try this in the probability update in my c64 packer, where it should apply nicely since it only uses 8 bits for probabilities and updates. :)</p>
<p>In that packer I currently update probabilities with something like this:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">adjustment</span> <span class="o">=</span> <span class="n">error</span><span class="err">/</span><span class="mi">16</span> <span class="p">(</span><span class="n">up</span> <span class="n">or</span> <span class="n">down</span> <span class="n">depending</span> <span class="n">on</span> <span class="n">bit</span><span class="p">)</span>
<span class="n">prob</span> <span class="o">+=</span> <span class="n">adjustment</span>
</code></pre></div></div>
<p>This is <em>very</em> simplified (I already wrote about it in much more detail <a href="https://yupferris.github.io/blog/2019/02/11/rANS-on-6502">here</a>) but the point is that I have 4 bits of fractional precision and 4 bits of whole precision for the update (using 8 bit probs), and currently I do the <code class="highlighter-rouge">/16</code> as 4 right shifts, which is essentially a floor divide (<em>and</em> a towards-zero divide since the adjustment is never negative).</p>
<p>To get stochastic rounding, I need some kind of super cheap PRNG that respects the distribution of the fractional bits. After some thinking I came up with something very simple, yet pretty effective. The idea is to use the fractional bits of the number I’m rounding as a <em>decision threshold</em>. If we generate a pseudo-random value with the same number of bits, we can just compare the value with the fractional bits. If the value is less than the fractional bits, we round up; otherwise, we round down. As long as the PRNG we choose has pretty even distribution over the possible fractional bit values, we should achieve decent stochastic rounding.</p>
<p>For the PRNG, I chose to use a rotating value in <code class="highlighter-rouge">[0, 16)</code> that I add 9 to each time (something close to golden ratio for 16) so that this cycles within this interval in a semi-random fashion where each value has equal probability. I got this idea from how <a href="http://fastcompression.blogspot.com/2013/12/finite-state-entropy-new-breed-of.html">FSE state tables are generated</a>, and it’s ofc not a great PRNG, but it’s <em>very</em> cheap (especially on 6502) both in terms of performance and code complexity, so it made for a good test.</p>
<p>This kind of probability update with stochastic rounding looks something like this:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">stochastic_counter</span> <span class="o">=</span> <span class="p">(</span><span class="n">stochastic_counter</span> <span class="o">+</span> <span class="mi">9</span><span class="p">)</span> <span class="o">&</span> <span class="mi">0x0f</span>
<span class="k">let</span> <span class="n">adjustment</span> <span class="o">=</span> <span class="n">error</span><span class="err">/</span><span class="mi">16</span>
<span class="k">if</span> <span class="n">stochastic_counter</span> <span class="o"><</span> <span class="nf">fract</span><span class="p">(</span><span class="n">adjustment</span><span class="p">)</span> <span class="p">{</span>
<span class="n">adjustment</span> <span class="o">=</span> <span class="nf">ceil</span><span class="p">(</span><span class="n">adjustment</span><span class="p">)</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">adjustment</span> <span class="o">=</span> <span class="nf">floor</span><span class="p">(</span><span class="n">adjustment</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">prob</span> <span class="o">+=</span> <span class="n">adjustment</span>
</code></pre></div></div>
<p>So this can be implemented in a <em>very</em> simple fashion, but how does it perform? Interestingly enough for this use case it actually performs a couple percent better than a floor divide, just by providing additional precision for the probabilities (and even with such a simple PRNG)!</p>
<p>However, it doesn’t end up winning in this setup; it certainly beats my best compressed size, including my best size with slightly hacked prob update (which is the one I <em>actually</em> use currently where I do the first thing above but I never let adjustment fall below 0 just to try and use more of the 8 bit range; against that it only does about 0.2% better.. note that same hack actually does <em>worse</em> with the more accurate stochastic rounding, which is kinda cool actually). Even with the better compression though, the code is a bit bigger (not much, but significant enough in this case); so compared to my best from before, I lose like 10 bytes.</p>
<p>But I’m quite impressed at how effective the rounding actually is. I might try and calculate some actual error percentages, and perhaps play with better/alternative cheap PRNG’s just for the fun of it to see if eventually I can actually do something better than what I’ve already got. But in either case, I think this is a really cool idea (and one that I hadn’t encountered before now), and it’s totally a viable alternative to simpler rounding when precision is tight!</p>
<blockquote>
<p>Edit 25/03/2019: I realized that I didn’t need to use a decision threshold at all; I could instead just add the stochastic counter bits to the adjustment directly before doing the shifts/floor divide. This is both equivalent and simpler, resulting in the implementation being a bit smaller, so that we only see a net loss of a single byte over my previous scheme. Still not winning, but fun nonetheless :) . This simpler formulation comes from the <a href="https://arxiv.org/abs/1502.02551">paper</a> I initially read that made me excited about this idea in the first place. I certainly should have read through the “system description” section before trying this myself!</p>
</blockquote>
Sat, 23 Mar 2019 00:00:00 +0000
https://yupferris.github.io/blog/2019/03/23/stochastic-rounding.html
https://yupferris.github.io/blog/2019/03/23/stochastic-rounding.htmlrANS on 6502<p>I’m in the middle of writing several compression-related posts that are all in various stages of completion, and I seem to have already developed a habit of writing waaay too much stuff. So I thought I’d try to write something a bit more short-form (yeah, it’s not really short at all, but whatever) about an experiment I did that actually went pretty well: I implemented an <a href="https://en.wikipedia.org/wiki/Asymmetric_numeral_systems">rANS</a>-based packer on <a href="https://en.wikipedia.org/wiki/Commodore_64">C64</a> for 4k intros, and I’m quite pleased with the results.</p>
<p>OK, so technically it’s an rABS-based packer, the difference here being that it only entropy codes binary symbols. The reason for this is that we typically observe better compression with adaptive modeling, and updating symbol predictions is much, much simpler with a binary alphabet (look ma, no divides!).</p>
<!-- excerpt -->
<blockquote>
<p>At some point I’ll also publish a post (or perhaps a series of posts) that will talk about the codec in more detail (along with several other things I tried as well), but for this post I’d just like to focus on the rABS implementation that forms the basis of the codec.</p>
</blockquote>
<h2 id="decoding-rabs">Decoding rABS</h2>
<p>The rest of this post will focus almost entirely on the rABS <em>decoder</em> used in the packer. While the encoder runs on a modern PC, the decoder runs on the target platform to decode the intro into memory before it runs. Thus, several of the details in the codec will be decided around keeping the decoder as simple as possible, while still being able to deliver the best compression ratio we can manage. Speed is also somewhat a concern, but not our first; nobody will complain if a kickass 4k intro for C64 takes 30 seconds to decode, for example (so long as the user knows it hasn’t crashed, of course).</p>
<p>The idea here is that to decode an rANS/rABS bitstream, we have some running state variable <code class="highlighter-rouge">x</code> that contains some information about all of the symbols previously encoded. We’ll decode symbols “out” of this state (essentially popping them off of a stack), and then update the state to a new value. The state always has enough information to decode the last symbol coded at least, though since it has limited precision, sometimes we’ll have to pull in additional state bits from the encoded bitstream. The full symbol decoding algorithm looks something like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>- Determine symbol s from state x and symbol probs (decode)
- Based on symbol s and its frequency Fs, find a new state x' (update)
- If necessary, pull in additional bits into x' (renormalize)
- Assign x = x'
</code></pre></div></div>
<p>Let’s first describe how a symbol <code class="highlighter-rouge">s</code> is extracted from a state value <code class="highlighter-rouge">x</code>. With rANS/rABS, this is actually pretty straightforward. If <code class="highlighter-rouge">x</code> is in some range <code class="highlighter-rouge">I = [L, b * L - 1]</code> (where <code class="highlighter-rouge">L</code> is some multiple of <code class="highlighter-rouge">M</code>, which is the sum of all symbol frequencies) and <code class="highlighter-rouge">b</code> is some integer base (2 for streaming bits, 8 for bytes, …), then <code class="highlighter-rouge">I</code> is essentially built up by several subranges for each possible symbol, starting at its cumulative frequency <code class="highlighter-rouge">Bs</code> and with a range of its frequency <code class="highlighter-rouge">Fs</code>, offset by <code class="highlighter-rouge">L</code> and scaled by <code class="highlighter-rouge">b</code>. We can then determine <code class="highlighter-rouge">s</code> just by checking which subrange of <code class="highlighter-rouge">I</code> that <code class="highlighter-rouge">x</code> falls into. This is nearly identical to what you’d do for an arithmetic decoder, except that the range is defined differently (it would be <code class="highlighter-rouge">[0, M]</code> or perhaps <code class="highlighter-rouge">[0, L]</code> for an arithmetic coder).</p>
<p>Note that this is extremely simple in the binary alphabet case (like we have with rABS) because <code class="highlighter-rouge">I</code> is essentially only divided into two subranges; one for a 1 bit and one for a 0 bit. Determining which range <code class="highlighter-rouge">x</code> is in then is just a comparison of <code class="highlighter-rouge">x - M</code> against <code class="highlighter-rouge">Bs</code>, where <code class="highlighter-rouge">s</code> is the symbol corresponding to the upper subrange in <code class="highlighter-rouge">I</code>.</p>
<p>As an example, consider the following range for symbols 0 and 1 (and we’re simplifying here with <code class="highlighter-rouge">L</code> = <code class="highlighter-rouge">M</code>, which interestingly is equivalent to tANS with rANS symbol order, but that’s a whole other can of worms that I would <em>love</em> to talk about some other time):</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>M = 8
L = 8 (M)
b = 2
F1 = 2
F0 = 6
x = 11
L 2L - 1
[ 1s | x 0s ] <- I
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
</code></pre></div></div>
<p>In this case we can clearly see that <code class="highlighter-rouge">x</code> falls into the subrange of <code class="highlighter-rouge">I</code> corresponding to symbol 0, and thus a 0 is decoded from <code class="highlighter-rouge">x</code>.</p>
<blockquote>
<p>Note that while this example shows a binary alphabet, this works for an arbitrary number of symbols (assuming we have enough precision to uniquely represent each subrange). Another thing I should probably mention is that when <code class="highlighter-rouge">L</code> > <code class="highlighter-rouge">M</code>, we actually have <em>several</em> subranges in <code class="highlighter-rouge">I</code> that correspond to each symbol; essentially we have <code class="highlighter-rouge">L</code> / <code class="highlighter-rouge">M</code> subranges for each symbol as <code class="highlighter-rouge">[0, M)</code> is repeated throughout <code class="highlighter-rouge">L</code>. This is more or less what gives us the modulo in the rANS state update formula and the ratio of <code class="highlighter-rouge">L</code> to <code class="highlighter-rouge">M</code> has a lot to say when it comes to the precision of the coder. Don’t worry if this is a bit tough to follow; the details here aren’t actually that important, and I’ll try to fill in as many blanks as possible shortly.</p>
</blockquote>
<p>Once we can extract a symbol <code class="highlighter-rouge">s</code> from state <code class="highlighter-rouge">x</code>, the heart of an rANS/rABS decoder is the following formula that calculates the next state value <code class="highlighter-rouge">x'</code> (citing <a href="http://cbloomrants.blogspot.com/2014/02/02-02-14-understanding-ans-4.html">Charles Bloom’s blog series on ANS</a>):</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x: current state value
x': new state value
Fs: frequency for symbol s
Bs: cumulative frequency for symbol s
M: sum of all symbol frequencies
x' = Fs * (x / M) + (x mod M) - Bs
</code></pre></div></div>
<p>Already we can tweak this a bit by making sure <code class="highlighter-rouge">M</code> is a power of two, which turns the divide into a right shift and the modulo into a bitwise <code class="highlighter-rouge">AND</code>:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x' = Fs * (x >> log2(M)) + (x & (M - 1)) - Bs
</code></pre></div></div>
<p>Now we’re looking pretty good here for something relatively fast/simple.</p>
<p>The last part involves renormalization. This part is easy. After state update, we have a new state <code class="highlighter-rouge">x'</code> which we want to be in range <code class="highlighter-rouge">I</code>. Often times it already will be; when it’s not, it’s going to be some value less than <code class="highlighter-rouge">L</code>. So, we just need to pull in bits from the encoded bitstream until <code class="highlighter-rouge">x'</code> is in <code class="highlighter-rouge">L</code>:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">while</span> <span class="n">x</span><span class="err">'</span> <span class="o"><</span> <span class="n">L</span> <span class="p">{</span>
<span class="n">x</span><span class="err">'</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="err">'</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="nf">input_bit</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div>
<p>And that’s it. As we’ll see later, the comparison against <code class="highlighter-rouge">L</code> can even be done in the CPU automatically for us, if we choose our ranges correctly.</p>
<p>Putting all of this together, our bit decoding code is becoming clearer:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// Input state x and probability for bit 1, output symbol and new state x'</span>
<span class="k">fn</span> <span class="nf">decode_bit</span><span class="p">(</span><span class="k">mut</span> <span class="n">x</span><span class="p">:</span> <span class="nb">u32</span><span class="p">,</span> <span class="k">mut</span> <span class="n">probability</span><span class="p">:</span> <span class="nb">u32</span><span class="p">)</span> <span class="k">-></span> <span class="p">(</span><span class="nb">u32</span><span class="p">,</span> <span class="nb">u32</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">symbol</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">bias</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="c">// Decode symbol</span>
<span class="k">if</span> <span class="n">probability</span> <span class="o"><</span> <span class="p">(</span><span class="n">x</span> <span class="o">&</span> <span class="p">(</span><span class="n">M</span> <span class="err">-</span> <span class="mi">1</span><span class="p">))</span> <span class="p">{</span>
<span class="c">// x is in the lower subrange; bit is 1, bias is 0</span>
<span class="n">symbol</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c">// x is in the upper subrange; bit is 0, bias is probability,</span>
<span class="c">// and probability should represent the upper subrange instead</span>
<span class="c">// of the lower one</span>
<span class="n">bias</span> <span class="o">=</span> <span class="n">probability</span><span class="p">;</span>
<span class="n">probability</span> <span class="o">=</span> <span class="n">M</span> <span class="err">-</span> <span class="n">probability</span><span class="p">;</span>
<span class="p">}</span>
<span class="c">// Update state</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">probability</span> <span class="o">*</span> <span class="p">(</span><span class="n">x</span> <span class="o">>></span> <span class="nf">log2</span><span class="p">(</span><span class="n">M</span><span class="p">))</span> <span class="o">+</span> <span class="p">(</span><span class="n">x</span> <span class="o">&</span> <span class="p">(</span><span class="n">M</span> <span class="err">-</span> <span class="mi">1</span><span class="p">))</span> <span class="err">-</span> <span class="n">bias</span><span class="p">;</span>
<span class="c">// Renormalize</span>
<span class="k">while</span> <span class="n">x</span> <span class="o"><</span> <span class="n">L</span> <span class="p">{</span>
<span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o"><<</span> <span class="mi">1</span><span class="p">)</span> <span class="p">|</span> <span class="nf">input_bit</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">(</span><span class="n">symbol</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Ultimately, this is what we’re going to implement on the C64.</p>
<h2 id="rabs-on-6502">rABS on 6502</h2>
<p>So there are a few things we need to work around here. Particularly, we’d like to do as much arithmetic as we can in 8 bits (with some things in 16 bits where necessary).</p>
<p>With the above code, we can start plugging in some constants based around what the 6502 is good at. With <code class="highlighter-rouge">b = 2</code>, <code class="highlighter-rouge">M = 256 (2^8)</code>, and <code class="highlighter-rouge">L = 32768 (2^15)</code>, we can represent our entire state <code class="highlighter-rouge">x</code> in 16 bits (two bytes, which exactly covers <code class="highlighter-rouge">[0, b * L - 1]</code>), and the sign bit of the high byte corresponds to whether or not <code class="highlighter-rouge">x</code> is above <code class="highlighter-rouge">L</code> (and thus whether or not it’s inside <code class="highlighter-rouge">I</code>), so we get that comparison for free! Further, the <code class="highlighter-rouge">x >> log2(M)</code> term above is just reading the high byte of the state, and the <code class="highlighter-rouge">x & (M - 1)</code> term is just reading the low byte of the state. Also, with <code class="highlighter-rouge">M = 256</code> our probabilities will always sum to 256; excluding 0 and 256 as possible probabilities (as these values don’t leave any room for the other symbol, and thus would break the entropy coder) we get a range of 1-255 for each probability, which fits perfectly within one byte. All quite nice for our byte-oriented 8-bit CPU, and with plenty of bits of precision between <code class="highlighter-rouge">M</code> and <code class="highlighter-rouge">L</code> to work with.</p>
<p>The one gotcha here still is the multiply in <code class="highlighter-rouge">Fs * (x >> log2(M))</code>, since the 6502 doesn’t have a built-in hardware multiplier. Luckily, since our predictions are always one byte and so is the shifted part of our state, we can do an 8x8 multiply with 16-bit result relatively easily in software, which looks something like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FAC1 = $58
FAC2 = $59
; A*256 + X = FAC1 * FAC2
MUL8
lda #$00
ldx #$08
clc
m0 bcc m1
clc
adc FAC2
m1 ror
ror FAC1
dex
bpl m0
ldx FAC1
rts
</code></pre></div></div>
<p>There are lots of such routines floating around; this one was taken from <a href="http://codebase64.org/doku.php?id=base:short_8bit_multiplication_16bit_product">Codebase 64</a>. Without some of the extra data moving and inlining this code, we can get it down to just 10 instructions, or ~16 bytes. Of course it’s not particularly fast (on average ~130 cycles or so), but assuming an upper limit of 30 seconds to decode and a ~1mhz CPU, we should have around 915 available cycles/bit for our target size of 4096*8 packed bits, so we’re well within limits here.</p>
<blockquote>
<p>There are several faster ways to do multiplication in software; most take advantage of alternative equivalent mathematical forms and/or lookup tables. In this case, I went with the simplest solution in order to keep decoder complexity low, but we can always trade some of this size if speed is more important.</p>
</blockquote>
<p>With our various constants plugged in and our multiply available, we can rearrange our decode code a bit to better match what we’ll implement in 6502 asm:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// Input state x and probability for bit 1, output symbol and new state x'</span>
<span class="k">fn</span> <span class="nf">decode_bit</span><span class="p">(</span><span class="k">mut</span> <span class="n">x</span><span class="p">:</span> <span class="nb">u16</span><span class="p">,</span> <span class="k">mut</span> <span class="n">probability</span><span class="p">:</span> <span class="nb">u8</span><span class="p">)</span> <span class="k">-></span> <span class="p">(</span><span class="nb">u8</span><span class="p">,</span> <span class="nb">u16</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">symbol</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">bias</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="c">// Decode symbol</span>
<span class="k">if</span> <span class="n">probability</span> <span class="o"><</span> <span class="n">x</span><span class="py">.low</span> <span class="p">{</span>
<span class="c">// x is in the lower subrange; bit is 1, bias is 0</span>
<span class="n">symbol</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c">// x is in the upper subrange; bit is 0, bias is probability,</span>
<span class="c">// and probability should represent the upper subrange instead</span>
<span class="c">// of the lower one</span>
<span class="n">bias</span> <span class="o">=</span> <span class="n">probability</span><span class="p">;</span>
<span class="n">probability</span> <span class="o">=</span> <span class="mi">0x100</span> <span class="err">-</span> <span class="n">probability</span><span class="p">;</span> <span class="c">// set carry, sub from zero</span>
<span class="p">}</span>
<span class="c">// Update state</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">new_x</span> <span class="o">=</span> <span class="n">x</span><span class="py">.high</span> <span class="o">*</span> <span class="n">probability</span><span class="p">;</span> <span class="c">// 8x8 multiply, 16 bit result</span>
<span class="n">new_x</span> <span class="o">+=</span> <span class="n">x</span><span class="py">.low</span><span class="p">;</span> <span class="c">// 16+8 add</span>
<span class="n">new_x</span> <span class="err">-</span><span class="o">=</span> <span class="n">bias</span><span class="p">;</span> <span class="c">// 16+8 sub</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">new_x</span><span class="p">;</span>
<span class="c">// Renormalize</span>
<span class="k">while</span> <span class="n">sign</span> <span class="n">bit</span> <span class="n">of</span> <span class="n">x_high</span> <span class="n">is</span> <span class="n">not</span> <span class="n">set</span> <span class="p">{</span>
<span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o"><<</span> <span class="mi">1</span><span class="p">)</span> <span class="p">|</span> <span class="nf">input_bit</span><span class="p">();</span> <span class="c">// get new bit in carry, rotate low+high</span>
<span class="p">}</span>
<span class="p">(</span><span class="n">symbol</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Not too much different than the earlier listing, but hopefully it becomes clearer how we’ll handle everything in 8 or 16 bits. I believe this is clear enough to imagine the full assembler listing - I’ll leave that as an exercise to the reader (if you’re familiar with 6502 this should all be pretty straightfoward). In the actual decoder it’s almost as simple; the code is slightly more complicated as it’s also responsible for updating the model prediction (since both the symbol decode and model update will perform the same branch on the decoded symbol), but beyond that it’s basically the obvious 6502 interpretation of the above listing, with some things folded together to make the code as small as possible.</p>
<h2 id="modeling">Modeling</h2>
<p>I mentioned that in this codec we use 8-bit probabilities. While this is enough precision to represent predictions fairly accurately, the update rule becomes critical in order to use this precision as effectively as possible.</p>
<p>Modeling binary probabilities adaptively is really quite simple. <a href="https://fgiesen.wordpress.com/2015/05/26/models-for-adaptive-arithmetic-coding/">ryg has already done a great post about this</a>, so I’ll leave most of the details there, but repeat some of the relevant parts here for clarity.</p>
<p>The model adaption code I use starts with the following:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">LEARNING_RATE</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
<span class="n">prob</span> <span class="o">=</span> <span class="mi">128</span><span class="p">;</span>
<span class="k">fn</span> <span class="nf">adapt</span><span class="p">()</span> <span class="p">{</span>
<span class="k">if</span> <span class="n">bit</span> <span class="o">==</span> <span class="mi">1</span> <span class="p">{</span>
<span class="n">prob</span> <span class="o">+=</span> <span class="p">(</span><span class="mi">256</span> <span class="err">-</span> <span class="n">prob</span><span class="p">)</span> <span class="o">>></span> <span class="n">LEARNING_RATE</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">prob</span> <span class="err">-</span><span class="o">=</span> <span class="n">prob</span> <span class="o">>></span> <span class="n">LEARNING_RATE</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p><code class="highlighter-rouge">4</code> might seem a bit high for <code class="highlighter-rouge">LEARNING_RATE</code> with only 8 bit probabilities, but empirically it performs quite well on a lot of different data sources, with both 16 and 8 bit probabilities. This update rule has some nice properties as well; it’s very easy to implement, and we never have to worry about boundary conditions (<code class="highlighter-rouge">prob</code> will never reach <code class="highlighter-rouge">0</code> or <code class="highlighter-rouge">256</code> so we don’t need to check for either).</p>
<p>It does have one problem though, which ryg outlines in his post. Due to the shift, the probability will never go below <code class="highlighter-rouge">2 ^ LEARNING_RATE - 1</code> (<code class="highlighter-rouge">15</code> in this case); likewise, it will never go above <code class="highlighter-rouge">256 - 15</code> either. This may not sound significant, but instead of representing the range <code class="highlighter-rouge">[0, 1]</code>, our probabilities now only represent the range <code class="highlighter-rouge">[15/256, 1 - 15/256]</code> or about <code class="highlighter-rouge">[0.0586, 0.9414]</code>. That’s actually <em>very</em> significant; we’re losing the bottom and top 6% of our available range! This can result in a compression ratio loss of up to a few percent (over 100 bytes for a 4k!).</p>
<p>So, we’d like to come up with a scheme that will be able to use this remaining range effectively. At the same time, we’d like not to change the learning rate too dramatically for the range that we’re able to effectively adapt in, since it performs quite well as-is.</p>
<p>In the end, I went with a simple method that allows us to take advantage of the full range in a trivial way:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">LEARNING_RATE</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
<span class="n">prob</span> <span class="o">=</span> <span class="mi">128</span><span class="p">;</span>
<span class="k">fn</span> <span class="nf">adapt</span><span class="p">()</span> <span class="p">{</span>
<span class="k">if</span> <span class="n">bit</span> <span class="o">==</span> <span class="mi">1</span> <span class="p">{</span>
<span class="n">adjustment</span> <span class="o">=</span> <span class="p">(</span><span class="mi">256</span> <span class="err">-</span> <span class="n">prob</span><span class="p">)</span> <span class="o">>></span> <span class="n">LEARNING_RATE</span><span class="p">;</span>
<span class="k">if</span> <span class="n">adjustment</span> <span class="o">==</span> <span class="mi">0</span> <span class="p">{</span>
<span class="n">adjustment</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">prob</span> <span class="o">+=</span> <span class="n">adjustment</span><span class="p">;</span>
<span class="k">if</span> <span class="n">prob</span> <span class="o">></span> <span class="mi">255</span> <span class="p">{</span>
<span class="n">prob</span> <span class="o">=</span> <span class="mi">255</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">adjustment</span> <span class="o">=</span> <span class="n">prob</span> <span class="o">>></span> <span class="n">LEARNING_RATE</span><span class="p">;</span>
<span class="k">if</span> <span class="n">adjustment</span> <span class="o">==</span> <span class="mi">0</span> <span class="p">{</span>
<span class="n">adjustment</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">prob</span> <span class="err">-</span><span class="o">=</span> <span class="n">adjustment</span><span class="p">;</span>
<span class="k">if</span> <span class="n">prob</span> <span class="o"><</span> <span class="mi">1</span> <span class="p">{</span>
<span class="n">prob</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This is identical to the adaptation code above, but when the model update won’t actually update the prediction, we force it to adjust by 1/256 anyways. Then, we handle over/underflow the obvious way. In 6502 both the <code class="highlighter-rouge">prob > 255</code> and <code class="highlighter-rouge">prob < 1</code> conditions can be checked by letting the probability overflow or decrease to <code class="highlighter-rouge">0</code>, so this code is very small to implement. Admittedly, it’s a bit of a hack, but this simple adjustment gains us 120 additional bytes in <a href="http://www.pouet.net/prod.php?which=70972">makeshift</a>, just by making sure we take advantage of precision we already have in the rest of the decoder. This is the kind of trick that can make 8 bit probabilities actually usable in practice.</p>
<h2 id="results">Results</h2>
<p>So even though I still plan on doing another post detailing the other aspects of the packer, I think I should at least share some actual results here.</p>
<p>The original intro this was used for, <a href="http://www.pouet.net/prod.php?which=70972">makeshift</a>, clocked in at at total of <strong>4095 bytes</strong>, just 1 byte below the 4096 byte limit when it was first released. It used a custom packer that I developed rather quickly, and was the first LZ I had developed that even did bit IO, not to mention anything resembling entropy coding.</p>
<p>The current king of compression for C64 is <a href="https://bitbucket.org/magli143/exomizer/wiki/Home">Exomizer</a>, which provides quite good compression and good decode speed. It’s the most commonly used tool for C64 4k intro compression <em>by far</em>, so if we’re to accurately gauge our results, we should see how we stack up with it as well, not just my older packer. Admittedly, I wasn’t able to produce a working version of makeshift with the latest Exomizer (nor was I able to when the intro was first released), but I <em>was</em> able to get a best size for the compressed data, and it’s open-source, so we can do some byte counting for the decompressor/stub code and figure out how large a full working version would be. There’s some estimation here, but this should be quite close at the very least: with the best possible packer/overhead size for Exomizer (i.e. not using literal sequences, still relocating some mem to decode into basic mem, and a few other small things that the intro needs to do), makeshift would clock in at <strong>4001 bytes</strong>, a total of <strong>94 bytes gained</strong> or a <strong>2.2% reduction</strong>. Certainly nothing to sneeze at, but of course, the new packer does better.</p>
<blockquote>
<p>For more detailed notes and how I came up with the 4001 bytes figure, see <a href="https://gist.github.com/yupferris/3d27f8a845c895aed2c0dd6c5493e174">this gist</a>. As it <em>is</em> an estimate after all, it’s possible I’ve made a few mistakes, but it should be well within acceptable margins.</p>
</blockquote>
<p>With the new packer at the time of writing, makeshift clocks in at <strong>3892 bytes</strong>, for a total of <strong>203 bytes gained</strong> or a <strong>4.9% reduction</strong>! This beats the best Exomizer size by <strong>109 bytes</strong> (a full <strong>2.7% reduction</strong> on top of what it would have given us), which is very exciting to me at least :) . And in the spirit of “binary or it didn’t happen”, you can download that version <a href="http://logicoma.io/misc/makeshift-2019.prg">here</a>.</p>
<p>Note that this figure is with <em>almost</em> zero changes to the actual intro code to tune it to the packer’s strengths. The one change I made was to remove the large portion of 0’s at the end of the original data and replace it with a clear loop that clears that area of memory instead (and I also adjusted the placement of some sections in memory to make room for this loop). This was mostly done to speed up encoding during testing (the current search algorithm is quite slow for long runs of the same byte, as is common for relatively naïve match searching), but the new packer <em>does</em> do a bit better with this loop (around 10 bytes gained, nothing more; this is in contrast with the old packer which did better just compressing all the 0’s), so I think this comparison is still totally fair and valid. There should be more potential gains tweaking the code/data to feed the packer better, but I think that’s better saved for the next intro(s) instead of trying to squeeze the most out of something that’s already released.</p>
<p>One thing I haven’t mentioned here is decompression speed. My old packer and Exomizer have very similar speed profiles; they both decode makeshift in just a few seconds (about 2 iirc). The new packer is <em>far</em> slower, decoding the intro in about 22.5 seconds. This is an enormous speed difference, but is perfectly acceptable for a 4k intro (assuming there’s a small progress indicator to let the user know it hasn’t crashed, which is what 6 of our decoder bytes go to); perhaps not much else, unless compressed size was very much a concern. Another thing I haven’t talked about is memory used by the decoder. Again, my old packer and Exomizer are similar here and use somewhere around 500 bytes for decoding tables, whereas the new one uses about 4.5kb for all of the possible predictions in the model. Also perhaps not that useful for anything but decoding a 4k, but an easy price to pay for better compression in this scenario, especially since it’s only used by the decompressor and can be clobbered afterwards (or whenever the decoder is not in use if the decoder were to be used on several compressed chunks throughout the intro or something; maybe that makes sense in a 64k…).</p>
<p>As mentioned, I plan on writing another post or two detailing the rest of the details of the codec, not just the entropy coder, so stay tuned for that. And maybe I’ll have even done another few rounds of optimization/packer improvements, and makeshift could be smaller still by then :) .</p>
<blockquote>
<p>As a small bonus, I’ve also tried the exact same packer on <a href="http://www.pouet.net/prod.php?which=71707">Datafresh</a> (cheers to <a href="https://github.com/askeksa">blueberry</a> for providing the unpacked data for testing!). I didn’t make a full working intro, but given that the original released intro was <strong>4096 bytes</strong> and the latest Exomizer crunches that to <strong>3820 bytes</strong>, we can assume 276 bytes for the full Exomizer decompressor/stub, which is bang on the money for our estimates above. My new packer compresses this same data to <strong>3661 bytes (159 bytes smaller)</strong>, so if we assume a 343 bytes decompressor/stub for the new packer (same as makeshift minus the relocation overhead), that would yield <strong>4004 bytes final size, a 92 byte gain (2.2% reduction)</strong>. So it’s not just my intro that wins here at least. :)</p>
</blockquote>
Mon, 11 Feb 2019 00:00:00 +0000
https://yupferris.github.io/blog/2019/02/11/rANS-on-6502.html
https://yupferris.github.io/blog/2019/02/11/rANS-on-6502.htmlHello world<p><strong>Hello world!!</strong> Time to start another blog. But yeah uhm why?</p>
<p>I’ve been working on a few (particularly compression-related) projects that are difficult to cover nicely in <a href="https://www.twitch.tv/ferrisstreamsstuff">stream format</a>, as I’d like to 1. have the information in writing, which should be easier to digest; and 2. not spend a lot of time preparing for such a stream/series that would be organized enough to be useful. I’d of course still like to do streams on these projects, and plan to do so, but they’d probably be more like high-level overviews with Q&A and would use the relevant blog post(s) as a rough outline and place to point to for learning more.</p>
<!-- excerpt -->
<p>I’d also like to have an alternative channel (besides streaming/twitter specifically) to output some thoughts or details or whatever whenever I feel like. I will <strong>not</strong> have any sort of set publishing schedule and I don’t necessarily want to make technical writing something I do regularly; I just want to have the option and a more focused place to open a dialog about more technical discussions.</p>
<p>So yeah. Hope to end up with something useful here. Time will tell I suppose :)</p>
Sun, 10 Feb 2019 00:00:00 +0000
https://yupferris.github.io/blog/2019/02/10/hello-world.html
https://yupferris.github.io/blog/2019/02/10/hello-world.html