Technical ramblings and other stuff

ferris blogs stuff Stochastic rounding

23 Mar 2019

So here’s a little quickie based on something I encountered in my new job :) .

The main motivation is that you’d like to work with lower precision numbers than you typically work with for *some reason* (typically to reduce storage for some model and ideally do faster calculations). But ofc to do that you need to drop some precision, which involves rounding.

Often times this rounding ends up being a floor or towards-zero round, sometimes a nearest round, but whichever you choose you’re still losing fractional information.

So the idea behind a stochastic round is that you use some random number that decides whether you round up or down, instead of always doing one or the other or taking the closest integer. The critical detail here is that the distribution of that number matches that of the fractional part of the number you’re rounding.

eg. if you round 1.3 to the nearest integer 100 times, you round up 30% of the time, down 70%.

likewise if you round 1.6 100 times, you round up 60% of the time, down 40%.

So if you *just* sum up that rounded value 100 times, you get the same result as if you had done the adds with higher precision and then rounded the normal way, which is a really cool idea actually! Of course you still get some cumulative error over time in practical use cases, but it actually happens to work out pretty well and tends to be noticeably better than just dropping the extra precision.

So naturally, I had to try this in the probability update in my c64 packer, where it should apply nicely since it only uses 8 bits for probabilities and updates. :)

In that packer I currently update probabilities with something like this:

```
let adjustment = error/16 (up or down depending on bit)
prob += adjustment
```

This is *very* simplified (I already wrote about it in much more detail here) but the point is that I have 4 bits of fractional precision and 4 bits of whole precision for the update (using 8 bit probs), and currently I do the `/16`

as 4 right shifts, which is essentially a floor divide (*and* a towards-zero divide since the adjustment is never negative).

To get stochastic rounding, I need some kind of super cheap PRNG that respects the distribution of the fractional bits. After some thinking I came up with something very simple, yet pretty effective. The idea is to use the fractional bits of the number I’m rounding as a *decision threshold*. If we generate a pseudo-random value with the same number of bits, we can just compare the value with the fractional bits. If the value is less than the fractional bits, we round up; otherwise, we round down. As long as the PRNG we choose has pretty even distribution over the possible fractional bit values, we should achieve decent stochastic rounding.

For the PRNG, I chose to use a rotating value in `[0, 16)`

that I add 9 to each time (something close to golden ratio for 16) so that this cycles within this interval in a semi-random fashion where each value has equal probability. I got this idea from how FSE state tables are generated, and it’s ofc not a great PRNG, but it’s *very* cheap (especially on 6502) both in terms of performance and code complexity, so it made for a good test.

This kind of probability update with stochastic rounding looks something like this:

```
stochastic_counter = (stochastic_counter + 9) & 0x0f
let adjustment = error/16
if stochastic_counter < fract(adjustment) {
adjustment = ceil(adjustment)
} else {
adjustment = floor(adjustment)
}
prob += adjustment
```

So this can be implemented in a *very* simple fashion, but how does it perform? Interestingly enough for this use case it actually performs a couple percent better than a floor divide, just by providing additional precision for the probabilities (and even with such a simple PRNG)!

However, it doesn’t end up winning in this setup; it certainly beats my best compressed size, including my best size with slightly hacked prob update (which is the one I *actually* use currently where I do the first thing above but I never let adjustment fall below 0 just to try and use more of the 8 bit range; against that it only does about 0.2% better.. note that same hack actually does *worse* with the more accurate stochastic rounding, which is kinda cool actually). Even with the better compression though, the code is a bit bigger (not much, but significant enough in this case); so compared to my best from before, I lose like 10 bytes.

But I’m quite impressed at how effective the rounding actually is. I might try and calculate some actual error percentages, and perhaps play with better/alternative cheap PRNG’s just for the fun of it to see if eventually I can actually do something better than what I’ve already got. But in either case, I think this is a really cool idea (and one that I hadn’t encountered before now), and it’s totally a viable alternative to simpler rounding when precision is tight!

Edit 25/03/2019: I realized that I didn’t need to use a decision threshold at all; I could instead just add the stochastic counter bits to the adjustment directly before doing the shifts/floor divide. This is both equivalent and simpler, resulting in the implementation being a bit smaller, so that we only see a net loss of a single byte over my previous scheme. Still not winning, but fun nonetheless :) . This simpler formulation comes from the paper I initially read that made me excited about this idea in the first place. I certainly should have read through the “system description” section before trying this myself!

23 Mar 2019