Technical ramblings and other stuff

10 Aug 2021

Another quickie, re this paper. It came up in conversation recently, and I ended up taking some notes while expanding on it to regain my understanding, and thought they turned out somewhat useful.

Note to self: there’s a newer paper by the same author which appears to build on this paper, with the addition of asynchronous comparisons for full/empty flags, which I haven’t yet read/understood; should be interesting.

The paper covers asynchronous FIFO design (eg. writing in one clock domain, reading in another). This is
metastability city if done incorrectly, but the gist is that we really only have to synchronize access to
the underlying memory, and this can be achieved by carefully signaling the read/write pointers to the
opposite domains by means of classic n-FF’s-in-series synchronizers, and encoding the values with gray
codes such that even in the presence metastability errors for individual bits, the logic is still robust
(eg. when the values are wrong, they are always are wrong *temporarily* and *conservatively*).

The “punchline” is the design presented figure 4/style #2 (figure 3/style #1 is somewhat interesting in its own right but not what we want in most cases; more on that later).

29 Jul 2021

This is a(n edited/corrected) “repost” of this Twitter thread, but I thought it would be nice to have all the notes collected here as well, so here goes.

I finally got to the punchline of “Mathematics of the DFT” and I must say, I’ve *really* enjoyed this book. It’s certainly dense and several parts took some rumination before they clicked, but I’m very satistfied.

I highly recommend reading the book if you’re curious about this stuff or want to fill in gaps in your knowledge. That said, here are some points that I found particularly interesting, many of which were totally new to me:

- The “add angles and multiply magnitudes” thing for complex multiplications isn’t arbitrary; it falls out of the algebra if you interpret the complex numbers as simple binomials. (side note: my Norwegian wife didn’t know “FOIL” which makes sense but I found funnily surprising)
- All real sinusoids can be expressed as a sum of a conjugate pair complex sinusoids. Put another way, all real signals
*must*have matching positive and negative frequency components each half the amplitude of the real signal components. The conjugate pair part is important; when summed, it makes the imaginary components cancel, and you’re left with a double-magnitude real component. - Expressions describing rotations in the complex plane can be broken down into two parts: a
*phasor*, which contains the magnitude and phase offset, and a*carrier*, which contains the time-dependent part. - Continuous sinusoids at different frequencies are always orthogonal. When sampled, this only holds when the periods of the sinusoids are factors of the signal length. These are equivalent imagining the continuous case as moving the samples infinitely close together.
- Vectors can be “projected” onto one another. Geometrically this corresponds to finding the point on one vector that connects the end of the other vector at a right angle.
- This projection can be expressed as a scalar times one of the vectors, where the scalar (called the “coefficient of projection”) is a scaled inner product of the two vectors.
- For a vector of dimension
`N`

, one can project the vector onto`N`

orthogonal vectors, and then perfectly reconstruct the original vector again by multiplying the coefficients of projection by the vectors and summing the results. - The DFT (and many related transforms) are thus simply this: scaled coefficients of projection of a length
`N`

signal, interpreted as an`N`

-dimensional vector, projected onto`N`

orthogonal basis vectors that happen to be sampled sinusoids which are`N`

harmonics of the sampling rate divided by`N`

. The scaling term in the IDFT accounts for this scaling. - Lots of things follow from this understanding, eg. how the cooley-tukey FFT algorithm works (the even and odd samples each project onto
`N/2`

sinusoids described by roots of unity, with the odd results multiplied by phasors), which is very satisfying!

Admittedly the later chapters in the book (7 and 8) are much more dry imo; 7 isn’t particularly interesting for the most part (for me at least) and 8 seems somewhat thin. The appendices are a similar story, but I suppose they’re meant to be concise. Still, I’m glad I pushed through and carefully read the first 6 chapters; totally worth it IMO :)

30 Jun 2021

Similar to my last post in this series, here’s another quickie about a strange filter implementation I found in Jean-Marc Valin’s “A Free Codec For Free Speech”. Of course, speex has been obsolete for some time now, but I think CELP (and LPC in general) is neat, and I’ve been studying up on some proper DSP theory lately, in particular filters and their corresponding z-transforms. So, when I found a simple filter description along with pseudocode in the speex paper, I thought I’d try deriving one from the other for practice.

The filter in question is described in section IV.C (“Common Layer 8 Problems”). It’s a bog-standard DC blocker with the following transfer function:

\[N(z)=\frac{1-z^{-1}}{1-\alpha z^{-1}}\]Not terribly interesting, but the pseudocode listing (modified slightly for clarity) is:

```
#define ALPHA .98f
static float mem = 0f;
for(i = 0; i < frame_size; i++){
mem = ALPHA * mem + (1 - ALPHA) * input[i];
output[i] = output[i] - mem;
}
```

Right away, there’s something fishy here: `output[i]`

is defined in terms of `output[i]`

, which makes no sense! But surely that’s the only issue, right?

Wrong.

07 Jun 2021

In our 64k intros, in addition to synthesis, we (among other groups) typically use samples for small one-shot sounds that are notoriously hard to synthesize, yet crucial for the overall impression of a track (for example, kick/snare drums, vocal chops, etc.).

Of course, we don’t just store raw PCM data, because not only is it quite large, it doesn’t compress well. Instead, since at least the mid-2000s, the most popular “good enough” approach has been to use a GSM 06.10 codec to compress the sample(s), and then use a decoder shipped with Windows via the ACM API to decode the sample(s) during initialization.

The resulting compression ratio and audio quality that we get by using this codec on this kind of data isn’t particularly great, in fact it’s really bad by today’s standards. On top of that, even decoding a single sample takes ~10 ACM API calls, plus of a bunch of data juggling. Gargaj and I actually looked into this a bit, and it’s apparently ~265b, which is admittedly decent, but not exactly free. It’s also not tweakable - you get the rate you get, and that’s it. Still, it pays for itself, it’s worked for a long time, and the fact that Windows ships an *encoder* for it as well means it doesn’t require using/integrating any additional 3rd party code into our tooling, which, when it comes to codecs, can be pretty important! So it’s no surprise that the technique has been left alone for so long, but we can surely do better.

30 May 2021

When implementing an algorithm from a paper/blog post/whatever, it’s not uncommon to transcribe the included (pseudo)code while trying to understand the ideas presented. Unfortunately, this code is often presented in a poor manner, and in some cases, is just plain wrong. This is particularly annoying when the paper is behind a paywall. I’ve encountered this enough times that I think I should start writing quick posts explaining my findings and how to fix them. This is the first of those posts.

The decoder pseudocode in T. Fischer’s “A pyramid vector quantizer” is awful. It’s presented in a near-unreadable fashion that’s difficult to transcribe into basically any programming language, and what’s worse, it’s missing a crucial detail in step 3 that doesn’t adjust `xb`

to account for the decoded sign for dimension `i`

, that causes the decoder to de-sync for the remaining dimension(s). The simplest pathological case I found was `b = 4, L = 2, K = 2`

.

Actually, a really funny thing is that if you transcribe the code directly, it *will* work, so long as your environment allows integer overflows and you don’t mind waiting a while. But of course, this is not usable or helpful.

31 Aug 2020

In an earlier post, I talked about building an rABS entropy decoder on 6502, which became the heart of my new 4k intro packer experiment for C64. While exploring several different entropy coder variants for simple LZ codecs, and in particular tANS or an approximation of it with adaptive probabilities (spoiler alert: this was a dead-end), I realized that rABS was not only a possible alternative, but actually an OK fit for the system, providing a high precision entropy coder with… well, *good enough* decode speed and a not-as-tricky-as-I-had-originally-feared implementation :) .

However, I intentionally left out several details of the rest of the codec. The earlier post only describes how the entropy decoder is implemented and some of the modeling, the combination of which is kindof the novel thing here. But essentially, it only gives us a way to pull modeled bits out of an encoded bitstream. The bits we actually encode/decode and how we model those is what a codec is *really* all about; the entropy coding part, while certainly important, is really just a piece of machinery that allows us to build the rest of the codec effectively and take as much advantage of the modeling choices we make as possible.

So, in this post I’d like to go into detail about the rest of the codec, including the LZ scheme chosen, how we model various symbols, how some of these techniques were chosen/tested, and several parts of the encoder. I’ll save the rABS-specific encoding details/intuitions for a future post where I can talk about that specifically in more depth.

11 Feb 2019

Of the various compression/packer experiments that I’ve toyed with, one of my favorites has to be a rABS-based packer on C64 for 4k intros. It’s an interesting challenge to try and beat state-of-the-art packers for the platform for this use case, and I’m quite pleased with construction I came up with, as well as the results. In particular, the way the **decoder** is built and how probabilities are modeled are particularly interesting, and that’s what I’d like to focus on in this post.

Quick note: while rANS is a more likely fit for a resource-constrained system like C64, I went with rABS instead, the difference being that we’re going to work with binary symbols. The reason for this is that we typically observe better compression with adaptive modeling, and updating symbol predictions is much, much simpler with a binary alphabet (look ma, no divides!). This of course makes per-symbol decoding much more expensive as we have to break symbols down into individual bit components and perform decoding/model updates on those (so, much more frequently), but for this particular use case, this is a tradeoff worth making for the higher compression ratio achieved.

10 Feb 2019

**Hello world!!** Time to start another blog. But yeah uhm why?

I’ve been working on a few (particularly compression-related) projects that are difficult to cover nicely in stream format, as I’d like to 1. have the information in writing, which should be easier to digest; and 2. not spend a lot of time preparing for such a stream/series that would be organized enough to be useful. I’d of course still like to do streams on these projects, and plan to do so, but they’d probably be more like high-level overviews with Q&A and would use the relevant blog post(s) as a rough outline and place to point to for learning more.