ferris blogs stuff

Technical ramblings and other stuff

Twitter Twitter
RSS RSS

ferris blogs stuff Cummings/Sunburst async FIFO notes


10 Aug 2021

Cummings/Sunburst async FIFO notes

Another quickie, re this paper. It came up in conversation recently, and I ended up taking some notes while expanding on it to regain my understanding, and thought they turned out somewhat useful.

Note to self: there’s a newer paper by the same author which appears to build on this paper, with the addition of asynchronous comparisons for full/empty flags, which I haven’t yet read/understood; should be interesting.

The paper covers asynchronous FIFO design (eg. writing in one clock domain, reading in another). This is metastability city if done incorrectly, but the gist is that we really only have to synchronize access to the underlying memory, and this can be achieved by carefully signaling the read/write pointers to the opposite domains by means of classic n-FF’s-in-series synchronizers, and encoding the values with gray codes such that even in the presence metastability errors for individual bits, the logic is still robust (eg. when the values are wrong, they are always are wrong temporarily and conservatively).

The “punchline” is the design presented figure 4/style #2 (figure 3/style #1 is somewhat interesting in its own right but not what we want in most cases; more on that later).

In this design, there’s a single binary counter (per side, either read/write, but the figure just shows one) which is used to hold the address as well as drive the read/write port address. Then there’s a matching gray code register that matches the current value of the binary register, and that one is used purely for synchronization.

The gray code output (ptr) has an additional MSB compared to the address output (addr), and is explained as follows (introducing m=n-1 for clarity): the FIFO depth is m, so 2^m entries, so that’s the width of the binary addr value. But to differentiate between empty and full states (when all m bits are equal) there’s another MSB which can be compared (equal is empty, not equal is full). Thus the internal counter and the gray code outputs (ptr) both need m+1 bits, which is shown as n in the diagram.

Then, in the full FIFO design (figure 5) we can see that only the ptr values (the gray-code encoded ones) cross domains, and they go through synchronizers (2 FF’s shown in this case, could be more; but note the thick lines indicating multiple bits - this is a situation that’s prone to synchronization errors between the bits, but gray codes ensure that the resulting values are conservative/safe). The other signals only touch the “source” domain, except for the dual ported RAM, but they always manipulate different addresses, as ensured by the surrounding logic, so this is safe.

One nice property of this design, as illustrated by figure 5, is that it’s highly composable - almost all of it is actually synchronous, except the synchronizers - so each part can be tested/understood in isolation, or alternatively, put together as a synchronous FIFO (which example 1 shows, more or less). This is good for improving testing and improving understanding, as it’s much easier to consider the behavior that way (though it’d certainly be wasteful/overkill as a synchronous design).

Back to figure 3/style #1: the main thing is that it’s storing the internal counter (n bits, or m+1 to match my earlier explanation) in gray code format, and forming an additional n-1 (m)-bit output (by combining the top two MSBs) so that we can still differentiate full/empty states (just as in the binary case, since the top bit behaves the same) and still address the memory properly. So it’s basically the same as the binary stuff, it’s just that generating the MSB of an n-1 bit gray code from an n bit gray code isn’t as simple as taking the n-2 bit as in the binary case. The registers are kept separate for this reason (we still need to send the whole n bits over to the other side but we have to derive the n-1 MSB and we still want to register that for other logic). But it’s confusing that it expects external logic to recombine the address bits again, and the slow combinational path with gray code encoding/decoding just to keep the counter in gray code format just seems awful/convoluted.

Style #2 is much clearer, and as the paper says, should also be faster.

Note that zipcpu also has a post about this design, including formally verifying its correctness.

I must admit that I struggled for a while to understand how it was safe to ship the data signals without any synchronization, but the internal construction of the memory takes care of that (again assuming you always write to/read from different addresses, which the synchronized logic takes are of). Curiously, the paper actually doesn’t show an enable signal for the read port, so there may be some subtlety there for the empty case (asynchronous read/write at the same address).. that seems spoopy! Note that the zipcpu article does have such a signal (i_rd) and I’m fairly certain you really want it too!

Edit 06/09/2021: After some email correspondence with the paper author, I’m now convinced that whether or not you want/need the enable signal for the read port depends on the underlying memory construction. For example, with a typical 8T 1R/1W cell topology, the read lines are electrically isolated from the write lines, so asynchronous read/write at the same address doesn’t cause any issues in this context. Further, registering the read address/enable signals (assuming they’re not required to meet timing) only adds further latency for read accesses, so you specifically do not want those here. However, depending on your target platform, you may not have a choice (eg. recent Xilinx BRAMs only support synchronous reads, and are likely to be inferred above a certain capacity even if your Verilog code unambiguously describes asynchronous read behavior) and might just have to use registered address/enable signals anyways.

As always, the paper goes into much more detail; read it if you haven’t already. But this should at least help with digesting it.

One last thing: all of this is overkill in cases where you have to synchronize a single-bit signal with relatively large pulse widths wrt to your target clock domain’s clock period; in that case just use the n-FF synchronizer approach. For cases where you want to ship a relatively-infrequent, single-cycle pulse from one domain to the other, the n-FF synchronizer approach is still usually what you want, just with an added xor-encoder and edge-detect-decoder on the respective ends of the sync FF’s (thus converting short pulses into edges of long pulses, which is the situation you want when using n-FF synchronizers). This paper by the same author is an excellent source for understanding various CDC solutions for different use cases; this should basically be required reading for this problem domain.

10 Aug 2021


Add a comment