Quasi-cyclic LDPC decoder

Error correction is a phenomenon that occurs in many instances in daily life but often goes unnoticed. Live streaming, surfing through the internet, watching a movie using compact disks (CDs), or even storing data in memory, all of these require error-correcting codes for their functioning. The key idea in error correction is the addition of redundant bits to the original data to counter the erroneous bits caused by the noise. Major advancements in this area have come due to their usage in communication systems, where large amounts of data need to be sent at high speeds through noisy channels. As we move towards 5G and beyond, the data rates keep increasing, consequently reducing the number of bits available for error correction. Yet, the goal is to design error-correcting codes that can operate at these high data rates and also provide reliable communication. One such code designed to meet the 5G standard specifications is the Quasi-cyclic Low-density Parity-check (QC-LDPC) code.

The family of LDPC codes is widely used for their excellent performance across most signal-to-noise ratio (SNR¹) regimes. QC-LDPC codes are particularly used for their efficient hardware implementation. In the 5G standard, the specifications mentioned for QC-LDPC codes (on terminologies such as base graphs and lifting sizes) need elaboration in order to understand how they reflect on hardware. I will explain these aspects in this post and in particular, look at the structure and functioning of the QC-LDPC decoder. You can skip the next section if you are familiar with the basics of LDPC codes.

Coding theory and LDPC basics (click to expand)

G and H matrices

To ensure reliable reception of information bits through a channel, we encode them by adding extra bits in a suitable manner. Given a set of $k$ message (useful information) bits ($\textbf{m}$) to be transmitted, we add $(n-k)$ parity bits, thus generating a codeword ($\textbf{c}$) of $n$ bits which are transmitted across the channel. The codeword $\textbf{c}$ is generated by multiplying the Generator matrix $\textbf{G}$ with the message $\textbf{m}$. Thus the same set of message vectors can have multiple sets of codeword vectors by varying the generator matrix used, leading to different possible codes. One can also generate a code where the $k$ message bits are embedded, together and in order, in the codeword $\textbf{c}$. Such a code is called a systematic code. In this case, the remaining $(n-k)$ bits become a part of the parity check equations between the $k$ message bits, thus leading to the name "parity bits".

Upon getting the received vector, we have to decode it to obtain the $k$ transmitted message bits. This is done with the help of a Parity Check matrix $\textbf{H}$, which when multiplied with a valid codeword ($\textbf{c}^{T}$) gives the null vector. Thus all the parity-check conditions defined by the system of equations $\textbf{H}\textbf{c}^{T} = 0$ must be satisfied by every codeword belonging to the code. The $k$ message bits can be retreived from the codeword using the property of systematic codes¹. Due to these properties, the parity check matrix and its structure forms the basis of any decoder design. A simple example to illustrate this, is a single parity check (SPC) code. As hinted by the name, the code has only 1 parity bit whose value is equal to the sum (XOR²) of all the other $k$ message bits. Its parity check matrix will thus be a row vector of size $(k+1)$ with all unit entries. The below set of equations demonstrate this example. $$ \begin{align*} \textbf{m} &= \begin{bmatrix} 0 & 1 & 0 & 1\\ \end{bmatrix} \\ \\ \textbf{G} &= \begin{bmatrix} 1 & 0 & 0 & 0 & 1\\ 0 & 1 & 0 & 0 & 1\\ 0 & 0 & 1 & 0 & 1\\ 0 & 0 & 0 & 1 & 1\\ \end{bmatrix} = \begin{bmatrix} \textbf{I}_4 & \textbf{1}\\ \end{bmatrix}\\ \textbf{H} &= \begin{bmatrix} 1 & 1 & 1 & 1 & 1\\ \end{bmatrix} = \begin{bmatrix} \textbf{1}^T & \textbf{I}_1\\ \end{bmatrix} \\ \\ \textbf{c} = \textbf{mG} &= \begin{bmatrix} 0 & 1 & 0 & 1 & {\color{red}0}\\ \end{bmatrix} \\ c_5 &= m_1+m_2+m_3+m_4 = {\color{red}0}\\ \\ \textbf{Hc}^T &= 0 \end{align*} $$ The dynamics of obtaining the $\textbf{G}$ and $\textbf{H}$ matrices and the relation between them is beyond the scope of this basic introduction. I recommend reading the book "Channel codes: classical and modern" (refernce below) for more details.

Why is coding non-trivial?

The aspect that makes coding a difficult exercise can be attributed more to the reason for doing it than the act of coding itself. Similar to the above construct, consider a message vector $\textbf{m}$ with $k$ bits that is passed directly through the channel without any coding. Upon reception, we know that there will be erroneous bits. Let us start with the simplest of cases - when there is only one errored bit whose position is known. The trivial job here is to flip the bit and obtain the correct message. Now let us add the first layer of complication - it is known that only one bit is erroneous, but we are not given which it is. If we take a random guess at the position and flip the bit, the probability of obtaining the correct message is $1/k$. One can try using an input pattern to increase the chances, but for practical setups we must be prepared for random input scenarios, thus ruling out this option. To further the misery, the number of erroneous bits is practically almost always more than 1, and more importantly unknown in quantity. So imagine looking at a received message vector and not knowing how many bits are erroneous or where their locations are. For all we know, any of the $k$ bits we are looking at might be erroneous, and we may have to flip it and pray it works. Sounds precarious! It is, especially if you look at the random selection odds for this setup. If there are $e$ erroneous bits in the received message, then our probability of randomly flipping the bits to correctly decipher the transmitted message vector falls down to $1/{N \choose k}$.

This is where the redundant bits provide an extra degree of freedom and allow for pre-transmission processing (encoding). A simple illustration to show how this helps localize the errors is : consider a 4-bit message vector and 2 parity bits (a 6-bit codeword). If the first two message bits have an odd number of ones, then the first parity bit (5th codeword bit) is 1, else it is 0. A similar condition is used for the second parity bit (6th codeword bit) based on the 3rd and 4th message bits. In this case, there is no overlap of the message bits between the two parity conditions. Now, upon passing the codeword through the channel and obtaining the received vector $\textbf{r}$, if the 1st, 2nd and the 5th bit of $\textbf{r}$ have odd number of ones, then at least one of them was flipped. Similar hypothesis holds for 3rd, 4th and 6th bit. But as you can see, we are still unable to pinpoint the exact bit position where the error happened. For that, one needs to involve overlapping of message bits in the parity conditions. These conditions can be systematically obtained through the $\textbf{G}$ and $\textbf{H}$ matrices described earlier. One such specific method used for setting up the conditions is described below.

LDPC codes

The low-density parity-check (LDPC) codes, originally introduced by Gallager in 1962, have been used for their ability to achieve performance close to capacity³. This is attributed to the construction of LDPC codes being similar to that in Shannon's capacity theorem. As the name suggests, these codes have parity-check⁴ matrices which are sparse - "low-density". Since there is no pre-defined structure to the parity check matrix, we look at each row as a parity check condition that needs to be satisfied. This approach is reasonable since the number of non-zero entries in the matrix is less. Each of these parity check conditions can be equivalently visualized as single parity-check codes on the non-zero matrix-entry codewords. Thus, if a matrix has M rows, we have M single parity-check codes that need to be satisfied. This thought process leads to an intuitive visualization of the LDPC codes called the Tanner graph.

A Tanner graph consists of two sets of nodes - check nodes and variable nodes. The check nodes depict the single parity check constraints that need to be satisfied. The variable nodes represent the codeword bits that form these parity check constraints. Since each single parity check condition (check node) will consist of (connected to) only the codeword bits (variable node), it trivially follows that the Tanner graph is a bipartite graph. Any parity check matrix (not necessarily of an LDPC code) can be represented as a Tanner graph.

Above we see an animation of how an LDPC code is represented on a Tanner graph and how the received bits are decoded upon passing through a binary erasure channel⁵. The animation describing how to go from parity-check matrix to Tanner graph is explained later for QC-LDPC codes. We can observe above that the bit information is collected by each check node (green square nodes) from all the connected variable nodes (purple round nodes) in order to deduce the missing bit information. Thus, information is passed between the check nodes and variable nodes during decoding. Since the scale of the above Tanner graph is small, it is easy to pick the first check node that needs to be solved. But on a bigger scale, when these connections and erasures are random, the decoding process and building the hardware becomes tricky. To alleviate these concerns, some structure is introduced to the LDPC parity check matrix, and one such modification leads to the QC-LDPC codes.

Usage of LLR

In the previous animation, we saw 1s and 0s as received bits during the decoding process. In practice, we receive signal values that are real-valued - due to modulation and channel noise⁶. It is not optimal to threshold this value to 0 or 1 as it leads to a loss in information and degradation in performance. Using the received value one can only give an estimate of whether the transmitted value is close to 0 or 1, and such a scenario is best represented using probabilities. For decoding, we can either use the 'probability of obtaining received value given the transmitted bit was 0' or 'probability of obtaining received value given the transmitted bit was 1' (likelihood). But probabilities are small numbers and, upon multiplying (during decoding), become even smaller. Thus it is favorable to use the logarithm of these probabilities as it also converts the multiplications into additions. Another thing to consider is that these probabilities are not normalized and hence will not add up to 1.

Hence, keeping all these factors into consideration, the decoder is fed with the logarithm of the ratios of these probabilities - called the log-likelihood ratio (LLR). Below is the equation which defines the LLR, where $\textbf{r}$ is the received vector, and $c_{n}$ is the nth bit of the codeword. We use a posteriori probabilities as input to the decoder instead of the likelihood probabilities as it is well-suited for the decoder operations. The prior probabilities are obtained using the received signal value.
$$ LLR \triangleq \displaystyle \frac{P(c_{n}=0|\textbf{r})}{P(c_{n}=1|\textbf{r})} $$

If the code is non-systematic, then each message bit is mapped to a particular codeword bit and is retreived later using the corresponding decoded bit ↩
The XOR operation replaces the sum operation here because we are considering only binary fields, i.e. only bits are involved in all the calculations↩
Shannon capacity or channel capacity is the tight upper bound on the rate at which information can be reliably transmitted over a communication channel.↩
Here parity-check means even parity, i.e. sum (XOR) of the bits under consideration should be equal to 0↩
A simple binary channel where the bit is erased instead of getting flipped when the noise is high↩
Since real channels will be modeled as Rayleigh fading or AWGN channels and not Binary Erasure channels↩

Defining parameters of a QC-LDPC code

The aim of using QC-LDPC codes over a generic LDPC code is to reduce the additional hardware required in the encoder - decoder implementation. A welcome consequence of this is the compression in representing the parity-check matrix. This has been exploited extensively by the 5G standard (and previous standards that used QC-LDPC codes) which uses a structure called prototype matrix (PM) to shrink the parity-check matrix representation. The prototype matrix is a matrix whose entries contain any number between $0$ to $Z-1$, along with blank (substituted with -1 in this post to recognize) entries. The size of the “blown-up” parity check matrix is determined by this parameter $Z$ called the lifting factor. The parity check matrix is obtained by expanding each entry of the prototype matrix into a $Z\;\text{x}\;Z$ sparse matrix. This $Z\;\text{x}\;Z$ block, called the cyclic shift matrix, is a circularly shifted version of the identity matrix. The amount of circular shift is indicated by the corresponding prototype matrix entry. The blank entries represent $Z\;\text{x}\;Z$ null matrices.

In the above animation, we observe the effect of lifting factor on the parity-check matrix. Prototype matrices considered in the animation (and also later ones in the post) are smaller in shape compared to practical ones in order to enable easier understanding. From the example, we see that an increase in the value of $Z$ leads to the parity-check matrix becoming larger and sparser. It is also evident from the illustration why these codes are named “low density”. Another parameter that affects the size of the parity check matrix is the code rate (CR). It directly controls the number of rows and columns of the prototype matrix, in turn determining the number of parity bits used in the code. The code rate is tuned to improve performance depending on the SNR, channel parameters and application. This post deals with understanding the structure and functioning of the QC-LDPC codes, and thus it suffices to deal with prototype matrices having a fixed code rate.

It should be noted that in the 5G standard rulebook, only two base graphs² (BG) are specified, with their entries ranging from 0 to maximum allowable lifting factor ($Z_{max}$). To obtain base graphs for lower values of $Z$³, one needs to perform ‘modulo $Z$’ operation on each entry of the specified base graph (as shown in the above animation). One of the advantages of employing lifting factor-based representation is that it allows the hardware to process $Z$ variable and check nodes parallelly. This parallel processing capability becomes more evident when the algorithm flow is observed on the Tanner graph.

Obtaining the Tanner graph

Before looking at the flow on the Tanner graph, we must first decipher how to convert the given prototype matrix into a Tanner graph. It comprises two steps: 1) Expansion of the prototype matrix into the parity check matrix 2) Establishing the connections between check nodes and variable nodes from the non-zero entries of the parity check matrix. The animation below summarizes these two steps.

Due to the base graph structure of QC-LDPC codes, the check nodes are connected in specific patterns to the variable nodes, thus allowing for a simplified decoding process as compared to a generic LDPC code with no structure. Though the Tanner graph is not directly used to decode the codeword, it plays a key role in visualizing the decoding process, and the algorithms are structured around this idea.

Decoding algorithms

A simple example was demonstrated in the ‘Coding theory and LDPC basics’ section to show the necessity of information transfer between check nodes and variable nodes. This transfer of information between the check and variable nodes connected by edges in the Tanner graph is called Message Passing. This message passing between the two sets of nodes can be carried out in two ways: 1) Flooding schedule 2) Layered schedule. The flooding schedule involves all the variable nodes simultaneously transferring information to all their connected check nodes and vice versa. Here, the information passed by a variable node to any two connected check nodes is the same. This process portrays the intuitive idea of message passing on a Tanner graph. In case of the layered schedule, the set of variable nodes connected to the first check node send and receive information to it, and then the next set of variable nodes connected to the second check node perform the same operation and so on until all the check nodes are completed. It can be seen that the information sent by one particular variable node connected to two different check nodes is different, as the latter check node will also receive the information update from the former check node. Thus, more information is transmitted in one iteration of a layered schedule as compared to that of a flooding schedule. But this comes at the cost of increased latency per iteration. The animation below demonstrates the message passing for flooding v/s layered schedule.

For practical purposes such as 5G, the flooding schedule is not preferred as it requires significantly more hardware (to process all the check nodes simultaneously) compared to the layered schedule. In fact, during the layered schedule, to reduce the hardware consumption further, all the variable nodes do not concurrently transfer information to their check node. A group of $Z$ variable nodes send information to $Z$ check nodes as indicated by the prototype matrix entry. Though $Z$ check nodes are being updated simultaneously, it is still a valid layered schedule since they do not have any common variable nodes (due to one-to-one mapping of the cyclic shift matrix structure). Here we must observe and appreciate the significance of lifting factor’s($Z$’s) usage in the structure and its role in improving the hardware processing.

An intuitive way to understand the messages being passed is: each check node collects all the messages from its connected variable nodes. It sends back information that enables the ‘parity-check sum’ (XOR) of its connected variable nodes to become zero. Each variable node, upon receiving the LLRs, performs a majority operation (similar to decoding a repition code⁴) on its check node messages. This information is then transferred to another connected check node at a later stage in the iteration. One of the basic decoding algorithms which employs such message passing on QC-LDPC codes is the sum-product algorithm (SPA). It works on the principle of iterative decoding, where a fixed number of iterations of the layered schedule is executed to obtain the final codeword. It employs the MAP framework, where log-likelihood ratios of a posteriori probabilities are exchanged between the variable and check nodes. It should be noted that the information received by these nodes is extrinsic information - information not already available to it.

The mathematical expression for check node messages in the sum-product algorithm is a complex function and extremely hardware intensive (described later in the ‘Hardware optimizations’ section). This complex function is approximated by a simple minimum operation giving rise to the min-sum (MS) algorithm. The performance of the min-sum approximation is improved by providing an offset correction, leading to the offset min-sim (OMS) algorithm. This OMS algorithm is widely used in practice.

Algorithm flow with time

The OMS algorithm for 5G employs the layered schedule⁵ and, at a time, parallelly processes the $Z$ variable and check nodes corresponding to the prototype matrix entry. Each check node gathers information from all the connected variable nodes (going through all the entries in a prototype matrix row) and performs processing to obtain the check node update. These check node updates are then passed onto the variable nodes, which upon processing, send the variable node updates to the next set of check nodes and so on. The animation below depicts this flow.

The above illustration shows how the prototype matrix is processed by a piece of hardware. Though the prototype matrix in this example has no multiple non-zero elements along its column, there can be such cases, which will lead to certain variable nodes having multiple check node connections. It can be seen above that the algorithm proceeds to the second row of the prototype matrix only after completing all the information transfer in the first row - as expected by the layered schedule. The check nodes, variable nodes, the message passing are all implemented using circuit components. This, along with the further optimizations performed at the hardware level to improve the throughput and area consumption, are elaborated in the next section. You can skip the next section if you do not want to get tangled in the hardware nuances.

Hardware optimizations (click to expand)

SPA versus OMS

To understand the complexity of the sum-product algorithm implementation, we look at its mathematical representation. We see below the check node update equation (without the exact index 'i' representation for brevity) : $$L_{CN} = 2\tanh^{-1}\left(\displaystyle\prod_{i}\tanh\left(L_{VN_{i}}\right)\right)$$ Here, $L_{CN}$ and $L_{VN}$ represent the check node and variable node messages respectively. Computing the $\tanh$ for each variable node message received, and taking the $\tanh^{-1}$ of their product requires a lot of processing circuitry. But this is just for one check node, and the situation becomes even more dire to implement for $Z$ check nodes parallelly. Thus we need to approximate this check node update, which is exactly what is done by the min-sum algorithm. This approximation leads to degradation in performance, which is compensated to some extent using the OMS algorithm. The check node update for OMS is given below. It has been simplified to two operations, i.e., 1) taking the product of the signs of all the variable node messages and 2) finding their minimum. The value $c$ is the offset subtracted. $$L_{CN} = \displaystyle\left(\prod_{i}sign(L_{VN_{i}})\right).max\{\,0,\,\min_{i}(L_{VN_{i}})-c\,\} $$

Architecture of the decoder

The decoder comprises mainly three units: 1) Memories 2) Cyclic shifter and 3) Processing units. The memory units are needed to store the messages passed between the check nodes and variable nodes, and also to store the final LLR. The graph connections shown in the animations are executed on hardware using the cyclic shifters. They pass the information from the memory to the processing unit as dictated by the connections. The processing units compute the minimum and product of signs operations from the received messages and send the information back to be updated in the memories.

Early termination

Earlier, we mentioned that the decoder executes a fixed number of iterations($I$) of the layered schedule to obtain the codeword. That can be 5,10,20 or more iterations. But is it necessary that all the received codewords undergo the same number of iterations to obtain the decoded output? What if a certain received codeword can be decoded after just two iterations since it had only one bit flipped with respect to the transmitted codeword? In truth, this is entirely possible, and we would be wasting the remaining $(I-2)$ iterations, in turn reducing the average throughput. But if we are able to recognize that the final LLR at any particular iteration is a codeword, we can stop the decoding process for that received codeword. This process is called early termination. Though this requires additional hardware, it can still reduce the power consumption by reducing the number of cycles the decoder stays ON. There are several techniques that are optimized, in hardware and performance, to perform this operation. The brute force way of approaching it is by thresholding the final LLR at each iteration and verifying that its product with the parity check matrix is zero. This is optimal in terms of performance and poor in terms of extra circuitry needed. This trade-off is exploited to improve the overall decoder design.

Outlook on the decoding process

As seen in the earlier section, during the execution of the OMS algorithm, there are a lot of sub-operations happening, especially in the hardware. As a result, it is possible to lose sight of the big picture, and base our understanding on the local aspects when implementing the hardware. However, the aim with which we started the decoding process is what is re-emphasized by the hardware perspective as well – getting the estimated LLR (and codeword) closer to the transmitted LLR (with every iteration). All the message passing happening along the Tanner graph in accordance with the prototype matrix is a collective effort to nudge the estimated LLR towards the transmitted LLR. This is depicted in the animation below, where the variable nodes update the ‘Q-memory’ that stores the final LLR.

The prior probability information required by the decoder is provided by the input/channel LLRs. The transmitted codeword, upon passing through the channel, provides the received signal that is used to calculate these input LLRs. The below animation summarizes all the steps involved in the OMS algorithm (including the input/output LLR).

As the number of iterations grow, the estimated LLR slowly moves away from the input LLR bias and towards satisfying the parity check constraints defined through the Tanner graph. A maximum number of iterations is set, after which the decoder provides the final LLR/codeword. However, the loop can be exited once it is realized that all the parity check constraints are being met. This method is called early termination and is employed to improve the average throughput.

Conclusion

As we saw a glimpse in the ‘Hardware optimizations’ section, there is a lot to assimilate before understanding the complete functioning of a QC-LDPC decoder on hardware. Keeping that thought aside, is there anything to look beyond, once we have a functional OMS QC-LDPC decoder? Of course! There are several optimizations that can be employed to improve the performance of the QC-LDPC code and its decoder. To improve the SNR performance, one can employ decoding algorithms different from the OMS algorithm - such as Adjusted Min-Sum or Adapted Min-Sum algorithms. To improve the hardware efficiency or throughput, one can choose how to store the message updates or how to effectively traverse the base graph while performing the decoding. A change in the code or scheduling can have ramifications at the hardware level and vice versa. Hence it is important to have a holistic approach to understanding these codes along with their hardware in order to find such optimizations.

The code for the animations can be found here.

Acknowledgments

I want to thank Prof. Andreas Burg and Dr. Hassan Harb for guiding me during my Master’s thesis on this topic which helped me gain these insights. I also want to thank my friend Rajat with whom I had discussions regarding the animations.

References

Robert Gallager. Low-density parity-check codes. IRE Transactions on information theory, 8(1):21–28, 1962.
Ryan, William, and Shu Lin. Channel codes: classical and modern. Cambridge university press, 2009.
F. A. Newagy, Y. A. Fahmy, and M. M. S. El-Soudani, “Designing Near Shannon Limit LDPC Codes using Particle Swarm Optimization Algorithm,” in 2007 IEEE International Conference on Telecommunications and Malaysia International Conference on Communications, 2007, pp. 119–123
Marc PC Fossorier. Quasicyclic low-density parity-check codes from circulant permutation matrices. IEEE transactions on information theory, 50(8):1788– 1793, 2004.
Christoph Studer, N Preyss, C Roth, and A Burg. Configurable high-throughput decoder architecture for quasi-cyclic ldpc codes. In 2008 42nd Asilomar Conference on Signals, Systems and Computers, pages 1137–1142. IEEE, 2008.
Youtube : Hamming & low density parity check codes

Footnotes

SNR is a parameter used to determine the quality of the signal with respect to the background noise ↩
Base graph is the prototype matrix at minimum code rate, i.e. the largest (in shape) possible prototype matrix ↩
The 5G standard allows only 51 values for $Z$ that lie in the range of 0 to 384 ($Z_{max}$). These 51 values can be factored into a prime number and a power of 2. The value of lifting factor is chosen based on the application. ↩
In a repition code, we transmit the same bit multiple times and decode by calculating the majority among the received bits. It should be observed that the variable nodes function as a repition decoder while the check nodes function as a single parity-check decoder. ↩
Currently researchers are exploring hybrid layered and flooding OMS algorithms to improve throughput ↩