I, Me, MySelf and CS Problems: May 2011

Wednesday, May 25, 2011

Advance Data Structure B-Tree, B+ Tree, RB Tree, AVL Tree & It Application

Introduction

B-Tree is a balanced m-way tree. This discussion from Wiki is a good material to introduce one to the characteristics and node layout of a B-Tree algorithm: http://en.wikipedia.org/wiki/B-tree.

This article discusses an in-memory B-Tree implementation. Although B-Tree is typically used in file/database system indexing (see my B-Tree disk based version for details @: http://www.4atech.net), there is a significant value in implementing a B-Tree based collection or dictionary in the .NET framework or in any language/platform for that matter.

Advantages of B-Tree In-memory

Typical in-memory sorted dictionary data structures today are based on the Binary Tree algorithm, which is not to be confused with B-Tree. Each node of a Binary Tree can contain a single item, whereas a B-Tree can contain a user defined number of items per node, and its nodes are kept balanced. This is a very important differentiation. Being that each node has a single item, storing a large number of items in a Binary Tree will generate a tall and narrow node graph with numerous nodes.

In a corresponding B-Tree implementation, the graph will tend to be shorter and wider with a lot fewer nodes. This is because a node in a B-Tree is typically configured to have numerous items; e.g., a B-Tree dictionary with 12 items will require a single node to contain the items, and a Binary Tree will require 12 nodes which can be scattered around in the heap (memory). Increasing our item sampling to thousands, hundreds of thousands, or millions, we're talking about a significant differentiation and significant optimization that a corresponding B-Tree based sorted dictionary can provide. A million single item node of a Binary Tree vs. around eighty three thousand of a B-Tree, for a 12 item node setup, and even far less if the user specifies more items per node than mentioned.

With this characterization, it is easy to imagine that a Binary Tree will tend to use more memory, and will tend to generate more memory fragmentation than a respective dictionary utilizing the B-Tree algorithm. And in production environments, we can't afford to have a system that is prone to memory fragmentation as in time, the application will degrade in performance and can cause out of memory conditions due to lack of contiguous allocable space.

More &v Detailed Information

Source
B-Tree http://www.codeproject.com/KB/collections/BTreeSortedDictionary.aspx
B+ Tree http://www.codeproject.com/KB/files/NTFSUndelete.aspx
RB Tree http://www.codeproject.com/KB/recipes/redblackcs.aspx
AVL Tree
1 http://www.codeproject.com/KB/architecture/avl_tree.aspx
2 http://www.codeproject.com/KB/architecture/avl_cpp.aspx
3 http://www.codeproject.com/KB/cpp/avltree.aspx

Saturday, May 21, 2011

CRC Implementation Code in C

CRCs are among the best checksums available to detect and/or correct errors in communications transmissions. Unfortunately, the modulo-2 arithmetic used to compute CRCs doesn't map easily into software. This article shows how to implement an efficient CRC in C.

I'm going to complete my discussion of checksums by showing you how to implement CRCs in software. I'll start with a naïve implementation and gradually improve the efficiency of the code as I go along. However, I'm going to keep the discussion at the level of the C language, so further steps could be taken to improve the efficiency of the final code simply by moving into the assembly language of your particular processor.

For most software engineers, the overwhelmingly confusing thing about CRCs is their implementation. Knowing that all CRC algorithms are simply long division algorithms in disguise doesn't help. Modulo-2 binary division doesn't map particularly well to the instruction sets of off-the-shelf processors. For one thing, generally no registers are available to hold the very long bit sequence that is the numerator. For another, modulo-2 binary division is not the same as ordinary division. So even if your processor has a division instruction, you won't be able to use it.
Modulo-2 binary division

Before writing even one line of code, let's first examine the mechanics of modulo-2 binary division. We'll use the example in Figure 1 to guide us. The number to be divided is the message augmented with zeros at the end. The number of zero bits added to the message is the same as the width of the checksum (what I call c); in this case four bits were added. The divisor is a c+1-bit number known as the generator polynomial.

Modulo-2 Binary Division Example

Figure 1. An example of modulo-2 binary division

The modulo-2 division process is defined as follows:

* Call the uppermost c+1 bits of the message the remainder
* Beginning with the most significant bit in the original message and for each bit position that follows, look at the c+1 bit remainder:
o If the most significant bit of the remainder is a one, the divisor is said to divide into it. If that happens (just as in any other long division) it is necessary to indicate a successful division in the appropriate bit position in the quotient and to compute the new remainder. In the case of modulo-2 binary division, we simply:
+ Set the appropriate bit in the quotient to a one, and
+ XOR the remainder with the divisor and store the result back into the remainder
o Otherwise (if the first bit is not a one):
+ Set the appropriate bit in the quotient to a zero, and
+ XOR the remainder with zero (no effect)
o Left-shift the remainder, shifting in the next bit of the message. The bit that's shifted out will always be a zero, so no information is lost.

The final value of the remainder is the CRC of the given message.

What's most important to notice at this point is that we never use any of the information in the quotient, either during or after computing the CRC. So we won't actually need to track the quotient in our software implementation. Also note here that the result of each XOR with the generator polynomial is a remainder that has zero in its most significant bit. So we never lose any information when the next message bit is shifted into the remainder.
Bit by bit

Listing 1 contains a naive software implementation of the CRC computation just described. It simply attempts to implement that algorithm as it was described above for this one particular generator polynomial. Even though the unnecessary steps have been eliminated, it's extremely inefficient. Multiple C statements (at least the decrement and compare, binary AND, test for zero, and left shift operations) must be executed for each bit in the message. Given that this particular message is only eight bits long, that might not seem too costly. But what if the message contains several hundred bytes, as is typically the case in a real-world application? You don't want to execute dozens of processor opcodes for each byte of input data.

#define POLYNOMIAL 0xD8 /* 11011 followed by 0's */

uint8_t
crcNaive(uint8_t const message)
{
uint8_t remainder;

/*
* Initially, the dividend is the remainder.
*/
remainder = message;

/*
* For each bit position in the message....
*/
for (uint8_t bit = 8; bit > 0; --bit)
{
/*
* If the uppermost bit is a 1...
*/
if (remainder & 0x80)
{
/*
* XOR the previous remainder with the divisor.
*/
remainder ^= POLYNOMIAL;
}

/*
* Shift the next bit of the message into the remainder.
*/
remainder = (remainder << 1);
}

/*
* Return only the relevant bits of the remainder as CRC.
*/
return (remainder >> 4);

} /* crcNaive() */

Listing 1. A naive CRC implementation in C
Code clean up

Before we start making this more efficient, the first thing to do is to clean this naive routine up a bit. In particular, let's start making some assumptions about the applications in which it will most likely be used. First, let's assume that our CRCs are always going to be 8-, 16-, or 32-bit numbers. In other words, that the remainder can be manipulated easily in software. That means that the generator polynomials will be 9, 17, or 33 bits wide, respectively. At first it seems we may be stuck with unnatural sizes and will need special register combinations, but remember these two facts:

* The most significant bit of any generator polynomial is always a one
* The uppermost bit of the XOR result is always zero and promptly shifted out of the remainder

Since we already have the information in the uppermost bit and we don't need it for the XOR, the polynomial can also be stored in an 8-, 16-, or 32-bit register. We can simply discard the most significant bit. The register size that we use will always be equal to the width of the CRC we're calculating.

As long as we're cleaning up the code, we should also recognize that most CRCs are computed over fairly long messages. The entire message can usually be treated as an array of unsigned data bytes. The CRC algorithm should then be iterated over all of the data bytes, as well as the bits within those bytes.

The result of making these two changes is the code shown in Listing 2. This implementation of the CRC calculation is still just as inefficient as the previous one. However, it is far more portable and can be used to compute a number of different CRCs of various widths.

/*
* The width of the CRC calculation and result.
* Modify the typedef for a 16 or 32-bit CRC standard.
*/
typedef uint8_t crc;

#define WIDTH (8 * sizeof(crc))
#define TOPBIT (1 << (WIDTH - 1))

crc
crcSlow(uint8_t const message[], int nBytes)
{
crc remainder = 0;

/*
* Perform modulo-2 division, a byte at a time.
*/
for (int byte = 0; byte < nBytes; ++byte)
{
/*
* Bring the next byte into the remainder.
*/
remainder ^= (message[byte] << (WIDTH - 8));

/*
* Perform modulo-2 division, a bit at a time.
*/
for (uint8_t bit = 8; bit > 0; --bit)
{
/*
* Try to divide the current data bit.
*/
if (remainder & TOPBIT)
{
remainder = (remainder << 1) ^ POLYNOMIAL;
}
else
{
remainder = (remainder << 1);
}
}
}

/*
* The final remainder is the CRC result.
*/
return (remainder);

} /* crcSlow() */

Listing 2. A more portable but still naive CRC implementation
Byte by byte

The most common way to improve the efficiency of the CRC calculation is to throw memory at the problem. For a given input remainder and generator polynomial, the output remainder will always be the same. If you don't believe me, just reread that sentence as "for a given dividend and divisor, the remainder will always be the same." It's true. So it's possible to precompute the output remainder for each of the possible byte-wide input remainders and store the results in a lookup table. That lookup table can then be used to speed up the CRC calculations for a given message. The speedup is realized because the message can now be processed byte by byte, rather than bit by bit.

The code to precompute the output remainders for each possible input byte is shown in Listing 3. The computed remainder for each possible byte-wide dividend is stored in the array crcTable[]. In practice, the crcInit() function could either be called during the target's initialization sequence (thus placing crcTable[] in RAM) or it could be run ahead of time on your development workstation with the results stored in the target device's ROM.

crc crcTable[256];

void
crcInit(void)
{
crc remainder;

/*
* Compute the remainder of each possible dividend.
*/
for (int dividend = 0; dividend < 256; ++dividend)
{
/*
* Start with the dividend followed by zeros.
*/
remainder = dividend << (WIDTH - 8);

/*
* Perform modulo-2 division, a bit at a time.
*/
for (uint8_t bit = 8; bit > 0; --bit)
{
/*
* Try to divide the current data bit.
*/
if (remainder & TOPBIT)
{
remainder = (remainder << 1) ^ POLYNOMIAL;
}
else
{
remainder = (remainder << 1);
}
}

/*
* Store the result into the table.
*/
crcTable[dividend] = remainder;
}

} /* crcInit() */

Listing 3. Computing the CRC lookup table

Of course, whether it is stored in RAM or ROM, a lookup table by itself is not that useful. You'll also need a function to compute the CRC of a given message that is somehow able to make use of the values stored in that table. Without going into all of the mathematical details of why this works, suffice it to say that the previously complicated modulo-2 division can now be implemented as a series of lookups and XORs. (In modulo-2 arithmetic, XOR is both addition and subtraction.)

A function that uses the lookup table contents to compute a CRC more efficiently is shown in Listing 4. The amount of processing to be done for each byte is substantially reduced.

crc
crcFast(uint8_t const message[], int nBytes)
{
uint8_t data;
crc remainder = 0;

/*
* Divide the message by the polynomial, a byte at a time.
*/
for (int byte = 0; byte < nBytes; ++byte)
{
data = message[byte] ^ (remainder >> (WIDTH - 8));
remainder = crcTable[data] ^ (remainder << 8);
}

/*
* The final remainder is the CRC.
*/
return (remainder);

} /* crcFast() */

Listing 4. A more efficient, table-driven, CRC implementation

As you can see from the code in Listing 4, a number of fundamental operations (left and right shifts, XORs, lookups, and so on) still must be performed for each byte even with this lookup table approach. So to see exactly what has been saved (if anything) I compiled both crcSlow() and crcFast() with IAR's C compiler for the PIC family of eight-bit RISC processors. 1 I figured that compiling for such a low-end processor would give us a good worst-case comparison for the numbers of instructions to do these different types of CRC computations. The results of this experiment were as follows:

* crcSlow(): 185 instructions per byte of message data
* crcFast(): 36 instructions per byte of message data

So, at least on one processor family, switching to the lookup table approach results in a more than five-fold performance improvement. That's a pretty substantial gain considering that both implementations were written in C. A bit more could probably be done to improve the execution speed of this algorithm if an engineer with a good understanding of the target processor were assigned to hand-code or tune the assembly code. My somewhat-educated guess is that another two-fold performance improvement might be possible. Actually achieving that is, as they say in textbooks, left as an exercise for the curious reader.
CRC standards and parameters

Now that we've got our basic CRC implementation nailed down, I want to talk about the various types of CRCs that you can compute with it. As I mentioned last month, several mathematically well understood and internationally standardized CRC generator polynomials exist and you should probably choose one of those, rather than risk inventing something weaker.

In addition to the generator polynomial, each of the accepted CRC standards also includes certain other parameters that describe how it should be computed. Table 1 contains the parameters for three of the most popular CRC standards. Two of these parameters are the "initial remainder" and the "final XOR value". The purpose of these two c-bit constants is similar to the final bit inversion step added to the sum-of-bytes checksum algorithm. Each of these parameters helps eliminate one very special, though perhaps not uncommon, class of ordinarily undetectable difference. In effect, they bulletproof an already strong checksum algorithm.

CRC-CCITT

CRC-16

CRC-32
Width

16 bits

16 bits

32 bits
(Truncated) Polynomial

0x1021

0x8005

0x04C11DB7
Initial Remainder

0xFFFF

0x0000

0xFFFFFFFF
Final XOR Value

0x0000

0x0000

0xFFFFFFFF
Reflect Data?

No

Yes

Yes
Reflect Remainder?

No

Yes

Yes
Check Value

0x29B1

0xBB3D

0xCBF43926

Table 1. Computational parameters for popular CRC standards

To see what I mean, consider a message that begins with some number of zero bits. The remainder will never contain anything other than zero until the first one in the message is shifted into it. That's a dangerous situation, since packets beginning with one or more zeros may be completely legitimate and a dropped or added zero would not be noticed by the CRC. (In some applications, even a packet of all zeros may be legitimate!) The simple way to eliminate this weakness is to start with a nonzero remainder. The parameter called initial remainder tells you what value to use for a particular CRC standard. And only one small change is required to the crcSlow() and crcFast() functions:

crc remainder = INITIAL_REMAINDER;

The final XOR value exists for a similar reason. To implement this capability, simply change the value that's returned by crcSlow() and crcFast() as follows:

return (remainder ^ FINAL_XOR_VALUE);

If the final XOR value consists of all ones (as it does in the CRC-32 standard), this extra step will have the same effect as complementing the final remainder. However, implementing it this way allows any possible value to be used in your specific application.

In addition to these two simple parameters, two others exist that impact the actual computation. These are the binary values "reflect data" and "reflect remainder". The basic idea is to reverse the bit ordering of each byte within the message and/or the final remainder. The reason this is sometimes done is that a good number of the hardware CRC implementations operate on the "reflected" bit ordering of bytes that is common with some UARTs. Two slight modifications of the code are required to prepare for these capabilities.

What I've generally done is to implement one function and two macros. This code is shown in Listing 5. The function is responsible for reflecting a given bit pattern. The macros simply call that function in a certain way.

#define REFLECT_DATA(X) ((uint8_t) reflect((X), 8))
#define REFLECT_REMAINDER(X) ((crc) reflect((X), WIDTH))

uint32_t
reflect(uint32_t data, uint8_t nBits)
{
uint32_t reflection = 0;

/*
* Reflect the data about the center bit.
*/
for (uint8_t bit = 0; bit < nBits; ++bit)
{
/*
* If the LSB bit is set, set the reflection of it.
*/
if (data & 0x01)
{
reflection |= (1 << ((nBits - 1) - bit));
}

data = (data >> 1);
}

return (reflection);

} /* reflect() */

Listing 5. Reflection macros and function

By inserting the macro calls at the two points that reflection may need to be done, it is easier to turn reflection on and off. To turn either kind of reflection off, simply redefine the appropriate macro as (X). That way, the unreflected data byte or remainder will be used in the computation, with no overhead cost. Also note that for efficiency reasons, it may be desirable to compute the reflection of all of the 256 possible data bytes in advance and store them in a table, then redefine the REFLECT_DATA() macro to use that lookup table.

Tested, full-featured implementations of both crcSlow() and crcFast() are available for download. These implementations include the reflection capabilities just described and can be used to implement any parameterized CRC formula. Simply change the constants and macros as necessary.

The final parameter that I've included in Table 1 is a "check value" for each CRC standard. This is the CRC result that's expected for the simple ASCII test message "123456789." To test your implementation of a particular standard, simply invoke your CRC computation on that message and check the result:

crcInit();

checksum = crcFast("123456789", 9);

If checksum has the correct value after this call, then you know your implementation is correct. This is a handy way to ensure compatibility between two communicating devices with different CRC implementations or implementors.

Additive CheckSum Algorithm

Whenever you connect two or more computers together with the intent of exchanging information, you assume that the exchange will take place without errors. But what if some of the data is lost or corrupted in transit? Communication protocols usually attempt to detect such errors automatically. To do that they use checksums.

The most important part of listening to someone speak is ensuring that you've heard them correctly. Your brain performs the tasks of error detection and correction for you, automatically. It does this by examining extra bits of information from the speaker and the speech; if a phrase or sentence makes sense as a whole and it makes sense coming from the mouth of the particular speaker, then the individual words were probably heard correctly. The same principle applies when you are reading. But what happens when computers are communicating with one another? How does the receiving computer know if an error has occurred in transmission?

Establishing correctness is more difficult for computers than humans. At the lowest level, communication between computers consists of nothing but a stream of binary digits. Meaning is only assigned to that particular sequence of bits at higher levels. We call that meaningful sequence of bits the message; it is analogous to a spoken or written phrase. If one or more bits within the message are inverted (a logic one becomes a logic zero, or vice versa) as it travels between computers, the receiver has no way to detect the error. No environmental or syntactical context is available to the receiver, since it cannot understand the message in its transmitted form.
Achieving parity

If we want communicating computers to detect and correct transmission errors automatically, we must provide a replacement for context. This usually takes the form of an error correction code or error detection code. A simple type of error detection code that you are probably already familiar with is called a parity bit. A parity bit is a single, extra binary digit that is appended to the message by the sender and transmitted along with it. Depending on the type of parity used, the parity bit ensures that the total number of logic ones sent is even (even parity) or odd (odd parity). For an example of even parity, consider the sequence:

10101110 1

in which the eight-bit message contains five ones and three zeros. The parity bit is one in this case to force the total number of ones in the transmitted data stream to be even.

When a parity bit is appended to a message, one additional bit of data must be sent between the computers. So there must be some benefit associated with the lost bandwidth. The most obvious benefit is that if any single bit in the message is inverted during transmission, the number of ones will either increase or decrease by one, thus making the received number of ones odd and the parity incorrect. The receiver can now detect such an error automatically and request a retransmission of the entire stream of bits. Interestingly, the receiver can also now detect any odd number of bit inversions. (Note that all of this still applies even when the parity bit is one of the inverted bits.)

However, as you may have already noticed, if two bits are inverted (two ones become zeros, for example), the parity of the stream of bits will not change; a receiver will not be able to detect such errors. In fact, the same is true for any even number of bit inversions. A parity bit is a weak form of error detection code. It has a small cost (one bit per message), but it is unable to detect many types of possible errors. (By the way, odd parity has the same costs, benefits, and weaknesses as even parity.)

Perhaps this is not acceptable for your application. An alternative that might occur to you is to send the entire message twice. If you're already spending bandwidth sending an error detection code, why not spend half of the bandwidth? The problem is that you could actually be spending much more than half of the available bandwidth. If a computer receives two copies of a message and the bits that comprise them aren't exactly the same, it cannot tell which one of the two is correct. In fact, it may be that both copies contain errors. So the receiver must request a retransmission whenever there is a discrepancy between the message and the copy sent as an error detection code.

With that in mind, let's compare the bandwidth costs of using a parity bit vs. resending the entire message. We'll assume that all messages are 1,000-bits long and that the communications channel has a bit error rate (average number of inverted bits) of one per 10,000 bits sent. Using a parity bit, we'd spend 0.1% of our bandwidth sending the error detection code (one bit of error protection for 1,000 bits of message) and have to retransmit one out of every 10 messages due to errors. If we send two complete copies of each message instead, the smallest unit of transmission is 2,000 bits (50% of our bandwidth is now spent sending the error detection code). In addition, we'll have to retransmit one out of every five messages. Therefore, the achievable bandwidths are approximately 90% and 40%, respectively. As the width of the code increases, the message plus code lengthens and becomes more vulnerable to bit errors and, as a result, expensive retransmissions.

From this type of analysis it should be clear that keeping the size of an error detection code as small as possible is a good thing, even if it does mean that some types of errors will be undetectable. (Note that even sending two copies of the message is not a perfect solution. If the same bit errors should happen to occur in both copies, the errors will not be detected by the receiver.) In addition to reducing the bandwidth cost of transmitting the error detection code itself, this also increases the message throughput. But we still don't want to go so far as using a one-bit error detection scheme like parity. That would let too many possible errors escape detection.

In practice, of course, we can achieve far better error detection capabilities than just odd numbers of inverted bits. But in order to do so we have to use error detection codes that are more complex than simple parity, and also contain more bits. I'll spend the remainder of this column and most of the next two discussing the strengths and weaknesses of various types of checksums, showing you how to compute them, and explaining how each can be best employed for purposes of error detection.
Checksums

As the name implies, a checksum usually involves a summation of one sort or another. For example, one of the most common checksum algorithms involves treating the message like a sequence of bytes and summing them. Listing 1 shows how this simple algorithm might be implemented in C.

uint8_t
Sum(uint8_t const message[], int nBytes)
{
uint8_t sum = 0;

while (nBytes-- > 0)
{
sum += *(message++);
}

return (sum);

} /* Sum() */

Listing 1. A sum-of-bytes checksum

The sum-of-bytes algorithm is straightforward to compute. However, understanding its strengths and weaknesses as a checksum is more difficult. What types of errors does it detect? What errors is it unable to detect? These are important factors to consider whenever you are selecting a checksum algorithm for a particular application. You want the algorithm you choose to be well matched to the types of errors you expect to occur in your transmission environment. For example, a parity bit would be a poor choice for a checksum if bit errors will frequently occur in pairs.

A noteworthy weakness of the sum-of-bytes algorithm is that no error will be detected if the entire message and data are somehow received as a string of all zeros. (A message of all zeros is a possibility, and the sum of a large block of zeros is still zero.) The simplest way to overcome this weakness is to add a final step to the checksum algorithm: invert the bits of the final sum. The new proper checksum for a message of all zeros would be FFh. That way, if the message and checksum are both all zeros, the receiver will know that something's gone terribly wrong. A modified version of the checksum implementation is shown in Listing 2.

uint8_t
SumAndInvert(uint8_t const message[], int nBytes)
{
uint8_t sum = 0;

while (nBytes-- > 0)
{
sum += *(message++);
}

return (~sum);

} /* SumAndInvert() */

Listing 2. A slightly more robust sum-of-bytes checksum algorithm

This final inversion does not affect any of the other error detection capabilities of this checksum, so let's go back to discussing the basic sum-of-bytes algorithm in Listing 1. First, it should be obvious that any single bit inversion in the message or checksum will be detectable. Such an error will always affect at least one bit within the checksum. (It could affect more, of course, but not less.) Observe that the sum-of-bytes is performed by essentially lining up all of the bytes that comprise the message and performing addition on them, like this:

10110101
11000000
00001101
+ 11100011
----------
01100101

Because of this mathematical arrangement, simultaneous single-bit inversions could occur in each of any number of the "columns." At least one bit of the checksum will always be affected. No matter how the inversions occur, at least the lowest-order column with an error will alter the checksum. This is an important point, because it helps us to understand the broader class of errors that can be detected by this type of checksum. I'll say it again: no matter how the inversions occur, at least the lowest order column with an error will alter the checksum, which means that two or more bit inversions in higher-order columns may cancel each other out. As long as the lowest-order column with an error has only one, it doesn't matter what other errors may be hidden within the message.

Now let's step back for a minute and talk about errors more formally. What exactly does a transmission error look like? Well, the first and most obvious type of error is a single bit that is inverted. That happens sometimes and is easy to detect (even a parity bit will do the job). Other times, bit inversions are part of an error burst. Not all of the bits within an error burst must be inverted. The defining characteristic is simply an inverted bit on each end. So an error burst may be 200 bits long, but contain just two inverted bits--one at each end.

A sum-of-bytes checksum will detect the vast majority of error bursts, no matter what their length. However, describing exactly which ones is generally difficult. Only one class of error bursts is always detectable: those of length eight bits or less. As long as the two inverted bits that bound the error burst have no more than six bits between them, the error will always be detected by this algorithm. That's because our previous requirement for error detection--that the lowest-order column with an error have only one error--is always met when the length of the error burst is no longer than the width of the checksum. We can know with certainty that such errors will always be detected.

Of course, many longer error bursts will also be detected. The probability of detecting a random error burst longer than eight bits is 99.6%. Errors will only be overlooked if the modified message has exactly the same sum as the original one, for which there is a 2-8 chance. That's much better error detection than a simple parity bit, and for not too much more cost.

The sum-of-bytes algorithm becomes even more powerful when the width of the checksum is increased. In other words, increasing the number of bits in the checksum causes it to detect even more types of errors. A 16-bit sum-of-words checksum will detect all single bit errors and all error bursts of length 16 bits or fewer. It will also detect 99.998% of longer error bursts. A 32-bit sum will detect even more errors. In practice, this increase in performance must be weighed against the increased cost of sending the extra checksum bits as part of every exchange of data.
Internet checksum

Many protocol stacks include some sort of a checksum within each protocol layer. The TCP/IP suite of protocols is no exception in this regard. In addition to a checksum at the lowest layer (within Ethernet packets, for example), checksums also exist within each IP, UDP, and TCP header. Figure 1 shows what some of these headers look like in the case of some data sent via UDP/IP. Here the fields of the IP header are summed to generate the 16-bit IP checksum and the data, fields of the UDP header, and certain fields of the IP header are summed to generate the 16-bit UDP checksum.

Figure 1. UDP and IP headers with checksum fields highlighted

A function that calculates the IP header checksum is shown in Listing 3. This function can be used before sending an IP packet or immediately after receiving one. If the packet is about to be sent, the checksum field should be set to zero before calculating the checksum and filled with the returned value afterward. If the packet has just been received, the checksum routine is expected to return a value of 0xFFFF to indicate that the IP header was received correctly. This result is a property of the type of addition used to compute the IP checksum.

uint16_t
NetIpChecksum(uint16_t const ipHeader[], int nWords)
{
uint16_t sum = 0;

/*
* IP headers always contain an even number of bytes.
*/
while (nWords-- > 0)
{
sum += *(ipHeader++);
}

/*
* Use carries to compute 1's complement sum.
*/
sum = (sum >> 16) + (sum & 0xFFFF);
sum += sum >> 16;

/*
* Return the inverted 16-bit result.
*/
return ((unsigned short) ~sum);

} /* NetIpChecksum() */

Listing 3. IP header checksum calculation

The IP header to be checksummed should be passed to NetIpChecksum() as an array of 16-bit words. Since the length of an IP header is always a multiple of four bytes, it is sufficient to provide the header length as a number of 16-bit words. The checksum is then computed by the function and returned to the caller for insertion into the header for validation of the header contents.

When you first look at this function, you may be overcome with a feeling of deja vu. The IP checksum algorithm begins in much the same way as the sum-of-bytes algorithm I discussed earlier. However, this algorithm is actually different. First, of course, we're computing a 16-bit checksum instead of an eight-bit one, so we're summing words rather than bytes. That difference is obvious. Less obvious is that we're actually computing a ones complement sum.

Recall that most computers store integers in a twos complement representation. Simply adding two integers, as we did in the previous algorithm, will therefore result in a twos complement sum. In order to compute the ones complement sum, we need to perform our addition with "end around carry." This means that carries out of the most significant bit (MSB) are not discarded, as they were previously. Instead, carries are added back into the checksum at the least significant bit (LSB). This could be done after each addition, but testing for a carry after each addition is expensive in C. A faster way is to let the carries accumulate in the upper half of a 32-bit accumulator. Once the sum-of-words is complete, the upper half of the 32-bit accumulator is turned into a 16-bit value and added to the 16-bit twos complement sum (the lower half). One final carry is possible at that point, and must be included in the final sum. The IP checksum is the inverted ones complement sum of all of the words in the IP header.

For checksum purposes, ones complement arithmetic has an important advantage over twos complement arithmetic. Recall that the biggest weakness of a parity bit is that it can't detect a pair of bit errors or any even number of errors, for that matter. A twos complement sum suffers from a similar weakness. Only one of the bits in the sum (the MSB) is affected by changes in the most significant of the 16 columns. If an even number of bit errors occurs within that column, the checksum will appear to be correct and the error will not be detected. A ones complement sum does not suffer this weakness.

Because carries out of the MSB are added back into the LSB during a ones complement sum, errors in any one column of the data will be reflected in more than one bit of the checksum. So a ones complement sum is a stronger checksum (for example, will detect more errors) than a twos complement sum, and only slightly more expensive. Hence, it is chosen for use in a lot of different situations.

The checksums within the UDP and TCP headers are computed in exactly the same way as the IP checksum. In fact, the only major difference is the set of words over which the sum is calculated. (A minor difference is that UDP checksums are optional.) In both cases, these layer 4 protocols include the message, their own header fields, and a few important fields of the IP header in the checksum calculation. The inclusion of some IP header fields in the UDP and TCP checksums is one of the biggest reasons that these layers of the protocol stack cannot be completely separated from the IP layer in a software implementation. The reason that IP checksums include only the IP header is that many intermediate computers must validate, read, and modify the contents of the IP header as a packet travels from its source to the destination. By reducing the number of words involved in the calculation, the speed with which all IP packets move across the Internet is improved. Including a larger set of words in the UDP and TCP checksums does not slow down communications; rather, it increases the likelihood that the message received is the same as the message sent.

Source http://www.netrino.com/Embedded-Systems/How-To/Additive-Checksums