MD5 optimized for AMD64

Marc Bevand
<m (at) zorinaq.com>
November 27, 2004

Download

md5-amd64.html - This article
md5-amd64.tar.bz2 - This article + MD5 implementation

Updates

2005-02-15: Gilles Vollant has ported md5-amd64 to ML64 syntax.
2005-06-25: OpenSSL 0.9.8 (to be released in a few days) includes md5-amd64, thanks to Andy Polyakov who incorporated md5-amd64 source code into their source tree.
2005-07-05: Charles Liu found a new way to optimize G() in round 2. The global speed of md5-amd64 has been increased by 6% (now running at 356 MB/s instead of 336 MB/s on an Opteron 244).
2009-11-06: Arkadiusz Miskiewicz reported to me that GNU binutils 2.21 made 'as' more strict by erroring out when displacement values can't be represented as 32-bit signed integers. This broke md5-amd64 which used instructions like "lea 0xd76aa478(%eax,%r10d),%eax" instead of the more correct "lea -680876936(%eax,%r10d),%eax". I fixed md5-amd64.
2010-05-02: Updated email and URL.

Introduction

MD5 is a message-digest algorithm taking a message as input and producing a 128-bit digest. This article presents md5-amd64, a 64-bit assembly language implementation of this algorithm optimized for AMD64 processors from AMD (Opteron, Athlon 64 and Athlon 64 FX). Obviously, the code also runs on EM64T processors from Intel since they contain cloned support for the AMD64 extensions (Nocona Xeon and recent Pentium 4 models).

As a completely unrelated note, in August 2004, some researchers have found a way to generate full collisions for MD5. Therefore, if you were planning to use it, it may be better to choose another hash function.

Overview

This section gives the reader a quick overview on the main operations executed by MD5. For a complete description, see RFC 1321. Here are the main features of MD5:

The input message is interpreted as a sequence of 32-bit little-endian words.
A 4-word buffer is used (A, B, C, D).
64 constants, constructed from the sine function, are used (T_i, where 1 <= i <= 64).
MD5 defines the rotate operator <<< (<<< s means left-rotate s bits).
MD5 defines 4 functions, each taking 3 words as arguments:
F(x,y,z) = (x AND y) OR (NOT(x) AND z)
G(x,y,z) = (x AND z) OR (y AND NOT(z))
H(x,y,z) = x XOR y XOR z
I(x,y,z) = y XOR (x OR NOT(z))

MD5 processes the input message in 16-word blocks (let X[k] denote words of the current block being processed, where 0 <= k <= 15). For each 16-word (64 bytes) block, MD5 executes 4 rounds of 16 steps (that is, a total of 64 steps). Each step looks like:

t = x + ((t + func(x,y,z) + X[k] + T_i) <<< s)

  x,y,z,t is a permutation of A,B,C,D
  k,i,s are fixed for each step
  func() is F(), G(), H(), I() (depending on the round: 1, 2, 3 or 4)

Finally, after some operations not described in this article, the 128-bit message-digest is obtained by concatenating a representation of A, B, C, D.

Performance Considerations

This section lists the various aspects of MD5 that have been analyzed in order to achieve maximum performance.

Endianness

When implementing MD5 for big-endian architectures, bytes of words have to be reversed since the algorithm interprets the input message as a sequence of 32-bit little-endian words. Fortunately the AMD64 architecture is little-endian, so no reordering is needed. md5-amd64 uses the simple mov instruction to load 32-bit words from memory.

64-bit registers

MD5 has been designed to be fast on 32-bit machine, it does all its operations on 32-bit integers. As a consequence, 64-bit registers are not useful, and are never used by md5-amd64.

Memory and Cache Latency

In order to hide memory/cache latencies when reading data from memory, each word is preloaded in a register before it is actually needed. Moreover the hardware prefetcher have shown to be very efficient, so no prefetch* instructions are used.

Number of operations in round 1

Round 1 uses the function F() which involves 4 operations (1 NOT, 2 AND and 1 OR):

    F(x,y,z) = (x AND y) OR (NOT(x) AND z)

AMD64 defines a MMX/SSE2 instruction performing a logical bitwise AND NOT (or ANDN) between its 2 operands: pandn. Moveover such an instruction would make possible to do 2 operations in parallel (AND and ANDN), followed by 1 operation (OR) acting on the results of the 2 previous operations:

    F(x,y,z) = (x AND y) OR (z ANDN x)

Unfortunately, pandn operates on MMX or XMM registers, this would require to either move values between integer and MMX/XMM registers (as the rest of the MD5 implementation would use integer registers), or implement MD5 using only MMX/XMM registers. Both solutions are inefficient because moving values between integer and MMX/XMM registers is a high-latency operation (with current AMD64 processors), and using only MMX/XMM registers would require exclusive use of the MMX/SSE/SSE2 instruction set, which is not suitable to implement MD5.

Finally, the well-known equivalent form of F() that uses only 3 operations (2 XOR and 1 AND) has been chosen:

    F(x,y,z) = ((y XOR z) AND x) XOR z

Parallelism of operations in round 2

Thanks to Charles Liu for the following idea about how to optimize G().

G() in round 2 is similar to F() in round 1. Indeed the function is defined as:

    G(x,y,z) = (x AND z) OR (y AND NOT(z))

And it possesses an equivalent 3-operation form:

    G(x,y,z) = ((x XOR y) AND z) XOR y

Howevere there is a fundamental difference with F() that makes the 4-operation form of G() more attractive: one of the OR's operands is (y AND NOT(z)), and the values of both y and z are known well in advance (actually their values are computed at least 2 steps earlier). So the processor can easily parallelize the computation of this operand while computing the other operand (x AND z). With F() the processor cannot achieve this level of parallelism because both OR's operands use x: (x AND y) OR (NOT(x) AND z) and x is actually computed during the step just before the current one (the fact that the value is not known well in advance makes the processor unable to do proper parallelization).

To sum up the situation, G() is faster when using the 4-operation form because of parallelization possibilities.

Precomputing `NOT(z)` in round 4

In round 4, I() computes NOT(z). It appeared that precomputing this value in each previous step improves performance by a non-negligible factor, because the operation is executed in parallel with the previous step.

At first sight, it may seem that precomputing values is a method that could be used in all rounds, but the fact is that each step is very dependent on the previous one. So this method is either impossible to do, or does not improve performance at all.

As a general remark, superscalar architectures are of almost no help for MD5, RFC 1810 already explained it: MD5 mostly consists of serial computations, therefore not many operations can execute in parallel.

Instruction ordering

While developing md5-amd64, assembly instructions have been ordered with extreme care. This is indeed an important task because it greatly affects overall performance.

Resulting Implementation

md5-amd64 can be obtained from the download section (see links at the top of this article), it is distributed as source code in the public domain. Please note that md5-amd64 has been designed to be easily integrated into OpenSSL: a single md5_block_asm_host_order() function is exported, other routines from OpenSSL are needed in order to use it as a full independent MD5 implementation.

Here is a table comparing 3 implementations of MD5:

MD5 throughput on an Opteron 244 (1.8 GHz) processor
Implementation	Throughput (MB/s)
md5-amd64 64-bit (AMD64 asm language)	356
OpenSSL MD5 32-bit (i386 asm language)	312
OpenSSL MD5 64-bit (C language)	204

md5-amd64 64-bit was benchmarked using the md5speed command (provided by the md5-amd64 tarball). The test consists of calling md5_block_asm_host_order() for 3 secs on the same 8192-byte data block. A small data block has been chosen so that it fits in the L1 cache, because we want to minimize interaction with the relatively high-latency RAM subsystem.

OpenSSL MD5 32-bit was benchmarked using the openssl speed md5 command. The result from the 8192-byte block test has been selected for a fair and equitable comparison with md5-amd64.

OpenSSL MD5 64-bit was benchmarked using the openssl speed md5 command and the md5speed command (provided by the md5-amd64 tarball). In both cases the throughput is equivalent.

All benchmarks used OpenSSL version 0.9.7d, compiled with GCC 3.4.2 with options -march=opteron -O3. As of today, this is the best way to obain high performance for the OpenSSL C language implementation of MD5.

Conclusion

On one hand, md5-amd64 is only 14% faster than the i386 assembly language MD5 implementation of OpenSSL (356 vs. 312 MB/s). But on the other hand, md5-amd64 is 75% faster than the OpenSSL C language implementation of MD5 compiled for AMD64 (356 vs. 204 MB/s).

In the first case, the small 14% improvement was easily predictable: because of the design of MD5, 64-bit processors are of almost no help for MD5, and modern superscalar architectures cannot be fully exploited by MD5 (md5-amd64 achieves only 1.8 instruction per clock on the Opteron processor, which represents only a 60% efficiency on this three-way superscalar processor).

In the second case, the nice 75% improvement confirms that hand-coded assembler can sometimes outperform what compilers can do by a great margin.

Thanks

The author would like to thank those people for their comments about the article, in alphabetical order:

Goswin Brederlow
Falk Hueffner
Charles Liu <vaccha (at) yahoo.com.tw>
Jim Lynch <jim (at) jam.sessionsnet.org>