RC4 implementation for AMD64

Marc Bevand
<m (at) zorinaq.com>
November 1, 2004


rc4-amd64.html - Latest version of this paper
rc4-amd64.tar.bz2 - RC4 implementation for AMD64


Eric Young, from RSA Security, informed me they optimized RC4 for AMD64 a long time ago. He estimates their implementation "scales to 331.4e6 bytes/sec" (316 MB/s) on a 1.8 GHz AMD64 processor, which is comparable to rc4-amd64's results (319 MB/s).
Actually, an AMD64 assembly version of RC4 has already been committed into the OpenSSL CVS repository on July 2004. I was not aware of this when developing rc4-amd64. Anyway, Andy Polyakov, from OpenSSL, has now updated their assembly version, based on some ideas presented in this paper. This has boosted their performance by a factor of about 1.5x.
Some people reported to me that rc4-amd64 runs at 354 MB/s on a 2.0 GHz Opteron x46 processor, and 393 MB/s on a 2.2 GHz Opteron x48. This means rc4-amd64 theoretically scales to 461 MB/s on a 2.6 GHz FX-55 processor (fastest AMD64 processor currently sold by AMD).
Jose Castejon-Amenedo, from HP, pointed out to me that "HP's RC4 implementation for IA64 delivers 381 MB/s - on a 1.3 GHz Madison." Theoretically, this code scales to 469 MB/s on a 1.6 GHz Itanium 2 (fastest IA64 processor currently sold by Intel). Because of this, rc4-amd64 is not the "world's fastest RC4 implementation running on general purpose CPUs", as it was claimed in the conclusion.
Dean Gaudet advised me to replace the ror instruction by a combination of shl+bswap. This is indeed about 1% faster, the implementation has been updated (rc4-amd64 now runs at 322 MB/s on a 1.8 GHz processor, instead of 319 MB/s).
Andy Polyakov has updated OpenSSL i386 RC4 implementation, thus the performance gap between AMD64 and i386 has been reduced !
Mozilla developers are currently working to incorporate rc4-amd64 into their code base.
Updated email and URL.


RC4, or Arcfour, is one of the fastest stream cipher available. Unfortunately none of the current implementations of this cipher achieve really good speeds on AMD64 processors. This is true when such processors run in 32-bit mode as well as 64-bit mode. The reason for this lack of speed depends on the mode:

This unhappy situation is partly due to the fact that the AMD64 architecture is very recent: AMD sold its first AMD64 processors in 2003, while i386 processors have been introduced in 1986. Generally speaking, most CPU-hungry i386 applications are already very well optimized, while much of the work remains to be done for AMD64. This paper demonstrates it is possible to achieve incredible speedups by writing AMD64 assembly code: throughput of the best currently known RC4 implementations has been surpassed by a factor of two.

Current implementations

OpenSSL offers some of the fastest implementations of RC4. Here is the throughput it can reach on recent i386 & AMD64 processors:

OpenSSL 0.9.7d (openssl speed rc4)
Processor Throughput (MB/s)
Athlon XP 3000+ 2.2 GHz 198
Opteron 244 1.8 GHz (32-bit) 163
Opteron 244 1.8 GHz (64-bit) 135
Pentium 4 3.0 GHz 120

All of those processors run in 32-bit mode, except for the Opteron (marked in bold) which runs in 64-bit mode. As we can see, the Opteron is quite slow in 64-bit mode (135 MB/s), it is just a little faster than the Pentium 4 (120 MB/s -- Pentium 4 processors are known to be particularly slow in this benchmark). The Opteron is faster in 32-bit mode (163 MB/s), but not that much. As explained in the introduction, this 32-bit/64-bit difference is due to the fact that OpenSSL provides 2 different RC4 implementations, one is an optimized i386 assembly language version, and the other is a generic C language version.

Note: OpenSSL version 0.9.7d has been used on each processor. The GCC version used to compile OpenSSL really matters only for the Opteron in 64-bit mode because it is the only case where C code is being evaluated. This benchmark used GCC 3.4.2 with the options -march=opteron -O3 (basically the best that is available nowadays). Another important point to consider when compiling the C version of RC4 for OpenSSL is how the following macros has been defined, this benchmark used: -DRC4_INT='unsigned char' -DRC4_CHUNK='unsigned long' (default values defined in opensslconf.h).

New implementation

As presented above, such speeds are unacceptable. AMD64 processors are capable of much faster RC4 throughput. That's why I decided to write an AMD64 assembly language version of RC4: rc4-amd64. It can be obtained from the "Download" section at the top of this paper. Please note that I have placed this implementation in the public domain, so that it can be useful to the maximum number of people. Here is the throughput of rc4-amd64:

rc4-amd64 (rc4speed, provided with rc4-amd64)
Processor Throughput (MB/s)
Opteron 244 1.8 GHz (64-bit) 322

In 64-bit mode, we now get 322 MB/s with rc4-amd64, instead of 135 MB/s with OpenSSL. rc4-amd64 is more than 2.3 times (or +130%) faster than OpenSSL. And, believe me or not, rc4-amd64 was a week-end hack: I did not spend more than 10 hours working on it. I hope this shows you what is possible when code is optimized for the AMD64 architecture.

Note: obviously, rc4-amd64 cannot be tested in 32-bit mode, since it has been designed for 64-bit mode.

Technical details

rc4-amd64 includes major hand optimizations. Here are the most effective ones:


rc4-amd64 has been presented, it offers 2.3 times the throughput of OpenSSL, which makes it, as of today, the world's fastest RC4 symmetric cipher implementation running on i386 or AMD64 general purpose CPUs (according to Jose Castejon-Amenedo, HP has a faster RC4 implementation for IA64).

Many other areas could benefit from such AMD64 optimizations: data compression algorithms, checksumming algorithms, video encoding, etc. Unfortunately AMD64 is so young that it will take some time (1, 2, 4 years ?) before all the real-world applications run at full-speed on this evolutionary architecture.