Featured in
Architecture & Design

Mini-talks: The Machine Intelligence Landscape: A Venture Capital Perspective by David Beyer. The future of global, trustless transactions on the largest graph: blockchain by Olaf Carlson-Wee. Algorithms for Anti-Money Laundering by Richard Minerich.

Featured in
Process & Practices

In-App Subscriptions Made Easy

There are various types of subscriptions: recurring, non-recurring, free-trial periods, various billing cycles and any possible billing variation one can imagine. But with lack of information online, you might discover that mobile subscriptions behave differently from what you expected. This article will make your life somewhat easier when addressing an in-app subscriptions implementation.

Featured in
Operations & Infrastructure

Mini-talks: The Machine Intelligence Landscape: A Venture Capital Perspective by David Beyer. The future of global, trustless transactions on the largest graph: blockchain by Olaf Carlson-Wee. Algorithms for Anti-Money Laundering by Richard Minerich.

Featured in
Enterprise Architecture

Mini-talks: The Machine Intelligence Landscape: A Venture Capital Perspective by David Beyer. The future of global, trustless transactions on the largest graph: blockchain by Olaf Carlson-Wee. Algorithms for Anti-Money Laundering by Richard Minerich.

Encrypting the Internet

The evolution of the Internet has resulted in large quantities of information being exchanged by businesses or private individuals. The nature of this information is typically both public and private, and much of it is transmitted over the hyper text transfer protocol (HTTP) in an insecure manner. A small amount of traffic, however, is transmitted by way of the secure sockets layer (SSL) over HTTP, known as HTTPS. HTTPS is a secure cryptographic protocol that provides encryption and message authentication over HTTP. The introduction of SSL over HTTP significantly increases the cost of processing traffic for service providers, as it sometimes requires an investment in expensive end-point acceleration devices. In this article, we present new technologies and results that show the economy of using general-purpose hardware for high-volume HTTPS traffic. Our solution is three pronged. First, we discuss new CPU instructions and show how to use them to significantly accelerate basic cryptographic operations, including symmetric encryption and message authentication. Second, we present results from a novel software implementation of the RSA algorithm that accelerates another compute-intensive part of the HTTPS protocol—public key encryption. Third, we show that the efficiency of a web server can be improved by balancing the web server workload with the public key cryptographic workload on a processor that is enabled with simultaneous multi-threading (SMT) technology. In conclusion, we show that these advances provide web services the tools to greatly reduce the cost of implementing HTTPS for all their HTTP traffic.

Introduction

As of January 2009, it is estimated that the Internet connects six hundred and twenty five million hosts. Every second, vast amounts of information are exchanged amongst these millions of computers. These data contain public and private information, which is often confidential and needs to be protected. Security protocols for safeguarding information are routinely used in banking and e-commerce. Private information, however, has not been protected on the Internet in general. Examples of private information (beyond banking and e-commerce data) include personal e-mail, instant messages, presence, location, streamed video, search queries, and interactions on a wide variety of on-line social networks. The reason for this neglect is primarily economic. Security protocols rely on cryptography, and as such are compute-resource-intensive. As a result, securing private information requires that an on-line service provider invest heavily in computation resources. In this article we present new technologies that can reduce the cost of on-line secure communications, thus making it a viable option for a large number of services.

A lot of private information is transmitted over the HTTP in an insecure manner. HTTP exists in the application layer of the TCP/IP protocol stack. The Secure Sockets Layer (SSL) and its successor, Transport Layer Security (TLS) are security technologies applied to the same layer. In this article, we specifically refer to SSL/TLS over the HTTP application layer, known as HTTPS. The introduction of HTTPS significantly increases the cost of processing traffic for web-service providers, due to the fact that it is not possible for previous-generation, web-server hardware to process high-volume HTTPS traffic with all the added cryptographic overhead. In order to process this high-volume traffic, a web-service provider has to invest in expensive end-point SSL/TLS acceleration devices. This added cost makes HTTPS a selective or premium choice among web-service providers. Consequently, a large amount of private information is transmitted over the web in an insecure manner and can, therefore, be intercepted or modified en route. In this article we provide a solution to this problem by presenting new technologies and results that show that it is now possible to use general-purpose hardware for high-volume HTTPS traffic.

Organization of this Article

Our solution to mitigating the overhead of an SSL-enabled HTTP session is three pronged. First, we discuss new processor instructions and show how to use them to accelerate basic cryptographic operations by factors. This substantially reduces the server load during the bulk data transfer phase of HTTPS. Second, we present results from a novel implementation of the Rivest Shamir Adleman (RSA) asymmetric cryptographic algorithm [1] that accelerates the most compute-intensive stage of the HTTPS protocol: that is, the stage in which the server has to decrypt handshake messages coming from a large number of clients. Third, we analyze a web server and show how its efficiency can be improved by balancing a web-server workload with a cryptographic workload on a processor enabled with simultaneous multi-threading (SMT) technology. By doing this, we show that the cryptographic overhead can be hidden by performing it in parallel with memory accesses that have long stall times.

We then elaborate on our motivation and vision of deploying HTTPS everywhere. First, we present an in-depth study of an SSL session and its resource requirements. We then describe our three-pronged strategy, together with our experiments and results.

Motivation

The motivation behind our research is primarily to enable widespread use of, and access to, HTTPS. It is important for service providers and users to be able to trust each other for their mutual benefit. An important aspect of the trust comes from knowing that private communications are kept confidential and adhere to the policies established between providers and users. Users need to be educated and informed about the benefits of HTTPS for privacy in on-line communications. Providers need to adopt ubiquitous HTTPS offerings to ensure that they hold up their end of the deal. Enabling HTTPS without expensive investment is important in creating such a partnership.

HTTPS provides an end-to-end solution to data privacy and authenticity. This end-to-end solution ensures that when users transmit information from their device to a provider, the information cannot be seen by man-in-the-middle spyware. This is important due to the fact that packets travel over untrusted networks all the time in the Internet. Although most routing devices are hidden from direct observation, they are not impervious to motivated eavesdroppers. Even more observable are the publicly accessible wireless access points that are in use all over the world. These access points broadcast information to all devices managed by them. If there is not an end-to-end solution for security, these communications can be easily observed by network neighbors. There are other solutions to the security problem, such as Layer 3 Virtual Private Networks (VPNs), but VPNs are typically limited to networks where users communicate with other users within a centrally managed network; that is, having multiple users but a single provider. In such cases, the network provider already has strict policies about data privacy and security that are communicated to users via training. For example, e-mails within an enterprise are often allowed only over the enterprise-managed VPN. For the larger Internet, users connect across the networks of multiple providers. In addition, in recent years we have seen a reduction in the use of a wide variety of communication protocols (for example, FTP) in favor of the HTTP protocol. In this environment, HTTPS is the most viable solution to enabling private and secure communications amongst the large and growing numbers of users and providers.

Future applications of HTTPS may include widespread e-mail encryption, secure video streaming, secure instant messaging and encrypted web searching. These are a few of the many applications of HTTPS that are not widely used today. Moreover, with each passing year, users are putting more of their personal and private information on-line. Cloud computing enables them to access their information across all their devices everywhere. We believe that it is inevitable that users will demand HTTPS support from their providers for all their communications. Being prepared for that day led us to research and develop the technologies described in this article. We envision that with these advancements, every HTTP-based communication made by every device today will be HTTPS-based in the near future. We refer to this as “https://everywhere!”.

Anatomy of a Secure Sockets Layer Session

Secure Sockets Layer

Secure sockets layer (SSL) (later versions known as Transport Layer Security, TLS) includes a handshake phase and a cryptographic data exchange phase. The overall SSL handshake is shown in Figure 1. In our diagram, in phase 1, the handshake begins when a client sends a server a list of algorithms the client is willing to support as well as a random number used as input to the key generation process.

In phase 2, the server chooses a cipher and sends it back, along with a certificate containing the server’s public key. The certificate proves the server’s identity. We note that the domain name of the server is also verified via the certificate (which helps eliminate phishing sites) and demonstrates to the user they are talking with the correct server/service. In addition, the server provides a second random number that is used as part of the key generation process. In phase 3, the client verifies the server’s certificate and extracts the server’s public key. The client then generates a random secret string called a pre-master secret and encrypts it by using the server’s public key. The pre-master secret is sent to the server. In phase 4, the server decrypts the pre-master secret by using RSA. This is one of the most compute-intensive parts of the SSL transaction on the server. The client and server then independently compute their session keys by using the pre-master secret to apply a procedure called a key derivation function (KDF) twice. In phases 5 and 6, the SSL handshake phase ends with the communicating parties sending authentication codes to each other, computed on all original handshake messages.

In SSL, the data are transferred by using a record protocol. The record protocol breaks a data stream into a series of fragments, each of which is independently protected and transmitted. In other words, in IPsec, protection is supported on an IP-packet-by-IP-packet basis, whereas in SSL, protection is supported on a fragment-by-fragment basis. Before a fragment is transmitted, it is protected against attacks by the calculation of a message authentication code on the fragment. The fragment’s authentication code is appended to the fragment, thereby forming a payload that is encrypted by using the cipher selected by the server. Finally, a record header is added to the payload. The concatenated header and encrypted payload are referred to as a record.

A secure web server is clearly a memory-intensive application. For an SSL connection, the most significant type of overhead is the one related to cryptography. This includes the operations of encrypting packets with a symmetric key, providing message authentication support, and setting up the session by using RSA, as mentioned previously. In the section that follows, we describe in more detail two encryption algorithms that we accelerate with technologies described in this article: the Advanced Encryption Standard (AES) and Rivest Shamir Adleman (RSA).

The Advanced Encryption Standard and the RSA Algorithm

Advanced Encryption Standard

AES is the United States Government’s standard for symmetric encryption, defined by FIPS Publication #197 (2001) [2, 3]. It is used in a large variety of applications where high throughput and security are required. In HTTPS, it can be used to provide confidentiality for the information that is transmitted over the Internet. AES is a symmetric encryption algorithm, which means that the same key is used for converting a plaintext to ciphertext, and vice versa. The structure of AES is shown in Figure 2.

Figure 2 Structure of AES (Source: Intel Corporation, 2009)

AES first expands a key (that can be 128, 192, or 256 bits long) into a key schedule. A key schedule is a sequence of 128-bit words, called round keys, that are used during the encryption process. The encryption process itself is a succession of a set of mathematical transformations called AES rounds.

During an AES round the input to the round is first XOR’d with a round key from the key schedule. The exclusive OR (XOR) logical operation can also be seen as addition without generating carries.

In the next step of a round, each of the 16 bytes of the AES state is replaced by another value by using a non-linear transformation called S-box. The AES S-box consists of two stages. The first stage is an inversion, not in regular integer arithmetic, but in a finite field arithmetic based on the set GF(28). The second stage is an affine transformation. During encryption, the input x, which is considered an element of GF(28); that is, an 8-bit vector, is first inverted, and then an affine map is applied to the result. During decryption, the input (y) goes through the inverse affine map and is then inverted in GF(28). The GF(28) inversions just mentioned are performed in GF(28), defined by the irreducible polynomial p(x) = x8 + x4 + x3 + x + 1 or 0x11B.

Next, the replaced byte values undergo two linear transformations called ShiftRows and MixColumns. ShiftRows is just a byte permutation. The MixColumns transformation operates on the columns of a matrix representation of the AES state. Each column is replaced by another one that results from a matrix multiplication. The transformation used for encryption is shown in Equation 1. In this equation, matrix-times-vector multiplications are performed according to the rules of the arithmetic of GF(28) with the same irreducible polynomial that is used in the AES S-box, namely, p (x) = x8 + x4 + x3 + x + 1.

During decryption, inverse ShiftRows is followed by inverse MixColumns. The inverse MixColumns transformation is shown in Equation 2.

Note that while the MixColumns transformation multiplies the bytes of each column with the factors 1, 1, 2 and 3, the inverse MixColumns transformation multiplies the bytes of each column by the factors 0x9, 0xE, 0xB, and 0xD. The same process is repeated 10, 12, or 14 times depending on the key size (128, 192, or 256 bits). The last AES round omits the MixColumns transformation.

The RSA Algorithm

RSA is a public key cryptographic scheme. The main idea behind public key cryptography is that encryption techniques can be associated with back doors. By back doors we mean secrets, known only to at least one of the communicating parties, which can simplify the decryption process. In public key cryptography, a message is encrypted by using a public key. A public key is associated with a secret called the private key. Without knowledge of the private key it is difficult to decrypt a message. Similarly, it is very difficult for an attacker to determine what the plaintext is.

We further explain how public key cryptography works by presenting the RSA algorithm as an example. In this algorithm, the communicating parties choose two random large prime numbers p and q. For maximum security, p and q are of equal length. The communicating parties then compute the product:

for some l. D and E can be used interchangeably, meaning that encryption can be done by using D, and decryption can be done by using E.

RSA is typically implemented using Chinese Remainder Theorem that reduces a single modular exponentiation operation into two operations of half length. Each modular exponentiation in turn is implemented, by using the square-and-multiply technique that reduces the exponentiation operation into a sequence of modular squaring and modular multiplication operations. Square-and-multiply may also be augmented with some windowing method for reducing the number of modular multiplications. Finally, modular squaring and multiplication operations can be reduced to big number multiplications by using reduction techniques such as Montgomery’s or Barrett’s [4, 5].

Acceleration Technologies

We are currently researching solutions to realize the vision of encrypting the Internet so that HTTPS sessions are accelerated by factors. The next micro-architecture generation adds new instructions for potentially speeding up symmetric encryption by 3-10 times. These instructions not only provide better performance but also protect applications against an important type of threat known as side-channel attacks. Second, we have developed improved integer arithmetic software that can speed up key exchange and establishment procedures by a factor of 40 to 100 percent.

Third, the Intel® Core™ i7 micro-architecture re-introduces the SMT feature into the CPU. SMT is ideal for hiding the cycles of compute-intensive public key encryption software under the stall times of network application memory lookups.

New Processor Instructions

In the next generation of Intel processors, a new set of instructions will be introduced that enable high performance and secure round encryption and decryption. These instructions are AESENC (AES round encryption), AESENCLAST (AES last round encryption), AESDEC (AES round decryption), and AESDECLAST (AES last round decryption). Two additional instructions are also introduced for implementing the key schedule transformation, AESIMC and AESKEYGENASSIST.

The design of these new processor instructions is based on the structure of AES. Systems such as AES involve complex mathematical operations such as finite field multiplications and inversions [6], as discussed earlier. These operations are time or memory consuming when implemented in software, but they are much faster and more power efficient when implemented by using combinatorial logic. Moreover, the operands involved in finite field operations can fit into the SIMD registers of the IA architecture. In this article, we discuss the concept of implementing an entire AES round as a single IA processor instruction by using combinatorial logic. An AES round instruction is much faster than its equivalent table-lookup-based software routine and can also be pipelined, thereby allowing the computation of an independent AES round result potentially every clock cycle.

The AESENC instruction implements these transformations of the AES specification in the order presented: ShiftRows, S-box, MixColumns, and AddRoundKey. The AESENCLAST implements ShiftRows, S-box, and AddRoundKey but not MixColumns, since the last round omits this transformation. The AESDEC instruction implements inverse ShiftRows, inverse S-box, inverse MixColumns, and AddRoundKey. Finally, the AESDECLAST instruction implements inverse ShiftRows, inverse S-box, and AddRoundKey, omitting the inverse MixColumns transformation. More details about these AES instructions can be found in [7].

Our AES instructions can be seen as cryptographic primitives for implementing not only AES but a wide range of cryptographic algorithms. For example, several submissions to NIST’s recent SHA-3 hash function competition use the AES round or its primitives as building blocks for computing cryptographic hashes. Moreover, combinations of instruction invocations can be used for creating more generic mathematical primitives for finite field computations. Our new instructions outperform by approximately 3-10 times the best software techniques doing equivalent mathematical operations on the same platform.

Together with the AES instructions, Intel will offer one new instruction supporting carry-less multiplication, named PCLMULQDQ. This instruction performs carryless multiplication of two 64-bit quadwords that are selected from the first and second operands, according to the immediate byte value.

Carry-less multiplication, also known as Galois Field (GF) multiplication, is the operation of multiplying two numbers without generating or propagating carries. In the standard integer multiplication, the first operand is shifted as many times as the positions of bits equal to “1” in the second operand. The product of the two operands is derived by adding the shifted versions of the first operand to each other. In carry-less multiplication, the same procedure is followed, except that additions do not generate or propagate carry. In this way, bit additions are equivalent to the exclusive OR (XOR) logical operation.

Carry-less multiplication is an essential component of the computations done as part of many systems and standards, including cyclic redundancy check (CRC), Galois/counter mode (GCM), and binary elliptic curves, and it is very inefficient when implemented in software in today’s processors. Thus, an instruction that accelerates carry-less multiplication is important for accelerating GCM and all communication protocols that depend on it [8].

Improved Key Establishment Software

We have also developed integer arithmetic software that can accelerate big number multiplication and modular reduction by at least 2X. Such routines are used not only in RSA public key encryption but also in Diffie Hellman key exchange and elliptic curve cryptography (ECC). Using our software, we are able to accelerate RSA 1024 from a performance of approximately 1500 signatures per second (OpenSSL v.0.9.8g) or 2000 signatures per second (OpenSSL v.0.9.8.h), to potentially 2900 signatures per second on a single Intel® Core i7 processor. Similarly, we are able to accelerate other popular cryptographic schemes such as RSA 2048 and Elliptic Curve Diffie-Hellman, based on the NIST B-233 curve.

The performance of RSA can be improved by accelerating the big number multiplication that is an essential and compute-intensive part of the algorithm. Our implementation uses an optimized schoolbook big number multiplication algorithm. RSA is a compute-intensive operation consuming millions of clocks on multiplying, adding, and subtracting 64-bit quantities. However, the state which RSA accesses is small, typically consisting of key information as well as 16-32 multipliers that fit into the L1 cache of Intel CPUs. With our software, an RSA 1024 decrypt operation consumes about 0.99 million clocks, whereas the corresponding RSA 2048 decrypt operation consumes about 6.73 million clocks on an Intel Core i7 processor. This is about 40 percent faster than corresponding operations that use OpenSSL (v. 0.9.8h).

The code listed in Code 1 illustrates the main idea, which is to combine multiply and add operations with a register recycling technique for intermediate values. In Code 1, 'a' and 'b' hold the two large numbers to be multiplied, and the results are stored in 'r'. These operations are repeated over the entire inputs to generate intermediate values that are then combined with addition to produce the large number multiplication result.

Code 1 RSA Implementation (Source: Intel Corporation, 2009)

We also investigated other techniques for big number multiplication, including Karatsuba-like constructions, but we found this schoolbook algorithm implementation to be the fastest [9, 10].

Simultaneous Multi-threading

The most recent Intel i7 core micro-architecture re-introduces the feature of hyperthreading (now referred to as simultaneous multi-threading or SMT) into the CPU. SMT represents a major departure from the earlier core micro-architecture, where each core was single threaded. As part of our research, we have demonstrated that SMT can result in substantial performance improvements for a certain class of workloads. Such workloads are associated with secure web transactions. We propose a new programming model where one compute-intensive thread performs only RSA public key encryption operations, and another thread performs memory access-intensive tasks. We show that RSA is an ideal companion thread for four representative memory access-intensive workloads when SMT is used, resulting in a 10–100 percent potential efficiency increase.

The system benefits most when a thread performing dependent-memory lookups is paired with an RSA thread. The throughput of the memory thread almost doubles, reaching the value it would have had if it hadn’t been paired with RSA. Another way to interpret the same result is that the RSA computation comes for free, because of SMT. In reality the RSA computation is hidden under the very long stall times of the memory thread. We also observe that the throughput of a single memory thread is increased by approximately 30 percent when SMT is switched on, and the memory thread is multiplexed with another memory thread. The same throughput is almost doubled when the memory thread is paired with an RSA thread. These results indicate that RSA is a much better companion thread than a second memory thread, due to the fact that one workload is memory access-intensive, and the other workload is compute-intensive. If an RSA thread is paired with a memory thread, then RSA performance also increases by 21 percent when SMT is switched ON as compared to OFF [11].

To further validate our position that SMT is beneficial especially to crypto workloads, we built a test bed running SpecWeb* 2005. The test bed consisted of a server machine using an Intel Core i7 processor connected to two client machines running a total of four client engines. We measured the server’s capacity with SMT turned on and off for the banking and support (regular HTTP) workloads. Our experiments indicate that SMT improves the overall system performance by at least 10 percent—more for the banking workload than the support workload. This result is in accordance with our earlier experiments, and it indicates that crypto workloads can take advantage of SMT.

The overall impact of our cryptographic algorithm acceleration technologies is shown in Figure 3. The first bar represents the crypto overhead of a 230 Kbyte SSL transaction as it runs on an Intel Core i7 processor today. The encryption scheme used is AES-256 in the counter mode. The next bar shows the acceleration gain if AES is implemented with the new instructions. The third bar shows the incremental gain by using our RSA software and SMT. Finally, the last bar shows the gain associated with replacing SHA1 with GCM. GCM is a message authentication scheme offering the same functionality as HMAC-SHA1. As is evident from the figure, our acceleration technologies substantially reduce the crypto overheads resulting in significant performance and efficiency improvement.

Conclusion

In summary, Intel is researching new technologies that offer cryptographic algorithm acceleration by factors. We described new processor instructions that can accelerate AES symmetric encryption. This acceleration substantially reduces the server load during the bulk data transfer phase of HTTPS. We also present results from a novel implementation of the RSA asymmetric cryptographic algorithm. This accelerates a very compute-intensive stage of the HTTPS protocol, a stage in which the server has to decrypt handshake messages coming from a large number of clients. Third, we analyze a web server and present some initial experimental results indicating that the efficiency of the server can be improved by balancing a web server workload with a cryptographic workload on an SMT-enabled processor. This shows that the cryptographic overhead can be hidden by performing it in parallel with memory accesses with long stall times. Our ultimate goal is to make general purpose processors capable of processing and forwarding encrypted traffic at very high speeds so that the Internet can be gradually transformed into a completely secure information delivery infrastructure. We also believe that these technologies can benefit other usage models, such as disk encryption and storage.

[10] M. E. Kounavis. “A New Method for Fast Integer Multiplication and its Application to Cryptography.” In Proceedings 2007 International Symposium on Performance Evaluation of Computer and Telecommunication Systems. San Diego, CA, 2007.

[11] S. Grover and M. Kounavis. “On the Impact of Simultaneous Multithreading on the Performance of Cryptographic Workloads.” Technical Report, available from the authors upon request.

This article and more on similar subjects may be found in the Intel Technology Journal, June 2009 Edition, “Advances in Internet Security”. More information can be found at http://intel.com/technology/itj.

About the Authors

Satyajit Grover is a Software Engineer working at Intel Labs. His duties involve researching and prototyping new ideas in the area of system security and integrity. He has been working in this area at Intel for over two years. Previous to that he was a graduate student and research assistant at the Computer Science Department at Portland State University. His e-mail is satyajit.grover at intel.com.

Xiaozhu Kang is a Research Scientist working at Intel Labs. Her research interests include algorithm design and performance analysis. She obtained a Ph.D. Degree in Electrical Engineering from Columbia University in 2008, and she joined Intel in January of 2009. Before that, she worked as an intern in Intel Corporation, Mathworks Corporation, and NEC Labs. Her e-mail is xiaozhu.kang at intel.com.

Michael Kounavis is a Senior Research Scientist working at Intel Labs. Michael is responsible for conducting research on novel digital arithmetic and cryptographic algorithms with the aim of accelerating a wide range of client, server, and networking applications. Michael is a co-inventor of the CRC32 SSE4 instruction of the Intel® Core i7 architecture used for iSCSI CRC generation. He is also a co-recipient of the 2008 Intel Achievement Award for his work on AES instructions. His e-mail is michael.e.kounavis at intel.com.

Frank Berry is a Principal Engineer working at Intel Labs. His area of expertise is the hardware/software interface where the hardware and software work closely together. His expertise extends to operating system internals, device drivers, and networking stacks. Frank has received two Intel Achievement Awards for his work on the InfiniBand Architecture and AES Instructions. His e-mail is frank.berry at intel.com.

I'm sure there will be a huge problem. The problem in hardware support for encryption will be disability to check the implementation of the encryption. An intruder may use a bug for a long time before the bug becomes disclosed.

Surely, no serious agencies will trust a third-party encryption hardware.

Firstly, thank you for a great article. It's also very good to know that SSL is being treated seriously by chip manufacturers. We're using a great deal more servers than neccessary due to the relative high CPU required for SSL. 10x speed improvements would be very welcome!

I am, however, a little skeptical of too much SSL. If everything is secure, then nothing can be cached. This will vastly increase the traffic across the entire internet and will slow everyone's connections down. Rather than knee-jerk "Encrypt everything" mentalities, we need to know what to encrypt and what not to encrypt.