Abstract

Optical networks play a crucial role in the provisioning of grid and cloud computing services. Their high bandwidth and low latency characteristics effectively enable universal users access to computational and storage resources that thus can be fully exploited without limiting performance penalties. Given the rising importance of such cloud/grid services hosted in (remote) data centers, the various users (ranging from academics, over enterprises, to non-professional consumers) are increasingly dependent on the network connecting these data centers that must be designed to ensure maximal service availability, i.e., minimizing interruptions. In this chapter, the authors outline the challenges encompassing the design, i.e., dimensioning of large-scale backbone (optical) networks interconnecting data centers. This amounts to extensions of the classical Routing and Wavelength Assignment (RWA) algorithms to so-called anycast RWA but also pertains to jointly dimensioning not just the network but also the data center resources (i.e., servers). The authors specifically focus on resiliency, given the criticality of the grid/cloud infrastructure in today’s businesses, and, for highly critical services, they also include specific design approaches to achieve disaster resiliency.

Introduction

Back in the 1960s, John McCarthy envisioned the concept of “computation as a public utility,” making computing power equally easily accessible as the classical utilities that provide users with water, gas, and electricity. That seminal idea reappeared in the 1990s under the form of grid computing, borrowing its name from the power grid, where “the grid” was aimed to be a highly powerful computing resource that scientists could easily tap into for performing challenging tasks. Similarly, today’s cloud computing paradigm is built on the idea of relieving the user from worrying about the resources required to run applications and to store data, as well as on the idea of enabling access to such applications and data from basically any device. Clearly, such concept can be made possible only through a high capacity and low latency network that connects the user to “the cloud,” i.e., the distributed computing/storage resources. Undeniably, development of optical network technology has been a major driver that enabled the realization of such grids/clouds.

The rise of broadband access networks, and high speed optical networking in Wide Area Networks (WAN) has increased the geographical scale of distributed computing paradigms, extending their range from on-site computing facilities to the cost-efficient aggregation of IT resources for both processing and storage in large scale data centers. These now can supply a broad spectrum of applications, serving a wide audience ranging from end consumers, over business users, to scientists requiring High Performance Computing (HPC) facilities. Basic concepts underlying so-called grid technology, originating in the e-Science domain (e.g., to process massive data flows from the Large Hadron Collider [LHC] at CERN, in Switzerland, used for the Higgs boson discovery), meanwhile evolved to today’s cloud applications. For a more elaborate discussion of these applications, as well as relevant optical technology that can help to meet their challenging requirements, we refer to (Develder & De Leenheer, et al., 2012). The resulting optical grid/cloud constituents are summarized in Figure 1.

Figure 1.

An optical grid/cloud interconnects various data sources (experimental facilities, sensors, etc.) to infrastructure for data storage and processing (data centers, high performance computing, etc.) to deliver services to various types of users. Such a distributed architecture owes its success to optical networking infrastructure, both in backbone and access networks (adapted from Develder, et al., 2012).

Given that virtually all types of today’s applications heavily rely on network connectivity, as well as the IT resources that constitute the workhorses of the grid/cloud, it is crucial that this infrastructure is able to provide the services resiliently. Protection of cloud service and traditional traffic protection vary in nature. In the optical layer (which is the focus area of this chapter), protection of traffic between two nodes is generally provided by provisioning a backup path between the nodes. In an optical cloud, a specific service/content is generally available from multiple locations (such as data centers or servers). Thus, we no longer need to provide backup path between the requesting node and the server node as the service can be continued/restored from another location after a failure. Cloud service protection also includes protection of content that is an integral part of the service. Routing and protection of connections and services largely depend on the placement of content, which itself is another important problem in a cloud. Thus, protection of services in a cloud has different requirements than traditional traffic and can benefit from distinct protection methods. Moreover, large-scale network failures due to natural disasters and intentional attacks pose a major problem. Although upper layer schemes (such as TCP retransmission, IP layer re-routing, etc.) are in place to recover from a network failure, they are incapable of dealing with disaster failures, mostly since they are spatially correlated and may require cross-layer signaling between the optical backbone and the upper layers.