Archive for the ‘Distributed Computing’ Category

“We grow in direct proportion to the amount of chaos we can sustain and dissipate” ― Ilya Prigogine, Order out of Chaos: Man’s New Dialogue with Nature

Abstract

According to Gartner “Alpha organizations aggressively focus on disruptive innovation to achieve competitive advantage. Characterized by unknowns, disruptive innovation requires business and IT leaders to go beyond traditional management techniques and implement new ground rules to enable success.”

While there is a lot of buzz about “game changing” technologies, and “disruptive innovation”, real “game changers” and “disruptive innovators” are few and far between. Leap-frog innovation is more like a “phase transition” in physics. A system is composed of individual elements with a well-defined function which interact with each other and the external world with a well-defined structure. The system usually exhibits normal equilibrium behavior that is predictable and when there are small fluctuations, incremental innovation allows to adjust itself and maintain the equilibrium with predictability. Only when the external forces inflict large or wild unexpected fluctuations in the system, the equilibrium is threatened and the system exhibits an emergent behavior where unstable equilibrium introduces unpredictability in the evolution dynamics of the system. A phase transition occurs with a reconfiguration of the structure of the system going through an architecture transformation resulting in order from chaos.

The difference between “Kaizen” (incremental improvement) and “disruptive innovation” is in dealing with stable equilibrium with small fluctuations versus dealing with meta-stable equilibrium with large-scale and big fluctuations. Current datacenter is in a similar transition from “being” to “becoming” driven by both the hyper-scale structure and fluctuations (which, the hardware and software systems delivering business processes are experiencing) caused by rapidly changing business priorities on a global scale, workload fluctuations and latency constraints. Is the current von Neumann stored program control implementation of the Turing machine reaching its limit? Is the datacenter poised for a phase transition from current ad-hoc distributed computing practices to a new theory-driven self-* architecture? In this blog we discuss a non-von Neumann managed Turing oracle machine network with a control architecture as an alternative.

The representation of the dynamics of a physical systems as linear, reversible (hence deterministic), temporal order of states requires that, in a deep sense, physical systems never change their identities through time; hence they can never become anything radically new (e.g., they must at most merely rearrange their parts, parts whose being is fixed). However, as elements interact with each other and their environment, the system dynamics can dramatically change when large fluctuations in the interactions induce a structural transformation leading to chaos and the eventual emergence of a new order out of chaos. This is denoted as “becoming”. In short, the dynamics of near equilibrium states with small-scale fluctuations in a system represent the “being” and large deviations from the equilibrium, emergence of an unstable equilibrium and the final restoration of order in a new equilibrium state represent the “becoming”. According to Plato “being” is absolute, independent, and transcendent. It never changes and yet causes the essential nature of things we perceive in the world of “becoming”. The world of becoming is the physical world we perceive through our senses. This world is always in movement, always changing. The two aspects – the static structures and their dynamics of evolution are two sides of a coin. Dynamics (becoming) represents time and static configurations at any particular instance represent the “being”. Prigogine applied this concept to understand the chemistry of matter, phase transitions and the like. Individual elements represent function and the groups (constituting a system) represent structure with dynamics. Fluctuations caused by the interaction within the system and between the system and its environment, cause the dynamics of the system to induce transitions from being to becoming. Thus, function, structure and fluctuations determine the system and its dynamics defining the complexity, chaos and order.

Why is it Relevant to Datacenters?

Datacenters are dynamic systems where software working with hardware delivers information processing services that allow modeling, interaction, reasoning, analysis and control of the environment external to them. Figure 1 shows the hardware, software and their interaction among themselves and the external world. There are two distinct systems interacting with each other to deliver the intent of the datacenter which is to execute specific computational workflows that model, monitor and control the external world processes using the computing resources:

Service workflows modeling the process dynamics of the system depicting the external world and its interactions. Usually this consists of functional requirements of the system that is under consideration such as business logic, sensors and actuator monitoring and control (the computed) etc. The model consists of various functions captured in a structure (e.g., a directed acyclic graph, DAG, and it’s evolution in time. This model does not include the computing resources required to execute the process dynamics. It is assumed tat the resources will be available for the computation (cpu, memory, time etc.)

The non-functional requirements that address the required resources to execute the functions as a function of time and fluctuations both in the interactions in the external world and also in the computing resources available to accomplish the intent defined in the functional requirements. The computation as implemented in the von Neumann stored program control model of the Turing machine requires time (impacted by the cpu speed, network latency, bandwidth, storage IOPs, throughput, capacity) and memory. The computing model assumes unbounded resources including time for completing the computation. Today, these resources are provided by a cluster of servers and other devices containing multi-core cpu’s and memory networked with different types of storage. The computations are executed in the server or device by allocating the resources using an operating system which itself is a software that mediates the resources to various computations.

On the right hand side of Figure 1, we depict the computing resources required to execute the functions in a given structure whether it is distributed or not. In the middle, we represent the application workflows composed of various components constituting an application area network (AAN) that is executed in a distributed computing cluster (DCC) made up of the hardware resources with specified service levels (cpu, memory, network bandwidth, cluster latency, storage capacity, IOPs , throughput and capacity). The left hand side shows a desired end-to-end process configuration and evolution monitoring and control mechanism. When all is said and done, the process workflows need to execute various functions using the computing resources made available in the form of a distributed cluster providing required CPU, memory, network bandwidth, latency, storage IOPs, throughput and capacity. The structure is determined by the non-functional requirements such as resource availability, performance, security and cost. Fluctuations evolve the process dynamics and require adjusting the resources to meet the needs of applications to cope with the fluctuations.

Figure 1: Decoupling service orchestration and infrastructure orchestration to deliver function, structure and dynamic process flow to address the fluctuations both in resource availability and service demand

There are two ways to match the resources available to the computing nodes connected by links that execute the business process dynamics. First approach is the current state of the art and the second one is an alternative approach based on extensions to the current von Neumann stored program implementation of the Turing machine.

Current State of the Art

The infrastructure is infused with intelligence about various applications and their evolving needs and adjust the resources (time of computation affected by cpu, network bandwidth, latency, storage capacity, throughput and IOPs and the memory required for the computation). Current IT has evolved from a model where the resources are provisioned anticipating the peak workloads and the structure of the application network is optimized for coping with deviations from equilibrium. Conventional computing models using physical servers (often referred to as bare-metal) cannot cope with wild fluctuations if the new server provisioning times are much larger than the time it takes for the onset of fluctuations and the predictability of their magnitude to pre-plan the provisioning of additional resources. Virtualization of the servers and on-demand provisioning of Virtual machines reduces the provisioning times substantially to institute auto-scaling, auto-failover and live migration across distributed resources using Virtual Machine image mobility. However, it comes with a price:

The Virtual Image is still tied to the infrastructure (network, storage and computing resources supporting the VM and moving a VM involves manipulating a multitude of distributed resources often owned or operated by different owners and touch many infrastructure management systems thus increasing complexity and cost of management.

If the distributed infrastructure is homogeneous and supports VM mobility, it is simpler but the solution forces vendor lock-in and does not allow to take advantage of commodity infrastructure offered by multiple suppliers.

If the distributed infrastructure is heterogeneous, VM mobility now must depend on myriad management systems and most often, these management systems themselves need other management systems to manage their resources.

The VM mobility and management also increase bandwidth and storage requirements and proliferation of point solutions and tools to move across heterogeneous distributed infrastructure that increase operational complexity and additional cost.

Current state of the art based on the mobility of VMs and infrastructure orchestration is summarized in figure 2.

Figure 2: The infrastructure orchestration based on second guessing the application quality of service requirements and its dynamic behavior

It clearly shows the futility of orchestrating service availability, performance, compliance, cost and security in a very distributed and heterogeneous environment where scale and fluctuations dominate. The cost and complexity of navigating multiple infrastructure service offerings often outweigh the benefits of commodity computing. It is one reason why enterprises complain that 70% of their budget often is spent on keeping the service lights on.

Alternative Approach: A Clean Separation of Business Logic Implementation and the Operational Realization of Non-functional Requirements

Another approach is to decouple application and business process workflow management from the distributed infrastructure mobility by placing the applications in the right infrastructure that has the right resources, monitor the evolution of the applications and proactively manage the infrastructure to add or delete resources with predictability based on history. Based on the RPO and RTO, adjust the application structure to create active/passive or active/active nodes to manage application QoS and workflow/business process QoS. This approach requires top down method of business process implementation with the specification of the business process intent followed by a hierarchical and temporal specification of process dynamics with context, constraints, communication, control of the group and its constituents and the initial conditions for the equilibrium quality of service (QoS). The details include:

Non-functional requirements that specify availability, performance, security, compliance and cost constraints and the policies specified with hierarchical and temporal process flows. The intent at higher level are translated to the down-stream intent of the computing nodes contributing to the workflow.

A method to implement autonomic behavior with visibility and control of application components so that they can be managed with policies defined. When scale and fluctuations demand a change in the structure to transition to a new equilibrium state, the policy implementation processes proactively add or subtract computing nodes or find existing nodes to replicate, repair, recombine or reconfigure the application components. The structural change implements the transition from being to becoming.

A New Architecture to Accommodate Scale and Fluctuations: Toward the Oneness of the Computer and the Computed

There is a fundamental reason why current Turing, von Neumann stored program computing model cannot address large-scale distributed computing with fluctuations both in resources and in computation workloads without increasing complexity and cost (Mikkilineni et. al. 2012). As von Neumann put it “It is a theorem of Gödel that the description of an object is one class type higher than the object.” An important implication of Gödel’s incompleteness theorem is that it is not possible to have a finite description with the description itself as the proper part. In other words, it is not possible to read yourself or process yourself as a process. In short, Gödel’s theorems prohibit “self-reflection” in Turing machines. According to Alan Turing, Gödel’s theorems show that every system of logic is in a certain sense incomplete, but at the same time it indicates means whereby from a system L of logic a more complete system L_ may be obtained. By repeating the process we get a sequence L, L1 = L_, L2 = L_1 … each more complete than the preceding. A logic Lω may then be constructed in which the provable theorems are the totality of theorems provable with the help of the logics L, L1, L2, … Proceeding in this way we can associate a system of logic with any constructive ordinal. It may be asked whether such a sequence of logics of this kind is complete in the sense that to any problem A, there corresponds an ordinal α such that A is solvable by means of the logic Lα.”

This observation along with his introduction of the oracle-machine influenced many theoretical advances including the development of generalized recursion theory that extended the concept of an algorithm. “An o-machine is like a Turing machine (TM) except that the machine is endowed with an additional basic operation of a type that no Turing machine can simulate.” Turing called the new operation the ‘oracle’ and said that it works by ‘some unspecified means’. When the Turing machine is in a certain internal state, it can query the oracle for an answer to a specific question and act accordingly depending on the answer. The o-machine provides a generalization of the Turing machines to explore means to address the impact of Gödel’s incompleteness theorems and problems that are not explicitly computable but are limit computable using relative reducibility and relative computability.

According to Mark Burgin, an Information processing system (IPS) “has two structures—static and dynamic. The static structure reflects the mechanisms and devices that realize information processing, while the dynamic structure shows how this processing goes on and how these mechanisms and devices function and interact.”

The software contains the algorithms (à la the Turing machine) that specify information processing tasks while the hardware provides the required resources to execute the algorithms. The static structure is defined by the association of software and hardware devices and the dynamic structure is defined by the execution of the algorithms. The meta-knowledge of the intent of the algorithm, the association of specific algorithm execution to a specific device, and the temporal evolution of information processing and exception handling when the computation deviates from the intent (be it because of software behavior or the hardware behavior or their interaction with the environment) is outside the software and hardware design and is expressed in non-functional requirements. Mark Burgin calls this Infware which contains the description and specification of the meta-knowledge that can be also be implemented using the hardware and software to enforce the intent with appropriate actions.

The implementation of Infware using Turing machines introduces the same dichotomy mentioned by Turing with respect to the manager of manager conundrum. This is consistent with the observation of Cockshott et al. (2012) ““The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.”

The goals of the distributed system determine the resource requirements and computational process definition of individual service components based on their priorities, workload characteristics and latency constraints. The overall system resiliency, efficiency and scalability depend upon the individual service component workload and latency characteristics of their interconnections that in turn depend on the placement of these components (configuration) and available resources. The resiliency (fault, configuration, accounting, performance and security often denoted by FCAPS) is measured with respect to a service’s tolerance to faults, fluctuations in contention for resources, performance fluctuations, security threats and changing system-wide priorities. Efficiency depicts the optimal resource utilization. Scaling addresses end-to-end resource provisioning and management with respect to increasing the number of computing elements required to meet service needs.

A possible solution to address resiliency with respect to scale and fluctuations is an application network architecture, based on increasing the intelligence of computing nodes which, is presented in the Turing centenary conference (2012) for improving the resiliency, efficiency and scaling of information processing systems. In its essence, the distributed intelligent managed element (DIME) network architecture extends the conventional computational model of information processing networks, allowing improvement of the efficiency and resiliency of computational processes. This approach is based on organizing the process dynamics under the supervision of intelligent agents. The DIME network architecture utilizes the DIME computing model with non-von Neumann parallel implementation of a managed Turing machine with a signaling network overlay and adds cognitive elements to evolve super recursive information processing. The DIME network architecture introduces three key functional constructs to enable process design, execution, and management to improve both resiliency and efficiency of application area networks delivering distributed service transactions using both software and hardware (Burgin and Mikkilineni):

Machines with an Oracle: Executing an algorithm, the DIME basic processor P performs the {read -> compute -> write} instruction cycle or its modified version the {interact with a network agent -> read -> compute -> interact with a network agent -> write} instruction cycle. This allows the different network agents to influence the further evolution of computation, while the computation is still in progress. We consider three types of network agents: (a) A DIME agent. (b) A human agent. (c) An external computing agent. It is assumed that a DIME agent knows the goal and intent of the algorithm (along with the context, constraints, communications and control of the algorithm) the DIME basic processor is executing and has the visibility of available resources and the needs of the basic processor as it executes its tasks. In addition, the DIME agent also has the knowledge about alternate courses of action available to facilitate the evolution of the computation to achieve its goal and realize its intent. Thus, every algorithm is associated with a blueprint (analogous to a genetic specification in biology), which provides the knowledge required by the DIME agent to manage the process evolution. An external computing agent is any computing node in the network with which the DIME unit interacts.

Blue-print or policy managed fault, configuration, accounting, performance and security monitoring and control (FCAPS): The DIME agent, which uses the blueprint to configure, instantiate, and manage the DIME basic processor executing the algorithm uses concurrent DIME basic processors with their own blueprints specifying their evolution to monitor the vital signs of the DIME basic processor and implements various policies to assure non-functional requirements such as availability, performance, security and cost management while the managed DIME basic processor is executing its intent. This approach integrates the evolution of the execution of an algorithm with concurrent management of available resources to assure the progress of the computation.

DIME network management control overlay over the managed Turing oracle machines: In addition to read/write communication of the DIME basic processor (the data channel), other DIME basic processors communicate with each other using a parallel signaling channel. This allows the external DIME agents to influence the computation of any managed DIME basic processor in progress based on the context and constraints. The external DIME agents are DIMEs themselves. As a result, changes in one computing element could influence the evolution of another computing element at run time without halting its Turing machine executing the algorithm. The signaling channel and the network of DIME agents can be programmed to execute a process, the intent of which can be specified in a blueprint. Each DIME basic processor can have its own oracle managing its intent, and groups of managed DIME basic processors can have their own domain managers implementing the domain’s intent to execute a process. The management DIME agents specify, configure, and manage the sub-network of DIME units by monitoring and executing policies to optimize the resources while delivering the intent.

The result is a new computing model, a management model and a programming model which infuse self-awareness using an intelligent Infware into a group of software components deployed on a distributed cluster of hardware devices while enabling the monitoring and control of the dynamics of computation to conform to the intent of the computational process. The DNA based control architecture configures appropriately the software and hardware components to execute the intent. As the computation evolves, the control agents monitor the evolution and makes appropriate adjustments to maintain an equilibrium conforming to the intent. When the fluctuations create conditions for unstable equilibrium, the control agents reconfigure the structure in order to create a new equilibrium state that conforms to the intent based on policies.

Figure 3 shows the Infware, hardware and software executing a web service using DNA.

Figure 3: Hardware and software networks with a process control Infware orchestrating the life-cycle evolution of a web service deployed on a Distributed Computing Cluster

The hardware components are managed dynamically to configure an elastic distributed computing cluster (DCC) to provide the required resources to execute the computations. The software components are organized as managed Turing oracle machines with a control architecture to create AANs that can be monitored and controlled to execute the intent using the network management abstractions of replication, repair, recombination and reconfiguration. With DNA, the datacenters are able to evolve from being to becoming.

It is important to note that DNA is implemented (Mikkilineni, et. al. 2012, 2014) to demonstrate a couple of functions that cannot be accomplished today with current state of the art:

Migrating a workflow being executed in a physical server (a web service transaction including a web server, application server and a database) to another physical server without a reboot or losing transactions to maintain recovery time and recovery point objectives. No virtual machines are required although they can be used just as if they were bare-metal servers.

Provide workflow auto-scaling, auto-failover and live migration with retention of application state using distributed computing clusters with heterogeneous infrastructure (bare metal servers, private and public clouds etc.) without infrastructure orchestration to accomplish them (e.g., without moving virtual machine images or LXC container based images).

The approach using DNA allows the implementation of the above functions without requiring changes to existing applications, OSs or current infrastructure because the architecture non-intrusively extends the current Turing computing model to a managed Turing oracle machine network with control network overlay. It is not a coincidence that similar abstractions are present in how cellular organisms, human organizations and telecommunication networks self-govern and deliver the intent of the system (Mikkilineni 2012).

Only time will tell if the DNA implementation of Infware is an incremental or leap-frog innovation.

Acknowledgements

This work originated from discussions started in IEEE WETICE 2009 to address the complexity, security and compliance issues in Cloud Computing. The work of Dr. Giovanni Morana, the C3DNA Team and the theoretical insights from professor Eugene Eberbach, Professor Mark Burgin and Pankaj Goyal are behind the current implementation of DNA.

“It’s very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond.“
—– von Neumann, Papers of John von Neumann on Computing and Computing Theory, Hixon Symposium, September 20, 1948, Pasadena, CA, The MIT Press, 1987.

Communication, Collaboration and Commerce at the Speed of Light:

With the advent of many-core servers, high bandwidth network technologies connecting these servers, and new class of high performance storage devices that can be optimized to meet the workload needs (IOPs intensive, throughput sensitive or capacity hungry workloads), Information Technology (IT) industry is looking at a transition from its server-centric, low-bandwidth, client-server origins to geographically distributed, highly scalable and resilient composed service creation, delivery and assurance environments that meet the rapidly changing business priorities, latency constraints, fluctuations in workloads and availability of required resources. Distributed service composition and delivery brings new challenges with scale and fluctuations both in demand and the availability of resources. New approaches are emerging to improve resiliency and the efficiency of distributed system design, deployment, management and control.

The Jazz Metaphor:

The quest for transition is best described by the Jazz metaphor aptly summarized by Holbrook [1] (Holbrook 2003), “Specifically, creativity in all areas seems to follow a sort of dialectic in which some structure (a thesis or configuration) gives way to a departure (an antithesis or deviation) that is followed, in turn, by a reconciliation (a synthesis or integration that becomes the basis for further development of the dialectic). In the case of jazz, the structure would include the melodic contour of a piece, its harmonic pattern, or its meter…. The departure would consist of melodic variations, harmonic substitutions, or rhythmic liberties…. The reconciliation depends on the way that the musical departures or violations of expectations are integrated into an emergent structure that resolves deviation into a new regularity, chaos into a new order, and surprise into a new pattern as the performance progresses.”

The Thesis:

The thesis in the IT evolution is the automation of business processes and service delivery using client-server architectures. It served well as long as the service scale and fluctuations of service delivery infrastructure resources were within certain bounds that allowed the action to increase or decrease available resources and meet the fluctuating demands. In addition, the resiliency of the service is always adjusted by improving the resiliency (availability, performance and security) of the infrastructure through various appliances, processes and tools. This introduced a timescale for meeting the resiliency required for various applications in terms of recovery time objectives and recovery point objectives. The resulting management “time constant” (defined as the time to recover a service to meet customer satisfaction) has been continuously decreasing with the use of newer technologies, tools and process automation.
However, with the introduction of the high-speed Internet, access to mobile technology and globalization of e-commerce, the scale and fluctuations in service demand have radically changed which have put challenging demands on provisioning the resources within shorter and shorter periods of time. Figure 1 summarizes the key drivers that are forcing the drastic reduction of management time constant.

Figure 1: Global communication, collaboration and commerce at the speed of light is forcing the drastic reduction in IT resource management time constant

The Anti-Thesis:

The result is the anti-thesis (the word is not used pejoratively but actually it denotes innovation, creativity and a touch of anti-establishment rebellion in the Jazz metaphor) to virtualize the infrastructure management (compute, storage and network resources) and provide intelligent resource management services that utilize commodity infrastructure connecting fat pipes. Software defined data center (SDDC) is used to represent the dynamic provisioning of server clusters connected by a network attached to required storage all meeting the service levels required by the applications that are composed to create a service transaction. The idea is to monitor the resource utilization by these service components and adjust the resources as required to meet the Quality of Service (QoS) needs of the service transaction (in terms of cpu, memory, network bandwidth, latency, storage throughput, IOPs and capacity.) Network function virtualization (NFV) is used to denote the dynamic provisioning and management of network services such as routing, switching and controlling commodity hardware that is solely devoted to connect various devices to assure desired network bandwidth and latency. Storage function virtualization (SFV) similarly denotes the dynamic provisioning and management of commodity storage hardware with required IOPs, throughput and capacity. ACI denotes application centric infrastructure which is sensitive to the needs of particular application and dynamically adjusts the resources to provide right cpu, memory, bandwidth, latency, storage IOPs, throughput and capacity. The drive to move away from proprietary network and storage equipment to commodity high performance hardware made ubiquitous with open interface architectures are intended to foster competition and innovation both in hardware and software. The open software is supposed to match the needs of the application by tuning the resources dynamically using the compute, network and storage management function made available with open-source software.

Unfortunately, the anti-thesis brings its own issues in transforming the current infrastructure that has evolved over few decades to the new paradigm.

The new approach has to accommodate current infrastructure and applications and allow seamless migration to new paradigm without vendor lock-in to use new infrastructure. Fork-lift strategy will not work that involves time. money and service interruption.

Current infrastructure is designed to provide low latency high performance application quality of service with various levels of security. For mission critical applications to migrate to new paradigm, these requirements have to be met without compromise.

The new paradigm should not require new way of developing applications or it must support current development languages and processes without new methodology lock-in. An application is defined both by functional requirements that dictate the specific domain functions and logic as well as non-functional requirements that define operational constraints related to service availability, reliability, performance, security and cost dictated by business priorities, workload fluctuations and resource latency constraints. A non-functional requirement specifies criteria that can be used to judge the operation of a system, rather than specific behaviors. The plan for implementing functional requirements is detailed in the system design. The plan for implementing non-functional requirements is detailed in the system architecture. The architecture for non-functional requirements plays a key role in whether the open systems approach will succeed or fail. An architecture that defines a plug and play approach requires a composition scheme which leads to the next issue.

There must be a way to compose applications developed by different vendors without having to look inside their implementation. In essence there must be a composition architecture that allows applications to be developed independently but can be composed to create new applications without having to modify the original components. Even when you have open-sourced applications, integrating them and creating new workflows and services is a labor intensive and knowledge sensitive task. The efficiency will be thwarted by the need for service engagements, training and maintenance of integrated workflows.

Current approaches suggested in the anti-thesis movement embracing virtual machines (VM), open-sourced applications and cloud computing fail on all these accounts by increasing complexity or requiring vendor, API and architecture dependency. The result is increased operation cost of integration dependency on ad-hoc software and services.

The increase in complexity with scale and distribution is more an issue of architecture and is not addressed by throwing more ad-hoc software to automate with managers of managers, point solutions and tools. It has to do more with the limitation of current computing architecture than lack of good ad-hoc software approaches.

Server virtualization creates a Virtual Machine image that can be replicated easily in different physical servers with shared resources. The introduction of Hypervisor to virtualize hardware resources (cpu and memory) allows multiple virtual machine images to share the resources in a physical server. NFV and SFV provide management functions to control the underlying commodity hardware. OpenStack and other infrastructure provisioning mechanisms have evolved through the anti-thesis movement to integrate VM provisioning integrated with NFV and SFV provisioning to create clusters of VMs on which the applications can deliver the service transactions. Figure 2 shows OpenStack implementation of such a service provisioning process. A cluster of VMs required for a service delivery can be provisioned with required service level agreements to assure right cpu, memory, bandwidth, latency, storage IOPs, throughput and capacity. It is also important to note that OpenStack not only can provision a VM cluster but also physical server cluster or a mixture. It allows adding or deleting or tuning a VM on demand. In addition, OpenStack allows including applications themselves to be part of the image and snapshots that can be reused to replicate the VM on any server. Clusters with appropriate applications and dependencies with connectivity and firewall rules can be provisioned and replicated. This allows for orchestration of VM images to provide auto-failover, auto-scaling, live-migration and auto-protection for service delivery.

Figure 2: OpenStack is used to provision infrastructure with required service level agreements to assure cpu, memory, bandwidth, storage IOPs, throughput, storage capacity of individual virtual machine (VM) and the network latency of the VM cluster

Unfortunately, the anti-thesis movement solely depends on infrastructure mobility and management through VMs and associated plumbing which requires a lock-in on the availability of same OpenStack in a distributed environment or complex image orchestration add-ons. More recently instead of moving the whole virtual image containing the OS, run-time environments and applications along with their configurations, a mini-OS (using subset of operating system services) image is created with application and their configurations. LXC containers and Docker containers are examples. The use of mobility of VMs or containers to move applications from one infrastructure to another to manage the infrastructure SLAs to meet QoS needs of an application has created a plethora of ad-hoc solutions adding to the complexity. Figure 3 shows the current state-of-the-art.

Figure 3: Current state-of-the-art that provides application QoS through Virtual Machine mobility or container mobility where container is also an image

While this approach provides a solution to meet application scaling and fluctuations needs as long as the infrastructure meets certain requirements, there are certain shortcomings in distributed heterogeneous infrastructures provided by different vendors:

Multiple Orchestrators are required when different architectures and infrastructure management systems are involved

Figure 4 shows the complexity involved in scaling services across distributed heterogeneous infrastructures with different owners using different infrastructure management systems. Integrating multiple distributed infrastructures with disparate management systems is not a highly scalable solution without increasing complexity and cost.

Obviously if scale, distribution and fluctuations (both in demand and resources) are not a requirement, then, the thesis will do well. Today, there are still many main-frame systems providing high transaction rates albeit at a higher cost. Anti-thesis is born out of the need for high degree of scalability, distribution and fluctuations with higher efficiency. Big data analysis, large scale collaboration systems are examples. However there is a large class of services that like to leverage commodity infrastructure and resiliency with security and application QoS management without vendor lock-in or high cost of complexity.

There are three stakeholders in an enterprise who want different things from infrastructure to provide QoS assurance:

Ability to “migrate service” or “tune infrastructure SLAs” based on Policies and application demand

Ability to burst into cloud without vendor-lock-in

The developers want:

Focus on business logic coding and specification of run-time requirements for resources (application intent, context, communications, control and constraints) without worrying about run-time infrastructure configurations

End-to-end visibility and profiling at run-time across the stack for Debugging

In essence, service developers would want to focus on functional requirement fulfillment without having to worry about resource availability in a fluctuating environment. Monitoring resource utilization and taking action on non-deterministic impact of scaling and fluctuations should be supported by a common architecture that decouples application execution from underlying resource management distributed or not.

Figure 4: Complexity in a distributed infrastructure where scaling and fluctuations are increasing

The Synthesis:

The synthesis depends on addressing the scaling and fluctuation issues without vendor lock-in or architecture lock-in that restricts developers to use their current environments and requires accommodating current infrastructure while allowing new infrastructure with NFV and SFV to seamlessly integrate. For example the anti-thesis solutions require certain features in their OSs and new middleware must run in distributed environments. This leaves a host of legacy systems out.

A call for the synthesis is emerging from two quarters:

Industry analysts such as Gartner who predict that a service governor will emerge in due time. “A service governor [2] is a runtime execution engine that has several inputs: business priorities, IT service descriptions (and dependency model), service quality and cost policies. In addition, it takes real-time data feeds that assess the performance of user transactions and the end-to-end infrastructure, and uses them to dynamically optimize the consumption of real and virtual IT infrastructure resources to meet the business requirements and service-level agreements (SLAs). It performs optimization through dynamic capacity management (that is, scaling resources up and down) and dynamically tuning the environment for optimum throughput given the demand. The service governor is the culmination of all technologies required to build the real-time infrastructure (RTI), and it’s the runtime execution management tool that pulls everything together.”

From the academic community who recognize the limitations of Turing’s formulation of computation in terms of functions to process information using simple read, compute (change state) and write instructions combined with the introduction of program, data duality by von Neumann which has allowed information technology (IT) to model, monitor, reason and control any physical system. Prof. Mark Burgin [3] in his 2005 book on super recursive algorithms states “it is important to see how different is functioning of a real computer or network from what any mathematical model in general and a Turing machine,(as an abstract, logical device), in particular, reputedly does when it follows instructions. In comparison with instructions of a Turing machine, programming languages provide a diversity of operations for a programmer. Operations involve various devices of computer and demand their interaction. In addition, there are several types of data. As a result, computer programs have to give more instructions to computer and specify more details than instructions of a Turing machine. The same is true for other models of computation. For example, when a finite automaton represents a computer program, only certain aspects of the program are reflected. That is why computer programs give more specified description of computer functioning, and this description is adapted to the needs of the computer. Consequently, programs demand a specific theory of programs, which is different from the theory of algorithms and automata.”

In short, the programs (or functions) developers develop to code business logic do not contain knowledge about how compute, storage and network devices interact with each other (structure) and how to deal with changing business priorities, workload variations and latency constraints (fluctuations that force changes to structure). This knowledge has to be incorporated in the architecture of the new computing, management and programming model.

These non-functional requirements are requirements that specify criteria that can be used to judge the operation of a system, rather than specific behavior. This should be contrasted with functional requirements that define specific behavior or functions that deal with algorithms, or business logic. The plan for implementing functional requirements is detailed in the system design. The plan for implementing non-functional requirements is detailed in the system architecture. These requirements include availability, reliability, performance, security, scalability and efficiency at run-time. The new architecture must encapsulate the intent of the program, its operational requirements such as the context, connectivity to other components, constraints and control abstractions that are required to manage the non-functional requirements. Figure 5 shows an architecture where the service management architecture is decoupled from the infrastructure management systems monitoring and managing distributed resources that may belong to different providers with different incentives.

The infrastructure control plane provides automation, monitoring and management of infrastructure required for applications to execute their intent. The output of the infrastructure is a cluster of physical servers or virtual servers with an operating system in each server to provide well-defined computing resources in terms of total CPU, Memory, network bandwidth, latency, storage IOPs, throughput and capacity. The infrastructure control plane will be able to provide required clusters on demand and elastically scale the nodes or the individual node resources on demand. The elastic on-demand resources use automation processes or NFV and SFV resources connected to Virtual or Physical servers.

As Professor Mark Burgin points out, the intent and the application monitoring to process information, apply knowledge, and change the circumstance must be part of the service management knowledge independent of distributed infrastructure management systems for providing true scalability, distribution and resiliency; and avoiding vendor lock-in or infrastructure, architecture or API lock-in. In addition, the service control plane must support recursive service composition to be able to have end-to-end service visibility and control to avail the best resources wherever they are available to meet the quality of service dictated by business priorities, latency constraints and workload fluctuations. The application quality of service must not be dictated or limited by the infrastructure limitations. Then only we can predictably deploy highly reliable services on even not so reliable distributed infrastructure and increase efficiency to meet demand that is not as predictable.

Borrowing from biological and intelligent systems which specialize in exploiting architectures that provide predictability, we can argue that infusing cognition into service management will provide such an architecture. Cognition [4] is associated with intent and its accomplishment through various processes that monitor and control a system and its environment. Cognition is associated with a sense of “self” (the observer) and the systems with which it interacts (the environment or the “observed”). Cognition [4] extensively uses time, history and reasoning in executing and regulating tasks that constitute a cognitive process. There is a fundamental reason why current Turing, von Neumann stored program computing model cannot address large-scale distributed computing with fluctuations both in resources and in computation workloads without increasing complexity and cost. As von Neumann [5] put it “It is a theorem of Gödel that the description of an object is one class type higher than the object.” An important implication of Gödel’s incompleteness theorem is that it is not possible to have a finite description with the description itself as the proper part. In other words, it is not possible to read yourself or process yourself as a process. In short, Gödel’s theorems prohibit “self-reflection” in Turing machines. Turing’s O-machine was designed to provide information that is not available in the computing algorithm executed by the TM. More recently, the super recursive algorithms proposed by Mark Burgin [3] points a way to model the knowledge about the hardware and software to reason and act to self-manage. He proves that the super recursive algorithms are more efficient than plain Turing computations which assume unbounded resources.

Perhaps, we should look for “synthesis” solutions not in familiar places where we feel comfortable with more ad-hoc software and services that are labor and knowledge intensive. We should look for clues in biology, human organizational networks and even telecommunication networks to transform current datacenters from being infrastructure management systems to services switching centers of the future [6]. This requires search for new computing, management and programming models without disturbing current applications, operating systems or infrastructure while facilitating smooth migration to a more harmonious melody of orchestrated services on a global scale with high efficiency and resiliency.

The “gap between the hardware and the software of a concrete computer and even greater gap between pure functioning of the computer and its utilization by a user, demands description of many other operations that lie beyond the scope of a computer program, but might be represented by a technology of computer functioning and utilization”

Introduction

According to Holbrook (Holbrook 2003), “Specifically, creativity in all areas seems to follow a sort of dialectic in which some structure (a thesis or configuration) gives way to a departure (an antithesis or deviation) that is followed, in turn, by a reconciliation (a synthesis or integration that becomes the basis for further development of the dialectic). In the case of jazz, the structure would include the melodic contour of a piece, its harmonic pattern, or its meter…. The departure would consist of melodic variations, harmonic substitutions, or rhythmic liberties…. The reconciliation depends on the way that the musical departures or violations of expectations are integrated into an emergent structure that resolves deviation into a new regularity, chaos into a new order, surprise into a new pattern as the performance progresses.”

Current IT in this Jazz metaphor, evolved from a thesis and currently is experiencing an anti-thesis and is ripe for a synthesis that would blend the old and the new with a harmonious melody to create a new generation of highly scalable, distributed, secure services with desired availability, cost and performance characteristics to meet the changing business priorities, highly fluctuating workloads and latency constraints.

The Hardware Upheaval and the Software Shortfall

There are three major factors driving the datacenter traffic and their patterns:
1. A multi-tier architecture which determines the availability, reliability, performance, security and cost of initiating a user transaction to an end-point and delivering that service transaction to the user. The composition and management of the service transaction involves both the north-south traffic from end-user to the end-point (most often over the Internet) and the east-west traffic that flows through various service components such as DMZ servers, web servers, application servers and databases. Most often these components exist within the datacenter or connected through a WAN to other datacenters. Figure 1 shows a typical configuration.

Service Transaction Delivery Network

The transformation from the client-server architectures to “composed service” model along with virtualization of servers allowing the mobility of Virtual Machines at run-time are introducing new patterns of traffic that increase in the east west direction inside the datacenter by orders of magnitude compared to the north-south traffic going from end-user to the service end-point or vice-versa. Traditional applications that evolved from client-server architectures use TCP/IP for all the traffic that goes across servers. While some optimizations attempt to improve performance for applications that go across servers using high-speed network technologies such as InfiniBand, Ethernet etc., TCP/IP and socket communications still dominate even among virtual servers within the same physical server.

2. The advent of many-core severs with tens and even hundreds of computing cores with high bandwidth communication among them drastically alters the traffic patterns. When two applications are using two cores within a processor, the communication among them is not very efficient if it uses socket communication and TCP/IP protocol instead of shared memory. When the two applications are running in two processors within the same server, it is more efficient to use PCIExpress or other high-speed bus protocols instead of socket communication using TCP/IP. If the two applications are running in two servers within the same datacenter it is more efficient to use Ethernet or InfiniBand. With the advent of mobility of applications using containers or even Virtual Machines, it is more efficient to switch the communication mechanism based on the context of where they are running. This context sensitive switching is a better alternative to replicating current VLAN and socket communications inside the many-core server. It is important to recognize that the many-core servers and processors constitute a network where each node itself is a sub-network with different bandwidths and protocols (socket-based low-bandwidth communication between servers, InfiniBand, or PCI Express bus based communication across processors in the same server and shared memory based low latency communication across the cores inside the processor). Figure 2 shows the network of networks using many-core processors.

A Network of Networks with Multiple Protocols

3. The many-core servers with new class of flash memory and high-bandwidth networks offer a new architecture for services creation, delivery and assurance going far beyond the current infrastructure-centric service management systems that have evolved from single-CPU and low-bandwidth origins. Figure 3 shows a potential architecture where many-core servers are connected with high-bandwidth networks that obviate the need for current complex web of infrastructure technologies and their management systems. The many-core servers each with huge solid-state Drives, SAS attached inexpensive disks, optical switching interfaces connected to WAN Routers offer a new class of services architecture if only the current software shortfall is plugged to match the hardware advances in server, network and storage devices.

If Server is the Cloud, What is the Service Delivery Network?

This would eliminate the current complexity mainly involved in dealing with TCP/IP across east-west traffic and infrastructure based service delivery and management systems to assure availability, reliability, performance, cost and security. For example, current security mechanisms that have evolved from TCP/IP communications do not make sense across east/west traffic and emerging container based architectures with layer 7 switching and routing independent of server and network security offer new efficiencies and security compliance.

Current evolution of commodity clouds and distributed virtual datacenters while providing on-demand resource provisioning, auto-failover, auto-scaling and live-migration of Virtual machines, they are still tied to the IP address and associated complexity of dealing with infrastructure management in distributed environments to assure the end-to-end service transaction quality of service (QoS).

The QoS Gap

This introduces either vendor lock-in that precludes the advantages of commodity hardware or introduces complexity in dealing with multitude of distributed infrastructures and their management to tune the service transaction QoS. Figure 4 shows the current state of the art. One can quibble whether it includes every product available or whether they are depicted correctly to represent their functionality but the general picture describes the complexity and or vendor lock-in dilemma. The important point to recognize is that the service transaction QoS depends on tuning the SLAs of distributed resources at run-time across multiple infrastructure owners with disparate management systems and incentives. The QoS tuning of service transactions is not scalable without increasing cost and complexity if it depends on tuning the distributed infrastructure with a multitude of point solutions and myriad infrastructure management systems..

What the Enterprise IT Wants:

There are three business drivers that are at the heart of the Enterprise Quest for an IT framework:

Compression of Time-to-Market: Proliferation of mobile applications, social networking, and web-based communication, collaboration and commerce are increasing the pressure on enterprise IT to support a rapid service development, deployment and management processes. Consumer facing services are demanding quick response to rapidly changing workloads and the large-scale computing, network and storage infrastructure supporting service delivery requires rapid reconfiguration to meet the fluctuations in workloads and infrastructure reliability, availability, performance and security.

Compression of Time-to-Fix: With consumers demanding “always-on” services supporting choice, mobility and collaboration, the availability, performance and security of end to end service transaction is at a premium and IT is under great pressure to respond by compressing the time to fix the “service” regardless of which infrastructure is at fault. In essence, the business is demanding the deployment of reliable services on not so reliable distributed infrastructure.

Cost Reduction of IT operation and management which is running at about 60% to 70% of its budget going to keep the “service lights” on: Current service administration and management paradigm that originated with server-centric and low-bandwidth network architecture is resource-centric and assumes that the resources (CPU, memory, network bandwidth, latency, storage capacity, throughput and IOPs) allocated to an application at install time can be changed to meet rapidly changing workloads and business priorities in real-time. Current state-of-the art uses virtual servers, network and storage that can be dynamically provisioned using software API. Thus the application and service (a group of applications providing a service transaction) QoS (quality of service defining the availability, performance, security and cost) can be tuned by dynamically reconfiguring the infrastructure. There are three major issues with this approach:

With a heterogeneous, distributed and multi-vendor infrastructure, tuning the infrastructure requires myriad point solutions, tools and integration packages to monitor current utilization of the resources by the service components, correlate and reason to define the actions required and coordinate many distributed infrastructure management systems to reconfigure the resources.

Introduction of public clouds and the availability of software as a service, while they have worked well for new application development or non-mission critical applications or applications that can be re-architected to optimize for the Cloud API which leverage application/service components available, they are also adding additional cost for IT to migrate many existing mission critical applications that demand high security, performance and low-latency. The suggested Hybrid solutions require adopting new cloud architecture in the datacenters or use myriad orchestration packages that add additional complexity and tool fatigue.

In order to address the need to compress time to market and time to fix and to reduce the complexity, enterprises small and big are desperately looking for solutions.

The lines of business owners want:

End-to-end visibility and control of service QoS independent of who provides the infrastructure

Availability, performance and security governance based on policies

Accounting of resource utilization and dynamic resolution of contention for resources

Application architecture decoupled from infrastructure by separating functional and non-functional requirements so that the application developers focus on business functions while deployment and operations are adjusted at run-time based on business priorities, latency constraints and workload fluctuations

Provide cloud-like services (on-demand provisioning of applications, self-repair, auto-scaling, live-migration and end-to-end security) at service level instead of at infrastructure level so that they can leverage own datacenter resources, or commodity resources abundant in public clouds without depending on cloud architectures, vendor API and cloud management systems.

Provide a suite of applications as a service (databases, queues, web servers etc.)

Service composition schemes that allow developers to reuse components and

Ability to provide end to end service level security independent of server and network security deployed to manage distributed resources

Ability to provide end-to-end service QoS visibility and control (on-demand service provisioning, auto-failover, auto-scaling, live migration and end-to-end security) across distributed physical or virtual servers in private or public infrastructure

Ability to reduce complexity and eliminate point solutions and myriad tools to manage distributed private and public infrastructure

Application Developers want:

To focus on developing service components, test them in their own environments and publish them in a service catalogue for reuse

Ability to compose services, test and deploy in their own environments and publish then in the service catalogue ready to deploy anywhere

Ability to specify the intent, context, constraints, communication, and control aspects of the service at run-time for managing non-functional requirements

An infrastructure that uses the specification to manage the run-time QoS with on-demand service provisioning on appropriate infrastructure (a physical or virtual server with appropriate service level assurance, SLA), manage run-time policies for fail-over, auto-scaling, live-migration and end-to-end security to meet run-time changes in business priorities, workloads and latency constraints.

Separation of run-time safety and survival of the service from sectionalizing, isolating, diagnosing and fixing at leisure

Get run-time history of service component behavior and ability to conduct correlated analysis to identify problems when they occur.

We need to discover a path to bridge the current IT to the new IT without changing applications, or the OSs or the current infrastructure while providing a way to migrate to a new IT where service transaction QoS management is truly decoupled from myriad distributed infrastructure management systems. This is not going to happen with current ad-hoc programming approaches. We need a new or at least an improved theory of computing.

As Cockshott et al (2012) point out current computing, management and programming models fall short when you try to include computers and the computed in same model.

“the key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.”

There are emerging technologies that might just provide the synthesis (reconciliation depends on the way that the architecture departures or violations of expectations are integrated into an emergent structure that resolves deviation into a new regularity, chaos into a new order, surprise into a new pattern as the transformation progresses) required to build the harmony by infusing cognition into computing. Only future will tell if this architecture is expressive enough and efficient as Mark Burgin claims in his elegant book on “Super Recursive Algorithms” quoted above.

Is the Information Technology poised for a renaissance (with a synthesis) since the great masters (Turing, von Neumann, Shannon etc.) developed the original thesis and take us beyond the current distributed-cloud-management anti-thesis.

The IEEE WETICE2015 International conference track on “the Convergence of Distributed Clouds, GRIDs and their Management” to be held in Cyprus next June (15 – 18) will address some of these emerging trends and attempt to bridge the old and the new.

Trouble in IT Paradise with Darkening Clouds:

If you ask an enterprise CIO over a couple of drinks, what is his/her biggest hurdle today that is preventing to deliver the business right resources at the right time at a right price, his/her answer would be that “the IT is too darn complex.” Over a long period of time, the infrastructure vendors have hijacked Information Technologies with their complex silos and expediency has given way to myriad tools and point solutions that overlay a management web. In addition, the Venture Capitalists looking for quickie “insertion points” with no overarching architectural framework have proliferated tools and appliances that have contributed to the current complexity and tool fatigue.

After a couple of more drinks, if you press the CIO why his/her mission critical applications are not migrating to the cloud which claims lesser complexity, the CIO laments that there is no cloud provider willing to sign a warranty that assures the service levels for their mission critical applications that guarantee application availability, performance and security levels. “Every cloud provider talks about infrastructure service levels but not willing to step up to assure application availability, performance and security. There are myriad off-the main street providers that claim to offer orchestration to provide the service levels, but no one yet is signing on the dotted line.” The situation is more complicated when the resources span across multiple infrastructure providers.

The decoupling of the strong binding between the management of applications and the infrastructure management is a key for the CIO as more applications are developed with shorter time to market. CIO’s top five priorities are transnational applications demanding distributed resources, security, cost, compliance and uptime. A Gartner report claims that the CIOs spend 74% of IT budget on keeping the application “lights on” and another 18% on “changing the bulbs” and other maintenance activities. (It is interesting to recall that before Strowger’s switch eliminated many operators sitting in long rows plugging countless jacks into countless plugs, the cost of adding and managing new subscribers was rising in a geometric proportion. According to the Bell System chronicles, one large city general manager of a telephone company at that time wrote that he could see the day coming soon when he would go broke merely by adding a few more subscribers because the cost of adding and managing a subscriber is far greater than the corresponding revenue generated. The only difference between today’s IT datacenter and central office before Strowger’s switch is that “very expensive consultants, countless hardware appliances, and countless software systems that manage them” replace “many operators, countless plugs and countless jacks”.)

In order to utilize commodity infrastructure while maintaining high security, mobility for performance and availability, the CIOs are looking to solutions that let them focus on application quality of service (QoS) and are willing to outsource the infrastructure management to providers who can assure application mobility, availability and security albeit with end to end service visibility and control at their disposal.

While the public clouds seem to offer a way out to leverage the commodity infrastructure with on demand Virtual Machine provisioning, there are four hurdles that are preventing the CIO’s to embrace the clouds for mission critical applications:

Current mission critical and even non-mission critical applications and services (groups of applications) are used to highly secure and low latency infrastructures that have been hardened and managed and the CIO’s are loath to spend more money to bring same level of SLA’s in public clouds.

The dependence on particular service provider infrastructure API’s, Virtual Machine Image Management (nested or not) infrastructure dependencies and added self-healing, auto-scaling, live-migration service cost and complexity create service provider lock-in on their infrastructure and their management services. This defeats the intent to leverage the commodity infrastructure offered by different service providers.

The increasing scope creep from infrastructure providers “up-the-stack” to provide application awareness and insert their API in application development in the name of satisfying non-functional requirements (availability, security, performance optimization) at run-time has started to increase the complexity and cost of application and service development. The resulting proliferation of tools and point solutions without a global architectural framework to use resources from multiple service providers have increased the integration and troubleshooting cost.

Global communications, collaboration and commerce at the speed of light has increased the scale of computing and the distributed computing resource management has fallen short in meeting the scale and the fluctuations both caused by demand and also fluctuations in resources availability, performance and security.

The Inadequacy of Ad-hoc Programming to Solve Distributed Computing Complexity:

Unfortunately, the complexity is more a structural issue than an operational or infrastructure technology issue that cannot be resolved with ad-hoc programming techniques to manage the resources. Cockshott et al. conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.” While the success of IT in modeling and executing business processes has evolved to current distributed datacenters and cloud computing infrastructures that provide on-demand computing resources to model and execute business processes, the structure and fluctuations that dictate the evolution of computation have introduced complexity in dealing with real-time changes in the interaction of the infrastructure and the computations they perform. The complexity manifests in the following ways:

In a distributed computing environment, maintaining the right computing resources (cpu, memory, network bandwidth, latency, storage capacity, throughput and IOPs) are available to right software component contributing to the service transaction requires orchestration and management of myriad computing infrastructures often owned by different providers with different profit motives and incentives. The resulting complexity in resource management to assure availability, performance and security of service transactions adds to the cost of computing. For example, it is estimated that up to 70% of current IT budget is consumed in assuring service availability, performance and security. The complexity is compounded in distributed computing environments that are supported by heterogeneous infrastructures with disparate management systems.

In a large-scale dynamic distributed computation supported by myriad infrastructure components, the increased component failure probabilities introduce a non-determinism (for example the Google is observing emergent behavior in their scheduling of distributed computing resources when dealing with large number of resources) that must be addressed by a service control architecture that decouples functional and non-functional aspects of computing.

Fluctuations in the computing resource requirements dictated by changing business priorities, workload variations that depend on service consumption profiles and real-time latency constraints dictated by the affinity of service components, all demand a run-time response to dynamically adjust the computing resources. Current dependence on myriad orchestrators and management systems cannot scale in a distributed infrastructure without either a vendor lock-in on infrastructure access methods or a universal standard that often stifles innovation and competition to meet fast changing business needs.

Thus the function, structure and fluctuations involved in dynamic processes delivering service transaction are driving a need to search new computation, management and programming models that address the unification of the computer and the computed and decouple the service management from the infrastructure management at run-time.

It is the Architecture Stupid:

A business process is defined both by functional requirements that dictate the business domain functions and logic as well as non-functional requirements that define operational constraints related to service availability, reliability, performance, security and cost dictated by business priorities, workload fluctuations and resource latency constraints. A non-functional requirement specifies criteria that can be used to judge the operation of a system, rather than specific behaviors. The plan for implementing functional requirements is detailed in the system design. The plan for implementing non-functional requirements is detailed in the system architecture. While much progress has been made in the system design and development, the architecture of distributed systems falls short to address the non-functional requirements for two reasons:

Current distributed systems architecture from its server-centric and low-bandwidth origins has created layers of resource management-centric ad-hoc software to address various uncertainties that arise in a distributed environment. Lack of support for concurrency, synchronization, parallelism and mobility of applications dictated by the current serial von-Neumann stored program control has given rise to ad-hoc software layers that monitor and manage distributed resources. While this approach may have been adequate when distributed resources are owned by a single provider and controlled by a framework that provides architectural support for implementing non-functional requirements, the proliferation of commodity distributed resource clouds offered by different service providers with different management infrastructures adds scaling and complexity issues. Current OpenStack and AWS API discussions are a clear example that forces a choice of one or the other or increased complexity to use both.

The resource-centric view of IT currently demotes application and service management to a second-class citizenship where the QoS of application/service is monitored and managed by myriad resource management systems overlaid with multiple correlation and analysis layers used to manipulate the distributed resources to adjust the Cpu, memory, bandwidth, latency, storage IOPs, throughput and capacity which are all what are required to keep the application/service to meet its quality of service. Obviously, this approach cannot scale unless single set of standards evolve or a single vendor lock-in occurs.

Unless an architectural framework evolves to decouple application/service management from myriad infrastructure management systems owned and operated by different service providers with different profit motives, the complexity and cost of management will only increase.

A Not So Cool Metaphor to Deliver Very Cool Services Anywhere, Anytime and On-demand:

A lesson on an architectural framework that addresses nonfunctional requirements while connecting billions of users anywhere anytime on demand is found in the Plain Old Telephone System (POTS). From the beginnings of AT&T to today’s remaking of at&t, much has changed but two things that remain constant are the universal service (on a global scale) and the telecom grade “trust” that are taken for granted. Very recently, Mark Zuckerberg proclaimed at the largest mobile technology conference in Barcelona that his very cool service Facebook wants to be the dial tone for the Internet. Originally, the dial tone was introduced to assure the telephone user that the exchange is functioning when the telephone is taken off-hook by breaking the silence (before an operator responded) with an audible tone. Later on, the automated exchanges provided a benchmark for telecom grade trust that assures managed resources on-demand with high availability, performance and security. Today, as soon as the user goes on hook, the network recognizes the profile based on the dialing telephone number. As soon as the dialed party number is dialed, the network recognizes the destination profile and provisions all the network resources required to make the desired connection, commence billing, monitor and assure the connection till one of the parties initiates a disconnect. During the call, if the connection experiences any changes that impact the non-functional requirements, the network intelligence takes appropriate action based on policies. The resulting resiliency (availability, performance, and security), efficiency and scaling ability to connect billions of users on demand have come to be known as “Telecom grade trust”. An architectural flaw in the original service design (exploited by Steve Jobs by building a blue-box) was fixed by introducing an architectural change to separate the data path and the control path. The resulting 800 service call model provided a new class of services such as call forwarding, call waiting and conference call.

The Internet on the other hand evolved to connect billions of computers together anywhere, anytime from the prophetic statement made by J. C. R. Licklider “A network of such (computers), connected to one another by wide-band communication lines [which provided] the functions of present-day libraries together with anticipated advances in information storage and retrieval and [other] symbiotic functions.” The convergence of voice over IP, data and video networks has given rise to a new generation of services enabling communication, collaboration and commerce at the speed of light. The result is that the datacenter has replaced the central office to become the hub from which myriad voice, video and data services are created, and delivered on a global scale. However the management of these services which determines their resiliency, efficiency and scaling is another matter. In order to provide on demand services, anywhere, any-time with prescribed quality of service in an environment of wildly fluctuating workloads, changing business priorities and latency constraints dictated by the proximity of service consumers and suppliers, resources have to be managed in real-time across distributed pools to match the service QoS to resource SLAs. The telephone network is designed to share resources on a global scale and to connect them as required in real-time to meet the non-functional service requirements while current datacenters (whether privately owned or publicly provides as cloud services) are not. There are three structural deficiencies in the current distributed datacenter architecture to match the telecom grade resiliency, efficiency and scaling:

The data path and service control path are not decoupled giving rise to same problems that Steve Jobs exploited causing a re-architecting of the network.

The service management is strongly coupled with the resource management systems and does not scale as the resources become distributed and multiple service providers provide those resources with different profit motives and incentives. Since the resources are becoming commodity, every service provider wants to go up the stack to provide lock-in.

Current trend to infuse resource management API in service logic to provide resource management at run-time and application aware architectures that want to establish intimacy with applications only increase complexity and make service composition with reusable service components all the more difficult because of their increased lock-in with resource management systems.

Resource management based datacenter operations miss an important feature of services/applications management which is that all services are not created equal. They have different latency and throughput requirements. They have different business priorities and different workload characteristics and fluctuations. What works for the goose does not work for the gander. In addition to the current complexity and cost of resource management to assure service availability, reliability, performance and security, there is an even more fundamental issue that plagues the current distributed systems architecture. A distributed transaction that spans multiple servers, networks and storage devices in multiple geographies uses resources that span across multiple datacenters. The fault, configuration, accounting, performance and security (FCAPS) of a distributed transaction behavior requires the end-to-end connection management more like telecommunication service spanning distributed resources. Therefore, focusing on only resource management in a datacenter without the visibility and control of all resources participating in the transaction will not provide assurance of service availability, reliability, performance and security at run-time.

New Dial Tones for Application/Service Development, Deployment and Operation:

Current Web-scale applications are distributed transactions that span across multiple resources widely scattered across multiple locations owned and managed by different providers. In addition, the transactions are transient making connections with various components to fulfill an intent and closing them only to reconnect when they need them again. This is very much in contrast to always-on distributed computing paradigm of yesterday.

In creating, deploying and operating these services, there are three key stake holders and associated processes:

Resources providers deliver the vital resources required to create, deploy and operate these resources on demand anywhere anytime (resource dial tone). The vital resources are just the CPU, memory, network latency, bandwidth and storage capacity, throughput and IOPs required to execute the application or service that has been compiled to “1”s and “0”s (the Turing Machine). The resource consumers care less about how you provide these as long as you maintain the service levels the resource providers agree to when the application or service requests the resources at provisioning time (matching the QoS request with SLA and maintaining it during the application/service life-time). The resource dial tone that assures the QoS with resource SLA is offered to two different types of consumers of this resource. First, the application developer who uses these resources to develop the service components and composes them to create more complex services with their own QoS requirements. Second the service operators who use the SLAs to provide management of QoS at run-time to deliver the services to end users.

The application developers like to use their tools and best practices without any constraints from resource providers and the run-time vital signs required to execute their services should be transparent to where or who is providing the vital resources. The resources must support the QoS specified by developer or service composer depending on the context, communication, control and constraint needs. They do not care how they get the CPU, memory, bandwidth, storage capacity, throughput or IOPs or how the latency constraints are met. This model is a major departure from current SDN route focusing on giving control of resources to applications which is not a scalable solution that allows decoupling of resource management from service management.

The service operators provide run-time QoS assurance by brokering the QoS demands to match the best available resource pool that meets the cost and quality constraints (the management dial tone that assures non-functional requirements). The brokering function is a network service ala services switching to match the applications/services to the right resources.

The brokering service must then provide the non-functional requirements management at run-time just as in POTS.

The New Service Operations Center (SOC) with End-to-end Service Visibility and Control Independent of Distributed Infrastructure Management Centers Owned by Different Infrastructure Providers:

The new Telco model that the broker facilitates allows the enterprises and other infrastructure users to focus on services architecture and management and use infrastructure as a commodity from different infrastructure providers just as Telcos provide shared resources with network services.

Figure 1: The Telco Grade Services Architecture that
decouples end to end service transaction management from infrastructure
management systems at run-time

The service broker matches the QoS of service and service components with service levels offered by different infrastructure providers based on the service blueprint which defines the context, constraints, communications and control abstraction of the service at hand. The service components are provided with desired Cpu, memory, bandwidth, latency, storage IOPs, throughput and capacity desired. The decoupling of service management from distributed infrastructure management systems puts the safety and survival of services first and allows sectionalization, isolation, diagnosis anfd fixing infrastructure at leisure as is the case today with POTS.

It is important to note that the service dial tone Zuckerberg is talking about is not related to the resources dial tone or management dial tone required for providing service connections and management at run-time. He is talking about application end user receiving the content. Facebook application developers do not care how the computing resources are provided as long as their service QoS is maintained to meet the business priorities, workloads and latency constraints to deliver their service on a global scale. Facebook CIO would rather spend time maintaining the service QoS by getting the resources wherever they are available to meet the service needs at reasonable cost. In fact most CIOs would get rid of the infrastructure management burden if they have QoS assurance and end-to-end service visibility and service control (they could not care less about access to resources or their management systems) to manage the non-functional requirements at run-time. After all, Facebook’s open compute project is a side effect trying to fill a gap left by infrastructure providers – not their main line of business. The crash that resulted after Zuckerberg’s announcement of WhatsApp acquisition was not the “cool” application’s fault. They probably could have used a service broker/switch providing the old fashioned resource dial tone so that they could provide the service dial tone to their users.

This is similar to a telephone company assuring appropriate resources to connect different users based on their profiles or the Internet connecting devices based on their QoS needs at run-time. The broker acts as service switch that connects various service components at run-time and matches the QoS demands with appropriate resources.

With the right technology, the service broker/switch may yet provide the required service level warranties to the enterprise CEOs from well-established carriers with money and muscle.

Will at&t and other Telcos have the last laugh by incorporating this brokering service switch in the network and make current distributed datacenters (cloud or otherwise with physical or virtual infrastructure) a true commodity?

“The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.”

Summary

The “Convergence of Clouds, Grids and their Management” conference track is devoted to discussing current and emerging trends in virtualization, cloud computing, high-performance computing, Grid computing and cognitive computing. The tradition that started in WETICE2009 “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short term profit driven motives of a particular corporate entity” has resulted in a new computing model that was included in the Turing Centenary Conference proceedings in 2012. More recently, a product based on these ideas was discussed in the 2013 Open Server Summit (www.serverdesignsummit.com), where many new ideas and technologies were presented to exploit the new generation of many-core servers, high-bandwidth networks and high-performance storage. We present here some thoughts on current trends which we hope will stimulate further research to be discussed in the WETICE 2014 conference track in Parma, Italy (http://wetice.org).

Introduction

Current IT datacenters have evolved from their server-centric, low-bandwidth origins to distributed and high-bandwidth environments where resources can be dynamically allocated to applications using computing, network and storage resource virtualization. While Virtual machines improve resiliency and provide live migration to reduce the recovery time objectives in case of service failures, the increased complexity of hypervisors, their orchestration, Virtual Machine images and their movement and management adds an additional burden in the datacenter.

Further automation trends continue to move toward static applications (locked-in-a-virtual machine, often as one application in one virtual machine) in a dynamic infrastructure (virtual servers, virtual networks, virtual storage, Virtual Image managers etc.). The safety and survival of applications and end to end service transactions delivered by a group of applications are managed by dynamically monitoring and controlling the resources at run-time in real-time. As services migrate to distributed environments where applications contributing to a service transaction are deployed in different datacenters and public or private clouds often owned by different providers, resource management across distributed resources is provided using myriad point solutions and tools that monitor, orchestrate and control these resources. A new call for application-centric infrastructure proposes that the infrastructure provide (http://blogs.cisco.com/news/application-centric-infrastructure-a-new-era-in-the-data-center/ ):

Application Velocity (Any workload, anywhere): Reducing application deployment time through a fully automated and programmatic infrastructure for provisioning and placement. Customers will be able to define the infrastructure requirements of the application, and then have those requirements applied automatically throughout the infrastructure.

A common platform for managing physical, virtual and cloud infrastructure: The complete integration across physical and virtual, normalizing endpoint access while delivering the flexibility of software and the performance, scale and visibility of hardware across multi-vendor, virtualized, bare metal, distributed scale out and cloud applications

Systems Architecture: A holistic approach with the integration of infrastructure, services and security along with the ability to deliver simplification of the infrastructure, integration of existing and future services with real time telemetry system wide.

Common Policy, Management and Operations for Network, Security, Applications: A common policy management framework and operational model driving automation across Network, Security and Application IT teams that is extensible to compute and storage in the future.

Open APIs, Open Source and Multivendor: A broad ecosystem of partners who will be empowered by a comprehensive published set of APIs and innovations contributed to open source.

The best of Custom and Merchant Silicon: To provide highly scalable, programmatic performance, low-power platforms and optics innovations that protect investments in existing cabling plants, and optimize capital and operational expenditures.

Perhaps this approach will work in a utopian IT landscape where either the infrastructure is provided by a single vendor or universal standards force all infrastructures to support common API. Unfortunately the real world evolves in a diverse, heterogeneous and competitive environment and what we are left with is a strategy that cannot scale and lacks end-to-end service visibility and control. End-to-end security becomes difficult to assure because of the myriad security management systems that control distributed resources. The result is open source systems that attempt to fill this niche. Unfortunately, in a highly networked world where multiple infrastructure providers provide a plethora of diverse technologies that evolve at a rapid rate to absorb high-paced innovations, orchestrating the infrastructure to meet the changing workload requirements that applications must deliver is a losing battle. The complexity and tool fatigue resulting from layers of virtualization and orchestration of orchestrators is crippling the operation and management of datacenters (virtualized or not) requiring 70% of current IT budgets going toward keeping the lights on. An explosion of tools, special purpose appliances (for Disaster Recovery, IP security, Performance optimization etc.) and administrative controls have escalated operation and management costs. Gartner Report estimates that for every 1$ spent on development of an application, another $1.31 is spent on assuring safety & survival. While all vendors agree upon Open Source, Open API, and multi-vendor support, reality is far from it. An example is the recent debate about whether OpenStack should include Amazon AWS API support while the leading cloud provider conveniently ignores the competing API.

The Strategy of Dynamic Virtual Infrastructure

The following picture presented in the Open Server Summit Presents a vision of future datacenter with a virtual switch network overlay over physical network.

In addition to the Physical network connecting physical servers, an overlay of virtual network inside the physical server to connect the virtual machines inside a physical server. In addition, a plethora of virtual machines are being introduced to replace the physical routers and switches that control the physical network. The quest to dynamically reconfigure the network at run-time to meet the changing application workloads, business priorities and latency constraints has introduced layers of additional network infrastructure albeit software-defined. While applications are locked in a virtual server, the infrastructure is evolving to dynamically reconfigure itself to meet changing application needs. Unfortunately this strategy can not scale in a distributed environment where different infrastructure providers deploy myriad heterogeneous technologies and management strategies and results in orchestrators of orchestrators contributing to complexity and tool fatigue in both datacenters and clod environments (private or public).

Figure 2 shows a new storage management architecture also presented in the Open Server Summit.

The PCIe switch allows a converged physical storage fabric at half the cost and half the power of current infrastructure. In order to leverage these benefits, the management infrastructure has to accommodate it which adds to the complexity.

In addition, it is estimated that the data traffic inside the datacenter is about 1000 times that of the data that is sent to and received from the users outside. This completely changes the role of TCP/IP traffic inside the datacenter and consequently the communication architecture between applications inside the datacenter. It does not anymore make sense for Virtual machines running inside a Many-core server to use TCP/IP as long as they are within the datacenter. In fact, it makes more sense for them to communicate via shared memory when they are executed on different cores within a processor, communicate via high speed bus when they are executed on different processors in the same server and a high speed network when they are executed in different servers in the same datacenter. TCP/IP is only needed when communicating with users outside the datacenter who can only be accessed via the Internet.

Figure 3 shows the server evolution.

Figure 3: Servers for the New Style of IT – Presented in Open Server summit 2013, Dwight Barron, HP Fellow and Chief Technologies Hyper-scale Server Business Segment, HP Servers Global Business Unit, Hewlett-Packard

As the following picture presents, current evolution of the datacenter is designed to provide dynamic control of resources for addressing the work-load fluctuations at run-time, changing business priorities and real-time latency constraints. The applications are static in a Virtual or Physical Server and the software defined infrastructure dynamically adjusts to changing application needs.

With the advent of many-core servers, high bandwidth technologies connecting these servers, and new class of high performance storage devices that can be optimized to meet the workload needs (IOPs intensive, throughput sensitive or capacity hungry), is it time to look at a static infrastructure with dynamic application/service management to reduce IT complexity in both datacenters and clouds (public or private)? This is possible if we can virtualize the applications inside a server (physical or virtual) and decouple the safety and survival of the applications and groups of applications that contribute to a distributed transaction from myriad resource management systems that provision and control a plethora of distributed resources supporting these applications.

The Cognitive Container discussed in the Open Server Summit (http://lnkd.in/b7-rfuK) presents the decoupling required between application and service management and underlying distributed resource management systems. Cognitive Container is specially designed to decouple the management of an application and service transactions that a group of distributed applications execute from the infrastructure management systems, at run-time, controlling their resources that are often owned or operated by different providers. The safety and survival of the application at run-time is put ahead by infusing the knowledge about the application (such as the intent, non-functional attributes, run-time constraints, connections and communication behaviors) into the container and using this information to monitor and manage the application at run-time. The Cognitive Container is instantiated and managed by a Distributed Cognitive Transaction Platform (DCTP) that sits between the applications and the OS facilitating the run-time management of Cognitive Containers. The DCTP does not require any changes to the application, OS or the infrastructure and uses the local OS in a physical or virtual server. A network of Cognitive Containers infused with similar knowledge about the service transaction they execute also is managed at run-time to assure the safety and survival based on policies dictated by business priorities, run-time workload fluctuations and real-time latency constraints. The Cognitive Container network using replication, repair, recombination and reconfiguration properties provide dynamic service management independent of infrastructure management systems at run-time. The Cognitive Containers are designed to use the local operating system to monitor the application vital signs (CPU, memory, bandwidth, latency, storage capacity, IOPs and throughput) and run-time behavior to manage the application to conform to the policies.

The cognitive container can be deployed in a physical or virtual server and does not require any changes to the applications, OSs or the infrastructure. Only the knowledge about the functional and n0n-functional requirements has to be infused into the Cognitive Container. The following figure shows a Cognitive Network deployed in a distributed infrastructure. The Cognitive Container and the service management are designed to provide auto-scaling, self-repair, live-migration and end-to-end service transaction security independent of infrastructure management system.

Using the Cognitive Container network it is possible to create a federated service creation, delivery and assurance platforms that transcend the physical and virtual server boundaries and geographical locations as shown in figure below.

This architecture provides an opportunity to simplify the infrastructure where a tiered server, storage and network infrastructure that is static and hardwired to provide various servers (physical or virtual) with specified service levels (CPU, memory, network bandwidth, latency, storage capacity and throughput) the cognitive containers are looking for based on their QoS requirements. It does not matter what technology is used to provision these servers with required service levels. The Cognitive Containers monitor these vital signs using the local OS and if they are not adequate, they will migrate to other servers where they are adequate based on policies determined by business priorities, run-time workload fluctuations and real-time latency constraints.

The infrastructure provisioning then becomes a simple matter of matching the Cognitive Container to the server based on QoS requirements. Thus the Cognitive Container services network provides a mechanism to deploy intelligent (self-aware, self-reasoning and self-controlling) services using dumb infrastructure with limited intelligence about services and applications (matching application profile to the server profile) on stupid pipes that are designed to provide appropriate performance based on different technologies as discussed in the Open Server Summit.

The managing and safekeeping of application required to cope with a non-deterministic impact on workloads from changing demands, business priorities, latency constraints, limited resources and security threats is very similar to how cellular organisms manage life in a changing environment. The managing and safekeeping of life efficiently at the lowest level of biological architecture that provides the resiliency was in his mind when von Neumann was presenting his Hixon lecture (Von Neumann, J. (1987) Papers of John von Neumann on Computing and Computing Theory, Hixon Symposium, September 20, 1948, Pasadena, CA, The MIT Press, Massachusetts, p474). ‘‘The basic principle of dealing with malfunctions in nature is to make their effect as unimportant as possible and to apply correctives, if they are necessary at all, at leisure. In our dealings with artificial automata, on the other hand, we require an immediate diagnosis. Therefore, we are trying to arrange the automata in such a manner that errors will become as conspicuous as possible, and intervention and correction follow immediately.’’ Comparing the computing machines and living organisms, he points out that the computing machines are not as fault tolerant as the living organisms. He goes on to say ‘‘It’s very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond.’’ Perhaps the Cognitive Container bridges this gap by infusing self-management into computing machines that manage the external world while also managing themselves with self-awareness, reasoning, and control based on policies and best practices.

Cognitive Containers or not, the question is how do we address the problem of ever increasing complexity and cost in current datacenter and cloud offerings? This will be a major theme in the 4th conference track on the Convergence of Distributed Clouds, Grids and their management at WETICE2014 in Parma, Italy.

Here is an excerpt from the WETICE2013 Track #3 -Convergence of Distributed Clouds, Grids and Their Management

Convergence of Distributed Clouds, Grids and their Management – CDCGM2013

WETICE2013 – Hammamet, June 17 – 20, 2013

Track Chair’s Report

Dr. Rao Mikkilineni, IEEE Member, and Dr. Giovanni Morana

Abstract

The Convergence of distributed clouds, grids and their management conference track focuses on virtualization and cloud computing as they enjoy wider acceptance. A recent IDC report predicts that by 2016, $1 of every $5 will be spent on cloud-based software and infrastructure. Three papers address key issues in cloud computing such as resource optimization and scaling to address changing workloads and energy management. In addition, the DIME network architecture proposed in WETICE2010 is discussed in two papers in this conference, both showing its usefulness in addressing fault, configuration, accounting, performance and security of service transactions with in the service oriented architecture implementation and also spanning across multiple clouds.

While virtualization has brought resource elasticity and application agility to the services infrastructure management, the resulting layers of orchestration and the lack of end-to-end service visibility and control spanning across multiple service provider infrastructure have added an alarming degree of complexity. Hopefully, reducing the complexity in the next generation datacenters will be a major research topic in this conference.

Introduction

While virtualization and cloud computing have brought elasticity to computing resources and agility to applications in a distributed environment, they have also increased complexity of managing various distributed applications contributing to a distributed service transaction delivery by adding layers of orchestration and management systems. There are three major factors contributing to the complexity:

Current IT datacenters have evolved from their server-centric, low-bandwidth origins to distributed and high-bandwidth environments where resources can be dynamically allocated to applications using computing, network and storage resource virtualization. While Virtual machines improve resiliency and provide live migration to reduce the recovery time objectives in case of service failures, the increased complexity of hypervisors, their orchestration, Virtual Machine images and their movement and management adds an additional burden in the datacenter. A recent global survey commissioned by Symantec Corporation involving 2,453 IT professionals at organizations in 32 countries concludes [1] that the complexity introduced by virtualization, cloud computing and proliferation of mobile devices is a major problem. The survey asked respondents to rate the level of complexity in each of five areas on a scale of 0 to 10, and the results show that data center complexity affects all aspects of computing, including security and infrastructure, disaster recovery, storage and compliance. For example, respondents on average rated all the areas 6.56 or higher on the complexity scale, with security topping the list at 7.06. The average level of complexity for all areas for companies around the world was 6.69. The survey shows that organizations in the Americas on average rated complexity highest, at 7.81, and those in Asia-Pacific/Japan lowest, at 6.15.

As the complexity increases, the response is to introduce more automation of resource administration and operational controls. However, the increased complexity of management of services may be more a fundamental architectural issue related to Gödel’s prohibition of self-reflection in Turing machines [2] than a software design or an operational execution issue. Cockshott et al. [3] conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.” Automation of dynamic resource administration at run-time makes the computer itself a part of the model and also a part of the problem.

As the services increasingly span across multiple datacenters often owned and operated by different service providers and operators, it is unrealistic to expect that more software that coordinates the myriad resource management systems belonging to different owners is the answer for reducing complexity. A new approach that decouples the service management from underlying distributed resource management systems which are often non-communicative and cumbersome is in order.

The current course becomes even more untenable with the advent of many-core severs with tens and even hundreds of computing cores with high bandwidth communication among them. It is hard to imagine replicating current TCP/IP based socket communication, “isolate and fix” diagnostic procedures, and the multiple operating systems (which do not have end-to-end visibility or control of business transactions that span across multiple cores, multiple chips, multiple servers and multiple geographies) inside the next generation many-core servers without addressing their shortcomings. The many-core servers and processors constitute a network where each node itself is a sub-network with different bandwidths and protocols (socket-based low bandwidth communication between servers, InfiniBand, or PCI Express bus based communication across processors in the same server and shared memory based low latency communication across the cores inside the processor).

The tradition that started in WETICE2009 “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short term profit driven motives of a particular corporate entity” has resulted in a new computing model that was included in the Turing Centenary Conference proceedings in 2012 [3, 4]. Two papers in this conference continue the investigation of its usefulness. Hopefully, this tradition will result in other novel and different approaches to address the datacenter complexity issue while incremental improvements continue as is evident from another three papers.

“WETICE 2012 Convergence of Distributed Clouds, Grids and their Management Conference Track is devoted to transform current labor intensive, software/shelf-ware-heavy, and knowledge-professional-services dependent IT management into self-configuring, self-monitoring, self-protecting, self-healing and self-optimizing distributed workflow implementations with end-to-end resource management by facilitating the development of a Unified Theory of Computing.”

Here is more food for thought…

Abstract:

Cellular biology has evolved to capture dynamic representations of self and its surroundings and a systemic view of monitoring and control of both the self and the surroundings to optimize the organism’s chances of survival. Signaling plays a key role in shaping the structure and behavior of cellular organisms to exhibit a high degree of resiliency by monitoring and controlling its own activity and its interactions with the outside environment with a Zen-like one-ness of the observer and the observed. Evolution has invented the genetic transactions of replication, repair, recombination and reconfiguration to support the survival of living cells by organizing themselves to execute a coordinated set of activities and signaling provides a vehicle for managing the system-wide behavior.

By introducing signaling and self-management in a Turing node and a signaling network as an overlay over the computing network, the current von-Neumann computing model is evolved to bring the architectural resiliency of cellular organisms to computing infrastructure. The new approach introduces the genetic transactions of replication, repair, recombination and reconfiguration to program self-resiliency in distributed computing systems executing a managed workflow. Perhaps, the injection of parallelism and network based composition of “Self” identity are the first steps in introducing the elements of homeostasis and self-management required for developing consciousness in the computing infrastructure.

Introduction:

As recent advances in neuroscience throw new light on the process of evolution of the cellular computing models, it is becoming clear that communication and collaboration mechanisms of distributed computing elements and end-to-end distributed transaction management played a crucial role in the development of self-resiliency, efficiency and scaling which are exhibited by diverse forms of life from the cellular organisms to highly evolved human beings. According to Antonio Damasio (Damasio 2010), managing and safe keeping life is the fundamental premise of biological value and this biological value has influenced the evolution of brain structures. “Life regulation, a dynamic process known as homeostasis for short, begins in unicellular living creatures, such as bacterial cell or a simple amoeba, which do not have a brain but are capable of adaptive behavior. It progresses in individuals whose behavior is managed by simple brains, as in the case with worms, and it continues its march in individuals whose brains generate both behavior and mind (insects and fish being examples)….” Homeostasis is the property of a system that regulates its internal environment and tends to maintain a stable, constant condition of properties like temperature or chemical parameters that are essential to its survival. System-wide homeostasis goals are accomplished through a representation of current state, desired state, a comparison process and control mechanisms.

He goes on to say that “consciousness came into being because of biological value, as a contributor to more effective value management. But consciousness did not invent biological value or the process of valuation. Eventually, in human minds, consciousness revealed biological value and allowed the development of new ways and means of managing it.” The governance of life’s processes is present even in single-celled organisms that lack a brain and it has evolved to the conscious awareness which is the hallmark of highly evolved human behavior. “Deprived of conscious knowledge, deprived of access to the byzantine devices of deliberation available in our brains, the single cell seems to have an attitude: it wants to live out its prescribed genetic allowance. Strange as it may seem, the want, and all that is necessary to implement it, precedes the explicit knowledge and deliberation regarding life conditions, since the cell clearly has neither. The nucleus and the cytoplasm interact and carry out complex computations aimed at keeping the cell alive. They deal with the moment-to-moment problems posed by the living conditions and adapt the cell to the situation in a survivable manner. Depending on the environmental conditions, they rearrange the position and distribution of molecules in their interior, and they change the shape of sub-components, such as microtubules, in an astounding display of precision. They respond under duress and under nice treatment too. Obviously, the cell components carrying out those adaptive adjustments were put into place and instructed by the cell’s genetic material.” This vivid insight brings to light the cellular computing model that:

Spells out the computational workflow components as a stable sequence of patterns that accomplishes a specific purpose,

Implements a parallel management workflow with another sequence of patterns that assures the successful execution of the system’s purpose (the computing network to assure biological value with management and safekeeping),

Uses a signaling mechanism that controls the execution of the workflow for gene expression (the regulatory network) and

The managing and safekeeping life efficiently are evident at the lowest level of biological architecture that provides the resiliency that von Neumann was discussing in his Hixon lecture (von Neumann, 1987). ‘‘The basic principle of dealing with malfunctions in nature is to make their effect as unimportant as possible and to apply correctives, if they are necessary at all, at leisure. In our dealings with artificial automata, on the other hand, we require an immediate diagnosis. Therefore, we are trying to arrange the automata in such a manner that errors will become as conspicuous as possible, and intervention and correction follow immediately.’’ Comparing the computing machines and living organisms, he points out that the computing machines are not as fault tolerant as the living organisms. He goes on to say ‘‘It’s very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond.’’

The connection between consciousness and computing models is succinctly summarized by Samad and Cofer (Samad, Cofer, 2001). While there is no accepted precise definition of the term consciousness, “it is generally held that it is a key to human (and possibly other animal) behavior and to the subjective sense of being human. Consequently, any attempt to design automation systems with humanlike autonomous characteristics requires designing in some elements of consciousness. In particular, the property of being aware of one’s multiple tasks and goals within a dynamic environment and of adapting behavior accordingly.” They point to two theoretical limitations of formal systems that may inhibit the implementation of computational consciousness and hence limit our ability to design human-like autonomous systems. “First, we know that all digital computing machines are “Turing-equivalent”-They differ in processing speeds, implementation technology, input/output media, etc., but they are all (given unlimited memory and computing time) capable of exactly the same calculations. More importantly, there are some problems that no digital computer can solve. The best known example is the halting problem; we know that it is impossible to realize a computer program that will take as input another, arbitrary, computer program and determine whether or not the program is guaranteed to always terminate.

Second, by Gödel’s proof, we know that in any mathematical system of at least a minimal power there are truths that cannot be proven. The fact that we humans can demonstrate the incompleteness of a mathematical system has led to the claims that Gödel’s proof does not apply to humans.”

An important implication of Gödel’s incompleteness theorem is that it is not possible to have a finite description with the description itself as the proper part. In other words, it is not possible to read yourself or process yourself as process. In short, Gödel’s theorems prohibit “self-reflection” in Turing machines. Louis Barrett highlights (Barrett, 2011) the difference between Turing Machines implemented using von Neumann architecture and biological systems. “Although the computer analogy built on von Neumann architecture has been useful in a number of ways, and there is also no doubt that work in classic artificial intelligence (or, as it is often known, Good Old Fashioned AI: GOFAI) has had its successes, these have been somewhat limited, at least from our perspective here as students of cognitive evolution.” She argues that the Turing machines based on algorithmic symbolic manipulation using von Neumann architecture, gravitate toward those aspects of cognition, like natural language, formal reasoning, planning, mathematics and playing chess, in which the processing of abstract symbols in a logical fashion and leaves out other aspects of cognition that deal with producing adoptive behavior in a changeable environment. Unlike the approach where perception, cognition and action are clearly separated, she suggests that the dynamic coupling between various elements of the system, where each change in one element continually influences every other element’s direction of change has to be accounted for in any computational model that includes system’s sensory and motor functions along with analysis. To be fair, such couplings in the observed can be modeled and managed using a Turing machine network and the Turing network itself can be managed and controlled by another serial Turing network. What is not possible is the tight integration of the models of the observer/manager and the observed/managed with a description of the “self” (or a specification of the manager) using parallelism and signaling that are the norm and not an exception in biology.

A more interesting controversy that has erupted regarding the need for new computing models (Wegner, Eberbach, 2004, Cockshott, Michaelson, 2007, Goldin, Wegner, 2008) throws some new light on the need for re-examining the Turing machines, Gödel’s prohibition of self-reflection and von Neumann’s conjecture. An even more recent discussion of the need for new computing models was presented in the Ubiquity symposium (ACM Ubiquity, 2011). As we describe later, these authors are attempting to address how to model computational problems that cannot be solved by a single Turing machine but can be solved using a set of Turing machines interacting with each other. In particular, the property of being aware of one’s multiple tasks and goals within a dynamic environment and of adapting behavior accordingly which is related to consciousness mentioned earlier is one such problem that a single Turing machine can not solve. The insights into biology suggest that in order to model temporal dynamics of the observer and the observed while also assuring the safe-keeping of the observer (with a “self” identity) requires modifications to the Turing machine to accommodate changes to the behavior while computation is still in progress.

Self, Consciousness, and Emotions – The Dynamic Representation of the Observer and the Observed:

Self-reflection, setting expectations, monitoring the deviations and taking corrective action are essential for managing the business of life through homeostasis and evolution has figured out how to encapsulate the right descriptions to execute the life’s processes using the genetic transaction of replication, repair, recombination and reconfiguration by exploiting parallelism and signaling. As Jonah Lehrer (Lehrer, 2010) describes in his book “How We Decide”, “Dopamine neurons automatically detect the subtle patterns that we would otherwise fail to notice; they assimilate all the data that we can’t consciously comprehend. And, then, once they come up with a set of refined predictions about how the world works, they translate these predictions to emotions.” Emotions, it seems are the instinctual localized component level suggestions for corrective actions based on local experience. Conscience [1] on the other hand, is the adult who correlates the instinctual suggestions with much larger perspective and makes decisions based on global priorities.

It is becoming clear from the recent advances in neuroscience, that self-reflection is a key component in living organisms. Homeostasis is not possible without a dynamic and active representation of the observer and the observed.

A cellular organism is the simplest form of life that maintains an internal environment that supports its essential biochemical reactions, despite changes in the external environment. Therefore, a selectively permeable plasma membrane surrounding a concentrated aqueous solution of chemicals is a feature of all cells. In addition it is capable of self-replication and self-repair which may be unicellular or multicellular. Unicellular organisms perform all the functions of life. Multicellular organisms contain several different cell types that are specialized to perform specific functions. The cell adapts to its environment by recognition and transduction of a broad range of environmental signals, which in turn activate response mechanisms by regulating the expression of proteins that take part in the corresponding processes. The nucleus of the cell houses deoxyribonucleic acid (DNA) the genetic blueprint of the organism which determines the structure and function of the organism as a whole. The DNA serves two functions. First, it contains instructions for assembling the structural and enzymatic proteins of the cell. Cellular enzymes in turn control the formation of other cellular structures and also determine the functional activity of the cell by regulating the rate at which metabolic reactions proceed. Second, by replicating (making copies of itself), DNA perpetuates the genetic blueprint within all new cells formed within the body and is responsible for passing on genetic information from the survivors to successors.

A gene is a stretch of DNA that contains instructions or code for a particular function such as synthesizing a protein or dictating the assembly of amino acids. A unique set of genes are packaged as chromosomes in complex organisms. A gene regulatory network represents relationships between genes that can be established from measuring how the expression level of each one affects the expression level of the others. In any global cellular network, genes do not interact directly with other genes. Instead, gene induction or repression occurs, the action of specific proteins, which are in turn products of certain genes as well. In essence, gene networks are abstract models that display causal relationships between gene activities and are represented by directed graphs. Nearly all of the cells of a multicellular organism contain same DNA. Yet this same genetic information yields a large number of different cell types. The fundamental difference between a neuron and a liver cell, for example, is which genes are expressed. The regulatory gene network forms a cellular control circuitry defining the overall behavior of the various cells. According to Antonio Damasio (Damasio, 2010), the brain architecture is an evolutionary aid to the business of managing life which consists of managing the body and the management gains precision and efficiency with the presence of circuits of neurons assisting the management. In describing the role of neurons, he says that “neurons are about life and managing life in other cells of the body, and that that aboutness requires two-way signaling. Neurons act on other body cells, via chemical messages or excitation of muscles, but in order to do their job, they need inspiration from the very body they supposed to prompt, so to speak. In simple brains, the body does its prompts simply by signaling to subcortical nuclei. Nuclei are filled with “dispositional know-how,” the sort of knowledge that does not require detailed mapped representations. But in complex brains, the map-making cerebral cortices describe the body and its doings in so much explicit detail that the owners of those brains become capable, for example, of “imaging: the shape of their limbs and their positions in space, or the fact that their elbows hurt or their stomach does”.

The complex network of neural connections and signaling mechanisms collaborate to create a dynamic, active and temporal representation of both the observer and the observed with myriad patterns, associations and constraints among their components. It seems that the business of managing life is more than mere book-keeping that is possible with a Turing machine. It involves the orchestration of an ensemble with a self-identity both at the group and the component level contributing to the system’s biological value. It is a hierarchy of individual components where each node itself is a sub-network with its own identity and purpose which is consistent with the system-wide purpose. To be sure, each component is capable of book-keeping and algorithmic manipulation of symbols. In addition, identity and representations of the observer and the observed at both the component and group level make system-wide self-reflection possible.

In short, the business of managing life is implemented by a system consisting of a network of networks with multiple parallel links that transmit both control information and the mission critical data required to sense and to control the observed by the observer. The data and control networks provide the capabilities to develop an internal representation of both the observer and the observed along with the processes required to implement the business of managing life. The organism is made up of autonomic components making up an ensemble collaborating and coordinating a complex set of life’s processes that are executed to sense and control both the observer and the observed. In this sense, the brain and the body are part of a collaborating system that has a unique identity and a structure that preserves the interrelationships. The system consists of:

Components each with a purpose within a larger system (specialization)

All of a component parts must be present for the system to carry out its purpose optimally,

A system’s parts must be arranged in a specific way for the system to carry out its purpose (separation of concerns),

Systems change in response to feedback (collect information, analyze information and control environment using specialized resources), and

Systems maintain their stability (in accomplishing their purpose) by making adjustments based on feedback (homeostasis).

[1] According to Antonio Damasio (Damasio, 2010), consciousness pertains to the knowing of any object or action attributed to a self, while conscience pertains to the good or evil to be found in actions or objects. The identity of self and its safekeeping are essential parts of life processes. “The non-conscious neural signaling of an individual organism begets the proto-self which permits core self and core consciousness, which allow for an auto-biographical self which permits extended consciousness. At the end of the chain, extended consciousness permits conscience.”

Figure 1 shows the model of core-conscience, its relationship to the Observed and the extended conscience (Damasio, 1999) proposed by Damasio based on his studies in neuroscience.

Figure 1: The mapping of the observer, the observed and myriad models, associations and processes executed using parallel signaling and data exchange networks. Each component itself is a sub-network with a purpose defined by its own internal models.

Literature is filled with discussion about Gödel’s prohibition of self-reflection in Turing machines and why consciousness cannot emerge from the brain models that depend on Turing machines. There are many theories on how the human brain is unique and may even involve quantum phenomena or gravity waves (Scott, 1995 and Davis 1992). However Damasio (Damasio, 2010) takes the evolutionary approach to discuss genomic unconsciousness, the feeling of conscious will, educating the cognitive conscious, the reflective self and its consequences. He goes on to say “in one form or another, the cultural developments manifest the same goal as the form of automated homeostasis.” “They respond to a detection of the imbalance in the life process, and seek to correct it within the constraints of human biology and of the physical and social environment.”

Instead of adding to the already existing controversy (Scott, 1995) on consciousness, we take a different route using Damasio’s emphasis on homeostasis along with the dynamic representation of the observer and the observed. We apply them to extend the Turing machine and its von Neumann Serial computing implementation. We ask how we can utilize the abstractions that assist in the business of managing life in cellular organisms, discussed above, to enhance the resiliency of distributed computing systems. In the next section we analyze the current implementation of Turing machines and suggest adding some of the abstractions that have proven useful in managing life’s processes to develop a computing model that addresses the problem of being aware of one’s multiple tasks and goals within a dynamic environment and of adapting behavior accordingly.

Turing Machines, Super Turing Machines and DIME Networks:

While a single SPC node lacks self-reflection prohibited by Gödel’s theorems, a network of Turing machines have been successfully used to implement business workflows that observe and manage the external world. This is accomplished by modeling the observed (external to the computing infrastructure) and orchestrating the temporal dynamics of the observed. This has helped us develop complex control systems that can be monitored and controlled with the resiliency of cellular organisms.

However, what is missing is the same resiliency in the infrastructure (or the observer) that implements the control of the observed. Learning fromDamasio’s analysis, in order to introduce consciousness, we must introduce the “self” identity of the observer and the observer’s multiple tasks and goals within a dynamic environment and of adapting behavior accordingly. The “self” specification must include a hierarchy of goals and execution mechanisms to include his concepts of “core” and “extended” selves.

The evolution of computing seems to follow a similar path to cellular organisms in the sense that it emerged as an individual computing element (von Neumann stored program implementation of the Turing machine) and evolved into today’s networks of managed computing elements executing complex workflows that monitor and control external environment.

The Turing machine originally started as a static closed system (Goldin, Wegner, 2008) analogous to a single cell. It was designed for computing algorithms that correspond to mathematical world view. This is the case with Assembler language programming where a CPU is programmed and the Turing machine is implemented using the von Neumann Stored Program Control computing model as shown in figure 2.

Figure 2: A Turing machine with von Neumann Stored Program Control implementation in its simplest form

The Church-Turing thesis stipulates that “Turing machines can compute any effective (partially recursive) functions over naturals (strings). Goldin and Wegner argue that the Church-Turing thesis applies only to effective computations rather than computation by arbitrary physical machines, dynamical systems or humans.

To address this issue, we stipulate that “all computations can be represented as workflows specified by a directed acyclic graph (DAG). Algorithms are a sub set of all computations. An algorithm can be viewed as a workflow of instructions executed by a stored program control (SPC) computing unit (constituting an atomic unit of computation). Then, based on the programming paradigm of one’s choice, one can compose other computing units such as procedures, functions, objects etc., to execute the specified workflow.” This can reconcile the operating system conundrum that states that the operating systems do not terminate as the Turing machines are required to. As soon as an operating system is introduced, the Turing machine SPC implementation immediately becomes a workflow of computations to implement a process, where each process now behaves as a new Turing machine with SPC implementation. It is as if the operating system is a manager (implementing a management workflow using a group of management Turing machines dedicated for this purpose) controlling a series of other computing Turing machines based on policies set in the operating system. The operating system instructions and the computational flow dependent instructions are mixed to serially execute the process and a sequence of processes. This is analogous to the evolution of multi-cellular organisms where individual cells establish a common management protocol to execute their goals with shared resources. The individual processes may or may not have a common goal but they share the same resources. The operating system communicates with the processes to exert its role using shared memory as shown in Figure 3. While the individual processes do not have fault, configuration, accounting, performance and security management of self, the operating system provides these functions using the signaling abstractions of addressing, alerting, mediation and supervision.

Figure 3: Operating system implements the managed Turing processes.

Since then, multi-threading in a single processor, networked and interactive computing have influenced the computations. In a network, concurrency and influence of one node on another (impact of the environment on the computation) are the new elements that have to be addressed. The Pi calculus and super Turing models (Eberbach, E., Wegner, P., Goldin, D., 2011) are an attempt to address these aspects. While these attempts are embroiled in controversy, (Cockshott, Michaelson, 2007), what is not in dispute is that a network of computers represents a network of organized Turing machines where each node is a group of Turing machines managed locally. See Figure 4.

Figure 4: A Networked set of Turing machines provide distributed computing services. However this does not provide coordination and management across the two sets of Turing machines.

In such a network, the local operating systems cannot provide Fault, Configuration, accounting, performance and security (FCAPS) management of the system as whole. The disciplines of distributed computing and distributed systems management evolved to address the FCAPS management of the system in an ad-hoc manner without a formal computing model for the system as a whole. This is even more complicated when the system as a whole now acts in unison with a system-wide purpose where one element can influence other elements as pointed out by Louise Barrett (Barrett, 2011).

In this case, the description of the functions performed and the influence of one computation on another has to be encoded at compile time and each computing element does not have the ability to change the behavior at run time. In addition, operating system function is to allocate the resources appropriately to the consumers (processes running applications) and the applications themselves do not have any influence on the resources during run time. For example, if the workload fluctuates, the application has no way of monitoring and controlling the resources.

Figure 5: A network of Turing machines implementing a service workflow that manages the external environment (the observed). The management of the observer is also implemented using the same serial Turing machines where in some nodes the management of the observer and the observed are mixed in serial fashion and some other nodes are exclusively devoted to managing the observer.

If multiple applications are contending for resources, external policies have to be implemented as other Turing machines and the applications themselves are not aware of these external influences. In order to manage distributed set of Turing machines, another set of Turing machines are introduced to provide service management to improve fault, configuration, accounting, performance and security characteristics of the distributed system. See figure 5.

The DIME computing model allows the specification and execution of a recursive composition model where each computing unit at any level specifies and executes the workflow at the lower level. The specification at a higher level eliminates the self-reflection prohibition of Gödel’s theorems on computational units. The parallel implementation of the management workflow and the computational workflow at each level allows the influence of one component in the workflow to influence another component at the lower level. At any level, the computational unit specifies and assures the execution of the lower level workflow thus it becomes the observer observing and controlling the workflow execution at lower level (which is the observed)

This model eliminates the problem of separation of communication between the computing system components in a system and the communication between the computing system and its environment. In current computing models of systems design, treating them as two separate issues has created the current disconnect in the distributed systems theories (Goldin, Wegner, 2007, pp. 22)

Figure 6: A Distributed Intelligent Managed Element (DIME) with local management of the Turing computing node and signaling channel. The FCAPS attributes of the Turing node are continuously monitored and controlled based on local policies. In addition the signaling channel allows coordination with global policies.

The DIME network architecture (Mikkilineni 2011) consists of four components:

A DIME node which encapsulates the von Neumann computing element with self-management of FCAPS.

Signaling capability that allows intra-DIME and Inter-DIME communication and control,

An infrastructure that allows implementing distributed service workflows as a set of tasks, arranged or organized in a DAG and executed by a managed network of DIMEs and

An infrastructure that assures DIME network management using the signaling network overlay over the computing workflow

The self-management and task execution (using the DIME component called MICE, the managed intelligent computing element) are performed in parallel using the stored program control computing devices. The DIME encapsulates the “dispositional know-how.” Each DIME is programmable to control the MICE and provide continuous supervision of the execution of the programs executed by the MICE. The DIME FCAPS management allows to model and represent dynamic behaviour of each DIME, the state of the MICE and its evolution as a function of time based on both internal and external stimuli. The parallel management architecture allows the observer (a network or subnetworks) that forms a group to monitor and control itself while facilitating the implementation of monitoring and control of the observed in external environment. Parallelism allows dynamic information flow both in the signaling channel and the external I/O channels of the Turing computing nodes.

There are three special features of DNA that contribute to self-resiliency:

Each Turing computing node is controlled by the FCAPS policies set in each DIME. Each read and write are dynamically configurable based on the FCAPS policies.

Each node itself can be a sub-network of DIMES with goals set by the sub-network policies.

It is easy to show that the DIME network architecture supports the genetic transactions of replication, repair, recombination and rearrangement. Figure 7 shows a single node execution of a service in a DIME network.

Figure 7: Single node execution of a DIME

A single node of a DIME that can execute a workflow by itself or by instantiating a sub-network provides a way to implement a managed DAG (Directed Acyclic Graph) executing a workflow. Replication is implemented by executing the same service as shown in figure 8.

DIME Replication

Figure 8: DIME Replication

By defining service S2 to execute itself, we replicate S2 DIME. Note that S2 is a service that can be programmed to terminate instantiating itself further when resources are not available. In addition, dynamic FCAPS (parallel service monitoring and control) management allows changing the behavior of The ability to execute the control commands in parallel allows dynamic replacement of services during run time. For example by stopping service S2 and loading and executing service S1, we dynamically change the service during run time. We can also redirect I/O dynamically during run time. Any DIME can also allow a sub-network instantiation and control as shown in figure 9. The workflow orchestrator instantiates the worker nodes, monitors heartbeat and performance of workers and implement fault tolerance, recovery, and performance management policies.

Figure 9: Dynamic Service Replication & Reconfiguration

It can also implement accounting and security monitoring and management using the signaling channel. Redirection of I/O allows dynamic reconfiguration of worker input and output thus providing computational network control.

In summary, the dynamic configuration at DIME node level and the ability to implement at each node, a managed directed acyclic graph using a DIME sub-network provides a powerful paradigm for designing and deploying managed services that are decoupled from the hardware infrastructure management. Figure 11 shows a workflow implementation of monitoring and controlling an external environment (temperature monitoring and fan control to maintain the temperature in a range) using a self-managed DIME network with signaling network overlay.

Figure 11: A workflow implementation using a DIME network. There are two FCAPS management workflows, one managing the observer (computing infrastructure) and the other managing the observed (Thermometer and the Fan)

While the DIME network architecture provides food for thought about Turing, machines, new computing models and the role of the representations of observer and the observed in consciousness, it also has practical utility in developing software exploiting the parallelism and performance of many-core servers (Mikkilineni et. al., 2011). Some of the results demonstrating self-repair, auto-scaling to control the response time of a web server are presented at the Server Design Summit (Mikkilineni, 2011).

Conclusion:

The limitation of Turing Machines as a complete model of computation has been pointed out by (Wegner, Eberbach, 2004). While it was challenged by (Cockshott, Michaelson, 2007), it was rebutted by (Goldin, Wegner, 2008). The main argument for a new computing model was to account for the interactive nature of conventional algorithmic computation and the environment outside the computing element. The Turing model dealing with algorithms is closed and static and does not address the changes affecting the computation from outside while the computation is in progress. In order to account for networked systems in which each change in one element continually influences every other element’s direction of change, more expressive computing model are required. The von Neumann implementation of the Turing machine with its serial processing and mixing of algorithmic computation and interaction using a network of von Neumann computing nodes have given rise to complex management infrastructure that makes it difficult to implement in our IT infrastructure, the architectural resiliency of cellular organisms.

The DIME computing model, by implementing parallel management infrastructure to monitor and control the Turing machine at the atomic level, allows the read and write functions of the conventional Turing machine to be influenced by external interaction. The hierarchical network based (where a node itself can be a sub-network) composition model of DIME network architecture allows the identification of “self” (the observer) at various levels and the representation of the interaction between the observer and the observed.

The beauty of the DIME computing model is that it does not impact the current implementation of the service workflow using von-Neumann SPC nodes (monitoring and control of the observed external systems). But by introducing parallel control and management of the service workflow, the DIME network architecture provides the required scaling, agility and resilience both at the node level and at the network level (integrating the management and control of self, the observer). The signaling based network level control of a service workflow that spans across multiple nodes allows the end-to-end connection level quality of service management independent of the hardware infrastructure management systems that do not provide any meaningful visibility or control to the end-to-end service transaction implementation at run time. The only requirement for the DIME infrastructure provider is to assure that the node OS provides the required services for the service controller to load the Service Regulator and the Service Execution Packages to create and execute the DIME.

The network management of DIME services allows hierarchical scaling using the network composition of sub-networks. Each DIME with its autonomy on local resources through FCAPS management and its network awareness through signaling can keep its own history to provide negotiated services to other DIMEs thus enabling a collaborative workflow execution.

Each node has a unique identity and supports local behavior and its control using local policies that are programmable using the conventional von Neumann SPC Turing machines. Each sub-network and network allows a group identity (group self) and support group behavior and control. The resulting network of networks enables system-wide resilient business of managing both the self and the services to monitor and control external behavior. The parallel control network allows dynamic connection management of component functions to create dynamic workflows to accommodate changing environment.

The cellular implementation of the business of managing life may also show us the way to the business of managing our computing infrastructure which has already proven valuable in implementing the business of managing our lives and our environment transcending the body and mind of a single individual. As von Neumann remarked (von Neumann, 1966), “A theorem of Gödel that the next logical step, the description of an object, is one class type higher than the object and is therefore asymptotically longer to describe.” He admitted to twisting the theorem a little while describing the evolution of diversifying computational ecology from simple strings of 0s and 1s (von Neumann, 1987). Perhaps the recursive nature of a network containing sub-networks as nodes along with FCAPS management both at the node and network level, offers the definition of “self-identity” at various levels. While self-reflection at any level is prohibited by Gödel, A higher level “self” provides the required management and control to lower levels. A parallel signaling network, which allows dynamic replication, repair, recombination and reconfiguration, provides a degree of resiliency, efficiency and scaling that are not possible with a network of serial von Neumann implementations of Turing machines only. This may well be a prescription for injecting the property of being aware of one’s multiple tasks and goals within a dynamic environment and of adapting behavior accordingly.

Scott, A., (1995). The Controversial New Science of Consciousness: Stairway to the Mind. New York, NY: Copernicus, Springer-Verlag. P.184.

“At the hierarchical level of human conscience it is not possible to report a consensus of the scientific community because there is none. Materialists, functionalists, and dualists are-according to a recent issue of the popular science magazine Omni (October 1993)-engaged in

Slinging mud and hitting low like politicians arguing about tax hikes. Although the epithets are more rarified-here it is “obscuritanist” and “crypto-Cartisian” rather than “liberal” and “right wing”-recent exchanges between neuroscientists and philosophers of mind (and in each group among themselves) feature the same sort of relentless defensiveness and stark opinionated name calling we expect from irate congressmen or trash-talking linebackers.

To the extent that this is a true appraisal of the current status of consciousness, it is unfortunate. Like life, the phenomenon of consciousness is intimately related to several levels of the scientific hierarchy, so the appropriate scientists-cytologists, electrophysiologists, neuroscientists, anesthegiologists, sociologists and ethnologists-should be working together. It is difficult to see how this elusive phenomenon might otherwise be understood.

Davis, P., (1992). The Mind of God: The Scientific Basis for a Rational World. New York, NY: Simon and Schuster.

Starting from the mainframe datacenters where applications are accessed using narrow bandwidth networks and dumb terminals and evolving to client-server and peer-to-peer distributed computing architectures which exploit higher bandwidth connections, business process automation has contributed significantly to reduce the TCO. With the Internet, global e-commerce was enabled and the resulting growth in commerce led to an explosion of storage. Storage networking and resulting NAS (network attached storage) and SAN (storage area network) technologies have further changed the dynamics of the enterprise IT infrastructure in a significant way to meet business process automation needs. The storage backup and recovery technologies have further improved the resiliency of services delivery processes by improving the time it takes to respond in case of service failure. Figure 1 shows the evolution of the data recovery time objective, (the recovery point objective (RPO) is the point in time to which you must recover data as dictated by business needs. Recovery time objective (RTO) is the period of time after an outage in which the application and its data must be restored to a predetermined state defined by RPO.), which dropped from days to minutes and seconds. While the productivity, flexibility and global connectivity made possible with this evolution have radically transformed the business economics of information systems, the complexity of heterogeneous and multi-vendor solutions have created high dependence on specialized training and service expertise to assure availability, reliability, performance and security of various business applications.

Figure 1: The evolution of Recovery Time Objective. Virtualization of server technology provides an order of magnitude improvement in the way applications are backed-up, recovered and protected against disasters.

Successful implementation must integrate various server, network and storage centric products with their local optimization best-practices with end-to-end optimization strategies. While each vendor attempts to assure their success with more software and services, the small and medium enterprises often cannot afford the escalating software and service expenses associated with optimization strategies and become vulnerable. The exponential growth in services demand for voice, data and video in the consumer market also has introduced severe strains on current IT infrastructures. There are three main issues that are currently driving distributed computing solutions to seek new approaches:

Current IT datacenters have evolved to meet the business services needs in an evolutionary fashion from server-centric application design to client-server networking to storage area networking without an end-to-end optimized architectural transformation along the way. The server, network and storage vendors optimized management in their own local domains often duplicating functions from other domains to compete in the market place. For example, cache memory is used to improve the performance of service transactions by improving response time. However, redundancy of cache management in server, storage and even network switches make tuning of the response time a complex task requiring multiple management systems. Application developers have also started to introduce server, storage and network management within their applications. For example, Oracle is not just a database application. It also is a storage manager, and a network manager as well as being an application manager. It tries to optimize all its resources for performance tuning. No wonder it takes an army of experts to keep it going. The result is an over-provisioned datacenter with multiple functions duplicated many times by the server, storage and networking vendors. Large enterprises with big profit margins throw human bodies, tons of hardware and a host of custom software and shelf-ware packages to address their needs. Some data centre managers do not even know what assets they have — of course, yet another opportunity for vendors to sell an asset management system to discover what is available, and services to provide asset management using such an asset manager. Another system is de-duplication software that finds out multiple copies of the same files and removes duplication. This shows how expensive it is to clean up after the fact.

Heterogeneous technologies from multiple vendors that are supposed to reduce IT costs actually increase the complexity and management costs. Today, many CFOs consider IT as a black hole that sucks in, expensive human consultants and continually demands capital and operational expenses to add hardware and software which often end up as shelf-ware because of their complexity. Even for mission-critical business services, enterprises CFOs are starting to question the productivity and effectiveness of current IT infrastructures. It becomes even more difficult to justify the costs and complexity to support the massive scalability and wild fluctuations in workloads demanded by consumer services. The price point is set low for the mass market but the demand is high for massive scalability (a relatively simple, but massive, service like Facebook is estimated to use about 40,000 servers and Google is estimated to run a million servers to support its business).

More importantly, Internet-based consumer services such as social networking, e-mail and video streaming applications have introduced new elements: wild fluctuations in demand, massive scale of delivery to a divergent set of customers. The result is an increased sensitivity to the economics of service creation, delivery and assurance. Unless the cost structure of IT management infrastructure is addressed, the mass-market needs cannot be met profitably. Large service providers such as Amazon, Google, Facebook etc., have understandably implemented alternatives to meet wildly fluctuating workloads, massive scaling of customers and latency. constraints to meet demanding response time requirements.

Cloud computing technology has evolved to meet the needs of massive scaling, wild fluctuations in consumer demand and response time control of distributed transactions spanning multiple systems, players and geographies. More importantly, cloud computing changes the backup and Disaster Recovery (DR) strategies in a drastic manner reducing the RTO to minutes and seconds doing much better than SAN/NAS based server-less backup and recovery strategies. Live migration is accomplished as follows:

The entire state of a virtual machine is encapsulated by a set of files stored on shared storage such as Fibre Channel or iSCSI Storage Area Network (SAN) or Network Attached Storage (NAS).

The active memory and precise execution state of the virtual machine is rapidly transferred over a high-speed network, allowing the virtual machine to instantaneously switch from running on the source host to the destination host. This entire process could take less than few seconds on a Gigabit Ethernet network.

The networks being used by the virtual machine are virtualized by the underlying host. This ensures that even after the migration, the virtual machine network identity and network connections are preserved.

While Virtual machines improve resiliency and live migration to reduce the RTO, the increased complexity of hypervisors, their orchestration, Virtual Machine images and their management adds an additional burden in the datacenter. Figure 2 shows the evolution of current datacenters from the mainframe days to the cloud computing transformation. The cost of creating and delivering a service has continuously decreased with increased performance of hardware and software technologies. What used to take months and years to develop and deliver new services now only takes weeks and hours. On the other hand, as service demand increased with ubiquitous access using the Internet and broadband networks, the need for resiliency (availability, reliability, performance and security management), efficiency and scaling also put new demands on service assurance and hence on the need for continuous reduction of RTO and RPO. The introduction of SAN server-less backup and virtual machine migration in turn have increased complexity and hence the cost of managing the service transactions during delivery while reducing the RTO and RPO.

Figure 2: Cost of Service Creation, Delivery and Assurance with the Evolution of Datacenter Technologies. The management cost has exploded because of a myriad point-solution appliances, software and shelf-ware are cobbled together from multiple vendors. Any future solution that addresses the datacenter management conundrum must provide end-to-end service visibility and control transcending multiple service provider resource management systems. Future datacenter focus will be on a transformation from Resources Management to Services Switching to provide telecom-grade “trust”.

The increased complexity of management of services implemented using the von Neumann serial computing model executing a Turing machine turns out to be more a fundamental architectural issue related to Godel’s prohibition of self-reflection in Turing machines than a software design issue. Cockshott et al. conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.” While the last statement is not strictly correct (for example current operating systems facilitate incorporating computing resources and their management interspersed with the computations that attempt to model any physical system to be executed in a Turing machine), it still points to a fundamental limitation of current Turing machine implementations of computations using the serial von Neumann stored program control computing model. The universal Turing machine allows a sequence of connected Turing machines synchronously model a physical system as a description specified by a third-party (the modeler). The context, constraints, communication abstractions and control of various aspects during the execution of the model (which specifies the relationship between the computer acting as the observer and the computed acting as the observed) cannot be also included in the same description of the model because of Gödel’s theorems of incompleteness and decidability. Figure 3 shows the evolution of computing from mainframe/client-server computing where the management was labor-intensive to the cloud computing paradigm where the management services (which include the computers themselves in the model controlling the physical world) are automated.

Figure 3: Evolution of Computing with respect to Resiliency, Efficiency and Scaling.

The first phase (of conventional computing) depended on manual operations and served well when the service transaction times and service management times could be very far apart and did not affect the service response times. As the service demands increased, service management automation helped reduce the gap between the two transaction times at the expense of increased complexity and resulting cost of management. It is estimated that 70% of today’s IT budget goes to self-maintenance and only 30% goes to new service development. Figure 4 shows current layers of systems contributing to cloud management.

Figure 4: Services and their management complexity

The origin of complexity is easy to understand. Current ad-hoc distributed service management practices originated from server-centric operating systems and narrow bandwidth connections. The need to address end-to-end service transaction management and the resource allocation and contention resolution required to address changing circumstances which, depend on business priorities, latency and workload fluctuations, were accommodated as an after-thought. In addition, open competitive market place has driven server-centric, network-centric and storage-centric oriented devices and appliances to multiply. The resulting duplication of many of the management functions in multiple devices without an end-to-end architectural view has largely contributed the cost and complexity of management. For example the storage volume management is duplicated in server, network and storage devices leading to a complex web of performance optimization strategies. Special purpose appliance solutions have sprouted to provide application, network, storage, and server security often duplicating many of the functions. Lack of an end-to-end architectural framework has led to point solutions that have dominated service management landscape often negating the efficiency improvements of service development and delivery made possible by the hardware performance improvements (Moore’s law) and software technologies and development frameworks.

The escape from this conundrum is to re-examine the computation models and circumvent the computational limit to go beyond Turing machines and serial von-Neumann computing model. Recently proposed computing model implemented in the DIME network architecture (Designing a New Class of Distributed Systems, Springer 2011) attempts to provide a new approach based on the old Turing O-machine proposed by Turing in his thesis. The phase 3 in figure 3 shows the new computing model implementing non-von Neumann managed Turing machine to implement hierarchical self-management of temporal computing processes. The implementation exploits the parallel threads and high bandwidth available with many-core processors and provides auto-scaling, live-migration, performance optimization and end to end transaction security by providing FCAPS (fault, configuration, accounting, performance and security) management of each Linux process and a network of such Linux processes provide a distributed service transaction. This eliminates the need for Hypervisors and Virtual machines and their management while reducing complexity. Since a Linux process is virtualized instead of a Virtual machine, the backup and DR are at a process level and also include a network of processes providing the service. Hence it is much more light-weight than VM based backup and DR.

The resulting decoupling of services management from infrastructure management provides a new approach to service management including backup and DR. While, the DIME computing model is in its infancy, two prototypes have already demonstrated its usefulness one with a LAMP stack and another with a new native-OS designed for many-core servers. Unlike Virtual Machine based backup and DR, the DIME network architecture supports auto-provisioning, auto-scaling, self-repair, live-migration, secure service isolation, and end-to-end distributed transaction security across multiple devices at the process level in an operating system. Therefore, this approach not only avoids the complexity of Hypervisors and Virtual machines (although, it still works with Virtual servers) but also allows adopting live-migration to existing applications without requiring changes to their code. In addition, it offers a new approach where the hardware infrastructure is simpler without the burden of anticipating service level requirements and let intelligence of services management reside in the services infrastructure leading to the deployment of intelligent self-managing services using a dumb infrastructure on stupid networks.

In conclusion, we emphasize that the DIME network architecture works with or without Hypervisors and associated Virtual Machine, IaaS and PaaS complexity and allows uniform service assurance across hybrid clouds independent of the service provider management systems. Only the Virtual server provisioning commands are required to configure just enough OS, DIMEX libraries and execute service components using DNA.

The power of DIME network architecture is easy to understand. By introducing parallel management to the Turing machine, we are converting a computing element to a managed computing element. In current operating systems, it is at the process level. In the new native operating system (parallax-OS) we have demonstrated, it is the Core in a many-core processor. A managed element provides plug-in dynamism to service architecture.

Figure 7 shows a service deployment in a Hybrid cloud with integrated service assurance across the private and public clouds without using service provider management infrastructure. Only the local operating system is utilized in DIME service network management.

Figure 7: A DNA based services deployment and assurance in a Hybrid Cloud. The decoupling of dynamic service provisioning and management from infrastructure resource provisioning and management (server, network and storage administration) enabled by DNA makes static provisioning of resource pools possible and dynamic service migration of services allows them to seek right resources at the right time based on workloads, business priorities and latency constraints.

As mentioned earlier, the DIME network architecture is still in its infancy and researchers are developing both the theory and practice to validate its usefulness in mission critical environments. Hopefully in this year of Turing centenary celebration, some new approaches will address the computation and its limits pointed out by Cockshott et al., in their book. Paraphrasing Turing (Turing was unimpressed by Wilkes’s EDSAC design, commenting that it was “much more in the American tradition of solving one’s difficulties by means of much equipment rather than by thought.”) a lot of appliances or code may not be often, a sustainable substitute for thoughtful architecture.

Introduction

Frustrated by the inability to fiddle with Internet routing in the real world, Stanford computer scientist Nick McKeown and colleagues developed a standard called OpenFlow that essentially opens up the Internet to researchers, allowing them to define data flows using software–a sort of “software-defined networking.” Installing a small piece of OpenFlow firmware (software embedded in hardware) gives engineers access to flow tables, rules that tell switches and routers how to direct network traffic. Yet it protects the proprietary routing instructions that differentiate one company’s hardware from another. SDN is nothing more than the separation of network data traffic processing from the logic and rules controlling the flow, inspection, and modification of that data. Traditional network hardware, i.e. switches and routers, implement these functions in proprietary firmware partitioned respectively into what is known as the data and control planes. While this is a fine research project, as the major vendors start to take this seriously and are attempting to introduce it in the real-world datacenters, one must ask if this will add or reduce complexity in the already complex datacenter where a host of piece meal solutions are offered by mega corporations seeking to continually increase their revenues without an incentive to reduce complexity by eliminating the number of hardware and software components deployed which would cut into their product sales.

Systems theory tells us that as the number of components increase in a system, the cost of complexity could outweigh the benefits unless architectural reorganization provides a way out. We argue that the management complexity in current IT infrastructure design, based on the serial von Neumann stored program control implementation of the universal Turing machine, is a more fundamental architecture issue related to the lack of resiliency of the computing model than a software design issue. Cockshott et al. (2012) conclude their book “Computation and its limits” with the paragraph “The key property of general-purpose computer is that they are general purpose. We can use them to deterministically model any physical system, of which they are not themselves a part, to an arbitrary degree of accuracy. Their logical limits arise when we try to get them to model a part of the world that includes themselves.” Current generation distributed systems are implemented using a network of Turing machines in which the service and its management are intermixed as shown in figure 1. The resources utilized by the nodes in a network are often controlled by a plethora of management systems which are outside the purview of the service workflow that is utilizing the resources. Thus the end to end service transaction response is controlled by these management systems which introduce a layer of complexity in coordination and contention resolution making the service much simpler than its management.

Figure 1: Serial von Neumann implementation of Turing Machines

The limitations of the SPC computing architecture were clearly on his mind when von Neumann gave his lecture at the Hixon symposium in 1948 in Pasadena, California (von Neumann, 1987, p. 408). “The basic principle of dealing with malfunctions in nature is to make their effect as unimportant as possible and to apply correctives, if they are necessary at all, at leisure. In our dealings with artificial automata, on the other hand, we require an immediate diagnosis. Therefore, we are trying to arrange the automata in such a manner that errors will become as conspicuous as possible, and intervention and correction follow immediately.” Comparing the computing machines and living organisms, he points out that the computing machines are not as fault tolerant as the living organisms. He goes on to say “It’s very likely that on the basis of philosophy that every error has to be caught, explained, and corrected, a system of the complexity of the living organism would not run for a millisecond” (von Neumann, 1987,p. 408). It is clear that von Neumann recognized a problem in the way we design computing systems.

“Normally, a literary description of what an automaton is supposed to do is simpler than the complete diagram of the automaton. It is not true a priori that this always will be so. There is a good deal in formal logic which indicates that when an automaton is not very complicated the description of the function of the automaton is simpler than the description of the automaton itself, as long as the automaton is not very complicated, but when you get to high complications, the actual object is much simpler than the literary description.” (von Neumann, 1987,pp. 454-457). He remarked, “It is a theorem of Gödel that the description of an object is one class type higher than the object and is therefore asymptotically infinitely longer to describe.” (von Neumann, 1987,pp. 454-457). The conjecture of von Neumann leads to the fact that “one cannot construct an automaton which will predict the behavior of any arbitrary automaton” (von Neumann, 1987,p. 456). This is so with the Turing machine implemented by the SPC model.

In simpler terms the management complexity is related to the classical Russel Paradox that can be paraphrased as follows: “Who manages the managers?” Gödel’s prohibition of self-reflection in a Turing Machine mandates a hierarchy of Turing machines acting as managers managing other Turing machines implementing the computations described as a sequence of instructions that are compiled into a sequence of 1’s and 0’s. The universal Turing machine (or the general purpose computer) implements these TMs in a synchronous workflow thus prohibiting changes to computations at run-time in any Turing machine while the computation is in progress in that machine (i.e., you cannot change the behavior of that computation (compiled code) till its execution is interrupted).

Current generation server, networking, and storage equipment and their management systems have evolved from server-centric and bandwidth limited network architectures to today’s Cloud computing architecture with virtual servers and broadband networks. During last six decades, many layers of computing abstractions have been introduced to map the execution of complex computational workflows to a sequence of 1s and 0s that eventually get stored in the memory and operated upon by the CPU to achieve the desired result. These include process definition languages, programming languages, file systems, databases, operating systems etc. While this has helped in automating many business processes, the exponential growth in services in the consumer market also has introduced severe strains on current IT infrastructure. In order to meet the need to rapidly respond to manage the distributed computing resources demanded by changing workloads, business priorities and latency constraints, new layers of resource management are added with the introduction of Hypervisors, virtual machines (VM) and their management. While these layers have made the application or service management more agile, they have introduced a new layer of issues related to their own management. For example, new layers of Virtual machine-level clustering, intrusion detection and performance management, are being introduced in addition to already existing clusters, intrusion detection and performance management systems at the infrastructure, operating systems and distributed resource management layers.

However, this approach is completely unsuited to exploit the new generation many-core servers and high-bandwidth networks now available. The advent of many-core severs with tens and even hundreds of computing cores with high bandwidth communication among them makes the current generation server, networking and storage equipment and their management systems which have evolved from server-centric and bandwidth limited architectures completely unsuited to use in the next generation computing infrastructure efficiently. It is hard to imagine replicating current TCP/IP-based socket communication, “isolate and fix” diagnostic procedures, and the multiple operating systems (which do not have end-to-end visibility or control of business transactions that span across multiple cores, multiple chips, multiple servers and multiple geographies) inside the next generation many-core servers without addressing their shortcomings. The many-core servers and processors constitute a network where each node itself is a sub-network with different bandwidths and protocols (socket-based low-bandwidth communication between servers, InfiniBand, or PCI Express bus based communication across processors in the same server and shared memory based low latency communication across the cores inside the processor).

In order to cope with the scaling issues and utilize the hierarchical many-core network of networks effectively, next generation service architecture has to emulate the architectural resiliency of cellular organisms that tolerate faults and implement command and control structures which enable execution of self-configuring, self-monitoring, self-protecting, self-healing and self-optimizing (in short self-*) business processes. This requires new computing models that break the Turing machine barrier to computation by allowing the computer and the computed to be treated in the same model.

Papers Solicited to Address Next Generation Datacenter Infrastructure and Technologies:

The conference on “Convergence of Distributed Clouds, Grids and their Management” sponsored under the Aegis of WETICE 2013 is devoted to addressing next generation computing models which support real-time resource reconfiguration of distributed business workflow execution based on latency constraints, changing workloads and business priorities. It is devoted to addressing the assurance of reliability, availability, performance, account management and security of distributed business process execution with appropriate visibility and control.

The objective of the Conference was first stated in WETICE 2009; “to analyze current trends in Cloud Computing and identify long-term research themes and facilitate collaboration in future research in the field that will ultimately enable global advancements in the field that are not dictated or driven by the prototypical short-term profit driven motives of a particular corporate entity.” We are glad to report that the discussions started in 2009 have directly resulted in an alternative approach to self-managing distributed computing systems totally different from current industry trend showing a way to eliminate the complexity of virtual machines and Hypervisors. If this approach is proven to be theoretically sound (as a paper in WETICE2012 investigated) and extend its usefulness (demonstrated through their feasibility in the form of two proofs of concepts in the last conference) to mission critical environments, the DIME network architecturemay yet prove to be an important contribution to computer science.

Following the tradition, the target of the WETICE2013 is to transform current complex, redundant, costly and knowledge intensive IT management into self-configuring, self-monitoring, self-healing and self-optimizing distributed workflow implementations with service management only limited by the speed of light. We identify another emerging area of software defined networks (SDN) as a potential candidate for further investigation without the bias that often surrounds commercial profit motives to see whether the overall complexity of the datacenter will be reduced or the SDNs are yet another layer of complexity.

Papers are solicited to advance the next generation distributed computing and its management infrastructure that leverages the new hardware innovations. The goals of the conference include (but are not limited to):

“There are two kinds of creation myths: those where life arises out of the mud, and those where life falls from the sky. In this creation myth, computers arose from the mud and code fell from the sky.”

— George Dyson, “Turing’s Cathedral: The Origins of the Digital Universe”, New York: Random House, 2012.

“The DIME network architecture arose out of the need to manage the ephemeral nature of life in the Digital Universe”

— Rao Mikkilineni (2012)

Abstract:

The explosion of current cloud computing software offerings (both open-sourced and proprietary) to create public, private and hybrid clouds raises a question. Is it resulting in higher resiliency, efficiency and scaling of service offerings or increasing the complexity by introducing more components in an already crowded datacenter deploying myriad appliances, management frameworks, tools and people, all claiming to help lower total cost of operation? As the reliability, availability, performance, security and efficiency of the total system depends both on the number of components and their configuration, the architecture of a system plays an important role in defining the overall system resiliency, efficiency and scaling. We discuss current cloud computing architecture, the resulting complexity and investigate possible solutions using the self-organizing fractals theory and non-equilibrium thermodynamics. Evolution has taught us that when complexity increases, often, an architectural transformation occurs to lower the overall system entropy. Is a phase transition about to occur in our data centers seeded by the new many-core servers and high bandwidth communications?

Introduction:

According to Holbrook (Holbrook 2003), “Specifically, creativity in all areas seems to follow a sort of dialectic in which some structure (a thesis or configuration) gives way to a departure (an antithesis or deviation) that is followed, in turn, by a reconciliation (a synthesis or integration that becomes the basis for further development of the dialectic). In the case of jazz, the structure would include the melodic contour of a piece, its harmonic pattern, or its meter…. The departure would consist of melodic variations, harmonic substitutions, or rhythmic liberties…. The reconciliation depends on the way that the musical departures or violations of expectations are integrated into an emergent structure that resolves deviation into a new regularity, chaos into a new order, surprise into a new pattern as the performance progresses.” He goes on to explain exquisitely what “all that jazz” means and what it has to do with Dynamic Open Complex Adaptive System or DOCAS.

I borrow the jazz metaphor to understand the current state of affairs in cloud computing. Cloud computing started innocently enough as an attempt to automate systems administration tasks of computing systems to improve the resiliency (availability, reliability, performance and security), efficiency and scaling of services provided by web-hosting data centers. Before the advent of global web e-commerce enabled by broadband networks and ubiquitous access to high-powered computing, the workload fluctuations were not wild-enough to demand very fast response in provisioning to meet them. While enterprise datacenters were not pushed to deal with the wild fluctuations that some web-services companies were, companies such as Amazon, Google, Facebook, Twitter etc., dealing with uncertain (non-deterministic) workload fluctuations took a different approach to improve resiliency and scaling. They took advantage of the increased power in blade servers, high bandwidth networks and virtualization technologies to create virtual machine (VM) based systems administration with multiple VMs in a physical device consolidating workloads that are managed with dynamic resource provisioning. This has become known as cloud computing. Strictly speaking, VM is not essential for automation to improve scaling, auto-failover and live migration of applications and their data; and companies such as Google have chosen their own automation strategies without using VMs. On the other hand, many other enterprises have taken a more conservative approach by not adopting the cloud strategy and avoid the risk of impacting their highly tuned mission critical application availability, performance and security. They are probably correct given the continued occasional outages, security breaches and cost escalation in managing complexity with many public clouds.

Amazon and Google went one step further by offering their flexible infrastructures to developers outside their company to rent the resources with which they could develop, deploy and service their own applications, thus unleashing a new class of developers. Startups could substitute OPEX for CAPEX to obtain the resources required for their new product and services development. Resulting explosion of applications and services has created a new demand for more clouds and more automation of systems administration to extend resiliency and provide a high degree of isolation from multiple tenants sharing resources while resolving the resulting contentions. The result is a complex web of Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) offerings to meet the needs of developers, service providers and service consumers. To be sure, these offerings are not independent. On the contrary, each layer influences the other in a complex set of interactions often in non-deterministic way based on workloads, business priorities and latency constraints. Figure 1 shows an example of these relationships.

Figure 1: Complex relationships of information flow between nested layers and information flows between components in each layer. The complexity is only compounded by multi-vendor offerings in each layer (not shown here)

The origin of complexity is easy to understand. While attempting to solve the issue of multi-tenancy and agility, the introduction of the virtual machines gives rise to another complexity of virtual image management and sprawl control. In order to address VM mobility issue, recent efforts to introduce application level mobility using other container constructs such as Gears, Cartridges etc., in the case of Redhat PaaS (or Dynos in the case of Heroku, the salesforce PaaS), introduce yet another layer of management of Gears and Cartridges (or Dynos). Another example is the Eucalyptus Infrastructure as a Service, which goes to great lengths to provide High Availability (HA) of the Infrastructure platform but fails to guarantee HA of applications. It is left to the applications to fend for themselves. These ad-hoc approaches to automate management have mushroomed the software required, increased the learning curve and made the operation and maintenance even more complex. While all platforms demonstrate drag and drop software with pretty displays that allow developers to easily create new services, there is no guarantee that if something goes wrong, one will be able to debug and find out where the root cause is. Or there is no assurance that when multiple services and applications are deployed on same platform, the feature interactions and shared resource management provided by a plethora of management systems designed independently will cooperate to provide the required reliability, availability, performance and security at the service level. More importantly, when the services cross server, data-center and geographical boundaries, there is no visibility and control of end to end service connections and their FCAPS management. Obviously, the platform vendors are only very eager to provide professional services and additional software to resolve the issues but without end to end service connection visibility and control that spans across multiple modules, systems, geographies and management systems, troubleshooting expenses could outweigh the realized benefits. What we need probably is not more “code” but an intelligent architecture that results in a synthesis of computing services and their management and a decoupling of end to end service connection and service component management from underlying resource (server, network and storage) management.

Self-organizing Fractals and Non-equilibrium Thermodynamics:

Fortunately, the self-organizing fractal theory (SOFT) and non-equilibrium thermodynamics (NET) (Kurakin 2011), provide a way to analyze complex systems and identify solutions. A very good glimpse into the theory can be found in the video (http://www.scivee.tv/node/4994). According to the SOFT-NET theory, the process of self-organization is scale-invariant and proceeds through sequential organizational state transitions, in a manner characteristic of far-from-equilibrium systems, with macrostructure-processes emerging via phase transition and self-organization of microstructure-processes. Once they have emerged as a result of an organizational transition, newborn structure-processes strive to persist and expand, growing in size/number, diversity, complexity, and order, while feeding on pre-existing energy/matter gradients. Economic competition among alternatively organized structure-processes feeding on the same energy/mater gradients leads to the elimination of economically deficient or inferior structure-processes and the improvement, diversification, and specialization of survivors, who are forced to fill and exploit all the available resource niches (the Darwinian phase of self-organization) (Kurakin 2007). Promoted by mutually profitable exchanges of energy/matter, the self-organization of specializing survivors (structure-processes) into larger scale structure-processes transforms (mostly) competing alternatives into (mostly) cooperating complements. As a result, Darwinian competition is transferred onto a larger spatiotemporal scale, where it commences among alternative organizations of self-organized survivors (the organizational phase (Kurakin 2007). Such an economy-driven, scale-invariant process of self-organization leads to the emergence of increasingly long-lived, multi-scale, hierarchical organizations (structures-processes) that expand over increasingly larger scales of space and time, feeding on available energy/matter gradients and eventually destroying them. Yet because energy/matter exists as a non-equilibrium system of interdependent gradients and conjugated fluxes of interconverting energy/matter forms, new gradients and fluxes are created and become dominant as old gradients and fluxes are consumed and destroyed. Such processes are responsible for the continuous birth, death, and transformation of energy/matter forms.

Obviously, cloud computing systems (or for that matter, distributed computing systems in general based on Turing machines) are not living organisms and thus are not susceptible to self-organization. However, if you substitute information to replace energy/matter, there are many similarities between the structure and dynamics of computing systems and living self-organizing systems. The nested computing layers, meta-stable organizational patterns (both macro- and micro- structures) in each layer, and process evolution through inter-layer interaction are the same features that contribute to self-organization. So one can ask what is missing for the cloud computing environments to become self-organizing. The answer lies in two observations:

First one is the Gödel’s prohibition of self-reflection by computing elements that form the fundamental building block in the computing domain, the Turing machine (TM) (Samad and Cofer, 2001).

Second one is the lack of scale invariant macro and micro structure-processes mentioned above for the organization of computing components and their management across various nested layers resulting from current ad-hoc implementation of computing processes using the serial von Neumann implementation of the Turing machine.

I have discussed both these deficiencies elsewhere (Mikkilineni 2011, 2012). The DIME network architecture proposed there attempts to address both these deficiencies.

The DIME Network Architecture:

In its simplest form a DIME is comprised of a policy manager (determining the fault, configuration, accounting, performance, and security aspects often denoted by FCAPS); a computing element called MICE (Managed Intelligent Computing Element); and two communication channels. The FCAPS elements of the DIME provide setup, monitoring, analysis and reconfiguration based on workload variations, system priorities based on policies and latency constraints. They are interconnected and controlled using a signaling channel which overlays a computing channel that provides I/O connections to the MICE (or the computing element) (Mikkilineni 2011). The DIME computing element acts like a Turing oracle machine introduced in his thesis and circumvents Gödel’s halting and un-decidability issues by separating the computing and its management and pushing the management to a higher level. Figure 2 shows the DIME computing model.

Figure 2: The DIME Computing Model. For details on the different implementations of DIME networks (a LAMP stack without VMs and a native Parallax OS) visit http://www.youtube.com/kawaobjects

In addition the introduction of signaling in the DIME network architecture allows a fractal composition scheme of the DIME network to create a recursive distributed computing engine with scale invariant FCAPS management of the computing workflow at node, sub-network and network level. Figure 2 shows the comparison between living organisms with self-organizing fractal attributes and Cloud computing infrastructure organized to exhibit self-management fractal attributes.

Figure 3: Comparison of the nested hierarchical organization of living organisms and DIME network architecture.

While both models exhibit the genetic transactions of replication, repair, recombination and reconfiguration (Stanier and Moore, 2006) (Mikkilineni 2011), there is a fundamental difference between the two. The DIME network architecture is not self-organizing but it is self-managing based on initial policies and constraints defined at the root levels of the hierarchies. These policies can be modified during run time but only with the influence of agents external to the computing element whose behavior is under modification (at the DIME node, sub-network and network level).

At each level, the FCAPS management defines the initial conditions and policy constraints (meta-model if you will, denoting the context and defining the destiny of the ensuing process workflow) that will define the information flows and workflows executed by the DIME network downstream. The resulting metastable configurations are monitored and managed by the managers upstream. This model exhibits the three-step processes that provide self-management in living organisms – establish routine, monitor cues and respond with corrective action based on FCAPS parameters at every level. Figure 4 shows the metastable configuration entropy of the whole system. The FCAPS parameters monitored provide a measure of system entropy shown and the reconfiguration alters the state from higher entropy to lower entropy providing a “measure” of the stable pattern.

Figure 4: System Entropy as a function of time

The SOFT-NET theories provide a path to reexamine the way we design distributed computing systems. Perhaps the living organisms with their self-organizing properties could provide us a way to bring self-management to cloud computing configurations to improve resiliency, efficiency and scaling. The DIME network architecture is a baby-step to implement a recursive distributed computing engine to execute managed workflows that constitute hierarchical and temporal sequences of events executing business workflows.

The DIME network architecture raises some interesting questions about Turing machines and their management. How is it related to the Universal Turing Machine (UTM)? It is important to point out that I do not claim that DIME networks are the answer to Cloud computing vows or that the UTM can or cannot do what a DIME network does. While communicating Turing machines are modeled by a UTM (Penrose 1989), can the managed Turing machine networks also be modeled by the UTM? Is the scale-invariant organizational macro and micro structure-processes discussed in SOFT-NET theory essential for self-organizing systems? What are the differences between living self-organizing systems and self-managing networks? I leave this to the experts. I only point out that the DIME is inspired by the oracle machine discussed by Turing in his thesis and implements the architectural resiliency of cellular organisms in distributed computing infrastructure by introducing parallel management of both the computing elements and networks. While its feasibility has been demonstrated (Mikkilineni, Morana and Seyler, 2012), the DIME network architecture is still in its infancy and presents an opportunity on the eve of Turing’s centenary celebration to investigate its usefulness and theoretical soundness. Only time will tell if the DIME network architecture is useful in mission critical environments. Figure 5 shows a comparision of Physical server based computing, Virtual Machine based cloud computing and DIME network implementation in Linux server eliminating the Hypervisors and Virtual Machines.

Figure 5: Comparision between conventional, cloud and DIME network computing paradigms. The DIME network Architecture requires no Hypervisors, Virtual Machines, IaaS or PaaS. Linux processes are FCAPS managed and networked using a middleware library without any changes to the Operating System.

The DIME network architecture with its self-management, parallel signaling network overlay and its recursive distributed computing engine model supports all features that current cloud computing provides and more while eliminating the need for Hypervisors, Virtual Machines, IaaS and PaaS. The DNA offers the simplicity by providing FCAPS management of a Linux process through a middle-ware library using standard services of the Linux operating syatem and parallelism available in a multi-core/many-core processor.

Conclusion:

I conclude with one lesson from the past (Mikkilineni and Sarathy, 2009) I take away working in POTS (Plain Old Telephone System), PANS (Pretty Amazing New Services enabled by the Internet), SANs and Clouds. It is that wherever there is networking, switching always trumps other approaches. When services are executed by a network of distributed components, service switching and end-to-end service connection management are the ultimate meta-stable structure-processes and it seems that cellular organisms, telephone networks, and human network eco-systems have figured this out. Signaling and nested FCAPS management structure-processes seem to be the common ingredients. Therefore, I predict that eventually the data centers which are currently computing resource management centers will transform themselves into services switching centers just as in telephony. Perhaps computer scientists should look to telephony, neuroscience and organizational dynamics for answers than engaging in hackathons and coding ad-hoc complex systems to manage distributed computing resources. SOFT-NET theories seem to be pointing to the right direction. The solution may lie in discovering scale invariant micro- and macro structure processes that provide nested FCAPS management and self-managed local and global policy enforcement. Perhaps Holbrook’s “All that Jazz” metaphor is an appropriate metaphor for cloud computing research. Time may be ripe for the reconciliation (the synthesis of the thesis of implementing services and the anti-thesis of services management).