Openness and standardization have been an eternity topic for computer industry. Since the early 80’s when the PC revolution led by Intel processors, open hardware standards have revolutionized the computer industry with standardized hardware components and building blocks. HW stadnards, USB, PCI-E, SATA, SAS, etc. are common to servers and PCs alike. At the same time many software standard emerged DLL, CORBA, Web Services, etc. to ensure software interoperability. Open standards have become the gene pool of today’s computing infrastructure.

How will open standards and open source solutions play in the cloud computing era? As we look at the most popular cloud service providers today, Google, Microsoft, Amazon, etc. None of them have open standards, at most they have open interfaces for others to interact with, but the cloud solution stack is mostly proprietary. If past history is a mirror of the future, we can foresee that as cloud services become more popular, open standards will play more and more important roles. A natural question to ask is how much open standards can play in the context of cloud computing? That is a question interesting to many of us. Let me try to share my opinion on this.

As indicated in the chart below, the level of open standards decrease as we go higher up to the cloud services stack. At the very bottom, the hardware building blocks, we need strong interoperability and inter-changeable (disposable?) components. They should be general independent of cloud middleware and application services. At the infrastructure as a service (IaaS) and platform services (Platform as a service – PaaS) layers, cloud operators are more likely to use open standard and generic building blocks to build their infrastructure services, even though they have to be optimized and work well with the cloud environment (cloud middleware or cloud OS) of their choice. While in the upper layers of cloud solution stack, where and application services (SaaS) are defined, there are a greater needs for cloud operators to offer differentiated services. That is where they will put their “secrete source” for competitiveness. It will be much more difficult to drive open standard building blocks/ components, other than focusing on interoperable interfaces, such as web services standards.

Based on the analysis above, it is safe to assume that open standard and open source opportunities are most promising at HW building blocks, IaaS, and PaaS layers. That should be where the industry is more likely to build consensus. While for the upper layers, especially SaaS, we should focus on interface standards, not as much on standard building blocks and open source solutions.

Intel has been a leader for HW standard building block for the last 30 years and has changed the industry. It is natural to assume that Intel should focus IaaS and PaaS building blocksas well as how these open standards could be applied at open datacenters (ODC) as“adjacent” growth opportunities to embrace the booming cloud computing. Some conventional wisdom says that Intel is not relevant to cloud, as cloud computing be definition abstracts HW. I would say just the opposite – Intel will continue to play a critical role to define and promote open standards and open source solutions for IaaS and PaaS, so that the cloud can actually mushroom. There is a strong correlation between how fast cloud computing can proliferate and how well Intel plays its role to lead the open cloud solutions at IaaS and PaaS layers. What do you think?

•ODC– (Open Data Center) Currently stands for a set of interoperable technologies optimized for IaaS, PaaS and SaaS datacenters.At the most basic levels, these optimizations will also apply to traditional enterprise as well in areas such as power management but higher level management will be tailored for IaaS and SaaS high density datacenters.

•SaaS – Software as a Service:is a model of software deployment whereby a provider licenses an application to customers for use as a service on demand.Examples include Google apps, Salesforce.com, etc.

•PaaS – Platform as a Service:It facilitates deployment of applications without the cost and complexity of buying and managing the underlying hardware and software layers, providing all of the facilities required to support the complete life cycle of building and delivering web applications and services entirely available from the Internet—with no software downloads or installation for developers, IT managers or end-users

•IaaS – Infrastructure as a Service:Rather than purchasing servers, software, data center space or network equipment, clients instead buy those resources as a fully outsourced service. The service is typically billed on a utility computing basis and amount of resources consumed (and therefore the cost) will typically reflect the level of activity.

As a former datacenter engineering manager, I had personal experience of the management issues at datacenters, especially dealing with power allocations and cooling – we often assumed the worse case scenario as we could not predict when the server power consumption will peak. When it did peak, we had no way to control it. It is like driving with blindfold and hope for the best outcome. The safest bet was to make the road as wide as possible - leave enough headroom for the power budget, so that we would not run into power issues. But it resuled in under utilized power, or stranded power, that is quite a waste.

Over the course of last several years, we met with many IPDC (internet portal datacenter) companies. We heard over and over again of their datacenter power management challenges, which was even worse than I experienced. Many of the IPDC companies we talked with leased racks from datacenter service providers under strict power limits per rack. The number of servers per rack they can fit had direct impact to their bottomline. They did not want to under-populate the racks, as they had to pay more rent for the same amount of servers; they could not over-populate the racks as it would be over the power limits. Their power management issues could be best summerized as the following:

·Over-allocation of power: Power allocation to servers does not match actual server power consumption. Power is typically allocated for worst case scenario based on server nameplate. Static allocation of power budget based on worst case scenario leads to inefficiencies and does not maximize use of available power capacity and rack space.

·Under-population of rack space: As a direct result of the over-allocation problem, there is a lot of empty space on racks. When the business needs more compute capacity, they have to pay more for additional racks. There are not enough datacenter spaces for them to rent. As a result, they had to go to other cities even other countries – increased operational cost and supporting staff.

·No capacity planning: There is not effective means to forecast and optimize power and performance dynamically at rack level. To improve power utilization, datacenters needs to track actual power and cooling consumption and dynamically adjust workload and power distribution for optimal performance at rack and datacenter levels.

This is where the Node Manager comes to play. Let’s take a look at what Node Manager and its companion software tool provided by Intel for rack and group level power management – Intel® Data Center Manager (DCM) will do:

Intel® Intelligent Power Node Manager (Node Manager)

Node Manager is an out-of-band (OOB) power management policy engine embedded in Intel server chipsets. Processors carry the capability to regulate their power consumption through the manipulation of the P- and T-states. Node Manager works with the BIOS and OS power management (OSPM) to perform this manipulation and dynamically adjust platform power to achieve maximum performance and power for a single node. Node Manager has the following features:

·Dynamic Power Monitoring: Measures actual power consumption of a server platform within acceptable error margin of +/- 10%. Node Manager gathers information from PSMI instrumented power supplies, provides real-time power consumption data singly or as a time series, and reports through IPMI interface.

·Platform Power Capping: Sets platform power to a targeted power budget while maintaining maximum performance for the given power level. Node Manager receives power policy from an external management console through IPMI interface and maintains power at targeted level by dynamically adjusting CPU p-states.

·Power Threshold Alerting: Node Manager monitors platform power against targeted power budget. When the target power budget cannot be maintained, Node Manager sends out alerts to the management console

DCM is software technology that provides power and thermal monitoring and management for servers, racks and groups of servers in datacenters. It builds on Node Manager and customers existing management consoles to bring platform power efficiency to End Users. DCM implements group level policies that aggregate node data across the entire rack or data center to track metrics, historical data and provide alerts to IT managers. This allows IT managers to establish group level power policies to limit consumption while dynamically DCM provides allows data centers to increase rack density, manage power peaks, and right size the power and cooling infrastructure. It is a software development kit (SDK) designed to plug-in to software management console products. It also has a reference user interface which was used in this POC as proxy for a management software product. Key DCM features are:

·Group (server, rack, row, PDU and logical group) level monitoring and aggregation of power and thermals

·Log and query for trend data for upto one year

·Policy driven intelligent group power capping

·User defined group level power alerts and notifications

·Support of distributed architectures (across multiple racks)

What the combination of DCM and Node Manager will do to datacenter power management? Here is the magic part… With the DCM at group and rack level setting policies, Node Manager can dynamically report the power consumed by a server and adjust it within certain range, so that the overall power consumption of the rack or a particular server group could be managed within a given target. Why this is important? Let me use a real example to explain it:

IPDC Company XYZ (a name I cannot disclose in public) has a mission critical workload at their datacenter that runs 24x7 and there are workload fluctuations during the day. The CPU utilization is mostly at 50~60%, with few cases that it will jump to 100%, typical for datacenter operations. To be on the safe side, the current solution is to do a pre-qualification of the Xeon® 5400 server for the worst case at 100% CPU utilization which ran at ~300W. They used 300W for power allocation, which was considered significantly lower than the nameplate value of the power supply (650W).

With Xeon® 550, for the same workload at 100% throughput, the platform power consumption goes down to 230W, a 70W reduction from the previous generation CPU – a good reason to switch to a new platform due to the advance intelligent power optimization features on Xeon® 5500. But the story does not end there…

On top of that, we further analyze the effect of power capping using Node Manager and DCM. After many tests, we noticed that if we cap at 170W and the performance of impact for workload at 60% CPU utilization and blow is almost negligible. This means, that we 170W power capping, the platform can deliver the same level of services most of the time, with 50W less (230W-170W) power consumption. For occasional spike that is above 60% CPU utilization, there will be some performance impact. However, since the Company XYZ operates at below 60% CPU utilization most of the time, the performance impacts are tolerable. As a result, we can squeeze more power from the power allocation using the dynamic power management feature of Node Manager and DCM.

What does this mean to the Company XYZ? Well, we can do the math. The rack they lease today has the limit of 2,200W/rack. With the current Xeon® 5400 servers, they can put upto 7 servers per rack at 300W per server. With Xeon® 5500, they can safely put 9 servers at 230W per server – a 28% increase of the server density on the rack. Top it up, by using Node Manager and DCM to manage the power at rack level with power limit of 2,200W and dynamically adjust the power allocation among the servers, we can put at least 12 servers at an average of 170W power allocation per server – a 71% increase of the server density comparing with the situation today! This means a great saving for the Company XYZ. In this case, the power consumption of each server on the rack could go above 170W, or lower than 170W. DCM dynamically adjusts the power capping policy while holding the line for entire rack power consumption below 2,200W.

Of course, the power management result varies from workload to workload. There has to be workload-based optimization in order to achieve the best result. Also, we assume that the datacenter should be able to provide sufficient cooling for devices that consume power within the given power limit. Even though, the result we get from this test could not be applied universally to all IPDC customers, we have finally had a platform that can dynamically and intelligently monitor and adjust the platform power based on workload. For datacenter managers, you can manage power at rack level and datacenter level with optimized power allocation to fully utilize the datacenter power. Are you ready to give it a try?

Dynamic Power Management Has Significant Values - a Baidu Case Study

Jackson He, Intel Corporation

We have just completed a proof of concept (POC) project with Baidu.com, the biggest search portal company in China (60+% market share in China), using the Intel® Dynamic Power Node Manager Technology (Node Manager) to dynamically optimize server performance and power consumption to maximize the server density of a rack. We used Node Manager to identify optimal control points, which became the basis to set power optimization policies at the node level. A management console - Intel® Datacenter Manager (Datacenter Manager) was used to manage servers at rack-level to coordinate power and performance optimization between servers to ensure maximum server density and perform yield for given power envelope for the rack. We have shown significant benefit from the POC and the customer like the results:

At a single node level, up to 40W savings / system without performance impact when a optimal power management policy is applied

At rack level, up to 20% additional capacity increase could be achieved within the same rack-level power envelope when aggregated optimal power management policy is applied

Comparing with today's datacenter operation at Baidu, by using Intel Node Manager, there could be a rack density increase 20~40% improvement

Some background of the technologies tested in this POC:

Intel® Dynamic Power Node Manager (Node Manager)

Node Manager is an out-of-band (OOB) power management policy engine that is embedded in Intel server chipset. It works with BIOS and OS power management (OSPM) to dynamically adjust platform power to achieve maximum performance/power at node (server) level. Node Manager has the following features:

Dynamic Power Monitoring: Measures actual power consumption of a server platform within acceptable error margin of +/- 10%. Node Manager gathers information from PSMI instrumented power supply, provides real-time power consumption data (point in time, or average over an interval), and reports through IPMI interface.

Platform Power Capping: Sets platform power to a targeted power budget while maintaining maximum performance for the given power level. Node Manager receives power policy from an external management console through IPMI interface and maintains power at targeted level by dynamically adjusting CPU p-states.

Power Threshold Alerting: Node Manager monitors platform power against targeted power budget. When the target power budget cannot be maintained, Node Manager sends out alerts to the management console

As the internet services grow and the more users embracing internet - approaching 1 billion connected users, one of the biggest challenges for data-center operators today is the increasing cost of power and cooling as a portion of the total cost of operations. As shown in Figure 1, over the past decade, the cost of power and cooling has increased 400%, and these costs are expected to continue to rise. In some cases, power costs account for 40-50% of the total data-center operation budget. To make matters worse, there is still a need to deploy more servers to support new business solutions. Data centers are therefore faced with the twin problem of how to deploy new services in the face of rising power and cooling costs. In a recent survey of data centers 59% identify power and cooling as the key factors limiting server deployment.

Figure 1: IDC Report of data center cost structure and trend

At the same time with the increased energy cost and awareness of global warming, there is increased regulatory scrutiny around both idle and max power of servers and clients (desktops and laptops). The "green awareness" datacenter is no longer a "nice to have" feature, but of necessity of business operation and environmental regulatory compliance. Figure 2 highlight the world-wide existing and emerging regulations on power and energy consumption. Future datacenters have to be able to clearly measure and proof regulation conformance in order to operate properly.

Figure 2: Existing and emerging energy and power regulations

To sum it up, the power management trends for future datacenters are multifaceted and will not be covered by a single company or a single business segment. They could be summarized in the following areas:

At environment level: conform to increased government regulations on energy and power and increased power constraint (limited available power) - need innovative ways to conform "green datacenter" regulations, while deliver great values to business.

At the datacenter level: more computing power is needed with increased demand; emergence of mega datacenter and modular datacenter (datacenter in a container); the overall power and cooling distributions need to match the increased need - new datacenter designs and power/cooling management needed.

At rack level: higher power density and higher server density per rack is needed to pack more computing power for a given space and cooling; workload balance between racks to increase power efficiency and overall datacenter reliability - need effective rack-level power and cooling monitoring and dynamic management capabilities

At server level: need lower idle and max processing power, so that platform power consumption trend is more linear with platform performance; dynamically adjust power consumption based on policy and workload - need more server-level instrumentations for power/cooling monitoring and more control knobs to dynamically optimize power and performance.

I hope you agree with me of the overall datacenter power management trends at datacenters in the coming year. These trends pose challenges for each of the areas listed above. These challenges also mean opportunities for innovative solutions to thrive. I'd like to listen to your feedback about these trends. I will talk more about challenges and potential solutions in the upcoming blogs. You are welcome to share your thought of where you believe the datacenter power management is going. Thanks a lot.

Server Management: How much is good enough?

Jackson He

Digital Enterprise Group, Intel Corporation

Manageability is a hot topic for IT managers. There have been so many solutions out there from different software vendors and hardware vendors alike. It seems to be "the more, the better". Is that true for all cases? Not really. IT managers also realized that more management also means increase of complexity and cost. It begs the question "How much server management is good enough?", "What are the basic needs for server management?", "Is there a way to balance the paradox between more complex management and simpler/cheaper options?"

We have done many researches and visited datacenters around the world to understand top server management challenges. Some of our observations are as the following:

Datacenter management too complex & too expensive: Typically a datacenter has several separate management systems for server platforms, for network, and for security infrastructure. For server management alone, there is a dedicated management module (BMC - baseboard management controller) on each server and management console software to centrally manage the hundreds and thousands of servers in the datacenter. Each BMC adds extra cost to each server, the management software also cost a lot to build or buy, and deploy. The datacenter management cost and utility expense (power and cooling) can be 45 times more than the acquisition cost of the server hardware through the life of the server (4 years).

Server management implementations inconsistent among platform providers: To make things worse, server management is not uniform across servers from different OEMs. That is servers from HP and Dell have different BMC and therefore, need different management interfaces and extended software to manage - added time and cost for software development. This aggravates the management cost pressure and limits the choices of platforms from different OEMs. As a result, server management becomes even more complex and expensive in the name of more powerful management, a vicious cycle that gets worse as the datacenter grows.

Packaged server management solutions not quite fit for IPDC environment: Packaged server management solutions from ISVs, like IBM Tivoli, HP Openview, CA, etc. are designed for enterprise where total number of servers is relatively low, applications are diverse, and dedicated IT engineers are looking for turnkey solutions. For the case of IPDC (internet portal datacenter), they have fewer specialized applications distributed across a large number of servers. Most of the IPDC datacenters develop their own server management software. However, they have to customize the management software for specific BMC for the servers of their choice. Furthermore, IPDC datacenters do not have as much the same level of server management granularity like the enterprise. They typically care a lot of system status and availability monitoring and means to remotely power on/off servers, but not as much remote diagnosis and repair as their enterprise counterparts. IPDC datacenters have more homogenous application and larger pool of servers. It will be more flexible and cost effective to simply redistribute workload and turn off troubled servers and fix them in bathes later. As a result, most of the IPDC customers write their own server management software that interact with a few basic management functions on the server - they really don't need a full-featured BMC.

Based on these finding, it makes us wonder if there are ways to make server management simpler. Is there an opportunity that we can achieve something that is "less is better than more"? Can we break the large complex server management solutions that trying to have "a size fits all" into something that is tiered with simpler/common server management at the bottom and more complex/customized server management layered on top of the basic server management?

Such a modular way of offering server management solutions that address some basic common needs across most datacenters. In such as way, we can standardize basic server management features and make them available to all Intel Architecture servers, while allowing OEMs and ISVs to build differentiations on top of them. It also gives the customers more flexibility to pick and chose based on their management needs. For IPDC customer, they may chose the basic set of server management that comes with servers regardless of which OEMs they come from - more vendors of choice, lower cost. For enterprise customers, they can chose more complex and customized solutions from particular OEMs and ISVs that best fit their datacenter management needs. Such as division of functions for server management may provide a relief of the ever increasing datacenter server management complexity and cost.

If this is true, where do we draw the line for basic server management? What is the minimum "good enough" server management feature set? I'd like to listen to you ideas around this as well. I will talk more about our thought around this in my next blog. Stay tuned.