VMware Farm Optimization

VMware Farm Optimization
By Jeremy Kampwerth
[email protected]
Introduction To Me
• Windows and Unix System Administrator for 8
years.
• Capacity and Performance Engineer for 6
years.
• Apparently I like working for large companies.
• I consider myself a jack of many trades.
• Presentations like this are not one of the
trades.
In General
• Topic has VMware in the title but this is not
about VMware specifically as the concepts I
will discuss could be applied to any virtual
environment.
• The concepts are more about common sense
then they are trade secrets.
• My role was to assist the project by providing
analysis and technical expertise.
– I was not doing the dirty work
Introduction to the Topic
• Working hand-in-hand with the virtual
support team to reign in the wild fire that was
virtual sprawl.
• Optimize VMs based on historical utilization
data with added controls around application
requirements.
• Today capacity team is part of the before and
after process to regulate and review.
Introduction to the Topic
• I will discuss
– How we got into the mess and how did the
capacity team help to get out of the mess.
– How and why was the capacity team engaged.
– The expensive tools we used to do the job.
– The guidelines we used to make safe decisions.
– What were the failures?
– What led to the successes?
How did we get into this mess?
• Many factors led to the wild fire (virtual
sprawl)
– Corporate decision to push virtualization
– Lack of controls in request process
• Lead to many over-provisioned VMs
– Existing large non-centralized environment
managed across many different internal
organizations each with a different set of rules
Ask the Capacity Team for help
• Surely the internal capacity team was the first
call.
• Surely before they ask for money they would
think of the capacity team.
• Surely upper management would know the
capacity team exists.
– Luckily they did
What was being asked
• Can we help the virtual team reign in the
madness?
• Can we produce same results as outside
company?
• Can we do it in a safe manner?
• Can we do it reliably and reproduce reliability?
What did they do?
• Looked at data for thousands of VMs
– Data only contained 4 weeks
• Analysis via Modeling tool
– Fancy tool with top secret formula
• CMDB details not considered
– No application relationships
– No account for age of the VM
• Many reductions found
– Over 40% vCPU reduction
– Over 70% vMEM reduction
Our Guidelines
• Make sure the server is being used for what it was intended
– In deployment for 180 days
• Consider the application
– Match by role and function
• Within each application, all production web servers should be sized the same
• Enough Data
– Minimum 90 days of data
• Peak utilization
– No arguing (but but why?)
– 15 minute interval
– Add headroom
• 20% headroom for vCPU
• 5% headroom for vMem (consumed memory)
Candidacy Analysis Overview
Outside Co.
Internal
Only subset of location
included
All locations included
Resources
vCPU & vMem
Initial vCPU & Small vMem pilot
Asset Status/Duration
Not taken into
consideration
180 days Deployed
Environment Matching (by AppID)
Not taken into
consideration
Match Prod/BCP
Match Non-Prod
4 weeks via Modeling tool
90 days of data
vCPU Formulas Used
Not Disclosed
Single Max vCPU (15min Interval) +
20%, rounded up
vMem Formulas Used
Not Disclosed
Single Max Consumed vMem +
5%, rounded up
thousands
thousands plus thousands
92%
15%
Location
Minimum Data Required for
Recommendation
VMs Analyzed
Candidates Identified
Analysis Comparison
Data Center 1
Server 1
Server 2
Server 3
Server 4
Server 5
Server 6
Current Configuration
4x8
4x8
4x8
4x8
Outside Co. Recommendation
1x1
1x1
1x1
1x1
Internal Recommendation
2x8
2x8
4x8
None
Reason for Difference
Max vCPU 1.48
Max vMem 7.8
BCP Match PROD
Max vCPU 2.92
Max vMem 7.67
Disposed
Data Center 2
Server 1
Server 2
Server 3
Server 4
Current Configuration
4 x 16
4 x 16
4 x 16
4 x 16
Outside Co. Recommendation
1x4
1x7
1x6
1x7
Internal Recommendation
3 x 12
2 x 16
2 x 16
3 x 16
Reason for Difference
Max vCPU 2.8
Max vMem 11.4
Max vCPU 1.52
Max vMem 15.62
Max vCPU 1.14
Max vMem 15.62
Max vCPU 2.72
Max vMem 15.61
Data Center 3
Server 1
Server 2
Server 3
Server 4
Current Configuration
4x6
4x6
2x4
Outside Co. Recommendation
1x1
1x1
1x2
Internal Recommendation
2x6
2x6
None
Max vCPU 1.44
Max vMem 5.49
Max vCPU 1.7
Max vMem 5.6
Disposed
Reason for Difference
The Process
• Capacity team to produce the results and
review and with project team to identify
candidates.
• Project team to communicate plan to planners
and application owners.
• Allow for rebutal
– But you better bring the facts
• Optimize
The First Year Results
• Of the 15% of VM candidates identified
– 23% were cancelled after appeals process
– Of the completed
• 50% reduction in configured vCPUs
• vMem was excluded
– 100% of reductions made with no issues
The Second Year Results
• Of the 8% of VM candidates identified
– 24% were cancelled after appeals process
– Of the completed
• 20% reduction in configured vCPUs
• 10% reduction in configured vMem
– 100% of reductions made with no issues
Realized Benefits
• Better performing VMs
– Over-provisioning of resources can hurt
• Better performing Hosts
– Accurate view allowed for higher utilization of the
clusters
• Costs
– Delayed purchase of new farms for over a year
• Time to focus on future
– New farms running more powerful hardware allowed
for a many to one replacement
What were the issues?
• Communication breakdown
– First knowledge of optimization was from the
change request
• Lack of understanding
– Not knowing how and why
• Coordination of optimizations
– Had to learn how things would work
What led to the Success?
• Management backing
– You will be optimized unless you can produce
evidence
• Conservative formula
– Peak utilization served us well
• Communication, Communication,
Communication
• Processes in place
– Appeals process
– Resources on demand (or at least with a phone call)
What if we got more aggressive
Scenario 1
Scenario 2
Scenario 3
Scenario 4
(Same Year 2)
(1 hour interval)
(12 month max)
(Less overhead)
Location
All Infrastructure
Resources
vCPU & vMem
Asset Status/Duration
180 days Deployed
Environment Matching (by
AppID)
Match Prod/BCP
Match Non-Prod
Data Required for
Recommendation
vCPU Formulas Used
90 days minimum
90 days minimum
(15 month max)
(12 month max)
Single Max vCPU (15min
+ 20%, rounded
up
Interval)
vMem Formulas Used
Single Max Consumed
vMem (15min Interval) +
5%, rounded up
Single Max vCPU
+ 20%,
rounded up
(1 hour Interval)
Single Max Consumed
vMem
(1 hour Interval) +
5%, rounded up
VMs Analyzed
Candidates Identified
Single Max vCPU
Single Max vCPU
(1 hour Interval) + 20%,
(1 hour Interval), rounded
rounded up
up
Single Max Consumed
vMem
(1 hour Interval) +
5%, rounded up
Single Max Consumed
vMem
(1 hour Interval), rounded
up
Over 10k
4%
18%
RISK
19%
29%
Where are we today
• Part of the request process
– Previously we may or may not be asked for sizing
– Currently all sizings come through us
• All existing servers get a sizing recommendation
• Annual Optimization Review
– At least one optimization per year
– Optimization now includes vCPU, vMem, and
storage
• Storage follows same type guidelines but analysis not
be capacity team
Thanks for listening!