nutanix

Tuesday, February 16, 2016

Resiliency
is the ability of a server, network, storage system,to recover quickly and
continue operating even when there has been an equipment failure, or other
disruption.

Invisible
Infrastructure should have the ability to fix the failure without any
disruption to the end user or the application.

In
Nutanix, every component is designed to be resilient to the failures. As we all
know,that the hardware will fail eventually or the certain part of
software will have bugs, it is the job of the software to build resiliency when
there is a failure.

Nutanix
XCP platform is inherently resilient to failures due to the intelligence built
into the distributed software. Nutanix XCP administrator can configure
additional resiliency on a container/datastore level, if resiliency needs to
cover the multiple point failure. However, the self healing capability of
Nutanix XCP platform reduces the need for configuring higher level of resiliency
than the default.

In this
blog, let me introduce to the different types of resiliency available in our
system.

1.
Hardware Resiliency:

a. Disk
Resiliency

b. Node
Resiliency

c.
Block/Rack Resiliency

d.
Cluster Resiliency/Data Center Resiliency

2. Software
Resiliency

a.
Quarantining the non-optimal node

b.
Auto-migration of Software services to a different node (Cassandra
Forwarding/Zookeeper migration)

b. Fail
Fast Concept

c. No
strict Leadership

d. Share
Nothing Architecture.

e. Fault
tolerance - FT-1 and FT-2

a. Disk Resiliency:

Nutanix Architecture

If
the Nutanix cluster is configured for the tolerance of one failure (FT=1*) at a
single point of time, when a block of data is written, the second copy of
the data will be stored on a disk in another node.

If
any disk fails, the data on that disk will be under-replicated, but all the
blocks of the data of the failed disk can be accessed from the other nodes in
the cluster. Nutanix self healing process running on all the nodes in the
cluster will re-replicate the data in the background.

Data Resiliency Status

Nutanix
Controller VM data is on the SSD and it is mirrored to the second SSD on the
same node.In
the event of an SSD failure, Nutanix CVM will continue to run without any
disruption.

Nutanix Cluster monitors the disks proactively
through S.M.A.R.T utility, so if the value of any attribute indicates any
potential failure, the cluster will copy the data from the disk to other
nodes/disks in the cluster and mark the disk offline.

After replacing the disk physically, Nutanix UI will guide the customer to re-add the disk back to the cluster.

UI: Adding the replaced disk to the cluster

*FT=1 or FT=2 is configurable. More details in http://nutanix.blogspot.com/2014/06/nutanix-40-feature-increased-resiliency.html

Disk offline logic: (hades process)

On a serious note, let us see if The Wolverine or DP has the better self-healing powers?

Friday, June 12, 2015

In creating a world class product, the challenge is always to develop something that is flexible,useful and simple. Many times, simplicity is lost at the expense of greater flexibility or flexibility is lost when a product becomes overly simple.

During the development process at Nutanix, two questions are always asked: “How do we build a solution that is simple yet flexible at the same time? How do we make the datacenter invisible? How do we help our customers to spend more time with their loved ones than being in the datacenter ?”

The Nutanix datacenter solution is easy to operate, install, configure, and troubleshoot thanks to features such as Easy and Non-Disruptive upgrades, Foundation, and Cluster Health.

Highlighted below is the evolution of one of our most exciting feature that all of our customers will see and use, the Nutanix Foundation installer tool.

Then

Before Foundation, the Nutanix factory had a repository of scripts that allowed them to install, test, and ship the blocks to the customers. The scripts had limited options for the network configurations, hypervisor, and Nutanix OS versions. Our customers told us they needed increased flexibility to install the hypervisor of their choice, hypervisor version of their preference, and the Nutanix OS with the networking configuration that best suits their datacenter at the time of installation, not at the time of ordering.

While we were developing this tool, one of the hypervisor vendor decided to throw a curve ball at us, forced us not to ship their hypervisor from Nutanix factory, even if the customer wanted their hypervisor. They considered us a competitor though we were enabling their bare hypervisor to move up a few layers and provide storage as well.

This made us more deterministic to provide the flexibility of changing the hypervisor at the time of installation to our field, partners and customer. This tool will eventually help our customers to non-disruptively change the hypervisors at anytime on a running cluster.

Now

To meet these customer needs, the Nutanix Foundation installer tool was created. It was developed with the aim of providing an uncompromisingly simple factory installer that is both flexible and can be used reliably in the customer datacenter.

Foundation 2.1 allows the customer/partner to configure the network parameters, install any hypervisor and NOS version of their choice, create the cluster, and run their production workload within a few hours of receiving the Nutanix block.

So far, the feedback from our customers and partners has been fantastic.

In the short time that Foundation has been available, the tool has quickly evolved to meet new customer needs.

Beyond

Inception of Foundation 3.0 started from the desire to make add node as simple as pairing an iWatch to the iPhone.

We strive to keep Foundation uncompromisingly simple and still extend the features without becoming a “featuritis - feature nightmare” by constantly reviewing it with our UX team and being in touch with our customers.

Very soon, we will be shipping Foundation within NOS 4.5, it will facilitate these Lifecycle management features

Faster Imaging of the nodes by our customers.

Seamless expansion of clusters. ( AOS 4.5)

Efficient Replacement of boot drives. (AOS 4.5.2)

Network aware cluster installs and cluster expansion.

BMC/BIOS Upgrades. (AOS 4.5.2)

Reset the cluster back to the factory defaults.(AOS 4.5)

Easy onboarding of any node to the cluster

I am excited to see how the new foundation framework will enable the future Lifecycle management innovations.

Tuesday, January 27, 2015

The Swiss Army knife is a pocket size multi-tool which equips you for everyday challenges. NCC equips you with multiple Nutanix troubleshooting tools in one package.NCC provides multiple utilities (plugins ) for the Nutanix Infrastructure administrator to

if needed, ability to execute NCC automatically and email the results at certain configurable time interval.

NCC is developed by Nutanix Engineering based on inputs provided by support engineers, customers, on-call engineers and solution architects. NCC helps the Nutanix customer to identify the problem and fix the problem or report it to Nutanix Support. NCC enables faster problem resolution by reducing the time taken to triage an issue.

When should we run NCC ?

after a new install.

Before and After any cluster activities - add node, remove node, reconfiguration and an upgrade

anytime when you are troubleshooting an issue.

As mentioned in the cluster health blog, NCC is the collector agent for cluster health.

2. Execute "ncc health_checks run_all" and monitor for messages other than PASS.

3. List of NCC Status

4. Results of a NCC check run on a lab cluster

5. Displaying and analyzing the failed tests.

FAILUREs are due to sub-optimal CVM memory and network errors. So to fix the issue- increase CVM memory to 16G or more. (KB: 1513 -https://portal.nutanix.com/#/page/kbs/details?targetId=kA0600000008djKCAQ )- check the network (rx_missed_errors -- check for network port flaps, network driver issues- KB 1679 and KB 1381)c. Log Collector Feature of NCC: ( similar to show tech_support of Cisco or vm-support of VMware)
NCC Log collector collects the logs from all the CVMs in parallel.

1. Execute ncc log_collector to find the list of logs that will be collected.

2. To collect all the logs for last 4 hours - ncc log_collector run_all

For example: stargate.INFO will have the time period when it is collected: