English: Alan Shepard in capsule aboard Freedom 7 before launch—1961 Alan Shepard became the first American in space on May 5, 1961. He launched aboard his Mercury-Redstone 3 rocket named Freedom 7–the suborbital flight lasted 15 minutes. (Photo credit: Wikipedia)

I’m writing this a little bit for my wife and non-technical friends. As I was thinking through this idea it occurred to me that it’s not particularly technical, and it’s a great way of explaining what I spent a lot of my time doing when I’m actually designing solutions.

Recently both my wife and my phones have broken in various ways, she dropped hers and it’s never been the same since, mine just had some dodgy hardware failure and trying to fix it made it worse. The technology we buy is generally put together by the cheapest hardware provider, with the lowest cost commodity technology (not always the case, but largely true). Mass production lowers costs, so does cutting corners and reducing redundancy (multiple components that are there solely to take over when one fails).

In general home life most of us don’t bother with redundancy, some of us may have more than one desktop PC or laptop, but most of us will only have one personal phone, one big-screen TV, one Blueray player, etc. etc. We buy commodity and we accept that it is a commodity and prone to failure. Some of us don’t realise this and get upset when technology fails (perhaps a lot of us don’t realise this!), but when you remember that the technology you have isn’t necessarily the best (the best is expensive) but the cheapest, you’ll realise why it’s so prone to failure. This is why the home technology insurance industry is so strong!

You don’t see technology insurance in the IT world though, not in the same guise. You have support contracts (which in many ways you could consider to be an insurance policy!) and these generally guarantee a replacement within 4 hours or 1 business day. This is too long to wait for a recovery, so this is where my job comes in.

I design systems to fail, I push systems until they fail, write down why they failed and work out whether we should / could prevent this. One thing we always try to get away from is the single point of failure (SPOF), that is any single unit / device / component that can fail and would take systems offline. But SPOF’s scale, a computer room is a SPOF, so is a country, so is the planet Earth (some organisations have genuinely considered satellites for keeping data replicated off-planet). This is where you weigh up cost vs. benefit, protecting against country failure is probably irrelevant for most people I talk to in the UK, if the UK is gone, is a UK company really going to care that it’s still online and available? Global banks, insurance firms and other global corporations definitely will care, but most “normal” companies won’t.

So everything I design is designed in pairs, or more. At the very basic layer we start with pairs of components: power connections, network connections, servers, rack cabinets (the things that hold computers in a data centre), power substations, internet feeds and increasingly often complete data centres (what we generally call Disaster Recovery – DR, or Disaster Avoidance). But depending on the size, we increase the capabilities for failure, N+1 being the minimum (N being the number of systems you need to actually meet the technical requirements, 1 being failover capacity). More often than not this is now N+2, and increasingly with DR we actually go to having (N+2) + (N+2), as a second data centre needs the same capacity as the primary data centre in order to recover from a catastrophic failure.

There is an old saying “fail to plan, plan to fail” but in the IT industry we flip this around, everything should be designed to fail, and I push people annoyingly hard to understand what happens when different components fail, how will this impact the application / service and how can we design around this. Making things fail and understanding failure events is a huge part of my work.

Somewhere, somehow I am probably related to Gene Kranz (although not directly to my knowledge), the well known NASA Flight Director, and I find it a happy coincidence that his famous book is titled “Failure is not an Option“. My entire job revolves around this same ethos, but failure is not an option because I have to find every failure before it happens. NASA go to N+2 or N+3 in most circumstances as human lives are directly at stake, as well has hundreds of millions of dollars! But Alan Shepard famously said when asked how he felt just before the launch of the Redstone Rocket ‘… every part of this ship was built by the low bidder.’

So next time your technology device at home breaks, remember that what you bought was produced by the cheapest bidder, in the most efficient mass production model. Everytime you read a news story about a banking system going down, Facebook not being available, or some other IT system failure, one of 2 things has happened:

Someone like me didn’t do their job thoroughly enough (I like to think less likely)

Someone like me did their job thoroughly, but someone higher up decided the cost was not worth the benefit of fixing the chance of failure

You might think that given the publicity and bad press, option 2 is the rarity, but you would be surprised how often things don’t fail, and the longer things don’t fail, the harder it is to justify the cost of introducing redundancy.

Cover via Amazon

Related articles

]]>http://www.wafl.co.uk/design-to-fail/feed/5DevOps vs IT-Opshttp://www.wafl.co.uk/devops-vs-it-ops/
http://www.wafl.co.uk/devops-vs-it-ops/#commentsMon, 17 Mar 2014 11:18:23 +0000http://wafl.co.uk/?p=2197I’ve spent the last few months closely monitoring the job boards, and because of my web development background I get flagged for development jobs. Interestingly the vast majority of development roles seem to be pitched as DevOps roles. Initially this got my interested as I’d be very interested in doing a DevOps role (my dev skills are rusty, but I can do the Ops side pretty well). But it seems the majority of DevOps roles are simply just Development roles with a bit of config management included, and the config management is code related, not infrastructure related.

If you look at the IT Operations side of things, these guys are getting more involved in automated builds, infrastructure configuration management and the ubiquitous immutable server concept. The problem is there is significant cross-over in the tooling for DevOps and IT-Ops. If you’re looking at something like Chef, Puppet, Ansible or Salt, one of the key decision factors is are you a developer or an infrastructure person. Developers are more likely to understand Github repositories and workflows, while infrastructure guys will understand more scripting an automated build. With the major infrastructure virtualisation vendors coming to the party with things like VMware’s Application Director and Data Director, as well as Microsoft’s App-Controller, this market is quickly becoming busy.

But the key question is still, are you a developer or an infrastructure person? Either an infrastructure person is building a template to hand-over to development, or a developer is taking a pre-built template and automating their code over the top of it. What about DevOps then? At what point will the infrastructure operations team actually work closely with the development team? Maybe the question is closer to: At what point will the infrastructure team let the development team get closer to the infrastructure, and at what point will the development team let the infrastructure team get closer to their code? There’s still too many arguments one way or the other (your code isn’t optimised for a virtual stack, your infrastructure isn’t dynamic enough for our code, etc. etc.).

I don’t doubt that the tooling is available, but I don’t think any one tool can yet do the broad spectrum that both Dev and Ops need yet. The biggest challenge is the politics between the teams. I don’t think this is necessarily ego’s, it’s just more taking 2 teams that have traditionally caused each-other a lot of headaches and suddenly asking them to just get along and work hand-in-hand. There are huge benefits in DevOps if done properly, but the challenge of doing it properly isn’t a small one! I think both teams can learn a lot from each-other too.

Should the converged infrastructure vendors (VCE Vblock, NetApp FlexPod, IBM Pure, Dell Active Infrastructure, HDS UCP, HP CloudSystem, etc.) start shipping with automation tools built in, it will be a very interesting challenge for the IT-Ops team! DevOps teams are already circumventing IT-Ops and going straight to platforms that can service them so my gut feeling is that IT-Ops need to start making the changes first.

]]>http://www.wafl.co.uk/devops-vs-it-ops/feed/1Explain Snapshotshttp://www.wafl.co.uk/explain-snapshots/
http://www.wafl.co.uk/explain-snapshots/#commentsTue, 11 Mar 2014 14:43:29 +0000http://www.wafl.co.uk/?p=1861This seems to be a popular search term so I think it’s worth covering off. This is covered on my old top post about Fractional Reservation, but I’ll cover the alternatives here also.

NetApp snapshots used to be pretty unique in the industry, but the industry term for this technology is generally now Append-on-Write / Redirect-on-Write (new writes are appended to the “end”, or redirected to free blocks, depending how you look at it) and quite a few vendors do it this way. Put very simply, all new data is written to new (zeroed) blocks on disk. This does mean that snapshot space has to be logically in the same location as the production data, but that really shouldn’t be a problem with wide-striping / aggregates / storage pools (pick preferred vendor term). When a snapshot is taken, the inode table is frozen and copied. The inode table points to the data blocks, and these data blocks now become fixed. As the active filesystem “changes” blocks, these actually get written to new locations on disk, and so there is no overhead to the write (the new blocks are already zeroed). In other technologies (not NetApp) this also forms the basis of automated tiering, once data is “locked” by a snapshot, it’ll never be over-written so it can safely be tiered out of SSD or even SAS as read performance is rarely an issue. NetApp use FlashPools to augment this, and a snapshot is a trigger for data to be tiered out of FlashPools as it’ll never be “overwritten”.

Other/Traditional vendors (although this list is rapidly shrinking) take a Copy-on-Write approach. This means that once a snapshot is taken and the used blocks are “locked” in, any overwrite has to copy the data to a new location (1 write), zero the production data (1 write) and then write the new data (1 write), as such CoW generally has a 3x overhead to over-write performance. However one advantage here is that snapshots can then be stored in a totally different location, and the layout of snapshots and production data can be much more tightly controlled. This is why the traditional enterprise arrays (DS8000, HDS VSP, EMC VMAX) still do things with Copy-on-Write. Their arrays are so ridiculously quick anyway the overhead to CoW is generally negligible.

There is another option and that is journalised. I mention this method as it’s what VMware does (although clearly not a storage vendor). Ignoring VMware VSAN, I’m talking about traditional VMware snapshots of a VMDK. The snapshot locks the written data of the VMDK file (as in NetApp snapshots) and then all following writes are written to a journal. Actually a few different technologies use this journal mechanism although in quite different ways (EMC RecoverPoint and Actifio both use journals, but in a very different way). The problem with VMware journal snapshots is when you delete it, at this point all the writes that happened in the journal file need to be re-written to the base VMDK, all at the same time as production writes still going to the disk. This can create a massive performance problem.

Append-on-Write / Redirect-on-Write is now employed by Dell Compellent, HDS HUS, EMC VNX, HP 3PAR, IBM V7000 and many others, including most start-ups. Its definitely the most efficient way of taking snapshots and many vendors will claim theoretically no limits to he snapshot capacity.

Downsides?

If you are too aggressive with your snapshot schedules, the system is always copying the inode tables, always having to manage a large number of snapshots, and never get a chance to do housekeeping. In order to AoW/RoW you need free blocks, to get free blocks the system needs chance to do housekeeping. Generally this involves a disk scrub of some sort (going to each data block, checking if any snapshot points at it, if not zero the data and free it up). I’ve seen busy storage systems not free up deleted space for 24 hours+. It can also affect data locality performance, writes will no longer be located near the same data blocks, and unless your storage system is specifically tuned to handle this sort of load (as most storage systems should be) then this can create a performance problem. Even if your storage system is tuned, performing system maintenance to rebalance the data (either manually triggered, or automatically by the appliance) can have significant performance gains for sequential workloads. Additionally the more free space you have, the quicker your storage system will be as it doesn’t have to “seek” for the next free block. A full system has to “seek” more to find free spaces and may be forced to break-up the data into smaller chunks in order to fill in the holes.

Don’t overlook housekeeping and don’t real system maximums as a design target! Keep well within the limits of the storage system (whatever vendor) and the storage system will look after you.

One final (very important) point. Snapshots are not backups! You need to get your data off the storage system, in a format that can be recovered without the need of the original storage system in order to call it a backup. Don’t rely too heavily on snapshots and consider what is acceptable RTO/RPO if you lose your entire storage system (snapshots and all).

Related articles

]]>http://www.wafl.co.uk/explain-snapshots/feed/9VMware CPU Ready Timehttp://www.wafl.co.uk/vmware-cpu-ready-time/
http://www.wafl.co.uk/vmware-cpu-ready-time/#commentsTue, 11 Mar 2014 13:52:49 +0000http://www.wafl.co.uk/?p=1831I have been surprised that recently this has come back to haunt me as an issue, and a major one at that.

So what’s the issue? Well, long story short, if you starve your virtual estate of CPU resources you’ll get CPU ready-state issues. Broadly this is caused by 2 issues, you’ve over-committed your CPU resources (consolidation ratio is too high), or your virtual machines are sized too big (and their workload is too high).

VMware vSphere is very clever with it’s CPU virtualisation. In order to allow multiple virtual machines share the same CPU space, it schedules them in and out. Needless to say this happens very quickly, and generally speaking the only thing you’ll notice is that you consume very little CPU and have a very high consolidation ratio. The problem really occurs with large VMs (4+ vCPU’s). vSphere needs to be a lot more intelligent about this, as all vCPU’s need to be scheduled at the same time, or skewed slightly (part of the relaxed co-scheduling in 5.0+). The window of opportunity to schedule these gets narrower the more vCPU’s you assign, so a 4 vCPU machine needs to wait for 4 logical cores to be available (hyper-threaded cores count as individual logical cores), and 8 vCPU machine needs to wait for 8. The busier a vSphere host is, the longer a queue there may be for CPU resources and the harder it is to schedule all the vCPU’s is. While a machine is waiting for CPU resources to be available, it is in a ready-state (meaning it has CPU transactions to process, but can’t as no resources are available). The relaxed co-scheduling means it doesn’t always have to wait for all vCPU’s to be scheduled at the same time on logical physical cores, but it’s a rule of thumb when sizing.

What if you see this, what can you do about it? Well, there are a couple of options with different viability each.

Set CPU affinity

Not necessarily ideal, but I recently saw this issue occur on a Citrix farm and the vCPU’s added up to less than the logical cores, but still the CPU’s were on average 30% in ready state (bad!). It turns out that VMware is still trying to schedule the CPU’s intelligently, and trying to NUMA affinity (the locality of memory pages to physical CPUs as not all memory has a direct path to all CPUs), so in this estate VMware was constantly re-scheduling these vCPUs causing an issue. Once we’d set CPU affinity on the Citrix VMs, the ready state dropped to less than 5% at peak. As the Citrix servers didn’t need any of the advanced features of the VMware clustering and the scaling was always the same (need more Citrix servers, add more physical hosts), this was safe to do and a relatively simple and easy fix. In other environments this really isn’t ideal at all.

Reduce the vCPU count

Definitely one of my top recommendations, although getting buy-in from application owners could be tough. If you see CPU scheduling issues I would almost guarantee that reducing your virtual machine size (in vCPU terms) would improve performance. Look to scale out instead, have more smaller VMs. Single vCPU servers don’t need to be scheduled in the same way (as they’ll run on any free logical core), so rarely (if ever) suffer from CPU ready state issues. I would choose this option every time if I could and you could ramp up your consolidation ratios if everything was single vCPU.

Grow your estate

More logical cores means less scheduling overhead and less CPU contention. An expensive choice maybe, but definitely a viable one. Look to remove or upgrade older hosts and put in servers with more cores. There is a caveat to this, some workloads prefer the higher clock speed of lower core CPU’s, but this is a rarity. Most applications will be fine and generally CPU is a resource you have spare (ready-state accepted).

Upgrade

VMware are constantly improving the ESXi CPU scheduler, and the improvements will give you lots of benefits to your issues. People with CPU ready-state issues in 4.0 saw them completely disappear with an upgrade to 5.x. You’ll not get an exact figure on the improvements as it really depends on the make-up of your hosts and VMs. But ESX upgrades are easy these days, and it’s a low risk fix.

Separate clusters

Have a high consolidation cluster that you use for your low vCPU machines (2 vCPU’s and less maybe) that you ramp up consolidation ratios. Then have a separate cluster for high-performance systems where the consolidation ratio is low, maybe as low as 1:1 (vCPU:pCPU). But do this within reason, DRS performs best with a good sized cluster, and if you only have 3 hosts in total then don’t do this! If you have say less than 5/6 hosts then maybe look to create this with DRS affinity groups, although that comes with some management overhead.

Bottom line is this issue is caused by over-committed resources, or an under-sized estate. Make sure your reporting is tuned so that you can pickup on CPU ready-state (vCenter Operations Manager does), if it doesn’t, look at alternatives or script (r)ESXtop to gather the stats for you. If you’re a hosting provider make sure you aren’t adversely affecting your customers with this issue, they’ll not be able to see a high CPU ready-state but there VMs will perform very badly.

Related articles

]]>http://www.wafl.co.uk/vmware-cpu-ready-time/feed/3New beginningshttp://www.wafl.co.uk/new-beginnings/
http://www.wafl.co.uk/new-beginnings/#commentsTue, 11 Mar 2014 12:36:08 +0000http://www.wafl.co.uk/?p=1822First new post in what, 2-3 years? WAFL.co.uk still performs admirably and now that I’ve moved roles I figure it’s time to re-visit my old flames. My home lab needs a re-build and upgrade and that should get documented!

So I’ve moved into the big scary world of contract based work, I started my first role a couple of weeks ago, and so far it’s going great. I want to keep an update on my exploits, the challenges real customers are facing and share some of my generic musings. The site will be less NetApp centric, but I still have my roots in storage!

My first role involved a lot of interesting challenges, but there’s some great technology available here too. A strong DevOps team (that need help integrating the Ops bit, doesn’t everyone?), lots of Big Data challenges, and an immediate project to look at creating a much more responsive infrastructure, including where cloud services fit in. I started life as a web developer, and it’s great being back at a dotcom company and seeing how the challenges have evolved.

Performance Manager 1.0 RC1 is deployed, not installed, as a virtual appliance within VMware ESX or ESXi. A virtual appliance is a prebuilt software bundle, containing an operating system and software applications that are integrated, managed, and updated as a package. This software distribution method simplifies what would be an otherwise complex installation process.

Upon deployment, the Linux 2.6.32-based virtual appliance creates a virtual machine containing the user software, third-party applications, and all configuration information pre-installed on the virtual machine. Much of the virtual appliance middleware is built primarily with Java and includes several open-source components – most notably from (but not limited to) the Apache Software Foundation, the Debian Project, and the Free Software Foundation.

Sizing Performance Manager is based upon a number of factors: the number of clustered Data ONTAP clusters, maximum number of nodes in each cluster, and maximum number of volumes on any node in a cluster.

In order to meet the official supportability status from NetApp, Performance Manager 1.0 RC1 requires 12GB of (reserved) memory, 4 virtual CPUs, and a total of 9572 MHz of (reserved) CPU. This qualified configuration meets minimum levels of acceptable performance and configuring these settings smaller than specified is not supported. Interestingly, increasing any of these resources is permitted – but not recommended – as doing so provides little additional value.

In fact, according to December 2013 AutoSupport data from NetApp, most customers should expect to deploy a single Performance Manager virtual appliance; as one instance will be suitable for 95% of all currently deployed clustered Data ONTAP systems.

It’s also interesting to note that quite a few products exist in the OnCommand portfolio that provide performance monitoring. So when would one product be better suited over another?

Let's start with Performance Manager; this is the least common denominator. It monitors and troubleshoots deep into the clustered Data ONTAP system.

OnCommand Balance provides troubleshooting and optimization at the next level -- especially in a server virtualization environment. It monitors clustered Data ONTAP, the VMs, and hosts. It correlates all data by using performance analytics and provides guidance.

Finally, OnCommand Insight is a full storage resource management solution for large, complex, multivendor storage in virtual and physical environments. NetApp customers add Insight if they have other vendors’ storage, such as EMC, HP, IBM, HDS, etc.

No license is required to deploy OnCommand Performance Manager 1.0 RC1. For the required VMware server and browser versions, refer to the Interoperability Matrix Tool (IMT) page on the NetApp Support Site.]]>

Performance Manager 1.0 RC1 is deployed, not installed, as a virtual appliance within VMware ESX or ESXi. A virtual appliance is a prebuilt software bundle, containing an operating system and software applications that are integrated, managed, and updated as a package. This software distribution method simplifies what would be an otherwise complex installation process.

Upon deployment, the Linux 2.6.32-based virtual appliance creates a virtual machine containing the user software, third-party applications, and all configuration information pre-installed on the virtual machine. Much of the virtual appliance middleware is built primarily with Java and includes several open-source components – most notably from (but not limited to) the Apache Software Foundation, the Debian Project, and the Free Software Foundation.

Sizing Performance Manager is based upon a number of factors: the number of clustered Data ONTAP clusters, maximum number of nodes in each cluster, and maximum number of volumes on any node in a cluster.

In order to meet the official supportability status from NetApp, Performance Manager 1.0 RC1 requires 12GB of (reserved) memory, 4 virtual CPUs, and a total of 9572 MHz of (reserved) CPU. This qualified configuration meets minimum levels of acceptable performance and configuring these settings smaller than specified is not supported. Interestingly, increasing any of these resources is permitted – but not recommended – as doing so provides little additional value.

In fact, according to December 2013 AutoSupport data from NetApp, most customers should expect to deploy a single Performance Manager virtual appliance; as one instance will be suitable for 95% of all currently deployed clustered Data ONTAP systems.

It’s also interesting to note that quite a few products exist in the OnCommand portfolio that provide performance monitoring. So when would one product be better suited over another?

Let’s start with Performance Manager; this is the least common denominator. It monitors and troubleshoots deep into the clustered Data ONTAP system.

OnCommand Balance provides troubleshooting and optimization at the next level — especially in a server virtualization environment. It monitors clustered Data ONTAP, the VMs, and hosts. It correlates all data by using performance analytics and provides guidance.

Finally, OnCommand Insight is a full storage resource management solution for large, complex, multivendor storage in virtual and physical environments. NetApp customers add Insight if they have other vendors’ storage, such as EMC, HP, IBM, HDS, etc.

No license is required to deploy OnCommand Performance Manager 1.0 RC1. For the required VMware server and browser versions, refer to the Interoperability Matrix Tool (IMT) page on the NetApp Support Site.

]]>http://www.wafl.co.uk/netapp-debuts-oncommand-performance-manager/feed/0NetApp Unveils FAS8000http://www.wafl.co.uk/netapp-unveils-fas8000/
http://www.wafl.co.uk/netapp-unveils-fas8000/#commentsWed, 19 Feb 2014 13:00:00 +0000http://wafl.co.uk/?guid=af6755873da1e8a66116a0f3f6c36b33NetApp today launched the FAS8000 Series, its latest enterprise platform for shared infrastructure, with three new models: FAS8020, FAS8040, and FAS8060, which replace the FAS/V3220, FAS/V3250, and FAS/V6220, respectively. This new line will initially ship with Data ONTAP 8.2.1 RC2, supporting either 7-Mode or clustered Data ONTAP.

All systems are available in either standalone and HA configurations within a single chassis. All standalone FAS8000 controller configurations can have a second controller (of the same model) added to the chassis to become HA.

The new FAS8000 has been qualified with the DS2246, DS4246, DS4486, DS4243, DS14mk4, and the DS14mk2-AT disk shelves with IOM6, IOM3, ESH4, and AT-FCX shelf modules. Virtualized storage from multiple vendors can also be added to the FAS8000 -- without a dedicated V-Series “gateway” system -- with the new “FlexArray” software feature.

NetApp will not offer a separate FlexCache model for the FAS8000 Series.

Let’s explore the technical details of each one of these new storage systems.

NetApp supports single and dual controller configurations in one chassis, but unlike previous systems, I/O Expansion Module (IOXM) configurations are not supported. The increased mix of high-performance on-board ports and the flexibility offered by the new Unified Target Adapter 2 (UTA2) ports reduces the need for higher slot counts on the FAS8000 series.

As with the NetApp NVRAM8 on previous FAS systems, each FAS8020 PCM includes 4GB of NVRAM9 (8GB per HA pair) with battery backup; should a power loss occur, the NVRAM contents are destaged onto NAND Flash memory. Once power is restored, the resulting NVLOG is then replayed to restore the system. NVRAM9 is integrated on the motherboard and does not take up a slot.

The FAS8020 is built upon the Gen 3 PCI Express (PCIe) architecture for embedded devices (such as PCI bridges, Ethernet / Fibre Channel / InfiniBand adapters, and SAS controllers). Its slots support wide links up to x8 lanes.

Interestingly, the HA interconnect for the FAS8020 now leverages 40Gb QDR InfiniBand adapters; this is a substantial upgrade from the 10GBASE-KR (copper) or 10GBASE-SR (fiber) technology found within the FAS/V3220.

Also new to the FAS8020 is the new Unified Target Adapter (UTA) 2; a storage industry first from NetApp. It supports 16Gb Fibre Channel (FC) or 10Gb Ethernet, providing future flexibility. Both ports must be set to the same "personality" and changing one UTA port will change the second port to the same personality. The FAS8020 has a single ASIC for onboard UTA2 ports, and fault tolerance across ASICs requires adding the X1143A-R6.

The FC personality on the UTA2 will autorange link speeds from 16/8/4 Gb FC, but does not work at 2 or 1 Gbps. The 10GbE will not autorange below 10GbE speeds. It is important to note that UTA2 ports are not supported with older DS14 FC shelves or FC tape devices. To connect to DS14 shelves or FC tape, use X1132A-R6 or X2054A-R6.

The FAS8020 can hold a maximum of 480 drives or 240 SSDs (per HA system), with a maximum capacity of 1,920TB. The maximum amount of Flash Cache and Flash Pool (combined) capacity is up to 6TB per HA pair. Maximum aggregate size on a FAS8020 is 150TB and the maximum volume size is 70TB.

Clustered Data ONTAP node limits of a FAS8020 are 24 for NAS and 8 for SAN with a homogeneous cluster. Not surprisingly, platform-mixing rules with heterogeneous (mixed) clusters limit the FAS8020 with FAS/V3220s, 3250s, 3270s, and 6210s to a maximum of 8 nodes for both SAN and NAS clusters. More modern systems (such as the FAS/V6220s, 6240/50s, and 6280/90s) are qualified with 24 nodes for NAS and 8 nodes for SAN with the FAS8020 – just like homogeneous clusters. Also be aware that the FAS8020 running within a heterogeneous (mixed) cluster of FAS/V3210s, FAS/V3240s, or FAS22xx systems has only been qualified for 4 nodes for SAN and 4 nodes for NAS at this time.

FAS8040/60 The 6U form factor FAS8040 and FAS8060 (codenamed: "Bimota") are targeted towards medium to large enterprise customers with business-critical applications and cloud infrastructure.

Each FAS8060 PCM includes dual-socket, 2.1 GHz Intel E5-2658 “Sandy Bridge-EP” processors with a total of 16 cores (32 per HA pair), an Intel Patsburg-J SouthBridge, and 64GB of DDR3 physical memory (128GB per HA pair). Data ONTAP CPU utilization on each FAS8060 PCM reaches 1,560%; this means all 16 cores are actively servicing the workload (100% equals one core).

NetApp supports single and dual controller configurations in one chassis, but unlike previous systems, I/O Expansion Module (IOXM) configurations are not supported. The increased mix of high-performance on-board ports and the flexibility offered by the new UTA2 ports reduces the need for higher slot counts on the FAS8000 series.

As with the NetApp NVRAM8 on previous FAS systems, each FAS8040 or FAS8060 PCM includes 8GB of NVRAM9 (16GB per HA pair) with battery backup; should a power loss occur, the NVRAM contents are destaged onto NAND Flash memory. Once power is restored, the resulting NVLOG is then replayed to restore the system. NVRAM9 is integrated on the motherboard and does not take up a slot.

Interestingly, the HA interconnect for the FAS8040/60 now leverages 40Gb QDR InfiniBand adapters; this is a substantial upgrade from the 10GBASE-KR (copper) or 10GBASE-SR (fiber) technology found within the FAS/V3250 and a modest improvement from the FAS/V6220’s 20Gb DDR InfiniBand interconnect.

According to NetApp, the FAS8040 and FAS8060 should use all four on-board ports for the cluster interconnect in order to reach peak performance for remote workloads within Data ONTAP 8.2.1. This best practice also ensures deployments can take advantage of potential performance increases in future releases of Data ONTAP.

Also new to the FAS8040 and FAS8060 is the new Unified Target Adapter (UTA) 2; a storage industry first from NetApp. It supports 16Gb Fibre Channel (FC) or 10Gb Ethernet, providing future flexibility. Both ports must be set to the same "personality" and changing one UTA port will change the second port to the same personality. The FAS8040 and FAS8060 have two UTA2 ASICs, and port pairs are e0e/0e and e0f/0f sharing an ASIC, while port pairs e0g/0g and e0h/0h share the second ASIC.

The FC personality on the UTA2 will autorange link speeds from 16/8/4 Gb FC, but does not work at 2 or 1 Gbps. The 10GbE will not autorange below 10GbE speeds. It is important to note that UTA2 ports are not supported with older DS14 FC shelves or FC tape devices. To connect to DS14 shelves or FC tape, use X1132A-R6 or X2054A-R6.

The FAS8040 can hold a maximum of 720 drives or 240 SSDs (per HA system), with a maximum capacity of 2,880TB. The maximum amount of Flash Cache and Flash Pool (combined) capacity is up to 12TB per HA pair. Maximum aggregate size on a FAS8040 is 180TB and the maximum volume size is 100TB.

The FAS8060 can hold a maximum of 1,200 drives or 240 SSDs (per HA system), with a maximum capacity of 4,800TB. The maximum amount of Flash Cache and Flash Pool (combined) capacity is up to 18TB per HA pair. Maximum aggregate size on a FAS8060 is 324TB and the maximum volume size is 100TB.

Clustered Data ONTAP node limits of a FAS8040/60 are 24 for NAS and 8 for SAN with a homogeneous cluster. Not surprisingly, platform-mixing rules with heterogeneous (mixed) clusters limit the FAS8040/60 with FAS/V3220s, 3250s, 3270s, and 6210s to a maximum of 8 nodes for both SAN and NAS clusters. More modern systems (such as the FAS/V6220s, 6240/50s, and 6280/90s) are qualified with 24 nodes for NAS and 8 nodes for SAN with the FAS8040/60 – just like homogeneous clusters.

SUMMARYThe FAS8000 is available to quote and order immediately with Data ONTAP 8.2.1 RC2. Shipments are planned to begin March 2014. Starting in May 2014, FAS8000 systems will be orderable as Factory Configured Clusters (FCCs).]]>

NetApp today launched the FAS8000 Series, its latest enterprise platform for shared infrastructure, with three new models: FAS8020, FAS8040, and FAS8060, which replace the FAS/V3220, FAS/V3250, and FAS/V6220, respectively. This new line will initially ship with Data ONTAP 8.2.1 RC2, supporting either 7-Mode or clustered Data ONTAP.

All systems are available in either standalone and HA configurations within a single chassis. All standalone FAS8000 controller configurations can have a second controller (of the same model) added to the chassis to become HA.

The new FAS8000 has been qualified with the DS2246, DS4246, DS4486, DS4243, DS14mk4, and the DS14mk2-AT disk shelves with IOM6, IOM3, ESH4, and AT-FCX shelf modules. Virtualized storage from multiple vendors can also be added to the FAS8000 — without a dedicated V-Series “gateway” system — with the new “FlexArray” software feature.

NetApp will not offer a separate FlexCache model for the FAS8000 Series.

Let’s explore the technical details of each one of these new storage systems.

NetApp supports single and dual controller configurations in one chassis, but unlike previous systems, I/O Expansion Module (IOXM) configurations are not supported. The increased mix of high-performance on-board ports and the flexibility offered by the new Unified Target Adapter 2 (UTA2) ports reduces the need for higher slot counts on the FAS8000 series.

As with the NetApp NVRAM8 on previous FAS systems, each FAS8020 PCM includes 4GB of NVRAM9 (8GB per HA pair) with battery backup; should a power loss occur, the NVRAM contents are destaged onto NAND Flash memory. Once power is restored, the resulting NVLOG is then replayed to restore the system. NVRAM9 is integrated on the motherboard and does not take up a slot.

The FAS8020 is built upon the Gen 3 PCI Express (PCIe) architecture for embedded devices (such as PCI bridges, Ethernet / Fibre Channel / InfiniBand adapters, and SAS controllers). Its slots support wide links up to x8 lanes.

Interestingly, the HA interconnect for the FAS8020 now leverages 40Gb QDR InfiniBand adapters; this is a substantial upgrade from the 10GBASE-KR (copper) or 10GBASE-SR (fiber) technology found within the FAS/V3220.

Also new to the FAS8020 is the new Unified Target Adapter (UTA) 2; a storage industry first from NetApp. It supports 16Gb Fibre Channel (FC) or 10Gb Ethernet, providing future flexibility. Both ports must be set to the same “personality” and changing one UTA port will change the second port to the same personality. The FAS8020 has a single ASIC for onboard UTA2 ports, and fault tolerance across ASICs requires adding the X1143A-R6.

The FC personality on the UTA2 will autorange link speeds from 16/8/4 Gb FC, but does not work at 2 or 1 Gbps. The 10GbE will not autorange below 10GbE speeds. It is important to note that UTA2 ports are not supported with older DS14 FC shelves or FC tape devices. To connect to DS14 shelves or FC tape, use X1132A-R6 or X2054A-R6.

The FAS8020 can hold a maximum of 480 drives or 240 SSDs (per HA system), with a maximum capacity of 1,920TB. The maximum amount of Flash Cache and Flash Pool (combined) capacity is up to 6TB per HA pair. Maximum aggregate size on a FAS8020 is 150TB and the maximum volume size is 70TB.

Clustered Data ONTAP node limits of a FAS8020 are 24 for NAS and 8 for SAN with a homogeneous cluster. Not surprisingly, platform-mixing rules with heterogeneous (mixed) clusters limit the FAS8020 with FAS/V3220s, 3250s, 3270s, and 6210s to a maximum of 8 nodes for both SAN and NAS clusters. More modern systems (such as the FAS/V6220s, 6240/50s, and 6280/90s) are qualified with 24 nodes for NAS and 8 nodes for SAN with the FAS8020 – just like homogeneous clusters. Also be aware that the FAS8020 running within a heterogeneous (mixed) cluster of FAS/V3210s, FAS/V3240s, or FAS22xx systems has only been qualified for 4 nodes for SAN and 4 nodes for NAS at this time.

FAS8040/60 The 6U form factor FAS8040 and FAS8060 (codenamed: “Bimota”) are targeted towards medium to large enterprise customers with business-critical applications and cloud infrastructure.

Each FAS8060 PCM includes dual-socket, 2.1 GHz Intel E5-2658 “Sandy Bridge-EP” processors with a total of 16 cores (32 per HA pair), an Intel Patsburg-J SouthBridge, and 64GB of DDR3 physical memory (128GB per HA pair). Data ONTAP CPU utilization on each FAS8060 PCM reaches 1,560%; this means all 16 cores are actively servicing the workload (100% equals one core).

NetApp supports single and dual controller configurations in one chassis, but unlike previous systems, I/O Expansion Module (IOXM) configurations are not supported. The increased mix of high-performance on-board ports and the flexibility offered by the new UTA2 ports reduces the need for higher slot counts on the FAS8000 series.

As with the NetApp NVRAM8 on previous FAS systems, each FAS8040 or FAS8060 PCM includes 8GB of NVRAM9 (16GB per HA pair) with battery backup; should a power loss occur, the NVRAM contents are destaged onto NAND Flash memory. Once power is restored, the resulting NVLOG is then replayed to restore the system. NVRAM9 is integrated on the motherboard and does not take up a slot.

Interestingly, the HA interconnect for the FAS8040/60 now leverages 40Gb QDR InfiniBand adapters; this is a substantial upgrade from the 10GBASE-KR (copper) or 10GBASE-SR (fiber) technology found within the FAS/V3250 and a modest improvement from the FAS/V6220’s 20Gb DDR InfiniBand interconnect.

According to NetApp, the FAS8040 and FAS8060 should use all four on-board ports for the cluster interconnect in order to reach peak performance for remote workloads within Data ONTAP 8.2.1. This best practice also ensures deployments can take advantage of potential performance increases in future releases of Data ONTAP.

Also new to the FAS8040 and FAS8060 is the new Unified Target Adapter (UTA) 2; a storage industry first from NetApp. It supports 16Gb Fibre Channel (FC) or 10Gb Ethernet, providing future flexibility. Both ports must be set to the same “personality” and changing one UTA port will change the second port to the same personality. The FAS8040 and FAS8060 have two UTA2 ASICs, and port pairs are e0e/0e and e0f/0f sharing an ASIC, while port pairs e0g/0g and e0h/0h share the second ASIC.

The FC personality on the UTA2 will autorange link speeds from 16/8/4 Gb FC, but does not work at 2 or 1 Gbps. The 10GbE will not autorange below 10GbE speeds. It is important to note that UTA2 ports are not supported with older DS14 FC shelves or FC tape devices. To connect to DS14 shelves or FC tape, use X1132A-R6 or X2054A-R6.

The FAS8040 can hold a maximum of 720 drives or 240 SSDs (per HA system), with a maximum capacity of 2,880TB. The maximum amount of Flash Cache and Flash Pool (combined) capacity is up to 12TB per HA pair. Maximum aggregate size on a FAS8040 is 180TB and the maximum volume size is 100TB.

The FAS8060 can hold a maximum of 1,200 drives or 240 SSDs (per HA system), with a maximum capacity of 4,800TB. The maximum amount of Flash Cache and Flash Pool (combined) capacity is up to 18TB per HA pair. Maximum aggregate size on a FAS8060 is 324TB and the maximum volume size is 100TB.

Clustered Data ONTAP node limits of a FAS8040/60 are 24 for NAS and 8 for SAN with a homogeneous cluster. Not surprisingly, platform-mixing rules with heterogeneous (mixed) clusters limit the FAS8040/60 with FAS/V3220s, 3250s, 3270s, and 6210s to a maximum of 8 nodes for both SAN and NAS clusters. More modern systems (such as the FAS/V6220s, 6240/50s, and 6280/90s) are qualified with 24 nodes for NAS and 8 nodes for SAN with the FAS8040/60 – just like homogeneous clusters.

SUMMARYThe FAS8000 is available to quote and order immediately with Data ONTAP 8.2.1 RC2. Shipments are planned to begin March 2014. Starting in May 2014, FAS8000 systems will be orderable as Factory Configured Clusters (FCCs).

]]>http://www.wafl.co.uk/netapp-unveils-fas8000/feed/0Super Storage: A Fan’s View of the NFL’s Data Storagehttp://www.wafl.co.uk/super-storage-a-fans-view-of-the-nfls-data-storage/
http://www.wafl.co.uk/super-storage-a-fans-view-of-the-nfls-data-storage/#commentsTue, 11 Feb 2014 13:00:00 +0000http://wafl.co.uk/?guid=a905ef6421cc08e226ebd39e1ec9340fLike most Americans, I recently watched the biggest, boldest, and coldest event in American football: Super Bowl XLVIII with 112.2 million of my closest "friends".

But even if you didn’t get excited about the big game, you might still be interested to learn about the role of data storage for the most-watched television program in American history.

During the week leading up to the Super Bowl, I had the privilege to help ring the opening bell at the NASDAQ MarketSite in New York City -- and what an experience! I also had the opportunity to chat with the NFL’s Director of Information Technology, Aaron Amendolia, to explore how they leverage NetApp storage systems for data management.

It starts with 40 NetApp FAS2200 Series storage systems that store, protect, and serve data to all 32 NFL teams, thousands of personnel, and millions of fans. For example:

Want player stats during the game? All game play raw data is instantly available and served by NetApp storage systems.

Like those action shots? Television and newspaper photographers take hundreds of thousands of photos and videographers capture high-definition video of regular-season games, the playoffs, and the Super Bowl – all stored on NetApp storage systems.

See someone wearing a badge? NetApp provides the infrastructure that supports security credentialing for everyone from hot dog vendors to the NFL commissioner.

I also learned that the NFL leverages the entire protocol stack (both SAN and NAS), with over 90% of their infrastructure running virtual machines on NetApp storage systems.

Yet, every Super Bowl is unique.

The NFL’s end-users are often located in hotels with high-latency connections; hardware is subjected to harsh environments usually not found within most datacenters (soda can spills, dirt, grit, etc.). The good news is that SnapMirror, the replication software built into Data ONTAP, allows the NFL to failover in the event of a problem.

In fact, they regularly test their disaster recovery plans with (live) failover and failback.

Sure, your favorite team may not have made it to the Super Bowl this year, but the partnership with the NFL and its 32 teams doesn’t end with the Super Bowl. It’s still business-as-usual for the NFL’s IT infrastructure: updating playbooks, streaming video, etc.

All of which require a super storage platform like NetApp: the official data storage provider of the NFL.]]>

Like most Americans, I recently watched the biggest, boldest, and coldest event in American football: Super Bowl XLVIII with 112.2 million of my closest “friends”.

But even if you didn’t get excited about the big game, you might still be interested to learn about the role of data storage for the most-watched television program in American history.

During the week leading up to the Super Bowl, I had the privilege to help ring the opening bell at the NASDAQ MarketSite in New York City — and what an experience! I also had the opportunity to chat with the NFL’s Director of Information Technology, Aaron Amendolia, to explore how they leverage NetApp storage systems for data management.

It starts with 40 NetApp FAS2200 Series storage systems that store, protect, and serve data to all 32 NFL teams, thousands of personnel, and millions of fans. For example:

Want player stats during the game? All game play raw data is instantly available and served by NetApp storage systems.

Like those action shots? Television and newspaper photographers take hundreds of thousands of photos and videographers capture high-definition video of regular-season games, the playoffs, and the Super Bowl – all stored on NetApp storage systems.

See someone wearing a badge? NetApp provides the infrastructure that supports security credentialing for everyone from hot dog vendors to the NFL commissioner.

I also learned that the NFL leverages the entire protocol stack (both SAN and NAS), with over 90% of their infrastructure running virtual machines on NetApp storage systems.

Yet, every Super Bowl is unique.

The NFL’s end-users are often located in hotels with high-latency connections; hardware is subjected to harsh environments usually not found within most datacenters (soda can spills, dirt, grit, etc.). The good news is that SnapMirror, the replication software built into Data ONTAP, allows the NFL to failover in the event of a problem.

In fact, they regularly test their disaster recovery plans with (live) failover and failback.

Sure, your favorite team may not have made it to the Super Bowl this year, but the partnership with the NFL and its 32 teams doesn’t end with the Super Bowl. It’s still business-as-usual for the NFL’s IT infrastructure: updating playbooks, streaming video, etc.

All of which require a super storage platform like NetApp: the official data storage provider of the NFL.

Coherency, PersistenceAs with previous releases, Flash Accel 1.3.0 detects and corrects for coherency at the block level -- rather than flushing the entire cache. Flushing the entire cache may be good as there are no data coherency issues, but terrible for performance. Flash Accel cache invalidation corrects cache, while keeping the cache persistent.

The benefit of both intelligent data coherency and persistence is to ensure that both the cache optimizes performance at its peak (i.e. when the cache is warm) and that peak performance can last as long as possible (by keeping the cache warm for as long as possible).

Side note: Flash Accel code manages server cache, accelerating access to data that is stored and managed by Data ONTAP on the storage system. Flash Accel is NOT Data ONTAP code.

What's NewFlash Accel 1.3.0 adds the following features and functionalities:

Support for Windows Server bare metal caching:

Windows 2008 R2, Windows 2012, and Windows 2012 R2

FC and iSCSI support for bare metal

Clustered apps supported (cold cache on failover)

Adds support for Windows 2012 and 2012 R2 VMs and vSphere 5.5 support

Note: for use of Flash Accel with Flash Accel Management Console (FAMC), vSphere 5.5 support will be added within weeks of general availability of Flash Accel 1.3

Up to 4TB of cache per server

Support for sTEC PCI-e Accelerator

Note: For VMware environment, Flash Accel 1.3.0 is initially only available for use with FAMC. 1.3.0 support for use with NetApp Virtual Storage Console (VSC) will be available when VSC 5.0 releases

Check the NetApp Interoperability Matrix Tool (IMT) under the new “Flash Accel” storage solution within the “Server Caching Solutions” category for the most up to date supported configurations.

As part of this release, memory consumption has been reduced. Previously, Flash Accel required 0.006 GB of physical memory on the host for every gigabyte of device memory and 0.006 GB additional memory on the VM for every gigabyte of cache space allocated. With version 1.3.0, it's now possible to configure 0.0035 GB of physical memory (on the host) and 0.0035 GB additional memory (on the VM).

It's also important to note that the default cache block size in version 1.3.0 has been increased from 4 KB to 8 KB.

Flash Accel 1.3.0 can be downloaded by any NetApp FAS or V-Series customer at no cost. To download Flash Accel, visit the NetApp Support Site.]]>

NetApp today announced the availability of Flash Accel 1.3.0, its server-side software that turns supported server flash into a cache for the backend Data ONTAP storage systems.

Coherency, PersistenceAs with previous releases, Flash Accel 1.3.0 detects and corrects for coherency at the block level — rather than flushing the entire cache. Flushing the entire cache may be good as there are no data coherency issues, but terrible for performance. Flash Accel cache invalidation corrects cache, while keeping the cache persistent.

The benefit of both intelligent data coherency and persistence is to ensure that both the cache optimizes performance at its peak (i.e. when the cache is warm) and that peak performance can last as long as possible (by keeping the cache warm for as long as possible).

Side note: Flash Accel code manages server cache, accelerating access to data that is stored and managed by Data ONTAP on the storage system. Flash Accel is NOT Data ONTAP code.

What’s NewFlash Accel 1.3.0 adds the following features and functionalities:

Support for Windows Server bare metal caching:

Windows 2008 R2, Windows 2012, and Windows 2012 R2

FC and iSCSI support for bare metal

Clustered apps supported (cold cache on failover)

Adds support for Windows 2012 and 2012 R2 VMs and vSphere 5.5 support

Note: for use of Flash Accel with Flash Accel Management Console (FAMC), vSphere 5.5 support will be added within weeks of general availability of Flash Accel 1.3

Up to 4TB of cache per server

Support for sTEC PCI-e Accelerator

Note: For VMware environment, Flash Accel 1.3.0 is initially only available for use with FAMC. 1.3.0 support for use with NetApp Virtual Storage Console (VSC) will be available when VSC 5.0 releases

Check the NetApp Interoperability Matrix Tool (IMT) under the new “Flash Accel” storage solution within the “Server Caching Solutions” category for the most up to date supported configurations.

As part of this release, memory consumption has been reduced. Previously, Flash Accel required 0.006 GB of physical memory on the host for every gigabyte of device memory and 0.006 GB additional memory on the VM for every gigabyte of cache space allocated. With version 1.3.0, it’s now possible to configure 0.0035 GB of physical memory (on the host) and 0.0035 GB additional memory (on the VM).

It’s also important to note that the default cache block size in version 1.3.0 has been increased from 4 KB to 8 KB.

Flash Accel 1.3.0 can be downloaded by any NetApp FAS or V-Series customer at no cost. To download Flash Accel, visit the NetApp Support Site.

Snap Creator was originally developed in October 2007 by NetApp Professional Services and Rapid Response Engineering to reduce (or even eliminate) scripting. Nowadays, Snap Creator is a fully supported software distribution available from the NetApp Support Site.

The Snap Creator Team provides two versions of Snap Creator: a community version and an official NetApp release. The community version includes the latest plug-ins, enhancements, and features but is not supported by NetApp Support. The NetApp Version is fully tested and supported, but does not include the latest plug-ins, features, and enhancements.

Let’s explore the architecture of the recently released Snap Creator 4.1 Community Release in November 2013.

Snap Creator Server ArchitectureThe Snap Creator Server is normally installed on a centralized server. It includes a Workflow Engine, which is a multi-threaded, XML-driven component that executes all Snap Creator commands.

Both the Snap Creator GUI and CLI, as well as third-party solutions (such as PowerShell Cmdlets), leverage the Snap Creator API. For example, NetApp Workflow Automation can leverage PowerShell Cmdlets to communicate to Snap Creator.

To store its configurations, Snap Creator includes configuration files and profiles in its Repository; this includes global configs and profile-level global configs. If you're familiar with previous versions of Snap Creator Server, one of the new components is the Extended Repository. This extension provides a database location for every job, imported information about jobs, and even plug-in metadata.

For persistence, the Snap Creator Database stores details on Snap Creator schedules and jobs, as well as RBAC users and roles.

The Storage Interface is the server-side component that handles communication via Data ONTAP APIs to execute storage system operations such as SnapVault, SnapMirror, etc. Snap Creator also includes the Unified Manager Interface that communicates to NetApp OnCommand Unified Manager; it actually uses Unified Manager APIs (instead of ONTAP APIs).

Finally, the Snap Creator Server includes an Agent Interface that communicates with the Agent (which is generally installed on hosts external to the Snap Creator Server). Let's move on to the Snap Creator Agent.

Snap Creator Agent ArchitectureThe Snap Creator Agent 4.1 was recently rewritten completely in Java to be multi-threaded on all operating systems. As part of this rewrite, encryption was enabled throughout the software. This means that previous releases (up to and including version 4.0) will communicate only over HTTP; whereas with version 4.1, communication occurs only over (encrypted) HTTPS.

First, the Agent Interface (on the Server) talks with the Agent’s RESTful Interface. The primary component of the Agent is the Operation/Execution Manager. It is responsible for handling any incoming, outgoing, and/or completed requests while the Execution Manager actually completes those requests.

To execute multiple tasks, each Snap Creator Agent includes a Thread Pool; it consists of worker threads, which determine the number of given operations.

But what happens if an operation has exceeded a timeout value after a specified time? This is when the Agent's Watchdog can be triggered by the Execution Manager. It is oftentimes invoked during quiesce operations.

Moving on, there's a Context Store which holds information that is needed over the lifetime of the workflow.

The Snap Creator Agent also includes a Plug-in Factory that instantiates the plug-ins and communicates directly to the Context Store.

For Java Plug-ins, it’s also important to note that it can directly talk back to the Context Store on the Agent, as well as to the Snap Creator Server -- via token-based storage access. This means that they can execute storage operations without communicating back to the Snap Creator Server. While this is a feature that hasn’t been leveraged to date, it is nevertheless available moving forward.

But what happens if the Plug-in Factory instantiates a non-Java plug-in?

In this scenario, it executes the code of the existing Plug-in Integration Engine, which can run the version 4.0 or 3.6 plug-ins as well as supporting custom plug-ins written in Perl, Unix shell scripting, PowerShell, etc.

SummaryTo recap, we've discussed both the server and agent architecture of Snap Creator 4.1. To learn more about developing plug-ins, visit SnapCreator.com.]]>

Snap Creator was originally developed in October 2007 by NetApp Professional Services and Rapid Response Engineering to reduce (or even eliminate) scripting. Nowadays, Snap Creator is a fully supported software distribution available from the NetApp Support Site.

The Snap Creator Team provides two versions of Snap Creator: a community version and an official NetApp release. The community version includes the latest plug-ins, enhancements, and features but is not supported by NetApp Support. The NetApp Version is fully tested and supported, but does not include the latest plug-ins, features, and enhancements.

Let’s explore the architecture of the recently released Snap Creator 4.1 Community Release in November 2013.

Snap Creator Server ArchitectureThe Snap Creator Server is normally installed on a centralized server. It includes a Workflow Engine, which is a multi-threaded, XML-driven component that executes all Snap Creator commands.

Both the Snap Creator GUI and CLI, as well as third-party solutions (such as PowerShell Cmdlets), leverage the Snap Creator API. For example, NetApp Workflow Automation can leverage PowerShell Cmdlets to communicate to Snap Creator.

To store its configurations, Snap Creator includes configuration files and profiles in its Repository; this includes global configs and profile-level global configs. If you’re familiar with previous versions of Snap Creator Server, one of the new components is the Extended Repository. This extension provides a database location for every job, imported information about jobs, and even plug-in metadata.

For persistence, the Snap Creator Database stores details on Snap Creator schedules and jobs, as well as RBAC users and roles.

The Storage Interface is the server-side component that handles communication via Data ONTAP APIs to execute storage system operations such as SnapVault, SnapMirror, etc. Snap Creator also includes the Unified Manager Interface that communicates to NetApp OnCommand Unified Manager; it actually uses Unified Manager APIs (instead of ONTAP APIs).

Finally, the Snap Creator Server includes an Agent Interface that communicates with the Agent (which is generally installed on hosts external to the Snap Creator Server). Let’s move on to the Snap Creator Agent.

Snap Creator Agent ArchitectureThe Snap Creator Agent 4.1 was recently rewritten completely in Java to be multi-threaded on all operating systems. As part of this rewrite, encryption was enabled throughout the software. This means that previous releases (up to and including version 4.0) will communicate only over HTTP; whereas with version 4.1, communication occurs only over (encrypted) HTTPS.

First, the Agent Interface (on the Server) talks with the Agent’s RESTful Interface. The primary component of the Agent is the Operation/Execution Manager. It is responsible for handling any incoming, outgoing, and/or completed requests while the Execution Manager actually completes those requests.

To execute multiple tasks, each Snap Creator Agent includes a Thread Pool; it consists of worker threads, which determine the number of given operations.

But what happens if an operation has exceeded a timeout value after a specified time? This is when the Agent’s Watchdog can be triggered by the Execution Manager. It is oftentimes invoked during quiesce operations.

Moving on, there’s a Context Store which holds information that is needed over the lifetime of the workflow.

The Snap Creator Agent also includes a Plug-in Factory that instantiates the plug-ins and communicates directly to the Context Store.

For Java Plug-ins, it’s also important to note that it can directly talk back to the Context Store on the Agent, as well as to the Snap Creator Server — via token-based storage access. This means that they can execute storage operations without communicating back to the Snap Creator Server. While this is a feature that hasn’t been leveraged to date, it is nevertheless available moving forward.

But what happens if the Plug-in Factory instantiates a non-Java plug-in?

In this scenario, it executes the code of the existing Plug-in Integration Engine, which can run the version 4.0 or 3.6 plug-ins as well as supporting custom plug-ins written in Perl, Unix shell scripting, PowerShell, etc.

SummaryTo recap, we’ve discussed both the server and agent architecture of Snap Creator 4.1. To learn more about developing plug-ins, visit SnapCreator.com.