Thursday, 18 December 2014

Root Volume Not Working Properly: Recovery Required (On a SIM)

If you use NetApp Simulators a lot, likely you’ll come
across the following at:

Call home for ROOT
VOLUME NOT WORKING PROPERLY: RECOVERY REQUIRED

- or -

SYSTEM MESSAGES

The root volume
(/mroot) is dangerously low on space (less than 10MB). To make space available,
delete old Snapshot copies, delete unneeded files, and/or expand the root
volume’s capacity. After enough space is made available, reboot this
controller...

This error is totally unlikely to ever happen on a
production system. The reason why it happens on SIMs is that the root volume is
so tiny - like less than 900MB - whereas production systems will be over 250GB
or probably much much more (CDOT systems come out of the factory with vol0 set at
95% the size of the 3 disk root aggregate, so pretty much at least 95% of the
size of the smallest disk which is usually > 900GB these days!)

So, if you’re reading this post, odds are you’ve got this
error, so how to fix?

1) If you’ve not logged into the console already, do so.

2) You’ll find yourself on an NODENAME::> prompt, at
that prompt type ::>

node run
local

3) Disable vol0’s snapshot schedule >

snap
sched vol0 0 0 0

4) Delete any snapshots of vol0 >

snap
delete -a vol0

5) Set vol0’s snap reserve to 0% >

snap
reserve vol0 0

Now if you do a
>

df vol0

- you should have
plenty of available capacity, and we could reboot the node to get the CDOT SIM
back up again, but there’s more we can do in the nodeshell!

6) Disable aggregate snapshots >

aggr
status

snap
sched -A aggr0 0 0 0

{Replace aggr0 with
the correct name if it is not the name of the root aggregate}

7) Delete all aggregate snapshots >

snap
delete -a -A aggr0

8) Verify/set aggr0’s snap reserve to 0% >

snap reserve -A
aggr0 0

9) Check the size of aggr0, and attempt to set vol0 to be
100% of that size >

df -A
aggr0

df vol0

vol size
vol0 921600k

- but it will error and tell you “Cannot grow root volume to more than 95% of the available aggregate size which
is currently ...”; set vol0 to be that size >

vol size
vol0 870664k

{Replace the sizes
above if you get different from your SIM}

Finally check the size of vol0 with >

df vol0

- and reboot to the CDOT SIM to get the cluster back up
and running again.

NODENAME>
exit

NODENAME::>
reboot

When the CDOT SIM is back up, there is yet more we can
do!

10) If you have spare disks, add them to aggr0, and
expand vol0 to 95% of aggr0’s size. The SIM comes with 3 x 1GB disks in a RAID-DP,
I reckon 7 is an excellent number, so we’ll increase aggr0 to 7 disk as below:

Login to the cluster ::>

storage
disk show -container unassigned

storage
disk assign -node NODENAME -all true

storage aggr
add-disks -aggregate aggr0 -diskcount 4

df -A
aggr0

system
node run local vol size vol0 4372700k

11) If you didn’t have spare disks (or even if you did),
you could convert aggr0 to RAID4 (again, it’s just a SIM), then add that disk
to aggr0.

From the clustershell ::>

storage
aggregate modify -aggregate aggr0 -raidtype raid4

storage
aggregate show -aggregate aggr0 -fields state

storage
aggregate add-disks -aggregate aggr0 -diskcount 1

disk
show -container-type aggregate -aggregate aggr0

And again increase the size of vol0 to 95% of aggr0’s
size like we did in 9 and 10.

12) Finally, there is some advanced tidy up of vol0 we
can do via the Systemshell (Note: This
blog post is just about Simulators - in the real world, the Systemshell should
only be used under advice and guidance from NetApp Support!)

::>
set -privilege diagnostic

::*>
security login unlock -username diag

::*>
security login password -username diag

::*>
systemshell

login:
diag

Password:
{As set above}

% cd
/mroot/etc/log

% pwd

% ls

% rm
*.log.*

% ls

% cd
/mroot/etc/log/mlog

% pwd

% ls

% rm
*.log.*

% ls

% cd
/mroot/etc/software

% pwd

% ls

% rm *

% ls

% exit

::>
set -privilege admin

::>
df vol0

Note:
/mroot/etc/software will not exist unless some software has been downloaded to
the SIM in its lifetime.

THE END...

Recreating the
Error?

This is super simple if you really want to see it:

::>
system node run local

> df
vol0

> vol
size vol0 XXXXXXXXk

{Where XXXXXXXXk is
a size like just than 1k more than the size of vol0!}