One of the disks of our EVA4000 died today. This diskgroup (all volumes vraid5 with sparing level 1 and almost no space left for more volumes, 1TiB drives) is being rebuilt with "spare space" right now, and it will take at least 15 hours to do the leveling/rebuilding.

We can't get a new disk until Friday. So, the question is, what would happen if another disk dies before the leveling completes? Would we lose data? And after that, how many aditional disks could die before losing data? 1 or 2?

In "usual" RAID, we would be vulnerable to data loss while the rebuild takes place, but in this case the space reserved for sparing is two times the size of the bigger disk, so at the very least the effect should be the same of having two spares.

Thanks in advance.

Update: I have found some interesting threads about this question but still can't answer to this question, so I'm starting a bounty.

3 Answers
3

Short version

Leveling is the process after the rebuilding. If your array is leveling,
you are just as safe as you were before the disk failed.

Long version

When you lose a disk, EVA will automatically try to use any of the space on
the remaining healthy disks to create a redundant copy of the data that used
to be on that disk. If you had one volume group with one big virtual disk with
Vraid5 parity and you lost a single disk, the EVA will regenerate the data
that used to be on the failed disk on the free space on the first disk. If
there isn't enough space it will use 2, 3 or more disks but you will get a
redundant copy of your data in the shortest time possible. How long that
takes, I cannot tell you. But you will be back to the "you can lose a disk and
not lose your data" state in a very short time. That is, of course, if you
have enough free space in your disks.

You mentioned sparing. I am not familiar with this term but I hope you are
talking about "failure protection level" which is the space that the EVA will
reserve for an emergency like the one you are describing. Single
protection level means that it will reserve the size of two of your largest
disks, and double - the size of four disks. EVA will not report this space as
free. So if you have single protection level and are using 95% with 16 1TB
disks, you will have 2TB reserved, and are only using 95% of the remaining
14TB. That is 13.3TB used, and 2.7TB free. And if you take the Vraid5 into
account, that is 10.64TB usable space and 2.66TB wasted for parity.

Once the EVA has made a redundant copy on as few disks as possible, it will
start leveling (I personally prefer to call it "balancing") the data. This
process involves moving the data around so all your disks end up with
approximately the same amount of data in the end. This process takes awfully
long time, especially if your usage is quite high, but you are safe if you
have another failure at this time.

Go in Command View and check the status of the volume group. If it says that
it is leveling - you are just as safe as you used to be before the failure.

You are now down to 15TB of raw disk space and you are using 13.3TB.
The EVA wants to maintain a single protection level but it cannot reserve 2TB
(you only have 1.7TB unused) so it is probably reporing the requested
protection level as single, and the actual protection level as
none. It may also be reporting your usage as going over 100%, since you
are using 13.3TB and to satisfy the single protection requirement you should
be under 13TB (15TB total - 2TB reserved for single protection).

This still means that you can still lose another disk, and you will still have
a healthy storage. You can lose a second disk, and it will be the Vraid5
redundancy that is going to protect your data (though you may see a
degradation in performance). And of course, if you are lucky you may survive a
third and a fourth disk failure, as long as they are not in the same Vraid
stripe (EVA's Vraid5 is more like RAID5+0, with stripes spanning over 5
disks).

Update: Unrelated to your question, but the latest FATA firmware
update has a "Fix for self-initiated resets that may occur under rare
circumstances". Believe me, it does not feel nice to see disks get thrown out
of a volume group for no reason.

Update 2: Updated because single protection level means the space for
two disks.

The exact numbers are 16 disks(now 15), single protection level, 95% of space used for vdisks. When I posted the question thought that the array was doing the leveling, but I think it actually started after the rebuilding you are describing. Thanks.
–
SamuelNov 5 '12 at 13:45

I had a similar experience with my MSA 4400. We kept it running at 95% capacity, but it started having some 9 drive failures a month, so I'm somewhat familiar with the ragged edge of data loss disaster.

You have several levels of scratch space that can prevent you from losing data, and it's hard to tell which one you're currently into. Spare space is a big one, obviously. Also, the level of vraid you use will play a part. Also, even when you swap that drive, it'll have to rebuild again.

The main thing you need to watch for is the failure protection level on your pool. You can set a requested level (like double) and then compare that to the actual level (like single or none). That said, even if you go from double to none in a single drive failure (one of the things I hate most about this box is that it allows that), you still have several ways the array can prevent you from losing data using parity from vraid or other black magic.

I've used pretty much all of the EVAs since day one and they're generally excellent but I will not go near their 1TB FATA disks after we ran into enormous difficulties with them. They're not rated for 24x365 work, only 30% duty-cycle, and working them harder than this just kills them. The problem with this is that levelling is a 24hrs/day process, so you have a domino-effect of one FATA drive failing, levelling starting which kills other disks, meaning levelling restart - it's a farce and HP have never properly held their hands up to it, so I avoid those disks.
–
Chopper3Nov 5 '12 at 12:55

I was seeing 9 failures a month on my 450GB drives. I think I got a lemon.
–
BasilNov 5 '12 at 14:08

For HP EVA:
Level 1 = the capacity of two of the largest drives configured are reserved for sparing

Which means if you loose 2 of your disks, you are left without spares, and rely only on RAID5 parity.
In your current situation, you can loose 1 more disk w/o array degradation, and 2 more without data loss, but with degraded performance.
In our organizations we have ALWAYS 2 spare disks outside of the enclosure and kept at the same temperature (so no tempering will be needed before insertion).