RAID “Write Hole” Phenomenon

If a power failure occurs during the write process to a RAID, the “write hole” phenomenon can be the result. This can happen in any RAID array including RAID 1, RAID 5 and RAID 6 whereby it’s impossible to determine which data blocks or parity information was not written to disk.

When this occurs it is undetectable and may go unnoticed resulting in problems at a later time. Although this situation is fairly rare, it can lead to serious problems, especially if a data recovery is required. This highlights why it is important not to become complacent, and think “I have a RAID, so I don’t need a backup” or you could suffer serious data loss.

Data Not Written

As already described, when a power failure occurs it is possible for some data not to be written to all the disks in a RAID array. With modern journaling file systems a power failure is not usually a problem, as any failed writes are still stored in the journal, but a RAID system may be performing many read/write tasks in parallel, which may lead to unusual timing issues.

If the data that was not written is a data block, when the file system is mounted, the journaling may well correct any issue, but any failure to write the parity stripe could cause a serious issue and go undetected until that parity data is required.

Resynchronisation Issues

Take a RAID 1 mirrored pair as an example, whereby data is written to a pair of disks, and a discrepancy is detected between them after a power failure, it is almost impossible to know which disk holds the correct version of that data. In a RAID containing calculated parity information, the same is also true when the parity data does not match the data blocks stored in a stripe.

This means that running a resynchronisation could consolidate the incorrect data as part of the RAID, leading to either corruption of file system data structures or file contents. Scheduled resynchronisation is recommended as part of RAID maintenance, but is not guaranteed to fix this problem. The act of writing data to the RAID will cause the parity in that particular data slice to be resynchronised.

Data Recovery and UPS

Installing an uninterruptible power supply (UPS) for a system running a RAID is the best choice when it comes to avoiding the “write hole” phenomenon. By doing this a controlled shutdown of the server can take place, avoiding the issue of file system corruption.

During data recovery from RAID systems, it is almost impossible to determine which disks hold the correct data if a “write hole” is detected. Through manual intervention it may be possible to resolve some of these issues, but others may be impossible to determine, so it’s important to reduce the risks of suffering “write hole” damage.

RAID 5 vs RAID 10

Redundant Array of Independent Disks (RAID) offers many benefits, from data read/write speed increase through to data redundancy. Each RAID level is a compromise between data security, hardware requirements and read/write speeds.

Your budget will be a big factor in determining which RAID level is most appropriate, but if there is no constraint, data security should be high on the list. No matter which RAID level is selected, it is important not to fall into the trap of thinking, ”I have a RAID, so I don’t need a backup,” otherwise your future will almost certainly include RAID data recovery.

RAID 10 Provides 100% Redundancy

RAID 10 stripes data across a set of mirrored pairs, and therefore requires double the number of drives, for the given capacity required. This provides full redundancy, but as with any RAID system, the failure of one drive could be closely followed another. If a mirrored pair fails at the same time, it will bring the RAID to a halt, so although this gives the best data security, there is still some risk.

RAID 10 can also in many instances provide faster read and write times, as there is no need to calculate parity. RAID 10 hardware is often set up to take the data read from the fastest responding drive. It is still possible in theory for a RAID 10 to run with 50% failure of the drives, providing a mirrored pair does not fail, but such action would run a huge risk to the integrity your data. RAID 10 is a common option for high availability servers, such as those running Exchange and SQL databases.

RAID 5 Offers Higher Capacity

RAID 5 stripes the data across the drives, with one drive in each data slice containing the parity information, which can be used to reconstruct the data for a missing drive. This means only the capacity of a single drive is used for redundancy, allowing for much larger data volumes, across the same drives.

A RAID 5 array can run in degraded mode if a single drive fails, but this causes both a performance hit, as well as putting your data at imminent risk. The failure of just one additional drive will cause the RAID to fail. RAID 5 is however still one of the most commonly used RAID array architectures.

Data Recovery Issues

Despite the mirrored drive pairs, RAID 10 arrays are still sometimes seen for data recovery. Providing failures are not ignored, whereby one drive in a mirrored pair could hold out-of-date data, RAID 10 offers a double chance of recovery for each data slice of the RAID, giving extremely high data recovery success rates.

Although RAID 5 arrays have a higher level of risk attached, the data recovery success rate is also very high, as it’s rare for the drive failures to be severe enough to cause the loss of large areas of the data volume.

Any redundancy for your data is certainly a better option than none, so the choice really comes down to budget, and how much risk you’re willing to take with the overall integrity of your data. This needs to be weighed against the possible financial harm your company would face, even for a temporary loss of data access.