RAID “Write Hole” Phenomenon

If a power failure occurs during the write process to a RAID, the “write hole” phenomenon can be the result. This can happen in any RAID array including RAID 1, RAID 5 and RAID 6 whereby it’s impossible to determine which data blocks or parity information was not written to disk.

When this occurs it is undetectable and may go unnoticed resulting in problems at a later time. Although this situation is fairly rare, it can lead to serious problems, especially if a data recovery is required. This highlights why it is important not to become complacent, and think “I have a RAID, so I don’t need a backup” or you could suffer serious data loss.

Data Not Written

As already described, when a power failure occurs it is possible for some data not to be written to all the disks in a RAID array. With modern journaling file systems a power failure is not usually a problem, as any failed writes are still stored in the journal, but a RAID system may be performing many read/write tasks in parallel, which may lead to unusual timing issues.

If the data that was not written is a data block, when the file system is mounted, the journaling may well correct any issue, but any failure to write the parity stripe could cause a serious issue and go undetected until that parity data is required.

Resynchronisation Issues

Take a RAID 1 mirrored pair as an example, whereby data is written to a pair of disks, and a discrepancy is detected between them after a power failure, it is almost impossible to know which disk holds the correct version of that data. In a RAID containing calculated parity information, the same is also true when the parity data does not match the data blocks stored in a stripe.

This means that running a resynchronisation could consolidate the incorrect data as part of the RAID, leading to either corruption of file system data structures or file contents. Scheduled resynchronisation is recommended as part of RAID maintenance, but is not guaranteed to fix this problem. The act of writing data to the RAID will cause the parity in that particular data slice to be resynchronised.

Data Recovery and UPS

Installing an uninterruptible power supply (UPS) for a system running a RAID is the best choice when it comes to avoiding the “write hole” phenomenon. By doing this a controlled shutdown of the server can take place, avoiding the issue of file system corruption.

During data recovery from RAID systems, it is almost impossible to determine which disks hold the correct data if a “write hole” is detected. Through manual intervention it may be possible to resolve some of these issues, but others may be impossible to determine, so it’s important to reduce the risks of suffering “write hole” damage.

