Saturday, June 26, 2010

Integrity and corruption: can file systems be scalable?

About a month ago I was at an interesting meeting at Juelich regarding support for Lustre.  Lustre is a cluster file system, designed primarily for large scale HPC installations, and Lustre stores it metadata and data in a collection of ext4 like disk file systems on the Lustre servers.

Two long-standing desirable features for Lustre were mentioned frequently:
  1. end-to-end data data integrity
  2. more scalable disk file systems for the servers
The popular opinion is that both goals can be achieved by combining Lustre's network checksum mechanisms with ZFS (or btrfs) as the backend disk file system, and indeed I made press announcements with Sun in 2007 to spread this opinion.

In this post I want to tell you I've learned that this approach is flawed, in at least one technical aspect.

FSCK is not dead

FSCK is a tool to check and repair file systems.  It is as old as the first file systems, and originally its purpose was to repair the file system, in particular to complete or undo compound operations which due to a sudden failure did not complete.  Compound operations in file systems are numerous.  For example,  creating a file typically initializes the inode of the new file as well as writing a name for the file in the parent directory.

By the mid 1990's disks and the file systems on them were so large that fsck started to take a long time to complete, and most file systems introduced alternative recovery mechanisms relying on a log-ahead or shadow copy recovery scheme (which was first in use in databases earlier).

FSCK stuck around.  It could handle fatalities that are not easily handled by the advanced recovery mechanisms, such as corruption on the drives.  Corruption means that either the data cannot be read, or its format is not what the file system expects.   Such corruptions have at least 3 causes:
  1. Known and accepted behavior of storage devices.  Two examples are caching in disk drives and the write hole in software raid devices.
  2. Spontaneous degradation of recorded information.  This is a considerable worry in very large scale file systems such as found in large HPC file systems.
  3. Bugs in the file system software itself.
ZFS-like efforts, including in particular the T10-DIF approach to data integrity for storage devices have successfully taken aim at causes 1 and 2.  There have been extensive discussions regarding the best checksum based repair mechanism (see the FAST proceedings for example, and in particular the studies of Andrea and Remzi Arpaci-Dusseau), but issue 3 is rarely if ever articulated.

I am grateful to Nikita Danilov to pinpoint software bugs as the reason that an FSCK utility remains required.   

While one might optimistically argue that widely used file systems can become bug free, complex ones, like Lustre, will not. 

FSCK and scalability

Checking file systems turns out to be one of the greatest challenges for scalability.   A requirement was posted by the HPCS initiative to be able to check and repair a file system with a trillion (10^12) files within 100 hours, and simple computations by Sun's Lustre group (you can find these by searching for File System Integrity Check Design on the Lustre wiki) show that the challenges are enormous.

Unfortunately once file system check and repair is required, the scalability of all file systems becomes questionable.  The repair tool needs to iterate over all objects stored in the file system, and this can take unacceptably long on the advanced file systems like ZFS and btrfs just as much as on the more traditional ones like ext4.  

This shows the shortcoming of the Lustre-ZFS proposal to address scalability.  It merely addresses data integrity.

Solutions?

Val Aurora has proposed chunkfs which divides file systems into tiles on which one can parallelize the fsck operation. But fundamentally the amount of data on a single drive is so large that check times are likely to be unreasonable, and on a single drive parallelization doesn't help.

So far not many better approaches have been proposed.  Restoring from backups or snapshots is one, and this is the practice in the database community.  But again, merely restoring a single drive can take a long time.  Another one is continuous and online checking, but the true effectiveness of such approaches in the face of software bugs has not been verified.

In a future post we will describe a different approach which is likely to cover the issues about as effectively as fsck has.

Monday, February 16, 2009

A talk a CU

Last week I gave a seminar at the Computer Science Department at the University of Colorado in Boulder. The topic was an architecture of storage management, based on just a few elements. One can summarize them as:
  1. update streams - to track last committed and executed requests
  2. a changelog, sufficient to do and undo operations, including data writes, with the ability to restrict the changelog to a fileset
  3. versioning of objects, management and operations on FIDs
  4. some database elements inside the file system, e.g. the ability to locate objects on certain servers
Most of these have over the years been discovered by the Lustre effort. However, an organic piecemeal implementation will not show that this is a concise and, likely, a usable proposal. To show that would require building something from the ground up that implements the key elements of this system, and demonstrate recoverability, clustering and various storage management features. While I trust this can be done, I wonder how many data management applications cannot be served by my proposal.

Some key elements, such as sessions, are things that are now, years after their implementation in Lustre, finding their way into systems like pNFS. However, one needs to go well beyond that to obtain the aggregate benefits.