官术网_书友最值得收藏!

Filesystem crash recovery

Filesystem writes have two major components to them. At the bottom level, you are writing out blocks of data to the disk. In addition, there is some amount of filesystem metadata involved too. Examples of metadata include the directory tree, the list of blocks and attributes associated with each file, and the list of what blocks on disk are free.

Like many disk-oriented activities, filesystems have a very clear performance vs. reliability trade-off they need to make. The usual reliability concern is what happens in the situation where you're writing changes to a file and the power goes out in the middle.

Consider the case where you're writing out a new block to a file, one that makes the file bigger (rather than overwriting an existing block). You might do that in the following order:

  1. Write data block.
  2. Write file metadata referencing use of that block.
  3. Add data block to the list of used space metadata.

What happens if power goes out between steps 2 and 3 here? You now have a block that is used for something, but the filesystem believes it's still free. The next process that allocates a block for something is going to get that block, and now two files would refer to it. That's an example of a bad order of operations that no sensible filesystem design would use. Instead, a good filesystem design would:

  1. Add data block to the list of used space metadata.
  2. Write data block.
  3. Write file metadata referencing use of that block.

If there was a crash between 1 and 2 here, it's possible to identify the blocks that were marked as used, but not actually written to use fully yet. Simple filesystem designs do that by iterating over all the disk blocks allocated, reconciling the list of blocks that should be used or free against what's actually used. Examples of this include the fsck program used to validate simple UNIX filesystems and the chkdsk program used on FAT32 and NTFS volumes under Windows.

Journaling filesystems

The more modern approach is to use what's called a journal to improve this situation. A fully journaled write would look like this:

  1. Write transaction start metadata to the journal.
  2. Write used space metadata change to the journal.
  3. Write data block change to the journal.
  4. Write file metadata change to the journal.
  5. Add data block to the list of used space metadata.
  6. Write data block.
  7. Write file metadata referencing use of that block.
  8. Write transaction end metadata to the journal.

What this gets you is the ability to recover from any sort of crash the filesystem might encounter. If you didn't reach the final step here for a given write transaction, the filesystem can just either ignore (data block write) or undo (metadata write) any partially completed work that's part of that transaction. This lets you avoid long filesystem consistency checks after a crash, because you'll just need to replay any open journal entries to fix all the filesystem metadata. The time needed to do this is proportional to the size of the journal, rather than the old filesystem checking routines whose runtime is proportional to the size of the volume.

The first thing that should jump out at you here is that you're writing everything twice, plus the additional transaction metadata, and therefore more than double the total writes per update in this situation.

The second thing to note is more subtle. Journaling in the filesystem is nearly identical to how write-ahead logging works to protect database writes in PostgreSQL. So if you're using journaling for the database, you're paying this overhead four times. Writes to the WAL, itself a journal, are journaled, and then writes to the disk are journaled too.

Since the overhead of full journaling is so high, few filesystems use it. Instead, the common situation is that only metadata writes are journaled, not the data block changes. This meshes well with PostgreSQL, where the database protects against data block issues but not filesystem metadata issues.

主站蜘蛛池模板: 类乌齐县| 张家界市| 宽城| 临沭县| 邵阳县| 安溪县| 台东市| 旬阳县| 成武县| 海阳市| 常山县| 岑巩县| 金乡县| 海淀区| 巴塘县| 凤台县| 巨鹿县| 黑水县| 保亭| 林西县| 名山县| 莱阳市| 博爱县| 忻州市| 泉州市| 四川省| 南宫市| 文登市| 扬州市| 台北市| 图片| 嫩江县| 栾川县| 民县| 永仁县| 安岳县| 三门峡市| 原阳县| 金溪县| 枞阳县| 无极县|