官术网_书友最值得收藏!

Understanding consistency and data loss

Digging into the PostgreSQL transaction log without thinking about consistency is impossible. In the first part of this chapter, we tried hard to explain the basic idea of the transaction log in general. You learned that it is hard, or even impossible, to keep data files in good shape without the ability to log changes beforehand.

So far, we have mostly talked about corruption. It is definitely not good to lose data files because of corrupted entries in them, but corruption is not the only issue you have to be concerned about. Two other important topics are:

  • Performance
  • Data loss

While these might be an obvious choice for important topics, we have the feeling that they are not well understood and honored. Therefore, they have been taken into consideration.

In our daily business as PostgreSQL consultants and trainers, we usually tend to see people who are only focused on performance.

Performance is everything. We want to be fast; tell us how to be fast…

The awareness of potential data loss, or even a concept to handle it, seems to be new to many people. Let's try to put it this way: what good is higher speed if data is lost even faster? The point of this is not that performance is not important; performance is highly important. However, we simply want to point out that performance is not the only component in the bigger picture.

All the way to the disk

To understand issues related to data loss and consistency, we have to see how a chunk of data is sent to the disk. The following diagram illustrates how this happens:

When PostgreSQL wants to read or write a block, it usually has to go through a couple of layers. When a block is written, it will be sent to the operating system. The operating system will cache the data and perform some operation on it. At some point, the operating system will decide to pass the data to some lower level. This might be the disk controller. The disk controller will cache, reorder, message the write again, and finally pass it to the disk. Inside the disk, there might be one more caching level before the data ends up on the real physical storage device.

In our example, we have used four layers. In many enterprise systems, there can be even more layers. Just imagine a virtual machine with storage mounted over the network, such as SAN, NAS, NFS, ATA-over_Ethernet, iSCSI, and so on. Many abstraction layers will pass data around, and each of them will try to do its share of optimization.

From memory to memory

What happens when PostgreSQL passes an 8 K block to the operating system? The only correct answer to this question might be, "something." When a normal write to a file is performed, there is absolutely no guarantee that the data is actually sent to the disk. In reality, writing to a file is nothing more than a copy operation from the PostgreSQL memory to some system memory. Both memory areas are in RAM, so in the event of a crash, data can be lost. Practically speaking, it makes no difference as to who loses the data if the entire RAM is gone due to a failure.

The following code snippet illustrates the basic problem we are facing:

test=# \d t_test
    Table "public.t_test"
 Column |  Type   | Modifiers
--------+---------+-----------
id      | integer |

test=# BEGIN;
BEGIN
test=# INSERT INTO t_test VALUES (1);
INSERT 0 1
test=# COMMIT;
COMMIT

Just as in the previous chapter, we are using a table with only one column. The goal is to run a transaction inserting a single row.

If a crash happens shortly after COMMIT, no data will be in danger because nothing has happened. Even if a crash happens shortly after the INSERT statement but before COMMIT, nothing can happen. The user has not issued a COMMIT command yet, so the transaction is known to be running and thus unfinished. If a crash occurs, the application will notice that things were unsuccessful and (hopefully) react accordingly. Also keep in mind that every transaction that is not committed will eventually end up as ROLLBACK.

However, the situation is quite different if the user has issued a COMMIT statement and it has returned successfully. Whatever happens, the user will expect the committed data to be available.

Note

Users expect that successful writes will be available after an unexpected reboot. This persistence is also required by the ACID criteria. In computer science, ACID is a set of properties that guarantee that database transactions are processed reliably.

From the memory to the disk

To make sure that the kernel will pass data from the memory to the disk, PostgreSQL has to take some precautions. Upon a COMMIT command, a system call will be issued. It forces data to the transaction log.

Note

PostgreSQL does not have to force data to the data files at this point because we can always repair broken data files from the XLOG. If data is stored in the XLOG safely, the transaction can be considered safe.

The system call necessary to force data to the disk is called fsync(). The following listing has been copied from the BSD manual page. In our opinion, it is one of the best manual pages ever written dealing with the topic:

FSYNC(2)            BSD System Calls Manual              FSYNC(2)

NAME
fsync -- synchronize a file's in-core state with
that on disk

SYNOPSIS
     #include <unistd.h>

int
fsync(intfildes);

DESCRIPTION
Fsync() causes all modified data and attributes of
fildes to be moved to a permanent storage device.
     This normally results in all in-core modified
copies of buffers for the associated file to be
written to a disk.

     Note that while fsync() will flush all data from
the host to the drive (i.e. the "permanent storage
device"), the drive itself may not physically
write the data to the platters for quite some time
and it may be written in an out-of-order sequence.

     Specifically, if the drive loses power or the OS
crashes, the application may find that only some
or none of their data was written.  The disk drive
may also re-order the data so that later writes
may be present, while earlier writes are not.

     This is not a theoretical edge case.  This sce-
nario is easily reproduced with real world work-
loads and drive power failures.

It essentially says that the kernel tries to make its image of the file in the memory consistent with the image of the file on the disk. It does so by forcing all changes to the storage device. It is also clearly stated that we are not talking about a theoretical scenario here; flushing to disk is a highly important issue.

Without a disk flush on COMMIT, you just cannot be sure that your data is safe, and this means that you can actually lose data in the event of some serious trouble.

Also, what is essentially important is speed and consistency; they can actually work against each other. Flushing changes to the disk is especially expensive because real hardware is involved. The overhead we have is not some five percent but a lot more. With the introduction of SSDs, the overhead has gone down dramatically but it is still substantial.

A word about batteries

Most production servers make use of a RAID controller to manage disks. The important point here is that disk flushes and performance are usually strongly related to RAID controllers. If the RAID controller has no battery, which is usually the case, then it takes insanely long to flush. The RAID controller has to wait for the slowest disk to return. However, if a battery is available, the RAID controller can assume that a power loss will not prevent an acknowledged disk write from completing once power is restored. So, the controller can cache a write and simply pretend to flush. Therefore, a simple battery can easily increase flush performance tenfold.

Tip

Keep in mind that what we have outlined in this section is general knowledge, and every piece of hardware is different. We highly recommend that you check out and understand your hardware and RAID configuration to see how flushes are handled.

Beyond the fsync function

The fsync function is not the only system call that flushes data to the disk. Depending on the operating system you are using, different flush calls are available. In PostgreSQL, you can decide on your preferred flush call by changing wal_sync_method. Again, this change can be made by tweaking postgresql.conf.

The methods available are open_datasync, fdatasync, fsync, fsync_writethrough, and open_sync.

Tip

If you want to change these values, we highly recommend checking out the manual pages of the operating system you are using to make sure that you have made the right choice.

PostgreSQL consistency levels

Ensuring consistency and preventing data loss is costly. Every disk flush is expensive and we should think twice before flushing to the disk. To give the user choices, PostgreSQL offers various levels of data protection. These various choices are represented by two essential parameters, which can be found in postgresql.conf:

  • fsync
  • synchronous_commit

The fsync parameter will control data loss, if it is used at all. In the default configuration, PostgreSQL will always flush a commit out to the disk. If fsync is off, however, there is no guarantee at all that a COMMIT statement will survive a crash. Data can be lost and there might even be data corruption. To protect all of your data, it is necessary to keep fsync at on. If you can afford to lose some or all of your data, you can relax flushing standards a little.

The synchronous_commit parameter is related to XLOG writes. Normally, PostgreSQL will wait until data has been written to the XLOG completely. Especially short transactions can suffer considerably, and therefore, different options are offered:

  • on: PostgreSQL will wait until XLOG has been fully and successfully written. If you are storing credit card data, you would want to make sure that a financial transaction is not lost. In this case, flushing to the disk is essential.
  • off: There will be a time difference between reporting success to the client and safely writing to the disk. In a setting like this, there can be corruption. Let's assume a database that stores information about who is currently online on a website. Suppose your system crashes and comes back up 20 minutes later. Do you really care about your data? After 20 minutes, everybody has to log in again anyway, and it is not worth sacrificing performance to protect data that will be outdated in a couple of minutes anyway.
  • local: In the case of a replicated database instance, we will wait only for the local instance to flush to the disk. The advantage here is that you have a high level of protection because you flush to one disk. However, we can safely assume that not both servers crash at the same time, so we can relax the standards on the slave a little.
  • remote_write: PostgreSQL will wait until a synchronous standby server reports success for a given transaction.

In contrast to setting fsync to off, changing synchronous_commit to off will not result in corruption. However, in the event of a crash we might lose a handful of transactions that have already been committed successfully. The amount of potential data loss is governed by an additional postgresql.conf setting called wal_writer_delay. In the case of setting synchronous_commit to off, we can never lose more data than defined in the wal_writer_delay configuration variable.

Tip

Changing synchronous_commit might look like a small performance tweak; in reality, however, changing the sync behavior is one of the dominant factors when running small writing transactions. The gain might not just be a handful of percentage points. If you are lucky, it could be tenfold or even more (depending on the hardware, work load, I/O subsystem, and so on).

Keep in mind that configuring a database is not just about speed. Consistency is at least as important as speed, and therefore, you should think carefully whether you want to trade speed for potential data loss.

It is important to fully understand the consistency-related topics outlined in this chapter. When it comes to deciding on your cluster architecture, data security will be an essential part, and it is highly desirable to be able to judge whether a certain architecture makes sense for your data. After all, work on databases is all about protecting data. Full awareness of your durability requirements is definitely a big plus.

主站蜘蛛池模板: 措勤县| 吐鲁番市| 分宜县| 公安县| 同德县| 邹平县| 江华| 成武县| 澎湖县| 武乡县| 婺源县| 曲水县| 巴彦县| 汉川市| 江华| 清河县| 珲春市| 黄平县| 昭苏县| 铅山县| 东兴市| 那曲县| 两当县| 聂拉木县| 马关县| 锦屏县| 德昌县| 泸州市| 浮梁县| 丰县| 四子王旗| 稷山县| 呈贡县| 景德镇市| 甘洛县| 句容市| 伊宁县| 安化县| 车险| 淄博市| 盘锦市|