- PostgreSQL Replication(Second Edition)
- Hans Jürgen Sch?nig
- 2100字
- 2021-07-16 13:33:49
Understanding consistency and data loss
Digging into the PostgreSQL transaction log without thinking about consistency is impossible. In the first part of this chapter, we tried hard to explain the basic idea of the transaction log in general. You learned that it is hard, or even impossible, to keep data files in good shape without the ability to log changes beforehand.
So far, we have mostly talked about corruption. It is definitely not good to lose data files because of corrupted entries in them, but corruption is not the only issue you have to be concerned about. Two other important topics are:
- Performance
- Data loss
While these might be an obvious choice for important topics, we have the feeling that they are not well understood and honored. Therefore, they have been taken into consideration.
In our daily business as PostgreSQL consultants and trainers, we usually tend to see people who are only focused on performance.
Performance is everything. We want to be fast; tell us how to be fast…
The awareness of potential data loss, or even a concept to handle it, seems to be new to many people. Let's try to put it this way: what good is higher speed if data is lost even faster? The point of this is not that performance is not important; performance is highly important. However, we simply want to point out that performance is not the only component in the bigger picture.
All the way to the disk
To understand issues related to data loss and consistency, we have to see how a chunk of data is sent to the disk. The following diagram illustrates how this happens:

When PostgreSQL wants to read or write a block, it usually has to go through a couple of layers. When a block is written, it will be sent to the operating system. The operating system will cache the data and perform some operation on it. At some point, the operating system will decide to pass the data to some lower level. This might be the disk controller. The disk controller will cache, reorder, message the write again, and finally pass it to the disk. Inside the disk, there might be one more caching level before the data ends up on the real physical storage device.
In our example, we have used four layers. In many enterprise systems, there can be even more layers. Just imagine a virtual machine with storage mounted over the network, such as SAN, NAS, NFS, ATA-over_Ethernet, iSCSI, and so on. Many abstraction layers will pass data around, and each of them will try to do its share of optimization.
From memory to memory
What happens when PostgreSQL passes an 8 K block to the operating system? The only correct answer to this question might be, "something." When a normal write to a file is performed, there is absolutely no guarantee that the data is actually sent to the disk. In reality, writing to a file is nothing more than a copy operation from the PostgreSQL memory to some system memory. Both memory areas are in RAM, so in the event of a crash, data can be lost. Practically speaking, it makes no difference as to who loses the data if the entire RAM is gone due to a failure.
The following code snippet illustrates the basic problem we are facing:
test=# \d t_test Table "public.t_test" Column | Type | Modifiers --------+---------+----------- id | integer | test=# BEGIN; BEGIN test=# INSERT INTO t_test VALUES (1); INSERT 0 1 test=# COMMIT; COMMIT
Just as in the previous chapter, we are using a table with only one column. The goal is to run a transaction inserting a single row.
If a crash happens shortly after COMMIT, no data will be in danger because nothing has happened. Even if a crash happens shortly after the INSERT
statement but before COMMIT
, nothing can happen. The user has not issued a COMMIT
command yet, so the transaction is known to be running and thus unfinished. If a crash occurs, the application will notice that things were unsuccessful and (hopefully) react accordingly. Also keep in mind that every transaction that is not committed will eventually end up as ROLLBACK
.
However, the situation is quite different if the user has issued a COMMIT
statement and it has returned successfully. Whatever happens, the user will expect the committed data to be available.
Note
Users expect that successful writes will be available after an unexpected reboot. This persistence is also required by the ACID criteria. In computer science, ACID is a set of properties that guarantee that database transactions are processed reliably.
From the memory to the disk
To make sure that the kernel will pass data from the memory to the disk, PostgreSQL has to take some precautions. Upon a COMMIT
command, a system call will be issued. It forces data to the transaction log.
Note
PostgreSQL does not have to force data to the data files at this point because we can always repair broken data files from the XLOG. If data is stored in the XLOG safely, the transaction can be considered safe.
The system call necessary to force data to the disk is called fsync()
. The following listing has been copied from the BSD manual page. In our opinion, it is one of the best manual pages ever written dealing with the topic:
FSYNC(2) BSD System Calls Manual FSYNC(2) NAME fsync -- synchronize a file's in-core state with that on disk SYNOPSIS #include <unistd.h> int fsync(intfildes); DESCRIPTION Fsync() causes all modified data and attributes of fildes to be moved to a permanent storage device. This normally results in all in-core modified copies of buffers for the associated file to be written to a disk. Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence. Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written. The disk drive may also re-order the data so that later writes may be present, while earlier writes are not. This is not a theoretical edge case. This sce- nario is easily reproduced with real world work- loads and drive power failures.
It essentially says that the kernel tries to make its image of the file in the memory consistent with the image of the file on the disk. It does so by forcing all changes to the storage device. It is also clearly stated that we are not talking about a theoretical scenario here; flushing to disk is a highly important issue.
Without a disk flush on COMMIT
, you just cannot be sure that your data is safe, and this means that you can actually lose data in the event of some serious trouble.
Also, what is essentially important is speed and consistency; they can actually work against each other. Flushing changes to the disk is especially expensive because real hardware is involved. The overhead we have is not some five percent but a lot more. With the introduction of SSDs, the overhead has gone down dramatically but it is still substantial.
A word about batteries
Most production servers make use of a RAID controller to manage disks. The important point here is that disk flushes and performance are usually strongly related to RAID controllers. If the RAID controller has no battery, which is usually the case, then it takes insanely long to flush. The RAID controller has to wait for the slowest disk to return. However, if a battery is available, the RAID controller can assume that a power loss will not prevent an acknowledged disk write from completing once power is restored. So, the controller can cache a write and simply pretend to flush. Therefore, a simple battery can easily increase flush performance tenfold.
Tip
Keep in mind that what we have outlined in this section is general knowledge, and every piece of hardware is different. We highly recommend that you check out and understand your hardware and RAID configuration to see how flushes are handled.
Beyond the fsync function
The fsync
function is not the only system call that flushes data to the disk. Depending on the operating system you are using, different flush calls are available. In PostgreSQL, you can decide on your preferred flush call by changing wal_sync_method
. Again, this change can be made by tweaking postgresql.conf
.
The methods available are open_datasync
, fdatasync
, fsync
, fsync_writethrough
, and open_sync
.
Tip
If you want to change these values, we highly recommend checking out the manual pages of the operating system you are using to make sure that you have made the right choice.
PostgreSQL consistency levels
Ensuring consistency and preventing data loss is costly. Every disk flush is expensive and we should think twice before flushing to the disk. To give the user choices, PostgreSQL offers various levels of data protection. These various choices are represented by two essential parameters, which can be found in postgresql.conf
:
fsync
synchronous_commit
The fsync
parameter will control data loss, if it is used at all. In the default configuration, PostgreSQL will always flush a commit out to the disk. If fsync
is off
, however, there is no guarantee at all that a COMMIT
statement will survive a crash. Data can be lost and there might even be data corruption. To protect all of your data, it is necessary to keep fsync
at on
. If you can afford to lose some or all of your data, you can relax flushing standards a little.
The synchronous_commit
parameter is related to XLOG writes. Normally, PostgreSQL will wait until data has been written to the XLOG completely. Especially short transactions can suffer considerably, and therefore, different options are offered:
on
: PostgreSQL will wait until XLOG has been fully and successfully written. If you are storing credit card data, you would want to make sure that a financial transaction is not lost. In this case, flushing to the disk is essential.off
: There will be a time difference between reporting success to the client and safely writing to the disk. In a setting like this, there can be corruption. Let's assume a database that stores information about who is currently online on a website. Suppose your system crashes and comes back up 20 minutes later. Do you really care about your data? After 20 minutes, everybody has to log in again anyway, and it is not worth sacrificing performance to protect data that will be outdated in a couple of minutes anyway.local
: In the case of a replicated database instance, we will wait only for the local instance to flush to the disk. The advantage here is that you have a high level of protection because you flush to one disk. However, we can safely assume that not both servers crash at the same time, so we can relax the standards on the slave a little.remote_write
: PostgreSQL will wait until a synchronous standby server reports success for a given transaction.
In contrast to setting fsync
to off
, changing synchronous_commit
to off
will not result in corruption. However, in the event of a crash we might lose a handful of transactions that have already been committed successfully. The amount of potential data loss is governed by an additional postgresql.conf
setting called wal_writer_delay
. In the case of setting synchronous_commit
to off
, we can never lose more data than defined in the wal_writer_delay
configuration variable.
Tip
Changing synchronous_commit
might look like a small performance tweak; in reality, however, changing the sync behavior is one of the dominant factors when running small writing transactions. The gain might not just be a handful of percentage points. If you are lucky, it could be tenfold or even more (depending on the hardware, work load, I/O subsystem, and so on).
Keep in mind that configuring a database is not just about speed. Consistency is at least as important as speed, and therefore, you should think carefully whether you want to trade speed for potential data loss.
It is important to fully understand the consistency-related topics outlined in this chapter. When it comes to deciding on your cluster architecture, data security will be an essential part, and it is highly desirable to be able to judge whether a certain architecture makes sense for your data. After all, work on databases is all about protecting data. Full awareness of your durability requirements is definitely a big plus.
- 精通Nginx(第2版)
- Boost程序庫完全開發指南:深入C++”準”標準庫(第5版)
- Learning Chef
- 機械工程師Python編程:入門、實戰與進階
- JavaScript+Vue+React全程實例
- Visual C++數字圖像處理技術詳解
- 3D少兒游戲編程(原書第2版)
- iOS編程基礎:Swift、Xcode和Cocoa入門指南
- Instant Ext.NET Application Development
- Python Data Analysis Cookbook
- Mastering React
- 智能手機APP UI設計與應用任務教程
- UI設計全書(全彩)
- C++編程兵書
- Managing Microsoft Hybrid Clouds