本文分为三节,分别介绍clog的fsync频率,原子操作,与异步提交一致性。
## PostgreSQL pg_clog fsync 频率分析
分析一下pg_clog是在什么时候需要调用fsync的?
首先引用wiki里的一段[pg_clog](https://wiki.postgresql.org/wiki/Hint_Bits)的介绍
> Some details here are in src/backend/access/transam/README:
> 1\. “pg_clog records the commit status for each transaction that has been assigned an XID.”
> 2\. “Transactions and subtransactions are assigned permanent XIDs only when/if they first do something that requires one — typically, insert/update/delete a tuple, though there are a few other places that need an XID assigned.”
>
> pg_clog is updated only at sub or main transaction end. When the transactionid is assigned the page of the clog that contains that transactionid is checked to see if it already exists and if not, it is initialised.
> pg_clog is allocated in pages of 8kB apiece(和BLOCKSZ一致,所以不一定是8K,见后面的分析).
> Each transaction needs 2 bits, so on an 8 kB page there is space for 4 transactions/byte * 8k bytes = 32k transactions.
> On allocation, pages are zeroed, which is the bit pattern for “transaction in progress”.
> So when a transaction starts, it only needs to ensure that the pg_clog page that contains its status is allocated, but it need not write anything to it.
> In 8.3 and later, this happens not when the transaction starts, but when the Xid is assigned (i.e. when the transaction first calls a read-write command).
> In previous versions it happens when the first snapshot is taken, normally on the first command of any type with very few exceptions.
>
> This means that one transaction in every 32K writing transactions does have to do extra work when it assigns itself an XID, namely create and zero out the next page of pg_clog.
> And that doesn’t just slow down the transaction in question, but the next few guys that would like an XID but arrive on the scene while the zeroing-out is still in progress.
> This probably contributes to reported behavior that the transaction execution time is subject to unpredictable spikes.
每隔32K个事务,要扩展一个CLOG PAGE,每次扩展需要填充0,同时需要调用PG_FSYNC,这个相比FSYNC XLOG应该是比较轻量级的。但是也可能出现不可预知的响应延迟,因为如果堵塞在扩展CLOG PAGE,所有等待clog PAGE的会话都会受到影响。
这里指当CLOG buffer没有空的SLOT时,会从所有的CLOG buffer SLOT选择一个脏页,将其刷出,这个时候才会产生pg_fsync。
CLOG pages don’t make their way out to disk until the internal CLOG buffers are filled, at which point the least recently used buffer there is evicted to permanent storage.
下面从代码中分析一下pg_clog是如何调用pg_fsync刷脏页的。
每次申请新的事务ID时,都需要调用ExtendCLOG,如果通过事务ID计算得到的CLOG PAGE页不存在,则需要扩展;但是并不是每次扩展都需要调用pg_fsync,因为checkpoint会将clog buffer刷到磁盘,除非在申请新的CLOG PAGE时所有的clog buffer都没有刷出脏页,才需要主动选择一个page并调用pg_fsync刷出对应的pg_clog/file。
src/backend/access/transam/varsup.c
~~~
/*
* Allocate the next XID for a new transaction or subtransaction.
*
* The new XID is also stored into MyPgXact before returning.
*
* Note: when this is called, we are actually already inside a valid
* transaction, since XIDs are now not allocated until the transaction
* does something. So it is safe to do a database lookup if we want to
* issue a warning about XID wrap.
*/
TransactionId
GetNewTransactionId(bool isSubXact)
{
......
/*
* If we are allocating the first XID of a new page of the commit log,
* zero out that commit-log page before returning. We must do this while
* holding XidGenLock, else another xact could acquire and commit a later
* XID before we zero the page. Fortunately, a page of the commit log
* holds 32K or more transactions, so we don't have to do this very often.
*
* Extend pg_subtrans too.
*/
ExtendCLOG(xid);
ExtendSUBTRANS(xid);
......
~~~
`ExtendCLOG(xid)`扩展clog page,调用`TransactionIdToPgIndex`计算XID和CLOG_XACTS_PER_PAGE的余数,如果不为0,则不需要扩展。
src/backend/access/transam/clog.c
~~~
#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)
/*
* Make sure that CLOG has room for a newly-allocated XID.
*
* NB: this is called while holding XidGenLock. We want it to be very fast
* most of the time; even when it's not so fast, no actual I/O need happen
* unless we're forced to write out a dirty clog or xlog page to make room
* in shared memory.
*/
void
ExtendCLOG(TransactionId newestXact)
{
int pageno;
/*
* No work except at first XID of a page. But beware: just after
* wraparound, the first XID of page zero is FirstNormalTransactionId.
*/
if (TransactionIdToPgIndex(newestXact) != 0 && // 余数不为0,说明不需要扩展。
!TransactionIdEquals(newestXact, FirstNormalTransactionId))
return;
pageno = TransactionIdToPage(newestXact);
LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCLOGPage(pageno, true);
LWLockRelease(CLogControlLock);
}
~~~
`ZeroCLOGPage(pageno, true)`,调用`SimpleLruZeroPage`,扩展并初始化CLOG PAGE,写XLOG日志。
~~~
/*
* Initialize (or reinitialize) a page of CLOG to zeroes.
* If writeXlog is TRUE, also emit an XLOG record saying we did this.
*
* The page is not actually written, just set up in shared memory.
* The slot number of the new page is returned.
*
* Control lock must be held at entry, and will be held at exit.
*/
static int
ZeroCLOGPage(int pageno, bool writeXlog)
{
int slotno;
slotno = SimpleLruZeroPage(ClogCtl, pageno);
if (writeXlog)
WriteZeroPageXlogRec(pageno);
return slotno;
}
~~~
`SimpleLruZeroPage(ClogCtl, pageno)`,调用`SlruSelectLRUPage(ctl, pageno)`,从clog shared buffer中选择SLOT。
src/backend/access/transam/slru.c
~~~
/*
* Initialize (or reinitialize) a page to zeroes.
*
* The page is not actually written, just set up in shared memory.
* The slot number of the new page is returned.
*
* Control lock must be held at entry, and will be held at exit.
*/
int
SimpleLruZeroPage(SlruCtl ctl, int pageno)
{
SlruShared shared = ctl->shared;
int slotno;
/* Find a suitable buffer slot for the page */
slotno = SlruSelectLRUPage(ctl, pageno);
Assert(shared->page_status[slotno] == SLRU_PAGE_EMPTY ||
(shared->page_status[slotno] == SLRU_PAGE_VALID &&
!shared->page_dirty[slotno]) ||
shared->page_number[slotno] == pageno);
/* Mark the slot as containing this page */
shared->page_number[slotno] = pageno;
shared->page_status[slotno] = SLRU_PAGE_VALID;
shared->page_dirty[slotno] = true;
SlruRecentlyUsed(shared, slotno);
/* Set the buffer to zeroes */
MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
/* Set the LSNs for this new page to zero */
SimpleLruZeroLSNs(ctl, slotno);
/* Assume this page is now the latest active page */
shared->latest_page_number = pageno;
return slotno;
}
~~~
`SlruSelectLRUPage(SlruCtl ctl, int pageno)`,从clog buffer选择一个空的SLOT,如果没有空的SLOT,则需要调用`SlruInternalWritePage(ctl, bestvalidslot, NULL)`,写shared buffer page。
~~~
/*
* Select the slot to re-use when we need a free slot.
*
* The target page number is passed because we need to consider the
* possibility that some other process reads in the target page while
* we are doing I/O to free a slot. Hence, check or recheck to see if
* any slot already holds the target page, and return that slot if so.
* Thus, the returned slot is *either* a slot already holding the pageno
* (could be any state except EMPTY), *or* a freeable slot (state EMPTY
* or CLEAN).
*
* Control lock must be held at entry, and will be held at exit.
*/
static int
SlruSelectLRUPage(SlruCtl ctl, int pageno)
{
......
/* See if page already has a buffer assigned */ 先查看clog buffer中是否有空SLOT,有则返回,不需要调pg_fsync
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY)
return slotno;
}
......
/* 如果没有找到空SLOT,则需要从clog buffer中选择一个使用最少的PAGE,注意他不会选择最近临近的PAGE,优先选择IO不繁忙的PAGE
* If we find any EMPTY slot, just select that one. Else choose a
* victim page to replace. We normally take the least recently used
* valid page, but we will never take the slot containing
* latest_page_number, even if it appears least recently used. We
* will select a slot that is already I/O busy only if there is no
* other choice: a read-busy slot will not be least recently used once
* the read finishes, and waiting for an I/O on a write-busy slot is
* inferior to just picking some other slot. Testing shows the slot
* we pick instead will often be clean, allowing us to begin a read at
* once.
*
* Normally the page_lru_count values will all be different and so
* there will be a well-defined LRU page. But since we allow
* concurrent execution of SlruRecentlyUsed() within
* SimpleLruReadPage_ReadOnly(), it is possible that multiple pages
* acquire the same lru_count values. In that case we break ties by
* choosing the furthest-back page.
*
* Notice that this next line forcibly advances cur_lru_count to a
* value that is certainly beyond any value that will be in the
* page_lru_count array after the loop finishes. This ensures that
* the next execution of SlruRecentlyUsed will mark the page newly
* used, even if it's for a page that has the current counter value.
* That gets us back on the path to having good data when there are
* multiple pages with the same lru_count.
*/
cur_count = (shared->cur_lru_count)++;
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
int this_delta;
int this_page_number;
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY) // 如果在此期间出现了空SLOT,返回这个slotno
return slotno;
this_delta = cur_count - shared->page_lru_count[slotno];
if (this_delta < 0)
{
/*
* Clean up in case shared updates have caused cur_count
* increments to get "lost". We back off the page counts,
* rather than trying to increase cur_count, to avoid any
* question of infinite loops or failure in the presence of
* wrapped-around counts.
*/
shared->page_lru_count[slotno] = cur_count;
this_delta = 0;
}
this_page_number = shared->page_number[slotno];
if (this_page_number == shared->latest_page_number)
continue;
if (shared->page_status[slotno] == SLRU_PAGE_VALID) // IO不繁忙的脏页
{
if (this_delta > best_valid_delta ||
(this_delta == best_valid_delta &&
ctl->PagePrecedes(this_page_number,
best_valid_page_number)))
{
bestvalidslot = slotno;
best_valid_delta = this_delta;
best_valid_page_number = this_page_number;
}
}
else
{
if (this_delta > best_invalid_delta ||
(this_delta == best_invalid_delta &&
ctl->PagePrecedes(this_page_number,
best_invalid_page_number)))
{
bestinvalidslot = slotno; // 当所有页面IO都繁忙时,无奈只能从IO繁忙中选择一个.
best_invalid_delta = this_delta;
best_invalid_page_number = this_page_number;
}
}
}
/* 如果选择到的PAGE
* If all pages (except possibly the latest one) are I/O busy, we'll
* have to wait for an I/O to complete and then retry. In that
* unhappy case, we choose to wait for the I/O on the least recently
* used slot, on the assumption that it was likely initiated first of
* all the I/Os in progress and may therefore finish first.
*/
if (best_valid_delta < 0) // 说明没有找到SLRU_PAGE_VALID的PAGE,所有PAGE都处于IO繁忙的状态。
{
SimpleLruWaitIO(ctl, bestinvalidslot);
continue;
}
/*
* If the selected page is clean, we're set.
*/
if (!shared->page_dirty[bestvalidslot]) // 如果这个页面已经不是脏页(例如被CHECKPOINT刷出了),那么直接返回
return bestvalidslot;
......
仅仅当以上所有的步骤,都没有找到一个EMPTY SLOT时,才需要主动刷脏页(在SlruInternalWritePage调用pg_fsync)。
/*
* Write the page. 注意第三个参数为NULL,即fdata
*/
SlruInternalWritePage(ctl, bestvalidslot, NULL);
......
~~~
`SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)`,调用`SlruPhysicalWritePage`,执行write。
~~~
/*
* Write a page from a shared buffer, if necessary.
* Does nothing if the specified slot is not dirty.
*
* NOTE: only one write attempt is made here. Hence, it is possible that
* the page is still dirty at exit (if someone else re-dirtied it during
* the write). However, we *do* attempt a fresh write even if the page
* is already being written; this is for checkpoints.
*
* Control lock must be held at entry, and will be held at exit.
*/
static void
SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
{
......
/* Do the write */
ok = SlruPhysicalWritePage(ctl, pageno, slotno, fdata);
......
~~~
SLRU PAGE状态
~~~
/*
* Page status codes. Note that these do not include the "dirty" bit.
* page_dirty can be TRUE only in the VALID or WRITE_IN_PROGRESS states;
* in the latter case it implies that the page has been re-dirtied since
* the write started.
*/
typedef enum
{
SLRU_PAGE_EMPTY, /* buffer is not in use */
SLRU_PAGE_READ_IN_PROGRESS, /* page is being read in */
SLRU_PAGE_VALID, /* page is valid and not being written */
SLRU_PAGE_WRITE_IN_PROGRESS /* page is being written out */
} SlruPageStatus;
~~~
`SlruPhysicalWritePage(ctl, pageno, slotno, fdata)`,这里涉及pg_clog相关的`SlruCtlData`结构,do_fsync=true。
~~~
/*
* Physical write of a page from a buffer slot
*
* On failure, we cannot just ereport(ERROR) since caller has put state in
* shared memory that must be undone. So, we return FALSE and save enough
* info in static variables to let SlruReportIOError make the report.
*
* For now, assume it's not worth keeping a file pointer open across
* independent read/write operations. We do batch operations during
* SimpleLruFlush, though.
*
* fdata is NULL for a standalone write, pointer to open-file info during
* SimpleLruFlush.
*/
static bool
SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno,
SlruFlush fdata);
......
int fd = -1;
......
// 如果文件不存在,自动创建
if (fd < 0)
{
/*
* If the file doesn't already exist, we should create it. It is
* possible for this to need to happen when writing a page that's not
* first in its segment; we assume the OS can cope with that. (Note:
* it might seem that it'd be okay to create files only when
* SimpleLruZeroPage is called for the first page of a segment.
* However, if after a crash and restart the REDO logic elects to
* replay the log from a checkpoint before the latest one, then it's
* possible that we will get commands to set transaction status of
* transactions that have already been truncated from the commit log.
* Easiest way to deal with that is to accept references to
* nonexistent files here and in SlruPhysicalReadPage.)
*
* Note: it is possible for more than one backend to be executing this
* code simultaneously for different pages of the same file. Hence,
* don't use O_EXCL or O_TRUNC or anything like that.
*/
SlruFileName(ctl, path, segno);
fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY,
S_IRUSR | S_IWUSR);
......
/*
* If not part of Flush, need to fsync now. We assume this happens
* infrequently enough that it's not a performance issue.
*/
if (!fdata) // 因为传入的fdata=NULL,并且ctl->do_fsync=true,所以以下pg_fsync被调用。
{
if (ctl->do_fsync && pg_fsync(fd)) // 对于pg_clog和multixact,do_fsync=true。
{
slru_errcause = SLRU_FSYNC_FAILED;
slru_errno = errno;
CloseTransientFile(fd);
return false;
}
if (CloseTransientFile(fd))
{
slru_errcause = SLRU_CLOSE_FAILED;
slru_errno = errno;
return false;
}
}
~~~
`ctl->do_fsync` && `pg_fsync(fd)`涉及的代码:
src/include/access/slru.h
~~~
/*
* SlruCtlData is an unshared structure that points to the active information
* in shared memory.
*/
typedef struct SlruCtlData
{
SlruShared shared;
/*
* This flag tells whether to fsync writes (true for pg_clog and multixact
* stuff, false for pg_subtrans and pg_notify).
*/
bool do_fsync;
/*
* Decide which of two page numbers is "older" for truncation purposes. We
* need to use comparison of TransactionIds here in order to do the right
* thing with wraparound XID arithmetic.
*/
bool (*PagePrecedes) (int, int);
/*
* Dir is set during SimpleLruInit and does not change thereafter. Since
* it's always the same, it doesn't need to be in shared memory.
*/
char Dir[64];
} SlruCtlData;
typedef SlruCtlData *SlruCtl;
~~~
src/backend/access/transam/slru.c
~~~
......
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
LWLock *ctllock, const char *subdir)
......
ctl->do_fsync = true; /* default behavior */ // 初始化LRU时,do_fsync默认是true的。
......
~~~
以下是clog初始化LRU的调用,可以看到它没有修改do_fsync,所以是TURE。
src/backend/access/transam/clog.c
~~~
/*
* Number of shared CLOG buffers.
*
* Testing during the PostgreSQL 9.2 development cycle revealed that on a
* large multi-processor system, it was possible to have more CLOG page
* requests in flight at one time than the number of CLOG buffers which existed
* at that time, which was hardcoded to 8\. Further testing revealed that
* performance dropped off with more than 32 CLOG buffers, possibly because
* the linear buffer search algorithm doesn't scale well.
*
* Unconditionally increasing the number of CLOG buffers to 32 did not seem
* like a good idea, because it would increase the minimum amount of shared
* memory required to start, which could be a problem for people running very
* small configurations. The following formula seems to represent a reasonable
* compromise: people with very low values for shared_buffers will get fewer
* CLOG buffers as well, and everyone else will get 32.
*
* It is likely that some further work will be needed here in future releases;
* for example, on a 64-core server, the maximum number of CLOG requests that
* can be simultaneously in flight will be even larger. But that will
* apparently require more than just changing the formula, so for now we take
* the easy way out.
*/
Size
CLOGShmemBuffers(void)
{
return Min(32, Max(4, NBuffers / 512));
}
void
CLOGShmemInit(void)
{
ClogCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(ClogCtl, "CLOG Ctl", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
CLogControlLock, "pg_clog");
}
~~~
以下是subtrans初始化LRU的调用,看到它修改了do_fsync=false。所以subtrans扩展PAGE时不需要调用pg_fsync。
src/backend/access/transam/subtrans.c
~~~
void
SUBTRANSShmemInit(void)
{
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "SUBTRANS Ctl", NUM_SUBTRANS_BUFFERS, 0,
SubtransControlLock, "pg_subtrans");
/* Override default assumption that writes should be fsync'd */
SubTransCtl->do_fsync = false;
}
~~~
multixact.c也没有修改do_fsync,所以也是需要fsync的。
MultiXactShmemInit(void)@src/backend/access/transam/multixact.c
pg_fsync代码:
src/backend/storage/file/fd.c
~~~
/*
* pg_fsync --- do fsync with or without writethrough
*/
int
pg_fsync(int fd)
{
/* #if is to skip the sync_method test if there's no need for it */
#if defined(HAVE_FSYNC_WRITETHROUGH) && !defined(FSYNC_WRITETHROUGH_IS_FSYNC)
if (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH)
return pg_fsync_writethrough(fd);
else
#endif
return pg_fsync_no_writethrough(fd);
}
/*
* pg_fsync_no_writethrough --- same as fsync except does nothing if
* enableFsync is off
*/
int
pg_fsync_no_writethrough(int fd)
{
if (enableFsync)
return fsync(fd);
else
return 0;
}
/*
* pg_fsync_writethrough
*/
int
pg_fsync_writethrough(int fd)
{
if (enableFsync)
{
#ifdef WIN32
return _commit(fd);
#elif defined(F_FULLFSYNC)
return (fcntl(fd, F_FULLFSYNC, 0) == -1) ? -1 : 0;
#else
errno = ENOSYS;
return -1;
#endif
}
else
return 0;
}
~~~
从上面的代码分析,扩展clog page时,如果在CLOG BUFFER中没有EMPTY SLOT,则需要backend process主动刷CLOG PAGE,所以会有调用pg_fsync的动作。
clog page和数据库BLOCKSZ (database block size)一样大,默认是8K(如果编译数据库软件时没有修改的话,默认是8KB),最大可以设置为32KB。每个事务在pg_clog中需要2个比特位来存储事务信息(xmin commit/abort,xmax commit/abort)。所以8K的clog page可以存储32K个事务信息,换句话说,每32K个事务,需要扩展一次clog page。
下面的代码是clog的一些常用宏。
src/backend/access/transam/clog.c
~~~
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
* everywhere else in Postgres.
*
* Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
* CLOG page numbering also wraps around at 0xFFFFFFFF/CLOG_XACTS_PER_PAGE,
* and CLOG segment numbering at
* 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT. We need take no
* explicit notice of that fact in this module, except when comparing segment
* and page numbers in TruncateCLOG (see CLOGPagePrecedes).
*/
/* We need two bits per xact, so four xacts fit in a byte */
#define CLOG_BITS_PER_XACT 2
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
#define CLOG_XACT_BITMASK ((1 << CLOG_BITS_PER_XACT) - 1)
#define TransactionIdToPage(xid) ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE)
#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)
#define TransactionIdToByte(xid) (TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE)
#define TransactionIdToBIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)
~~~
查看数据库的block size:
~~~
postgres@digoal-> pg_controldata |grep block
Database block size: 8192
WAL block size: 8192
~~~
我们可以使用stap来跟踪是否调用pg_fsync,如果你要观察backend process主动刷clog 脏页,可以把checkpoint间隔开大,同时把clog shared buffer pages。
你就会观察到backend process主动刷clog 脏页。
~~~
Size
CLOGShmemBuffers(void)
{
return Min(32, Max(4, NBuffers / 512));
}
~~~
跟踪
~~~
src/backend/access/transam/slru.c
SlruPhysicalWritePage
......
SlruFileName(ctl, path, segno);
fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY,
S_IRUSR | S_IWUSR);
......
src/backend/storage/file/fd.c
OpenTransientFile
pg_fsync(fd)
~~~
stap脚本
~~~
[root@digoal ~]# cat trc.stp
global f_start[999999]
probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
~~~
开启一个pgbench执行txid_current()函数申请新的事务号。
~~~
postgres@digoal-> cat 7.sql
select txid_current();
~~~
测试,约每秒产生32K左右的请求。
~~~
postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183
progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127
progress: 242.0 s, 32567.3 tps, lat 0.030 ms stddev 0.179
progress: 243.0 s, 33656.6 tps, lat 0.029 ms stddev 0.038
progress: 244.0 s, 33948.1 tps, lat 0.029 ms stddev 0.021
progress: 245.0 s, 32996.8 tps, lat 0.030 ms stddev 0.046
progress: 246.0 s, 34156.7 tps, lat 0.029 ms stddev 0.015
progress: 247.0 s, 33259.5 tps, lat 0.029 ms stddev 0.074
progress: 248.0 s, 32979.6 tps, lat 0.030 ms stddev 0.043
progress: 249.0 s, 32892.6 tps, lat 0.030 ms stddev 0.039
progress: 250.0 s, 33090.7 tps, lat 0.029 ms stddev 0.020
progress: 251.0 s, 33238.3 tps, lat 0.029 ms stddev 0.017
progress: 252.0 s, 32341.3 tps, lat 0.030 ms stddev 0.045
progress: 253.0 s, 31999.0 tps, lat 0.030 ms stddev 0.167
progress: 254.0 s, 33332.6 tps, lat 0.029 ms stddev 0.056
progress: 255.0 s, 30394.6 tps, lat 0.032 ms stddev 0.027
progress: 256.0 s, 31862.7 tps, lat 0.031 ms stddev 0.023
progress: 257.0 s, 31574.0 tps, lat 0.031 ms stddev 0.112
~~~
跟踪backend process
~~~
postgres@digoal-> ps -ewf|grep postgres
postgres 2921 1883 29 09:37 pts/1 00:00:05 pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
postgres 2924 1841 66 09:37 ? 00:00:13 postgres: postgres postgres [local] SELECT
~~~
从日志中抽取pg_clog相关的跟踪结果。
~~~
[root@digoal ~]# stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp -x 2924 >./stap.log 2>&1
0 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12350 slotno=10 fdata=ERROR
31 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
53 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
102 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
1096 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
1113 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
1105302 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12351 slotno=11 fdata=ERROR
1105329 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
1105348 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
1105405 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
1106440 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
1106452 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
2087891 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12352 slotno=12 fdata=ERROR
2087917 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0182" fileFlags=66 fileMode=384
2087958 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
2088013 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
2089250 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
2089265 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
~~~
计算估计,每隔1秒左右会产生一次fsync。
~~~
postgres=# select 1441503928836-1441503927731;
?column?
----------
1105
(1 row)
postgres=# select 1441503929819-1441503928836;
?column?
----------
983
(1 row)
~~~
前面pgbench的输出看到每秒产生约32000个事务,刚好等于一个clog页的事务数(本例数据块大小为8KB)。
每个事务需要2个比特位,每个字节存储4个事务信息,8192*4=32768。
如果你需要观察backend process不刷clog buffer脏页的情况。可以把checkpoint 间隔改小,或者手动执行checkpoint,同时还需要把clog buffer pages改大,例如:
~~~
Size
CLOGShmemBuffers(void)
{
return Min(1024, Max(4, NBuffers / 2));
}
~~~
使用同样的stap脚本,你就观察不到backend process主动刷clog dirty page了。
通过以上分析,如果你发现backend process频繁的clog,可以采取一些优化手段。
1. 因为每次扩展pg_clog文件后,文件大小都会发生变化,此时如果backend process调用pg_fdatasync也会写文件系统metadata journal(以EXT4为例,假设mount参数data不等于writeback),这个操作是整个文件系统串行的,容易产生堵塞;
所以backend process挑选clog page时,不选择最近的page number可以起到一定的效果,(最好是不选择最近的clog file中的pages);
另一种方法是先调用`sync_file_range`, SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER,它不需要写metadata。将文件写入后再调用pg_fsync。减少等待data fsync的时间;
2. pg_clog文件预分配,目前pg_clog单个文件的大小是由CLOGShmemBuffers决定的,为BLOCKSZ的32倍。可以尝试预分配这个文件,而不是每次都扩展,改变它的大小;
3. 延迟backend process 的 fsync请求到checkpoint处理。
[参考]
https://wiki.postgresql.org/wiki/Hint_Bits
http://blog.163.com/digoal@126/blog/static/1638770402015840480734/
src/backend/access/transam/varsup.c
src/backend/access/transam/clog.c
src/backend/access/transam/slru.c
src/include/access/slru.h
src/backend/access/transam/subtrans.c
src/backend/storage/file/fd.c
## pg_clog的原子操作与pg_subtrans(子事务)
如果没有子事务,其实很容易保证pg_clog的原子操作,但是,如果加入了子事务并为子事务分配了XID,并且某些子事务XID和父事务的XID不在同一个CLOG PAGE时,保证事务一致性就涉及CLOG的原子写了。
PostgreSQL是通过2PC来实现CLOG的原子写的:
1. 首先将主事务以外的CLOG PAGE中的子事务设置为sub-committed状态;
2. 然后将主事务所在的CLOG PAGE中的子事务设置为sub-committed,同时设置主事务为committed状态,将同页的子事务设置为committed状态;
3. 将其他CLOG PAGE中的子事务设置为committed状态;
src/backend/access/transam/clog.c
~~~
/*
* TransactionIdSetTreeStatus
*
* Record the final state of transaction entries in the commit log for
* a transaction and its subtransaction tree. Take care to ensure this is
* efficient, and as atomic as possible.
*
* xid is a single xid to set status for. This will typically be
* the top level transactionid for a top level commit or abort. It can
* also be a subtransaction when we record transaction aborts.
*
* subxids is an array of xids of length nsubxids, representing subtransactions
* in the tree of xid. In various cases nsubxids may be zero.
*
* lsn must be the WAL location of the commit record when recording an async
* commit. For a synchronous commit it can be InvalidXLogRecPtr, since the
* caller guarantees the commit record is already flushed in that case. It
* should be InvalidXLogRecPtr for abort cases, too.
*
* In the commit case, atomicity is limited by whether all the subxids are in
* the same CLOG page as xid. If they all are, then the lock will be grabbed
* only once, and the status will be set to committed directly. Otherwise
* we must
* 1\. set sub-committed all subxids that are not on the same page as the
* main xid
* 2\. atomically set committed the main xid and the subxids on the same page
* 3\. go over the first bunch again and set them committed
* Note that as far as concurrent checkers are concerned, main transaction
* commit as a whole is still atomic.
*
* Example:
* TransactionId t commits and has subxids t1, t2, t3, t4
* t is on page p1, t1 is also on p1, t2 and t3 are on p2, t4 is on p3
* 1\. update pages2-3:
* page2: set t2,t3 as sub-committed
* page3: set t4 as sub-committed
* 2\. update page1:
* set t1 as sub-committed,
* then set t as committed,
then set t1 as committed
* 3\. update pages2-3:
* page2: set t2,t3 as committed
* page3: set t4 as committed
*
* NB: this is a low-level routine and is NOT the preferred entry point
* for most uses; functions in transam.c are the intended callers.
*
* XXX Think about issuing FADVISE_WILLNEED on pages that we will need,
* but aren't yet in cache, as well as hinting pages not to fall out of
* cache yet.
*/
~~~
实际调用的入口代码在transam.c,subtrans.c中是一些低级接口。
那么什么是subtrans?
当我们使用savepoint时,会产生子事务。子事务和父事务一样,可能消耗XID。一旦为子事务分配了XID,那么就涉及CLOG的原子操作了,因为要保证父事务和所有的子事务的CLOG一致性。
当不消耗XID时,需要通过SubTransactionId来区分子事务。
~~~
src/backend/acp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466858 ptr=0
~~~
重新开一个会话,你会发现,子事务也消耗了XID。因为重新分配的XID已经从607466859开始了。
~~~
postgres@digoal-> psql
psql (9.4.4)
Type "help" for help.
postgres=# select txid_current();
txid_current
--------------
607466859
(1 row)
~~~
[参考]
src/backend/access/transam/clog.c
src/backend/access/transam/subtrans.c
src/backend/access/transam/transam.c
src/backend/access/transam/README
src/include/c.hr
## CLOG一致性和异步提交
异步提交是指不需要等待事务对应的wal buffer fsync到磁盘,即返回,而且写CLOG时也不需要等待XLOG落盘。
而pg_clog和pg_xlog是两部分存储的,那么我们想一想,如果一个已提交事务的pg_clog已经落盘,而XLOG没有落盘,刚好此时数据库CRASH了。数据库恢复时,由于该事务对应的XLOG缺失,数据无法恢复到最终状态,但是PG_CLOG却显示该事务已提交,这就出问题了。
所以对于异步事务,CLOG在write前,务必等待该事务对应的XLOG已经FLUSH到磁盘。
PostgreSQL如何记录事务和它产生的XLOG的LSN的关系呢?
其实不是一一对应的关系,而是记录了多事务对一个LSN的关系。
src/backend/access/transam/clog.c
LSN组,每32个事务,记录它们对应的最大LSN。
也就是32个事务,只记录最大的LSN。节约空间?
~~~
/* We store the latest async LSN for each group of transactions */
#define CLOG_XACTS_PER_LSN_GROUP 32 /* keep this a power of 2 */
每个CLOG页需要分成多少个LSN组。
#define CLOG_LSNS_PER_PAGE (CLOG_XACTS_PER_PAGE / CLOG_XACTS_PER_LSN_GROUP)
#define GetLSNIndex(slotno, xid) ((slotno) * CLOG_LSNS_PER_PAGE + \
((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) / CLOG_XACTS_PER_LSN_GROUP)
~~~
LSN被存储在这个数据结构中
src/include/access/slru.h
~~~
/*
* Shared-memory state
*/
typedef struct SlruSharedData
{
......
/*
* Optional array of WAL flush LSNs associated with entries in the SLRU
* pages. If not zero/NULL, we must flush WAL before writing pages (true
* for pg_clog, false for multixact, pg_subtrans, pg_notify). group_lsn[]
* has lsn_groups_per_page entries per buffer slot, each containing the
* highest LSN known for a contiguous group of SLRU entries on that slot's
* page. 仅仅pg_clog需要记录group_lsn
*/
XLogRecPtr *group_lsn; // 一个数组,存储32个事务组成的组中最大的LSN号。
int lsn_groups_per_page;
......
~~~
src/backend/access/transam/clog.c
~~~
* lsn must be the WAL location of the commit record when recording an async
* commit. For a synchronous commit it can be InvalidXLogRecPtr, since the
* caller guarantees the commit record is already flushed in that case. It
* should be InvalidXLogRecPtr for abort cases, too.
void
TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
TransactionId *subxids, XidStatus status, XLogRecPtr lsn)
{
......
~~~
更新事务状态时,同时更新对应LSN组的LSN为最大LSN值。(CLOG BUFFER中的操作)
~~~
/*
* Sets the commit status of a single transaction.
*
* Must be called with CLogControlLock held
*/
static void
TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
{
......
/*
* Update the group LSN if the transaction completion LSN is higher.
*
* Note: lsn will be invalid when supplied during InRecovery processing,
* so we don't need to do anything special to avoid LSN updates during
* recovery. After recovery completes the next clog change will set the
* LSN correctly.
*/
if (!XLogR int lsnindex = GetLSNIndex(slotno, xid);
if (ClogCtl->shared->group_lsn[lsnindex] < lsn) // 更新组LSN
ClogCtl->shared->group_lsn[lsnindex] = lsn;
}
......
~~~
将事务标记为commit状态,对于异步事务,多一个LSN参数,用于修改事务组的最大LSN。
~~~
/*
* TransactionIdCommitTree
* Marks the given transaction and children as committed
*
* "xid" is a toplevel transaction commit, and the xids array contains its
* committed subtransactions.
*
* This commit operation is not guaranteed to be atomic, but if not, subxids
* are correctly marked subcommit first.
*/
void
TransactionIdCommitTree(TransactionId xid, int nxids, TransactionId *xids)
{
TransactionIdSetTreeStatus(xid, nxids, xids,
TRANSACTION_STATUS_COMMITTED,
InvalidXLogRecPtr);
}
/*
* TransactionIdAsyncCommitTree
* Same as above, but for async commits. The commit record LSN is needed.
*/
void
TransactionIdAsyncCommitTree(TransactionId xid, int nxids, TransactionId *xids,
XLogRecPtr lsn)
{
TransactionIdSetTreeStatus(xid, nxids, xids,
TRANSACTION_STATUS_COMMITTED, lsn);
}
/*
* TransactionIdAbortTree
* Marks the given transaction and children as aborted.
*
* "xid" is a toplevel transaction commit, and the xids array contains its
* committed subtransactions.
*
* We don't need to worry about the non-atomic behavior, since any onlookers
* will consider all the xacts as not-yet-committed anyway.
*/
void
TransactionIdAbortTree(TransactionId xid, int nxids, TransactionId *xids)
{
TransactionIdSetTreeStatus(xid, nxids, xids,
TRANSACTION_STATUS_ABORTED, InvalidXLogRecPtr);
}
~~~
从XID号,获取它对应的LSN,需要注意的是,这个XID如果是一个FROZEN XID,则返回一个(XLogRecPtr) invalid lsn。
src/backend/access/transam/transam.c
~~~
/*
* TransactionIdGetCommitLSN
*
* This function returns an LSN that is late enough to be able
* to guarantee that if we flush up to the LSN returned then we
* will have flushed the transaction's commit record to disk.
*
* The result is not necessarily the exact LSN of the transaction's
* commit record! For example, for long-past transactions (those whose
* clog pages already migrated to disk), we'll return InvalidXLogRecPtr.
* Also, because we group transactions on the same clog page to conserve
* storage, we might return the LSN of a later transaction that falls into
* the same group.
*/
XLogRecPtr
TransactionIdGetCommitLSN(TransactionId xid)
{
XLogRecPtr result;
/*
* Currently, all uses of this function are for xids that were just
* reported to be committed by TransactionLogFetch, so we expect that
* checking TransactionLogFetch's cache will usually succeed and avoid an
* extra trip to shared memory.
*/
if (TransactionIdEquals(xid, cachedFetchXid))
return cachedCommitLSN;
/* Special XIDs are always known committed */
if (!TransactionIdIsNormal(xid))
return InvalidXLogRecPtr;
/*
* Get the transaction status.
*/
(void) TransactionIdGetStatus(xid, &result);
return result;
}
/*
* Interrogate the state of a transaction in the commit log.
*
* Aside from the actual commit status, this function returns (into *lsn)
* an LSN that is late enough to be able to guarantee that if we flush up to
* that LSN then we will have flushed the transaction's commit record to disk.
* The result is not necessarily the exact LSN of the transaction's commit
* record! For example, for long-past transactions (those whose clog pages // long-past事务,指非标准事务号。例), we'll return InvalidXLogRecPtr. Also, because
* we group transactions on the same clog page to conserve storage, we might
* return the LSN of a later transaction that falls into the same group.
*
* NB: this is a low-level routine and is NOT the preferred entry point
* for most uses; TransactionLogFetch() in transam.c is the intended caller.
*/
XidStatus
TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
{
int pageno = TransactionIdToPage(xid);
int byteno = TransactionIdToByte(xid);
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
int slotno;
int lsnindex;
char *byteptr;
XidStatus status;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid);
byteptr = ClogCtl->shared->page_buffer[slotno] + byteno;
status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
lsnindex = GetLSNIndex(slotno, xid);
*lsn = ClogCtl->shared->group_lsn[lsnindex];
LWLockRelease(CLogControlLock);
return status;
}
~~~
前面所涉及的都是CLOG BUFFER中的操作,如果要将buffer写到磁盘,则真正需要涉及到一致性的问题,即在将CLOG write到磁盘前,必须先确保对应的事务产生的XLOG已经flush到磁盘。那么这里就需要用到前面每个LSN组中记录的max LSN了。
代码如下:
src/backend/access/transam/slru.c
~~~
/*
* Physical write of a page from a buffer slot
*
* On failure, we cannot just ereport(ERROR) since caller has put state in
* shared memory that must be undone. So, we return FALSE and save enough
* info in static variables to let SlruReportIOError make the report.
*
* For now, assume it's not worth keeping a file pointer open across
* independent read/write operations. We do batch operations during
* SimpleLruFlush, though.
*
* fdata is NULL for a standalone write, pointer to open-file info during
* SimpleLruFlush.
*/
static bool
SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
{
SlruShared shared = ctl->shared;
int segno = pageno / SLRU_PAGES_PER_SEGMENT;
int rpageno = pageno % SLRU_PAGES_PER_SEGMENT;
int offset = rpageno * BLCKSZ;
char path[MAXPGPATH];
int fd = -1;
/*
* Honor the write-WAL-before-data rule, if appropriate, so that we do not
* write out data before associated WAL records. This is the same action
* performed during FlushBuffer() in the main buffer manager.
*/
if (shared->group_lsn != NULL)
{
/*
* We must determine the largest async-commit LSN for the page. This
* is a bit tedious, but since this entire function is a slow path
* anyway, it seems better to do this here than to maintain a per-page
* LSN variable (which'd need an extra comparison in the
* transaction-commit path).
*/
XLogRecPtr max_lsn;
int lsnindex,
lsnoff;
lsnindex = slotno * shared->lsn_groups_per_page;
max_lsn = shared->group_lsn[lsnindex++];
for (lsnoff = 1; lsnoff < shared->lsn_groups_per_page; lsnoff++)
{
XLogRecPtr this_lsn = shared->group_lsn[lsnindex++];
if (max_lsn < this_lsn)
max_lsn = this_lsn;
}
if (!XLogRecPtrIsInvalid(max_lsn)) // 判断max_lsn是不是一个有效的LSN,如果是有效的LSN,说明需要先调用xlogflush将wal buffer中小于该LSN以及以前的buffer写入磁盘。
则。
{
/*
* As noted above, elog(ERROR) is not acceptable here, so if
* XLogFlush were to fail, we must PANIC. This isn't much of a
* restriction because XLogFlush is just about all critical
* section anyway, but let's make sure.
*/
START_CRIT_SECTION();
XLogFlush(max_lsn);
END_CRIT_SECTION();
}
}
......
~~~
小结
对于异步事务,如何保证write-WAL-before-data规则?
pg_clog将32个事务分为一组,存储这些事务的最大LSN。存储在SlruSharedData结构中。
在将clog buffer write到磁盘前,需要确保该clog page对应事务的xlog LSN已经flush到磁盘。
[参考]
src/backend/access/transam/clog.c
src/include/access/slru.h
src/backend/access/transam/transam.c
src/backend/access/transam/slru.c
- 数据库内核月报目录
- 数据库内核月报 - 2016/09
- MySQL · 社区贡献 · AliSQL那些事儿
- PetaData · 架构体系 · PetaData第二代低成本存储体系
- MySQL · 社区动态 · MariaDB 10.2 前瞻
- MySQL · 特性分析 · 执行计划缓存设计与实现
- PgSQL · 最佳实践 · pg_rman源码浅析与使用
- MySQL · 捉虫状态 · bug分析两例
- PgSQL · 源码分析 · PG优化器浅析
- MongoDB · 特性分析· Sharding原理与应用
- PgSQL · 源码分析 · PG中的无锁算法和原子操作应用一则
- SQLServer · 最佳实践 · TEMPDB的设计
- 数据库内核月报 - 2016/08
- MySQL · 特性分析 ·MySQL 5.7新特性系列四
- PgSQL · PostgreSQL 逻辑流复制技术的秘密
- MySQL · 特性分析 · MyRocks简介
- GPDB · 特性分析· Greenplum 备份架构
- SQLServer · 最佳实践 · RDS for SQLServer 2012权限限制提升与改善
- TokuDB · 引擎特性 · REPLACE 语句优化
- MySQL · 专家投稿 · InnoDB物理行中null值的存储的推断与验证
- PgSQL · 实战经验 · 旋转门压缩算法在PostgreSQL中的实现
- MySQL · 源码分析 · Query Cache并发处理
- PgSQL · 源码分析· pg_dump分析
- 数据库内核月报 - 2016/07
- MySQL · 特性分析 ·MySQL 5.7新特性系列三
- MySQL · 特性分析 · 5.7 代价模型浅析
- PgSQL · 实战经验 · 分组TOP性能提升44倍
- MySQL · 源码分析 · 网络通信模块浅析
- MongoDB · 特性分析 · 索引原理
- SQLServer · 特性分析 · XML与JSON应用比较
- MySQL · 最佳实战 · 审计日志实用案例分析
- MySQL · 性能优化 · 条件下推到物化表
- MySQL · 源码分析 · Query Cache内部剖析
- MySQL · 捉虫动态 · 备库1206错误问题说明
- 数据库内核月报 - 2016/06
- MySQL · 特性分析 · innodb 锁分裂继承与迁移
- MySQL · 特性分析 ·MySQL 5.7新特性系列二
- PgSQL · 实战经验 · 如何预测Freeze IO风暴
- GPDB · 特性分析· Filespace和Tablespace
- MariaDB · 新特性 · 窗口函数
- MySQL · TokuDB · checkpoint过程
- MySQL · 特性分析 · 内部临时表
- MySQL · 最佳实践 · 空间优化
- SQLServer · 最佳实践 · 数据库实现大容量插入的几种方式
- 数据库内核月报 - 2016/05
- MySQL · 引擎特性 · 基于InnoDB的物理复制实现
- MySQL · 特性分析 · MySQL 5.7新特性系列一
- PostgreSQL · 特性分析 · 逻辑结构和权限体系
- MySQL · 特性分析 · innodb buffer pool相关特性
- PG&GP · 特性分析 · 外部数据导入接口实现分析
- SQLServer · 最佳实践 · 透明数据加密在SQLServer的应用
- MySQL · TokuDB · 日志子系统和崩溃恢复过程
- MongoDB · 特性分析 · Sharded cluster架构原理
- PostgreSQL · 特性分析 · 统计信息计算方法
- MySQL · 捉虫动态 · left-join多表导致crash
- 数据库内核月报 - 2016/04
- MySQL · 参数故事 · innodb_additional_mem_pool_size
- GPDB · 特性分析 · Segment事务一致性与异常处理
- GPDB · 特性分析 · Segment 修复指南
- MySQL · 捉虫动态 · 并行复制外键约束问题二
- PgSQL · 性能优化 · 如何潇洒的处理每天上百TB的数据增量
- Memcached · 最佳实践 · 热点 Key 问题解决方案
- MongoDB · 最佳实践 · 短连接Auth性能优化
- MySQL · 最佳实践 · RDS 只读实例延迟分析
- MySQL · TokuDB · TokuDB索引结构--Fractal Tree
- MySQL · TokuDB · Savepoint漫谈
- 数据库内核月报 - 2016/03
- MySQL · TokuDB · 事务子系统和 MVCC 实现
- MongoDB · 特性分析 · MMAPv1 存储引擎原理
- PgSQL · 源码分析 · 优化器逻辑推理
- SQLServer · BUG分析 · Agent 链接泄露分析
- Redis · 特性分析 · AOF Rewrite 分析
- MySQL · BUG分析 · Rename table 死锁分析
- MySQL · 物理备份 · Percona XtraBackup 备份原理
- GPDB · 特性分析· GreenPlum FTS 机制
- MySQL · 答疑解惑 · 备库Seconds_Behind_Master计算
- MySQL · 答疑解惑 · MySQL 锁问题最佳实践
- 数据库内核月报 - 2016/02
- MySQL · 引擎特性 · InnoDB 文件系统之文件物理结构
- MySQL · 引擎特性 · InnoDB 文件系统之IO系统和内存管理
- MySQL · 特性分析 · InnoDB transaction history
- PgSQL · 会议见闻 · PgConf.Russia 2016 大会总结
- PgSQL · 答疑解惑 · PostgreSQL 9.6 并行查询实现分析
- MySQL · TokuDB · TokuDB之黑科技工具
- PgSQL · 性能优化 · PostgreSQL TPC-C极限优化玩法
- MariaDB · 版本特性 · MariaDB 的 GTID 介绍
- MySQL · 特性分析 · 线程池
- MySQL · 答疑解惑 · mysqldump tips 两则
- 数据库内核月报 - 2016/01
- MySQL · 引擎特性 · InnoDB 事务锁系统简介
- GPDB · 特性分析· GreenPlum Primary/Mirror 同步机制
- MySQL · 专家投稿 · MySQL5.7 的 JSON 实现
- MySQL · 特性分析 · 优化器 MRR & BKA
- MySQL · 答疑解惑 · 物理备份死锁分析
- MySQL · TokuDB · Cachetable 的工作线程和线程池
- MySQL · 特性分析 · drop table的优化
- MySQL · 答疑解惑 · GTID不一致分析
- PgSQL · 特性分析 · Plan Hint
- MariaDB · 社区动态 · MariaDB on Power8 (下)
- 数据库内核月报 - 2015/12
- MySQL · 引擎特性 · InnoDB 事务子系统介绍
- PgSQL · 特性介绍 · 全文搜索介绍
- MongoDB · 捉虫动态 · Kill Hang问题排查记录
- MySQL · 参数优化 ·RDS MySQL参数调优最佳实践
- PgSQL · 特性分析 · 备库激活过程分析
- MySQL · TokuDB · 让Hot Backup更完美
- PgSQL · 答疑解惑 · 表膨胀
- MySQL · 特性分析 · Index Condition Pushdown (ICP)
- MariaDB · 社区动态 · MariaDB on Power8
- MySQL · 特性分析 · 企业版特性一览
- 数据库内核月报 - 2015/11
- MySQL · 社区见闻 · OOW 2015 总结 MySQL 篇
- MySQL · 特性分析 · Statement Digest
- PgSQL · 答疑解惑 · PostgreSQL 用户组权限管理
- MySQL · 特性分析 · MDL 实现分析
- PgSQL · 特性分析 · full page write 机制
- MySQL · 捉虫动态 · MySQL 外键异常分析
- MySQL · 答疑解惑 · MySQL 优化器 range 的代价计算
- MySQL · 捉虫动态 · ORDER/GROUP BY 导致 mysqld crash
- MySQL · TokuDB · TokuDB 中的行锁
- MySQL · 捉虫动态 · order by limit 造成优化器选择索引错误
- 数据库内核月报 - 2015/10
- MySQL · 引擎特性 · InnoDB 全文索引简介
- MySQL · 特性分析 · 跟踪Metadata lock
- MySQL · 答疑解惑 · 索引过滤性太差引起CPU飙高分析
- PgSQL · 特性分析 · PG主备流复制机制
- MySQL · 捉虫动态 · start slave crash 诊断分析
- MySQL · 捉虫动态 · 删除索引导致表无法打开
- PgSQL · 特性分析 · PostgreSQL Aurora方案与DEMO
- TokuDB · 捉虫动态 · CREATE DATABASE 导致crash问题
- PgSQL · 特性分析 · pg_receivexlog工具解析
- MySQL · 特性分析 · MySQL权限存储与管理
- 数据库内核月报 - 2015/09
- MySQL · 引擎特性 · InnoDB Adaptive hash index介绍
- PgSQL · 特性分析 · clog异步提交一致性、原子操作与fsync
- MySQL · 捉虫动态 · BUG 几例
- PgSQL · 答疑解惑 · 诡异的函数返回值
- MySQL · 捉虫动态 · 建表过程中crash造成重建表失败
- PgSQL · 特性分析 · 谈谈checkpoint的调度
- MySQL · 特性分析 · 5.6 并行复制恢复实现
- MySQL · 备库优化 · relay fetch 备库优化
- MySQL · 特性分析 · 5.6并行复制事件分发机制
- MySQL · TokuDB · 文件目录谈
- 数据库内核月报 - 2015/08
- MySQL · 社区动态 · InnoDB Page Compression
- PgSQL · 答疑解惑 · RDS中的PostgreSQL备库延迟原因分析
- MySQL · 社区动态 · MySQL5.6.26 Release Note解读
- PgSQL · 捉虫动态 · 执行大SQL语句提示无效的内存申请大小
- MySQL · 社区动态 · MariaDB InnoDB表空间碎片整理
- PgSQL · 答疑解惑 · 归档进程cp命令的core文件追查
- MySQL · 答疑解惑 · open file limits
- MySQL · TokuDB · 疯狂的 filenum++
- MySQL · 功能分析 · 5.6 并行复制实现分析
- MySQL · 功能分析 · MySQL表定义缓存
- 数据库内核月报 - 2015/07
- MySQL · 引擎特性 · Innodb change buffer介绍
- MySQL · TokuDB · TokuDB Checkpoint机制
- PgSQL · 特性分析 · 时间线解析
- PgSQL · 功能分析 · PostGIS 在 O2O应用中的优势
- MySQL · 引擎特性 · InnoDB index lock前世今生
- MySQL · 社区动态 · MySQL内存分配支持NUMA
- MySQL · 答疑解惑 · 外键删除bug分析
- MySQL · 引擎特性 · MySQL logical read-ahead
- MySQL · 功能介绍 · binlog拉取速度的控制
- MySQL · 答疑解惑 · 浮点型的显示问题
- 数据库内核月报 - 2015/06
- MySQL · 引擎特性 · InnoDB 崩溃恢复过程
- MySQL · 捉虫动态 · 唯一键约束失效
- MySQL · 捉虫动态 · ALTER IGNORE TABLE导致主备不一致
- MySQL · 答疑解惑 · MySQL Sort 分页
- MySQL · 答疑解惑 · binlog event 中的 error code
- PgSQL · 功能分析 · Listen/Notify 功能
- MySQL · 捉虫动态 · 任性的 normal shutdown
- PgSQL · 追根究底 · WAL日志空间的意外增长
- MySQL · 社区动态 · MariaDB Role 体系
- MySQL · TokuDB · TokuDB数据文件大小计算
- 数据库内核月报 - 2015/05
- MySQL · 引擎特性 · InnoDB redo log漫游
- MySQL · 专家投稿 · MySQL数据库SYS CPU高的可能性分析
- MySQL · 捉虫动态 · 5.6 与 5.5 InnoDB 不兼容导致 crash
- MySQL · 答疑解惑 · InnoDB 预读 VS Oracle 多块读
- PgSQL · 社区动态 · 9.5 新功能BRIN索引
- MySQL · 捉虫动态 · MySQL DDL BUG
- MySQL · 答疑解惑 · set names 都做了什么
- MySQL · 捉虫动态 · 临时表操作导致主备不一致
- TokuDB · 引擎特性 · zstd压缩算法
- MySQL · 答疑解惑 · binlog 位点刷新策略
- 数据库内核月报 - 2015/04
- MySQL · 引擎特性 · InnoDB undo log 漫游
- TokuDB · 产品新闻 · RDS TokuDB小手册
- PgSQL · 社区动态 · 说一说PgSQL 9.4.1中的那些安全补丁
- MySQL · 捉虫动态 · 连接断开导致XA事务丢失
- MySQL · 捉虫动态 · GTID下slave_net_timeout值太小问题
- MySQL · 捉虫动态 · Relay log 中 GTID group 完整性检测
- MySQL · 答疑释惑 · UPDATE交换列单表和多表的区别
- MySQL · 捉虫动态 · 删被引用索引导致crash
- MySQL · 答疑释惑 · GTID下auto_position=0时数据不一致
- 数据库内核月报 - 2015/03
- MySQL · 答疑释惑· 并发Replace into导致的死锁分析
- MySQL · 性能优化· 5.7.6 InnoDB page flush 优化
- MySQL · 捉虫动态· pid file丢失问题分析
- MySQL · 答疑释惑· using filesort VS using temporary
- MySQL · 优化限制· MySQL index_condition_pushdown
- MySQL · 捉虫动态·DROP DATABASE外键约束的GTID BUG
- MySQL · 答疑释惑· lower_case_table_names 使用问题
- PgSQL · 特性分析· Logical Decoding探索
- PgSQL · 特性分析· jsonb类型解析
- TokuDB ·引擎机制· TokuDB线程池
- 数据库内核月报 - 2015/02
- MySQL · 性能优化· InnoDB buffer pool flush策略漫谈
- MySQL · 社区动态· 5.6.23 InnoDB相关Bugfix
- PgSQL · 特性分析· Replication Slot
- PgSQL · 特性分析· pg_prewarm
- MySQL · 答疑释惑· InnoDB丢失自增值
- MySQL · 答疑释惑· 5.5 和 5.6 时间类型兼容问题
- MySQL · 捉虫动态· 变量修改导致binlog错误
- MariaDB · 特性分析· 表/表空间加密
- MariaDB · 特性分析· Per-query variables
- TokuDB · 特性分析· 日志详解
- 数据库内核月报 - 2015/01
- MySQL · 性能优化· Group Commit优化
- MySQL · 新增特性· DDL fast fail
- MySQL · 性能优化· 启用GTID场景的性能问题及优化
- MySQL · 捉虫动态· InnoDB自增列重复值问题
- MySQL · 优化改进· 复制性能改进过程
- MySQL · 谈古论今· key分区算法演变分析
- MySQL · 捉虫动态· mysql client crash一例
- MySQL · 捉虫动态· 设置 gtid_purged 破坏AUTO_POSITION复制协议
- MySQL · 捉虫动态· replicate filter 和 GTID 一起使用的问题
- TokuDB·特性分析· Optimize Table
- 数据库内核月报 - 2014/12
- MySQL· 性能优化·5.7 Innodb事务系统
- MySQL· 踩过的坑·5.6 GTID 和存储引擎那会事
- MySQL· 性能优化·thread pool 原理分析
- MySQL· 性能优化·并行复制外建约束问题
- MySQL· 答疑释惑·binlog event有序性
- MySQL· 答疑释惑·server_id为0的Rotate
- MySQL· 性能优化·Bulk Load for CREATE INDEX
- MySQL· 捉虫动态·Opened tables block read only
- MySQL· 优化改进· GTID启动优化
- TokuDB· Binary Log Group Commit with TokuDB
- 数据库内核月报 - 2014/11
- MySQL· 捉虫动态·OPTIMIZE 不存在的表
- MySQL· 捉虫动态·SIGHUP 导致 binlog 写错
- MySQL· 5.7改进·Recovery改进
- MySQL· 5.7特性·高可用支持
- MySQL· 5.7优化·Metadata Lock子系统的优化
- MySQL· 5.7特性·在线Truncate undo log 表空间
- MySQL· 性能优化·hash_scan 算法的实现解析
- TokuDB· 版本优化· 7.5.0
- TokuDB· 引擎特性· FAST UPDATES
- MariaDB· 性能优化·filesort with small LIMIT optimization
- 数据库内核月报 - 2014/10
- MySQL· 5.7重构·Optimizer Cost Model
- MySQL· 系统限制·text字段数
- MySQL· 捉虫动态·binlog重放失败
- MySQL· 捉虫动态·从库OOM
- MySQL· 捉虫动态·崩溃恢复失败
- MySQL· 功能改进·InnoDB Warmup特性
- MySQL· 文件结构·告别frm文件
- MariaDB· 新鲜特性·ANALYZE statement 语法
- TokuDB· 主备复制·Read Free Replication
- TokuDB· 引擎特性·压缩
- 数据库内核月报 - 2014/09
- MySQL· 捉虫动态·GTID 和 DELAYED
- MySQL· 限制改进·GTID和升级
- MySQL· 捉虫动态·GTID 和 binlog_checksum
- MySQL· 引擎差异·create_time in status
- MySQL· 参数故事·thread_concurrency
- MySQL· 捉虫动态·auto_increment
- MariaDB· 性能优化·Extended Keys
- MariaDB·主备复制·CREATE OR REPLACE
- TokuDB· 参数故事·数据安全和性能
- TokuDB· HA方案·TokuDB热备
- 数据库内核月报 - 2014/08
- MySQL· 参数故事·timed_mutexes
- MySQL· 参数故事·innodb_flush_log_at_trx_commit
- MySQL· 捉虫动态·Count(Distinct) ERROR
- MySQL· 捉虫动态·mysqldump BUFFER OVERFLOW
- MySQL· 捉虫动态·long semaphore waits
- MariaDB·分支特性·支持大于16K的InnoDB Page Size
- MariaDB·分支特性·FusionIO特性支持
- TokuDB· 性能优化·Bulk Fetch
- TokuDB· 数据结构·Fractal-Trees与LSM-Trees对比
- TokuDB·社区八卦·TokuDB团队