Inside PostgreSQL Shared Memory BRUCE MOMJIAN, ENTERPRISEDB October, 2008
Abstract POSTGRESQL is an open-source, full-featured relational database. This presentation gives an overview of the shared memory structures used by Postgres.
Outline
1. File storage format 2. Shared memory creation 3. Shared buffers 4. Row value access 5. Locking 6. Other structures
Inside PostgreSQL Shared Memory
1
File System /data
Postgres
/data
Postgres
Postgres
Inside PostgreSQL Shared Memory
2
File System /data/base
Postgres
Postgres
Postgres
Inside PostgreSQL Shared Memory
/data
/base /global /pg_clog /pg_multixact /pg_subtrans /pg_tblspc /pg_twophase /pg_xlog 3
File System /data/base/db
Postgres
Postgres
/data
/base /16385 (production) /1 (template1) /16821 (test) /17982 (devel) /21452 (marketing)
Postgres
Inside PostgreSQL Shared Memory
4
File System /data/base/db/table
Postgres
/data
Postgres
/base /16385
/24692 (customer) /27214 (order) /25932 (product) /25952 (employee) /27839 (part)
Postgres
Inside PostgreSQL Shared Memory
5
File System Data Pages
Postgres
/data
/base /16385
/24692 8k
8k
8k
8k
Postgres
Postgres
Inside PostgreSQL Shared Memory
6
Data Pages
Postgres
/data
/base /16385
/24692 8k
8k
8k
8k
Postgres
Postgres Page Header
Item
Item
Item
8K Tuple Tuple Inside PostgreSQL Shared Memory
Tuple
Special 7
File System Block Tuple
Postgres
/data
/base /16385
/24692 8k
8k
8k
8k
Postgres Page Header
Item
Item
Item
Postgres 8K Tuple Tuple
Tuple
Special
Tuple
Inside PostgreSQL Shared Memory
8
File System Tuple ’Martin’
int4in(’9241’) Tuple
textout() Header
Value
Value
Value
Value
Value
Value
OID − object id of tuple (optional) xmin − creation transaction id xmax − destruction transaction id cmin − creation command id cmax − destruction command id ctid − tuple id (page / item) natts − number of attributes infomask − tuple flags hoff − length of tuple header bits − bit map representing NULLs
Inside PostgreSQL Shared Memory
9
Tuple Header C Structures typedef struct HeapTupleFields { TransactionId t_xmin; TransactionId t_xmax; union { CommandId t_cid; TransactionId t_xvac; } t_field3; } HeapTupleFields;
/* inserting xact ID */ /* deleting or locking xact ID */
/* inserting or deleting command ID, or both */ /* VACUUM FULL xact ID */
typedef struct HeapTupleHeaderData { union { HeapTupleFields t_heap; DatumTupleFields t_datum; } t_choice; ItemPointerData t_ctid;
/* current TID of this or newer tuple */
/* Fields below here must match MinimalTupleData! */ uint16
t_infomask2;
/* number of attributes + various flags */
uint16
t_infomask;
/* various flag bits, see below */
uint8
t_hoff;
/* sizeof header incl. bitmap, padding */
/* ^ − 23 bytes − ^ */ bits8
t_bits[1];
/* bitmap of NULLs −− VARIABLE LENGTH */
/* MORE DATA FOLLOWS AT END OF STRUCT */ } HeapTupleHeaderData; Inside PostgreSQL Shared Memory
10
Shared Memory Creation
k()
postmaster
for
postgres
postgres
Program (Text)
Program (Text)
Program (Text)
Data
Data
Data
Shared Memory
Shared Memory
Shared Memory
Stack
Stack
Stack
Inside PostgreSQL Shared Memory
11
Shared Memory
PROC
Lightweight Locks
XLOG Buffers
Proc Array
Lock Hashes
CLOG Buffers
LOCK
Subtrans Buffers
PROCLOCK
Two−Phase Structs
Auto Vacuum
Multi−XACT Buffers
Btree Vacuum Free Space Map
Statistics
Background Writer
Synchronized Scan
Shared Invalidation
Buffer Descriptors Shared Buffers
Semaphores Inside PostgreSQL Shared Memory
12
Shared Buffers
Buffer Descriptors
Pin Count − prevent page replacement LWLock − for page changes
8k
8k
8k Shared Buffers
read()
Page Header
Item
Item
Item
write() Postgres
/data /base /16385 /24692
8K 8k 8k 8k 8k
Tuple Tuple
Tuple
Special
Postgres
Postgres
Inside PostgreSQL Shared Memory
13
HeapTuples
8k
8k
8k Shared Buffers
Page Header
Item
Item
Item
8K Tuple Tuple
Tuple
Special
HeapTuple
’Martin’
int4in(’9241’) Tuple
textout() Header
Value
Value
Value
Value
Value
Postgres
Value
C pointer OID − object id of tuple (optional) xmin − creation transaction id xmax − destruction transaction id cmin − creation command id cmax − destruction command id ctid − tuple id (page / item) natts − number of attributes infomask − tuple flags hoff − length of tuple header bits − bit map representing NULLs
Inside PostgreSQL Shared Memory
14
Finding A Tuple Value in C Datum nocachegetattr(HeapTuple tuple, int attnum, TupleDesc tupleDesc, bool *isnull) { HeapTupleHeader tup = tuple−>t_data; Form_pg_attribute *att = tupleDesc−>attrs; { int
i;
/* * Note − This loop is a little tricky. For each non−null attribute, * we have to first account for alignment padding before the attr, * then advance over the attr based on its length. Nulls have no * storage and no alignment padding either. We can use/set * attcacheoff until we reach either a null or a var−width attribute. */ off = 0; for (i = 0;; i++) /* loop exit is at "break" */ { if (HeapTupleHasNulls(tuple) && att_isnull(i, bp)) continue; /* this cannot be the target att */ if (att[i]−>attlen == −1) off = att_align_pointer(off, att[i]−>attalign, −1, tp + off); else /* not varlena, so safe to use att_align_nominal */ off = att_align_nominal(off, att[i]−>attalign); if (i == attnum) break; off = att_addlength_pointer(off, att[i]−>attlen, tp + off); } } return fetchatt(att[attnum], tp + off); }
Inside PostgreSQL Shared Memory
15
Value Access in C #define fetch_att(T,attbyval,attlen) \ ( \ (attbyval) ? \ ( \ (attlen) == (int) sizeof(int32) ? \ Int32GetDatum(*((int32 *)(T))) \ : \ ( \ (attlen) == (int) sizeof(int16) ? \ Int16GetDatum(*((int16 *)(T))) \ : \ ( \ AssertMacro((attlen) == 1), \ CharGetDatum(*((char *)(T))) \ ) \ ) \ ) \ : \ PointerGetDatum((char *) (T)) \ ) Inside PostgreSQL Shared Memory
16
Test And Set Lock Can Succeed Or Fail
1
1
0/1
0
1
Success
Failure
Was 0 on exchange
Was 1 on exchange Lock already taken
Inside PostgreSQL Shared Memory
17
Test And Set Lock x86 Assembler static __inline__ int tas(volatile slock_t *lock) { register slock_t _res = 1;
: : :
/* * Use a non−locking test before asserting the bus lock. Note that the * extra test appears to be a small loss on some x86 platforms and a small * win on others; it’s by no means clear that we should keep it. */ __asm__ __volatile__( " cmpb $0,%1 \n" " jne 1f \n" " lock \n" " xchgb %0,%1 \n" "1: \n" "+q"(_res), "+m"(*lock) "memory", "cc"); return (int) _res;
}
Inside PostgreSQL Shared Memory
18
Spin Lock Always Succeeds
1
1
0/1
0
Sleep of increasing duration
1
Success
Failure
Was 0 on exchange
Was 1 on exchange Lock already taken
Spinlocks are designed for short-lived locking operations, like access to control structures. They are not be used to protect code that makes kernel calls or other heavy operations. Inside PostgreSQL Shared Memory
19
Light Weight Locks Sleep On Lock
PROC
Lightweight Locks
XLOG Buffers
Proc Array
Lock Hashes
CLOG Buffers
LOCK
Subtrans Buffers
PROCLOCK
Two−Phase Structs
Auto Vacuum
Multi−XACT Buffers
Btree Vacuum Free Space Map
Statistics
Background Writer
Synchronized Scan
Shared Invalidation
Buffer Descriptors Shared Buffers
Semaphores
Light weight locks attempt to acquire the lock, and go to sleep on a semaphore if the lock request fails. Spinlocks control access to the light weight lock control structure. Inside PostgreSQL Shared Memory
20
Database Object Locks
PROC
PROCLOCK
LOCK Lock Hashes
Inside PostgreSQL Shared Memory
21
Proc
PROC empty
used
used
empty
used
empty
Proc Array
Inside PostgreSQL Shared Memory
22
Other Shared Memory Structures
PROC
Lightweight Locks
XLOG Buffers
Proc Array
Lock Hashes
CLOG Buffers
LOCK
Subtrans Buffers
PROCLOCK
Two−Phase Structs
Auto Vacuum
Multi−XACT Buffers
Btree Vacuum Free Space Map
Statistics
Background Writer
Synchronized Scan
Shared Invalidation
Buffer Descriptors Shared Buffers
Semaphores Inside PostgreSQL Shared Memory
23
Conclusion
Inside PostgreSQL Shared Memory
Pink Floyd: Wish You Were Here 24