Shore Storage Manager: Implementation Notes

Pages are reserved and allocated for a store in units of ss_m::ext_sz (enumeration value smlevel_0::ext_sz, found in sm_base.h), a compile-time constant that indicates the size of an extent.

An extent is a set of contiguous pages, represented by a persistent data structure extlink_t. Extents are linked together to form the entire structure of a store. The head of this list has a reference to it from a store-node (stnode_t), described below. Extents (extlink_t) are co-located on extent-map pages at the beginning of the volume.

Each extent has an owner, which is the store id (snum_t) of the store to which it belongs. Free extents are not linked together; they simply have no owner (signified by an extlink_t::owner == 0).

An extent id is a number of type extnum_t. It is arithmetically determined from a page number, and the pages in an extent are arithmetically derived from an extent number. The extnum_t is used in acquiring locks on extents and it is used for locating the associated extlink_t and the extent-map page on which the extlink_t resides. Scanning the pages in a store can be accomplished by scanning the list of extlink_t.

The entire allocation metadata for a page are in its extent, which contains a bitmap indicating which of its pages are allocated. One cannot determine the allocation status of a page from the page itself: the extent map page must be inspected.

Extents also contain (unlogged, advisory) metadata in the form of a pbucketmap; this contains a bucket number for each page in the extent, indicating of the amount of free space on the page. The map has meaning only to the File Manager. The file manager asks the I/O layer (which then descends to the volume manager for this purpose) to find the next page whose advisory bucket number is sufficiently large for the file manager's record-allocation needs. Between the time this request is made and the time the file manager fixes and inspects the page, the page might no longer have sufficient space. Nevertheless, this advisory bucket number in the extlink_t reduces the number of page-fixes to find a page with the needed space, and it does improve the effective fill-factor for file pages.

Maintaining the bucket map is costly in that it fixes and dirties extent-map pages, even though it does not log these updates.

The bucket map is maintained only for extents whose pages are file_p (small-object) pages.

Store Nodes

Store nodes are co-located on store-map pages at the beginning of a volume, after the extent maps. The volume is formatted to allow as many store nodes as there are extents.

Object Identifiers, Object Location, and Locks

Clearly, from a record ID, its page and slot can be derived without consulting any indices. It is also clear that records cannot move, which has ramifications for Space Reservation on a Page, described below. The Lock Manager understands these identifiers as well as extent IDs, and generates locks from identifiers.

Predefined Stores

Allocation and Deallocation of Extents

Allocating a set of extents to a store is a matter of linking them together and then appending the list to the tail of the store's linked list:

Allocation is handled slightly differently in the two contexts in which it is performed:

Commit-Time Handling of Extent-Deallocation

For the latter case, the transaction asks the transaction manager to identify all extents on which it has locks (lock manager's job). If

If the extent is in a still-allocated store, the entity freeing the extent (the lock manager) will have acquired an EX lock on the extent for the transaction. If the extent is part of a destroyed store, the store will have an EX lock on it and this will prevent any other transaction from trying to deallocate the extent.

Implications of This Design

The volume layer does not contain any means of spreading out or clustering extents over extent-map pages for clustering (or for latch-contention mitigation).

Creating and Destroying Stores

Commit-Time Handling of Store-Destruction

Allocation and Deallocation of Pages

When the store is used for an index, the page is not visible until it has been formatted and inserted (linked) into the index. In the case of files, however, the situation is complicated by the lack of linkage of file pages by another means. Pages used for large objects are referenced through an index in the object's metadata, but pages used for small objects become part of the file when the page's extent bitmap indicates that it is allocated. This has some significant ramifications:

Despite the fact that the intended uses of the page require different handling, a significant part of page allocation is generic and is handled by the volume layer. To handle some of the contextual differences, the volume layer uses a callback to the calling manager.

Allocating a page to a store comprises these steps (in fact, the code is written to allocate a number of pages at once; this description is simplified):

As mentioned above, there are times when the volume manager is told to allocate new pages at the end of the store (append). This happens when the file manager allocates small-object file pages unless the caller passes in the policy t_compact, indicating that it should search the file for available pages. The server can choose its policy when calling ss_m::create_rec (see pg_policy_t). When the server uses append_file_i, only the policy t_append is used, which enforces append-only page allocation.

Volume Manager Caches

The volume manager also keeps a last-page cache. This is a map from snum_t to extnum_t; the extent number is that of the last extent allocated to the store. From this one can arithmetically derive the last page in (reserved or possibly allocated) the store. Note that the last page allocated to the store might be anywhere in the store's exent list; all pages after that might be reserved but not allocated.

Priming caches is an expensive operation. It is not done on each volume mount, because volumes are mounted and dismounted several times during recovery, and priming on each mount would be prohibitive. Every attempt to allocate a page checks the store's reserved-page cache; if it is empty, it is primed. Every time an extent is allocated or a last-page-in-file is located, the last-page cache is updated.

Page Types

All pages are slotted (those that don't need the slot structure may use only one slot) and have the following layout:

Each page type is a C++ class that derives from the base class page_p. Page_p implements functionality that is common to all (or most) page types. The types are as follows:

Issues specific to the page types will be dealt with in the descriptions of the modules that use them.

Space Reservation on a Page

 page_p::rsvd_mode() == true

In the case of B-trees, space reservation is not used because undo and redo are handled logically -- entries can be re-inserted in a different page. But in the case of files, records are identified by physical ID, which includes page and slot number, so records must be reinserted just where they first appeared.

Holes in a page are coalesced (moved to the end of the page) as needed, when the total free space on the page satisfies a need but the contiguous free space does not. Hence, a record truncation followed by an append to the same record does not necessarily cause the shifting of other records on the same page.

A count of free bytes is maintained for all pages. Free-space metadata are maintained for rsvd_mode() pages:

File Manager

A file comprises two stores. One store is allocated for slotted (small-record) pages, called file_p pages. One store is allocated for large records, and contains lgdata_p and lgindex_p pages. Small records are those whose size is less than or equal to sm_config_info_t.max_small_rec. A record larger than this has a slot on a small-record page, which slot contains metadata refering to pages in the large-record store. The scan order of a file is the physical order of the records in the small-record store.

Every record, large or small, has the following metadata in the record's slot on the file_p page; these data are held in a rectag_t structure:

struct rectag_t {
    uint2_t   hdr_len;  // length of user header, may be zero
    uint2_t   flags;    // enum recflags_t: indicates internal implementation
    smsize_t  body_len; // true length of the record 
};

Internally (inside the storage manager), the class record_t is a handle on the record's tag and is the class through which the rectag_t is manipulated.

Creating and Destroying a File of Records

Finding Space for a Record

The heap keeps track of the amount of free space in (recently-used) pages in the heap, and it is searchable so that it can return the page with the smallest free space that is larger than a given value in bytes. The free-space-on-page value that it uses for this purpose is the most liberal value -- it's possible that some of the space on the page is reserved for a transaction that has not yet committed (if that transaction destroyed a record, it can use space that other transactions cannot).

When a record is created, the file manager tries to use an already-allocated page that has space for the record. It determines what space is needed for the record from the length hint and the data given in the ss_m::create_rec call.

Three policies used can be used (in combination) to search for pages with space in which to create a new record:

Using append_file_t to create records means only t_append is used, ensuring that the record will always be appended to the file. ss_m::create_rec uses t_cache | t_compact | t_append.

The policy can be given on the ss_m::create_rec call. The default is t_cache | t_compact | t_append.

If the file manager does not find a page in the file with sufficient space for the record, or if it must append to the end of the file and the last page hasn't the needed space, the file manager asks the I/O manager to allocate a page.

Under the best of circumstances, creating a small record involves three page latches @: one file page (in which to insert a record), the extent map page is fixed to verify the file page's allocation status, and then after the record is created, the the extent map page is re-latched to update the space-utilization metadata for the extent.

Updating a record (by way of a server-provided record ID) consists in these steps, which may require up to two extra page latches (besides latching the page containing the record):

Record Access by Record ID

The storage manager does not inspect the store's metatdata (directory index entry) to see that it is a file store because if the page is still allocated to the store and the page is a file page, the store must still be a file store.

Summary of Free-Space Management

B+-Tree Manager

B-trees can be bulk-loaded from files of sorted key-value pairs, as long as the keys are in lexicographic form.

Those two papers give a thorough explanation of the arcane algorithms, including logging considerations. Anyone considering changing the B-tree code is strongly encouraged to read these papers carefully. Some of the performance tricks described in these papers are not implemented here. For example, the ARIES/IM paper describes performance of logical undo of insert operations if and only if physical undo is not possible. The storage manager always undoes inserts logically.

R*-Tree Manager

Directory Manager

The storage manager maintains some transient and some persistent data for each store. The directory's key is the store ID, and the value it returns from a lookup is a sdesc_t ("store descriptor") structure, which contains both the persistent and transient information.

The persistent information is in a sinfo_s structure; the transient information is resident only in the cache of sdesc_t structures that the directory manager maintains.

Lock Manager

Lock requests are issued with a lock ID (lockid_t), which encodes the identity of the entity being locked, the kind of lock, and, by inference, a lock hierarchy for a subset of the kinds of locks above. The lock manager does not insist that lock identifiers refer to any existing object.

Note that the following lock kinds are not in any hierarchy: -extent -user1, user2, user3, user4

Other than the way the lock identifiers are inspected for the purpose of enforcing the hierarchy, lock identifiers are considered opaque data by the lock manager.

The lockid_t structure can be constructed from the IDs of the various entities in (and out of ) the hierarchy; see lockid_t and the example lockid_test.cpp.

Implicit locks

Escalation

Lock Acquisition and Release by Storage Manager

The storage manager's API allows explicit acquisition of locks by a server. User modes user1, user2, user3 and user4 are provided for that purpose.

Quarks

Extent locks are an exception; they must be held long-term for page allocation and deallocation to work, so even in the context of an open quark, extent locks will be held until end-of-transaction.

The lock manager uses a hash table whose size is determined by a configuration option. The hash function used by the lock manager is known not to distribute locks evenly among buckets. This is partly due to the nature of lock IDs.

Lock Cache

Deadlock Detection

Transaction Manager

Because these are logged actions, and they occur if and only if the transaction commits, the storage manager guarantees that the ending of the transaction and re-marking and deletion of stores is atomic. This is accomplished by putting the transaction into a state xct_freeing_space, and writing a log record to that effect. The space is freed, the stores are converted, and a final log record is written before the transaction is truly ended. In the vent of a carash while a transaction is freeing space, recovery searches all the store metadata for stores marked for deleteion and deletes those that would otherwise have been missed in redo.

Log Manager

How the Server Uses the Log Manager

Between the time the xct log buffer is grabbed and the time it is released, the buffer is held exclusively by the one thread that grabbed it, and updates to the xct log buffer can be made freely. (Note that this per-transaction log buffer is unrelated to the log buffer internal to the log manager.)

During recovery, no logging is done in analysis or redo phases; only during the undo phase are log records inserted. Log-space reservation is not needed until recovery is complete; the assumption is that if the transaction had enough log space prior to recovery, it has enough space during recovery. Prepared transactions pose a challenge, in that they are not resolved until after recovery is complete. Thus, when a transaction-prepare is logged, the log-space-reservations of that transaction are logged along with the rest of the transaction state (locks, coordinator, etc.) and before recovery is complete, these transactions acquire their prior log-space reservations.

The above protocol is enforced by the storage manager in helper functions that create log records; these functions are generated by Perl scripts from the source file logdef.dat. (See Log Record Types.)

The file logdef.dat also contains the fudge factors for log-space reservation. These factors were experimentally determined. There are corner cases involving btree page SMOs (structure-modification operations), in which the fudge factors will fail. [An example is when a transaction aborts after having removed entries, and after other transactions have inserted entries; the aborting transaction needs to re-insert its entries, which now require splits.] The storage manager has no resolution for this. The fudge factors handle the majority of cases without reserving excessive log-space.

Log Record Types

# <std-header style='data' orig-src='shore'>
#
#  $Id: logdef.dat,v 1.67 2010/10/27 17:04:23 nhall Exp $
#
# SHORE -- Scalable Heterogeneous Object REpository
#
# Copyright (c) 1994-99 Computer Sciences Department, University of
#                       Wisconsin -- Madison
# All Rights Reserved.
#
# Permission to use, copy, modify and distribute this software and its
# documentation is hereby granted, provided that both the copyright
# notice and this permission notice appear in all copies of the
# software, derivative works or modified versions, and any portions
# thereof, and that both notices appear in supporting documentation.
#
# THE AUTHORS AND THE COMPUTER SCIENCES DEPARTMENT OF THE UNIVERSITY
# OF WISCONSIN - MADISON ALLOW FREE USE OF THIS SOFTWARE IN ITS
# "AS IS" CONDITION, AND THEY DISCLAIM ANY LIABILITY OF ANY KIND
# FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
#
# This software was developed with support by the Advanced Research
# Project Agency, ARPA order number 018 (formerly 8230), monitored by
# the U.S. Army Research Laboratory under contract DAAB07-91-C-Q518.
# Further funding for this work was provided by DARPA through
# Rome Research Laboratory Contract No. F30602-97-2-0247.
#
#   -- do not edit anything above this line --   </std-header>

#########################################################################
#                                    #
#    WARNING: if you add, delete or change any of the log records,    #
#    or their data members, or their semantics you also need to    #
#    update log_base::version_major and/or log_base::version_minor    #
#    in log_base.cpp.                        #
#                                                                       #
#       For every log record type, the perl script generates a class    #
#       class <type>_log {                                              #
#           void fill(const lpid_t*p, uint2_t tag, int len);            #
#       public:                                                         #
#           <type>_log(<arg>);                                          #
#        // and...                                                      #
#        // iff R bit set:                                              #
#        void redo(page_p *page);                                       #
#        // iff U bit set:                                              #
#        void undo(page_p *page);                                       #
#       }                                                               #
#                                                                       #
#    The format of the file is as follows:                              #
#        type = log record type                                         #
#        X    = transaction log (generated by transactions)             #
#                      If set, logstub_gen.cpp contains a function      #
#                      rc_t log_<type> (<arg>) to generate the log recs #
#                      according to convention.  If not, the code else- #
#                      where in the SM has to be written by hand to gen #
#                      the log record.                                  #
#        S    = sync (not used at all anymore)                          #
#        R    = redoable    (-->t_redo bit set in log record)           #
#                      Includes redo method in class                    #
#        U    = undoable    (-->t_undo)                                 #
#                      Includes undo method in class                    #
#        F    = format    NOT USED                                      #
#        A    = space-allocation:                                       #
#                      If NOT set, generated code decides if logging    #
#                      should be done, based on :                       #
#                      1) smlevel_1::log, smlevel_0::logging_enabled,   #
#                      2) (if page argument present) page.store_flags   #
#                                            == st_tmp                  #
#                      3) xct() attached and xct()->is_log_on()         #
#                                                                       #
#                      If A bit IS SET, checks #2, #3 are left out      #
#                                                                       #
#        L    = logical undo log record -- don't fix the page           #
#                                        for undo.  Irrelevant if not   #
#                      an undoable log record.                          #
#                      --> t_logical                                    # 
#                                                                       #
#        fudge = observed fudge factor for log space reservations       #
#                                                                       #
#        arg  = arguments to constructor                                #
#                      SPECIAL CASE: first argument is "page":          #
#                      1) store flags checked to turn off logging for   #
#                      st_tmp files.                                    #
#                      2) give_logbuf() call passes page for 2nd arg    #
#                      3) page.set_dirty() if logging is skipped        #
#                                    #
#########################################################################
# type             XSRUFAL  fudge    arg                                #
#########################################################################
comment            1011001 1.0 (const char* msg);
compensate         1000001 0.0 (const lsn_t& rec_lsn);
skip               0000000 0.0 ();
chkpt_begin        0000000 0.0 (const lsn_t &lastMountLSN);
chkpt_bf_tab       0000000 0.0 (int cnt, const lpid_t* pid,
                            const lsn_t* rec_lsn);
chkpt_xct_tab      0000000 0.0 (const tid_t& youngest, 
                            int cnt, const tid_t* tid, 
                            const smlevel_1::xct_state_t* state,
                            const lsn_t* last_lsn, const lsn_t* undo_nxt);
chkpt_dev_tab      0000000 0.0 (int cnt, const char** dev_name, const vid_t* vid);
chkpt_end          0000000 0.0 (const lsn_t& master, const lsn_t& min_rec_lsn);
mount_vol          0010010 0.0 (const char *dev_name, const vid_t &vid);
dismount_vol       0010010 0.0 (const char *dev_name, const vid_t &vid);
#########################################################################
# type             XSRUFAL  fudge    arg                                #
#########################################################################
xct_abort          1000000 0.0 ();
xct_freeing_space  1000000 0.0 ();
xct_end            1000000 0.0 ();
xct_end_group      1000000 0.0 (const xct_t** l, int llen);
xct_prepare_st     1010000 0.0 (const gtid_t* g, const server_handle_t& h);
xct_prepare_lk     1010000 0.0 (int num, lock_mode_t mode, lockid_t* lks);
xct_prepare_alk    1010000 0.0 (int num, lockid_t* lks, lock_mode_t* modes);
xct_prepare_stores 1010000 0.0 (int num, const stid_t* stids);
xct_prepare_fi     1010000 0.0 (int numex, int numix, int numsix, int numextent, const lsn_t& first, int rsvd, int ready, int used);
#########################################################################
# type             XSRUFAL  fudge    arg                                #
#########################################################################
# page allocation log records - testable(physical) for redo
# alloc_file_page is marked "logical" because there's no need to fix the
# page in the automagic-handling code.
alloc_file_page    1001011 2.4 (const lpid_t& pid, const lsn_t&  rec_lsn);
alloc_pages_in_ext 1011011 1.0 (const page_p& page, snum_t snum, 
                            extnum_t idx, const Pmap& pmap);
free_pages_in_ext  1011011 1.5 (const page_p& page, snum_t snum, 
                            extnum_t idx, const Pmap& pmap);

# page allocation log records - testable(physical) for redo
# create_ext_list for creation of an extent list all on same page
# free_ext_list is reverse of create_ext_list; all on same page
create_ext_list    1010011 0.0 (const page_p& page, const stid_t& stid, 
                           extnum_t prev, extnum_t next, 
                           extnum_t count, const extnum_t* list);
free_ext_list      1010011 0.0 (const page_p& page, const stid_t& stid, 
                            extnum_t head, extnum_t count);
# set_ext_next: when extent lists cross page boundaries
set_ext_next       1010011 0.0 (const page_p& page, extnum_t ext, 
                            extnum_t new_next);
store_operation    1011011 1.0 (const page_p& page, 
                            const store_operation_param& op);
#########################################################################
# type             XSRUFAL  fudge    arg                                #
#########################################################################
#page_link used by btree pages only, for now
page_link          1011000 1.0 (const page_p& page, shpid_t new_prev, 
                            shpid_t new_next);

# page_insert used by page_p::insert_expand (inserting into a slot): generic
# page_remove used by page_p::remove_compress (removing a slot): semi-generic
page_insert        1011000 1.0 (const page_p& page, int idx, int cnt, 
                            const cvec_t* vec);
page_remove        1011000 1.0 (const page_p& page, int idx, int cnt);

# A page format reflects two operations: the page init/format part and
# the insertion of something into the first slot.
# generic page format: the page init part isn't undoable but the
# insert_expand/reclaim part is undoable
page_format        1011000 1.22 (const page_p& page, int idx, int cnt, 
                            const cvec_t* vec);
# page_mark: marks a slot as deleted
# page_reclaim: opposite of page_mark: makes a slot in-use
page_mark          1011000 1.1 (const page_p& page, int idx);
page_reclaim       1011000 1.67 (const page_p& page, int idx, const cvec_t& vec);

# shift: move data from slot to slot: very generic
# used by btree & rtree pages
page_shift         1011000 1.0 (const page_p& page, int idx2, 
                            page_s::slot_length_t off2, 
                            page_s::slot_length_t len2,
                            int idx1,  page_s::slot_length_t off1);
# splice and splicez: very generic.
# used by btree, rtree, file, large obj pages,
# for cut/paste/overwrite/merge_slots, etc.
page_splice        1011000 1.18 (const page_p& page, int idx, int start, int len, 
                            const cvec_t& vec);
page_splicez       1011000 1.0 (const page_p& page, int idx, int start, 
                            int len, int osave, int nsave, const cvec_t& vec);

# page_set_byte: for now used only by extlink pages
page_set_byte      1011000 1.0 (const page_p& page, int idx, u_char old, 
                            u_char bits, int op);
#
# page_image: for now used only by rtree pages & btree pages 
page_image         1010000 0.0 (const page_p& page);
#########################################################################
# type             XSRUFAL  fudge    arg                                #
#########################################################################
btree_purge        1011001 3.29 (const page_p& page);
btree_insert       1011001 1.42 (const page_p& page, int idx, 
                            const cvec_t& key, const cvec_t& el,
                            bool unique);
btree_remove       1011001 2.91  (const page_p& page, int idx, 
                            const cvec_t& key, const cvec_t& el,
                            bool unique);
rtree_insert       1011001 1.0  (const page_p& page, int idx,
                            const nbox_t& key, const cvec_t& el);
rtree_remove       1011001 1.0  (const page_p& page, int idx, 
                            const nbox_t& key, const cvec_t& el);
#########################################################################
# type             XSRUFAL  fudge    arg                                #
#########################################################################

The bodies of the methods of the class <log-rec-name>_log are hand-written and reside in

Some logging records are compensated, meaning that the log records are skipped during rollback. Compensations may be needed because some operation simply cannot be undone. The protocol for compensating actions is as follows:

The mechanism for turning off logging for a transaction is to construct an instance of xct_log_switch_t.

When the instance is destroyed, the original logging state is restored. The switch applies only to the transaction that is attached to the thread at the time the switch instance is constructed, and it prevents other threads of the transaction from using the log (or doing much else in the transaction manager) while the switch exists.

Log Manager Internals

The user-settable run-time option sm_logsize indicates the maximum number of KB that may be opened at once; this, in turn, determines the size of a partition file, since the number of partition files is a compile-time constant. The storage manager computes partition sizes based on the user-provided log size, such that partitions sizes are a convenient multiple of blocks (more about which, below).

A new partition is opened when the tail of the log approaches the end of a partition, that is, when the next insertion into the log is at an offset larger than the maximum partition size. (There is a fudge factor of BLOCK_SIZE in here for convenience in implementation.)

The low part of an lsn_t represents the byte-offset into the log file at which the log record with that lsn_t sits.

Thus, the total file size of a log file log.<n> is the size of all log records in the file, and the lsn_t of each log record in the file is lsn_t(<n>, <byte-offset>) of the log record within the file.

The log is, conceptually, a forever-expanding set of files. The log manager will open at most PARTITION_COUNT log files at any one time.

The log is considered to have run out of space if logging requires that more than smlevel_0::max_openlog partitions are needed. Partitions are needed only as long as they contain log records needed for recovery, which means:

Afer a checkpoint is taken and its log records are durable, the storage manager tries to scavenge all partitions that do not contain necessary log records. The buffer manager provides the min recovery lsn; the transaction manager provides the min xct lsn, and the log manager keeps track of the location of the last completed checkpoint in its master_lsn. Thus the minimum of the

file part of the minmum of these lsns indicates the lowest partition that cannot be scavenged; all the rest are removed.

When the log is in danger of runing out of space (because there are long-running transactions, for example) the server may be called via the LOG_WARN_CALLBACK_FUNC argument to ss_m::ss_m. This callback may abort a transaction to free up log space, but the act of aborting consumes log space. It may also archive a log file and remove it. If the server provided a LOG_ARCHIVED_CALLBACK_FUNC argument to ss_m::ss_m, this callback can be used to retrieve archived log files when needed for rollback.

A skip_log record indicates the logical end of a partition. The log manager ensures that the last log record in a file is always a skip_log record.

Log files (partitions) are composed of segments. A segment is an integral number of blocks.

The smallest partition is one segment plus one block, but may be many segments plus one block. The last block enables the log manager to write the skip_log record to indicate the end of the file.

The partition size is determined by the storage manager run-time option, sm_logsize, which determines how much log can be open at any time, i.e., the combined sizes of the PARTITION_COUNT partitions.

The maximum size of a log record (logrec_t) is 3 storage manager pages. A page happens to match the block size but the two compile-time constants are not inter-dependent. A segment is substantially larger than a block, so it can hold at least several maximum-sized log records, preferably many.

Inserting a log record consists of copying it into the log manager's log buffer (1 segment in size). The buffer wraps so long as there is room in the partition. Meanwhile, a log-flush daemon thread writes out unflushed portions of the log buffer. The log daemon can lag behind insertions, so each insertion checks for space in the log buffer before it performs the insert. If there isn't enough space, it waits until the log flush daemon has made room.

When insertion of a log record would wrap around the buffer and the partition has no room for more segments, a new partition is opened, and the entire newly-inserted log record will go into that new partition. Meanwhile, the log-flush daemon will see that the rest of the log buffer is written to the old partition, and the next time the log flush daemon performs a flush, it will be flushing to the new partition.

The bookkeeping of the log buffer's free and used space is handled by the notion of epochs. An epoch keeps track of the start and end of the unflushed portion of the segment (log buffer). Thus, an epoch refers to only one segment (logically, log buffer copy within a partition). When an insertion fills the log buffer and causes it to wrap, a new epoch is created for the portion of the log buffer representing the new segment, and the old epoch keeps track of the portion of the log buffer representing the old segment. The inserted log record usually spans the two segements, as the segments are written contiguously to the same log file (partition).

When an insertion causes a wrap and there is no more room in the partition to hold the new segment, a new epoch is created for the portion of the log buffer representing the new segment, and the old epoch keeps track of the portion of the log buffer representing the old segment, as before. Now, however, the inserted log record is inserted, in its entirety, in the new segment. Thus, no log record spans partitions.

Meanwhile, the log-flush buffer knows about the possible existence of two epochs. When an old epoch is valid, it flushes that epoch. When a new epoch is also valid, it flushes that new one as well. If the two epochs have the same target partition, the two flushes are done with a single write.

The act of flushing an epoch to a partition consists in a single write of a size that is an even multiple of BLOCK_SIZE. The flush appends a skip_log record, and zeroes as needed, to round out the size of the write. Writes re-write portions of the log already written, in order to overwrite the skip_log record at the tail of the log (and put a new one at the new tail).

Recovery

Each time a storage manager (ss_m class) is constructed, the logs are inspected, the last checkpoint is located, and its lsn is remembered as the master_lsn, then recovery is performed. Recovery consists of three phases: analysis, redo and undo.

This also requires a close interaction between the transaction manager and the log manager.

All three managers understand a log sequence number (lsn_t). Log sequence numbers serve to identify and locate log records in the log, to timestamp pages, identify timestamp the last update performed by a transaction, and the last log record written by a transaction. Since every update is logged, every update can be identified by a log sequence number. Each page bears the log sequence number of the last update that affected that page.

A page cannot be written to disk until the log record with that page's lsn has been written to the log (and is on stable storage). A log sequence number is a 64-bit structure, with part identifying a log partition (file) number and the rest identifying an offset within the file.

Log Partitions

The storage manger may have at most 8 active partitions at any one time. An active partition is one that is needed because it contains log records for running transactions. Such partitions could (if it were supported) be streamed to tape and their disk space reclaimed. Space is reclaimed when the oldest transaction ends and the new oldest transaction's first log record is in a newer partition than that in which the old oldest transaction's first log record resided. Until tape archiving is implemented, the storage manager issues an error (eOUTOFLOGSPACE) if it consumes sufficient log space to be unable to abort running transactions and perform all resulting necessary logging within the 8 partitions available.

Ultimately, archiving to tape is necessary. The storage manager does not perform write-aside or any other work in support of long-running transactions.

The checkpoint manager chkpt_m sleeps until kicked into action by the log manager, and when it is kicked, it takes a checkpoint, then sleeps again. Taking a checkpoint amounts to these steps:

These checkpoint log records may interleave with other log records, making the checkpoint "fuzzy"; this way the world doesn't have to grind to a halt while a checkpoint is taken, but there are a few operations that must be serialized with all or portions of a checkpoint. Those operations use mutex locks to synchronize. Synchronization of operations is as follows:

Buffer Manager

All frames in the buffer pool are the same size, and they cannot be coalesced, so the buffer manager manages a set of pages of fixed size.

Hash Table

Cuckoo hashing is subject to cycles, in which making room on one table bucket A would require moving something else into A. Using at least two slots per bucket reduces the chance of a cycle.

The implementation contains a limit on the number of times it looks for an empty slot or moves that it has to perform to make room. It does If cycles are present, the limit will be hit, but hitting the limit does not necessarily indicate a cycle. If the limit is hit, the insert will fail. The "normal" solution in this case is to rebuild the table with different hash functions. The storage manager does not handle this case.

Page Replacement

The buffer manager forks background threads to flush dirty pages. The buffer manager makes an attempt to avoid hot pages and to minimize the cost of I/O by sorting and coalescing requests for contiguous pages. Statistics kept by the buffer manager tell the number of resulting write requests of each size.

There is one bf_cleaner_t thread for each volume, and it flushes pages for that volume; this is done so that it can combine contiguous pages into single write requests to minimize I/O. Each bf_cleaner_t is a master thread with multiple page-writer slave threads. The number of slave threads per master thread is controlled by a run-time option. The master thread can be disabled (thereby disabling all background flushing of dirty pages) with a run-time option.

The buffer manager writes dirty pages even if the transaction that dirtied the page is still active (steal policy). Pages stay in the buffer pool as long as they are needed, except when chosen as a victim for replacement (no force policy).

The replacement algorithm is clock-based (it sweeps the buffer pool, noting and clearing reference counts). This is a cheap way to achieve something close to LRU; it avoids much of the overhead and mutex bottlenecks associated with LRU.

The buffer manager maintains a hash table that maps page IDs to buffer frame control blocks (bfcb_t), which in turn point to frames in the buffer pool. The bfcb_t keeps track of the page in the frame, the page ID of the previously-held page, and whether it is in transit, the dirty/clean state of the page, the number of page fixes (pins) held on the page (i.e., reference counts), the recovery lsn of the page, etc. The control block also contains a latch. A page, when fixed, is always fixed in a latch mode, either LATCH_SH or LATCH_EX.

Each page type defines a set of fix methods that are virtual in the base class for all pages: The rest of the storage manager interacts with the buffer manager primarily through these methods of the page classes. The macros MAKEPAGECODE are used for each page subtype; they define all the fix methods on the page in such a way that bf_m::fix() is properly called in each case.

A page frame may be latched for a page without the page being read from disk; this is done when a page is about to be formatted.

The buffer manager is responsible for maintaining WAL; this means it may not flush to disk dirty pages whose log records have not reached stable storage yet. Temporary pages (see sm_store_property_t) do not get logged, so they do not have page lsns to assist in determining their clean/dirty status, and since pages may change from temporary (unlogged) to logged, they require special handling, described below.

When a page is unfixed, sometimes it has been updated and must be marked dirty. The protocol used in the storage manager is as follows:

It is possible that a page is fixed in EX mode, marked dirty but never updated after all, then unfixed. The buffer manager attempts to recognize this situation and clean the control block "dirty" bit and recovery lsn.

Things get a little complicated where the buffer-manager's page-writer threads are concerned. The page-writer threads acquire a share latches and copy dirty pages; this being faster than holding the latch for the duration of the write to disk When the write is finished, the page-writer re-latches the page with the intention of marking it clean if no intervening updates have occurred. This means changing the dirty bit and updating the recovery lsn in the buffer control block. The difficulty lies in determining if the page is indeed clean, that is, matches the latest durable copy. In the absence of unlogged (t_temporary) pages, this would not be terribly difficult but would still have to cope with the case that the page was (updated and) written by another thread between the copy and the re-fix. It might have been cleaned, or that other thread might be operating in lock-step with this thread. The conservative handling would be not to change the recovery lsn in the control block if the page's lsn is changed, however this has serious consequences for hot pages: their recovery lsns might never be moved toward the tail of the log (the recovery lsns remain artificially low) and thus the hot pages can prevent scavenging of log partitions. If log partitions cannot be scavenged, the server runs out of log space. For this reason, the buffer manager goes to some lengths to update the recovery lsn if at all possible. To further complicate matters, the page could have changed stores, and thus its page type or store (logging) property could differ. The details of this problem are handled in a function called determine_rec_lsn().