internal.h

00001 // -*- mode:c++; c-basic-offset:4 -*-
00002 /*<std-header orig-src='shore' incl-file-exclusion='INTERNAL_H'>
00003 
00004  $Id: internal.h,v 1.13 2011/09/08 18:10:53 nhall Exp $
00005 
00006 SHORE -- Scalable Heterogeneous Object REpository
00007 
00008 Copyright (c) 1994-99 Computer Sciences Department, University of
00009                       Wisconsin -- Madison
00010 All Rights Reserved.
00011 
00012 Permission to use, copy, modify and distribute this software and its
00013 documentation is hereby granted, provided that both the copyright
00014 notice and this permission notice appear in all copies of the
00015 software, derivative works or modified versions, and any portions
00016 thereof, and that both notices appear in supporting documentation.
00017 
00018 THE AUTHORS AND THE COMPUTER SCIENCES DEPARTMENT OF THE UNIVERSITY
00019 OF WISCONSIN - MADISON ALLOW FREE USE OF THIS SOFTWARE IN ITS
00020 "AS IS" CONDITION, AND THEY DISCLAIM ANY LIABILITY OF ANY KIND
00021 FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
00022 
00023 This software was developed with support by the Advanced Research
00024 Project Agency, ARPA order number 018 (formerly 8230), monitored by
00025 the U.S. Army Research Laboratory under contract DAAB07-91-C-Q518.
00026 Further funding for this work was provided by DARPA through
00027 Rome Research Laboratory Contract No. F30602-97-2-0247.
00028 
00029 */
00030 
00031 /*  -- do not edit anything above this line --   </std-header>*/
00032 
00033 /* This file contains doxygen documentation only */
00034 
00035 /**\page IMPLNOTES Implementation Notes
00036  *
00037  * \section MODULES Storage Manager Modules
00038  * The storage manager code contains the following modules (with related C++ classes):
00039  *
00040  * - \ref SSMAPI (ss_m) 
00041  *   Most of the programming interface to the storage manager is encapsulated
00042  *   in the ss_m class. 
00043  * - \ref VOL_M (vol_m) and \ref DIR_M (dir_m)
00044  *   These managers handle volumes, page allocation and stores, which are the
00045  *   structures underlying files of records, B+-Tree indexes, and
00046  *   spatial indexes (R*-Trees).
00047  * - \ref FILE_M (file_m), \ref BTREE_M (btree_m), and \ref RTREE_M (rtree_m)
00048  *   handle the storage structures available to servers.
00049  * - \ref LOCK_M (lock_m) 
00050  *   The lock manager is quasi-stand-alone.
00051  * - \ref XCT_M (xct_t) and * \ref LOG_M (log_m) handle transactions,
00052  *   logging, and recovery.
00053  * - \ref BF_M (bf_m)
00054  *   The buffer manager works closely with \ref XCT_M and \ref LOG_M.
00055  *
00056  * \attention
00057  * \anchor ATSIGN 
00058  * \htmlonly
00059  * <b><font color=#B2222>
00060  * In the storage manager, fixing a page in the buffer pool acquires
00061  a latch
00062  and the verbs "fix" and "latch" are used interchangeably here.
00063  For searching convenience, where latching/fixing occurs the symbol @
00064  has been inserted.
00065  * </b></font> 
00066  * \endhtmlonly
00067  *
00068  * \section VOL_M I/O Manager and Volume Manager
00069  * The I/O manager was, in the early days of SHORE, expected to
00070  * have more responsibility than it now has; now it is little more
00071  * than a wrapper for the \ref VOL_M.   
00072  * For the purpose of this discussion, 
00073  * the I/O Manager and the volume manager are the same entity.
00074  * There is a single read-write lock associated 
00075  * with the I/O-Volume manager to serialize access.  
00076  * Read-only functions acquire the lock in read mode; updating 
00077  * functions acquire the lock in write mode.
00078  *
00079  * \note
00080  * Much of the page- and extent-allocation code relies on the fact that
00081  * access to the manager is serialized, and this lock is a major source of
00082  * contention.
00083  *
00084  * The volume manager handles formatting of volumes,
00085  * allocation and deallocation of pages and extents in stores.
00086  * Within a page, allocation of space is up to the manager of the
00087  * storage structure (btree, rtree, or file).
00088  *
00089  * The following sections describe the \ref VOL_M:
00090  * - \ref EXTENTSTORE 
00091  * - \ref STORENODE 
00092  * - \ref OBJIDS 
00093  * - \ref STONUMS 
00094  * - \ref ALLOCEXT 
00095  * - \ref XAFTERXCT 
00096  * - \ref IMPLICAT 
00097  * - \ref ALLOCST 
00098  * - \ref SAFTERXCT 
00099  * - \ref ALLOCPG 
00100  * - \ref VOLCACHES 
00101  * - \ref PAGES 
00102  * - \ref RSVD_MODE 
00103  *
00104  * \subsection EXTENTSTORE Extents and Stores
00105  *
00106  * Files and indexes are types of \e stores. A store is a persistent
00107  * data structure to which pages are allocated and deallocated, but which
00108  * is independent of the purpose for which it is used (index or file).
00109  *
00110  * Pages are reserved and allocated for a store in units of ss_m::ext_sz 
00111  * (enumeration value smlevel_0::ext_sz, found in sm_base.h), 
00112  * a compile-time constant that indicates the size of an extent.
00113  *
00114  * An extent is a set of contiguous pages, represented 
00115  * by a persistent data structure \ref extlink_t.  Extents are
00116  * linked together to form the entire structure of a store.  
00117  * The head of this list has a reference to it from a store-node
00118  * (\ref stnode_t), described below.
00119  * Extents (extlink_t) are co-located on extent-map pages at 
00120  * the beginning of the volume. 
00121  *
00122  * Each extent has an owner, 
00123  * which is the store id (\ref snum_t) of the store to which it belongs.
00124  * Free extents are not linked together; 
00125  * they simply have no owner (signified by an  \ref extlink_t::owner == 0).
00126  *
00127  * An extent id is a number of type \ref extnum_t.  It is arithmetically
00128  * determined from a page number, and the pages in an extent are arithmetically derived from an extent number. 
00129  * The \ref extnum_t is used in acquiring locks on
00130  * extents and it is used for locating the associated \ref extlink_t and the
00131  * extent-map page on which the \ref extlink_t resides.
00132  * Scanning the pages in a store can be accomplished by scanning the 
00133  * list of \ref extlink_t.   
00134  * 
00135  * The entire allocation metadata for a page are in its extent, which contains a
00136  * bitmap indicating which of its pages are allocated.
00137  * One cannot determine the allocation status of a page from the page 
00138  * itself: the extent map page must be inspected.
00139  *
00140  * Extents also contain (unlogged, advisory) metadata in the form of
00141  * a pbucketmap; this contains a \e bucket \e number 
00142  * for each page in the extent,
00143  * indicating of the amount of free space on the page.
00144  * The map has meaning only to the \ref FILE_M.  
00145  * The file manager asks the I/O layer
00146  * (which then descends to the volume manager for this purpose) to find the
00147  * next page whose advisory bucket number is sufficiently large for the
00148  * file manager's record-allocation needs.  Between the time this request
00149  * is made and the time the file manager fixes
00150  * and inspects the page,
00151  * the page might no longer have sufficient space.  Nevertheless, this
00152  * advisory bucket number in the extlink_t reduces the number of page-fixes
00153  * to find a page with the needed space, and
00154  * it does improve the effective fill-factor for file pages.
00155  *
00156  * Maintaining the bucket map is costly in that it fixes and dirties 
00157  * extent-map pages,
00158  * even though it does not log these updates.
00159  *
00160  * The bucket map is maintained \e only for extents whose pages are 
00161  * file_p (small-object) pages.
00162  *
00163  * \subsection STORENODE Store Nodes
00164  * A \ref stnode_t holds metadata for a store, including a reference to
00165  * the first extent in the store. 
00166  * A store \e always contains at least one allocated extent, even if
00167  * no pages in that extent are allocated.
00168  * Scanning the pages in a store can be accomplished by scanning the 
00169  * list of \ref extlink_t.   
00170  *
00171  * Store nodes are co-located on store-map pages at the beginning of a volume, 
00172  * after the extent maps. The volume is formatted to allow
00173  * as many store nodes as there are extents.
00174  *
00175  * \subsection OBJIDS Object Identifiers, Object Location, and Locks
00176  *
00177  * There is a close interaction among various object identifiers,
00178  * the data structures in which the 
00179  * objects reside, and the locks acquired on the objects.
00180  * 
00181  * Simply put:
00182  * - a volume identifier (ID) consists of an 
00183  *   integral number, e.g., 1, represented in an output stream as v(1).
00184  * - a store identifier consists of a volume ID and a store number, e.g., 3,
00185  *   represented s(1.3).
00186  * - an index ID and a file ID are merely store IDs.
00187  * - a page ID contains a store ID and a page number, e.g., 48, represented
00188  *   p(1.3.48).
00189  * - a record ID for a record in a file contains a 
00190  *   page ID and a slot number, e.g., 2, represented r(1.3.48.2).
00191  *
00192  *  Clearly, from a record ID, its page and slot can 
00193  *  be derived without consulting any indices.  It is 
00194  *  also clear that records cannot move, which has 
00195  *  ramifications for \ref RSVD_MODE, described below.
00196  *  The \ref LOCK_M understands these identifiers as well as extent
00197  *  IDs, and generates locks from identifiers.  
00198  *
00199  * \subsection STONUMS Predefined Stores
00200  *
00201  * A volume contains these pre-defined structures:
00202  * - Header: page 0 (the first page) of the volume; contains :
00203  *   - a format version #
00204  *   - the long volume id
00205  *   - extent size
00206  *   - number of extents
00207  *   - number of extents used for store 0 (see below)
00208  *   - number of pages used for store 0 (see below)
00209  *   - the first page of the extent map
00210  *   - the first page of the store map
00211  *   - page size
00212  * - store #0 : a "pseudo-store" containing the extent-map and store-map pages. This
00213  *   starts with page 1 (the second page) of the volume.
00214  * - store #1 :  directory of the stores (used by the storage manager): this is
00215  *   a btree index mapping store-number to metadata about the store, 
00216  *   including (but not limited to) the store's use (btree/rtree/file-small-object-pages/file-large-object-pages), 
00217  *   and, in the case of indices, the root page of the index,
00218  *   and, in the case of files, the store number of the associated large-object-page store. 
00219  * - store #2 :  root index (for use by the server)
00220  *
00221  * \subsection ALLOCEXT Allocation and Deallocation of Extents 
00222  *
00223  * \anchor ALLOCEXTA 
00224  * Finding an extent to allocate to a store requires
00225  * searching through the 
00226  *   extent-map pages for an extent that is both 
00227  *   unallocated (owner is zero) and not locked. 
00228  *  - The storage 
00229  *   manager caches the minimum free extent number with which to start a
00230  *   search; this number is reset to its static lower bound when the
00231  *   volume is mounted, meaning that the first extent operation after a
00232  *   mount starts its search at the head of the volume.
00233  * - Subsequent searches start the search with the lowest free extent
00234  *   number.
00235  * - Extent-map pages are latched as needed for this linear search 
00236  *   \ref ATSIGN "\@".
00237  * - The first appropriate extent found is IX-locked.
00238  *   These locks are explicitly acquired by the lock manager; 
00239  *   extent locks are not in the lock hierarchy;
00240  *
00241  * \anchor ALLOCEXTA2 
00242  * Allocating a set of extents to a store is a matter of linking
00243  * them together and then appending the list to the 
00244  * tail of the store's linked list:
00245  *  - Locks are \e not acquired for previous and next extents in the list; 
00246  *    EX-latches protect these structures;
00247  *  - New extents are always linked in at the tail of the list.
00248  *  - One extent-map page is fixed at a time 
00249  *   \ref ATSIGN "\@".  Entire portions of the
00250  *   extent list that reside on the same extent-map page are 
00251  *   linked while holding the page latched, and 
00252  *   logged in a single log record. This is useful only for creating
00253  *   large objects; all other page-allocations result in allocation of
00254  *   one or zero extents.
00255  *
00256  * Allocation is handled slightly differently in the two contexts 
00257  * in which it is performed: 
00258  * - creating a store (see \ref ALLOCST), and 
00259  * - inserting new pages into an existing store (see \ref ALLOCPG ).
00260  * These two cases are described in more detail below.
00261  *
00262  * Extents are freed:
00263  * - when a transaction deletes a store and commits, and
00264  * - when a transaction has deleted the last page in the extent; this involves acquiring an IX lock on the extent, but does not
00265  * preclude other transactions from allocating pages in the same extent. 
00266  * Also, since the transaction might abort, the extent must not be re-used 
00267  * for another store by another transaction. Furthermore, the page-deleting 
00268  * transaction could re-use the pages.  For these reasons, extents are left
00269  * in a store until the transaction commits (see \ref XCT_M and \ref XAFTERXCT,
00270  * and \ref SAFTERXCT). 
00271  * \anchor ALLOCEXTD1 
00272  * Thus, deallocating an extent \e before a transaction commits comprises:
00273  *  - clearing the extent-has-allocated-page bit in the already-held
00274  *   extent IX lock, and does not involve any page-fixes.
00275  *
00276  * \subsection XAFTERXCT Commit-Time Handling of Extent-Deallocation
00277  * At commit time, the transaction deallocates extents in two contexts:
00278  * - When destroying stores that were marked for deletion by the transaction,and
00279  * - While freeing extents marked for freeing by the transaction as a result
00280  *   of (incremental) page-freeing.
00281  *
00282  * For the latter case, the transaction asks the transaction
00283  * manager to 
00284  * identify all extents on which it has locks (lock manager's job).  
00285  * If 
00286  * - the lock manager can upgrade the extent's lock to EX mode \e and 
00287  * - the extent still contains no allocated pages, 
00288  * the lock manager frees the extent.
00289  * (An optimization avoids excessive page-fixing here: the
00290  * extent lock contains a bit indicating whether the extent contains any
00291  * allocated pages.)
00292  *
00293  * \anchor ALLOCEXTD2 
00294  * Deallocating an extent from a store (at transaction-commit) comprises:
00295  * - Identifying the previous- and next- extent numbers;
00296  * - Identifying the pages containing the \ref extlink_t structures for the
00297  *   extent to be freed and for the previous- and next- extent structures,
00298  *   which may mean as many as three pages;
00299  * - Sorting the page numbers and EX-latching 
00300  *   \ref ATSIGN "\@"
00301  *   the extent-map pages in 
00302  *   ascending order to avoid latch-latch deadlocks;
00303  * - Ensuring that the previous- and next- extent numbers on the 
00304  *   to-be-freed extent have not changed (so that we know that we have 
00305  *   fixed the right pages);
00306  *   There should be no opportunity for these links to change since
00307  *   the volume manager is a monitor (protected by a read-write lock).
00308  * - Updating the extents and physically logging each of the updates.
00309  * - Updating caches related to the store.
00310  *
00311  * If the extent is in a still-allocated store, the entity freeing
00312  * the extent (the lock manager) will have acquired an EX lock on
00313  * the extent for the transaction. If the extent is part of a
00314  * destroyed store, the store will have an EX lock on it and this will
00315  * prevent any other transaction from trying to deallocate the extent.
00316  *
00317  * \subsection IMPLICAT Implications of This Design
00318  * This extent-based design has the following implications:
00319  * - Before it can be used, a volume must be formatted for a given size 
00320  *   so that the number of extent map pages and store map pages can 
00321  *   be established;
00322  * - Extending the volume requires reformatting, so the server is forced to
00323  *   perform database reorganization in this case;
00324  * - Location of an extlink_t can be determined arithmetically from a page
00325  *   number (thus also from a record ID), 
00326  *   which is cheaper than looking in an index of any sort; however,
00327  *   this means that
00328  *   extra work is done to validate the record ID (that is, its 
00329  *   page's store-membership);
00330  * - Because the store-membership of a page is immaterial in locating the page,
00331  *   the buffer-pool manager need not pay any attention to stores; in fact,
00332  *   it reduces I/O costs by sorting pages by {volume,page number} and writing 
00333  *   contiguous pages in one system call;
00334  * - Because the store-membership of an extent is immaterial in locating the
00335  *   extent, extent locks do not contain a store number, and their locks
00336  *   can be aquired regardless of their allocation state.  One can
00337  *   test the "locked" status of an extent prior to allocating it.
00338  * - Extent-map pages tend to be hot (remain in the buffer pool), which 
00339  *   minimizes I/O;
00340  * - Extent-map pages could be a source of latch contention, however
00341  *   they are only latched in the volume manager, which redirects the
00342  *   contention to the volume mutex;
00343  * - The number of page fixes required for finding free extents is bounded 
00344  *   by the number of extent-map pages on the volume, and in some cases
00345  *   employs O(n) (linear) searches, as described in the item below;
00346  * - Pages may be reserved for allocation in a file without being allocated, 
00347  *   so optimal use of the volume requires that the allocated extents 
00348  *   be searched before new extents are allocated; 
00349  * - Deallocating a page and changing store flags (logging attributes)
00350  *   of a page or store does not require touching the page itself; entire
00351  *   stores are deallocated by latching and updating only the required
00352  *   extent-map pages;
00353  * - The high fan-out of extent-map pages to pages ensures that deallocating
00354  *   stores is cheap;
00355  * - Clustering of pages is achieved, which is useful for large objects and
00356  *   can be helpful for file scans;
00357  * - Prefetching of file pages can be achieved by inspection of extent maps;
00358  * - Files need not impose their own structure on top of stores: store
00359  *   order is file order; the fact that the storage manager avoids 
00360  *   superimposing a file structure has \ref FILERAMIFICATIONS "ramifications of its own".
00361  *
00362  *
00363  * The volume layer does not contain any means of spreading out or clustering
00364  * extents over extent-map pages for clustering (or for latch-contention 
00365  * mitigation).
00366  *
00367  * \subsection ALLOCST Creating and Destroying Stores
00368  *
00369  * For each store  the storage manager keeps certain metadata about the store
00370  * in a \e directory, which is an index maintained by the \ref DIR_M.
00371  * 
00372  * \anchor STORECREATE
00373  * Creating a store comprises:
00374  * - Finding an unused store number. This is a linear search through the
00375  *   store node map pages for a stnode_t (one with no associated extent list)
00376  *   that is not locked. The linear search starts at a revolving location.
00377  *   In the worst case, it will search all the stnode_t and therefore
00378  *   fix 
00379  *   \ref ATSIGN "\@"
00380  *   all the store map pages;
00381  * - Aquiring a store lock in EX mode, long-duration;
00382  * - Finding an extent for the store (details \ref ALLOCEXTA "here");
00383  * - Updating the stnode_t 
00384  *   \ref ATSIGN "\@"
00385  *   to reserve it (legitimize the store number);
00386  * - Logging the creation store operation without the first extent;
00387  * - Allocating the extent to the store (details \ref ALLOCEXTA2 "here")
00388  *   (we cannot allocate an extent without a legitimate owning store to which
00389  *   to allocate it);
00390  * - Updating the stnode_t to add the first extent 
00391  *   \ref ATSIGN "\@";
00392  * - Logging the store operation to add the first exent to the store.
00393  *
00394  * Destroying a store before transaction-commit comprises these steps:
00395  * - Verify that the store number is a valid number (no latches required);
00396  * - Mark the store as deleted:
00397  *   - Latch the store map page (that holds the stnode_t) for the store 
00398  *   \ref ATSIGN "\@";
00399  *   - EX-lock the store (long duration);
00400  *   - Mark the stnode_t as "t_deleting_store" 
00401  *   \ref ATSIGN "\@"
00402  *   (meaning it is to be
00403  *   deleted at end-of-transaction, see \ref XCT_M);
00404  *   - Log this marking store operation;
00405  *   - Add the store to a list of stores to free when the transaction 
00406  *   commits (See \ref XCT_M);
00407  * - Clear caches related to the store.
00408  *
00409  * \subsection SAFTERXCT Commit-Time Handling of Store-Destruction
00410  * Removing a store (at commit time) marked for deletion comprises these steps:
00411  * - Verify that the store is still marked for deletion (partial rollback
00412  *   does not inspect the list of stores to delete, and in any case, this
00413  *   has to be one on restart because the list is transient) 
00414  *   \ref ATSIGN "\@",
00415  * - Update the stnode_t to indicate that the store's extents are about to
00416  *   be deallocated 
00417  *   \ref ATSIGN "\@";
00418  * - Log the above store operation for crash recovery;
00419  * - Free (really) the extents in the store 
00420  *   \ref ATSIGN "\@";
00421  * - Update the stnode_t to clear its first extent 
00422  *   \ref ATSIGN "\@";
00423  * - Log the store as having been deleted in toto;
00424  * - Clear cached information for this store.
00425  *
00426  * \subsection ALLOCPG Allocation and Deallocation of Pages
00427  *
00428  * Allocating an extent to a store does not make its pages "visible" to the
00429  * server. They are considered "reserved".
00430  * Pages within the extent have to be allocated 
00431  * (their bits in the extent's bitmap must be set).  
00432  *
00433  * When the store is used for an index, the page is not 
00434  * visible until it has been formatted
00435  * and inserted (linked) into the index.  
00436  * In the case of files, however,
00437  * the situation is complicated by the lack of linkage of file pages by
00438  * another means.  Pages used for large objects are referenced through an
00439  * index in the object's metadata, but pages used 
00440  * for small objects become part of the
00441  * file when the 
00442  * page's extent bitmap indicates that it is
00443  * allocated. 
00444  * \anchor FILERAMIFICATIONS
00445  * This has some significant ramifications:
00446  * - neither deallocation nor allocation of file pages requires latching of
00447  *   previous and next pages for linking purposes;
00448  * - the file manager and index managers handle page allocation somewhat
00449  *   differently;
00450  * - the file manager must go to great lengths to ensure that the
00451  *   page is not accessible until both allocated and formatted, and
00452  *   to ensure the safety of
00453  *   the file structure in event of error or crash.
00454  *
00455  * Despite the fact that the intended uses of the page require different
00456  * handling, a significant part of page allocation is 
00457  * generic and is handled by the volume layer.  To handle some of the
00458  * contextual differences, the volume layer uses a callback to the
00459  * calling manager.  
00460  *
00461  * \anchor ALLOCPGTOSTORE
00462  * Allocating a page to a store comprises these steps
00463  * (in fact, the code is
00464  * written to allocate a number of pages at once; this description is 
00465  * simplified):
00466  * - Locate a reserved page in the store
00467  *   - if the page must be \e appended to the store, special precautions are
00468  *   needed to ensure that the reserved page is the next unallocated page 
00469  *   in the last extent of the store 
00470  *   \ref ATSIGN "\@";
00471  *   - if the page need not be appended, any reserved page will do
00472  *      - look in the cache for extents with reserved pages
00473  *      - if none found \e and we are not in append-only context,
00474  *      search the file's extents for reserved pages 
00475  *   \ref ATSIGN "\@";
00476  *      (This can be disabled by changing the value 
00477  *      of the constant \e never_search in sm_io.cpp.)
00478  * - If no reserved pages are found, find and allocate an extent 
00479  *   (details \ref ALLOCEXTA "here");
00480  * - Acquire an IX lock on the extent in which we found reserved page(s)
00481  * - Find a reserved page in the given extent that has \e no \e lock on it
00482  *   (if no such thing exists, skip this extent and find another
00483  *   \ref ATSIGN "\@");
00484  * - Acquire a lock on the available page 
00485  *   (mode IX or EX, and duration depend on the context)
00486  *   - the file manager when allocating a small-record file page
00487  *   uses IX mode, long(commit-) duration, 
00488  *   which means that
00489  *   deallocated pages will not be reallocated until after the deallocating 
00490  *   transaction commits (see \ref XAFTERXCT);
00491  *   - the file manager when allocating pages for large records 
00492  *   uses long duration, 
00493  *   IX or EX mode, depending on the exact use of the page for 
00494  *   (various) large-record structures; long-duration locks mean that
00495  *   deallocated pages will not be reallocated until the deallocating ;
00496  *   - btree index manager uses EX mode, instant duration , meaning that
00497  *   deallocated pages can be re-used;
00498  *   - rtree index manager uses EX mode, instant duration, meaning that
00499  *   deallocated pages can be re-used;
00500  * - Call back to (file or index) manager to accept or reject this page:
00501  *   - file manager allocating a small-record file page
00502  *   fixes 
00503  *   \ref ATSIGN "\@"
00504  *   the page (for formatting) and returns "accept";
00505  *   - file manager allocating a large-record page just returns "accept";
00506  *   - rtree and btree index managers just return "accept";
00507  * - Log the page allocation, 
00508  *   set "has-page-allocated" indicator in the extent lock.
00509  *
00510  * As mentioned above, there are times when the volume manager is told to
00511  * allocate new pages at the end of the store (append).  This happens 
00512  * when the file manager allocates small-object file pages unless the
00513  * caller passes in the policy t_compact, indicating that it should search
00514  * the file for available pages.
00515  * The server can choose its policy when calling \ref ss_m::create_rec
00516  * (see \ref pg_policy_t).
00517  * When the server uses \ref append_file_i, only the policy t_append
00518  * is used, which enforces append-only page allocation.
00519  *
00520  * Deallocating a page in a store comprises these steps:
00521  * - Acquire a long-duration EX lock on the page;
00522  * - Verify the store-membership of the page if required to do so (by
00523  *   the file manager in cases in which it was forced to  unfix
00524  *   and  fix 
00525  *   \ref ATSIGN "\@"
00526  *   the page);
00527  * - Acquire a long-duration IX lock on the page's extent;
00528  * - Fix 
00529  *   \ref ATSIGN "\@"
00530  *   the extent-map page and update the extent's bitmap, log the update. 
00531  * - If the extent now contains only reserved pages, mark the extent as
00532  *   removable (clear the extent-has-allocated-page bit in the already-held
00533  *   IX lock).
00534  *
00535  *
00536  * \subsection VOLCACHES Volume Manager Caches 
00537  * The volume manager does not contain any persistent indices to 
00538  * assist in finding free pages in a store's allocated extents (which it
00539  * can do only when not forced to append to the store). 
00540  * To minimize the need for linear searches through the store's extents,
00541  * the volume manager 
00542  * caches information about reserved pages in each store, the
00543  * \b reserved-page \b cache.
00544  * This is a map (set of pairs mapping snum_t -> extnum_t) to find extents
00545  * already allocated to a store that contain free pages. This
00546  * cache is consulted before new extents are allocated to a store.
00547  * Since after restart the cache is necessarily empty, it is primed when
00548  * first needed for the purpose of allocating anything for the store. 
00549  *
00550  * The volume manager also keeps a \b last-page \b cache.
00551  * This is a map from snum_t to extnum_t; the extent number is that of
00552  * the last extent allocated to the store. From this one can arithmetically
00553  * derive the last page in (reserved or possibly allocated) the store.
00554  * Note that the last page \e allocated to the store might be anywhere in
00555  * the store's exent list; all pages after that might be \e reserved but
00556  * not allocated.
00557  *
00558  * Priming caches is an expensive operation.
00559  * It is not done on each volume mount, because volumes are mounted and
00560  * dismounted several times during recovery, and priming on each
00561  * mount would be prohibitive.
00562  * Every attempt to allocate a page checks the store's reserved-page 
00563  * cache; if it is empty, it is primed.
00564  * Every time an extent is allocated or a last-page-in-file is located,
00565  * the last-page cache is updated.
00566  *
00567  * \subsection PAGES Page Types
00568  * Pages in a volume come in a variety of page types, all the same size.
00569  * The size of a page is a compile-time constant.  It is controlled by
00570  * a build-time configuration option (see 
00571  * \ref CONFIGOPT). the default page size is 8192 bytes.
00572  *
00573  * All pages are \e slotted (those that don't need the slot structure
00574  * may use only one slot) and have the following layout:
00575  * - header, including
00576  *   - lsn_t log sequence number of last page update
00577  *   - page id
00578  *   - links to next and previous pages (used by some storage structures)
00579  *   - page tag (indicates type of page)
00580  *   - space management metadata (space_t)
00581  *   - store flags (logging level metadata)
00582  * - slots (grow down)
00583  * - slot table array of pointers to the slots (grows up)
00584  * - footer (copy of log sequence number of last page update)
00585  *
00586  * Each page type is a C++ class that derives from the base class
00587  * page_p.  Page_p implements functionality that is common to all
00588  * (or most) page types. The types are as follows:
00589  *
00590  * - extlink_p : extent-link pages, used by vol_m
00591  * - stnode_p  : store-node pages, used by vol_m
00592  * - file_p    : slotted pages of file-of-record, used by file_m
00593  * - lgdata_p  : pages of large records, used by file_m
00594  * - lgindex_p : pages of large records, used by file_m
00595  * - keyed_p   : slotted pages of indexes, used by btree_m
00596  * - zkeyed_p  : slotted pages of indexes, used by btree_m
00597  * - rtree_p   : slotted pages of spatial indexes, used by rtree_m
00598  *
00599  * Issues specific to the page types will be dealt with in the descriptions of the modules that use them.
00600  *
00601  *  
00602  * \subsection RSVD_MODE Space Reservation on a Page
00603  *
00604  * Different storage structures offer different opportunities for fine-grained 
00605  * locking and need different means of allocation space within a page.
00606  * Special care is taken to reserve space on a page when slots 
00607  * are freed (records are deleted) so that rollback can restore 
00608  * the space on the page.  
00609  * Page types that use this space reservation have 
00610  * \code page_p::rsvd_mode() == true \endcode.
00611  *
00612  * In the case of B-trees, space reservation is not used because
00613  * undo and redo are handled logically -- entries 
00614  * can be re-inserted in a different page.  But in the case of files, 
00615  * records are identified by physical ID, which includes page and slot number,
00616  * so records must be reinserted just where they first appeared. 
00617  *
00618  * Holes in a page are coalesced (moved to the end of the page) as needed, 
00619  * when the total free space on the page satisfies a need but the 
00620  * contiguous free space does not.  Hence, a record truncation followed 
00621  * by an append to the same record does not necessarily cause the 
00622  * shifting of other records on the same page.
00623  *
00624  * A count of free bytes is maintained for all pages. Free-space
00625  * metadata are maintained for rsvd_mode() pages:
00626  * - When a transaction releases a slot on a page with rsvd_mode(), the slot 
00627  * remains
00628  * "reserved" for use by the same transaction. 
00629  * - That slot is not free to be allocated by another transaction until 
00630  * the releasing transaction commits.  
00631  * This is because if the transaction aborts, the slot must
00632  * be restored with the same slot number.  
00633  * Not only must the slot number be  preserved, 
00634  * but the number of bytes consumed by that slot must remain 
00635  * available lest the transaction abort.
00636  * - The storage manager keeps track of the youngest active transaction
00637  * that is freeing space (i.e., "reserving" it) on the page 
00638  * and the number of bytes freed ("reserved")
00639  * by the youngest transaction.  
00640  * - When the youngest transaction to reserve space on the page becomes
00641  * older than the oldest active transaction in the system, the reserved
00642  * space becomes free. This check for freeing up the reserved space happens
00643  * whenever a transaction tries to allocate space on the page.
00644  * - During rollback, a transaction can use \e any amount of
00645  * reserved space, but during forward processing, it can only use space
00646  * it reserved, and that is known only if the transaction in question is
00647  * the youngest transaction described in the above paragraph.
00648  * - The changes to space-reservation metadata (space_t) are not logged.
00649  * The actions that result in updates to this metadata are logged (as
00650  * page mark and page reclaim).
00651  *
00652  * \section FILE_M File Manager
00653  * A file is a group of variable-sized records.
00654  * A record is the smallest persistent datum that has identity.
00655  * A record may also have a "user header", whose contents are
00656  * for use by the server.  
00657  * As records vary in size, so their storage representation varies. 
00658  * The storage manager changes the storage representation as needed.
00659  *
00660  * A file comprises two stores.  
00661  * One store is allocated for slotted (small-record) pages, called file_p
00662  * pages.  
00663  * One store is allocated for large records, and contains lgdata_p and
00664  * lgindex_p pages.
00665  * Small records are those whose size is less than or equal to
00666  * sm_config_info_t.max_small_rec.  A record larger than this
00667  * has a slot on a small-record page, which slot contains metadata
00668  * refering to pages in the large-record store.
00669  * The scan order of a file is the physical order of the records
00670  * in the small-record store.
00671  *
00672  * Every record, large or small, has the following metadata in the
00673  * record's slot on the file_p page; these data are held in a rectag_t
00674  * structure:
00675  * \code
00676 struct rectag_t {
00677     uint2_t   hdr_len;  // length of user header, may be zero
00678     uint2_t   flags;    // enum recflags_t: indicates internal implementation
00679     smsize_t  body_len; // true length of the record 
00680 };
00681 \endcode
00682   * The flags have have these values:
00683     - t_small   : a small record, entirely contained on the file_p
00684     - t_large_0 : a large record, the slot on the file_p contains the
00685                   user header, while the body is a list
00686                   of chunks (pointers to contiguous lgdata_p pages)
00687     - t_large_1 : a large record, the slot on the file_p contains the
00688                   user header, while the body is a reference to a single
00689                   lgindex_p page, which is the root of a 1-level index of
00690                   lgdata_p pages.
00691     - t_large_2 : like t_large_1 but the index may be two levels deep. This
00692                   has not been implemented.
00693  *
00694  * Internally (inside the storage manager), the class record_t is a
00695  * handle on the record's tag and is the class through which the
00696  * rectag_t is manipulated.
00697  *
00698  * A record is exposed to the server through a set 
00699  * of ss_m methods (\ref ss_m::create_rec,
00700  * \ref ss_m::append_rec, etc), and through the \ref pin_i class.
00701  *
00702  * \attention
00703  * All updates to records are accomplished by copying out part or all of
00704  * the record from the buffer pool to the server's address space, performing
00705  * updates there, and handing the new data to the storage manager. 
00706  * User (server) data are not updated directly in the buffer pool.
00707  *
00708  * The server may cause the file_p and at most one large data page to
00709  * be pinned for a given record through the pin_i class; the server must
00710  * take care not to create latch-latch deadlocks by holding a record pinned
00711  * while attempting to pin another.  An ordering protocol among the pages
00712  * pinned must be observed to avoid such deadlocks.
00713  *
00714  * \note The system only detects lock-lock deadlocks.  Deadlocks involving
00715  * mutexes or latches or other blocking mechanisms will cause the server to
00716  * hang.
00717  *
00718  * \subsection ALLOCFL Creating and Destroying a File of Records
00719  *
00720  * Creating a file comprises these steps:
00721  * - Create a store of file_p pages for the record headers and 
00722  *   small-record bodies 
00723  *   (details are \ref STORECREATE "here");
00724  * - Create a store of lgindex_p and lgdata_p pages for the 
00725  *   large objects
00726  *   (details are \ref STORECREATE "here");
00727  * - Acquire a long-duration EX lock on the store id;
00728  * - Create the "file structure" on the first store. This means a
00729  *   single page is allocated
00730  *   and formatted as a file_p page with no records
00731  *   in use
00732  *   (details are \ref ALLOCPAGETOSTORE "here");
00733  *   ;
00734  * - Insert the file meta-data in the file-directory index (see \ref DIR_M).
00735  *   This involves several page fixes for the btree insertion 
00736  *   \ref ATSIGN "\@".
00737  *
00738  * Destroying a file comprises these steps:
00739  * - Consult the directory entry for the file's small-object store, 
00740  *   thus acquiring an EX long-duration lock on the file;
00741  *   This will involve page fixes for reading the btree directory.
00742  * - Verify that the store is really used for a file, and still exists;
00743  *   This requires SH-latching a store map page to inspect the stnode_t for
00744  *   the store 
00745  *   \ref ATSIGN "\@";
00746  * - Mark the store for destruction
00747  *   (details are in \ref ALLOCST ).
00748  * - Remove transient histogram information for the store;
00749  * - From the store directory, determine the large-object store associated
00750  *   with the file;
00751  * - Mark the large-object store for destruction
00752  *   (details are in \ref ALLOCST ).
00753  * - Remove the file's metatdata from the directory 
00754  *   \ref ATSIGN "\@"
00755  *   (\ref DIR_M).
00756  *
00757  * \subsection HISTO Finding Space for a Record
00758  *
00759  * The file manager caches information
00760  * about page utilization for (small-object) pages in each file so that it can
00761  * control fragmentation of files.
00762  * This cached information takes the form of a 
00763  * histoid_t, which contains a \b heap and a \b histogram.  
00764  *
00765  * The \b heap keeps track of the amount of free space in 
00766  * (recently-used) pages in the heap, and it is 
00767  * searchable so that it can
00768  * return the page with the smallest free space that is larger than a
00769  * given value in bytes.
00770  * The free-space-on-page value that it uses for this purpose
00771  * is the most liberal value -- it's possible that some of the space on
00772  * the page is reserved for a transaction that has not yet committed
00773  * (if that transaction destroyed a record, it can use space that other
00774  * transactions cannot).
00775  * \bug GNATS 157 The histoid_t heap should have some size limit (number of entries).
00776  *
00777  * The  \b histogram has a small number of buckets, each of which counts
00778  * the number of pages in the file containing free space between 
00779  * the bucket min and the bucket max.
00780  *
00781  *
00782  * When a record is created, the file manager 
00783  * tries to use an already-allocated
00784  * page that has space for the record. 
00785  * It determines what space is needed
00786  * for the record from the length hint and
00787  * the data given in the \ref ss_m::create_rec call. 
00788  *
00789  * Three policies used can be used (in combination) to search for pages
00790  * with space in which to create a new record:
00791  * - t_cache : look in the \b heap for a page with space. 
00792  * - t_compact : if the \b histograms say 
00793  *   there are any pages with  sufficient space somewhere in the file, 
00794  *   do a linear search of the file for such a page, updating histogram
00795  *   heap metadata in the process.  This is potentially
00796  *   costly but useful when the file has not been inspected since the
00797  *   last restart, because the heap has no records for the file except
00798  *   those inserted due to a record-update or removal.
00799  * - t_append : append the new record to the file
00800  *
00801  * Using append_file_t to create records means only t_append is used, 
00802  * ensuring that the record will always be appended to the file.
00803  * \ref ss_m::create_rec uses t_cache | t_compact | t_append.
00804  *
00805  * The policy can be given on the \ref ss_m::create_rec call. The default
00806  * is t_cache | t_compact | t_append.
00807  *
00808  * If the file manager does not find a page in the file with sufficient
00809  * space for the record, or if it must append to the end of the file
00810  * and the last page hasn't the needed space, the file manager asks
00811  * the I/O manager to allocate a page.
00812  *
00813  * Once the file manager has located a page with sufficient space to
00814  * create the record, the \ref VOL_M worries about 
00815  * \ref RSVD_MODE.
00816  *
00817  * Creating a record comprises these steps:
00818  * - Estimate the space required for the record, based on the sizes of 
00819  *   the data and header vectors and the length-hint given 
00820  *   on the ss_m::create_rec call.
00821  * - Choose a record implementation for the given size (
00822  *   a small object or a large object, which determines the amount 
00823  *   of space needed in the slot of the file_p page)
00824  * - Find and lock a slot in a page:
00825  *   - if appending to the file, find and EX-lock (with long-duration) the next 
00826  *   available slot in the last page of the file.
00827  *   Finding the last page in the file requires SH-latching extent-map 
00828  *   pages 
00829  *   \ref ATSIGN "\@";
00830  *   the storage manager caches last-page information to reduce
00831  *   page-fixing to find the last page of a store.
00832  *   Searches for the last page of a store start with the cached
00833  *   last-page's extent if it is still part of the store; otherwise they
00834  *   start with the head of the store's list, and update the cache.
00835  *   - If there is no such slot or if it is not large enough, 
00836  *   allocate a new page (at the end of the file).
00837  *   - if not appending to the file, consult the histograms to find a 
00838  *   page already in the file, 
00839  *   one that contains a slot large enough for the new record.
00840  * - Once we have located a (potentially) suitable page, 
00841  *   verify that the page is indeed legitimate (since we used
00842  *   transient, cached information to locate this page).
00843  *   If not, reject  the page.
00844  *   Note that normally we cover latches with locks to avoid
00845  *   deadlocks, but in this case we must latch first because
00846  *   we have no idea which slot to lock, nor do we know
00847  *   if the page is still in the expected file.
00848  *   We may have to try several pages before finding one 
00849  *   that is truly suitable, so this entire protocol 
00850  *   is handled in the histogram code. The protocol is as follows:
00851  *   - Conditionally EX-latch 
00852  *   \ref ATSIGN "\@"
00853  *   the page. 
00854  *   If we cannot do so, give up on  this page and try another;
00855  *   - Once the page is fixed, verify that its 
00856  *   page ID contains the expected store ID (to detect a race) 
00857  *   (if not, reject this page and find another);
00858  *   - Check the allocation status of the page:
00859  *     - try to IS-lock the page, and if we 
00860  *     cannot do so immediately, we reject the page and try another;
00861  *     - verify that the page is allocated in its extent, and that 
00862  *     the extent's owner is the expected store. This involves 
00863  *     SH-latching 
00864  *      \ref ATSIGN "\@"
00865  *     the extent map page while holding the file page latched;
00866  *   - Acquire an EX lock on the next available slot with enough 
00867  *   space (space that is usable by this transaction,
00868  *   subject to \ref RSVD_MODE);
00869  * - Once we have a suitable page with an EX record lock,  
00870  *   create the record.
00871  * 
00872  * Under the best of circumstances, creating a small record involves three page
00873  * latches 
00874  *   \ref ATSIGN "\@":
00875  * one file page (in which to insert a record),
00876  * the extent map page is fixed to verify the file page's allocation status,
00877  * and then after the record is created, the 
00878  * the extent map page is re-latched to update the space-utilization metadata
00879  * for the extent.
00880  *
00881  * Updating a record (by way of a server-provided record ID) consists 
00882  * in these steps, which may require up to two extra page latches (besides
00883  * latching the page containing the record):
00884  * - Latch the page on which the record resides 
00885  *   \ref ATSIGN "\@";
00886  * - Verify the legitimacy of the record ID (see \ref UPDATEREC);
00887  * - Verify that the file page still contains the record identified by the 
00888  *   record ID;
00889  * - Perform the update.
00890  *
00891  * Freeing a record comprises these steps:
00892  * - EX-lock the record (with long duration);
00893  * - From the record ID, determine its containing page;
00894  * - EX-latch the page 
00895  *   \ref ATSIGN "\@";
00896  * - Mark the slot free, releasing the space but 
00897  *   leaving it reserved (see \ref RSVD_MODE);
00898  * - If the slot is the last used slot on the page, and the page is not the
00899  *   first allocated page in the file, 
00900  *   free the page (finding the first page in the
00901  *   file requires  SH-latching 
00902  *   a store-map page 
00903  *   \ref ATSIGN "\@"
00904  *   and at least one extent-map page
00905  *   \ref ATSIGN "\@");
00906  * - Update the histograms to reflect the space on the page and whether the
00907  *   page is still in the file.  This may change the page's bucket and
00908  *   require EX-latching 
00909  *   \ref ATSIGN "\@"
00910  *   the extent-map page to update the pspacemap.
00911  *
00912  *
00913  * \subsection UPDATEREC  Record Access by Record ID
00914  *
00915  * When a server issues a storage manager request to
00916  * read or update a record (using any ss_m::update_rec or pin_i::pin
00917  * or pin_i::repin, that is, any method that
00918  * takes a record identifier), the storage manager must verify
00919  * the legitimacy of the record identifier.  (If the storage
00920  * manager performs an update based on a record ID that is stale,
00921  * it can result in unrecoverable errors. )
00922  * Verifying the legitimacy of a record ID is, unfortunately,
00923  * an expensive operation:
00924  * - Verify that store id on the page matches that in the record ID;
00925  * - Verify that page is a file page;
00926  * - Verify that the store exists (inspect the stnode_t for the store
00927  *   \ref ATSIGN "\@"
00928  *   )
00929  * - Verify that the page is allocated to the store (fix the extlink_t
00930  *   \ref ATSIGN "\@"
00931  *   and verify that its owner is still the store in question and that
00932  *   the page's bit is set in the extent's page map).
00933  *
00934  * The storage manager does not inspect the store's metatdata (directory
00935  * index entry) to 
00936  * see that it is a file store because if the page is still allocated
00937  * to the store and the page is a file page, the store must still be a file
00938  * store.
00939  *
00940  * \subsection FREESPACE Summary of Free-Space Management
00941  *
00942  * Free space is managed in several ways:
00943  * - Free extents have no owner (persistent datum in the extlink_t).
00944  * - Allocated extents with free pages are cached by the 
00945  *   volume manager (transient data).
00946  * - Allocated extents contain a map of the free-space buckets to which its
00947  *   pages belong for file pages (persistent, in the extlink_t).
00948  * - The volume manager keeps caches the last page in a file (transient).
00949  * - The volume manager keeps a cache of extents in a store that
00950  *   contain reserved pages (transient).
00951  * - The volume manager caches the lowest unallocated extent in a volume (transient).
00952  * - The file manager maintains a heap and histograms (transient) that
00953  *   cache the free-space bucket data on the pages. 
00954  *
00955  * \section BTREE_M B+-Tree Manager
00956  *
00957  * The values associated with the keys are opaque to the storage
00958  * manager, except when IM (Index Management locking protocol) is used, 
00959  * in which case the value is
00960  * treated as a record ID, but no integrity checks are done.  
00961  * It is the responsibility of the server to see that the value is
00962  * legitimate in this case.
00963  *
00964  * B-trees can be bulk-loaded from files of sorted key-value pairs,
00965  * as long as the keys are in \ref LEXICOFORMAT "lexicographic form". 
00966  * \bug GNATS 116 Btree doesn't sort elements for duplicate keys in bulk-load. 
00967  * This is a problem inherited from the original SHORE storage manager.
00968  *
00969  * The implementation of B-trees is straight from the Mohan ARIES/IM
00970  * and ARIES/KVL papers. See \ref MOH1, which covers both topics.
00971  *
00972  * Those two papers give a thorough explanation of the arcane algorithms,
00973  * including logging considerations.  
00974  * Anyone considering changing the B-tree code is strongly encouraged 
00975  * to read these papers carefully.  
00976  * Some of the performance tricks described in these papers are 
00977  * not implemented here.  
00978  * For example, the ARIES/IM paper describes performance of logical 
00979  * undo of insert operations if and only if physical undo 
00980  * is not possible.  
00981  * The storage manager always undoes inserts logically.
00982  *
00983  * \bug GNATS 137 Latches can now be downgraded; btree code should use this.
00984  *
00985  * \section RTREE_M R*-Tree Manager
00986  *
00987  * The spatial indexes in the storage manager are R*-trees, a variant
00988  * of R-trees that perform frequent restructuring to yield higher
00989  * performance than normal R-trees.  The entire index is locked.
00990  * See \ref BKSS.
00991  *
00992  * \section DIR_M Directory Manager
00993  * All storage structures created by a server
00994  * have entries in a B+-Tree index called the 
00995  * \e store \e directory or just \e directory.
00996  * This index is not exposed to the server.
00997  *
00998  * The storage manager maintains  some transient and some persistent data
00999  * for each store.  The directory's key is the store ID, and the value it
01000  * returns from a lookup is a
01001  * sdesc_t ("store descriptor") structure, which
01002  * contains both the persistent and transient information. 
01003  *
01004  * The persistent information is in a sinfo_s structure; the 
01005  * transient information is resident only in the cache of sdesc_t 
01006  * structures that the directory manager
01007  * maintains.
01008  *
01009  * The metadata include:
01010  * - what kind of storage structure uses this store  (btree, rtree, file)
01011  * - if a B-tree, is it unique and what kind of locking protocol does it use?
01012  * - what stores compose this storage structure (e.g., if file, what is the
01013  *   large-object store and what is the small-record store?)
01014  * - what is the root page of the structure (if an index)
01015  * - what is the key type if this is an index
01016  *
01017  * \section LOCK_M Lock Manager
01018  *
01019  * The lock manager understands the folling kind of locks
01020  * - volume
01021  * - extent
01022  * - store
01023  * - page
01024  * - kvl
01025  * - record
01026  * - user1
01027  * - user2
01028  * - user3
01029  * - user4
01030  *
01031  * Lock requests are issued with a lock ID (lockid_t), which
01032  * encodes the identity of the entity being locked, the kind of
01033  * lock, and, by inference,  a lock hierarchy for a subset of the
01034  * kinds of locks above.
01035  * The lock manager does not insist that lock identifiers 
01036  * refer to any existing object.  
01037  *
01038  * The lock manager enforces two lock hierarchies:
01039  * - Volume - store - page - record
01040  * - Volume - store - key-value
01041  *
01042  * Note that the following lock kinds are not in any hierarchy:
01043  * -extent
01044  * -user1, user2, user3, user4
01045  *
01046  * Other than the way the lock identifiers are inspected for the purpose 
01047  * of enforcing the hierarchy, lock identifiers are considered opaque 
01048  * data by the lock manager.  
01049  *
01050  * The lockid_t structure can be constructed from the IDs of the
01051  * various entities in (and out of ) the hierarchy; see lockid_t and
01052  * the example lockid_test.cpp.
01053  *
01054  * \subsection LOCK_M_IMPLICIT Implicit locks
01055  * The hierarchy is used for implicit acquisition of locks as follows:
01056  * - for each parent lock in the hierarchy, determine its lock mode 
01057  *   based on the mode of the child:
01058  *   - parent_mode[child IS or SH] =  IS
01059  *   - parent_mode[child IX or SIX or UD or EX] =  IX
01060  *   - parent_mode[child none] =  none
01061  * - for each parent lock in the hierarchy that is not already held in
01062  *   sufficient mode, acquire the parent lock in the mode determined above
01063  *
01064  * \subsection LOCK_M_ESC Escalation
01065  * The lock manager escalates up the hierarchy by default.  
01066  * The escalation thresholds are based on run-time options.  
01067  * They can be controlled (set, disabled) on a per-object level.  
01068  * For example, escalation to the store level can be disabled when 
01069  * increased concurrency is desired.  
01070  * Escalation can also be controlled on a per-transaction or per-server basis.
01071  *
01072  * \subsection LOCK_M_SM Lock Acquisition and Release by Storage Manager
01073  * Locks are acquired by storage manager operations as appropriate to the
01074  * use of the data (read/write). (Update locks are not acquired by the
01075  * storage manager.)
01076  *
01077  * The storage manager's API allows explicit acquisition 
01078  * of locks by a server.    User modes user1, user2, user3 and user4 are provided for that purpose.
01079  *
01080  * Freeing locks is automatic at transaction commit and rollback.  
01081  *
01082  * There is limited support for freeing locks in the middle of 
01083  * a transaction:
01084  * - locks of duration less than t_long can be unlocked with unlock(), and
01085  * - quarks (sm_quark_t) simplify acquiring and freeing locks mid-transaction:
01086  *
01087  * \subsubsection QUARK Quarks
01088  * A quark is a marker in the list of locks held by a transaction.  
01089  * When the quark is destroyed, all locks acquired since the 
01090  * creation of the quark are freed.  Quarks cannot be used while more than
01091  * one thread is attached to the transaction, although the storage 
01092  * manager does not strictly enforce this (due to the cost).
01093  * When a quark is in use for a transaction, the locks acquired
01094  * will be of short duration, the assumption being that the quark
01095  * will be closed before commit-time.  
01096  *
01097  * Extent locks are an exception; they must be held long-term for
01098  * page allocation and deallocation to work, so even in the context
01099  * of an open quark, extent locks will be held until end-of-transaction.
01100  *
01101  * The lock manager uses a hash table whose size is determined by
01102  * a configuration option.  
01103  * The hash function used by the lock manager is known not 
01104  * to distribute locks evenly among buckets.
01105  * This is partly due to the nature of lock IDs.
01106  *
01107  * \subsection LCACHE Lock Cache
01108  * To avoid expensive lock manager queries, each transaction 
01109  * keeps a cache of the last <N> locks acquired (the number
01110  * <N> is a compile-time constant).
01111  * This close association between the transaction manager and
01112  * the lock manager is encapsulated in several classes in the file lock_x.
01113  *
01114  * \subsection DLD Deadlock Detection
01115  * The lock manager uses a statistical deadlock-detection scheme
01116  * known as "Dreadlocks" [KH1].
01117  * Each storage manager thread (smthread_t) has a unique fingerprint, which is
01118  * a set of bits; the deadlock detector ORs together the bits of the
01119  * elements in a waits-for-dependency-list; each thread, when 
01120  * blocking, holds a  digest (the ORed bitmap).  
01121  * It is therefore cheap for a thread to detect a cycle when it needs to 
01122  * block awaiting a lock: look at the holders
01123  * of the lock and if it finds itself in any of their digests, a
01124  * cycle will result.
01125  * This works well when the total number of threads relative to the bitmap
01126  * size is such that it is possible to assign a unique bitmap to each
01127  * thread.   
01128  * If you cannot do so, you will have false-positive deadlocks
01129  * "detected".
01130  * The storage manager counts, in its statistics, the number of times
01131  * it could not assign a unique fingerprint to a thread.  
01132  * If you notice excessive transaction-aborts due to false-positive
01133  * deadlocks,
01134  * you can compile the storage manager to use a larger
01135  * number bits in the 
01136  * \code sm_thread_map_t \endcode
01137  * found in 
01138  * \code smthread.h \endcode.
01139  *
01140  * \section XCT_M Transaction Manager
01141  * When a transaction commits, these steps are taken to
01142  * manage stores:
01143  * - Stores that were given sm_store_property_t t_load_file or 
01144  *   t_insert_file are turned * into t_regular stores;
01145  * - The transaction enters a state called "freeing space" so that
01146  *   stores marked for deletion can be handled properly in event of
01147  *   a crash/restart before the transaction logs it commit-completion;
01148  * - Stores marked for deletion are removed (see \ref SAFTERXCT);
01149  * - Extents marked for freeing are freed. These are extents marked for
01150  *   freeing in stores that were not marked for deletion; rather,
01151  *   these are extents that are marked for deletion due to 
01152  *   incremental freeing of their pages.
01153  *
01154  * Because these are logged actions, and they occur if and only if the 
01155  * transaction commits, the storage manager guarantees that the ending
01156  * of the transaction and re-marking and deletion of stores is atomic.
01157  * This is accomplished by putting the transaction into a state
01158  * xct_freeing_space, and writing a log record to that effect.
01159  * The space is freed, the stores are converted, and a final log record is written before the transaction is truly ended.
01160  * In the event of a crash while a transaction is freeing space, 
01161  * recovery searches all the 
01162  * store metadata for stores marked for deletion
01163  * and deletes those that would otherwise have been missed in redo.
01164  *
01165  * \section LOG_M Log Manager
01166  *
01167  * \subsection LOG_M_USAGE How the Server Uses the Log Manager
01168  *
01169  * Log records for redoable-undoable operations contain both the 
01170  * redo- and undo- data, hence an operation never causes two 
01171  * different log records to be written for redo and for undo.  
01172  * This, too, controls logging overhead.  
01173  *
01174  * The protocol for applying an operation to an object is as follows:
01175  * - Lock the object.
01176  * - Fix the page(s) affected in exclusive mode.
01177  * - Apply the operation.
01178  * - Write the log record(s) for the operation.
01179  * - Unfix the page(s).
01180  *
01181  * The protocol for writing log records is as follows:
01182  * - Grab the transaction's log buffer in which the last log record is to be 
01183  *   cached by calling xct_t::get_logbuf()
01184  *   - Ensure that we have reserved enough log space for this transaction 
01185  *   to insert the desired log record an to undo it. This is done by
01186  *   by passing in 
01187  *   the type of the log record we are about to insert, and by using a
01188  *   "fudge factor" (multiplier) associated with the given log record type.
01189  *   The fudge factor indicates on average, how many bytes tend to be needed to undo the action being logged.
01190  * - Write the log record into the buffer (the idiom is to construct it
01191  *      there using C++ placement-new).
01192  * - Release the buffer with xct_t::give_logbuf(),
01193  *    passing in as an argument the fixed page that was affected
01194  *    by the update being logged.  This does several things: 
01195  *    - writes the transaction ID, previous LSN for this transaction 
01196  *      into the log record
01197  *    - inserts the record into the log and remembers this record's LSN
01198  *    - marks the given page dirty.
01199  *
01200  * Between the time the xct log buffer is grabbed and the time it is
01201  * released, the buffer is held exclusively by the one thread that
01202  * grabbed it, and updates to the xct log buffer can be made freely.
01203  * (Note that this per-transaction log buffer is unrelated to the log buffer
01204  * internal to the log manager.)
01205  *
01206  * During recovery, no logging is done in analysis or redo phases; only during
01207  * the undo phase are log records inserted.  Log-space reservation is not
01208  * needed until recovery is complete; the assumption is that if the
01209  * transaction had enough log space prior to recovery, it has enough space
01210  * during recovery.
01211  * Prepared transactions pose a challenge, in that they are not resolved until
01212  * after recovery is complete. Thus, when a transaction-prepare is logged,
01213  * the log-space-reservations of that transaction are logged along with the rest of the transaction state (locks, coordinator, etc.) and before 
01214  * recovery is complete, these transactions acquire their prior log-space
01215  * reservations.
01216  *
01217  * The above protocol is enforced by the storage manager in helper
01218  * functions that create log records; these functions are generated
01219  * by Perl scripts from the source file logdef.dat.  (See \ref LOGRECS.)
01220  *
01221  * The file logdef.dat also contains the fudge factors for log-space
01222  * reservation. These factors were experimentally determined.
01223  * There are corner cases involving btree page SMOs (structure-modification operations), in which the
01224  * fudge factors will fail.  [An example is when a transaction aborts after
01225  * having removed entries, and after other transactions have inserted
01226  * entries; the aborting transaction needs to re-insert its entries, which 
01227  * now require splits.]
01228  * The storage manager has no resolution for this.
01229  * The fudge factors handle the majority of cases without reserving excessive
01230  * log-space.
01231  * \bug GNATS 156  Btree SMOs during rollback can cause problems.
01232  *
01233  *\subsection LOGRECS Log Record Types
01234  * The input to the above-mentioned Perl script is the source of all
01235  * log record types.  Each log record type is listed in the  file
01236  * \code logdef.dat \endcode
01237  * which is fairly self-explanatory, reproduced here:
01238  * \include logdef.dat
01239  *
01240  * The bodies of the methods of the class <log-rec-name>_log
01241  * are hand-written and reside in \code logrec.cpp \endcode.
01242  *
01243  * Adding a new log record type consists in adding a line to
01244  * \code logdef.dat, \endcode
01245  * adding method definitions to 
01246  * \code logrec.cpp, \endcode
01247  * and adding the calls to the free function log_<log-rec-name>(args)
01248  * in the storage manager.
01249  * The base class for every log record is logrec_t, which is worth study
01250  * but is not documented here.
01251  *
01252  * Some logging records are \e compensated, meaning that the 
01253  * log records are skipped during rollback. 
01254  * Compensations may be needed because some operation simply cannot
01255  * be undone.  The protocol for compensating actions is as follows:
01256  * - Fix the needed pages.
01257  * - Grab an \e anchor in the log.  
01258  *   This is an LSN for the last log record written for this transaction.
01259  * - Update the pages and log the updates as usual.
01260  * - Write a compensation log record (or piggy-back the compensation on
01261  *   the last-written log record for this transaction to reduce 
01262  *   logging overhead) and free the anchor.
01263  *
01264  * \note Grabbing an anchor prevents all other threads in a multi-threaded
01265  * transaction from gaining access to the transaction manager.  Be careful
01266  * with this, as it can cause mutex-latch deadlocks where multi-threaded
01267  * transactions are concerned.  In other words, two threads cannot concurrently
01268  * update in the same transaction.
01269  *
01270  * In some cases, the following protocol is used to avoid excessive
01271  * logging by general update functions that, if logging were turned
01272  * on, would generate log records of their own.
01273  * - Fix the pages needed in exclusive mode.
01274  * - Turn off logging for the transaction.
01275  * - Perform the updates by calling some general functions.  If an error occurs, undo the updates explicitly.
01276  * - Turn on logging for the transaction.
01277  * - Log the operation.  If an error occurs, undo the updates with logging turned off..
01278  * - Unfix the pages.
01279  *
01280  * The mechanism for turning off logging for a transaction is to
01281  * construct an instance of xct_log_switch_t.
01282  *
01283  * When the instance is destroyed, the original logging state
01284  * is restored.  The switch applies only to the transaction that is 
01285  * attached to the thread at the time the switch instance is constructed, 
01286  * and it prevents other threads of the transaction from using 
01287  * the log (or doing much else in the transaction manager) 
01288  * while the switch exists.
01289  *
01290  * \subsection LOG_M_INTERNAL Log Manager Internals
01291  *
01292  * The log is a collection of files, all in the same directory, whose
01293  * path is determined by a run-time option.
01294  * Each file in the directory is called a "log file" and represents a
01295  * "partition" of the log. The log is partitioned into files to make it 
01296  * possible to archive portions of the log to free up disk space.
01297  * A log file has the name \e log.<n> where <n> is a positive integer.
01298  * The log file name indicates the set of logical sequence numbers (lsn_t)
01299  * of log records (logrec_t) that are contained in the file.  An
01300  * lsn_t has a \e high part and a \e low part, and the
01301  * \e high part (a.k.a., \e file part) is the <n> in the log file name.
01302  *
01303  * The user-settable run-time option sm_logsize indicates the maximum 
01304  * number of KB that may be opened at once; this, in turn, determines the
01305  * size of a partition file, since the number of partition files is
01306  * a compile-time constant.
01307  * The storage manager computes partition sizes based on the user-provided
01308  * log size, such that partitions sizes are a convenient multiple of blocks
01309  * (more about which, below).
01310  *
01311  * A new partition is opened when the tail of the log approaches the end
01312  * of a partition, that is, when the next insertion into the log
01313  * is at an offset larger than the maximum partition size.  (There is a
01314  * fudge factor of BLOCK_SIZE in here for convenience in implementation.)
01315  * 
01316  * The \e low part of an lsn_t represents the byte-offset into the log file
01317  * at which the log record with that lsn_t sits.
01318  *
01319  * Thus, the total file size of a log file \e log.<n>
01320  * is the size of all log records in the file, 
01321  * and the lsn_t of each log record in the file is
01322  * lsn_t(<n>, <byte-offset>) of the log record within the file.
01323  *
01324  * The log is, conceptually, a forever-expanding set of files. The log 
01325  * manager will open at most PARTITION_COUNT log files at any one time.
01326  * - PARTITION_COUNT = smlevel_0::max_openlog
01327  * - smlevel_0::max_openlog (sm_base.h) = SM_LOG_PARTITIONS
01328  * - SM_LOG_PARTITIONS a compile-time constant (which can be overridden in 
01329  *   config/shore.def).
01330  *
01331  * The log is considered to have run out of space if logging requires that
01332  * more than smlevel_0::max_openlog partitions are needed.
01333  * Partitions are needed only as long as they contain log records 
01334  * needed for recovery, which means:
01335  * - log records for pages not yet made durable (min recovery lsn)
01336  * - log records for uncommitted transactions (min xct lsn)
01337  * - log records belonging to the last complete checkpoint
01338  *
01339  * Afer a checkpoint is taken and its log records are durable,
01340  * the storage manager tries to scavenge all partitions that do not
01341  * contain necessary log records.  The buffer manager provides the
01342  * min recovery lsn; the transaction manager provides the min xct lsn,
01343  * and the log manager keeps track of the location of the last 
01344  * completed checkpoint in its master_lsn.  Thus the minimum of the
01345  * 
01346  * \e file part of the minmum of these lsns indicates the lowest partition 
01347  * that cannot be scavenged; all the rest are removed.
01348  *
01349  * When the log is in danger of runing out of space 
01350  * (because there are long-running  transactions, for example) 
01351  * the server may be called via the
01352  * LOG_WARN_CALLBACK_FUNC argument to ss_m::ss_m.  This callback may
01353  * abort a transaction to free up log space, but the act of aborting
01354  * consumes log space. It may also archive a log file and remove it.
01355  * If the server provided a
01356  * LOG_ARCHIVED_CALLBACK_FUNC argument to ss_m::ss_m, this callback
01357  * can be used to retrieve archived log files when needed for
01358  * rollback.
01359  * \warning This functionality is not complete and has not been
01360  * well-tested.
01361  *
01362  * Log files (partitions) are written in fixed-sized blocks.  The log
01363  * manager pads writes, if necessary, to make them BLOCK_SIZE. 
01364  * - BLOCK_SIZE = 8192, a compile-time constant.
01365  *
01366  * A skip_log record indicates the logical end of a partition.
01367  * The log manager ensures that the last log record in a file 
01368  * is always a skip_log record. 
01369  *
01370  * Log files (partitions) are composed of segments. A segment is
01371  * an integral number of blocks.
01372  * - SEGMENT_SIZE = 128*BLOCK_SIZE, a compile-time constant.
01373  *
01374  * The smallest partition is one segment plus one block, 
01375  * but may be many segments plus one block. The last block enables
01376  * the log manager to write the skip_log record to indicate the
01377  * end of the file.
01378  *
01379  * The partition size is determined by the storage manager run-time option,
01380  * sm_logsize, which determines how much log can be open at any time,
01381  * i.e., the combined sizes of the PARTITION_COUNT partitions.
01382  *
01383  * The maximum size of a log record (logrec_t) is 3 storage manager pages.
01384  * A page happens to match the block size but the two compile-time
01385  * constants are not inter-dependent. 
01386  * A segment is substantially larger than a block, so it can hold at least
01387  * several maximum-sized log records, preferably many.
01388  * 
01389  * Inserting a log record consists of copying it into the log manager's
01390  * log buffer (1 segment in size).  The buffer wraps so long as there
01391  * is room in the partition.  Meanwhile, a log-flush daemon thread
01392  * writes out unflushed portions of the log buffer. 
01393  * The log daemon can lag behind insertions, so each insertion checks for
01394  * space in the log buffer before it performs the insert. If there isn't
01395  * enough space, it waits until the log flush daemon has made room.
01396  *
01397  * When insertion of a log record would wrap around the buffer and the
01398  * partition has no room for more segments, a new partition is opened,
01399  * and the entire newly-inserted log record will go into that new partition.
01400  * Meanwhile, the log-flush daemon will see that the rest of the log
01401  * buffer is written to the old partition, and the next time the
01402  * log flush daemon performs a flush, it will be flushing to the
01403  * new partition.
01404  *
01405  * The bookkeeping of the log buffer's free and used space is handled
01406  * by the notion of \e epochs.
01407  * An epoch keeps track of the start and end of the unflushed portion
01408  * of the segment (log buffer).  Thus, an epoch refers to only one
01409  * segment (logically, log buffer copy within a partition).
01410  * When an insertion fills the log buffer and causes it to wrap, a new
01411  * epoch is created for the portion of the log buffer representing
01412  * the new segment, and the old epoch keeps track of the portion of the 
01413  * log buffer representing the old segment.  The inserted log record
01414  * usually spans the two segements, as the segments are written contiguously
01415  * to the same log file (partition).
01416  *
01417  * When an insertion causes a wrap and there is no more room in the
01418  * partition to hold the new segment, a new 
01419  * epoch is created for the portion of the log buffer representing
01420  * the new segment, and the old epoch keeps track of the portion of the 
01421  * log buffer representing the old segment, as before.  
01422  * Now, however, the inserted log record is inserted, in its entirety,
01423  * in the new segment.  Thus, no log record spans partitions.
01424  *
01425  * Meanwhile, the log-flush buffer knows about the possible existence of
01426  * two epochs.  When an old epoch is valid, it flushes that epoch.
01427  * When a new epoch is also valid, it flushes that new one as well.
01428  * If the two epochs have the same target partition, the two flushes are
01429  * done with a single write.
01430  *
01431  * The act of flushing an epoch to a partition consists in a single
01432  * write of a size that is an even multiple of BLOCK_SIZE.  The
01433  * flush appends a skip_log record, and zeroes as needed, to round out the
01434  * size of the write.  Writes re-write portions of the log already
01435  * written, in order to overwrite the skip_log record at the tail of the
01436  * log (and put a new one at the new tail).
01437  *
01438  *
01439  *\subsection RECOV Recovery
01440  * The storage manager performs ARIES-style logging and recovery.
01441  * This means the logging and recovery system has these characteristics:
01442  * - uses write-ahead logging (WAL)
01443  * - repeats history on restart before doing any rollback 
01444  * - all updates are logged, including those performed during rollback
01445  * - compensation records are used in the log to bound the amount
01446  *   of logging done for rollback 
01447  *   and guarantee progress in the case of repeated 
01448  *   failures and restarts.
01449  *
01450  * Each time a storage manager (ss_m class) is constructed, the logs
01451  * are inspected, the last checkpoint is located, and its lsn is
01452  * remembered as the master_lsn, then recovery is performed.
01453  * Recovery consists of three phases: analysis, redo and undo.
01454  *
01455  *\subsubsection RECOVANAL Analysis
01456  * This pass analyzes the log starting at the master_lsn, and
01457  *   reading log records written thereafter.  Reading the log records for the
01458  *   last completed checkpoint, it reconstructs the transaction table, the
01459  *   buffer-pool's dirty page table, and mounts the devices and
01460  *   volumes that were mounted at the time of the checkpoint.
01461  *   From the dirty page table, it determines the \e redo_lsn, 
01462  *   the lowest recovery lsn of the dirty pages, which is 
01463  *   where the next phase of recovery must begin.
01464  *
01465  *\subsubsection RECOVREDO Redo
01466  * This pass starts reading the log at the redo_lsn, and, for each
01467  *   log record thereafter, decides whether that log record's 
01468  *   work needs to be redone.  The general protocol is:
01469  *   - if the log record is not redoable, it is ignored
01470  *   - if the log record is redoable and contains a page ID, the
01471  *   page is inspected and its lsn is compared to that of the log
01472  *   record. If the page lsn is later than the log record's sequence number,
01473  *   the page does not need to be updated per this log record, and the
01474  *   action is not redone.
01475  *
01476  *\subsubsection RECOVUNDO Undo
01477  *  After redo,  the state of the database matches that at the time 
01478  *  of the crash.  Now the storage manager rolls back the transactions that 
01479  *  remain active.  
01480  *  Care is taken to undo the log records in reverse chronological order, 
01481  *  rather than allowing several transactions to roll back 
01482  *  at their own paces.  This is necessary because some operations 
01483  *  use page-fixing for concurrency-control (pages are protected 
01484  *  only with latches if there is no page lock in
01485  *  the lock hierarchy -- this occurs when 
01486  *  logical logging and high-concurrency locking are used, 
01487  *  in the B-trees, for example.  A crash in the middle of 
01488  * a compensated action such as a page split must result in 
01489  * the split being undone before any other operations on the 
01490  * tree are undone.). 
01491  * \bug GNATS 49 (performance) There is no concurrent undo.
01492  *
01493  * After the storage manager has recovered, control returns from its
01494  * constructor method to the caller (the server).
01495  * There might be transactions left in prepared state.  
01496  * The server is now free to resolve these transactions by 
01497  * communicating with its coordinator. 
01498  *
01499  *\subsection LSNS Log Sequence Numbers
01500  *
01501  * Write-ahead logging requires a close interaction between the
01502  * log manager and the buffer manager: before a page can be flushed
01503  * from the buffer pool, the log might have to be flushed.
01504  *
01505  * This also requires a close interaction between the transaction
01506  * manager and the log manager.
01507  * 
01508  * All three managers understand a log sequence number (lsn_t).
01509  * Log sequence numbers serve to identify and locate log records
01510  * in the log, to timestamp pages, identify timestamp the last
01511  * update performed by a transaction, and the last log record written
01512  * by a transaction.  Since every update is logged, every update
01513  * can be identified by a log sequence number.  Each page bears
01514  * the log sequence number of the last update that affected that
01515  * page.
01516  *
01517  * A page cannot be written to disk until  the log record with that
01518  * page's lsn has been written to the log (and is on stable storage).
01519  * A log sequence number is a 64-bit structure,  with part identifying
01520  * a log partition (file) number and the rest identifying an offset within the file. 
01521  *
01522  * \subsection LOGPART Log Partitions
01523  *
01524  * The log is partitioned to simplify archiving to tape (not implemented)
01525  * The log comprises 8 partitions, where each partition's
01526  * size is limited to approximately 1/8 the maximum log size given
01527  * in the run-time configuration option sm_logsize.
01528  * A partition resides in a file named \e log.<n>, where \e n
01529  * is the partition number.
01530  * The configuration option sm_logdir names a directory 
01531  * (which must exist before the storage manager is started) 
01532  * in which the storage manager may create and destroy log files.
01533  *
01534  *  The storage manger may have at most 8 active partitions at any one time.  
01535  *  An active partition is one that is needed because it 
01536  *  contains log records for running transactions.  Such partitions 
01537  *  could (if it were supported) be streamed to tape and their disk 
01538  *  space reclaimed.  Space is reclaimed when the oldest transaction 
01539  *  ends and the new oldest transaction's first log record is 
01540  *  in a newer partition than that in which the old oldest 
01541  *  transaction's first log record resided.  
01542  *  Until tape archiving is implemented, the storage 
01543  *  manager issues an error (eOUTOFLOGSPACE) 
01544  *  if it consumes sufficient log space to be unable to 
01545  *  abort running transactions and perform all resulting necessary logging 
01546  *  within the 8 partitions available. 
01547  * \note Determining the point at which there is insufficient space to
01548  * abort all running transactions is a heuristic matter and it
01549  * is not reliable.  The transaction "reserves" log space for rollback, meaning
01550  * that no other transaction can consume that space until the transaction ends.'
01551  * A transaction has to reserve significantly more space to roll back than it
01552  * needs for forward processing B-tree deletions; this is because the log overhead
01553  * for the insertions is considerably larger than that for deletion.
01554  * The (compile-time) page size is also a factor in this heuristic.
01555  *
01556  * Log records are buffered by the log manager until forced to stable 
01557  * storage to reduce I/O costs.  
01558  * The log manager keeps a buffer of a size that is determined by 
01559  * a run-time configuration option.  
01560  * The buffer is flushed to stable storage when necessary.  
01561  * The last log in the buffer is always a skip log record, 
01562  * which indicates the end of the log partition.
01563  *
01564  * Ultimately, archiving to tape is necessary.  The storage manager
01565  * does not perform write-aside or any other work in support of
01566  * long-running transactions.
01567  *
01568  * The checkpoint manager chkpt_m sleeps until kicked into action
01569  * by the log manager, and when it is kicked, it takes a checkpoint, 
01570  * then sleeps again.  Taking a checkpoint amounts to these steps:
01571  * - Write a chkpt_begin log record.
01572  * - Write a series of log records recording the mounted devices and volumes..
01573  * - Write a series of log records recording the mounted devices.
01574  * - Write a series of log records recording the buffer pool's dirty pages.
01575  *    For each dirty page in the buffer pool, the page id and its recovery lsn 
01576  *    is logged.  
01577  *    \anchor RECLSN
01578  *    A page's  recovery lsn is metadata stored in the buffer 
01579  *    manager's control block, but is not written on the page. 
01580  *    It represents an lsn prior to or equal to the log's current lsn at 
01581  *    the time the page was first marked dirty.  Hence, it
01582  *    is less than or equal to the LSN of the log record for the first
01583  *    update to that page after the page was read into the buffer
01584  *    pool (and remained there until this checkpoint).  The minimum
01585  *    of all the  recovery lsn written in this checkpoint 
01586  *    will be a starting point for crash-recovery, if this is 
01587  *    the last checkpoint completed before a crash.
01588  * - Write a series of log records recording the states of the known 
01589  *    transactions, including the prepared transactions.  
01590  * - Write a chkpt_end log record.
01591  * - Tell the log manage where this checkpoint is: the lsn of the chkpt_begin
01592  *   log record becomes the new master_lsn of the log. The master_lsn is
01593  *   written in a special place in the log so that it can always be 
01594  *   discovered on restart.
01595  *
01596  *   These checkpoint log records may interleave with other log records, making
01597  *   the checkpoint "fuzzy"; this way the world doesn't have to grind to
01598  *   a halt while a checkpoint is taken, but there are a few operations that
01599  *   must be serialized with all or portions of a checkpoint. Those operations
01600  *   use mutex locks to synchronize.  Synchronization of operations is
01601  *   as follows:
01602  *   - Checkpoints cannot happen simultaneously - they are serialized with
01603  *   respect to each other.
01604  *   - A checkpoint and the following are serialized:
01605  *      - mount or dismount a volume
01606  *      - prepare a transaction
01607  *      - commit or abort a transaction (a certain portion of this must
01608  *        wait until a checkpoint is not happening)
01609  *      - heriocs to cope with shortage of log space
01610  *   - The portion of a checkpoint that logs the transaction table is
01611  *     serialized with the following:
01612  *      - operations that can run only with one thread attached to
01613  *        a transaction (including the code that enforces this)
01614  *      - transaction begin, end
01615  *      - determining the number of active transactions
01616  *      - constructing a virtual table from the transaction table
01617  *
01618  * \section BF_M Buffer Manager
01619  * The buffer manager is the means by which all other modules (except
01620  * the log manager) read and write pages.  
01621  * A page is read by calling bf_m::fix.
01622  * If the page requested cannot be found in the buffer pool, 
01623  * the requesting thread reads the page and blocks waiting for the 
01624  * read to complete.
01625  *
01626  * All frames in the buffer pool are the same size, and 
01627  * they cannot be coalesced, 
01628  * so the buffer manager manages a set of pages of fixed size.
01629  *
01630  * \subsection BFHASHTAB Hash Table
01631  * The buffer manager maintains a hash table mapping page IDs to
01632  * buffer control blocks.  A control block points to its frame, and
01633  * from a frame one can arithmetically locate its control block (in
01634  * bf_m::get_cb(const page_s *)).
01635  * The hash table for the buffer pool uses cuckoo hashing 
01636  * (see \ref P1) with multiple hash functions and multiple slots per bucket.  
01637  * These are compile-time constants and can be modified (bf_htab.h).
01638  *
01639  * Cuckoo hashing is subject to cycles, in which making room on one 
01640  * table bucket A would require moving something else into A.
01641  * Using at least two slots per bucket reduces the chance of a cycle.
01642  *
01643  * The implementation contains a limit on the number of times it looks for
01644  * an empty slot or moves that it has to perform to make room.  It does
01645  * If cycles are present, the limit will be hit, but hitting the limit
01646  * does not necessarily indicate a cycle.  If the limit is hit,
01647  * the insert will fail.
01648  * The "normal" solution in this case is to rebuild the table with
01649  * different hash functions. The storage manager does not handle this case.
01650  * \bug  GNATS 47 
01651  * In event of insertion failure, the hash table will have to be rebuilt with
01652  * different hash functions, or will have to be modified in some way.
01653  *
01654  * \bug GNATS 35 The buffer manager hash table implementation contains a race.
01655  * While a thread performs a hash-table
01656  * lookup, an item could move from one bucket to another (but not
01657  * from one slot to another within a bucket).
01658  * The implementation contains a temporary work-around for
01659  * this, until the problem is more gracefully fixed: if lookup fails to
01660  * find the target of the lookup, it performs an expensive lookup and
01661  * the statistics record these as bf_harsh_lookups. This is expensive.
01662  *
01663  * \subsection REPLACEMENT Page Replacement
01664  * When a page is fixed, the buffer manager looks for a free buffer-pool frame,
01665  * and if one is not available, it has to choose a victim to replace. 
01666  * It uses a clock-based algorithm to determine where in the buffer pool
01667  * to start looking for an unlatched frame:
01668  * On the first pass of the buffer pool it considers only clean frames. 
01669  * On the second pass it will consider dirty pages,
01670  * and on the third or subsequent pass it will consider any frame.
01671  *
01672  * The buffer manager forks background threads to flush dirty pages. 
01673  * The buffer manager makes an attempt to avoid hot pages and to minimize 
01674  * the cost of I/O by sorting and coalescing requests for contiguous pages. 
01675  * Statistics kept by the buffer manager tell the number of resulting write 
01676  * requests of each size.
01677  *
01678  * There is one bf_cleaner_t thread for each volume, and it flushes pages for that
01679  * volume; this is done so that it can combine contiguous pages into
01680  * single write requests to minimize I/O.  Each bf_cleaner_t is a master thread with
01681  * multiple page-writer slave threads.  The number of slave threads per master
01682  * thread is controlled by a run-time option.
01683  * The master thread can be disabled (thereby disabling all background
01684  * flushing of dirty pages) with a run-time option. 
01685  *
01686  * The buffer manager writes dirty pages even if the transaction
01687  * that dirtied the page is still active (steal policy). Pages
01688  * stay in the buffer pool as long as they are needed, except when
01689  * chosen as a victim for replacement (no force policy).
01690  *
01691  * The replacement algorithm is clock-based (it sweeps the buffer
01692  * pool, noting and clearing reference counts). This is a cheap
01693  * way to achieve something close to LRU; it avoids much of the
01694  * overhead and mutex bottlenecks associated with LRU.
01695  *
01696  * The buffer manager maintains a hash table that maps page IDs to buffer 
01697  * frame  control blocks (bfcb_t), which in turn point to frames
01698  * in the buffer pool.  The bfcb_t keeps track of the page in the frame, 
01699  * the page ID of the previously-held page, 
01700  * and whether it is in transit, the dirty/clean state of the page, 
01701  * the number of page fixes (pins) held on the page (i.e., reference counts), 
01702  * the \ref RECLSN "recovery lsn" of the page, etc.  
01703  * The control block also contains a latch.  A page, when fixed,
01704  * is always fixed in a latch mode, either LATCH_SH or LATCH_EX.
01705  * \bug GNATS 40 bf_m::upgrade_latch() drops the latch and re-acquires in
01706  * the new mode, if it cannot perform the upgrade without blocking. 
01707  * This is an issue inherited from the original SHORE storage manager.
01708  * To block in this case
01709  * would enable a deadlock in which two threads hold the latch in SH mode
01710  * and both want to upgrade to EX mode.  When this happens, the statistics 
01711  * counter \c bf_upgrade_latch_race is incremented.
01712  *
01713  * Page fixes are expensive (in CPU time, even if the page is resident).
01714  *
01715  * Each page type defines a set of fix methods that are virtual in 
01716  * the base class for all pages: The rest of the storage manager 
01717  * interacts with the buffer manager primarily through these methods 
01718  * of the page classes.  
01719  * The macros MAKEPAGECODE are used for each page subtype; they 
01720  * define all the fix methods on the page in such a way that bf_m::fix() 
01721  * is properly called in each case. 
01722  *
01723  * A page frame may be latched for a page without the page being 
01724  * read from disk; this
01725  * is done when a page is about to be formatted. 
01726  *
01727  * The buffer manager is responsible for maintaining WAL; this means it may not
01728  * flush to disk dirty pages whose log records have not reached stable storage yet.
01729  * Temporary pages (see sm_store_property_t) do not get logged, so they do not
01730  * have page lsns to assist in determining their clean/dirty status, and since pages
01731  * may change from temporary (unlogged) to logged, they require special handling, described
01732  * below.
01733  *
01734  * When a page is unfixed, sometimes it has been updated and must be marked dirty.
01735  * The protocol used in the storage manager is as follows:
01736  *
01737  * - Fixing with latch mode EX signals intent to dirty the page. If the page
01738  *   is not already dirty, the buffer control block for the page is given a
01739  *   recovery lsn of the page's lsn. This means that any dirtying of the page
01740  *   will be done with a log record whose lsn is larger than this recovery lsn.
01741  *   Fixing with EX mode of an already-dirty page does not change 
01742  *   the recovery lsn  for the page.
01743  *
01744  * - Clean pages have a recovery lsn of lsn_t::null.
01745  *
01746  * - A thread updates a page in the buffer pool only when it has the
01747  *   page EX-fixed(latched).
01748  *
01749  * - After the update to the page, the thread writes a log record to 
01750  *   record the update.  The log functions (generated by Perl) 
01751  *   determine if a log record should be written (not if a tmp 
01752  *   page, or if logging turned off, for example),
01753  *   and if not, they call page.set_dirty() so that any subsequent
01754  *   unfix notices that the page is dirty.
01755  *   If the log record is written, the modified page is unfixed with
01756  *   unfix_dirty() (in xct_impl::give_logbuf).
01757  *
01758  * - Before unfixing a page, if it was written, it must be marked dirty first
01759  *   with 
01760  *   - set_dirty followed by unfix, or
01761  *   - unfix_dirty (which is set_dirty + unfix).
01762  *
01763  * - Before unfixing a page, if it was NOT written, unfix it with bf_m::unfix
01764  *   so its recovery lsn gets cleared.  This happens only if this is the
01765  *   last thread to unfix the page.  The page could have multiple fixers 
01766  *   (latch holders) only if it were fixed in SH mode.  If fixed (latched)
01767  *   in EX mode,  this will be the only thread to hold the latch and the
01768  *   unfix will clear the recovery lsn.
01769  *
01770  *  It is possible that a page is fixed in EX mode, marked dirty but never
01771  *  updated after all,  then unfixed.  The buffer manager attempts to recognize
01772  *  this situation and clean the control block "dirty" bit and recovery lsn.
01773  *
01774  * Things get a little complicated where the buffer-manager's 
01775  * page-writer threads are
01776  * concerned.  The  page-writer threads acquire a share latches and copy
01777  * dirty pages; this being faster than holding the latch for the duration of the
01778  * write to disk
01779  * When the write is finished,  the page-writer re-latches the page with the
01780  * intention of marking it clean if no intervening updates have occurred. This
01781  * means changing the \e dirty bit and updating the recovery lsn in the buffer 
01782  * control block. The difficulty lies in determining if the page is indeed clean,
01783  * that is, matches the latest durable copy.
01784  * In the absence of unlogged (t_temporary) pages, this would not be terribly
01785  * difficult but would still have to cope with the case that the page was
01786  * (updated and) written by another thread between the copy and the re-fix.
01787  * It might have been cleaned, or that other thread might be operating in
01788  * lock-step with this thread.
01789  * The conservative handling would be not to change the recovery lsn in the
01790  * control block if the page's lsn is changed, however this has 
01791  * serious consequences
01792  * for hot pages: their recovery lsns might never be moved toward the tail of
01793  * the log (the recovery lsns remain artificially low) and 
01794  * thus the hot pages can prevent scavenging of log partitions. If log
01795  * partitions cannot be scavenged, the server runs out of log space.
01796  * For this reason, the buffer manager goes to some lengths to update the
01797  * recovery lsn if at all possible.
01798  * To further complicate matters, the page could have changed stores, 
01799  * and thus its page type or store (logging) property could differ.
01800  * The details of this problem are handled in a function called determine_rec_lsn().
01801  *
01802  * \subsection PAGEWRITERMUTEX Page Writer Mutexes
01803  *
01804  * The buffer manager keeps a set of \e N mutexes to sychronizing the various
01805  * threads that can write pages to disk.  Each of these mutexes covers a
01806  * run of pages of size smlevel_0::max_many_pages. N is substantially smaller
01807  * than the number of "runs" in the buffer pool (size of 
01808  * the buffer pool/max_many_pages), so each of the N mutexes actually covers
01809  * several runs:  
01810  * \code
01811  * page-writer-mutex = page / max_many_pages % N
01812  * \endcode
01813  *
01814  * \subsection BFSCAN Foreground Page Writes and Discarding Pages
01815  * Pages can be written to disk by "foreground" threads under several
01816  * circumstances.
01817  * All foreground page-writing goes through the method bf_m::_scan.
01818  * This is called for:
01819  * - discarding all pages from the buffer pool (bf_m::_discard_all)
01820  * - discarding all pages belonging to a given store from the buffer pool 
01821  *   (bf_m::_discard_store), e.g., when a store is destroyed.
01822  * - discarding all pages belonging to a given volume from the buffer pool 
01823  *   (bf_m::_discard_volume), e.g., when a volume is destroyed.
01824  * - forcing all pages to disk (bf_m::_force_all) with or without invalidating
01825  *   their frames, e.g., during clean shutdown.
01826  * - forcing all pages of a store to disk (bf_m::_force_store) with 
01827  *   or without invalidating
01828  *   their frames, e.g., when changing a store's property from unlogged to
01829  *   logged.
01830  * - forcing all pages of a volume to disk (bf_m::_force_store) with 
01831  *   without invalidating the frames, e.g., when dismounting a volume.
01832  * - forcing all pages whose recovery lsn is less than or equal to a given
01833  *   lsn_t, e.g.,  for a clean shutdown, after restart.
01834  */
01835 /**\page Logging 
01836  *
01837  * See \ref LOG_M.
01838  * */
01839 
01840 /**\page DEBUGAID Debugging Aids
01841   *\section SSMDEBUGAPI Storage Manager Methods for Debugging
01842  *
01843  * The storage manager contains a few methods that are useful for
01844  * debugging purposes. Some of these should be used for not other
01845  * purpose, as they are not thread-safe, or might be very expensive.
01846  * See \ref SSMAPIDEBUG.
01847  * 
01848  *\section SSMDEBUG Build-time Debugging Options
01849  *
01850  * At configure time, you can control which debugger-related options
01851  * (symbols, inlining, etc) with the debug-level options. See \ref CONFIGOPT.
01852  * \section SSMTRACE Tracing (--enable-trace)
01853  * When this build option is used, additional code is included in the build to
01854  * enable some limited tracing.  These C Preprocessor macros apply:
01855  * -W_TRACE
01856  *  --enable-trace defines this.
01857  * -FUNC
01858  *  Outputs the function name when the function is entered.
01859  * -DBG 
01860  *  Outputs the arguments.
01861  * -DBGTHRD 
01862  *  Outputs the arguments.
01863  *
01864  *  The tracing is controlled by these environment variables:
01865  *  -DEBUG_FLAGS: a list of file names to trace, e.g. "smfile.cpp log.cpp"
01866  *  -DEBUG_FILE: name of destination for the output. If not defined, the output
01867  *    is sent to cerr/stderr.
01868  *
01869  * See \ref CONFIGOPT.
01870  *  \note This tracing is not thread-safe, as it uses streams output.
01871  * \section SSMENABLERC Return Code Checking (--enable-checkrc)
01872  * If a w_rc_t is set but not checked with method is_error(), upon destruction the
01873  * w_rc_t will print a message to the effect "error not checked".
01874  * See \ref CONFIGOPT.
01875  *
01876  */

Generated on Mon Jan 2 15:13:57 2012 for Shore Storage Manager by  doxygen 1.4.7