00001 // -*- mode:c++; c-basic-offset:4 -*- 00002 /*<std-header orig-src='shore' incl-file-exclusion='INTERNAL_H'> 00003 00004 $Id: internal.h,v 1.9 2010/08/23 23:01:06 nhall Exp $ 00005 00006 SHORE -- Scalable Heterogeneous Object REpository 00007 00008 Copyright (c) 1994-99 Computer Sciences Department, University of 00009 Wisconsin -- Madison 00010 All Rights Reserved. 00011 00012 Permission to use, copy, modify and distribute this software and its 00013 documentation is hereby granted, provided that both the copyright 00014 notice and this permission notice appear in all copies of the 00015 software, derivative works or modified versions, and any portions 00016 thereof, and that both notices appear in supporting documentation. 00017 00018 THE AUTHORS AND THE COMPUTER SCIENCES DEPARTMENT OF THE UNIVERSITY 00019 OF WISCONSIN - MADISON ALLOW FREE USE OF THIS SOFTWARE IN ITS 00020 "AS IS" CONDITION, AND THEY DISCLAIM ANY LIABILITY OF ANY KIND 00021 FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE. 00022 00023 This software was developed with support by the Advanced Research 00024 Project Agency, ARPA order number 018 (formerly 8230), monitored by 00025 the U.S. Army Research Laboratory under contract DAAB07-91-C-Q518. 00026 Further funding for this work was provided by DARPA through 00027 Rome Research Laboratory Contract No. F30602-97-2-0247. 00028 00029 */ 00030 00031 /* -- do not edit anything above this line -- </std-header>*/ 00032 00033 /* This file contains doxygen documentation only */ 00034 00035 /**\page IMPLNOTES Implementation Notes 00036 * 00037 * \section MODULES Storage Manager Modules 00038 * The storage manager code contains the following modules (with related C++ classes): 00039 * 00040 * - \ref SSMAPI (ss_m) 00041 * Most of the programming interface to the storage manager is encapsulated 00042 * in the ss_m class. 00043 * - \ref IO_M (io_m), \ref VOL_M (vol_m) and \ref DIR_M (dir_m) 00044 * These managers handle volumes, page allocation and stores, which are the 00045 * structures underlying files of records, B+-Tree indexes, and 00046 * spatial indexes (R*-Trees). 00047 * - \ref FILE_M (file_m), \ref BTREE_M (btree_m), and \ref RTREE_M (rtree_m) 00048 * handle the storage structures available to servers. 00049 * - \ref LOCK_M (lock_m) 00050 * The lock manager is quasi-stand-alone. 00051 * - \ref XCT_M (xct_t) and * \ref LOG_M (log_m) handle transactions, 00052 * logging, and recovery. 00053 * - \ref BF_M (bf_m) 00054 * The buffer manager works closely with \ref XCT_M and \ref LOG_M. 00055 * 00056 * \section IO_M I/O Manager 00057 * The I/O manager was, in the early days of SHORE, expected to 00058 * have more responsibility than it now has; now it is little more 00059 * than a wrapper for the \ref VOL_M. 00060 * For the purpose of this discussion, 00061 * the I/O Manager and the volume manager are the same entity. 00062 * There is a single read-write lock associated 00063 * with the I/O-Volume manager to serialize access. 00064 * Read-only functions acquire the lock in read mode; updating 00065 * functions acquire the lock in write mode. 00066 * Much of the page- and extent-allocation code relies on the fact that 00067 * access to the manager is serialized, and this lock is a major source of 00068 * contention. 00069 * 00070 * \section VOL_M Volume Manager 00071 * The volume manager handles formatting of volumes, 00072 * allocation and deallocation of pages and extents in stores. 00073 * Within a page, allocation of space is up to the manager of the 00074 * storage structure (btree, rtree, or file). 00075 * 00076 * \subsection EXTENTSTORE Extents and Stores 00077 * 00078 * Files and indexes are types of \e stores. A store is a persistent 00079 * data structure to which pages are allocated and deallocated, but which 00080 * is independent of the purpose for which it is used (index or file). 00081 * 00082 * Pages are reserved and allocated for a store in units of ss_m::ext_sz 00083 * (enumeration value smlevel_0::ext_sz, found in sm_base.h), 00084 * a compile-time constant that indicates the size of an extent. 00085 * 00086 * An extent is a set of contiguous pages, represented 00087 * by a persistent data structure \ref extlink_t. Extents are 00088 * linked together to form the entire structure of a store. 00089 * The head of this list has a reference to it from a store-node 00090 * (\ref stnode_t), described below. 00091 * Extents (extlink_t) are co-located on extent-map pages at 00092 * the beginning of the volume. 00093 * 00094 * Each extent has an owner, 00095 * which is the store id (\ref snum_t) of the store to which it belongs. 00096 * Free extents are not linked together; 00097 * they simply have no owner (signified by an \ref extlink_t::owner == 0). 00098 * 00099 * An extent id is a number of type \ref extnum_t. It is arithmetically 00100 * determined from a page number, and the pages in an extent are arithmetically derived from an extent number. 00101 * The \ref extnum_t is used in acquiring locks on 00102 * extents and it is used for locating the associated \ref extlink_t and the 00103 * extent-map page on which the \ref extlink_t resides. 00104 * Scanning the pages in a store can be accomplished by scanning the 00105 * list of \ref extlink_t. 00106 * 00107 * The entire allocation metadata for a page are in its extent, which contains a 00108 * bitmap indicating which of its pages are allocated. 00109 * One cannot determine the allocation status of a page from the page 00110 * itself: the extent map page must be inspected. 00111 * 00112 * Extents also have an (unlogged, advisory) indication of the amount of 00113 * space unused on the page; this takes the form of a bucket number that 00114 * has meaning only to the file manager. This metadatum is used in 00115 * the volume manager's vol_t::next_page_with_space() method, which is 00116 * used only by the file manager. The file manager asks the I/O layer 00117 * (which then descends to the volume manager for this purpose) to find the 00118 * next page whose advisory bucket number is sufficiently large for the 00119 * file manager's record-allocation needs. Between the time this request 00120 * is made and the time the file manager latches and inspects the page, 00121 * the page might no longer have sufficient space. Nevertheless, this 00122 * advisory bucket number in the extlink_t reduces the number of page fixes 00123 * and does improve the effective fill-factor for file pages. 00124 * 00125 * \subsubsection STORENODE Store Nodes 00126 * A \ref stnode_t holds metadata for a store, including a reference to 00127 * the first extent in the store. 00128 * A store \e always contains at least one allocated extent, even if 00129 * no pages in that extent are allocated. 00130 * Scanning the pages in a store can be accomplished by scanning the 00131 * list of \ref extlink_t. 00132 * 00133 * Store nodes are co-located on store-map pages at the beginning of a volume, 00134 * after the extent maps. 00135 * 00136 * \subsection ALLOCEXTSTORE Allocation and Deallocation of Extents and Pages 00137 * 00138 * The volume manager handles allocating extents to stores. 00139 * Deallocation of extents is shared by the volume manager 00140 * and the the lock manager 00141 * (see the discussion in lockid_t::set_ext_has_page_alloc). 00142 * 00143 * \subsubsection ALLOCEXT Allocation and Deallocation of Extents 00144 * Allocating an extent to a store consists in: 00145 * - locating a free extent: a search through the extent-map pages for an extent that is both unallocated (owner is zero) and not locked; 00146 * - acquiring an IX lock on the extent. These locks are explicitly acquired by the lock manager; extent locks are not in the lock hierarchy; 00147 * - linking the extent into the linked-list for the store, usually at the end of the list. Locks are \e not acquired for the previous and next extents in the list; page latches protect these structures; 00148 * - the updates to the extent-map pages are physically logged. 00149 * 00150 * De-allocating an extent from a store consists in: 00151 * - closely interacting with the lock manager (described below) 00152 * - identifying the previous- and next- extent numbers 00153 * - identifying the pages containing the \ref extlink_t structures for the 00154 * extent to be freed and for the previous- and next- extent structures, which may mean as many as three pages 00155 * - sorting the page numbers and latching the extent-map pages in ascending order to avoid latch-latch deadlocks 00156 * - ensuring that the previous- and next- extent numbers on the 00157 * to-be-freed extent have not changed (so that we know that we have 00158 * latched the right pages) 00159 * - updating the extents and physically logging each of the updates. 00160 * 00161 * Extents are freed only when a transaction has deleted the last page in 00162 * an extent; this involves acquiring an IX lock on the extent, but does not 00163 * preclude other transactions from allocating pages in the same extent. 00164 * Also, since the transaction might abort, the extent must not be re-used 00165 * for another store by another transaction. Furthermore, the page-deleting 00166 * transaction could re-use the pages. For these reasons, extents are left 00167 * in a store until the transaction commits. 00168 * 00169 * At commit time, the transaction co-routines with the lock manager to 00170 * identify all extents on which it has locks (lock manager's job). 00171 * If the lock manager can upgrade the extent's lock to EX mode \e and 00172 * the extent still contains no allocated pages, the extent can 00173 * be deallocated ( the transaction's job). 00174 * (An optimization avoids excessive page-latching here: the 00175 * extent lock contains a bit indicating whether the extent contains any 00176 * allocated pages.) 00177 * 00178 * This extent-based design has the following implications: 00179 * - before it can be used, a volume must be formatted for a given size so that the number of extent map pages and store map pages can be established; 00180 * - location of an extlink_t can be determined arithmetically, which is cheaper than looking in an index of any sort; 00181 * - the number of page fixes required for finding free extents is bounded by the number of extent-map pages on the volume; 00182 * - extent-map pages tend to be hot (remain in the buffer pool), which is 00183 * minimizes I/O; 00184 * - extent-map pages tend to be a source of latch contention; 00185 * - pages may be reserved for allocation in a file without being allocated, so optimal use of the volume requires that the allocated extents be searched before new extents are allocated; which is both slow and leads to latch contention. 00186 * 00187 * The volume layer does not contain any means of spreading out or clustering 00188 * extents over extent-map pages to address the latch contention issue. 00189 * 00190 * \subsubsection ALLOCPG Allocation and Deallocation of Pages Within a Store 00191 * 00192 * Allocating an extent to a store does not make its pages "visible" to the 00193 * server. They are considered "reserved". 00194 * Pages within the extent have to be allocated 00195 * (their bits in the extent's bitmap must be set). 00196 * 00197 * When the store is used for an index, the page is not 00198 * visible until it has been formatted 00199 * and inserted (linked) into the index. 00200 * In the case of files, however, 00201 * the situation is complicated by the lack of linkage of file pages by 00202 * another means. Pages used for large objects are referenced through an 00203 * index in the object's metadata, but pages used 00204 * for small objects become part of the 00205 * file when the 00206 * page's extent bitmap indicates that it is 00207 * allocated. 00208 * This has some significant ramifications: 00209 * - neither deallocation nor allocation of file pages requires latching of 00210 * previous and next pages for linking purposes; 00211 * - the file manager and index managers handle page allocation somewhat 00212 * differently; 00213 * - the file manager has to go to great lengths to ensure that the 00214 * page is not accessible until both allocated and formatted, and 00215 * to ensure the safety of 00216 * the file structure in event of error or crash. 00217 * 00218 * Despite the fact that the intended uses of the page require different 00219 * handling, a significant part of page allocation is 00220 * generic and is handled by the volume layer. To handle some of the 00221 * contextual differences, the volume layer uses a callback to the 00222 * calling manager. 00223 * 00224 * Allocating a page to a store consists in 00225 * (in fact, the code is 00226 * written to allocate a number of pages at once; this description is 00227 * simplified): 00228 * - locate a reserved page in the store 00229 * - if the page must be \e appended to the store, special precautions are 00230 * needed to ensure that the reserved page is the next unallocated page in the last extent of the store; 00231 * - if the page need not be appended, any reserved page will do; 00232 * - if there are no reserved pages in the store, allocate an extent (now 00233 * there are reserved pages); if the page must be appended, the 00234 * new extent must be linked to the last extent in the store; 00235 * - acquire an IX lock on the extent in which we found reserved page(s) 00236 * - find a reserved page that has \e no \e lock on it 00237 * - acquire a lock on the page (mode IX or EX, and duration depend on the context) 00238 * - file manager allocating a small-record file page 00239 * uses IX mode, long(commit-) duration, 00240 * which means that 00241 * deallocated pages will not be re-allocated until the deallocating 00242 * transaction commits 00243 * - file manager allocating pages for large records uses long duration, 00244 * IX or EX mode, depending on the exact use of the page for 00245 * (various) large-record structures; long-duration locks mean that 00246 * deallocated pages will not be re-allocated until the deallocating 00247 * - btree index manager uses EX mode, instant duration , meaning that 00248 * deallocated pages can be re-used 00249 * - rtree index manager uses EX mode, instant duration, meaning that 00250 * deallocated pages can be re-used 00251 * - call back to (file or index) manager to accept or reject this page 00252 * - file manager allocating a small-record file page 00253 * formats the page and returns "accept" 00254 * - file manager allocating a large-record page just returns "accept" 00255 * - rtree and btree index managers just return "accept" 00256 * - log the page allocation, set "has-page-allocated" indicator in the extent lock 00257 * 00258 * As mentioned above, there are times when the volume manager is told to 00259 * allocate new pages at the end of the store (append). This happens 00260 * when the file manager allocates small-object file pages unless the 00261 * caller passes in the policy t_compact, indicating that it should search 00262 * the file for available pages. 00263 * The server can choose its policy when calling \ref ss_m::create_rec 00264 * (see \ref pg_policy_t). 00265 * When the server uses \ref append_file_i, only the policy t_append 00266 * is used, which enforces append-only page allocation. 00267 * 00268 * The volume manager does not contain any persistent indices to 00269 * assist in finding free pages in a store's allocated extents (which it 00270 * can do only when not forced to append to the store). 00271 * To minimize the need for linear searches through the store's extents, 00272 * the volume manager keeps a cache of {store, extent} pairs, to find 00273 * extents already allocated to a store that contain free pages. This 00274 * cache is consulted before new extents are allocated to a store. 00275 * Since after restart the cache is necessarily empty, it is primed when 00276 * first needed for the purpose of allocating anything for the store. 00277 * 00278 * Priming the cache is an expensive operation. 00279 * It is not done on each volume mount, because volumes are mounted and 00280 * dismounted several times during recovery, and priming on each 00281 * mount would be prohibitive. 00282 * Every attempt to allocate a page checks the store's extent 00283 * cache; if it is empty, it is primed. 00284 * 00285 * If the cache does not yield a reserved page in an allocated extent, the 00286 * storage manager will search the file (a linear search) for such pages. 00287 * (This can be disabled by changing the value of the constant \e never_search 00288 * in sm_io.cpp.) 00289 * 00290 * Deallocating a page in a store consists in: 00291 * - acquire a long-duration EX lock on the page 00292 * - verify the store-membership of the page if required to do so by 00293 * the file manager in cases in which it was forced to unlatch and re-latch the page 00294 * - acquire a long-duration IX lock on the page's extent 00295 * - latch the extent-map page and update the extent's bitmap, log the update 00296 * 00297 * \subsubsection STONUMS Predefined Stores 00298 * 00299 * A volume contains these pre-defined structures: 00300 * - Header: page 0 (the first page) of the volume; contains : 00301 * - a format version # 00302 * - the long volume id 00303 * - extent size 00304 * - number of extents 00305 * - number of extents used for store 0 (see below) 00306 * - number of pages used for store 0 (see below) 00307 * - the first page of the extent map 00308 * - the first page of the store map 00309 * - page size 00310 * - store #0 : a "pseudo-store" containing the extent-map and store-map pages. This 00311 * starts with page 1 (the second page) of the volume. 00312 * - store #1 : directory of the stores (used by the storage manager): this is 00313 * a btree index mapping store-number to metadata about the store, 00314 * including (but not limited to) the store's use (btree/rtree/file-small-object-pages/file-large-object-pages), 00315 * and, in the case of indices, the root page of the index, 00316 * and, in the case of files, the store number of the associated large-object-page store. These metadata are encapsulated in an sinfo_s structure (see sdesc.h) and are manipulated by the directory manager (dir_m, in dir.cpp). 00317 * - store #2 : root index (for use by the server) 00318 * 00319 * \subsection PAGES Page Types 00320 * Pages in a volume come in a variety of page types, all the same size. 00321 * The size of a page is a compile-time constant. It is controlled by 00322 * a build-time configuration option (see 00323 * \ref CONFIGOPT). the default page size is 8192 bytes. 00324 * 00325 * All pages are \e slotted (those that don't need the slot structure 00326 * may use only one slot) and have the following layout: 00327 * - header, including 00328 * - lsn_t log sequence number of last page update 00329 * - page id 00330 * - links to next and previous pages (used by some storage structures) 00331 * - page tag (indicates type of page) 00332 * - space management metadata (space_t) 00333 * - store flags (logging level metadata) 00334 * - slots (grow down) 00335 * - slot table array of pointers to the slots (grows up) 00336 * - footer (copy of log sequence number of last page update) 00337 * 00338 * Each page type is a C++ class that derives from the base class 00339 * page_p. Page_p implements functionality that is common to all 00340 * (or most) page types. The types are as follows: 00341 * 00342 * - extlink_p : extent-link pages, used by vol_m 00343 * - stnode_p : store-node pages, used by vol_m 00344 * - file_p : slotted pages of file-of-record, used by file_m 00345 * - lgdata_p : pages of large records, used by file_m 00346 * - lgindex_p : pages of large records, used by file_m 00347 * - keyed_p : slotted pages of indexes, used by btree_m 00348 * - zkeyed_p : slotted pages of indexes, used by btree_m 00349 * - rtree_p : slotted pages of spatial indexes, used by rtree_m 00350 * 00351 * Issues specific to the page types will be dealt with in the descriptions of the modules that use them. 00352 * 00353 * \subsection OBJIDS Object Identifiers, Object Location, and Locks 00354 * 00355 * There is a close interaction among various object identifiers, 00356 * the data structures in which the objects reside, and the locks acquired on the objects. 00357 * 00358 * Simply put: 00359 * - a volume identifier (ID) consists of an integral number, e.g., 1, represented in an output stream as v(1). 00360 * - a store identifier consists of a volume ID and a store number, e.g., 3, 00361 * represented s(1.3). 00362 * - an index ID and a file ID are merely store IDs. 00363 * - a page ID contains a store ID and a page number, e.g., 48, represented 00364 * p(1.3.48). 00365 * - a record ID for a record in a file contains a page ID and a slot number, e.g., 2, represented r(1.3.48.2). 00366 * 00367 * Clearly, from a record ID, its page and slot can be derived without consulting any indices. It is also clear that records cannot move, which has ramifications for \ref RSVD_MODE, below. 00368 * 00369 * The \ref LOCK_M understands these identifiers, and can generate locks from them. 00370 * 00371 * \subsection RSVD_MODE Space Reservation on a Page 00372 * 00373 * 00374 * Different storage structures offer different opportunities for fine-grained 00375 * locking and need different means of allocation space within a page. 00376 * Special care is taken to reserve space on a page when slots 00377 * are freed (records are deleted) so that rollback can restore 00378 * the space on the page. 00379 * Page types that use this space reservation have 00380 * \code page_p::rsvd_mode() == true \endcode. 00381 * 00382 * In the case of B-trees, space reservation is not used because 00383 * undo and redo are handled logically -- entries 00384 * can be re-inserted in a different page. But in the case of files, 00385 * records are identified by physical ID, which includes page and slot number, 00386 * so records must be reinserted just where they first appeared. 00387 * 00388 * Holes in a page are coalesced (moved to the end of the page) as needed, 00389 * when the total free space on the page satisfies a need but the 00390 * contiguous free space does not. Hence, a record truncation followed 00391 * by an append to the same record does not necessarily cause the 00392 * shifting of other records on the same page. 00393 * 00394 * A count of free bytes is maintained for all pages. More space-allocation 00395 * metadata is maintained for rsvd_mode() pages: 00396 * - When a transaction releases a slot on a page with rsvd_mode(), the slot 00397 * remains 00398 * "reserved" for use by the same transaction. 00399 * - That slot is not free to be allocated by another transaction until 00400 * the releasing transaction commits. 00401 * This is because if the transaction aborts, the slot must 00402 * be restored with the same slot number. 00403 * Not only must the slot number be preserved, 00404 * but the number of bytes consumed by that slot must remain 00405 * available lest the transaction abort. 00406 * - The storage manager keeps track of the youngest active transaction 00407 * that is freeing space (i.e., "reserving" it) on the page 00408 * and the number of bytes freed ("reserved") 00409 * by the youngest transaction. 00410 * - When the youngest transaction to reserve space on the page becomes 00411 * older than the oldest active transaction in the system, the reserved 00412 * space becomes free. This check for freeing up the reserved space happens 00413 * whenever a transaction tries to allocate space on the page. 00414 * - During rollback, a transaction can use \e any amount of 00415 * reserved space, but during forward processing, it can only use space 00416 * it reserved, and that is known only if the transaction in question is 00417 * the youngest transaction described in the above paragraph. 00418 * - The changes to space-reservation metadata (space_t) are not logged. 00419 * The actions that result in updates to this metadata are logged (as 00420 * page mark and page reclaim). 00421 * 00422 * \section FILE_M File Manager 00423 * A file is a group of variable-sized records. 00424 * A record is the smallest persistent datum that has identity. 00425 * A record may also have a "user header", whose contents are 00426 * for use by the server. 00427 * As records vary in size, so their storage representation varies. 00428 * The storage manager changes the storage representation as needed. 00429 * A file comprises two stores. 00430 * One store is allocated for slotted (small-record) pages, called file_p 00431 * pages. 00432 * One store is allocated for large records, and contains lgdata_p and 00433 * lgindex_p pages. 00434 * Small records are those whose size is less than or equal to 00435 * sm_config_info_t.max_small_rec. A record larger than this 00436 * has a slot on a small-record page, which slot contains metadata 00437 * refering to pages in the large-record store. 00438 * The scan order of a file is the physical order of the records 00439 * in the small-record store. 00440 * 00441 * Every record, large or small, has the following metadata in the 00442 * record's slot on the file_p page; these data are held in a rectag_t 00443 * structure: 00444 * \code 00445 struct rectag_t { 00446 uint2_t hdr_len; // length of user header, may be zero 00447 uint2_t flags; // enum recflags_t: indicates internal implementation 00448 smsize_t body_len; // true length of the record 00449 }; 00450 \endcode 00451 * The flags have have these values: 00452 - t_small : a small record, entirely contained on the file_p 00453 - t_large_0 : a large record, the slot on the file_p contains the 00454 user header, while the body is a list 00455 of chunks (pointers to contiguous lgdata_p pages) 00456 - t_large_1 : a large record, the slot on the file_p contains the 00457 user header, while the body is a reference to a single 00458 lgindex_p page, which is the root of a 1-level index of 00459 lgdata_p pages. 00460 - t_large_2 : like t_large_1 but the index may be two levels deep. This 00461 has not been implemented. 00462 * 00463 * Internally (inside the storage manager), the class record_t is a 00464 * handle on the record's tag and is the class through which the 00465 * rectag_t is manipulated. 00466 * 00467 * A record is exposed to the server through a set of ss_m methods (\ref ss_m::create_rec, 00468 * \ref ss_m::append_rec, etc), and through the \ref pin_i class. 00469 * 00470 * \attention 00471 * All updates to records are accomplished by copying out part or all of 00472 * the record from the buffer pool to the server's address space, performing 00473 * updates there, and handing the new data to the storage manager. 00474 * User (server) data are not updated directly in the buffer pool. 00475 * 00476 * The server may cause the file_p and at most one large data page to 00477 * be pinned for a given record through the pin_i class; the server must 00478 * take care not to create latch-latch deadlocks by holding a record pinned 00479 * while attempting to pin another. An ordering protocol among the pages 00480 * pinned must be observed to avoid such deadlocks. 00481 * 00482 * \note The system only detects lock-lock deadlocks. Deadlocks involving 00483 * mutexes or latches or other blocking mechanisms will cause the server to 00484 * hang. 00485 * 00486 * \subsection HISTOFIND Allocating Space for a Record 00487 * 00488 * When a record is created, the file manager tries to use an already-allocated 00489 * page that has space for the record. It determines what space is needed 00490 * for the record from the length hint and 00491 * the data given in the \ref ss_m::create_rec call. 00492 * The file manager caches information 00493 * about page utilization for pages in each store. 00494 * The page utilization data for the store take the form of a 00495 * histoid_t, which contains a heap and a histogram. 00496 * The heap keeps track of the amount of 00497 * free space in (recently-used) pages in the heap, and it is 00498 * searchable so that it can 00499 * return the page with the smallest free space that is larger than a 00500 * given value. 00501 * The free-space-on-page value that it uses for this purpose 00502 * is the most liberal value -- it's possible that some of the space on 00503 * the page is reserved for a transaction that has not yet committed 00504 * (if that transaction destroyed a record, it can use space that other 00505 * transactions cannot). 00506 * \bug GNATS 157 The histoid_t heap should have some size limit (number of entries). 00507 * 00508 * The histogram has a small number of buckets, each of which counts 00509 * the number of pages in the file containing free space between 00510 * the bucket min and the bucket max. 00511 * 00512 * Three policies used can be used (in combination) to search for pages 00513 * with space in which to create a new record: 00514 * - t_cache : look in the heap for a page with space. 00515 * - t_compact : if the histograms say there are any pages with 00516 * sufficient space somewhere in the file, 00517 * do a linear search of the file for such a page, updating histogram 00518 * heap metadata in the process. This is potentially 00519 * costly but useful when the file has not been inspected since the 00520 * last restart, because the heap has no records for the file except 00521 * those inserted due to a record-update or removal. 00522 * - t_append : append the new record to the file 00523 * 00524 * Using append_file_t to create records means only t_append is used, 00525 * ensuring that the record will always be appended to the file. 00526 * \ref ss_m::create_rec uses t_cache | t_compact | t_append. 00527 * 00528 * The policy can be given on the \ref ss_m::create_rec call. The default 00529 * is t_cache | t_compact | t_append. 00530 * 00531 * If the file manager does not find a page in the file with sufficient 00532 * space for the record, or if it must append to the end of the file 00533 * and the last page hasn't the needed space, the file manager asks 00534 * the I/O manager to allocate a page. 00535 * 00536 * Once the file manager has located a page with sufficient space to 00537 * create the record, the I/O and volume managers worry about 00538 * \ref RSVD_MODE. 00539 * 00540 * Creating a record consists in: 00541 * - estimate the space required for the record, based on the sizes of 00542 * the data and header vectors and the length-hint given 00543 * on the ss_m::create_rec call. 00544 * - choose a record implementation for the given size ( 00545 * a small object or a large object, which determines the amount 00546 * of space needed in the slot of the file_p page) 00547 * - find and lock a slot in a page: 00548 * - if appending to the file, find and EX-lock (with long-duration) the next 00549 * available slot in the last page of the file. 00550 * If there is no such slot or if it is not large enough, 00551 * allocate a new page (at the end of the file). 00552 * - if not appending to the file, consult the histograms to find a 00553 * page already in the file, 00554 * one that contains a slot large enough for the new record. 00555 * - Once we have located a (potentially) suitable page, latch it, find a suitable slot, and 00556 * lock that. Note that normally we cover latches with locks to avoid 00557 * deadlocks, but in this case we must latch, then inspect the page because 00558 * we have no idea which slot to lock. 00559 * We may have to try several pages before finding one 00560 * that is truly suitable, so this entire protocol 00561 * is handled in the histogram code. The protocol is as follows: 00562 * - Conditionally EX-latch the page. If we cannot do so, give up on 00563 * this page and try another; 00564 * - Once the page is latched, verify that its page ID contains the expected store ID (to detect a race) (if not, reject this page and find another); 00565 * - Check the allocation status of the page: 00566 * - try to IS-lock the page, and if we cannot do so immediately, we reject the page and try another; 00567 * - verify that the page is allocated in its extent, and that the extent's owner is the expected store. This involves SH-latching the extent map page while holding the file page latched; 00568 * - Acquire an EX lock on the next available slot with enough space (space that is usable by this transaction -- subject to \ref RSVD_MODE); 00569 * - Once we have a suitable page with an EX record lock, create the record in the slot of the page located above. 00570 * 00571 * Freeing a record consists in: 00572 * - EX-lock the record (with long duration); 00573 * - from the record ID, determine its containing page; 00574 * - EX-latch the page; 00575 * - mark the slot free, releasing the space but leaving it reserved (see \ref RSVD_MODE); 00576 * - if the slot is the last used slot on the page, and the page is not the 00577 * first page in the file, free the page; 00578 * - update the histograms to reflect the space on the page and whether the 00579 * page is still in the file. 00580 * 00581 * \section BTREE_M B+-Tree Manager 00582 * 00583 * The values associated with the keys are opaque to the storage 00584 * manager, except when IM (Index Management locking protocol) is used, 00585 * in which case the value is 00586 * treated as a record ID, but no integrity checks are done. 00587 * It is the responsibility of the server to see that the value is 00588 * legitimate in this case. 00589 * 00590 * B-trees can be bulk-loaded from files of sorted key-value pairs, 00591 * as long as the keys are in \ref LEXICOFORMAT "lexicographic form". 00592 * \bug GNATS 116 Btree doesn't sort elements for duplicate keys in bulk-load. 00593 * This is a problem inherited from the original SHORE storage manager. 00594 * 00595 * The implementation of B-trees is straight from the Mohan ARIES/IM 00596 * and ARIES/KVL papers. See \ref MOH1, which covers both topics. 00597 * 00598 * Those two papers give a thorough explanation of the arcane algorithms, 00599 * including logging considerations. 00600 * Anyone considering changing the B-tree code is strongly encouraged 00601 * to read these papers carefully. 00602 * Some of the performance tricks described in these papers are 00603 * not implemented here. 00604 * For example, the ARIES/IM paper describes performance of logical 00605 * undo of insert operations if and only if physical undo 00606 * is not possible. 00607 * The storage manager always undoes inserts logically. 00608 * 00609 * \bug GNATS 137 Latches can now be downgraded; btree code should use this. 00610 * 00611 * \section RTREE_M R*-Tree Manager 00612 * 00613 * The spatial indexes in the storage manager are R*-trees, a variant 00614 * of R-trees that perform frequent restructuring to yield higher 00615 * performance than normal R-trees. The entire index is locked. 00616 * See \ref BKSS. 00617 * 00618 * \section DIR_M Directory Manager 00619 * All storage structures created by a server 00620 * have entries in a B+-Tree index called the 00621 * \e store \e directory or just \e directory. 00622 * This index is not exposed to the server. 00623 * 00624 * The storage manager maintains some transient and some persistent data 00625 * for each store. The directory's key is the store ID, and the value it 00626 * returns from a lookup is a 00627 * sdesc_t ("store descriptor") structure, which 00628 * contains both the persistent and transient information. 00629 * 00630 * The persistent information is in a sinfo_s structure; the 00631 * transient information is resident only in the cache of sdesc_t 00632 * structures that the directory manager 00633 * maintains. 00634 * 00635 * The metadata include: 00636 * - what kind of storage structure uses this store (btree, rtree, file) 00637 * - if a B-tree, is it unique and what kind of locking protocol does it use? 00638 * - what stores compose this storage structure (e.g., if file, what is the 00639 * large-object store and what is the small-record store?) 00640 * - what is the root page of the structure (if an index) 00641 * - what is the key type if this is an index 00642 * 00643 * \section LOCK_M Lock Manager 00644 * 00645 * The lock manager understands the folling kind of locks 00646 * - volume 00647 * - extent 00648 * - store 00649 * - page 00650 * - kvl 00651 * - record 00652 * - user1 00653 * - user2 00654 * - user3 00655 * - user4 00656 * 00657 * Lock requests are issued with a lock ID (lockid_t), which 00658 * encodes the identity of the entity being locked, the kind of 00659 * lock, and, by inference, a lock hierarchy for a subset of the 00660 * kinds of locks above. 00661 * The lock manager does not insist that lock identifiers 00662 * refer to any existing object. 00663 * 00664 * The lock manager enforces two lock hierarchies: 00665 * - Volume - store - page - record 00666 * - Volume - store - key-value 00667 * 00668 * Note that the following lock kinds are not in any hierarchy: 00669 * -extent 00670 * -user1, user2, user3, user4 00671 * 00672 * Other than the way the lock identifiers are inspected for the purpose 00673 * of enforcing the hierarchy, lock identifiers are considered opaque 00674 * data by the lock manager. 00675 * 00676 * The lockid_t structure can be constructed from the IDs of the 00677 * various entities in (and out of ) the hierarchy; see lockid_t and 00678 * the example lockid_test.cpp. 00679 * 00680 * \subsection LOCK_M_IMPLICIT Implicit locks 00681 * The hierarchy is used for implicit acquisition of locks as follows: 00682 * - for each parent lock in the hierarchy, determine its lock mode 00683 * based on the mode of the child: 00684 * - parent_mode[child IS or SH] = IS 00685 * - parent_mode[child IX or SIX or UD or EX] = IX 00686 * - parent_mode[child none] = none 00687 * - for each parent lock in the hierarchy that is not already held in 00688 * sufficient mode, acquire the parent lock in the mode determined above 00689 * 00690 * \subsection LOCK_M_ESC Escalation 00691 * The lock manager escalates up the hierarchy by default. 00692 * The escalation thresholds are based on run-time options. 00693 * They can be controlled (set, disabled) on a per-object level. 00694 * For example, escalation to the store level can be disabled when 00695 * increased concurrency is desired. 00696 * Escalation can also be controlled on a per-transaction or per-server basis. 00697 * 00698 * \subsection LOCK_M_SM Lock Acquisition and Release by Storage Manager 00699 * Locks are acquired by storage manager operations as appropriate to the 00700 * use of the data (read/write). (Update locks are not acquired by the 00701 * storage manager.) 00702 * 00703 * The storage manager's API allows explicit acquisition 00704 * of locks by a server. User modes user1, user2, user3 and user4 are provided for that purpose. 00705 * 00706 * Freeing locks is automatic at transaction commit and rollback. 00707 * 00708 * There is limited support for freeing locks in the middle of 00709 * a transaction: 00710 * - locks of duration less than t_long can be unlocked with unlock(), and 00711 * - quarks (sm_quark_t) simplify acquiring and freeing locks mid-transaction: 00712 * 00713 * \subsubsection QUARK Quarks 00714 * A quark is a marker in the list of locks held by a transaction. 00715 * When the quark is destroyed, all locks acquired since the 00716 * creation of the quark are freed. Quarks cannot be used while more than 00717 * one thread is attached to the transaction, although the storage 00718 * manager does not strictly enforce this (due to the cost). 00719 * When a quark is in use for a transaction, the locks acquired 00720 * will be of short duration, the assumption being that the quark 00721 * will be closed before commit-time. 00722 * 00723 * Extent locks are an exception; they must be held long-term for 00724 * page allocation and deallocation to work, so even in the context 00725 * of an open quark, extent locks will be held until end-of-transaction. 00726 * 00727 * The lock manager uses a hash table whose size is determined by 00728 * a configuration option. 00729 * The hash function used by the lock manager is known not 00730 * to distribute locks evenly among buckets. 00731 * This is partly due to the nature of lock IDs. 00732 * 00733 * \subsection LCACHE Lock Cache 00734 * To avoid expensive lock manager queries, each transaction 00735 * keeps a cache of the last <N> locks acquired (the number 00736 * <N> is a compile-time constant). 00737 * This close association between the transaction manager and 00738 * the lock manager is encapsulated in several classes in the file lock_x. 00739 * 00740 * \subsection DLD Deadlock Detection 00741 * The lock manager uses a statistical deadlock-detection scheme 00742 * known as "Dreadlocks" [KH1]. 00743 * Each storage manager thread (smthread_t) has a unique fingerprint, which is 00744 * a set of bits; the deadlock detector ORs together the bits of the 00745 * elements in a waits-for-dependency-list; each thread, when 00746 * blocking, holds a digest (the ORed bitmap). 00747 * It is therefore cheap for a thread to detect a cycle when it needs to 00748 * block awaiting a lock: look at the holders 00749 * of the lock and if it finds itself in any of their digests, a 00750 * cycle will result. 00751 * This works well when the total number of threads relative to the bitmap 00752 * size is such that it is possible to assign a unique bitmap to each 00753 * thread. 00754 * If you cannot do so, you will have false-positive deadlocks 00755 * "detected". 00756 * The storage manager counts, in its statistics, the number of times 00757 * it could not assign a unique fingerprint to a thread. 00758 * If you notice excessive transaction-aborts due to false-positive 00759 * deadlocks, 00760 * you can compile the storage manager to use a larger 00761 * number bits in the 00762 * \code sm_thread_map_t \endcode 00763 * found in 00764 * \code smthread.h \endcode. 00765 * 00766 * \section XCT_M Transaction Manager 00767 * When a transaction commits, the stores that are marked for deletion 00768 * are deleted, and the stores that were given sm_store_property_t t_load_file or t_insert_file are turned into t_regular stores. 00769 * Because these are logged actions, and they occur if and only if the 00770 * transaction commits, the storage manager guarantees that the ending 00771 * of the transaction and re-marking and deletion of stores is atomic. 00772 * This is accomplished by putting the transaction into a state 00773 * xct_freeing_space, and writing a log record to that effect. 00774 * The space is freed, the stores are converted, and a final log record is written before the transaction is truly ended. 00775 * In the vent of a carash while a transaction is freeing space, 00776 * recovery searches all the 00777 * store metadata for stores marked for deleteion 00778 * and deletes those that would otherwise have been missed in redo. 00779 * 00780 * \section LOG_M Log Manager 00781 * 00782 * \subsection LOG_M_USAGE How the Server Uses the Log Manager 00783 * 00784 * Log records for redoable-undoable operations contain both the 00785 * redo- and undo- data, hence an operation never causes two 00786 * different log records to be written for redo and for undo. 00787 * This, too, controls logging overhead. 00788 * 00789 * The protocol for applying an operation to an object is as follows: 00790 * - Lock the object. 00791 * - Fix the page(s) affected in exclusive mode. 00792 * - Apply the operation. 00793 * - Write the log record(s) for the operation. 00794 * - Unfix the page(s). 00795 * 00796 * The protocol for writing log records is as follows: 00797 * - Grab the transaction's log buffer in which the last log record is to be 00798 * cached by calling xct_t::get_logbuf() 00799 * - Ensure that we have reserved enough log space for this transaction 00800 * to insert the desired log record an to undo it. This is done by 00801 * by passing in 00802 * the type of the log record we are about to insert, and by using a 00803 * "fudge factor" (multiplier) associated with the given log record type. 00804 * The fudge factor indicates on average, how many bytes tend to be needed to undo the action being logged. 00805 * - Write the log record into the buffer (the idiom is to construct it 00806 * there using C++ placement-new). 00807 * - Release the buffer with xct_t::give_logbuf(), 00808 * passing in as an argument the fixed page that was affected 00809 * by the update being logged. This does several things: 00810 * - writes the transaction ID, previous LSN for this transaction 00811 * into the log record 00812 * - inserts the record into the log and remembers this record's LSN 00813 * - marks the given page dirty. 00814 * 00815 * Between the time the xct log buffer is grabbed and the time it is 00816 * released, the buffer is held exclusively by the one thread that 00817 * grabbed it, and updates to the xct log buffer can be made freely. 00818 * (Note that this per-transaction log buffer is unrelated to the log buffer 00819 * internal to the log manager.) 00820 * 00821 * During recovery, no logging is done in analysis or redo phases; only during 00822 * the undo phase are log records inserted. Log-space reservation is not 00823 * needed until recovery is complete; the assumption is that if the 00824 * transaction had enough log space prior to recovery, it has enough space 00825 * during recovery. 00826 * Prepared transactions pose a challenge, in that they are not resolved until 00827 * after recovery is complete. Thus, when a transaction-prepare is logged, 00828 * the log-space-reservations of that transaction are logged along with the rest of the transaction state (locks, coordinator, etc.) and before 00829 * recovery is complete, these transactions acquire their prior log-space 00830 * reservations. 00831 * 00832 * The above protocol is enforced by the storage manager in helper 00833 * functions that create log records; these functions are generated 00834 * by Perl scripts from the source file logdef.dat. (See \ref LOGRECS.) 00835 * 00836 * The file logdef.dat also contains the fudge factors for log-space 00837 * reservation. These factors were experimentally determined. 00838 * There are corner cases involving btree page SMOs (structure-modification operations), in which the 00839 * fudge factors will fail. [An example is when a transaction aborts after 00840 * having removed entries, and after other transactions have inserted 00841 * entries; the aborting transaction needs to re-insert its entries, which 00842 * now require splits.] 00843 * The storage manager has no resolution for this. 00844 * The fudge factors handle the majority of cases without reserving excessive 00845 * log-space. 00846 * \bug GNATS 156 Btree SMOs during rollback can cause problems. 00847 * 00848 *\subsection LOGRECS Log Record Types 00849 * The input to the above-mentioned Perl script is the source of all 00850 * log record types. Each log record type is listed in the file 00851 * \code logdef.dat \endcode 00852 * which is fairly self-explanatory, reproduced here: 00853 * \include logdef.dat 00854 * 00855 * The bodies of the methods of the class <log-rec-name>_log 00856 * are hand-written and reside in \code logrec.cpp \endcode. 00857 * 00858 * Adding a new log record type consists in adding a line to 00859 * \code logdef.dat, \endcode 00860 * adding method definitions to 00861 * \code logrec.cpp, \endcode 00862 * and adding the calls to the free function log_<log-rec-name>(args) 00863 * in the storage manager. 00864 * The base class for every log record is logrec_t, which is worth study 00865 * but is not documented here. 00866 * 00867 * Some logging records are \e compensated, meaning that the 00868 * log records are skipped during rollback. 00869 * Compensations may be needed because some operation simply cannot 00870 * be undone. The protocol for compensating actions is as follows: 00871 * - Fix the needed pages. 00872 * - Grab an \e anchor in the log. 00873 * This is an LSN for the last log record written for this transaction. 00874 * - Update the pages and log the updates as usual. 00875 * - Write a compensation log record (or piggy-back the compensation on 00876 * the last-written log record for this transaction to reduce 00877 * logging overhead) and free the anchor. 00878 * 00879 * \note Grabbing an anchor prevents all other threads in a multi-threaded 00880 * transaction from gaining access to the transaction manager. Be careful 00881 * with this, as it can cause mutex-latch deadlocks where multi-threaded 00882 * transactions are concerned. In other words, two threads cannot concurrently 00883 * update in the same transaction. 00884 * 00885 * In some cases, the following protocol is used to avoid excessive 00886 * logging by general update functions that, if logging were turned 00887 * on, would generate log records of their own. 00888 * - Fix the pages needed in exclusive mode. 00889 * - Turn off logging for the transaction. 00890 * - Perform the updates by calling some general functions. If an error occurs, undo the updates explicitly. 00891 * - Turn on logging for the transaction. 00892 * - Log the operation. If an error occurs, undo the updates with logging turned off.. 00893 * - Unfix the pages. 00894 * 00895 * The mechanism for turning off logging for a transaction is to 00896 * construct an instance of xct_log_switch_t. 00897 * 00898 * When the instance is destroyed, the original logging state 00899 * is restored. The switch applies only to the transaction that is 00900 * attached to the thread at the time the switch instance is constructed, 00901 * and it prevents other threads of the transaction from using 00902 * the log (or doing much else in the transaction manager) 00903 * while the switch exists. 00904 * 00905 * \subsection LOG_M_INTERNAL Log Manager Internals 00906 * 00907 * The log is a collection of files, all in the same directory, whose 00908 * path is determined by a run-time option. 00909 * Each file in the directory is called a "log file" and represents a 00910 * "partition" of the log. The log is partitioned into files to make it 00911 * possible to archive portions of the log to free up disk space. 00912 * A log file has the name \e log.<n> where <n> is a positive integer. 00913 * The log file name indicates the set of logical sequence numbers (lsn_t) 00914 * of log records (logrec_t) that are contained in the file. An 00915 * lsn_t has a \e high part and a \e low part, and the 00916 * \e high part (a.k.a., \e file part) is the <n> in the log file name. 00917 * 00918 * The user-settable run-time option sm_logsize indicates the maximum 00919 * number of KB that may be opened at once; this, in turn, determines the 00920 * size of a partition file, since the number of partition files is 00921 * a compile-time constant. 00922 * The storage manager computes partition sizes based on the user-provided 00923 * log size, such that partitions sizes are a convenient multiple of blocks 00924 * (more about which, below). 00925 * 00926 * A new partition is opened when the tail of the log approaches the end 00927 * of a partition, that is, when the next insertion into the log 00928 * is at an offset larger than the maximum partition size. (There is a 00929 * fudge factor of BLOCK_SIZE in here for convenience in implementation.) 00930 * 00931 * The \e low part of an lsn_t represents the byte-offset into the log file 00932 * at which the log record with that lsn_t sits. 00933 * 00934 * Thus, the total file size of a log file \e log.<n> 00935 * is the size of all log records in the file, 00936 * and the lsn_t of each log record in the file is 00937 * lsn_t(<n>, <byte-offset>) of the log record within the file. 00938 * 00939 * The log is, conceptually, a forever-expanding set of files. The log 00940 * manager will open at most PARTITION_COUNT log files at any one time. 00941 * - PARTITION_COUNT = smlevel_0::max_openlog 00942 * - smlevel_0::max_openlog (sm_base.h) = SM_LOG_PARTITIONS 00943 * - SM_LOG_PARTITIONS a compile-time constant (which can be overridden in 00944 * config/shore.def). 00945 * 00946 * The log is considered to have run out of space if logging requires that 00947 * more than smlevel_0::max_openlog partitions are needed. 00948 * Partitions are needed only as long as they contain log records 00949 * needed for recovery, which means: 00950 * - log records for pages not yet made durable (min recovery lsn) 00951 * - log records for uncommitted transactions (min xct lsn) 00952 * - log records belonging to the last complete checkpoint 00953 * 00954 * Afer a checkpoint is taken and its log records are durable, 00955 * the storage manager tries to scavenge all partitions that do not 00956 * contain necessary log records. The buffer manager provides the 00957 * min recovery lsn; the transaction manager provides the min xct lsn, 00958 * and the log manager keeps track of the location of the last 00959 * completed checkpoint in its master_lsn. Thus the minimum of the 00960 * 00961 * \e file part of the minmum of these lsns indicates the lowest partition 00962 * that cannot be scavenged; all the rest are removed. 00963 * 00964 * When the log is in danger of runing out of space 00965 * (because there are long-running transactions, for example) 00966 * the server may be called via the 00967 * LOG_WARN_CALLBACK_FUNC argument to ss_m::ss_m. This callback may 00968 * abort a transaction to free up log space, but the act of aborting 00969 * consumes log space. It may also archive a log file and remove it. 00970 * If the server provided a 00971 * LOG_ARCHIVED_CALLBACK_FUNC argument to ss_m::ss_m, this callback 00972 * can be used to retrieve archived log files when needed for 00973 * rollback. 00974 * \warning This functionality is not complete and has not been 00975 * well-tested. 00976 * 00977 * Log files (partitions) are written in fixed-sized blocks. The log 00978 * manager pads writes, if necessary, to make them BLOCK_SIZE. 00979 * - BLOCK_SIZE = 8192, a compile-time constant. 00980 * 00981 * A skip_log record indicates the logical end of a partition. 00982 * The log manager ensures that the last log record in a file 00983 * is always a skip_log record. 00984 * 00985 * Log files (partitions) are composed of segments. A segment is 00986 * an integral number of blocks. 00987 * - SEGMENT_SIZE = 128*BLOCK_SIZE, a compile-time constant. 00988 * 00989 * The smallest partition is one segment plus one block, 00990 * but may be many segments plus one block. The last block enables 00991 * the log manager to write the skip_log record to indicate the 00992 * end of the file. 00993 * 00994 * The partition size is determined by the storage manager run-time option, 00995 * sm_logsize, which determines how much log can be open at any time, 00996 * i.e., the combined sizes of the PARTITION_COUNT partitions. 00997 * 00998 * The maximum size of a log record (logrec_t) is 3 storage manager pages. 00999 * A page happens to match the block size but the two compile-time 01000 * constants are not inter-dependent. 01001 * A segment is substantially larger than a block, so it can hold at least 01002 * several maximum-sized log records, preferably many. 01003 * 01004 * Inserting a log record consists of copying it into the log manager's 01005 * log buffer (1 segment in size). The buffer wraps so long as there 01006 * is room in the partition. Meanwhile, a log-flush daemon thread 01007 * writes out unflushed portions of the log buffer. 01008 * The log daemon can lag behind insertions, so each insertion checks for 01009 * space in the log buffer before it performs the insert. If there isn't 01010 * enough space, it waits until the log flush daemon has made room. 01011 * 01012 * When insertion of a log record would wrap around the buffer and the 01013 * partition has no room for more segments, a new partition is opened, 01014 * and the entire newly-inserted log record will go into that new partition. 01015 * Meanwhile, the log-flush daemon will see that the rest of the log 01016 * buffer is written to the old partition, and the next time the 01017 * log flush daemon performs a flush, it will be flushing to the 01018 * new partition. 01019 * 01020 * The bookkeeping of the log buffer's free and used space is handled 01021 * by the notion of \e epochs. 01022 * An epoch keeps track of the start and end of the unflushed portion 01023 * of the segment (log buffer). Thus, an epoch refers to only one 01024 * segment (logically, log buffer copy within a partition). 01025 * When an insertion fills the log buffer and causes it to wrap, a new 01026 * epoch is created for the portion of the log buffer representing 01027 * the new segment, and the old epoch keeps track of the portion of the 01028 * log buffer representing the old segment. The inserted log record 01029 * usually spans the two segements, as the segments are written contiguously 01030 * to the same log file (partition). 01031 * 01032 * When an insertion causes a wrap and there is no more room in the 01033 * partition to hold the new segment, a new 01034 * epoch is created for the portion of the log buffer representing 01035 * the new segment, and the old epoch keeps track of the portion of the 01036 * log buffer representing the old segment, as before. 01037 * Now, however, the inserted log record is inserted, in its entirety, 01038 * in the new segment. Thus, no log record spans partitions. 01039 * 01040 * Meanwhile, the log-flush buffer knows about the possible existence of 01041 * two epochs. When an old epoch is valid, it flushes that epoch. 01042 * When a new epoch is also valid, it flushes that new one as well. 01043 * If the two epochs have the same target partition, the two flushes are 01044 * done with a single write. 01045 * 01046 * The act of flushing an epoch to a partition consists in a single 01047 * write of a size that is an even multiple of BLOCK_SIZE. The 01048 * flush appends a skip_log record, and zeroes as needed, to round out the 01049 * size of the write. Writes re-write portions of the log already 01050 * written, in order to overwrite the skip_log record at the tail of the 01051 * log (and put a new one at the new tail). 01052 * 01053 * 01054 *\subsection RECOV Recovery 01055 * The storage manager performs ARIES-style logging and recovery. 01056 * This means the logging and recovery system has these characteristics: 01057 * - uses write-ahead logging (WAL) 01058 * - repeats history on restart before doing any rollback 01059 * - all updates are logged, including those performed during rollback 01060 * - compensation records are used in the log to bound the amount 01061 * of logging done for rollback 01062 * and guarantee progress in the case of repeated 01063 * failures and restarts. 01064 * 01065 * Each time a storage manager (ss_m class) is constructed, the logs 01066 * are inspected, the last checkpoint is located, and its lsn is 01067 * remembered as the master_lsn, then recovery is performed. 01068 * Recovery consists of three phases: analysis, redo and undo. 01069 * 01070 *\subsubsection RECOVANAL Analysis 01071 * This pass analyzes the log starting at the master_lsn, and 01072 * reading log records written thereafter. Reading the log records for the 01073 * last completed checkpoint, it reconstructs the transaction table, the 01074 * buffer-pool's dirty page table, and mounts the devices and 01075 * volumes that were mounted at the time of the checkpoint. 01076 * From the dirty page table, it determines the \e redo_lsn, 01077 * the lowest recovery lsn of the dirty pages, which is 01078 * where the next phase of recovery must begin. 01079 * 01080 *\subsubsection RECOVREDO Redo 01081 * This pass starts reading the log at the redo_lsn, and, for each 01082 * log record thereafter, decides whether that log record's 01083 * work needs to be redone. The general protocol is: 01084 * - if the log record is not redoable, it is ignored 01085 * - if the log record is redoable and contains a page ID, the 01086 * page is inspected and its lsn is compared to that of the log 01087 * record. If the page lsn is later than the log record's sequence number, 01088 * the page does not need to be updated per this log record, and the 01089 * action is not redone. 01090 * 01091 *\subsubsection RECOVUNDO Undo 01092 * After redo, the state of the database matches that at the time 01093 * of the crash. Now the storage manager rolls back the transactions that 01094 * remain active. 01095 * Care is taken to undo the log records in reverse chronological order, 01096 * rather than allowing several transactions to roll back 01097 * at their own paces. This is necessary because some operations 01098 * use page-fixing for concurrency-control (pages are protected 01099 * only with latches if there is no page lock in 01100 * the lock hierarchy -- this occurs when 01101 * logical logging and high-concurrency locking are used, 01102 * in the B-trees, for example. A crash in the middle of 01103 * a compensated action such as a page split must result in 01104 * the split being undone before any other operations on the 01105 * tree are undone.). 01106 * \bug GNATS 49 (performance) There is no concurrent undo. 01107 * 01108 * After the storage manager has recovered, control returns from its 01109 * constructor method to the caller (the server). 01110 * There might be transactions left in prepared state. 01111 * The server is now free to resolve these transactions by 01112 * communicating with its coordinator. 01113 * 01114 *\subsection LSNS Log Sequence Numbers 01115 * 01116 * Write-ahead logging requires a close interaction between the 01117 * log manager and the buffer manager: before a page can be flushed 01118 * from the buffer pool, the log might have to be flushed. 01119 * 01120 * This also requires a close interaction between the transaction 01121 * manager and the log manager. 01122 * 01123 * All three managers understand a log sequence number (lsn_t). 01124 * Log sequence numbers serve to identify and locate log records 01125 * in the log, to timestamp pages, identify timestamp the last 01126 * update performed by a transaction, and the last log record written 01127 * by a transaction. Since every update is logged, every update 01128 * can be identified by a log sequence number. Each page bears 01129 * the log sequence number of the last update that affected that 01130 * page. 01131 * 01132 * A page cannot be written to disk until the log record with that 01133 * page's lsn has been written to the log (and is on stable storage). 01134 * A log sequence number is a 64-bit structure, with part identifying 01135 * a log partition (file) number and the rest identifying an offset within the file. 01136 * 01137 * \subsection LOGPART Log Partitions 01138 * 01139 * The log is partitioned to simplify archiving to tape (not implemented) 01140 * The log comprises 8 partitions, where each partition's 01141 * size is limited to approximately 1/8 the maximum log size given 01142 * in the run-time configuration option sm_logsize. 01143 * A partition resides in a file named \e log.<n>, where \e n 01144 * is the partition number. 01145 * The configuration option sm_logdir names a directory 01146 * (which must exist before the storage manager is started) 01147 * in which the storage manager may create and destroy log files. 01148 * 01149 * The storage manger may have at most 8 active partitions at any one time. 01150 * An active partition is one that is needed because it 01151 * contains log records for running transactions. Such partitions 01152 * could (if it were supported) be streamed to tape and their disk 01153 * space reclaimed. Space is reclaimed when the oldest transaction 01154 * ends and the new oldest transaction's first log record is 01155 * in a newer partition than that in which the old oldest 01156 * transaction's first log record resided. 01157 * Until tape archiving is implemented, the storage 01158 * manager issues an error (eOUTOFLOGSPACE) 01159 * if it consumes sufficient log space to be unable to 01160 * abort running transactions and perform all resulting necessary logging 01161 * within the 8 partitions available. 01162 * \note Determining the point at which there is insufficient space to 01163 * abort all running transactions is a heuristic matter and it 01164 * is not reliable. The transaction "reserves" log space for rollback, meaning 01165 * that no other transaction can consume that space until the transaction ends.' 01166 * A transaction has to reserve significantly more space to roll back than it 01167 * needs for forward processing B-tree deletions; this is because the log overhead 01168 * for the insertions is considerably larger than that for deletion. 01169 * The (compile-time) page size is also a factor in this heuristic. 01170 * 01171 * Log records are buffered by the log manager until forced to stable 01172 * storage to reduce I/O costs. 01173 * The log manager keeps a buffer of a size that is determined by 01174 * a run-time configuration option. 01175 * The buffer is flushed to stable storage when necessary. 01176 * The last log in the buffer is always a skip log record, 01177 * which indicates the end of the log partition. 01178 * 01179 * Ultimately, archiving to tape is necessary. The storage manager 01180 * does not perform write-aside or any other work in support of 01181 * long-running transactions. 01182 * 01183 * The checkpoint manager chkpt_m sleeps until kicked into action 01184 * by the log manager, and when it is kicked, it takes a checkpoint, 01185 * then sleeps again. Taking a checkpoint amounts to these steps: 01186 * - Write a chkpt_begin log record. 01187 * - Write a series of log records recording the mounted devices and volumes.. 01188 * - Write a series of log records recording the mounted devices. 01189 * - Write a series of log records recording the buffer pool's dirty pages. 01190 * For each dirty page in the buffer pool, the page id and its recovery lsn 01191 * is logged. 01192 * \anchor RECLSN 01193 * A page's recovery lsn is metadata stored in the buffer 01194 * manager's control block, but is not written on the page. 01195 * It represents an lsn prior to or equal to the log's current lsn at 01196 * the time the page was first marked dirty. Hence, it 01197 * is less than or equal to the LSN of the log record for the first 01198 * update to that page after the page was read into the buffer 01199 * pool (and remained there until this checkpoint). The minimum 01200 * of all the recovery lsn written in this checkpoint 01201 * will be a starting point for crash-recovery, if this is 01202 * the last checkpoint completed before a crash. 01203 * - Write a series of log records recording the states of the known 01204 * transactions, including the prepared transactions. 01205 * - Write a chkpt_end log record. 01206 * - Tell the log manage where this checkpoint is: the lsn of the chkpt_begin 01207 * log record becomes the new master_lsn of the log. The master_lsn is 01208 * written in a special place in the log so that it can always be 01209 * discovered on restart. 01210 * 01211 * These checkpoint log records may interleave with other log records, making 01212 * the checkpoint "fuzzy"; this way the world doesn't have to grind to 01213 * a halt while a checkpoint is taken, but there are a few operations that 01214 * must be serialized with all or portions of a checkpoint. Those operations 01215 * use mutex locks to synchronize. Synchronization of operations is 01216 * as follows: 01217 * - Checkpoints cannot happen simultaneously - they are serialized with 01218 * respect to each other. 01219 * - A checkpoint and the following are serialized: 01220 * - mount or dismount a volume 01221 * - prepare a transaction 01222 * - commit or abort a transaction (a certain portion of this must 01223 * wait until a checkpoint is not happening) 01224 * - heriocs to cope with shortage of log space 01225 * - The portion of a checkpoint that logs the transaction table is 01226 * serialized with the following: 01227 * - operations that can run only with one thread attached to 01228 * a transaction (including the code that enforces this) 01229 * - transaction begin, end 01230 * - determining the number of active transactions 01231 * - constructing a virtual table from the transaction table 01232 * 01233 * \section BF_M Buffer Manager 01234 * The buffer manager is the means by which all other modules (except 01235 * the log manager) read and write pages. 01236 * A page is read by calling bf_m::fix. 01237 * If the page requested cannot be found in the buffer pool, 01238 * the requesting thread reads the page and blocks waiting for the 01239 * read to complete. 01240 * 01241 * All frames in the buffer pool are the same size, and 01242 * they cannot be coalesced, 01243 * so the buffer manager manages a set of pages of fixed size. 01244 * 01245 * \subsection BFHASHTAB Hash Table 01246 * The buffer manager maintains a hash table mapping page IDs to 01247 * buffer control blocks. A control block points to its frame, and 01248 * from a frame one can arithmetically locate its control block (in 01249 * bf_m::get_cb(const page_s *)). 01250 * The hash table for the buffer pool uses cuckoo hashing 01251 * (see \ref P1) with multiple hash functions and multiple slots per bucket. 01252 * These are compile-time constants and can be modified (bf_htab.h). 01253 * 01254 * Cuckoo hashing is subject to cycles, in which making room on one 01255 * table bucket A would require moving something else into A. 01256 * Using at least two slots per bucket reduces the chance of a cycle. 01257 * 01258 * The implementation contains a limit on the number of times it looks for 01259 * an empty slot or moves that it has to perform to make room. It does 01260 * If cycles are present, the limit will be hit, but hitting the limit 01261 * does not necessarily indicate a cycle. If the limit is hit, 01262 * the insert will fail. 01263 * The "normal" solution in this case is to rebuild the table with 01264 * different hash functions. The storage manager does not handle this case. 01265 * \bug GNATS 47 01266 * In event of insertion failure, the hash table will have to be rebuilt with 01267 * different hash functions, or will have to be modified in some way. 01268 * 01269 * \bug GNATS 35 The buffer manager hash table implementation contains a race. 01270 * While a thread performs a hash-table 01271 * lookup, an item could move from one bucket to another (but not 01272 * from one slot to another within a bucket). 01273 * The implementation contains a temporary work-around for 01274 * this, until the problem is more gracefully fixed: if lookup fails to 01275 * find the target of the lookup, it performs an expensive lookup and 01276 * the statistics record these as bf_harsh_lookups. This is expensive. 01277 * 01278 * \subsection REPLACEMENT Page Replacement 01279 * When a page is fixed, the buffer manager looks for a free buffer-pool frame, 01280 * and if one is not available, it has to choose a victim to replace. 01281 * It uses a clock-based algorithm to determine where in the buffer pool 01282 * to start looking for an unlatched frame: 01283 * On the first pass of the buffer pool it considers only clean frames. 01284 * On the second pass it will consider dirty pages, 01285 * and on the third or subsequent pass it will consider any frame. 01286 * 01287 * The buffer manager forks background threads to flush dirty pages. 01288 * The buffer manager makes an attempt to avoid hot pages and to minimize 01289 * the cost of I/O by sorting and coalescing requests for contiguous pages. 01290 * Statistics kept by the buffer manager tell the number of resulting write 01291 * requests of each size. 01292 * 01293 * There is one bf_cleaner_t thread for each volume, and it flushes pages for that 01294 * volume; this is done so that it can combine contiguous pages into 01295 * single write requests to minimize I/O. Each bf_cleaner_t is a master thread with 01296 * multiple page-writer slave threads. The number of slave threads per master 01297 * thread is controlled by a run-time option. 01298 * The master thread can be disabled (thereby disabling all background 01299 * flushing of dirty pages) with a run-time option. 01300 * 01301 * The buffer manager writes dirty pages even if the transaction 01302 * that dirtied the page is still active (steal policy). Pages 01303 * stay in the buffer pool as long as they are needed, except when 01304 * chosen as a victim for replacement (no force policy). 01305 * 01306 * The replacement algorithm is clock-based (it sweeps the buffer 01307 * pool, noting and clearing reference counts). This is a cheap 01308 * way to achieve something close to LRU; it avoids much of the 01309 * overhead and mutex bottlenecks associated with LRU. 01310 * 01311 * The buffer manager maintains a hash table that maps page IDs to buffer 01312 * frame control blocks (bfcb_t), which in turn point to frames 01313 * in the buffer pool. The bfcb_t keeps track of the page in the frame, 01314 * the page ID of the previously-held page, 01315 * and whether it is in transit, the dirty/clean state of the page, 01316 * the number of page fixes (pins) held on the page (i.e., reference counts), 01317 * the \ref RECLSN "recovery lsn" of the page, etc. 01318 * The control block also contains a latch. A page, when fixed, 01319 * is always fixed in a latch mode, either LATCH_SH or LATCH_EX. 01320 * \bug GNATS 40 bf_m::upgrade_latch() drops the latch and re-acquires in 01321 * the new mode, if it cannot perform the upgrade without blocking. 01322 * This is an issue inherited from the original SHORE storage manager. 01323 * To block in this case 01324 * would enable a deadlock in which two threads hold the latch in SH mode 01325 * and both want to upgrade to EX mode. When this happens, the statistics 01326 * counter \c bf_upgrade_latch_race is incremented. 01327 * 01328 * Page fixes are expensive (in CPU time, even if the page is resident). 01329 * 01330 * Each page type defines a set of fix methods that are virtual in 01331 * the base class for all pages: The rest of the storage manager 01332 * interacts with the buffer manager primarily through these methods 01333 * of the page classes. 01334 * The macros MAKEPAGECODE are used for each page subtype; they 01335 * define all the fix methods on the page in such a way that bf_m::fix() 01336 * is properly called in each case. 01337 * 01338 * A page frame may be latched for a page without the page being 01339 * read from disk; this 01340 * is done when a page is about to be formatted. 01341 * 01342 * The buffer manager is responsible for maintaining WAL; this means it may not 01343 * flush to disk dirty pages whose log records have not reached stable storage yet. 01344 * Temporary pages (see sm_store_property_t) do not get logged, so they do not 01345 * have page lsns to assist in determining their clean/dirty status, and since pages 01346 * may change from temporary (unlogged) to logged, they require special handling, described 01347 * below. 01348 * 01349 * When a page is unfixed, sometimes it has been updated and must be marked dirty. 01350 * The protocol used in the storage manager is as follows: 01351 * 01352 * - Fixing with latch mode EX signals intent to dirty the page. If the page 01353 * is not already dirty, the buffer control block for the page is given a 01354 * recovery lsn of the page's lsn. This means that any dirtying of the page 01355 * will be done with a log record whose lsn is larger than this recovery lsn. 01356 * Fixing with EX mode of an already-dirty page does not change 01357 * the recovery lsn for the page. 01358 * 01359 * - Clean pages have a recovery lsn of lsn_t::null. 01360 * 01361 * - A thread updates a page in the buffer pool only when it has the 01362 * page EX-fixed(latched). 01363 * 01364 * - After the update to the page, the thread writes a log record to 01365 * record the update. The log functions (generated by Perl) 01366 * determine if a log record should be written (not if a tmp 01367 * page, or if logging turned off, for example), 01368 * and if not, they call page.set_dirty() so that any subsequent 01369 * unfix notices that the page is dirty. 01370 * If the log record is written, the modified page is unfixed with 01371 * unfix_dirty() (in xct_impl::give_logbuf). 01372 * 01373 * - Before unfixing a page, if it was written, it must be marked dirty first 01374 * with 01375 * - set_dirty followed by unfix, or 01376 * - unfix_dirty (which is set_dirty + unfix). 01377 * 01378 * - Before unfixing a page, if it was NOT written, unfix it with bf_m::unfix 01379 * so its recovery lsn gets cleared. This happens only if this is the 01380 * last thread to unfix the page. The page could have multiple fixers 01381 * (latch holders) only if it were fixed in SH mode. If fixed (latched) 01382 * in EX mode, this will be the only thread to hold the latch and the 01383 * unfix will clear the recovery lsn. 01384 * 01385 * It is possible that a page is fixed in EX mode, marked dirty but never 01386 * updated after all, then unfixed. The buffer manager attempts to recognize 01387 * this situation and clean the control block "dirty" bit and recovery lsn. 01388 * 01389 * Things get a little complicated where the buffer-manager's 01390 * page-writer threads are 01391 * concerned. The page-writer threads acquire a share latches and copy 01392 * dirty pages; this being faster than holding the latch for the duration of the 01393 * write to disk 01394 * When the write is finished, the page-writer re-latches the page with the 01395 * intention of marking it clean if no intervening updates have occurred. This 01396 * means changing the \e dirty bit and updating the recovery lsn in the buffer 01397 * control block. The difficulty lies in determining if the page is indeed clean, 01398 * that is, matches the latest durable copy. 01399 * In the absence of unlogged (t_temporary) pages, this would not be terribly 01400 * difficult but would still have to cope with the case that the page was 01401 * (updated and) written by another thread between the copy and the re-fix. 01402 * It might have been cleaned, or that other thread might be operating in 01403 * lock-step with this thread. 01404 * The conservative handling would be not to change the recovery lsn in the 01405 * control block if the page's lsn is changed, however this has 01406 * serious consequences 01407 * for hot pages: their recovery lsns might never be moved toward the tail of 01408 * the log (the recovery lsns remain artificially low) and 01409 * thus the hot pages can prevent scavenging of log partitions. If log 01410 * partitions cannot be scavenged, the server runs out of log space. 01411 * For this reason, the buffer manager goes to some lengths to update the 01412 * recovery lsn if at all possible. 01413 * To further complicate matters, the page could have changed stores, 01414 * and thus its page type or store (logging) property could differ. 01415 * The details of this problem are handled in a function called determine_rec_lsn(). 01416 * 01417 * \subsection PAGEWRITERMUTEX Page Writer Mutexes 01418 * 01419 * The buffer manager keeps a set of \e N mutexes to sychronizing the various 01420 * threads that can write pages to disk. Each of these mutexes covers a 01421 * run of pages of size smlevel_0::max_many_pages. N is substantially smaller 01422 * than the number of "runs" in the buffer pool (size of 01423 * the buffer pool/max_many_pages), so each of the N mutexes actually covers 01424 * several runs: 01425 * \code 01426 * page-writer-mutex = page / max_many_pages % N 01427 * \endcode 01428 * 01429 * \subsection BFSCAN Foreground Page Writes and Discarding Pages 01430 * Pages can be written to disk by "foreground" threads under several 01431 * circumstances. 01432 * All foreground page-writing goes through the method bf_m::_scan. 01433 * This is called for: 01434 * - discarding all pages from the buffer pool (bf_m::_discard_all) 01435 * - discarding all pages belonging to a given store from the buffer pool 01436 * (bf_m::_discard_store), e.g., when a store is destroyed. 01437 * - discarding all pages belonging to a given volume from the buffer pool 01438 * (bf_m::_discard_volume), e.g., when a volume is destroyed. 01439 * - forcing all pages to disk (bf_m::_force_all) with or without invalidating 01440 * their frames, e.g., during clean shutdown. 01441 * - forcing all pages of a store to disk (bf_m::_force_store) with 01442 * or without invalidating 01443 * their frames, e.g., when changing a store's property from unlogged to 01444 * logged. 01445 * - forcing all pages of a volume to disk (bf_m::_force_store) with 01446 * without invalidating the frames, e.g., when dismounting a volume. 01447 * - forcing all pages whose recovery lsn is less than or equal to a given 01448 * lsn_t, e.g., for a clean shutdown, after restart. 01449 */ 01450 /**\page Logging 01451 * 01452 * See \ref LOG_M. 01453 * */ 01454 01455 /**\page DEBUGAID Debugging Aids 01456 *\section SSMDEBUGAPI Storage Manager Methods for Debugging 01457 * 01458 * The storage manager contains a few methods that are useful for 01459 * debugging purposes. Some of these should be used for not other 01460 * purpose, as they are not thread-safe, or might be very expensive. 01461 * See \ref SSMAPIDEBUG. 01462 * 01463 *\section SSMDEBUG Build-time Debugging Options 01464 * 01465 * At configure time, you can control which debugger-related options 01466 * (symbols, inlining, etc) with the debug-level options. See \ref CONFIGOPT. 01467 * \section SSMTRACE Tracing (--enable-trace) 01468 * When this build option is used, additional code is included in the build to 01469 * enable some limited tracing. These C Preprocessor macros apply: 01470 * -W_TRACE 01471 * --enable-trace defines this. 01472 * -FUNC 01473 * Outputs the function name when the function is entered. 01474 * -DBG 01475 * Outputs the arguments. 01476 * -DBGTHRD 01477 * Outputs the arguments. 01478 * 01479 * The tracing is controlled by these environment variables: 01480 * -DEBUG_FLAGS: a list of file names to trace, e.g. "smfile.cpp log.cpp" 01481 * -DEBUG_FILE: name of destination for the output. If not defined, the output 01482 * is sent to cerr/stderr. 01483 * 01484 * See \ref CONFIGOPT. 01485 * \note This tracing is not thread-safe, as it uses streams output. 01486 * \section SSMENABLERC Return Code Checking (--enable-checkrc) 01487 * If a w_rc_t is set but not checked with method is_error(), upon destruction the 01488 * w_rc_t will print a message to the effect "error not checked". 01489 * See \ref CONFIGOPT. 01490 * 01491 */