Minirel Buffer Manager Report

Minirel Buffer Manager Report


Prepared by Kan Qiu (kan@cs) and Fei Su (fei@cs) May 13 ,1995


INTRODUCTION

Buffer manager plays an important role in the Minirel multi-user database system. It manages a number of memory frames and book-keeps based on a hashing table inside the shared memory region. Buffer manager sits between upper level database access methods and the lower level space management. Buffer manager is also extensible with respect to different replacement polocies, i.e., user can change the buffer frame replacement policy to any user defined policy as long as user can provide an replacer class whose public interfaces are identical to those of the default replacer class Clock based on the clock algorithm. Writes of dirty pages are handled synchronously when dirty pages are replaced ; the buffer manager also support the reservation of blocks of pages for use, e.g., by fancy join method that utilize large amounts of buffer space. Buffer manager observe steal and no force policy.

Buffer manager provides basic methods such as page pinning and unpinning, page allocation and deallocation , buffer frame reservation and release for some special join methods . Most minirel components of Minirel have to interact with buffer manager. Most public interfaces of buffer manager are designed for access methods(B+Tree, Linear Hashing , Grid Files , R-Trees,etc.) and join methods. Buffer manager has to use interfaces provided by space manager to do disk I/O for frame replacement. Also buffer manager communicates with recovery manager to provide the checkpoint service and log manager to enforce write after log(WAL) policy. After all, buffer manager uses synchronization machanisms provided by OS group to achieve mutual exclusion during its critical sections.

EXTERNAL INTERFACES

class BufMgr {
public:
   BufMgr();	//Buffer pool size is set to be a default integer

   ~BufMgr();  	//flush all valid dirty pages to disk

   Status pinPage(int PageId_in_a_DB, Page*& page,int emptyPage=0);
		// check if this page is in buffer pool, otherwise
		// find a frame for this page , read in and pin it.
		// also write out the old page if it's dirty before 
		// reading if emptyPage==TRUE, then actually no 
		// read is done to bring the page

   Status unpinPage(int globalPageId_in_a_DB, int dirty=FALSE);
		// if pincount>0, decrement it and if it becomes zero,
		// put it in a group of replacement candidates.
		// if pincount=0 before this call, return error.

   Status newPage(int& firstPageId, Page*& firstpage,int howmany=1); 
		// call DB object to allocate a run of new pages and 
                // find a frame in the buffer pool for the first page
		// and pin it. If buffer is full, ask DB to deallocate 
		// all these pages and return error

    	Status 	freePage(int globalPageId); 
		// user should call this method if it needs to 
		// delete a page this routine will call DB to 
		//deallocate the page 

    	Status 	reserve(int number, Page* pagelist[]);
		// reservation of a number of frames for special 
		// purpose,actually it just pin these frames and 
		// remove them from the buffer table.

	Status	commit();   
		// for Xact to notify bufferMgr to 
		// really delete pages

   	Status	release(int number, Page* pagelist[]); 
		//only release those reserved by previous interface
		//This routine includes unpinPage operation.

	Status 	dirtyPageTable(int& Num_of_dirtyPage,
			int globalPageId[],	
				//user provides an array to hold Ids
			lsn_t	recovery_LSN[]	
				//user provides the space 
			);
		//(we don't assume that recovery manager has
		//to know the size of the buffer pool)
		//and an array of integers in the user
		//provided integer buffer of some maximum
		//size of buffer pool size.

	void 	last_lsn_flushed(lsn_t last_lsn);
		// To be called by log manager

    	void 	printSelf(void);
};

INTERNAL DESIGN

Buffer manager is implemented by the BufMgr class. Besides the lock machanisms, private members of the BufMgr class include an array of Page objects ,which is the real buffer pool, a BufHashTbl object, which is in charge of book-keeping for each frame and hashing search, a REPLACER(Clock) object enforcing the clock algorithm and finally a lsn_t object to record the latest log tail that's written to the disk. We discuss all major internal designs as follows:

ACCOMPLISHMENTS

We had to rewrite the buffer BufMgr class although new code inherit some basic data structure names and functionality. First of all, old code does not have independent replacement class, so we had to design an totally isolated and extensible relacement class. Second, we decided to try to avoid dynamic memory allocation inside the buffer manager because we did not expect to be able to dynamically allocate memory in the shared mode without any extra trouble. Third, multi-user buffer manager class is much more complicated than the single user version, besides our overall architecture changed a lot from the original single user minirel, therefore those three hundred lines of old buffer manager is not very attractive to us. However, old version of buffer manager did serve as a good example about how the interfaces look like and how buffer manager is structured. We successfully combined codes from space manager and os group and new error protocol and wrote a comprehensive test.C driver to test multi-user code.

TESTING

We wrote a driver named test.C with independent friendly user interface to test every public interfaces of our buffer manager code. For example, we intentionally created some break point in our code to check the correctness of the concurrency control. To test the case of multiple users are accessing the same page that is not in the buffer pool, we set a stop point inside one of the critical sections. Tests proved that our implementation can correctly handle the case of multiple reading of a single page from the disk by different users. Unfortunately, however, we failed to prectect the consistency of replaced page during a particular race situation. Therefore future enhancement has to be made.

RETROSPECTIVE

In the multi-user version of minirel, the most important problem for buffer manager is to ensure mutual exclusion and synchronization at least price of concurrency. In retrospect, we did much work to make it functional in multi-user version, but we also made some mistakes.

Because disk I/O is very slow, we don't want to block all other processes when one process is doing disk read or write. This means when one process wants to do disk I/O, it will release the lock on the buffer manager, and let other processes call any functions in buffer manager, thus synchronization problem is a big issue in this case.

There are many problems related to disk I/O. One is read/read problem, i.e. if one process wants a page which is being read from the disk by other process, it has to be blocked until the disk read is completed. We solved this problem by adding a semaphore to each descriptor and let processes wait on this semaphore when disk read is in progress. At the same time, lock on buffer manager is released to improve concurrency. A potential problem to this approach is that maybe too many semaphores are needed in the case that there are many many page frames in the buffer pool, because we need one semaphore for each page frame. Now we are going to change to another approach in which we associate a waiting list to each page frame and put each process to sleep on its own semaphore. In this approach, less semaphores are needed. Actually, only two or three semaphores are needed in the buffer manager class.

Another problem related to disk I/O is more subtle than the previous one. And we overlooked it until Professor Carey pointed it out us. The problem is, when a process decides to do a disk read due to a miss in the pool, Let's first describe the situation here. Every time when a victim frame is picked up, if the frame contains a dirty page, the victim page has to be written to disk and the new page needed is read from disk to the frame. Here both disk write and disk read are needed, and both will take a long time thus both need to be outside the buffer manager critical section. Both operations could be delayed due to some reason, at the same time other process could get into buffer manager asking for exactly the victim page which has not yet been written into disk and will get a miss in the buffer hash table and read the page from disk , thus get an older version, leaving the database in a inconsistent state.

We are going to solve this problem. Our solution is to associate the buffer manager a write-out list and semaphore to control the write list. We will put all scheduled write-out page id in this list. If a miss happens when looking up a page id, the process will first grab a victim frame, second check the write-out list, if the page is still in the list, copy it into the frame, otherwise read it from disk as usual. The write-out list is updated just before the new page is read into the victim frame. Here we just outlined the solution. We are still working on it.

APPENDIX