Advanced Systems Topics Explanations

During my time supervising the University of Cambridge Computer Laboratory Part II Advanced Systems Topics course, various questions or confusions arose. Here is an in depth explanation on the issues raised. To the best of my knowledge it is correct, but if in doubt, ask the lecturer.

Capability Systems

A capability based system allows the "principle of least privilege" to be adhered to in a system: instead of a highly privileged kernel, with less privileged applications, each application is simply given exactly those privileges that it requires, and no more. This is very different from the hierarchical systems that we are used to, where, e.g. a process run by root has all of root's privileges. Now we can give an application access to, say, the hard disk at block level, but not the TLB.

A capability consists of a segment descriptor, indicating the base address and the limit (i.e. how large the segment is). It also contains an access code, specifying what type of access the capability allows to that segment (read only, read-write, etc.). Posession of a capability allows the described access.

The CAP is a hardware-level capability system. There are both data (D) registers and capability (C) registers. The system enforces the restriction that no transfer is possible from D to C registers, i.e. a process cannot simply manufacture a capability. Only a component of the operating system holds capabilities necessary to create capabilities in D registers and transfer them to C registers (analogy with the royal mint being the only people who are legally allowed to print bank notes).

When a process enters a new procedure, it uses the ENTER instruction, which takes as arguments the address of the code to execute, and a list of the registers containing the capabilities that should be given to the procedure. Note that each register may contain a capability that allows access to a segment containing further capabilities in memory. The usage of the ENTER instruction ensures that an application may only gain certain privileges when executing particular code (the procedure that has been called). The ENTER instruction loads a capability for another capability segment, which contains as its first capability one to allow access to the code of the procedure being called, and the remainder being the capabilities that the procedure is to be allowed. When the procedure is entered, the parent process' capabilities are saved into a process resource list. On execution of a RETURN instruction, the parent's capabilities are restored.

Capabilities may be refined by applications, using an assembly instruction, which only allows the capability to become less privileged, not more. This is useful to give called procedures lower capabilities.

Capabilities in the CAP are relative. This means that instead of specifying an absolute base address, a capability may instead specify the base relative to another capability, as well as privileges relative to that "parent" capability. The result is that all capabilities are in fact relative to the OS's store of privileges (the Master Resource List). This makes revocation of privileges much easier, as only one needs to be changed at OS level for all the "descendants" to be modified. When using relative capabilities, we must now evaluated them on use, i.e. follow their references back up the tree to the absolute capability they are descended from. Such evaluation takes time, and therefore the CAP retains a cache of evaluated capabilities.

I/O in the CAP is handled by designating small segments of memory as corresponding to the block devices. If a process has the capability for the particular area of memory it is deemed to have that capability for the block device. The CAP has a capability-based file system which is described in Wilkes' & Neeedham's book on the subject.

Unfortunately the CAP was before its time: the hardware was complex and expensive, and overall the system was somewhat slow. Now EROS is using many of the same ideas to create a viable, modern, capability-based OS.

EROS stores capabilities in what it terms cpages. These are only accessible to the kernel, and not to user processes. The capability, in addition to specifying the limits of a segment and the access code, specifies the type of the object that is stored in that segment. 16 Cpages are stored in each cgroup, which are arranged in hierarchical trees, from the domain root downwards. The cgroups furthest from the route then contain cpages which point to actual data pages (rather than more cgroups). The tree of cgroups is of variable height. When walking down the tree the most restrictive privilege encountered on the way to the data page applies.

Objects in EROS are accessed by referencing a capability register, which in turn points to the object and passes messages to/from it. Memory is simply a cache of persistent storage, i.e. all storage is persistent. This is achieved by regular checkpointing of the entire system state (including process state). The system is made fast by means of hand coded kernel operations (similar to the L4 microkernel), and caching a representation of the segment tree in the page table (i.e. we no longer have to walk the tree, as we simply associate the evaluated capabilites with each page table entry [? Not certain about this!]).

B+ Trees

A basic B-tree of order d is made of nodes which contain at most 2d keys and at most 2d+1 pointers. Each node must have at least d keys and d+1 pointers. Hence each node is at least half full. The methods of insertion and deletion used ensure that the tree remains balanced. Hence the longest path in a B-tree of n keys is (approximately) at most log_dn nodes, d being the order of the tree. Leaves are all at the same height in the tree. Note that at any node, each key has stored with its associated data value, therefore leaves are simply nodes without any forwards pointers to other nodes. The reason for requiring a balanced tree was that access to each separate node was likely to require a transfer from secondary storage. Today this is still the case when we are indexing secondary storage (the file system) itself.

B+ Trees are a variant of B-trees, with several important differences. In particular, no key/value pairs are stored in the nodes. Instead, nodes simply provide a "roadmap" for the rapid location of the key/value pairs in the leaves. The usage of this roadmap means that when a key is deleted, it does not need to be removed from the roadmap, only the relevant leaf. Another signficant difference is the presence of pointers from leaf to leaf, allowing easy sequential access. This is useful when considering indexing a file system, as one may perform a random access using the roadmap, or a sequential access using the leaf to leaf pointers. An interesting optimisation is to map each node into one page of memory, which then means the page manager (assuming an LRU replacement policy) will keep the often-used root node and its immediate children in a cache rather than causing them to be fetched from main memory (or disk) on each use.

Comer's paper on B-trees is an excellent read -- it's not complicated and is very comprehensive.
Take a look at this explanation and Perl code for B tree traversal if you want some pictorial explanations.

Sparse & Dense Indexes

Perhaps the easiest illustration of indexes is to consider databases as a motivating example. A primary key is one by which the data is ordered in the file, and each record has a unique primary key. A secondary key may also be unique per record, but the data is not sorted by it, i.e. records with the same value for the secondary key field may be distributed throughout the database.

A dense index is when we have an entry in our (primary key) index for every record, i.e. there is a pointer straight to each record from the index. Conversely, a sparse index contains only some of the possible values of primary key. When searching for a primary key value that is not in the sparse index, we simply find the value just smaller than that which we are searching for, follow its pointer into the database, and then scan down the records until we find the one with the requested primary key. Sparse indexes are therefore less rapid, but more space efficient (and hence more of the index fits in memory), and require less upkeep (as on insertion or deletion of a new record we do not necessarily have to update the index).

A multi-level sparse index simply makes the index more of a hierarchy, i.e. now the entries in the top-level sparse index point to ranges in the lower-level sparse index, which in turn contains entries that point to ranges of actual records (or rather, point to individual records, but from those individual records we must search a range of records to actually find the record with the key we are searching for). This is somewhat similar to the idea of multi-level page tables, except that the second level table is dense.

For secondary indexes, we require dense rather than sparse indexes. The reason for this is that records are not stored in the file in order of secondary key. Hence, we must now index every secondary key, as we have no guarantee that a key with value just less or greater than the current one is to be found anywhere nearby in the database.

RVM

Recoverable Virtual Memory "refers to a region of virtual memory on which transactional semantics apply" (SMH in AST notes). We consider only atomicity (i.e. an action either totally commits or does nothing at all) and durability (i.e. the result of a committed action survives a crash). Isolation and Consistency are not considered, as the expense is too great: these properties can be built on top of RVM if so desired. To achieve this, for each transaction we take a copy of the original data before any operation is performed on it, and store it in an undo log in memory; this allows user initiated undos to take place, and ensures atomicity. When the transaction commits, the updates are written to a (sequential) redo log on disk. This ensures durability. When the redo log on disk becomes full, it is read sequentially and updates are flushed to disk. Usage of a redo log allows the transactions themselves to have fast disk writes, as they write sequentially to the log, rather than performing random accesses to the disk when writing directly to the database. One disadvantage of RVM is the existence of these three copies, and the especially the time taken to write the data to each of them in turn.

Rio Vista & NVRAM

Rio Vista is a system designed to make RVM faster. The main problems of RVM is that it keeps up to three copies of the data, and employs expensive synchronous disk writes. Rio mitigates this by using an area of memory, backed by a UPS (i.e. it's Non Volatile RAM, though I suspect one could use a technology such as flash -- is this fast enough?) that acts as a file cache. This sits between memory and the disk, caching both reads and writes. If the cache fills up, then data is flushed to disk. By flushing lazily only as and when required, the system eliminates the cost of disk writes. If the machine crashes, on startup, the dirty data is simply flushed to disk from the NVRAM.

Rio intercepts mmap() calls, so that instead of mapping an area of disk into memory, an area of the NVRAM file cache is instead mapped. This gives high performance gains for files that fit in the NVRAM cache, but with large files, e.g. databases, the system suffers from thrashing (in the way that all caches do).

Vista is an RVM library built to utilise the Rio NVRAM file cache. Because it can assume that the NVRAM will survive a crash, it does not make use of a redo log, and hence decreases the number of copies of the data that need to be kept, producing significant speed ups.

Big Reader Locks

When using an MRSW lock (otherwise known as a read/write spin lock), multiple processes on multiple CPUs access the same lock data structure to update a common read counter. While this allows multiple simultaneous reader threads, it has the problem that to update the lock data structure itself required the CPU that the thread concerned is running on to obtain exclusive access to the cache line containing the lock structure. Hence, to increment a reader count, a CPU first has to get a copy of the cache line, update it, and then invalidate all other copies, whereupon all other CPUs have to get new copies. Clearly this isn't particularly efficient when all the threads do the majority of the time is read.

To attempt to solve this problem, big reader locks, otherwise known as big reader read/write spin locks, split the lock data structure so that the read portion is spread over all the CPUs, whilst the write portion remains as it was: a single shared value. This allows each CPU to update its own read flag in its hardware cache, without needing to snoop or invalidate other CPU's read flags. Obviously an implementation detail is that the data structure should ensure that the read flags for each CPU are really in different cache lines!

Acquiring a read lock involves setting the relevant lock flag corresponding to the CPU concerned in the __brlock_array structure, and then waiting until the single spin lock that is used for writing is open (i.e. waiting until no one is writing before beginning to read).

Acquiring the write lock involves acquiring the single write spin lock, and then checks all lock flags in __brlock_array corresponding to the big reader lock are cleared. If not, it releases the write spin lock, and starts again. The reasoning behind this is probably that we wish read threads to have priority over write ones. Were we to simply hold the write spin lock whilst waiting for the remaining readers to finish, we would block any new ones from starting (you could argue this would be what you wanted in a normal MRSW lock). In any case, a write may only take place if there are no readers, which may take significant amounts of time (not least because the CPU attempting to take out the write lock must examine the spin lock flag that each of the other CPUs owns, which involves transferring the cache line between CPUs).

Big Reader locks are rarely used. On the Intel architecture Linux only uses a single big reader lock in the networking code.

xFS

xFS stripes its log out to the relevant group of machines. The idea is that each machine coalesces writes privately, and then instead of flushing the log to its own disk, it flushes to the group of remote disks (just think of RAID as a lot more distributed now). So we end up not having to transfer entire new blocks to all of the other machines. xFS uses managers who are responsible for file metadata (any machine can be a manager), plus stripe groups that each file is actually distributed over (i.e. not every file is striped over every server).

Exokernel

Exokernel attempts to expose as much of the low level properties of the machine it is running on as possible. The idea is to then allow applications to provide their own custom (e.g.) buffer cache replacement policies, or page management. Most applications will just want the standard OS provided low level functions. We provide these "system call" functions in the form of user mode libraries, known as libOSes. This means that an application programmer can include their own libOS library in their distribution, that has different low level handlers for a few particular OS functions, optimised to their application. Exokernel enforces security and protection at the lowest levels, i.e. below the libOS level. For disks, this is done by allowing applications to provide libFS libraries, which allow custom ways of storing files (e.g. the metadata layout). There needs to be some translation from this custom representation to one which the underlying exokernel OS understands. This is done using structures known as templates (which once written are immutable). A template contains three Untrusted Data Functions, the important one being owns-udf(). A UDF takes as input the custom metadata, m, and translates it into the set of blocks that m points to. Exokernel can then check that the process concerned has access to those blocks, in accordance with its low level security policy. UDFs are written by the same people as those who write the libFS library (obvious: they design the metadata layout, so they also supply a translation function). The language is a RISC-like one, which is guaranteed to be deterministic, i.e. it will always give as its output a set of blocks.

For example, supposing a libFS wishes to allocate a disk block b by placing a pointer to it in some metadata structure, m. We first run owns-udf(m) to check that that process has access to the blocks concerned. Then we make a copy of m, (call it n), and adjust it to add the pointer to the new block b. We then run owns-udf(n) and see whether it gives the same set of blocks that owns-udf(m) did, plus the new block b. If so, then we allow the operation to proceed (i.e. we replace m with the new metadata n).

This allows block-level granularity of file sharing. The paper claims that for web applications their custom web server (Cheetah) works 9 times better than the best Unix one. There are various reasons why, including the storing precomputed packet checksums on disk.

Texas Pointer Swizzling

Texas uses 64 bit pointers to index a single, persistent, virtual address space, as follows:

When an application first loads, it will request access by name to some items that are held in persistent storage. Texas translates these names into 64 bit pointers, and loads the relevant data from persistent storage into virtual memory (let us say an example is page A), returning a pointer into that page of virtual memory that contains the objects that were requested.
On loading into page A of virtual memory, the objects are scanned for any 64 bit pointers to other objects in the persistent storage.
For each 64 bit reference found, Texas reserves another page in virtual memory, and marks it invalid. It replaces the 64 bit address found in page A with the valid 32 bit address of the new reserved virtual page (say, page B). Note that at this point this new page (B) is empty.
When the application attempts to follow the 32 bit pointer from page A to page B, a page fault is generated. This is handled by Texas, which loads the appropriate page in from persistent storage, and marks the page as valid.
On loading the data into B, Texas performs the same scan through the page to check for 64 bit pointers, reserves another page in virtual memory (marked as invalid) for each of these, and rewrites the pointers to be the correct 32 bit ones.

In this way Texas stays "one step ahead" of the application, which never sees any 64 bit pointers.