the necessary information to proceed with the deallocation. In a 32- bit address space 1 MB is enough for the BIBOP table (768 KB in Linux, since 25% of the virtual address space is reserved for kernel memory). In 64-bit address spaces, multilevel trees or tries can be used instead [10], to encode information only for the segments of the address space that are actually used by the application. We are currently investigating these options in an ongoing effort to port Streamflow to a 64-bit system. The BIBOP technique allows the elimination of headers for small objects without introducing artifi- cial segmentation of the virtual address space. The elimination of headers allows Streamflow to better exploit spatial locality oppor- tunities. It also facilitates the support of arbitrarily small objects. In the current implementation the minimum object granularity is 4

bytes. Object allocation:

When a memory request is received, Stream-

flow directs it to the appropriate object size class in the local heap of the thread that initiated the request. In the common case, the first page block in the list of that size class will have available ob- jects. There are two categories of available objects. Those that have already gone through one or more allocation/deallocation cycles populate the freed LIFO list and are preferred for consequent al- locations. This design decision, combined with the LIFO organiza- tion of the list, favors temporal locality, since recently deallocated objects are reused as soon as possible. If the freed list is empty, Streamflow allocates one of the objects of the page block that have never been allocated before. The beginning of the memory area that accommodates such objects is pointed to by unallocated, which is a bump-pointer that is forwarded each time by one object.

Object deallocation: the same thread that

Object deallocations are usually initiated by allocated the object. If this is the case, the

object is simply inserted—without any synchronization—into the freed LIFO of the parent page block that it originated from. If after the deallocation the page block becomes empty, it is dealt with by

the page block caching policy which is described later on.

Remote object deallocations are deallocations of an object from a thread other than the one that allocated it. Remote dealloca- tions need to be treated differently, since only the owner-thread of each page block can modify the freed LIFO. In this case, the object is inserted to the remotely freed LIFO list of the par- ent page block. The insertion to the list is performed via a 64- bit atomic cmp&swap operation which simultaneously updates the remotely freed LIFO head and checks the owner identifier (id) of the parent page block to ensure that the page block is actually owned by a thread3. Objects inserted into the remotely freed LIFO will be eventually transferred to the freed LIFO by the owner-thread of the page block. The decoupling of local and re- mote operations is a key design point which drastically improves the latency and scalability of Streamflow by eliminating atomic in- structions from the critical path of the most frequent operations. Furthermore, Streamflow uses the minimum number of atomic in- structions for thread-safe remote object deallocations.

When a memory request can not be served by the page block at the head of the appropriate object size class because the page block is full, the owner-thread checks the remotely freed LIFO for ob- jects freed earlier to the page block by remote threads. If such ob- jects exist, they are all removed with a single atomic cmp&swap op- eration and transferred to the freed list. The memory request then proceeds exactly as the common case memory request described earlier. The lazy reclamation policy of remotely freed objects, com- bined with the page blocks rotation strategy, guarantees that re- motely freed memory objects will eventually be reused. However,

3 Given that id and remotely freed need to be updated by a single 64- bit atomic operation, they are always placed into 64 consecutive bits in the page block header.

their reuse will be delayed until it is absolutely necessary: when the parent page block runs out of free memory. This strategy min- imizes the number of atomic operations required for accessing the remotely freed list. If, however, the page block at the head of the object size class is full and its remotely freed list does not contain any objects, the page block is rotated to the end of the list and a new page block is fetched from the cache or requested from

the page manager. Thread termination:

Whenever a thread terminates, Streamflow

ensures that the free memory of partially free or locally cached page blocks in its heap will be made available to the other threads. Empty and partially full page blocks are handled by the caching policy described below. If one of the page blocks appears to be full, its remotely freed list is checked for remotely freed objects. If the list is not empty, the objects are removed—with a single, atomic cmp&swap operation—and transferred to the freed list. Following, the page block is managed by the caching policy as a completely or partially free block. If this is not the case, the thread declares the page block as “orphaned,” by setting the id of its owner to NULL. Any orphaned page block can be “adopted” and attached to the heap of the first thread that deallocates an object originating from it observes that the page block is orphaned.

The id is set to NULL with an atomic 64-bit cmp&swap op- eration, which simultaneously verifies that the remotely freed list remains empty. Should the instruction fail, one or more ob- jects have been freed into the remotely freed LIFO after the last check, so the page block is no longer full. The atomic operation eliminates the possibility of declaring a page block as orphaned af- ter all its objects have been returned to the remotely freed list. The free memory of such a page block would never be reused, since no thread would ever have the opportunity to observe it as

orphaned. Page block caching:

Page block caching is the boundary that

separates the multithreaded memory allocator front-end and the page manager back-end. When the allocator needs a new page block, it first checks a thread-local cache, then the global caches, and if no cached page blocks of the correct size are found, it passes a request on to the page manager. The local caches are synchronization-free LIFO lists, and the global cache is a lock-free LIFO list. The caching layer is the last level at which Streamflow applies lock-free, non-blocking synchronization. Its purpose is to relieve strain on the page manager.

Page blocks in local caches are organized according to their size. Due to the minimum size, maximum size, and power of two size limitations for page blocks, multiple object classes use page blocks of the same size. Orphaned page blocks whose original owner thread has terminated are placed on a global list, which must preserve the page block’s object class, since there are still some live objects allocated from this page block. Completely free page blocks can be placed on a global free cache upon thread termination, or when a thread releases a page block and the local cache is overpopulated. In order to maintain low virtual memory usage, our implementation constrains the population of the local and global caches to one and zero page blocks respectively. Orphaned page blocks can always be stored in the global list of orphaned blocks, independent of the list’s population.

Discussion: From the discussion so far it is clear that Streamflow performs the vast majority of memory allocation/deallocation op- erations without introducing synchronization. Synchronization be- tween threads is only required in the infrequent cases of: i) remote object deallocations, ii) batch reclamation of remotely freed ob- jects, iii) declaration of a page block as orphaned, iv) adoption of an orphaned page block, and v) page block returns to or requests from the page manager. Even in these cases, with the exception of