October 22 2016
by ScrimpyCat

Simple Lock-Free Epoch Memory Reclamation

Epoch based memory reclamation or EBR, is an approach that was outlined in this paper. It describes that all you need to successfully manage the delayed reclamation of memory used by your lock-free algorithms, is to separate references into 3 groups. These groups are the epochs. The oldest epoch at any given time that can be safely reclaimed is 2 epochs behind the most recent.

This means that as a new thread enters a new epoch it will be placed 2 places ahead of the current global epoch, and will proceed to add any of its memory that needs reclaiming to it. Then when no threads are currently accessing the current global (stale) epoch, have all moved 2 places ahead, that old memory can now safely be reclaimed. And then the global epoch is moved one place further. This allows for the continued delayed reclamation of memory, while making sure only the memory that can potentially still be accessed is kept around.

The Approach

Unfortunately EBR seems to be one of the less frequently reviewed strategies for memory reclamation in a lock-free algorithm. Which has made it quite difficult to come across other implementations. However the implementations I was able to find, I noticed when it came time to deciding on whether they could perform reclamation of the stale epoch or not they typically followed the structure of iterating over all the threads registered by the system, which was requiring a lock. To address this it occurred to me that you could just retain references to each epoch that is currently being used. Now this does come at the expense of introducing additional CAS operations, but it allows for the implementation to be lock-free and quite straightforward to understand.

Implementation

We will implement this technique using C11 (atomics and threads), though will not bother with overriding the memory order for simplicity.

To start off we will make our definitions.

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdatomic.h>
#include <threads.h>

typedef uint64_t epoch_t;

typedef struct entry_t {
    struct entry_t *next;
    void *item;
    void (*reclaimer)(void*);
} entry_t;

typedef struct {
    entry_t *list;
    uint32_t refCount;
} managed_list_t;

typedef struct {
    entry_t *head;
    entry_t *tail;
    epoch_t epoch;
} local_t;

typedef struct {
    _Atomic(managed_list_t) managed[3];
    _Atomic(epoch_t) epoch;
    tss_t key;
} gc_t;

gc_t *Create(void);
void Destroy(gc_t *GC);
void Reclaim(gc_t *GC, entry_t *Node, epoch_t Epoch);
void Start(gc_t *GC);
void Manage(gc_t *GC, void *Item, void (*Reclaimer)(void*));
void Stop(gc_t *GC);

entry_t represents a reclaimable item that is managed by our system, managed_list_t is the global epoch reference, local_t is our thread local data, and gc_t is our garbage collector/EBR system reference.

The first functions we will implement are the creation and destruction functions for the system. These are relatively straightforward, they are initialising the system with it's initial state and then cleaning up the system and destroying it.

gc_t *Create(void)
{
    gc_t *GC = malloc(sizeof(gc_t));
    if (GC)
    {
        if (tss_create(&GC->key, free) != thrd_success)
        {
            free(GC);
            return NULL;
        }

        atomic_init(&GC->managed[0], (managed_list_t){ .list = NULL, .refCount = 0 });
        atomic_init(&GC->managed[1], (managed_list_t){ .list = NULL, .refCount = 0 });
        atomic_init(&GC->managed[2], (managed_list_t){ .list = NULL, .refCount = 0 });
        atomic_init(&GC->epoch, 0);
    }

    return GC;
}

void Destroy(gc_t *GC)
{
    tss_delete(GC->key);

    for (int Loop = 0; Loop < 3; Loop++)
    {
        managed_list_t Managed = atomic_load(&GC->managed[Loop]);
        Reclaim(GC, Managed.list, atomic_load(&GC->epoch));
    }

    free(GC);
}

Next will be the Reclaim function. This function will be used when a stale epoch is able to be reclaimed. So it will reclaim the items that belong to that epoch, free our entry_t nodes, and increment the global epoch. An improvement that could be made, is if we used a pool based allocator for our entry_t nodes, as this would allow us to alleviate some of the overhead with allocating/deallocating those nodes.

void Reclaim(gc_t *GC, entry_t *Node, epoch_t Epoch)
{
    while (Node)
    {
        Node->reclaimer(Node->item); //Reclaim entries

        entry_t *Temp = Node;
        Node = Node->next;
        free(Temp); //Free node
    }

    atomic_compare_exchange_strong(&GC->epoch, &Epoch, Epoch + 1); //Increment the global epoch only once for the given epoch (as multiple threads may reach this point, though only one will have a non-NULL Node)
}

Next will be the Start function. This function will be used by our lock-free algorithms to indicate when a memory managed section will begin (adding items to be managed or reading managed items). It will first get the local state for the thread, or create it if there is none. Then the local epoch will either be moved 2 places ahead, so it is no longer referencing the global (stale) epoch. We then retain a reference to the epoch we will be using.

void Start(gc_t *GC)
{
    local_t *Local = tss_get(GC->key);
    if (!Local)
    {
        Local = malloc(sizeof(local_t));
        if (!Local) return; //Handle failure however you want

        Local->epoch = 0;
        tss_set(GC->key, Local);
    }

    //Repeat until we successfully retain a reference to an epoch +2 ahead of the global
    for (epoch_t GlobalEpoch = atomic_load(&GC->epoch); ; )
    {
        epoch_t Epoch = (GlobalEpoch + 2) % 3; //Set current thread's epoch to +2 above global epoch
        *Local = (local_t){ .head = NULL, .tail = NULL, .epoch = Epoch };

        //Retain a reference to the current epoch
        managed_list_t Managed;
        do {
            Managed = atomic_load(&GC->managed[Epoch]);
        } while (!atomic_compare_exchange_weak(&GC->managed[Epoch], &Managed, ((managed_list_t){ .list = Managed.list, .refCount = Managed.refCount + 1 })));

        //Check to see global epoch has not changed since retaining the local epoch reference
        epoch_t NewGlobalEpoch = atomic_load(&GC->epoch);
        if (GlobalEpoch == NewGlobalEpoch) break;

        GlobalEpoch = NewGlobalEpoch;

        //Release the reference to the current epoch
        if (!atomic_compare_exchange_strong(&GC->managed[Epoch], &Managed, ((managed_list_t){ .list = Managed.list, .refCount = Managed.refCount - 1 })))
        {
            do {
                Managed = atomic_load(&GC->managed[Epoch]);
            } while (!atomic_compare_exchange_weak(&GC->managed[Epoch], &Managed, ((managed_list_t){ .list = Managed.list, .refCount = Managed.refCount - 1 })));
        }
    }
}

Next we will implement the Manage function. This function will be used to add any items that we wish to have managed by the system. So we will create a new entry and then store it in the local thread state. Reason we store it in the thread's local state here is to alleviate an additional CAS-loop that would have otherwise been required. Instead we keep a list, keeping track of head and tail nodes, of entries from that thread which will be merged only once the managed section has finished.

void Manage(gc_t *GC, void *Item, void (*Reclaimer)(void*))
{
    entry_t *Entry = malloc(sizeof(entry_t));
    if (!Entry) return; //Handle failure however you want

    //Add the item to the current local list of managed items
    local_t *Local = tss_get(GC->key);
    *Entry = (entry_t){ .next = Local->head, .item = Item, .reclaimer = Reclaimer };
    Local->head = Entry;

    if (!Local->tail) Local->tail = Local->head;
}

Finally all that is left is to implement our Stop function. This function will be used by our lock-free algorithms to indicate when a memory managed section will end (no longer items to be managed or reading managed items). It will first get the global epoch, this will become our stale epoch (if it meets the right conditions). We then need to decrement our reference to our current epoch and merge our entries with it. Both can be achieved with a single CAS-loop. Finally to workout whether our global epoch is stale, and can be reclaimed. We need to check that there are no longer any references to it and there are no references to the epoch after it (all references are now +2 places ahead). If those conditions are met we can safely reclaim it.

void Stop(gc_t *GC)
{
    local_t *Local = tss_get(GC->key);

    const epoch_t GlobalEpoch = atomic_load(&GC->epoch);
    const epoch_t Epoch = Local->epoch;

    managed_list_t Managed;
    if (Local->head)
    {
        //Release a reference to the current epoch and add local entries to global entries for this epoch
        do {
            Managed = atomic_load(&GC->managed[Epoch]);
            Local->tail->next = Managed.list;
        } while (!atomic_compare_exchange_weak(&GC->managed[Epoch], &Managed, ((managed_list_t){ .list = Local->head, .refCount = Managed.refCount - 1 })));
    }

    else
    {
        //Release a reference to the current epoch
        do {
            Managed = atomic_load(&GC->managed[Epoch]);
        } while (!atomic_compare_exchange_weak(&GC->managed[Epoch], &Managed, ((managed_list_t){ .list = Managed.list, .refCount = Managed.refCount - 1 })));
    }

    const epoch_t StaleEpoch = GlobalEpoch % 3;
    Managed = atomic_load(&GC->managed[StaleEpoch]);
    if (Managed.refCount == 0) //Check if the stale epoch is no longer referenced
    {
        managed_list_t NextManaged = atomic_load(&GC->managed[(StaleEpoch + 1) % 3]);
        if ((NextManaged.refCount == 0) && (atomic_compare_exchange_strong(&GC->managed[StaleEpoch], &Managed, ((managed_list_t){ .list = NULL, .refCount = 0 })))) //Check if the stale epoch can be reclaimed
        {
            if (GlobalEpoch == atomic_load(&GC->epoch)) Reclaim(GC, Managed.list, GlobalEpoch); //Before reclaiming double check another thread hasn't beaten us to it
            else if (Managed.list)
            {
                //If another thread has beaten us and there were nodes attached, proceed to add them back
                entry_t *List = Managed.list, *Tail = Managed.list;
                while (Tail->next) Tail = Tail->next;

                for (epoch_t GlobalEpoch = atomic_load(&GC->epoch); ; )
                {
                    //Retain a reference to an epoch +2 places ahead of global
                    epoch_t Epoch = (GlobalEpoch + 2) % 3;
                    managed_list_t Managed;
                    do {
                        Managed = atomic_load(&GC->managed[Epoch]);
                    } while (!atomic_compare_exchange_weak(&GC->managed[Epoch], &Managed, ((managed_list_t){ .list = Managed.list, .refCount = Managed.refCount + 1 })));

                    //Check to see global epoch has not changed since retaining the local epoch reference
                    epoch_t NewGlobalEpoch = atomic_load(&GC->epoch);
                    if (GlobalEpoch == NewGlobalEpoch)
                    {
                        //Release a reference to the current epoch and add local entries to global entries for this epoch
                        do {
                            Managed = atomic_load(&GC->managed[Epoch]);
                            Tail->next = Managed.list;
                        } while (!atomic_compare_exchange_weak(&GC->managed[Epoch], &Managed, ((managed_list_t){ .list = List, .refCount = Managed.refCount - 1 })));

                        break;
                    }

                    GlobalEpoch = NewGlobalEpoch;

                    //Release the reference to the current epoch
                    if (!atomic_compare_exchange_strong(&GC->managed[Epoch], &Managed, ((managed_list_t){ .list = Managed.list, .refCount = Managed.refCount - 1 })))
                    {
                        do {
                            Managed = atomic_load(&GC->managed[Epoch]);
                        } while (!atomic_compare_exchange_weak(&GC->managed[Epoch], &Managed, ((managed_list_t){ .list = Managed.list, .refCount = Managed.refCount - 1 })));
                    }
                }
            }
        }
    }
}

Caveats

CAS-loops are expensive, and unfortunately this implementation requires a number of them, however it only requires 2 of them and 2 strong CAS on the best case. Other memory reclamation strategies such as the lock-based EBR avoids them, hazard pointers avoid them, QSBR avoids them, etc. It may be possible to replace them somehow, but I haven't given too much thought as to how that may be achieved for now.

The other problem is thread failure as brought up in the paper. If a thread fails in the critical section, its reference will remain forever and memory will start to "leak" as the global epoch will no longer able to advance so that memory won't be reclaimed. A strategy to avoid this might be to watch for non-recoverable thread failure, replace the GC's other threads are using and destroy the old ones.

Concluding

To see this in-practice the implementation can be found here. One interesting approach that can be taken from this is simply using the retained references for critical sections and avoiding the epochs altogether. Now that strategy would only be appropriate in very low contested situations as it would only be able to reclaim the memory when no threads are currently in the critical section, but for some use cases that may be possible.

Lastly as always, lock-free programming is hard. Hopefully there are no errors, so always use at your own risk.

Updates

Major changes have been made to this article. The details of those changes are outlined below.

12th May 2018

As it would turn out lock-free programming is certainly hard and misleading. The original implementation presented in this article (which can be seen here) did end up having some bugs. Amusingly I had been using an implementation like this for years and had never experienced a crash related to memory reclamation.

It was only recently when I was working on a new lock-free algorithm that when I went to stress test it, I started noticing incorrect results (unfortunately stress testing my other lock-free algorithms never happened to bring up this issue). For ages I thought something must've been wrong with this new algorithm, but I couldn't workout what. Eventually I managed to track it down to the epoch memory reclaimer implementation I was using, which is what led me to making the above fixes.

The changes made were: