The database that never lets go

This is part 3 in the Slowest Database series. Be sure to check out part 1 first.

Background

In part 2, we left off with deletes being the slowest operation, at around 2 seconds to delete 10 words in only a ~1 MB file. This seems like the place to focus.

The last delete implementation worked like this:

Loop over the file looking for all occurrences of the string to be deleted
Buffer the entire database, post-delete, in memory
Write this into a new file
Delete the old file

Aside from the performance, this has big problems with RAM consumption; our database can only do deletes if we can hold the entire thing in memory. This doesn’t work.

UnbufferedStringDb

I originally planned on part 3 being about storing the data in sorted order, but storing data sorted on disk is surprisingly complicated. We’ll get there, but I want to do simpler intermediate steps first. The next step we’ll take is figuring out how to not buffer and write the entire file post-delete. This will involve:

Identifying the line indexes of the file that need deleted
Marking them deleted (not actually removing data at this point) so that they are unsearchable
Updating our read logic so that it doesn’t detect these “deleted” records

This doesn’t free any disk space. This only makes data unsearchable. We’ll continue to iterate on this.

Relevant changes in this iteration:

public async Task Delete(string data)
{
    var deletedIndexes = new List<string>();
    var currentIndex = 0;

    var lines = File.ReadLines(filePath);
    foreach (var line in lines)
    {
        if (line == data)
        {
            deletedIndexes.Add(currentIndex.ToString());
        }

        currentIndex++;
    }

    await SetCount(rowCount - deletedIndexes.Count);

    File.AppendAllLines(freeListFilePath, deletedIndexes);
}

public async Task<string> Find(string data)
{
    var lines = File.ReadLines(filePath);
    var currentIndex = 0;

    foreach (var line in lines)
    {
        if (line == data && !IsDeleted(currentIndex))
        {
            return line;
        }

        currentIndex++;
    }

    return await Task.FromResult<string>(default);
}

private bool IsDeleted(int recordIndex)
{
    var lines = File.ReadLines(freeListFilePath);
    foreach (var line in lines)
    {
        if (int.Parse(line) == recordIndex)
        {
            return true;
        }
    }

    return false;
}

Our Find method now naively checks a file to see if the current index is marked deleted, and if so, skips it.

Benchmark results

The big change in benchmarking here is adding an additional insert and search after deletions are run. This is relevant in this implementation because, as you might expect, searches get slower as the “freelist” grows, and the freelist never gets smaller.

All values are in milliseconds.

Operation	StringDb	StringDb2	UnbufferedStringDb
Seed 1 MB	893	935	921
Insert 10	16	12	13
Insert 10 less common	4	20	14
Get total count	52	1	0
Search 10	1	16	20
Search 10 less common	24	29	34
Search 10 non-words	196	198	206
Delete 10	2014	2254	263
Delete 10 less common	2015	2051	226
Delete 10 non-words	2002	2206	223
Insert 10 less common	5	12	13
Search 10 less common	237	204	1077

Thoughts

This achieved the primary goal of speeding up deletes (~10x faster), but had the cost of slowing down searches as the freelist grows (~5x slower at this size). This was anticipated, though I will admit the slowdown was bigger than I figured even at a fairly small freelist (0.5 MB).

There were some optimizations to the freelist that I intentionally avoided at this point, such as keeping some or all of it in RAM. One of the goals of this project is focusing on how things are stored on disk. Buffering data in memory makes many performance problems trivial, but is then limited by system RAM. I’m making careful considerations here, only consuming RAM where the max it can take up is predictable and tolerable (such as the total record count fitting into a single int).

Next steps

The big problem now is search performance after deletions, so let’s focus there. Stay tuned!