Improving RRB-Tree Performance through Transience
posted

I finally completed my master’s thesis! Its source code is available at https://github.com/hyPiRion/c-rrb, and the text is available at thesis.pdf.
Abstract
The RRB-tree is a confluently persistent data structure based on the persistent vector, with efficient concatenation and slicing, and effectively constant time indexing, updates and iteration. Although efficient appends have been discussed, they have not been properly studied.
This thesis formally describes the persistent vector and the RRB-tree, and presents three optimisations for the RRB-tree which have been successfully used in the persistent vector. The differences between the implementations are discussed, and the performance is measured. To measure the performance, the C library librrb is implemented with the proposed optimisations.
Results show that the optimisations improves the append performance of the RRB-tree considerably, and suggests that its performance is comparable to mutable array lists in certain situations.
Disclaimer
I think it should be noted that I do not know any type theory at all, so take the notation in Chapter 4 and 7 with a grain of salt. The intention is not to introduce proper type theory notation, but rather explain what transience/transients are. I’d love some feedback on how to improve those parts by seasoned type theorists.
Intended Audience
If you just want a basic understanding of how the persistent vector works, have a look at my blog post series on the persistent vector instead. Additionally, I’d recommend going to the persistent data structure presentation by Dann Toliver on StrangeLoop this year (2014) if you’re going. I do not know the guy, but I know visualisation helped me incredibly much with understanding how they worked.
If you want to understand the RRB-tree, you can do one of two things:
- Read the original paper on RRB-trees and watch this youtube video. If it’s still not evident (I found it somewhat dense), attempt to read my thesis.
- Wait for me to write a blog post series on the RRB-tree. (No guarantees!)
Why do I mention this? Well, it turns out that most theses are intended for academics, and this one is not an exception. It’s not that I want the thing to be unreadable – far from it – but I felt the need to prove a considerable amount of stuff to back up my claims. As a result, my thesis is very math-y and there are a lot of details in it. While it’s not necessarily a problem, I don’t want people unfamiliar with the mathematical notation to be scared away from the persistent vector data structure family.
The mathematics is not “advanced”, mind you. If you want to follow the proofs, you have to understand geometric series, logarithmic identities, and L’Hôpital’s rule. Additionally, you should be familiar with computational complexity theory (understanding what Big-O notation is, and its cousin Theta notation). It’s probably not easy to understand it all at first read, but I think it’s not really required if you want to skim through the thesis. I try to give an informal explanation alongside the formal one to make it understandable for everyone.
If you want to really understand these data structures, or want to implement them, I think you should read it regardless of your background. Apart from the math parts described above, the thesis is designed to not require much background knowledge. The only thing you have to know is the trie data structure.
What’s Next for RRB-Trees?
First and foremost, I want to release librrb. I do need to clean up a couple
of things; I want to make the source code understandable before publishing it.
The source will be released under the MIT license, and hopefully, it should be
relatively easy to port it to other languages. Edit: The source code is now available!
Secondly, transience is actually a really easy concept, so I’d like to do a quick explanation of how it works. It’s the final piece for explaining the Clojure’s persistent vector blog series.
My thesis does not contain anything about the core.rrb-vector library, as I,
unfortunately, didn’t get enough time to write a good comparison: There’s sadly
no explanation on how the optimisations in core.rrb-vector works, and
deciphering imperative Clojure code is not my strongest skill. I think that’s
the biggest weakness of my thesis because Michał Marczyk has done an amazing
job on the library.
But from what I gather, the core.rrb-vector has a tail and support for
transients. It does not seem to use the direct append technique but does
something very similar. I’ll have to look at the implementation in detail and
attempt to comprehend it whenever I get time enough to do so, so that I can
learn the differences and do a write-up on it.
I also have some thoughts on visualising the persistent vector and the RRB-tree, but I’ll wait until Dann Toliver presents his talk. No need to reinvent the wheel if it’s already good enough :)
Errata
I’d be happy to get feedback/errata and answer any questions you might have. Please email me with them: I’ll fix them whenever I get time, and upload a new version with some revision history added.
Additional Information
The design of my thesis was initially stolen from Bastian Müller, from his thesis “Efficient Dynamic Method Dispatch on the Java Virtual Machine”, which was an attempt to replicate the design of Dave Hermann’s dissertation “A Theory of Typed Hygienic Macros”. It’s very similar to the template I’ve open sourced as tethysthesis, but there are a couple of (mostly unnoticeable) changes I need to merge in.
If you want to refer to my thesis, the BibTeX entry is here:
@mastersthesis{lorange2014rrb,
    author = {L'orange, Jean Niklas},
    title  = {{Improving RRB-Tree Performance through Transience}},
    school = {Norwegian University of Science and Technology},
    year   = {2014},
    month  = {June}
}
If you want to host my thesis on your webpage or parts of it, you’re of course free to do so. I’d be happy if you refer to the original as I will update it to fix errata (http://hypirion.com/thesis.pdf).