How Voron works: Insight into the new RavenDB storage engine
In this Level 400 talk, we go deep into how Voron is implemented, including all the gory details of creating a high performance transnational storage.
- Level 400: Diving into Voron Oren Eini firstname.lastname@example.org ayende.com/blog Hibernating Rhinos
- Voron is… Low level key / value store Transactional / ACID MVCC Multi layers
- background LevelDB LMDB Esent
- Seeks are slow 0.01 ms – Compress 1kb with Zippy 0.25 ms – Read 1 MB from memory 0.50 ms – Ping inside data center 10.0 ms – Disk seek 10.0 ms – Read 1 MB from network 30.0 ms – Read 1 MB from disk
- Binary Trees, Eh? F B A D C E G H I
- B+ Trees
- Implementation 4KB Pages B+ Tree Page translation table MVCC Journal file Scratch file Memory mapped
- Modifying the tree Find appropriate #to modify. Get a scratch page, copy #to scratch page. Register scratch #with the old ## in #translation table (PTT). Modify the #as you wish. On commit, the PTT becomes publicly visible. All changed pages are written to journal file. If rollback, revert to previous PTT, release scratch pages, done.
- #0 -> #3 #1 -> #1 #0 -> #3 #1 -> #5
- Background Find pages in scratch that have no one looking at older versions of them. Copy to data file. Clear the scratch space.
- How it works Only I/O during commits is a single write through, compressed, of data to journal. Moving data to data file is done in async. No need to call fsync(). Full & incremental backups.
- Missing the forest Voron isn’t a B+ Tree system. It doesn’t have a tree, it has trees. Plural. <blink>Important</blink>
- Falling trees Single root tree Contain many additional trees. Tree is similar to a table. Operations on tree: Add(key, value) Del(key, value) Find(key) : value Iterate() (Seek,Next, Prev)
- How it works?
- With indexes
- Finding stuff * Not the most efficient method
- So, Voron has trees… Root tree Free Space tree Contains references to named trees Enough? Tree of trees MultiAdd, MultiDelete, MultiRead
- Why multi trees? Optimization – if has just 1 item (and no value) can directly use the parent tree store. Store multiple items for a single value.
- Iterating multi trees
- What voron does? Opens up a lot of interesting scenarios. We have far better control over persistence now. Very low level (bits & bytes). Very fast! Concurrency benefits: Reads Writes* * Yet Voron allows only a single writer!
- What it does not? It isn’t about Linux. It can’t run on Linux*. Need to implment: PosixPureMemoryPager PosixPageFileBackedMemoryMappedPager PosixMemoryMapPager Waiting for big Linux push post 3.0 release.
- the cloud story… Scratch / temp usage Utilize fast local drives that can go away. Slow I/O only hold us for tx commit (and we optimized that).
- Summary Voron learned from LevelDB, LMDB, Esent. Journal for Atomicity, Consistency & Durability. MVCC for Consistency & Isolation. Root tree, named tress, multi trees.