How Voron works: Insight into the new RavenDB storage engine
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
In this Level 400 talk, we go deep into how Voron is implemented, including all the gory details of creating a high performance transnational storage.
- Level 400: Diving into Voron Oren Eini [email protected] ayende.com/blog Hibernating Rhinos
- Voron is… Low level key / value store Transactional / ACID MVCC Multi layers
- background LevelDB LMDB Esent
- Seeks are slow 0.01 ms – Compress 1kb with Zippy 0.25 ms – Read 1 MB from memory 0.50 ms – Ping inside data center 10.0 ms – Disk seek 10.0 ms – Read 1 MB from network 30.0 ms – Read 1 MB from disk
- Binary Trees, Eh? F B A D C E G H I
- B+ Trees
- Implementation 4KB Pages B+ Tree Page translation table MVCC Journal file Scratch file Memory mapped
- Modifying the tree Find appropriate #to modify. Get a scratch page, copy #to scratch page. Register scratch #with the old ## in #translation table (PTT). Modify the #as you wish. On commit, the PTT becomes publicly visible. All changed pages are written to journal file. If rollback, revert to previous PTT, release scratch pages, done.
- #0 -> #3 #1 -> #1 #0 -> #3 #1 -> #5
- Background Find pages in scratch that have no one looking at older versions of them. Copy to data file. Clear the scratch space.
- How it works Only I/O during commits is a single write through, compressed, of data to journal. Moving data to data file is done in async. No need to call fsync(). Full & incremental backups.
- Missing the forest Voron isn’t a B+ Tree system. It doesn’t have a tree, it has trees. Plural. <blink>Important</blink>
- Falling trees Single root tree Contain many additional trees. Tree is similar to a table. Operations on tree: Add(key, value) Del(key, value) Find(key) : value Iterate() (Seek,Next, Prev)
- How it works?
- With indexes
- Finding stuff * Not the most efficient method
- So, Voron has trees… Root tree Free Space tree Contains references to named trees Enough? Tree of trees MultiAdd, MultiDelete, MultiRead
- Why multi trees? Optimization – if has just 1 item (and no value) can directly use the parent tree store. Store multiple items for a single value.
- Iterating multi trees
- What voron does? Opens up a lot of interesting scenarios. We have far better control over persistence now. Very low level (bits & bytes). Very fast! Concurrency benefits: Reads Writes* * Yet Voron allows only a single writer!
- What it does not? It isn’t about Linux. It can’t run on Linux*. Need to implment: PosixPureMemoryPager PosixPageFileBackedMemoryMappedPager PosixMemoryMapPager Waiting for big Linux push post 3.0 release.
- the cloud story… Scratch / temp usage Utilize fast local drives that can go away. Slow I/O only hold us for tx commit (and we optimized that).
- Summary Voron learned from LevelDB, LMDB, Esent. Journal for Atomicity, Consistency & Durability. MVCC for Consistency & Isolation. Root tree, named tress, multi trees.