How Voron works: Insight into the new RavenDB storage engine

Data & Analytics

oren-eini
of 25
Description
In this Level 400 talk, we go deep into how Voron is implemented, including all the gory details of creating a high performance transnational storage.
Text
  • Level 400: Diving into Voron Oren Eini ayende@ayende.com ayende.com/blog Hibernating Rhinos
  • Voron is…  Low level key / value store  Transactional / ACID  MVCC  Multi layers
  • WHY?!
  • background  LevelDB  LMDB  Esent
  • Seeks are slow  0.01 ms – Compress 1kb with Zippy  0.25 ms – Read 1 MB from memory  0.50 ms – Ping inside data center  10.0 ms – Disk seek  10.0 ms – Read 1 MB from network  30.0 ms – Read 1 MB from disk
  • Binary Trees, Eh? F B A D C E G H I
  • B+ Trees
  • Implementation  4KB Pages  B+ Tree  Page translation table  MVCC  Journal file  Scratch file  Memory mapped
  • Modifying the tree  Find appropriate #to modify.  Get a scratch page, copy #to scratch page.  Register scratch #with the old ## in #translation table (PTT).  Modify the #as you wish.  On commit, the PTT becomes publicly visible.  All changed pages are written to journal file.  If rollback, revert to previous PTT, release scratch pages, done.
  • #0 -> #3 #1 -> #1 #0 -> #3 #1 -> #5
  • Background  Find pages in scratch that have no one looking at older versions of them.  Copy to data file.  Clear the scratch space.
  • How it works  Only I/O during commits is a single write through, compressed, of data to journal.  Moving data to data file is done in async.  No need to call fsync().  Full & incremental backups.
  • Missing the forest  Voron isn’t a B+ Tree system.  It doesn’t have a tree, it has trees. Plural.  <blink>Important</blink>
  • Falling trees  Single root tree  Contain many additional trees.  Tree is similar to a table.  Operations on tree:  Add(key, value)  Del(key, value)  Find(key) : value  Iterate() (Seek,Next, Prev)
  • How it works?
  • With indexes
  • Finding stuff * Not the most efficient method
  • So, Voron has trees…  Root tree  Free Space tree  Contains references to named trees  Enough?  Tree of trees  MultiAdd, MultiDelete, MultiRead
  • Why multi trees?  Optimization – if has just 1 item (and no value) can directly use the parent tree store.  Store multiple items for a single value.
  • Iterating multi trees
  • What voron does?  Opens up a lot of interesting scenarios.  We have far better control over persistence now.  Very low level (bits & bytes).  Very fast!  Concurrency benefits:  Reads  Writes*  * Yet Voron allows only a single writer!
  • What it does not?  It isn’t about Linux. It can’t run on Linux*.  Need to implment:  PosixPureMemoryPager  PosixPageFileBackedMemoryMappedPager  PosixMemoryMapPager  Waiting for big Linux push post 3.0 release.
  • the cloud story…  Scratch / temp usage  Utilize fast local drives that can go away.  Slow I/O only hold us for tx commit (and we optimized that).
  • Summary  Voron learned from LevelDB, LMDB, Esent.  Journal for Atomicity, Consistency & Durability.  MVCC for Consistency & Isolation.  Root tree, named tress, multi trees.
  • Questions?
Comments
Top