At StumbleUpon’s headquarters in San Francisco on Tuesday night, about 70 engineers and HBase committers discussed the latest optimizations and utilizations of HBase, an open-source distributed database that simplifies managing and retrieving huge amounts of data. StumbleUpon relies on HBase to process all the data we receive on user preferences and stumbling patterns, and it helps us ensure our recommendations are accurate and high-quality. Many large-scale web sites use HBase and Hadoop, the software framework on which it runs, to process jobs like this. On Tuesday, we heard from three HBase users who presented hacks they’ve created to get even more out of the architecture:
Todd Lipcon (twitter.com/tlipcon) from Cloudera, a software company that provides Apache Hadoop-based software and services, discussed how to avoid pauses, or moments when retrieving data is impossible, caused by the Java Garbage Collector. Todd explained that most garbage collection pauses are caused by fragmentation, and that MSLAB, a new memory allocator for HBase that plays well with the Garbage Collector, effectively moves all memory allocations into contiguous 2MB chunks.
Next, Benoit Sigoure, Site Reliability Engineer from StumbleUpon, talked about designing OpenTSDB, a distributed, scalable time series database. He discussed how OpenTSDB uses HBase to store and retrieve billions of highly granular data points in real-time and to present this data in custom graphs that even business folk like me can understand. The upshot: No single point of failure and an intuitive interface for viewing data (like stumbles over a time period, for example). Check out his entire presentation here.
Finally, Darren Erik Vengroff from RichRelevance, which designs software for product recommendations one might see on Amazon and Netflix, talked about creating a language called BigQL, built on top of a new HBase Coprocessor feature, that data analysts who know SQL could understand. His aim was to give these data analysts a program that, in his words, is “easy to use, fits their way of thinking, and solves their problems, but that’s also tuned to the backend data stores we want to run on.”
But I have to admit that my favorite part of the evening were the analogies that engineers have for certain program operations and scenarios. Here are some of the best from last night, in my opinion:
- Garbage collection – Operations attempting to dispose of memory occupied by objects that are no longer being used – i.e. the garbage.
- “Juliet Pause” – This can happen when a server pauses to run a “stop-the-world” full garbage collection, where a server’s processes (or just one) are paused. To the control center that monitors server activity, this server appears unresponsive – i.e. dead – and so it assumes control over the “dead” servers’ files and begins cleanup and redeploy of the “dead” server’s resources across the server network. Eventually, the garbage collection completes and the paused server “wakes up,” only to find that its world has been radically altered and so immediately kills itself, reminiscent of Juliet’s awakening in Act 5, Scene III of Romeo and Juliet.
- Swiss cheese – The fragmentation – i.e. free spaces or “holes” – in data after clearing out unused memory.
- Shingling – A technique where one looks at overlapping time ranges for data – i.e., a little bit before and a little bit after the desired time range, like shingles on a roof.
- Time Series Daemons (pronounced like “demons”) – Data storage nodes in OpenTSDB that sit on top of HBase and serve data back when requested. The Daemons specialize in storing many small, independent, and discrete observations efficiently.
Besides getting a peek into the world of big data and imaginative engineering terms (not to mention munching on pizza and drinking beer from the StumbleUpon kegerator), I got the chance to chat with leading HBase developers like Ted Dunning, Chief Applications Architect at MapR Technologies. When I asked him about HBase’s place in the data tech community these days, Ted referenced this Arthur C. Clarke quote: “Sufficiently advanced technology is indistinguishable from magic.”
“HBase is just beginning to grow up, to be magic in the sense that it hides the complexity,” he said. He added that the question that all new technology promoters face is the same: “How do you communicate a need to people who don’t know what they can’t do?”
To find out more about these events and join the Bay Area HBase User Group, click here.