We’ve been thrilled about the recent press around HBase, a database that StumbleUpon has been using for over a year and a half that makes it easier to retrieve, save, and manage vast amounts of data. We wanted to shed some light on this system in plain English, so I sat down with Ryan Rawson, System Architect and HBase committer here at StumbleUpon, to chat about how HBase works and why StumbleUpon uses it.
KG: What is HBase?
RR: HBase is an open-source distributed database that runs on Hadoop, a software framework that allows you to store and process large amounts of data on multiple machines. HBase is “distributed” because it operates on many computers at once. By harnessing the power of multiple computers, we can solve problems that are much bigger than any single computer can tackle. HBase is also open-sourced, meaning that we can collaborate with more than 100 companies – including Facebook – all around the world to improve it and gain mutual benefit from each other’s work.
KG: How does HBase work?
RR: HBase gives you real-time access to Hadoop software. By building on top of Hadoop, we are leveraging the scale that Yahoo has pioneered and other companies have also helped to improve. When we want to retrieve a range of data, we send a signal to a directory service that tells us which machine hosts that range of data. We ask the machine for the data, which it then retrieves and sends back to us, kind of like a library card catalog. But even though we contact a directory, the system isn’t strictly centralized. There’s not one machine handling every request, which means that if parts of the system go down, we don’t lose access to the entire set of data at the same time. We can also get data faster because we’re spreading the load over multiple machines, taking advantage of parallelism.
KG: What are StumbleUpon’s data needs?
RR: To make the best recommendations, we have to manage a lot of user signals. Every thumb-up, stumble, and share (among other feedback from users) is stored so we can make better decisions about what pages to show you next. Our data must be organized safely, retrieved quickly, deleted at times, and refreshed often.
KG: What do you like about HBase?
RR: It’s cost-effective, fast at data retrieval, and dependable. Instead of buying one or two very large computers, HBase runs on a network of smaller and cheaper machines. By storing data across these multiple smaller machines, we get better performance since we can always add more machines to improve data storage and retrieval as StumbleUpon’s data store grows. Plus, we can worry less about any one machine failure. Our developers love working with HBase, since it uses really cool technology and involves working with “big data.” We’re working with something that most people can’t imagine or never get the chance to work with. This is the leading edge of data storage technology, and we really feel like we’re inventing the future in a lot of ways. The fact that Facebook decided to build their next generation of messaging technology on HBase is a validation of what we’ve done and plan to do.
KG: What kinds of problems do you work on with HBase these days?
RR: I am currently working on making the next major release of HBase harder, better, faster, and stronger! (cue Daft Punk song). I’m also looking into adding features that help StumbleUpon in particular.
KG: Last question – and this one’s important. Where did that cute Hadoop elephant come from? :-)
RR: This one is easy: The Hadoop project’s founder Doug Cutting named it after his son’s stuffed elephant. The rest is history!
For the latest on how StumbleUpon works with HBase, check out this talk by Jean-Daniel Cryans, database engineer at StumbleUpon, from Hadoop World in October: