Blogs » Developer Blog » Developer Blog » Searching for Serendipity

StumbleUpon uses search for a multitude of applications including both user-facing features and internal-only services. For instance: just as you can search your own favorites for that grilled cheese competition you saw once, we use the same backend service to augment and present the recommendations we give you: sorting and ranking term frequency, suggesting tags, content type detection, categorization, and more.

Why ElasticSearch?

Recently we shared that stumbling activity generates about 2.5 billion data points a week. While the original backend (based on Solr) sorting through all of that has served its heart out since then, the inherent limitations of a monolithic data store have really started to show. The amount of data generated by more than 15 million users and their billions of ratings dictated something more easily expandable, and the growth of our engineering staff highlighted the need for a platform that was both easy to experiment with as well as simple to support operationally.

We were among the earliest HBase adopters for similar reasons: as our data sets grew in size, we needed to be able to both store and to serve the data in a scalable manner. When we began to outgrow the basic limitations of MySQL replication and sharding, HBase was the answer that provided a horizontally-scalable and schema-free solution on commodity hardware.  Because it was designed to be distributed from the ground-up, ElasticSearch is the search analogue to HBase that frees us from some restrictions that Solr imposes:

– Failed nodes in ElasticSearch are easily replaced at a later time while replicas serve the load
– Changing the replica or shard count in response to load is easily done at runtime
– Developers are up and running within minutes, thanks to the schema-free design

Use Case: Search Suggestions

We saw ElasticSearch’s flexibility and performance in action when we made an improvement to our recommendation engine technology requiring a small high-performance search index optimized to return results given partial user-inputted terms. In less than an hour, we had an index live and serving results.

In contrast to our normal “do everything” index that is configured with one shard per node and two replicas per shard, we decided to use a single shard with replicas on each remaining cluster node. Because a single shard is able to handle the finite and relatively small amount of data in this index, we obviate the overhead of distributed search. Furthermore, by having a replica on each node not hosting the primary shard, we gain the ability to query any machine for up-to-date results, as well as the fault-tolerance this brings in the event of failure of the primary shard’s hardware.

We configured explicit mappings in order to support this query pattern.  Here’s an excerpt:

"tag": {
   "type": "multi_field",
   "fields": {
      "edgengram": {
         "include_in_all": false,
         "analyzer": "edgengram",
         "type": "string"
      },
      "tag": {
        "type": "string"
      },
      "ngram": {
        "include_in_all": false,
        "analyzer": "ngram",
        "type": "string"
      },
      "snowball": {
        "include_in_all": false,
        "analyzer": "snowball",
        "type": "string"
      }
   }
}

 

By configuring the tag multi-field in this manner we are able to search flexibly: by default, the simple tag type is searched using standard tokenization and analysis. However, we also have ngrammed and edge-ngrammed fields that we’ve set up with analyzers beforehand: these fields are pre-tokenized into chunks (ngrams) for faster partial-word matching such that “food” becomes {f,o,d,fo,oo,od,foo,ood,food} and {f,fo,foo,food} in the ngram and edge-ngram cases respectively. There is also a simple Snowball analyzer field called snowball set up for alternative stemming requirements.

By keeping only a bounded set of tag data on a single shard with pre-computed ngrams, we can make AJAX calls as the user types to retrieve the best matches to their input in near-realtime.

What’s Upcoming

In order to make development even easier for our engineers, we plan to open-source our Fluent Query Builder. The developer simply writes the query in a simple domain-specific language for much more readable and resilient code. Here is an example querying the mapping given above:

$result = $search->{$indexname}->search('tag')
   ->query()
     ->bool()
        ->should()
->field('tag.edgengram')->query($tag_input)->end()                                      
->field('alias.edgengram')->query($tag_input)->end()
           ->end()
        ->end()
     ->end()
   ->fields($fields)
   ->explain(false);

 

This fluent model abstracts out the query building details from the developer and lets them focus on business logic, rather than implementation details – should the API change or the logic become overly complex, for example.

Also, we’ll be enhancing the monitoring capabilities of ElasticSearch with native collection into our OpenTSDB project. No project is complete without best-of-breed monitoring and alerting, and the precise data capture ability of OpenTSDB, along with the simple REST interface, will allow us to monitor the performance and health of our ElasticSearch backends with ease.

/Ken MacInnis profile picture