Searching on our
site is an expensive operation:
- When you make a query, your query is sent to all the machines which might have matching articles
- Based on the list of those articles, our system then creates the related phrases, categories or calculates the sentiment of all the matching articles (this is not yet a public feature) and returns them to the searcher.
Especially the second part is an expensive operation. While the first step "only" needs to distribute the search, map docids and sentenceids (for sentence/post level searching), the second part has to dynamically create lists of keywords (do some calculations and also sort them!) based on
all the search results we found in the first step.
This is all being done in a distributed fashion, as you can't have all the information on one machine. Even if the information is distributed over multiple machines, you still have locality problems: If you have seperated your document id space over multiple machines (and you can locally extract a list of keywords), you end up with seperate lists of keywords on multiple machines, which you have to merge. So you have to come up with a clever way of doing this and maybe even change the initial distribution of documents, especially when you want to calculate variations of frequency over time.
Summize (they were independent at first from twitter) is a search engine for
twitter. They provide very fresh but basic search results ordered by date.
One of their advantages (or cleverness) is that they limited their search engine to twitter posts: As it doesn't make sense to sort the twitter posts on relevance (they have a maximum 160 character size), they display their results sorted by date. This gives them a huge advantage:
- They only need to fetch the 10 most recent documents of the entire collection.
- The returned text size is much smaller.
- They don't need to broadcast the query to all of their machines at once.
1) Both sites (ours and theirs) use lucene as backend (you can easily detect this on how you can write queries). While for our search engine, lucene has to fetch all the results (to sort the results by relevance and for us to get all the document ids for further calculations), they do simply have to take the first 10 results sorted by document id (no ordering has to be done if the document ids are already sorted in the index). Lucene is also much faster at returning only a small subset of documents, then returning the whole result list.
2) As their result text size is much smaller, it also make sense to store the twitter posts directly in the index for summize. As you are able to search in both posts and sentences on our site, it would be a waste of space to save the text more than once in the index, so we don't save the posts in the index and fetch the data later, which causes a very small performance penalty at the end.
3) Summize probably (this is how I would do it) broadcasts their query at first only to a handful of servers, or maybe only one server with the most recent data. As many people only search for very common terms, 95% of all queries can probably be handled like that. If not enough results are returned, the query is being broadcast to more servers (probably using an exponential backoff strategy or all at once). As the first server is hit each time, the server is replicated multiple times. The further you go down in time, the servers are less and less replicated as less and less search requests will reach the servers. This is also explains that when you are searching for something very uncommon (eg
http://search.twitter.com/search?q=great+not+and+apple), the query can take up to 14 seconds or trigger an error page if the query is not in the cache or cached in lucene.
Iterend does need to get all the data and broadcasts the query to all the servers.
While Summize's architecture certainly has it's advantages, it also makes it harder to add aditionnal features like sentiment detection or phrase extraction over the entire collection of results, and not only the subset of results that's being displayed.
Eg. in the past, they had a feature where you could enter a query and you would get a map of about 60 postitive/negative/neutral posts. They never added a global overview of all posts, as this would have caused the query to be executed in their entire cluster.
Btw, sample search results for "Summize" are available
here ;).