9.26.2008

Short downtime

As we were making a few changes to our blog spider a few days ago, the spider stopped working a little more than 24 hours ago. Unfortunately we didn't notice this. This is also why there was a 20 hours delay when readwriteweb wrote their article (Thanks!).
Normally articles are added between 30 minutes or an hour after they are posted (depending on how often the blog is being fetched).
We are fixing that bug right now. Sorry for the inconvenience. It should be fixed soon and you can access our site normally.


EDIT:
Everything is working normally again and incoming posts are normally added to the search engine.

9.23.2008

Searching... or why Summize is often faster

Searching on our site is an expensive operation:
  1. When you make a query, your query is sent to all the machines which might have matching articles
  2. Based on the list of those articles, our system then creates the related phrases, categories or calculates the sentiment of all the matching articles (this is not yet a public feature) and returns them to the searcher.
Especially the second part is an expensive operation. While the first step "only" needs to distribute the search, map docids and sentenceids (for sentence/post level searching), the second part has to dynamically create lists of keywords (do some calculations and also sort them!) based on all the search results we found in the first step.
This is all being done in a distributed fashion, as you can't have all the information on one machine. Even if the information is distributed over multiple machines, you still have locality problems: If you have seperated your document id space over multiple machines (and you can locally extract a list of keywords), you end up with seperate lists of keywords on multiple machines, which you have to merge. So you have to come up with a clever way of doing this and maybe even change the initial distribution of documents, especially when you want to calculate variations of frequency over time.


Summize (they were independent at first from twitter) is a search engine for twitter. They provide very fresh but basic search results ordered by date.

One of their advantages (or cleverness) is that they limited their search engine to twitter posts: As it doesn't make sense to sort the twitter posts on relevance (they have a maximum 160 character size), they display their results sorted by date. This gives them a huge advantage:
  1. They only need to fetch the 10 most recent documents of the entire collection.
  2. The returned text size is much smaller.
  3. They don't need to broadcast the query to all of their machines at once.
1) Both sites (ours and theirs) use lucene as backend (you can easily detect this on how you can write queries). While for our search engine, lucene has to fetch all the results (to sort the results by relevance and for us to get all the document ids for further calculations), they do simply have to take the first 10 results sorted by document id (no ordering has to be done if the document ids are already sorted in the index). Lucene is also much faster at returning only a small subset of documents, then returning the whole result list.

2) As their result text size is much smaller, it also make sense to store the twitter posts directly in the index for summize. As you are able to search in both posts and sentences on our site, it would be a waste of space to save the text more than once in the index, so we don't save the posts in the index and fetch the data later, which causes a very small performance penalty at the end.

3) Summize probably (this is how I would do it) broadcasts their query at first only to a handful of servers, or maybe only one server with the most recent data. As many people only search for very common terms, 95% of all queries can probably be handled like that. If not enough results are returned, the query is being broadcast to more servers (probably using an exponential backoff strategy or all at once). As the first server is hit each time, the server is replicated multiple times. The further you go down in time, the servers are less and less replicated as less and less search requests will reach the servers. This is also explains that when you are searching for something very uncommon (eg http://search.twitter.com/search?q=great+not+and+apple), the query can take up to 14 seconds or trigger an error page if the query is not in the cache or cached in lucene.
Iterend does need to get all the data and broadcasts the query to all the servers.

While Summize's architecture certainly has it's advantages, it also makes it harder to add aditionnal features like sentiment detection or phrase extraction over the entire collection of results, and not only the subset of results that's being displayed.
Eg. in the past, they had a feature where you could enter a query and you would get a map of about 60 postitive/negative/neutral posts. They never added a global overview of all posts, as this would have caused the query to be executed in their entire cluster.

Btw, sample search results for "Summize" are available here ;).

9.16.2008

What makes us different from other search engines

We have less team members and less servers than our competitors ;). But besides that:

Upgrade today

Hi,

we increased the number of blog sources from 50 000 to 250 000 blogs.

While this still seems rather low for some people, please keep in mind that we offer additional features (like sentence search, phrase extraction, sentiment detection, clustering, etc..) that take many more ressources than simple traditional search. We are also still a small startup and can't afford (at least not right now ;)) to have thousands of servers running for us, so we will focus on the most linked blogs and not include the smaller blogs for now.

Anyway, we should cover a much bigger diversity of blogs now.