As we were making a few changes to our blog spider a few days ago, the spider stopped working a little more than 24 hours ago. Unfortunately we didn't notice this. This is also why there was a 20 hours delay when readwriteweb wrote their article (Thanks!).
Normally articles are added between 30 minutes or an hour after they are posted (depending on how often the blog is being fetched).
We are fixing that bug right now. Sorry for the inconvenience. It should be fixed soon and you can access our site normally.
EDIT:
Everything is working normally again and incoming posts are normally added to the search engine.
9.26.2008
9.23.2008
Searching... or why Summize is often faster
Searching on our site is an expensive operation:
This is all being done in a distributed fashion, as you can't have all the information on one machine. Even if the information is distributed over multiple machines, you still have locality problems: If you have seperated your document id space over multiple machines (and you can locally extract a list of keywords), you end up with seperate lists of keywords on multiple machines, which you have to merge. So you have to come up with a clever way of doing this and maybe even change the initial distribution of documents, especially when you want to calculate variations of frequency over time.
Summize (they were independent at first from twitter) is a search engine for twitter. They provide very fresh but basic search results ordered by date.
One of their advantages (or cleverness) is that they limited their search engine to twitter posts: As it doesn't make sense to sort the twitter posts on relevance (they have a maximum 160 character size), they display their results sorted by date. This gives them a huge advantage:
2) As their result text size is much smaller, it also make sense to store the twitter posts directly in the index for summize. As you are able to search in both posts and sentences on our site, it would be a waste of space to save the text more than once in the index, so we don't save the posts in the index and fetch the data later, which causes a very small performance penalty at the end.
3) Summize probably (this is how I would do it) broadcasts their query at first only to a handful of servers, or maybe only one server with the most recent data. As many people only search for very common terms, 95% of all queries can probably be handled like that. If not enough results are returned, the query is being broadcast to more servers (probably using an exponential backoff strategy or all at once). As the first server is hit each time, the server is replicated multiple times. The further you go down in time, the servers are less and less replicated as less and less search requests will reach the servers. This is also explains that when you are searching for something very uncommon (eg http://search.twitter.com/search?q=great+not+and+apple), the query can take up to 14 seconds or trigger an error page if the query is not in the cache or cached in lucene.
Iterend does need to get all the data and broadcasts the query to all the servers.
While Summize's architecture certainly has it's advantages, it also makes it harder to add aditionnal features like sentiment detection or phrase extraction over the entire collection of results, and not only the subset of results that's being displayed.
Eg. in the past, they had a feature where you could enter a query and you would get a map of about 60 postitive/negative/neutral posts. They never added a global overview of all posts, as this would have caused the query to be executed in their entire cluster.
Btw, sample search results for "Summize" are available here ;).
- When you make a query, your query is sent to all the machines which might have matching articles
- Based on the list of those articles, our system then creates the related phrases, categories or calculates the sentiment of all the matching articles (this is not yet a public feature) and returns them to the searcher.
This is all being done in a distributed fashion, as you can't have all the information on one machine. Even if the information is distributed over multiple machines, you still have locality problems: If you have seperated your document id space over multiple machines (and you can locally extract a list of keywords), you end up with seperate lists of keywords on multiple machines, which you have to merge. So you have to come up with a clever way of doing this and maybe even change the initial distribution of documents, especially when you want to calculate variations of frequency over time.
Summize (they were independent at first from twitter) is a search engine for twitter. They provide very fresh but basic search results ordered by date.
One of their advantages (or cleverness) is that they limited their search engine to twitter posts: As it doesn't make sense to sort the twitter posts on relevance (they have a maximum 160 character size), they display their results sorted by date. This gives them a huge advantage:
- They only need to fetch the 10 most recent documents of the entire collection.
- The returned text size is much smaller.
- They don't need to broadcast the query to all of their machines at once.
2) As their result text size is much smaller, it also make sense to store the twitter posts directly in the index for summize. As you are able to search in both posts and sentences on our site, it would be a waste of space to save the text more than once in the index, so we don't save the posts in the index and fetch the data later, which causes a very small performance penalty at the end.
3) Summize probably (this is how I would do it) broadcasts their query at first only to a handful of servers, or maybe only one server with the most recent data. As many people only search for very common terms, 95% of all queries can probably be handled like that. If not enough results are returned, the query is being broadcast to more servers (probably using an exponential backoff strategy or all at once). As the first server is hit each time, the server is replicated multiple times. The further you go down in time, the servers are less and less replicated as less and less search requests will reach the servers. This is also explains that when you are searching for something very uncommon (eg http://search.twitter.com/search?q=great+not+and+apple), the query can take up to 14 seconds or trigger an error page if the query is not in the cache or cached in lucene.
Iterend does need to get all the data and broadcasts the query to all the servers.
While Summize's architecture certainly has it's advantages, it also makes it harder to add aditionnal features like sentiment detection or phrase extraction over the entire collection of results, and not only the subset of results that's being displayed.
Eg. in the past, they had a feature where you could enter a query and you would get a map of about 60 postitive/negative/neutral posts. They never added a global overview of all posts, as this would have caused the query to be executed in their entire cluster.
Btw, sample search results for "Summize" are available here ;).
9.16.2008
What makes us different from other search engines
We have less team members and less servers than our competitors ;). But besides that:
- We display an overview of what is currently being discussed in the blogosphere, so that you are able to dive into the different areas you are interested in.
- All articles are linked to structured wikipedia information, which makes it possible to search by categories. (eg. http://blogs.iterend.com/en/?query=category%3A"Swimmer"+category%3A"Medalist"&date=alltime )
- Search results are clustered and you can search on sentence level, post level or blog level.
- Next to the search results, relevant phrases and categories are displayed, so you are able to restrict your search or get an overview over the information you are looking for. (eg. http://blogs.iterend.com/en/?query=techcrunch&date=alltime)
- You can search for related posts to a given topic, url or cluster (eg. http://blogs.iterend.com/en/?query=related%3A"iphone"&date=alltime)
Upgrade today
Hi,
we increased the number of blog sources from 50 000 to 250 000 blogs.
While this still seems rather low for some people, please keep in mind that we offer additional features (like sentence search, phrase extraction, sentiment detection, clustering, etc..) that take many more ressources than simple traditional search. We are also still a small startup and can't afford (at least not right now ;)) to have thousands of servers running for us, so we will focus on the most linked blogs and not include the smaller blogs for now.
Anyway, we should cover a much bigger diversity of blogs now.
we increased the number of blog sources from 50 000 to 250 000 blogs.
While this still seems rather low for some people, please keep in mind that we offer additional features (like sentence search, phrase extraction, sentiment detection, clustering, etc..) that take many more ressources than simple traditional search. We are also still a small startup and can't afford (at least not right now ;)) to have thousands of servers running for us, so we will focus on the most linked blogs and not include the smaller blogs for now.
Anyway, we should cover a much bigger diversity of blogs now.
Subscribe to:
Posts (Atom)