If you were following the news in the last days you could get the impression that the end of Google is imminent.
– Cuil Exits Stealth Mode With A Massive Search Engine
– Google Beats Cuil Hands Down In Size And Relevance, But That Isn’t The Whole Story
– And….Cuil Goes Offline
– Do Not Mistype Cuil
The traditional media already speculated about the death of Google:
– The Google Killer is here..it’s gone..it’s back again
– Will Cuil Kill Google?
– Cuil Hopes to be a Google-Killer
This made me very curious, so I started reading a lot of articles about Cuil. The search engine business is not a world of miracles, the knowledge is just not spread very far, so if someone claims to do something completely new like “indexing at 10% cost of google”, it is something that calls for attention.
One interesting fact is, that it seems to be very difficult for most media organizations to get their facts straight. I read about as many explanations of the difference between Cuil and Google as I read reports. It seems most media outlets don’t know what the PageRank algorithm is, but don’t have a problem to explain it wrongly. ZDNet Germany writes for example: “The most important difference between Cuil and Google is the method with which the search engine evaluates the relevance of search results. While Google captures all clicks on a search result and calculates the relevance of a page in terms of the PageRank, the startup uses a contextual evaluation of the results.”
This is just plain wrong. PageRank has nothing to do with clicking on search results. PageRank is an algorithm that uses the context of links to evaluate the relevance of the page they link to. It is based on the concept of a random surfer who travels the web by randomly clicking on the links of a page. If the random surfer arrives on a page, the relevance of the page for the keywords in the context in the link is increased. Oh and by the way: The name doesn’t derive from “page” as in web page, but from “Page” as in Larry Page.
Golem boldly asserts the difference is that Cuil “has collected 121 Billion Webpages and sorts them not only by link analysis and traffic, but also captures the content and tries to put it in context”. The article contradicts itself, contradicts other media and contradicts reality.
First, the article gives the impression that this is a large number compared to Google’s index size. Google stopped telling people how large its index is some years ago, so nobody can check that claim directly. But if you check how many results you get for common searches like “dog” or “house”, Cuil returns only half as many results as Google does. So either they are not very effective in finding anything in their pile of web pages or 121 billion is not as impressive a number as it sounds.
Next, the article claims they use link analysis. That sounded strange to me as they claim to NOT use PageRank, the most common scoring algorithm based on link analysis. So what type of link analysis do they use?
I had to do more research. Luckily, I found a great article by one of the Cuil protagonists that reveals a lot about how these people think about search engine technology. They do not use link analysis or scoring because “Page rank is lengthy analysis of a global nature and will cause you to buy more machines and get bogged down on this one complicated step”.
Obviously they are not smart enough to do a simple PageRank implementation (every student in our university builds one as part of the reinforcement learning courses) and to just ignore the well-known problem of spam-dexing.
Obviously, when the technology that Cuil uses was state of the art, the Cuil people were not yet in the industry. Otherwise they would know, that a whole search engine generation died, because they did not use PageRank. WebCrawler went bust, Altavista buys their search results from Yahoo.
So, they don’t use link analysis. Do they use traffic analysis? Cuil claims they don’t record any personal data about their users, so obviously they don’t analyze their own traffic. And they don’t provide a service like analytics that would let them capture the traffic on other sites. So, what kind of traffic do they analyze. I conclude that Ggolem was just adding some random attributes in hopes nobody would know it is wrong.
The last claim is that they only analyze the content. That is something I can believe. It is consistent with the Patterson paper that dismisses almost every achievement of search engine tech of the last ten years as too complicated and essentially unnecessary. At no point does the paper address the real problem of people who put a lot of popular keywords on their page just to lure people to porn. It seems all thought about Cuil is based on the premise that every website will cooperate nicely. When you see how much porn you already get at Cuil with simple searches, this does not promise a bright future.
Techcrunch spotlights that Cuil can crawl 90% cheaper than Google. This may be true as Cuil has eliminated almost every complex procedure from the Crawl. Though, when you read the Patterson paper, you discover that they just moved those procedure to query time. So if you add up all the costs of the search engine you don’t get any cost advantages. I would bet you will see that it gets more expensive because there are by far fewer ways to optimize at query time.
So, in the end the only achievement Cuil can claim is to sell a product that has been discredited 10 years ago and make people believe that it is a revolution. It is a bit like people who sell unpasteurized milk while claiming that it is just too expensive to get rid of bacteria and that people want it cheap.
If you want to read a good overview over the subject after all this ranting you should have a look at the new marketing blog.