Winning at Search: The Algorithm or The Infrastructure?

We are on the eve of Google announcing their search results for their 3Q.  Google has become a major force in discovery and advertising by virtue of their ability to surface the closest result relevant to a user across the broadest set of queries on the Internet.   Dozens of start-ups and certainly a few large players have tried to de-throne Google’s supremacy, but few have been successful.  The switching costs are zero, yet Google’s market share has only gone up.  Narrowing the domain has helped, and by limiting topical areas to things like shopping or health, companies have created market share distributions more favorable than in broad search; however, an end user is not going to use or remember 100 different search engines optimized for 100 different topics.  In fact, as it has in Health or in Local, Google has picked off verticals one by one to super-optimize.  This all got me thinking about how a start-up could ever beat Google at the broad game of search.

Search is decomposed into a few different elements.  The first is a “spider” – a virtual bot that scours the web, parses web pages, and builds a representation of the web; the second is an algorithm that takes those spliced pieces and decides what pages are more important than others given a set of constraints or inputs; the third is a massive index that takes all this analysis and stores it so that at “query time”, an engine can quickly take the digested knowledge and weights, and return a result. 

It’s my view that algorithms are not people or resource intensive.  A few guys thinking very hard can come up with simple, revolutionary ideas as Sergey Brin and Larry Page did.  Sure, Google has an incredible number of variables and residual terms that help refine its algorithm, but at the end of the day, it’s very rare that math is invented or discovered.   In fact, I’d wager a “better algorithm” already exists somewhere in academic labs throughout the country.   If it can be written or built by few, it is within the realm of startup possibility today. 

I tend to believe the biggest challenge for a start-up remains circumventing the need to re-create Google’s infrastructure against an algorithm.  Google spends over $2.8bln in CAPEX a year.  They spend significantly more in CAPEX than they do on search algorithm specific R&D.  I have heard estimates that maintenance and improvement of Google’s algorithms can be satisfied by a few hundred engineers, a small number relative to the 5,800 headcount in R&D.  Google’s CAPEX purchases machines that process huge streams of information, run calculations, and store all that data into massive repositories.  In fact, it is estimated that a normal Google search query involves anywhere from 700 to 1000 servers!  Their compute farms grows as the web grows.

To fundamentally change the playing field, a breakthrough is needed on the indexing and spidering schema.  An index can’t require anywhere near the amount of storage that Google currently has on its disks; the spider must more efficiently parse pages to go into that index.  Perhaps the spider performs distributed analysis while out in the web rather than in a central location; maybe the index is broken up or organized in a completely novel way.  Without breaking Google’s CAPEX curve, a startup would be hard pressed to go as broad and yet be more relevant than Google with the head start in investment that Google already has. 

I fully acknowledge the first objection to the above:  Microsoft has all the resources in the world, and has not been able to replicate Google’s effectiveness.  I cannot claim to know how Microsoft’s money has been spent, but my hunch is that Microsoft has tried to catch up by using variants of the same approach as Google.  The problem with that is Microsoft started significantly behind, and playing by the same rules will continue to leave them behind.  Cashback is an interesting attempt to buy traffic, but startups don’t have that option.  I would also concede that the more Google feeds its algorithm with data it gets by increased usage of the engine, the more disadvantaged any new approach would be.

All that being said, my current bias is that for a start-up, we need massive innovations in spidering and indexing (or the concepts they represent) to defeat the Google machine, not better algorithms.  The few that have started with a better algorithm have always had to constrain their bounds as a result of running into the wall of how much money they spend on capital equipment.  I am fascinated by the discussion and would love any feedback to the above.  I’d also enjoy reading about anything going on in academia that shows promise.  And if you’d like my views on particular sub-segments within search (vertical, social, etc), feel free to ping me…