Winning at Search: The Algorithm or The Infrastructure?

We are on the eve of Google announcing their search results for their 3Q.  Google has become a major force in discovery and advertising by virtue of their ability to surface the closest result relevant to a user across the broadest set of queries on the Internet.   Dozens of start-ups and certainly a few large players have tried to de-throne Google’s supremacy, but few have been successful.  The switching costs are zero, yet Google’s market share has only gone up.  Narrowing the domain has helped, and by limiting topical areas to things like shopping or health, companies have created market share distributions more favorable than in broad search; however, an end user is not going to use or remember 100 different search engines optimized for 100 different topics.  In fact, as it has in Health or in Local, Google has picked off verticals one by one to super-optimize.  This all got me thinking about how a start-up could ever beat Google at the broad game of search.

Search is decomposed into a few different elements.  The first is a “spider” – a virtual bot that scours the web, parses web pages, and builds a representation of the web; the second is an algorithm that takes those spliced pieces and decides what pages are more important than others given a set of constraints or inputs; the third is a massive index that takes all this analysis and stores it so that at “query time”, an engine can quickly take the digested knowledge and weights, and return a result. 

It’s my view that algorithms are not people or resource intensive.  A few guys thinking very hard can come up with simple, revolutionary ideas as Sergey Brin and Larry Page did.  Sure, Google has an incredible number of variables and residual terms that help refine its algorithm, but at the end of the day, it’s very rare that math is invented or discovered.   In fact, I’d wager a “better algorithm” already exists somewhere in academic labs throughout the country.   If it can be written or built by few, it is within the realm of startup possibility today. 

I tend to believe the biggest challenge for a start-up remains circumventing the need to re-create Google’s infrastructure against an algorithm.  Google spends over $2.8bln in CAPEX a year.  They spend significantly more in CAPEX than they do on search algorithm specific R&D.  I have heard estimates that maintenance and improvement of Google’s algorithms can be satisfied by a few hundred engineers, a small number relative to the 5,800 headcount in R&D.  Google’s CAPEX purchases machines that process huge streams of information, run calculations, and store all that data into massive repositories.  In fact, it is estimated that a normal Google search query involves anywhere from 700 to 1000 servers!  Their compute farms grows as the web grows.

To fundamentally change the playing field, a breakthrough is needed on the indexing and spidering schema.  An index can’t require anywhere near the amount of storage that Google currently has on its disks; the spider must more efficiently parse pages to go into that index.  Perhaps the spider performs distributed analysis while out in the web rather than in a central location; maybe the index is broken up or organized in a completely novel way.  Without breaking Google’s CAPEX curve, a startup would be hard pressed to go as broad and yet be more relevant than Google with the head start in investment that Google already has. 

I fully acknowledge the first objection to the above:  Microsoft has all the resources in the world, and has not been able to replicate Google’s effectiveness.  I cannot claim to know how Microsoft’s money has been spent, but my hunch is that Microsoft has tried to catch up by using variants of the same approach as Google.  The problem with that is Microsoft started significantly behind, and playing by the same rules will continue to leave them behind.  Cashback is an interesting attempt to buy traffic, but startups don’t have that option.  I would also concede that the more Google feeds its algorithm with data it gets by increased usage of the engine, the more disadvantaged any new approach would be.

All that being said, my current bias is that for a start-up, we need massive innovations in spidering and indexing (or the concepts they represent) to defeat the Google machine, not better algorithms.  The few that have started with a better algorithm have always had to constrain their bounds as a result of running into the wall of how much money they spend on capital equipment.  I am fascinated by the discussion and would love any feedback to the above.  I’d also enjoy reading about anything going on in academia that shows promise.  And if you’d like my views on particular sub-segments within search (vertical, social, etc), feel free to ping me…

Is Cloud Computing Stupid?

This was the supposition of Richard Stallman, founder of the Free Software Foundation.  As a venture investor hoping to invest in businesses that are ultimately profitable, with strong customer stickiness, and sustainable defensibility, you might be shocked to hear that I find some of Stallman’s assertions to be quite reasonable.   The cloud does have the potential to create lock-in under a certain set of circumstances, and can be called proprietary development platforms.  Where I disagree is that as a result of the above, customers should stay far away from cloud computing platforms (such as CPUoD, SaaS, and PaaS, as defined in my last post).  In fact, I believe given the rise of open systems, APIs, and standardized data access and retrieval layers, customers can enjoy all the benefits of a cloud platform, while maintaining sufficiently healthy competitive dynamics between vendors to keep them open and honest.

There is the obvious issue in Stallman’s position, which is that only 0.01% of customers have the expertise and resources to build one’s one server farm using all open source components and manage a fully controlled applications and data environment.   Putting that aside, I’m focused on the rest of the customers out there, large and small, that only have time to focus on their own value proposition, and where time to market makes use of clouds a very seductive option. 

Most SaaS applications today can be decomposed into forms that collect data, links to connect to data, workflow that pushes data to people in the right order, analytics that repurpose data “A” into new data “B”, and presentation to display data.  These SaaS applications are “multi-tenant” in nature – meaning there is one version of the application that all customers use.  While there are customizations, 90%+ of the app looks the same from customer to customer.  IF an application boils down to a calculation and presentation layer between various “rest states” of data, and a single application is fungible to many customers, then “uniqueness” lies in the data, not the application.  Therefore, the primary inhibitor to switching to a different application revolves around the concern for one’s data.  The easier I can get my data into and out of an application, the less beholden I am to any one vendor.  And if I am not beholden to a vendor, I can insist on the value proposition I need when purchasing the application.  Thus, to me, the argument all boils down to data portability. 

As a very simple consumer analogy, let’s pick the fun world of photo upload applications.  If I could easily extract all my Flickr photos and pump them into any other competing service (Ofoto, Shutterfly, Picasa), then I can feel fairly comfortable that Flickr is highly incented to offer best functionality at best cost.  If they do not, I take my photos out, and push them into the superior offering.  While many services do not provide such photo portability, I believe those that will win long term will be those that do, as savvy consumers will flock to such services.

In the old days, data was stored in proprietary formats that could only be read by the application writing the data.  In fact, way back, the physical storage of data to disk was proprietary!  Things have come a long way with the advent of standards such as SCSI, SQL, ODBC/JDBC, and XML, as well as published ways to extract the information via APIs via a ubiquitous transport layer in TCP/IP.  Data is isolated from the application, and able to be extracted via a variety of methods.  Almost all of the major SaaS suppliers today offer APIs (perhaps of varying quality) to push and pull information out of their application.  Many also allow connectivity at the database layer, and have built in export functionality.  The means to get at the data are provided for by the in the application provider, and I would expect this to increase significantly over time.

The next challenge after being able to access the data is to be able to take data on one side and make sure it is intelligible to any other application one might want to use.  Fortunately, there are a number of vendors who offer data integration and migration capabilities in the “cloud”.  As an example, FirstMark has an investment in a company called Boomi.  There are others.  These companies build software that takes the “taxonomy” of one application and translates it for other applications to use.  These can be comparable applications, to migrate from one to another, or they can be complementary applications, so that one set of data can be leveraged in multiple dimensions and avoid data input redundancies. 

If data is portable, then customers benefit greatly by leveraging a “cloud”.  Cloud vendors have extraordinary leverage in CAPEX, one that few companies can match.   The bandwidth and storage consumed by users of EC2 & S3 now exceed that from Amazon.com and all its other sites combined!  Quite a striking example, and it’s hard to fathom matching that kind of purchasing power.  In addition, the people and software investments to scale the infrastructure, the processes and procedures, the knowledge, all are very costly to duplicate.  If done right, clouds can be a much cheaper place to operate and allow customers to focus on their core value proposition as long as they insist on data flexibility.   

The above is also true for PaaS vendors.  Most PaaS vendors go out of their way to note that applications built on their platform have APIs built into the application out of the gate.  Now, it is true that ISVs choosing to use a PaaS platform are buying into a proprietary programming style.  In addition, they are at the mercy of the viability of the PaaS vendor, and that the PaaS vendor will not jump into the SaaS game by building competitive applications.  But ISVs have the same data portability options as an end customer.  If they choose to build on another PaaS, they simply have to ensure their PaaS vendor allows them to pump data from one platform to the other. 

None of this is easy.  Data movement has always been challenging.  But I believe we are now in a permanent era where you cannot “hide” data behind layers upon layers of proprietary code.  Customers and ISVs must insist that any cloud vendor they choose provide easy and standardized means to access and move their data.  If we all do a good job insisting and asking the right questions, the winners in the cloud battles will be those that embrace openness and portability, and who focus on retaining customers by having the best application instead of by scaring them with lock-in.

What is Cloud Computing?

Given Larry Ellison’s recent objections to the term “cloud computing”, and that I will likely write about the space often, I thought I would take a shot at defining things that get lumped into the term. 

I tend to agree that “cloud computing” is an abused term, but I believe if you parse the various definitions, I think you come out with four categories:

·         Co-location and web hosters:  The forefathers of the cloud computing space.  They created specialized data centers with redundant infrastructures (such as power, network connectivity, etc) for third parties to leverage.  Customers were separated by cages, where they could put their own servers into racks (or lease the hoster’s servers).  Applications and data were technically outside the offices of the customer, and accessed via IP protocol and the Internet cloud.   Put Internet cloud together with computing elsewhere, one could play the game and conceptually call that “cloud computing”.

·         CPU/Storage on demand (“CPUoD”):  These players start with their own data center facilities and servers, but have leveraged the explosion in hypervisors to virtualize server pools.  They then layer on standardized OS environment, web servers, load balancers, databases, etc.   The application must be built for that run-time environment, but if it is, one simply focuses on the development of their application and can buy compute/storage that executes the software and stores the data in a usage driven pricing model.  Some folks optimize for specific languages, such as Google’s AppEngine in Python, while others provide specialized diagnostics and monitoring services on top of their cloud to differentiate.  Some are stateful, some are stateless, some with persistent storage, some with dynamic storage.  But at the end of the day, it is a standardized operating environment that one pays per GHz and/or GB running ANY application.   I’d view this as the basic “brick” in cloud computing.

·         Software as a service (“SaaS”):  On the other end of the spectrum, software as a service providers build all the way up through the application/UI layer to offer a business function to the end user in a shared, multi-tenant, recurring revenue model.  While extensible and customizable, it is one instance of the software that serves many customers.  It is often lumped into cloud computing because the data center cost (where the software executes and data resides) and assumed scalability are bundled into the cost charged to the end user for the application.  The vendor can either:  1) take their own racks, cages, and servers (as in first option above) to build their own internal CPUoD environment and write their application on top of their own controlled stack, or 2) the provider can use a CPUoD provider and write their application for that environment.  The end user pays for an application that scales by usage of the application (which may or may not need more compute) but the scalability and cost of the infrastructure is hidden from the user.  From the customer’s standpoint, this is a “cloud” + application.   But buyer beware, as Bob Moul of Boomi points out, many things calling themselves SaaS are not.

·          Platform as a service (“PaaS”):   This is the newest category.  It began when Salesforce realized that their SaaS application could be decomposed into more basic units that could be building blocks for any application.  Forms, tabs, and links, tied together with workflow logic and wrapped around data.  Force.com is a generic representation of an application – no data, no logic, but all the means to present, push, and pull information.  To build an application, one “programs visually”.  Customize a form, create a workflow for the application, specify the data types via fields, and your app is built.  PaaS removes the engineering level concepts in writing code in computer languages like C++ or Java (compiling, de-bugging, inheritances, message passing, etc), and incorporates the infrastructure scalability of CPUoD.  Like SaaS, the purchaser of an application built on a PaaS platform pays an application fee that assumes the infrastructure scales transparent to them.  Unlike SaaS, PaaS creates multi-tenancy across applications!  There is a single shared instance of a platform that supports multiple applications running on one or many CPUoD infrastructures.

Where’s the opportunity for startups?  Well, building and running clouds are a complex and costly activity.  It’s hard to envision as a young company having any comparable buying leverage on the CAPEX side.  One cannot hope to get anywhere near the same discount as Google on CPUs and motherboards.  And people use Amazon because it’s cheap.   The only hope I see for companies to make it are 1) in differentiated scaling systems that drive down the OPEX cost equation, 2) such a differentiated coding/support environment that people are willing to pay a real premium, or 3) gaining critical mass in a specific ecosystem of diverse applications that generate a network effect to one’s cloud.  The other area I like are plays that ride on top of clouds providing value added services on top that are gaps for the CPUoD/SaaS/PaaS provider .  That shifts the game from economic capital to an intellectual capital exercise, where nimble innovators thrive!