It was a tweet that fired the imagination like few others. On May 10, 2011, at 1:35 in the afternoon, Eric Brewer told the world he was redesigning the most important operation on the internet.
Brewer, a professor of computer science at the University of California, Berkeley, was headed for Google, where he would help build a brand-new computing platform that could span dozens of data centers across the globe and instantly process requests from billions of people in a matter of milliseconds. “I will be leading the design of the next gen of infrastructure at Google,” he wrote. “The cloud is young: much to do, many left to reach.”
Brewer now regrets the tweet. It leaves out so many other Googlers working alongside him. “I am actually providing design leadership — and an outside perspective,” he tells Wired in an e-mail, “but it is a multi-person effort.” And yet, that’s all he’ll say. Google, you see, treats its globe-spanning infrastructure as the most important of trade secrets.
The web giant believes much of its success stems from its ability to craft software and hardware capable of juggling more data, more quickly than practically any other operation on Earth. And, well, that’s about right. The Googlenet is what so much of the computing world looks to as the modern ideal. Occasionally, the company will reveal pieces of its top-secret infrastructure — which now spans as many as three dozen data centers — and others will follow its lead. The followers include everyone from Facebook, Yahoo, and Twitter to the NSA.
That’s why the tweet was so intriguing. Eric Brewer and his team are building what may be the future of the internet. At this point, we don’t know what all this will look like. But we can at least understand who Eric Brewer is — and, to certain extent, why he was chosen for the task.
I will be leading the design of the next gen of infrastructure at Google.The cloud is young: much to do, many left to reach.
— Eric Brewer (@eric_brewer) May 10, 2011
Before Google, There Was Inktomi
Eric Brewer isn’t just an academic. In the mid-1990s, one of his Berkeley research projects spawned a web search engine called Inktomi. Nowadays, Inktomi is remembered — if it’s remembered at all — as one of the many web search engines that flourished during the dot-com boom before bowing to Google in the decade that followed. But Inktomi was a little different. Before it was purchased by Yahoo in 2002, it pioneered a computing philosophy that served as bedrock not for the Google empire but for the web as a whole.
When Inktomi was founded in 1996 — two years before Google — web search engines and other massive online applications were served from big, beefy machines based on microprocessors that used the RISC architecture and other chips specifically designed for very large tasks. Alta Vista — the dominant search engine prior to the arrival of Inktomi — ran on enormous machines built around the Alpha processor, a RISC chip designed by its parent company, the Digital Equipment Corporation. But Eric Brewer realized that, when building this sort of sprawling application, it made more far sense to spread the load across a sea of servers built for much smaller tasks.
“Eric was able to demonstrate that a cluster of hundreds of cheap computers could actually significantly outperform the fastest supercomputers of the day,” says David Wagner, who studied under Brewer and is now a professor at UC Berkeley specializing in computer security.
‘Eric was able to demonstrate that a cluster of hundreds of cheap computers could actually significantly outperform the fastest supercomputers of the day.’
— David Wagner
This model makes it easier to expand an application — adding new machines as needed — and it makes it easier to accommodate hardware failures. But it also means you’re using technology that improves at a faster clip. “By working with low-end, everyday machines, you benefit from volume. You benefit from the fact that this is what everyone else is buying,” says Wagner. “Volume drives Moore’s Law, so these commodity machines were getting faster at a faster rate than supercomputers.”
Plus, these machines use less power — and when you expand your application to “internet-scale,” power
accounts for a significant amount of your overall cost.
The idea at the heart of Inktomi would redefine the internet. Following in the footsteps of Brewer’s company, Google built its search empire on commodity servers equipped with processors based on the x86 architecture Intel originally built for desktops PCs. In 2001, Jim Mitchell and Gary Lauterbach — two bigwigs at Sun Microsystems — visited Google’s server room and saw hundreds of dirt-cheap motherboards slotted into what look like bread racks you’d find in a bakery. Sun was another company that built big, beefy RISC machines, and though it had close ties to Google, Mitchell and Lauterbach knew it would never sell a single machine to the fledgling search company.
“Those servers are so cheap and use so little power,” Mitchell told Lauterbach, “we have no hope of building a product to help them.”
Google would eventually take this idea to extremes, designing its own stripped-down servers in an effort to save additional cost and power. And the rest of the web followed suit. Today, the web runs on cheap x86 servers, and some large outfits, including Facebook and Amazon, are designing their own machines in an effort to push the outside of the envelope. You could argue this was the only way the web could evolve — and Eric Brewer knew that it would.
“Eric’s big insight was that the internet would soon grow so big that there won’t be any computer big enough to run it — and that the only way to accommodate this was to rethink the architecture of the software so it could run on hundreds of thousands of machines,” says Armando Fox, another Berkeley distributed systems guru who studied with Brewer. “Today, we take that for granted. But in 1995, it was new thinking. Eric rightly gets credit on having that vision before a lot of other people — and executing on it.”
It only makes sense, then, that Google would tap Brewer to help rebuild its infrastructure for the coming decades. The Googlenet is state-of-the-art. But it’s also getting old, and according to one former engineer, it’s already feeling its age.
Brewer fits the bill not only because he has real-world experience with the sort of infrastructure Google is built on, but also because he continues to stretch the boundaries of distributed-systems research. Inktomi made him a millionaire, but he soon returned to the academic world. “When Inktomi went public, I thought I would never see him again. But a couple of years later, he was back at Berkeley,” says David Wagner. “You could tell where his heart was.”
Nowadays, Brewer is best known for the CAP Theorem — or Brewer’s Theorem — which grew out of his experience at Inktomi. The CAP Theorem originated with a 2000 speech given by Brewer and was later mathematically proven by two other academics, MIT’s Nancy Lynch and one of her graduate students, Seth Gilbert. In short, it says that a system the size of the Googlenet always comes with a compromise.
When you spread data across hundreds of machines, the theorem explains, you can guarantee that the data is consistent, meaning every machine using the system has access to the same set of data at the same time. You can guarantee that the system is always available, meaning that each time a machine requests a piece of information, it receives a definitive response. And you can guarantee partition tolerance, meaning the system can continue to operate when part of the system fails. But you can’t guarantee all three. You can guarantee two of the three, but not all.
“If you’re working with a large-scale distributed system,” explains Seth Gilbert, now an assistant professor in the department of computer science at the National University of Singapore, “you can’t get everything you want.”
The point, as Brewer explains in a recent article in Computer magazine, is that developers must realize there are tradeoffs to be made in building massively distributed applications with separate “partitions” guaranteed not to fail at the same time. “The CAP theorem asserts that any networked shared-data system can have only two of three desirable properties,” he says. “However, by explicitly handling partitions, designers can optimize consistency and availability, thereby achieving some trade-off of all three.”
According to David Wagner and Seth Gilbert, the theorem had a direct effect on the way distributed systems were built. “Before Eric proposed this, people were trying to build systems that did all three. That’s what you want to tell your customers,” Gilbert says. “It showed people there are tradeoffs. But it also showed them that they needed to focus their efforts, ask themselves: ‘What is most important for the system you’re building?’” If you don’t do this, says David Wagner, you end up with a system that will fail in ways you never anticipated.
Wagner points to Amazon’s popular cloud services as a prime example of a distributed system that was surely built with the CAP Theorem in mind. Amazon partitions its service, dividing it into “availability zones” guaranteed not to fail at the same time, he says, but it doesn’t guarantee consistency across multiple zones.
How will this play into “the next gen of infrastructure at Google”? At this point, we can only speculate. Apparently, the traditional flaw in Google’s infrastructure involved availability. It uses a mechanism called Chubby to keep multiple machines from reading and writing data on a server at the same time, and it’s designed to fail on occasion. According to rumor, this has became increasingly problematic in recent years, as the Google infrastructure expands, and Gilbert guesses that Brewer will seek to solve this limitation. “You would expect them to make a different tradeoff,” he says.
Whatever direction Google takes, you can bet it will look well beyond the status quo. In addition to calling on Brewer, the company has apparently tapped several other engineers with vast experience in this area. Brewer says his desk is within 10 feet of Jeff Dean, Sanjay Ghemawat, and Luiz André Barroso. That would be three of the engineers who designed the Google infrastructure the first time around.