Tuesday, May 17, 2016

Looking under the hood in cloud datacenters

There's a great article in the May 2016 issue of Communications of the ACM authored by a group of Googlers, including former Berkeley PhD student David Oppenheimer (now the tech lead on Kubernetes) and my PhD advisor Eric Brewer (Berkeley professor on leave to be Google's VP Infrastructure), on the evolution of cloud-based management systems at Google. They were doing "containerization" of cloud servers before almost anyone else, and contributed to the Linux containerization code. The article distills over 10 years of designing and working with different approaches to containerizing software to share the hardware in Google's datacenters, and some lessons learned and still-open problems. A couple of key messages:

  • The "unit of virtualization" is no longer the virtual machine but the application, since modern systems like Kubernetes and Docker virtualize the app container and the files and resources on which it depends, but actually hide some details of the underlying OS, instead relying on OS-independent APIs. Apps are therefore less sensitive to OS-specific APIs and easier to port across containers.
  • As a result of having both resource containers and filesystem-virtualization containers map to an app (or small collection of apps that make up a service), troubleshooting and monitoring naturally have affinity to apps rather than (virtual) machines, simplifying devops.
  • A "label" scheme in which a given instance of an app has one or more labels simplifies deployment and devops as well. For example, if one instance of a service is misbehaving, that instance can (eg) have the "production" label removed from it, so that monitoring/load balancing services will exclude it from their operations but leave it running so it can be debugged in situ.
The article also points to still-open problems, including (surprise!) configuration (every configuration DSL eventually becomes Turing-complete and they're all mutually incompatible), dependency management (related to configuration, but also introduces new complications, i.e. what if service X depends on Y but Y must first go through a registration or authentication step before it can be deployed), and more.

Overall a really interesting read that suggests, ironically, that the programmer-facing view of cloud datacenters is actually a "back to the 1990s" view of managing a collection of individual apps with APIs (remember ASPs?) rather than managing a collection of either virtual or physical machine images.

No comments:

Post a Comment