diff --git a/docs/design/control-plane-resilience.md b/docs/design/control-plane-resilience.md new file mode 100644 index 00000000000..8becccec19b --- /dev/null +++ b/docs/design/control-plane-resilience.md @@ -0,0 +1,269 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Kubernetes/Ubernetes Control Plane Resilience + +## Long Term Design and Current Status + +### by Quinton Hoole, Mike Danese and Justin Santa-Barbara + +### December 14, 2015 + +## Summary + +Some amount of confusion exists around how we currently, and in future +want to ensure resilience of the Kubernetes (and by implication +Ubernetes) control plane. This document is an attempt to capture that +definitively. It covers areas including self-healing, high +availability, bootstrapping and recovery. Most of the information in +this document already exists in the form of github comments, +PR's/proposals, scattered documents, and corridor conversations, so +document is primarily a consolidation and clarification of existing +ideas. + +## Terms + +* **Self-healing:** automatically restarting or replacing failed + processes and machines without human intervention +* **High availability:** continuing to be available and work correctly + even if some components are down or uncontactable. This typically + involves multiple replicas of critical services, and a reliable way + to find available replicas. Note that it's possible (but not + desirable) to have high + availability properties (e.g. multiple replicas) in the absence of + self-healing properties (e.g. if a replica fails, nothing replaces + it). Fairly obviously, given enough time, such systems typically + become unavailable (after enough replicas have failed). +* **Bootstrapping**: creating an empty cluster from nothing +* **Recovery**: recreating a non-empty cluster after perhaps + catastrophic failure/unavailability/data corruption + +## Overall Goals + +1. **Resilience to single failures:** Kubernetes clusters constrained + to single availability zones should be resilient to individual + machine and process failures by being both self-healing and highly + available (within the context of such individual failures). +1. **Ubiquitous resilience by default:** The default cluster creation + scripts for (at least) GCE, AWS and basic bare metal should adhere + to the above (self-healing and high availability) by default (with + options available to disable these features to reduce control plane + resource requirements if so required). It is hoped that other + cloud providers will also follow the above guidelines, but the + above 3 are the primary canonical use cases. +1. **Resilience to some correlated failures:** Kubernetes clusters + which span multiple availability zones in a region should by + default be resilient to complete failure of one entire availability + zone (by similarly providing self-healing and high availability in + the default cluster creation scripts as above). +1. **Default implementation shared across cloud providers:** The + differences between the default implementations of the above for + GCE, AWS and basic bare metal should be minimized. This implies + using shared libraries across these providers in the default + scripts in preference to highly customized implementations per + cloud provider. This is not to say that highly differentiated, + customized per-cloud cluster creation processes (e.g. for GKE on + GCE, or some hosted Kubernetes provider on AWS) are discouraged. + But those fall squarely outside the basic cross-platform OSS + Kubernetes distro. +1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms + for achieving system resilience (replication controllers, health + checking, service load balancing etc) should be used in preference + to building a separate set of mechanisms to achieve the same thing. + This implies that self hosting (the kubernetes control plane on + kubernetes) is strongly preferred, with the caveat below. +1. **Recovery from catastrophic failure:** The ability to quickly and + reliably recover a cluster from catastrophic failure is critical, + and should not be compromised by the above goal to self-host + (i.e. it goes without saying that the cluster should be quickly and + reliably recoverable, even if the cluster control plane is + broken). This implies that such catastrophic failure scenarios + should be carefully thought out, and the subject of regular + continuous integration testing, and disaster recovery exercises. + +## Relative Priorities + +1. **(Possibly manual) recovery from catastrophic failures:** having a Kubernetes cluster, and all + applications running inside it, disappear forever perhaps is the worst + possible failure mode. So it is critical that we be able to + recover the applications running inside a cluster from such + failures in some well-bounded time period. + 1. In theory a cluster can be recovered by replaying all API calls + that have ever been executed against it, in order, but most + often that state has been lost, and/or is scattered across + multiple client applications or groups. So in general it is + probably infeasible. + 1. In theory a cluster can also be recovered to some relatively + recent non-corrupt backup/snapshot of the disk(s) backing the + etcd cluster state. But we have no default consistent + backup/snapshot, verification or restoration process. And we + don't routinely test restoration, so even if we did routinely + perform and verify backups, we have no hard evidence that we + can in practise effectively recover from catastrophic cluster + failure or data corruption by restoring from these backups. So + there's more work to be done here. +1. **Self-healing:** Most major cloud providers provide the ability to + easily and automatically replace failed virtual machines within a + small number of minutes (e.g. GCE + [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart) + and Managed Instance Groups, + AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/) + and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This + can fairly trivially be used to reduce control-plane down-time due + to machine failure to a small number of minutes per failure + (i.e. typically around "3 nines" availability), provided that: + 1. cluster persistent state (i.e. etcd disks) is either: + 1. truely persistent (i.e. remote persistent disks), or + 1. reconstructible (e.g. using etcd [dynamic member + addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member) + or [backup and + recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)). + + 1. and boot disks are either: + 1. truely persistent (i.e. remote persistent disks), or + 1. reconstructible (e.g. using boot-from-snapshot, + boot-from-pre-configured-image or + boot-from-auto-initializing image). +1. **High Availability:** This has the potential to increase + availability above the approximately "3 nines" level provided by + automated self-healing, but it's somewhat more complex, and + requires additional resources (e.g. redundant API servers and etcd + quorum members). In environments where cloud-assisted automatic + self-healing might be infeasible (e.g. on-premise bare-metal + deployments), it also gives cluster administrators more time to + respond (e.g. replace/repair failed machines) without incurring + system downtime. + +## Design and Status (as of December 2015) + + + + + + + + + + + + + + + + + + + + + + +
Control Plane ComponentResilience PlanCurrent Status
API Server + +Multiple stateless, self-hosted, self-healing API servers behind a HA +load balancer, built out by the default "kube-up" automation on GCE, +AWS and basic bare metal (BBM). Note that the single-host approach of +hving etcd listen only on localhost to ensure that onyl API server can +connect to it will no longer work, so alternative security will be +needed in the regard (either using firewall rules, SSL certs, or +something else). All necessary flags are currently supported to enable +SSL between API server and etcd (OpenShift runs like this out of the +box), but this needs to be woven into the "kube-up" and related +scripts. Detailed design of self-hosting and related bootstrapping +and catastrophic failure recovery will be detailed in a separate +design doc. + + + +No scripted self-healing or HA on GCE, AWS or basic bare metal +currently exists in the OSS distro. To be clear, "no self healing" +means that even if multiple e.g. API servers are provisioned for HA +purposes, if they fail, nothing replaces them, so eventually the +system will fail. Self-healing and HA can be set up +manually by following documented instructions, but this is not +currently an automated process, and it is not tested as part of +continuous integration. So it's probably safest to assume that it +doesn't actually work in practise. + +
Controller manager and scheduler + +Multiple self-hosted, self healing warm standby stateless controller +managers and schedulers with leader election and automatic failover of API server +clients, automatically installed by default "kube-up" automation. + +As above.
etcd + +Multiple (3-5) etcd quorum members behind a load balancer with session +affinity (to prevent clients from being bounced from one to another). + +Regarding self-healing, if a node running etcd goes down, it is always necessary to do three +things: +
    +
  1. allocate a new node (not necessary if running etcd as a pod, in +which case specific measures are required to prevent user pods from +interfering with system pods, for example using node selectors as +described in dynamic member + addition. +In the case of remote persistent disk, the etcd state can be recovered +by attaching the remote persistent disk to the replacement node, thus +the state is recoverable even if all other replicas are down. + +There are also significant performance differences between local disks and remote +persistent disks. For example, the sustained throughput +local disks in GCE is approximatley 20x that of remote disks. + +Hence we suggest that self-healing be provided by remotely mounted persistent disks in +non-performance critical, single-zone cloud deployments. For +performance critical installations, faster local SSD's should be used, +in which case remounting on node failure is not an option, so +etcd runtime configuration +should be used to replace the failed machine. Similarly, for +cross-zone self-healing, cloud persistent disks are zonal, so +automatic +runtime configuration +is required. Similarly, basic bare metal deployments cannot generally +rely on +remote persistent disks, so the same approach applies there. +
+ +Somewhat vague instructions exist +on how to set some of this up manually in a self-hosted +configuration. But automatic bootstrapping and self-healing is not +described (and is not implemented for the non-PD cases). This all +still needs to be automated and continuously tested. +
+ + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]() + diff --git a/docs/design/federated-services.md b/docs/design/federated-services.md new file mode 100644 index 00000000000..6febfb2137e --- /dev/null +++ b/docs/design/federated-services.md @@ -0,0 +1,550 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Kubernetes Cluster Federation (a.k.a. "Ubernetes") + +## Cross-cluster Load Balancing and Service Discovery + +### Requirements and System Design + +### by Quinton Hoole, Dec 3 2015 + +## Requirements + +### Discovery, Load-balancing and Failover + +1. **Internal discovery and connection**: Pods/containers (running in + a Kubernetes cluster) must be able to easily discover and connect + to endpoints for Kubernetes services on which they depend in a + consistent way, irrespective of whether those services exist in a + different kubernetes cluster within the same cluster federation. + Hence-forth referred to as "cluster-internal clients", or simply + "internal clients". +1. **External discovery and connection**: External clients (running + outside a Kubernetes cluster) must be able to discover and connect + to endpoints for Kubernetes services on which they depend. + 1. **External clients predominantly speak HTTP(S)**: External + clients are most often, but not always, web browsers, or at + least speak HTTP(S) - notable exceptions include Enterprise + Message Busses (Java, TLS), DNS servers (UDP), + SIP servers and databases) +1. **Find the "best" endpoint:** Upon initial discovery and + connection, both internal and external clients should ideally find + "the best" endpoint if multiple eligible endpoints exist. "Best" + in this context implies the closest (by network topology) endpoint + that is both operational (as defined by some positive health check) + and not overloaded (by some published load metric). For example: + 1. An internal client should find an endpoint which is local to its + own cluster if one exists, in preference to one in a remote + cluster (if both are operational and non-overloaded). + Similarly, one in a nearby cluster (e.g. in the same zone or + region) is preferable to one further afield. + 1. An external client (e.g. in New York City) should find an + endpoint in a nearby cluster (e.g. U.S. East Coast) in + preference to one further away (e.g. Japan). +1. **Easy fail-over:** If the endpoint to which a client is connected + becomes unavailable (no network response/disconnected) or + overloaded, the client should reconnect to a better endpoint, + somehow. + 1. In the case where there exist one or more connection-terminating + load balancers between the client and the serving Pod, failover + might be completely automatic (i.e. the client's end of the + connection remains intact, and the client is completely + oblivious of the fail-over). This approach incurs network speed + and cost penalties (by traversing possibly multiple load + balancers), but requires zero smarts in clients, DNS libraries, + recursing DNS servers etc, as the IP address of the endpoint + remains constant over time. + 1. In a scenario where clients need to choose between multiple load + balancer endpoints (e.g. one per cluster), multiple DNS A + records associated with a single DNS name enable even relatively + dumb clients to try the next IP address in the list of returned + A records (without even necessarily re-issuing a DNS resolution + request). For example, all major web browsers will try all A + records in sequence until a working one is found (TBD: justify + this claim with details for Chrome, IE, Safari, Firefox). + 1. In a slightly more sophisticated scenario, upon disconnection, a + smarter client might re-issue a DNS resolution query, and + (modulo DNS record TTL's which can typically be set as low as 3 + minutes, and buggy DNS resolvers, caches and libraries which + have been known to completely ignore TTL's), receive updated A + records specifying a new set of IP addresses to which to + connect. + +### Portability + +A Kubernetes application configuration (e.g. for a Pod, Replication +Controller, Service etc) should be able to be successfully deployed +into any Kubernetes Cluster or Ubernetes Federation of Clusters, +without modification. More specifically, a typical configuration +should work correctly (although possibly not optimally) across any of +the following environments: + +1. A single Kubernetes Cluster on one cloud provider (e.g. Google + Compute Engine, GCE) +1. A single Kubernetes Cluster on a different cloud provider + (e.g. Amazon Web Services, AWS) +1. A single Kubernetes Cluster on a non-cloud, on-premise data center +1. A Federation of Kubernetes Clusters all on the same cloud provider + (e.g. GCE) +1. A Federation of Kubernetes Clusters across multiple different cloud + providers and/or on-premise data centers (e.g. one cluster on + GCE/GKE, one on AWS, and one on-premise). + +### Trading Portability for Optimization + +It should be possible to explicitly opt out of portability across some +subset of the above environments in order to take advantage of +non-portable load balancing and DNS features of one or more +environments. More specifically, for example: + +1. For HTTP(S) applications running on GCE-only Federations, + [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) + should be usable. These provide single, static global IP addresses + which load balance and fail over globally (i.e. across both regions + and zones). These allow for really dumb clients, but they only + work on GCE, and only for HTTP(S) traffic. +1. For non-HTTP(S) applications running on GCE-only Federations within + a single region, + [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) + should be usable. These provide TCP (i.e. both HTTP/S and + non-HTTP/S) load balancing and failover, but only on GCE, and only + within a single region. + [Google Cloud DNS](https://cloud.google.com/dns) can be used to + route traffic between regions (and between different cloud + providers and on-premise clusters, as it's plain DNS, IP only). +1. For applications running on AWS-only Federations, + [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) + should be usable. These provide both L7 (HTTP(S)) and L4 load + balancing, but only within a single region, and only on AWS + ([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be + used to load balance and fail over across multiple regions, and is + also capable of resolving to non-AWS endpoints). + +## Component Cloud Services + +Ubernetes cross-cluster load balancing is built on top of the following: + +1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) + provide single, static global IP addresses which load balance and + fail over globally (i.e. across both regions and zones). These + allow for really dumb clients, but they only work on GCE, and only + for HTTP(S) traffic. +1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) + provide both HTTP(S) and non-HTTP(S) load balancing and failover, + but only on GCE, and only within a single region. +1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) + provide both L7 (HTTP(S)) and L4 load balancing, but only within a + single region, and only on AWS. +1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other + programmable DNS service, like + [CloudFlare](http://www.cloudflare.com) can be used to route + traffic between regions (and between different cloud providers and + on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS + doesn't provide any built-in geo-DNS, latency-based routing, health + checking, weighted round robin or other advanced capabilities. + It's plain old DNS. We would need to build all the aforementioned + on top of it. It can provide internal DNS services (i.e. serve RFC + 1918 addresses). + 1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can + be used to load balance and fail over across regions, and is also + capable of routing to non-AWS endpoints). It provides built-in + geo-DNS, latency-based routing, health checking, weighted + round robin and optional tight integration with some other + AWS services (e.g. Elastic Load Balancers). +1. Kubernetes L4 Service Load Balancing: This provides both a + [virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies) + and a + [real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer) + service IP which is load-balanced (currently simple round-robin) + across the healthy pods comprising a service within a single + Kubernetes cluster. +1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): A generic wrapper around cloud-provided L4 and L7 load balancing services, and roll-your-own load balancers run in pods, e.g. HA Proxy. + +## Ubernetes API + +The Ubernetes API for load balancing should be compatible with the +equivalent Kubernetes API, to ease porting of clients between +Ubernetes and Kubernetes. Further details below. + +## Common Client Behavior + +To be useful, our load balancing solution needs to work properly with +real client applications. There are a few different classes of +those... + +### Browsers + +These are the most common external clients. These are all well-written. See below. + +### Well-written clients + +1. Do a DNS resolution every time they connect. +1. Don't cache beyond TTL (although a small percentage of the DNS + servers on which they rely might). +1. Do try multiple A records (in order) to connect. +1. (in an ideal world) Do use SRV records rather than hard-coded port numbers. + +Examples: + ++ all common browsers (except for SRV records) ++ ... + +### Dumb clients + +1. Don't do a DNS resolution every time they connect (or do cache + beyond the TTL). +1. Do try multiple A records + +Examples: + ++ ... + +### Dumber clients + +1. Only do a DNS lookup once on startup. +1. Only try the first returned DNS A record. + +Examples: + ++ ... + +### Dumbest clients + +1. Never do a DNS lookup - are pre-configured with a single (or + possibly multiple) fixed server IP(s). Nothing else matters. + +## Architecture and Implementation + +### General control plane architecture + +Each cluster hosts one or more Ubernetes master components (Ubernetes API servers, controller managers with leader election, and +etcd quorum members. This is documented in more detail in a +[separate design doc: Kubernetes/Ubernetes Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#). + +In the description below, assume that 'n' clusters, named +'cluster-1'... 'cluster-n' have been registered against an Ubernetes +Federation "federation-1", each with their own set of Kubernetes API +endpoints,so, +"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1), +[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1) +... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) . + +### Federated Services + +Ubernetes Services are pretty straight-forward. They're comprised of +multiple equivalent underlying Kubernetes Services, each with their +own external endpoint, and a load balancing mechanism across them. +Let's work through how exactly that works in practice. + +Our user creates the following Ubernetes Service (against an Ubernetes +API endpoint): + + $ kubectl create -f my-service.yaml --context="federation-1" + +where service.yaml contains the following: + + kind: Service + metadata: + labels: + run: my-service + name: my-service + namespace: my-namespace + spec: + ports: + - port: 2379 + protocol: TCP + targetPort: 2379 + name: client + - port: 2380 + protocol: TCP + targetPort: 2380 + name: peer + selector: + run: my-service + type: LoadBalancer + +Ubernetes in turn creates one equivalent service (identical config to +the above) in each of the underlying Kubernetes clusters, each of +which results in something like this: + + $ kubectl get -o yaml --context="cluster-1" service my-service + + apiVersion: v1 + kind: Service + metadata: + creationTimestamp: 2015-11-25T23:35:25Z + labels: + run: my-service + name: my-service + namespace: my-namespace + resourceVersion: "147365" + selfLink: /api/v1/namespaces/my-namespace/services/my-service + uid: 33bfc927-93cd-11e5-a38c-42010af00002 + spec: + clusterIP: 10.0.153.185 + ports: + - name: client + nodePort: 31333 + port: 2379 + protocol: TCP + targetPort: 2379 + - name: peer + nodePort: 31086 + port: 2380 + protocol: TCP + targetPort: 2380 + selector: + run: my-service + sessionAffinity: None + type: LoadBalancer + status: + loadBalancer: + ingress: + - ip: 104.197.117.10 + +Similar services are created in `cluster-2` and `cluster-3`, each of +which are allocated their own `spec.clusterIP`, and +`status.loadBalancer.ingress.ip`. + +In Ubernetes `federation-1`, the resulting federated service looks as follows: + + $ kubectl get -o yaml --context="federation-1" service my-service + + apiVersion: v1 + kind: Service + metadata: + creationTimestamp: 2015-11-25T23:35:23Z + labels: + run: my-service + name: my-service + namespace: my-namespace + resourceVersion: "157333" + selfLink: /api/v1/namespaces/my-namespace/services/my-service + uid: 33bfc927-93cd-11e5-a38c-42010af00007 + spec: + clusterIP: + ports: + - name: client + nodePort: 31333 + port: 2379 + protocol: TCP + targetPort: 2379 + - name: peer + nodePort: 31086 + port: 2380 + protocol: TCP + targetPort: 2380 + selector: + run: my-service + sessionAffinity: None + type: LoadBalancer + status: + loadBalancer: + ingress: + - hostname: my-service.my-namespace.my-federation.my-domain.com + +Note that the federated service: + +1. Is API-compatible with a vanilla Kubernetes service. +1. has no clusterIP (as it is cluster-independent) +1. has a federation-wide load balancer hostname + +In addition to the set of underlying Kubernetes services (one per +cluster) described above, Ubernetes has also created a DNS name +(e.g. on [Google Cloud DNS](https://cloud.google.com/dns) or +[AWS Route 53](https://aws.amazon.com/route53/), depending on +configuration) which provides load balancing across all of those +services. For example, in a very basic configuration: + + $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 + +Each of the above IP addresses (which are just the external load +balancer ingress IP's of each cluster service) is of course load +balanced across the pods comprising the service in each cluster. + +In a more sophisticated configuration (e.g. on GCE or GKE), Ubernetes +automatically creates a +[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) +which exposes a single, globally load-balanced IP: + + $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com + my-service.my-namespace.my-federation.my-domain.com 180 IN A 107.194.17.44 + +Optionally, Ubernetes also configures the local DNS servers (SkyDNS) +in each Kubernetes cluster to preferentially return the local +clusterIP for the service in that cluster, with other clusters' +external service IP's (or a global load-balanced IP) also configured +for failover purposes: + + $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com + my-service.my-namespace.my-federation.my-domain.com 180 IN A 10.0.153.185 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 + +If Ubernetes Global Service Health Checking is enabled, multiple +service health checkers running across the federated clusters +collaborate to monitor the health of the service endpoints, and +automatically remove unhealthy endpoints from the DNS record (e.g. a +majority quorum is required to vote a service endpoint unhealthy, to +avoid false positives due to individual health checker network +isolation). + +### Federated Replication Controllers + +So far we have a federated service defined, with a resolvable load +balancer hostname by which clients can reach it, but no pods serving +traffic directed there. So now we need a Federated Replication +Controller. These are also fairly straight-forward, being comprised +of multiple underlying Kubernetes Replication Controllers which do the +hard work of keeping the desired number of Pod replicas alive in each +Kubernetes cluster. + + $ kubectl create -f my-service-rc.yaml --context="federation-1" + +where `my-service-rc.yaml` contains the following: + + kind: ReplicationController + metadata: + labels: + run: my-service + name: my-service + namespace: my-namespace + spec: + replicas: 6 + selector: + run: my-service + template: + metadata: + labels: + run: my-service + spec: + containers: + image: gcr.io/google_samples/my-service:v1 + name: my-service + ports: + - containerPort: 2379 + protocol: TCP + - containerPort: 2380 + protocol: TCP + +Ubernetes in turn creates one equivalent replication controller +(identical config to the above, except for the replica count) in each +of the underlying Kubernetes clusters, each of which results in +something like this: + + $ ./kubectl get -o yaml rc my-service --context="cluster-1" + kind: ReplicationController + metadata: + creationTimestamp: 2015-12-02T23:00:47Z + labels: + run: my-service + name: my-service + namespace: my-namespace + selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service + uid: 86542109-9948-11e5-a38c-42010af00002 + spec: + replicas: 2 + selector: + run: my-service + template: + metadata: + labels: + run: my-service + spec: + containers: + image: gcr.io/google_samples/my-service:v1 + name: my-service + ports: + - containerPort: 2379 + protocol: TCP + - containerPort: 2380 + protocol: TCP + resources: {} + dnsPolicy: ClusterFirst + restartPolicy: Always + status: + replicas: 2 + +The exact number of replicas created in each underlying cluster will +of course depend on what scheduling policy is in force. In the above +example, the scheduler created an equal number of replicas (2) in each +of the three underlying clusters, to make up the total of 6 replicas +required. To handle entire cluster failures, various approaches are possible, +including: +1. **simple overprovisioing**, such that sufficient replicas remain even if a + cluster fails. This wastes some resources, but is simple and + reliable. +2. **pod autoscaling**, where the replication controller in each + cluster automatically and autonomously increases the number of + replicas in its cluster in response to the additional traffic + diverted from the + failed cluster. This saves resources and is reatively simple, + but there is some delay in the autoscaling. +3. **federated replica migration**, where the Ubernetes Federation + Control Plane detects the cluster failure and automatically + increases the replica count in the remainaing clusters to make up + for the lost replicas in the failed cluster. This does not seem to + offer any benefits relative to pod autoscaling above, and is + arguably more complex to implement, but we note it here as a + possibility. + +### Implementation Details + +The implementation approach and architecture is very similar to +Kubernetes, so if you're familiar with how Kubernetes works, none of +what follows will be surprising. One additional design driver not +present in Kubernetes is that Ubernetes aims to be resilient to +individual cluster and availability zone failures. So the control +plane spans multiple clusters. More specifically: + ++ Ubernetes runs it's own distinct set of API servers (typically one + or more per underlying Kubernetes cluster). These are completely + distinct from the Kubernetes API servers for each of the underlying + clusters. ++ Ubernetes runs it's own distinct quorum-based metadata store (etcd, + by default). Approximately 1 quorum member runs in each underlying + cluster ("approximately" because we aim for an odd number of quorum + members, and typically don't want more than 5 quorum members, even + if we have a larger number of federated clusters, so 2 clusters->3 + quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc). + +Cluster Controllers in Ubernetes watch against the Ubernetes API +server/etcd state, and apply changes to the underlying kubernetes +clusters accordingly. They also have the anti-entropy mechanism for +reconciling ubernetes "desired desired" state against kubernetes +"actual desired" state. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]() + diff --git a/docs/design/federation-phase-1.md b/docs/design/federation-phase-1.md new file mode 100644 index 00000000000..baf1e47247f --- /dev/null +++ b/docs/design/federation-phase-1.md @@ -0,0 +1,434 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Ubernetes Design Spec (phase one) + +**Huawei PaaS Team** + +## INTRODUCTION + +In this document we propose a design for the “Control Plane” of +Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of +this work please refer to +[this proposal](../../docs/proposals/federation.md). +The document is arranged as following. First we briefly list scenarios +and use cases that motivate K8S federation work. These use cases drive +the design and they also verify the design. We summarize the +functionality requirements from these use cases, and define the “in +scope” functionalities that will be covered by this design (phase +one). After that we give an overview of the proposed architecture, API +and building blocks. And also we go through several activity flows to +see how these building blocks work together to support use cases. + +## REQUIREMENTS + +There are many reasons why customers may want to build a K8S +federation: + ++ **High Availability:** Customers want to be immune to the outage of + a single availability zone, region or even a cloud provider. ++ **Sensitive workloads:** Some workloads can only run on a particular + cluster. They cannot be scheduled to or migrated to other clusters. ++ **Capacity overflow:** Customers prefer to run workloads on a + primary cluster. But if the capacity of the cluster is not + sufficient, workloads should be automatically distributed to other + clusters. ++ **Vendor lock-in avoidance:** Customers want to spread their + workloads on different cloud providers, and can easily increase or + decrease the workload proportion of a specific provider. ++ **Cluster Size Enhancement:** Currently K8S cluster can only support +a limited size. While the community is actively improving it, it can +be expected that cluster size will be a problem if K8S is used for +large workloads or public PaaS infrastructure. While we can separate +different tenants to different clusters, it would be good to have a +unified view. + +Here are the functionality requirements derived from above use cases: + ++ Clients of the federation control plane API server can register and deregister clusters. ++ Workloads should be spread to different clusters according to the + workload distribution policy. ++ Pods are able to discover and connect to services hosted in other + clusters (in cases where inter-cluster networking is necessary, + desirable and implemented). ++ Traffic to these pods should be spread across clusters (in a manner + similar to load balancing, although it might not be strictly + speaking balanced). ++ The control plane needs to know when a cluster is down, and migrate + the workloads to other clusters. ++ Clients have a unified view and a central control point for above + activities. + +## SCOPE + +It’s difficult to have a perfect design with one click that implements +all the above requirements. Therefore we will go with an iterative +approach to design and build the system. This document describes the +phase one of the whole work. In phase one we will cover only the +following objectives: + ++ Define the basic building blocks and API objects of control plane ++ Implement a basic end-to-end workflow + + Clients register federated clusters + + Clients submit a workload + + The workload is distributed to different clusters + + Service discovery + + Load balancing + +The following parts are NOT covered in phase one: + ++ Authentication and authorization (other than basic client + authentication against the ubernetes API, and from ubernetes control + plane to the underlying kubernetes clusters). ++ Deployment units other than replication controller and service ++ Complex distribution policy of workloads ++ Service affinity and migration + +## ARCHITECTURE + +The overall architecture of a control plane is shown as following: + +![Ubernetes Architecture](ubernetes-design.png) + +Some design principles we are following in this architecture: + +1. Keep the underlying K8S clusters independent. They should have no + knowledge of control plane or of each other. +1. Keep the Ubernetes API interface compatible with K8S API as much as + possible. +1. Re-use concepts from K8S as much as possible. This reduces +customers’ learning curve and is good for adoption. Below is a brief +description of each module contained in above diagram. + +## Ubernetes API Server + +The API Server in the Ubernetes control plane works just like the API +Server in K8S. It talks to a distributed key-value store to persist, +retrieve and watch API objects. This store is completely distinct +from the kubernetes key-value stores (etcd) in the underlying +kubernetes clusters. We still use `etcd` as the distributed +storage so customers don’t need to learn and manage a different +storage system, although it is envisaged that other storage systems +(consol, zookeeper) will probably be developedand supported over +time. + +## Ubernetes Scheduler + +The Ubernetes Scheduler schedules resources onto the underlying +Kubernetes clusters. For example it watches for unscheduled Ubernetes +replication controllers (those that have not yet been scheduled onto +underlying Kubernetes clusters) and performs the global scheduling +work. For each unscheduled replication controller, it calls policy +engine to decide how to spit workloads among clusters. It creates a +Kubernetes Replication Controller on one ore more underlying cluster, +and post them back to `etcd` storage. + +One sublety worth noting here is that the scheduling decision is +arrived at by combining the application-specific request from the user (which might +include, for example, placement constraints), and the global policy specified +by the federation administrator (for example, "prefer on-premise +clusters over AWS clusters" or "spread load equally across clusters"). + +## Ubernetes Cluster Controller + +The cluster controller +performs the following two kinds of work: + +1. It watches all the sub-resources that are created by Ubernetes + components, like a sub-RC or a sub-service. And then it creates the + corresponding API objects on the underlying K8S clusters. +1. It periodically retrieves the available resources metrics from the + underlying K8S cluster, and updates them as object status of the + `cluster` API object. An alternative design might be to run a pod + in each underlying cluster that reports metrics for that cluster to + the Ubernetes control plane. Which approach is better remains an + open topic of discussion. + +## Ubernetes Service Controller + +The Ubernetes service controller is a federation-level implementation +of K8S service controller. It watches service resources created on +control plane, creates corresponding K8S services on each involved K8S +clusters. Besides interacting with services resources on each +individual K8S clusters, the Ubernetes service controller also +performs some global DNS registration work. + +## API OBJECTS + +## Cluster + +Cluster is a new first-class API object introduced in this design. For +each registered K8S cluster there will be such an API resource in +control plane. The way clients register or deregister a cluster is to +send corresponding REST requests to following URL: +`/api/{$version}/clusters`. Because control plane is behaving like a +regular K8S client to the underlying clusters, the spec of a cluster +object contains necessary properties like K8S cluster address and +credentials. The status of a cluster API object will contain +following information: + +1. Which phase of its lifecycle +1. Cluster resource metrics for scheduling decisions. +1. Other metadata like the version of cluster + +$version.clusterSpec + + + + + + + + + + + + + + + + + + + + + + + + + +
Name
+
Description
+
Required
+
Schema
+
Default
+
Address
+
address of the cluster
+
yes
+
address
+

Credential
+
the type (e.g. bearer token, client +certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)
+
yes
+
string
+

+ +$version.clusterStatus + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Name
+
Description
+
Required
+
Schema
+
Default
+
Phase
+
the recently observed lifecycle phase of the cluster
+
yes
+
enum
+

Capacity
+
represents the available resources of a cluster
+
yes
+
any
+

ClusterMeta
+
Other cluster metadata like the version
+
yes
+
ClusterMeta
+

+ +**For simplicity we didn’t introduce a separate “cluster metrics” API +object here**. The cluster resource metrics are stored in cluster +status section, just like what we did to nodes in K8S. In phase one it +only contains available CPU resources and memory resources. The +cluster controller will periodically poll the underlying cluster API +Server to get cluster capability. In phase one it gets the metrics by +simply aggregating metrics from all nodes. In future we will improve +this with more efficient ways like leveraging heapster, and also more +metrics will be supported. Similar to node phases in K8S, the “phase” +field includes following values: + ++ pending: newly registered clusters or clusters suspended by admin + for various reasons. They are not eligible for accepting workloads ++ running: clusters in normal status that can accept workloads ++ offline: clusters temporarily down or not reachable ++ terminated: clusters removed from federation + +Below is the state transition diagram. + +![Cluster State Transition Diagram](ubernetes-cluster-state.png) + +## Replication Controller + +A global workload submitted to control plane is represented as an +Ubernetes replication controller. When a replication controller +is submitted to control plane, clients need a way to express its +requirements or preferences on clusters. Depending on different use +cases it may be complex. For example: + ++ This workload can only be scheduled to cluster Foo. It cannot be + scheduled to any other clusters. (use case: sensitive workloads). ++ This workload prefers cluster Foo. But if there is no available + capacity on cluster Foo, it’s OK to be scheduled to cluster Bar + (use case: workload ) ++ Seventy percent of this workload should be scheduled to cluster Foo, + and thirty percent should be scheduled to cluster Bar (use case: + vendor lock-in avoidance). In phase one, we only introduce a + _clusterSelector_ field to filter acceptable clusters. In default + case there is no such selector and it means any cluster is + acceptable. + +Below is a sample of the YAML to create such a replication controller. + +``` +apiVersion: v1 +kind: ReplicationController +metadata: + name: nginx-controller +spec: + replicas: 5 + selector: + app: nginx + template: + metadata: + labels: + app: nginx + spec: + containers: + - name: nginx + image: nginx + ports: + - containerPort: 80 + clusterSelector: + name in (Foo, Bar) +``` + +Currently clusterSelector (implemented as a +[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704)) +only supports a simple list of acceptable clusters. Workloads will be +evenly distributed on these acceptable clusters in phase one. After +phase one we will define syntax to represent more advanced +constraints, like cluster preference ordering, desired number of +splitted workloads, desired ratio of workloads spread on different +clusters, etc. + +Besides this explicit “clusterSelector” filter, a workload may have +some implicit scheduling restrictions. For example it defines +“nodeSelector” which can only be satisfied on some particular +clusters. How to handle this will be addressed after phase one. + +## Ubernetes Services + +The Service API object exposed by Ubernetes is similar to service +objects on Kubernetes. It defines the access to a group of pods. The +Ubernetes service controller will create corresponding Kubernetes +service objects on underlying clusters. These are detailed in a +separate design document: [Federated Services](federated-services.md). + +## Pod + +In phase one we only support scheduling replication controllers. Pod +scheduling will be supported in later phase. This is primarily in +order to keep the Ubernetes API compatible with the Kubernetes API. + +## ACTIVITY FLOWS + +## Scheduling + +The below diagram shows how workloads are scheduled on the Ubernetes control plane: + +1. A replication controller is created by the client. +1. APIServer persists it into the storage. +1. Cluster controller periodically polls the latest available resource + metrics from the underlying clusters. +1. Scheduler is watching all pending RCs. It picks up the RC, make + policy-driven decisions and split it into different sub RCs. +1. Each cluster control is watching the sub RCs bound to its + corresponding cluster. It picks up the newly created sub RC. +1. The cluster controller issues requests to the underlying cluster +API Server to create the RC. In phase one we don’t support complex +distribution policies. The scheduling rule is basically: + 1. If a RC does not specify any nodeSelector, it will be scheduled + to the least loaded K8S cluster(s) that has enough available + resources. + 1. If a RC specifies _N_ acceptable clusters in the + clusterSelector, all replica will be evenly distributed among + these clusters. + +There is a potential race condition here. Say at time _T1_ the control +plane learns there are _m_ available resources in a K8S cluster. As +the cluster is working independently it still accepts workload +requests from other K8S clients or even another Ubernetes control +plane. The Ubernetes scheduling decision is based on this data of +available resources. However when the actual RC creation happens to +the cluster at time _T2_, the cluster may don’t have enough resources +at that time. We will address this problem in later phases with some +proposed solutions like resource reservation mechanisms. + +![Ubernetes Scheduling](ubernetes-scheduling.png) + +## Service Discovery + +This part has been included in the section “Federated Service” of +document +“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please +refer to that document for details. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]() + diff --git a/docs/design/ubernetes-cluster-state.png b/docs/design/ubernetes-cluster-state.png new file mode 100644 index 00000000000..56ec2df8b3f Binary files /dev/null and b/docs/design/ubernetes-cluster-state.png differ diff --git a/docs/design/ubernetes-design.png b/docs/design/ubernetes-design.png new file mode 100644 index 00000000000..44924846c5d Binary files /dev/null and b/docs/design/ubernetes-design.png differ diff --git a/docs/design/ubernetes-scheduling.png b/docs/design/ubernetes-scheduling.png new file mode 100644 index 00000000000..01774882bdf Binary files /dev/null and b/docs/design/ubernetes-scheduling.png differ