Merge pull request #11938 from justinsb/docs_aws_internals

Auto commit by PR queue bot
2015-10-27 11:30:09 -07:00
parent d6d0976d55 7006133621
commit ad01a84ecb
1 changed files with 314 additions and 0 deletions
--- a/docs/design/aws_under_the_hood.md
+++ b/docs/design/aws_under_the_hood.md
@@ -0,0 +1,314 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+<strong>
+The latest 1.0.x release of this document can be found
+[here](http://releases.k8s.io/release-1.0/docs/design/aws_under_the_hood.md).
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Peeking under the hood of Kubernetes on AWS
+
+This document provides high-level insight into how Kubernetes works on AWS and
+maps to AWS objects.  We assume that you are familiar with AWS.
+
+We encourage you to use [kube-up](../getting-started-guides/aws.md) to create
+clusters on AWS. We recommend that you avoid manual configuration but are aware
+that sometimes it's the only option.
+
+Tip: You should open an issue and let us know what enhancements can be made to
+the scripts to better suit your needs.
+
+That said, it's also useful to know what's happening under the hood when
+Kubernetes clusters are created on AWS. This can be particularly useful if
+problems arise or in circumstances where the provided scripts are lacking and
+you manually created or configured your cluster.
+
+**Table of contents:**
+ * [Architecture overview](#architecture-overview)
+ * [Storage](#storage)
+ * [Auto Scaling group](#auto-scaling-group)
+ * [Networking](#networking)
+ * [NodePort and LoadBalancer services](#nodeport-and-loadbalancer-services)
+ * [Identity and access management (IAM)](#identity-and-access-management-iam)
+ * [Tagging](#tagging)
+ * [AWS objects](#aws-objects)
+ * [Manual infrastructure creation](#manual-infrastructure-creation)
+ * [Instance boot](#instance-boot)
+
+### Architecture overview
+
+Kubernetes is a cluster of several machines that consists of a Kubernetes
+master and a set number of nodes (previously known as 'minions') for which the
+master which is responsible. See the [Architecture](architecture.md) topic for
+more details.
+
+By default on AWS:
+
+* Instances run Ubuntu 15.04 (the official AMI).  It includes a sufficiently
+  modern kernel that pairs well with Docker and doesn't require a
+  reboot.  (The default SSH user is `ubuntu` for this and other ubuntu images.)
+* Nodes use aufs instead of ext4 as the filesystem / container storage (mostly
+  because this is what Google Compute Engine uses).
+
+You can override these defaults by passing different environment variables to
+kube-up.
+
+### Storage
+
+AWS supports persistent volumes by using [Elastic Block Store (EBS)](../user-guide/volumes.md#awselasticblockstore).  These can then be
+attached to pods that should store persistent data (e.g. if you're running a
+database).
+
+By default, nodes in AWS use [instance storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html)
+unless you create pods with persistent volumes
+[(EBS)](../user-guide/volumes.md#awselasticblockstore).  In general, Kubernetes
+containers do not have persistent storage unless you attach a persistent
+volume, and so nodes on AWS use instance storage.  Instance storage is cheaper,
+often faster, and historically more reliable.  Unless you can make do with whatever
+space is left on your root partition, you must choose an instance type that provides
+you with sufficient instance storage for your needs.
+
+Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to track
+its state.  Similar to nodes, containers are mostly run against instance
+storage, except that we repoint some important data onto the peristent volume.
+
+The default storage driver for Docker images is aufs.  Specifying btrfs (by passing the environment
+variable `DOCKER_STORAGE=btrfs` to kube-up) is also a good choice for a filesystem.  btrfs
+is relatively reliable with Docker and has improved its reliability with modern
+kernels.  It can easily span multiple volumes, which is particularly useful
+when we are using an instance type with multiple ephemeral instance disks.
+
+### Auto Scaling group
+
+Nodes (but not the master) are run in an
+[Auto Scaling group](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html)
+on AWS.  Currently auto-scaling (e.g. based on CPU) is not actually enabled
+([#11935](http://issues.k8s.io/11935)).  Instead, the Auto Scaling group means
+that AWS will relaunch any nodes that are terminated.
+
+We do not currently run the master in an AutoScalingGroup, but we should
+([#11934](http://issues.k8s.io/11934)).
+
+### Networking
+
+Kubernetes uses an IP-per-pod model.  This means that a node, which runs many
+pods, must have many IPs.  AWS uses virtual private clouds (VPCs) and advanced
+routing support so each pod is assigned a /24 CIDR. The assigned CIDR is then
+configured to route to an instance in the VPC routing table.
+
+It is also possible to use overlay networking on AWS, but that is not the default
+configuration of the kube-up script.
+
+### NodePort and LoadBalancer services
+
+Kubernetes on AWS integrates with [Elastic Load Balancing
+(ELB)](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SetUpASLBApp.html).
+When you create a service with `Type=LoadBalancer`, Kubernetes (the
+kube-controller-manager) will create an ELB, create a security group for the
+ELB which allows access on the service ports, attach all the nodes to the ELB,
+and modify the security group for the nodes to allow traffic from the ELB to
+the nodes.  This traffic reaches kube-proxy where it is then forwarded to the
+pods.
+
+ELB has some restrictions:
+* it requires that all nodes listen on a single port,
+* it acts as a forwarding proxy (i.e. the source IP is not preserved).
+
+To work with these restrictions, in Kubernetes, [LoadBalancer
+services](../user-guide/services.md#type-loadbalancer) are exposed as
+[NodePort services](../user-guide/services.md#type-nodeport).  Then
+kube-proxy listens externally on the cluster-wide port that's assigned to
+NodePort services and forwards traffic to the corresponding pods.
+
+So for example, if we configure a service of Type LoadBalancer with a
+public port of 80:
+* Kubernetes will assign a NodePort to the service (e.g. 31234)
+* ELB is configured to proxy traffic on the public port 80 to the NodePort
+  that is assigned to the service (31234).
+* Then any in-coming traffic that ELB forwards to the NodePort (e.g. port 31234)
+  is recognized by kube-proxy and sent to the correct pods for that service.
+
+Note that we do not automatically open NodePort services in the AWS firewall
+(although we do open LoadBalancer services).  This is because we expect that
+NodePort services are more of a building block for things like inter-cluster
+services or for LoadBalancer.  To consume a NodePort service externally, you
+will likely have to open the port in the node security group
+(`kubernetes-minion-<clusterid>`).
+
+### Identity and Access Management (IAM)
+
+kube-proxy sets up two IAM roles, one for the master called
+[kubernetes-master](../../cluster/aws/templates/iam/kubernetes-master-policy.json)
+and one for the nodes called
+[kubernetes-minion](../../cluster/aws/templates/iam/kubernetes-minion-policy.json).
+
+The master is responsible for creating ELBs and configuring them, as well as
+setting up advanced VPC routing.  Currently it has blanket permissions on EC2,
+along with rights to create and destroy ELBs.
+
+The nodes do not need a lot of access to the AWS APIs.  They need to download
+a distribution file, and then are responsible for attaching and detaching EBS
+volumes from itself.
+
+The node policy is relatively minimal.  The master policy is probably overly
+permissive.  The security concious may want to lock-down the IAM policies
+further ([#11936](http://issues.k8s.io/11936)).
+
+We should make it easier to extend IAM permissions and also ensure that they
+are correctly configured ([#14226](http://issues.k8s.io/14226)).
+
+### Tagging
+
+All AWS resources are tagged with a tag named "KuberentesCluster", with a value
+that is the unique cluster-id.  This tag is used to identify a particular
+'instance' of Kubernetes, even if two clusters are deployed into the same VPC.
+Resources are considered to belong to the same cluster if and only if they have
+the same value in the tag named "KubernetesCluster".  (The kube-up script is
+not configured to create multiple clusters in the same VPC by default, but it
+is possible to create another cluster in the same VPC.)
+
+Within the AWS cloud provider logic, we filter requests to the AWS APIs to
+match resources with our cluster tag.  By filtering the requests, we ensure
+that we see only our own AWS objects.
+
+Important: If you choose not to use kube-up, you must pick a unique cluster-id
+value, and ensure that all AWS resources have a tag with
+`Name=KubernetesCluster,Value=<clusterid>`.
+
+### AWS objects
+
+The kube-up script does a number of things in AWS:
+
+* Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes distribution
+  and the salt scripts into it.  They are made world-readable and the HTTP URLs
+  are passed to instances; this is how Kubernetes code gets onto the machines.
+* Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/):
+    * `kubernetes-master` is used by the master.
+    * `kubernetes-minion` is used by nodes.
+* Creates an AWS SSH key named `kubernetes-<fingerprint>`.  Fingerprint here is
+  the OpenSSH key fingerprint, so that multiple users can run the script with
+  different keys and their keys will not collide (with near-certainty). It will
+  use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create
+  one there.  (With the default Ubuntu images, if you have to SSH in: the user is
+  `ubuntu` and that user can `sudo`).
+* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and
+  enables the `dns-support` and `dns-hostnames` options.
+* Creates an internet gateway for the VPC.
+* Creates a route table for the VPC, with the internet gateway as the default
+  route.
+* Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE`
+  (defaults to us-west-2a).  Currently, each Kubernetes cluster runs in a
+  single AZ on AWS. Although, there are two philosophies in discussion on how to
+  achieve High Availability (HA):
+     * cluster-per-AZ: An independent cluster for each AZ, where each cluster
+       is entirely separate.
+     * cross-AZ-clusters: A single cluster spans multiple AZs.
+The debate is open here, where cluster-per-AZ is discussed as more robust but
+cross-AZ-clusters are more convenient.
+* Associates the subnet to the route table
+* Creates security groups for the master (`kubernetes-master-<clusterid>`)
+  and the nodes (`kubernetes-minion-<clusterid>`).
+* Configures security groups so that masters and nodes can communicate. This
+  includes intercommunication between masters and nodes, opening SSH publicly
+  for both masters and nodes, and opening port 443 on the master for the HTTPS
+  API endpoints.
+* Creates an EBS volume for the master of size `MASTER_DISK_SIZE` and type
+  `MASTER_DISK_TYPE`.
+* Launches a master with a fixed IP address (172.20.0.9) that is also
+  configured for the security group and all the necessary IAM credentials. An
+  instance script is used to pass vital configuration information to Salt. Note:
+  The hope is that over time we can reduce the amount of configuration
+  information that must be passed in this way.
+* Once the instance is up, it attaches the EBS volume and sets up a manual
+  routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to
+  10.246.0.0/24).
+* For auto-scaling, on each nodes it creates a launch configuration and group.
+  The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-minion-group. The default
+  name is kubernetes-minion-group. The auto-scaling group has a min and max size
+  that are both set to NUM_MINIONS. You can change the size of the auto-scaling
+  group to add or remove the total number of nodes from within the AWS API or
+  Console. Each nodes self-configures, meaning that they come up; run Salt with
+  the stored configuration; connect to the master; are assigned an internal CIDR;
+  and then the master configures the route-table with the assigned CIDR. The
+  kube-up script performs a health-check on the nodes but it's a self-check that
+  is not required.
+
+
+If attempting this configuration manually, I highly recommend following along
+with the kube-up script, and being sure to tag everything with a tag with name
+`KubernetesCluster` and value set to a unique cluster-id.  Also, passing the
+right configuration options to Salt when not using the script is tricky: the
+plan here is to simplify this by having Kubernetes take on more node
+configuration, and even potentially remove Salt altogether.
+
+
+### Manual infrastructure creation
+
+While this work is not yet complete, advanced users might choose to manually
+create certain AWS objects while still making use of the kube-up script (to configure
+Salt, for example).  These objects can currently be manually created:
+
+* Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket.
+* Set the `VPC_ID` environment variable to reuse an existing VPC.
+* Set the `SUBNET_ID` environemnt variable to reuse an existing subnet.
+* If your route table has a matching `KubernetesCluster` tag, it will
+  be reused.
+* If your security groups are appropriately named, they will be reused.
+
+Currently there is no way to do the following with kube-up:
+
+* Use an existing AWS SSH key with an arbitrary name.
+* Override the IAM credentials in a sensible way
+  ([#14226](http://issues.k8s.io/14226)).
+* Use different security group permissions.
+* Configure your own auto-scaling groups.
+
+If any of the above items apply to your situation, open an issue to request an
+enhancement to the kube-up script. You should provide a complete description of
+the use-case, including all the details around what you want to accomplish.
+
+### Instance boot
+
+The instance boot procedure is currently pretty complicated, primarily because
+we must marshal configuration from Bash to Salt via the AWS instance script.
+As we move more post-boot configuration out of Salt and into Kubernetes, we
+will hopefully be able to simplify this.
+
+When the kube-up script launches instances, it builds an instance startup
+script which includes some configuration options passed to kube-up, and
+concatenates some of the scripts found in the cluster/aws/templates directory.
+These scripts are responsible for mounting and formatting volumes, downloading
+Salt and Kubernetes from the S3 bucket, and then triggering Salt to actually
+install Kubernetes.
+
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->