From df60a2466bb7642fe2a97dc1b87f111da950dead Mon Sep 17 00:00:00 2001
From: Eric Tune <etune@google.com>
Date: Wed, 21 Jan 2015 13:14:36 -0800
Subject: [PATCH] Reorganize mitigations.

---
 docs/availability.md | 53 ++++++++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/docs/availability.md b/docs/availability.md
index 8a92688964a..e940f45a7e2 100644
--- a/docs/availability.md
+++ b/docs/availability.md
@@ -4,13 +4,13 @@ This document collects advice on reasoning about and provisioning for high-avail
 
 ## Failure modes
 
-This is an incomplete list of things that could go wrong, and how to deal with it.
+This is an incomplete list of things that could go wrong, and how to deal with them.
 
 Root causes:
   - VM(s) shutdown
   - network partition within cluster, or between cluster and users.
   - crashes in Kubernetes software 
-  - data loss or unavailability from storage
+  - data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume).
   - operator error misconfigures kubernetes software or application software.
 
 Specific scenarios:
@@ -18,21 +18,11 @@ Specific scenarios:
     - Results
       - unable to stop, update, or start new pods, services, replication controller
       - existing pods and services should continue to work normally, unless they depend on the Kubernetes API
-    - Mitigations
-      - Use cloud provider best practices for improving availability of a VM, such as automatic restart and reliable
-        storage for writeable state (GCE PD or AWS EBS volume).
-      - High-availability (replicated) APIserver is a planned feature for Kubernetes.  Will tolerate one or more
-        similtaneous apiserver failures.
-      - Multiple independent clusters will tolerate failure of all apiservers in one cluster.  
   - Apiserver backing storage lost
     - Results
       - apiserver should fail to come up.
       - kubelets will not be able to reach it but will continute to run the same pods and provide the same service proxying.
       - manual recovery or recreation of apiserver state necessary before apiserver is restarted.
-    - Mitigations
-      - High-availability (replicated) APIserver is a planned feature for Kubernetes.  Each apiserver has independent
-        storage.  Etcd will recover from loss of one member.  Risk of total data loss greatly reduced.
-      - snapshot PD/EBS-volume periodically
   - Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
     - currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
     - in future, these will be replicated as well and may not be co-located
@@ -40,27 +30,48 @@ Specific scenarios:
   - Node (thing that runs kubelet and kube-proxy and pods) shutdown
     - Results
       - pods on that Node stop running
-    - Mitigations
-      - replication controller should be used to restart copy of the pod elsewhere
-      - service should be used to hide changes in the pod IP address after restart
-      - applications (containers) should tolerate unexpected restarts
   - Kubelet software fault
     - Results
       - crashing kubelet cannot start new pods on the node
       - kubelet might delete the pods or not
       - node marked unhealthy
       - replication controllers start new pods elsewhere
-    - Mitigations
-      - same as for Node shutdown case
   - Cluster operator error
     - Results:
       - loss of pods, services, etc
       - lost of apiserver backing store
       - users unable to read API
       - etc
-    - Mitigations
-      - run additional cluster(s) and do not make changes to all at once.
-      - snapshot apiserver PD/EBS-volume periodically
+
+Mitigations:
+- Action: Use IaaS providers automatic VM restarting feature for IaaS VMs.
+  - Mitigates: Apiserver VM shutdown or apiserver crashing
+  - Mitigates: Supporting services VM shutdown or crashes
+
+- Action use IaaS providers reliable storage (e.g GCE PD or AWS EBS volume) for VMs with apiserver+etcd.
+  - Mitigates: Apiserver backing storage lost
+
+- Action: Use Replicated APIserver feature (when complete: feature is planned but not implemented)
+  - Mitigates: Apiserver VM shutdown or apiserver crashing
+    - Will tolerate one or more similtaneous apiserver failures.
+  - Mitigates: Apiserver backing storage lost
+    - Each apiserver has independent storage.  Etcd will recover from loss of one member.  Risk of total data loss greatly reduced.
+
+- Action: Snapshot apiserver PDs/EBS-volumes periodically
+  - Mitigates: Apiserver backing storage lost
+  - Mitigates: Some cases of operator error
+  - Mitigates: Some cases of kubernetes software fault
+
+- Action: use replication controller and services in front of pods
+  - Mitigates: Node shutdown
+  - Mitigates: Kubelet software fault
+
+- Action: applications (containers) designed to tolerate unexpected restarts
+  - Mitigates: Node shutdown
+  - Mitigates: Kubelet software fault
+
+- Action: Multiple independent clusters (and avoid making risky changes to all clusters at once)
+  - Mitigates: Everything listed above.
 
 ## Chosing Multiple Kubernetes Clusters