Merge pull request #23666 from smarterclayton/initctrproposal

Automatic merge from submit-queue Proposal for implementing init containers Addresses #1589. Implemented in #23567. Docs in https://github.com/kubernetes/kubernetes.github.io/pull/679 ```release-note Init containers enable pod authors to perform tasks before their normal containers start. Each init container is started in order, and failing containers will prevent the application from starting. ```
2016-06-17 17:02:17 -07:00
parent 7ab303efbe bdde25cf43
commit d03a5fab29
1 changed files with 473 additions and 0 deletions
--- a/docs/proposals/container-init.md
+++ b/docs/proposals/container-init.md
@@ -0,0 +1,473 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Pod initialization
+
+@smarterclayton
+
+March 2016
+
+## Proposal and Motivation
+
+Within a pod there is a need to initialize local data or adapt to the current
+cluster environment that is not easily achieved in the current container model.
+Containers start in parallel after volumes are mounted, leaving no opportunity
+for coordination between containers without specialization of the image. If
+two containers need to share common initialization data, both images must
+be altered to cooperate using filesystem or network semantics, which introduces
+coupling between images. Likewise, if an image requires configuration in order
+to start and that configuration is environment dependent, the image must be
+altered to add the necessary templating or retrieval.
+
+This proposal introduces the concept of an **init container**, one or more
+containers started in sequence before the pod's normal containers are started.
+These init containers may share volumes, perform network operations, and perform
+computation prior to the start of the remaining containers. They may also, by
+virtue of their sequencing, block or delay the startup of application containers
+until some precondition is met. In this document we refer to the existing pod
+containers as **app containers**.
+
+This proposal also provides a high level design of **volume containers**, which
+initialize a particular volume, as a feature that specializes some of the tasks
+defined for init containers. The init container design anticipates the existence
+of volume containers and highlights where they will take future work
+
+## Design Points
+
+* Init containers should be able to:
+  * Perform initialization of shared volumes
+    * Download binaries that will be used in app containers as execution targets
+    * Inject configuration or extension capability to generic images at startup
+    * Perform complex templating of information available in the local environment
+    * Initialize a database by starting a temporary execution process and applying
+      schema info.
+  * Delay the startup of application containers until preconditions are met
+  * Register the pod with other components of the system
+* Reduce coupling:
+  * Between application images, eliminating the need to customize those images for
+    Kubernetes generally or specific roles
+  * Inside of images, by specializing which containers perform which tasks
+    (install git into init container, use filesystem contents
+    in web container)
+  * Between initialization steps, by supporting multiple sequential init containers
+* Init containers allow simple start preconditions to be implemented that are
+  decoupled from application code
+  * The order init containers start should be predictable and allow users to easily
+    reason about the startup of a container
+  * Complex ordering and failure will not be supported - all complex workflows can
+    if necessary be implemented inside of a single init container, and this proposal
+    aims to enable that ordering without adding undue complexity to the system.
+    Pods in general are not intended to support DAG workflows.
+* Both run-once and run-forever pods should be able to use init containers
+* As much as possible, an init container should behave like an app container
+  to reduce complexity for end users, for clients, and for divergent use cases.
+  An init container is a container with the minimum alterations to accomplish
+  its goal.
+* Volume containers should be able to:
+  * Perform initialization of a single volume
+  * Start in parallel
+  * Perform computation to initialize a volume, and delay start until that
+    volume is initialized successfully.
+  * Using a volume container that does not populate a volume to delay pod start
+    (in the absence of init containers) would be an abuse of the goal of volume
+    containers.
+* Container pre-start hooks are not sufficient for all initialization cases:
+  * They cannot easily coordinate complex conditions across containers
+  * They can only function with code in the image or code in a shared volume,
+    which would have to be statically linked (not a common pattern in wide use)
+  * They cannot be implemented with the current Docker implementation - see
+    [#140](https://github.com/kubernetes/kubernetes/issues/140)
+
+
+
+## Alternatives
+
+* Any mechanism that runs user code on a node before regular pod containers
+  should itself be a container and modeled as such - we explicitly reject
+  creating new mechanisms for running user processes.
+* The container pre-start hook (not yet implemented) requires execution within
+  the container's image and so cannot adapt existing images. It also cannot
+  block startup of containers
+* Running a "pre-pod" would defeat the purpose of the pod being an atomic
+  unit of scheduling.
+
+
+## Design
+
+Each pod may have 0..N init containers defined along with the existing
+1..M app containers.
+
+On startup of the pod, after the network and volumes are initialized, the
+init containers are started in order. Each container must exit successfully
+before the next is invoked. If a container fails to start (due to the runtime)
+or exits with failure, it is retried according to the pod RestartPolicy.
+RestartPolicyNever pods will immediately fail and exit. RestartPolicyAlways
+pods will retry the failing init container with increasing backoff until it
+succeeds. To align with the design of application containers, init containers
+will only support "infinite retries" (RestartPolicyAlways) or "no retries"
+(RestartPolicyNever).
+
+A pod cannot be ready until all init containers have succeeded. The ports
+on an init container are not aggregated under a service. A pod that is
+being initialized is in the `Pending` phase but should have a distinct
+condition. Each app container and all future init containers should have
+the reason `PodInitializing`. The pod should have a condition `Initializing`
+set to `false` until all init containers have succeeded, and `true` thereafter.
+If the pod is restarted, the `Initializing` condition should be set to `false.
+
+If the pod is "restarted" all containers stopped and started due to
+a node restart, change to the pod definition, or admin interaction, all
+init containers must execute again. Restartable conditions are defined as:
+
+* An init container image is changed
+* The pod infrastructure container is restarted (shared namespaces are lost)
+* The Kubelet detects that all containers in a pod are terminated AND
+  no record of init container completion is available on disk (due to GC)
+
+Changes to the init container spec are limited to the container image field.
+Altering the container image field is equivalent to restarting the pod.
+
+Because init containers can be restarted, retried, or reexecuted, container
+authors should make their init behavior idempotent by handling volumes that
+are already populated or the possibility that this instance of the pod has
+already contacted a remote system.
+
+Each init container has all of the fields of an app container. The following
+fields are prohibited from being used on init containers by validation:
+
+* `readinessProbe` - init containers must exit for pod startup to continue,
+  are not included in rotation, and so cannot define readiness distinct from
+  completion.
+
+Init container authors may use `activeDeadlineSeconds` on the pod and
+`livenessProbe` on the container to prevent init containers from failing
+forever. The active deadline includes init containers.
+
+Because init containers are semantically different in lifecycle from app
+containers (they are run serially, rather than in parallel), for backwards
+compatibility and design clarity they will be identified as distinct fields
+in the API:
+
+    pod:
+      spec:
+        containers: ...
+        initContainers:
+        - name: init-container1
+          image: ...
+          ...
+        - name: init-container2
+        ...
+      status:
+        containerStatuses: ...
+        initContainerStatuses:
+        - name: init-container1
+          ...
+        - name: init-container2
+          ...
+
+This separation also serves to make the order of container initialization
+clear - init containers are executed in the order that they appear, then all
+app containers are started at once.
+
+The name of each app and init container in a pod must be unique - it is a
+validation error for any container to share a name.
+
+While pod containers are in alpha state, they will be serialized as an annotation
+on the pod with the name `pod.alpha.kubernetes.io/init-containers` and the status
+of the containers will be stored as `pod.alpha.kubernetes.io/init-container-statuses`.
+Mutation of these annotations is prohibited on existing pods.
+
+
+### Resources
+
+Given the ordering and execution for init containers, the following rules
+for resource usage apply:
+
+* The highest of any particular resource request or limit defined on all init
+  containers is the **effective init request/limit**
+* The pod's **effective request/limit** for a resource is the higher of:
+  * sum of all app containers request/limit for a resource
+  * effective init request/limit for a resource
+* Scheduling is done based on effective requests/limits, which means
+  init containers can reserve resources for initialization that are not used
+  during the life of the pod.
+* The lowest QoS tier of init containers per resource is the **effective init QoS tier**,
+  and the highest QoS tier of both init containers and regular containers is the
+  **effective pod QoS tier**.
+
+So the following pod:
+
+    pod:
+      spec:
+        initContainers:
+        - limits:
+            cpu: 100m
+            memory: 1GiB
+        - limits:
+            cpu: 50m
+            memory: 2GiB
+        containers:
+        - limits:
+            cpu: 10m
+            memory: 1100MiB
+        - limits:
+            cpu: 10m
+            memory: 1100MiB
+
+has an effective pod limit of `cpu: 100m`, `memory: 2200MiB` (highest init
+container cpu is larger than sum of all app containers, sum of container
+memory is larger than the max of all init containers). The scheduler, node,
+and quota must respect the effective pod request/limit.
+
+In the absence of a defined request or limit on a container, the effective
+request/limit will be applied. For example, the following pod:
+
+    pod:
+      spec:
+        initContainers:
+        - limits:
+            cpu: 100m
+            memory: 1GiB
+        containers:
+        - request:
+            cpu: 10m
+            memory: 1100MiB
+
+will have an effective request of `10m / 1100MiB`, and an effective limit
+of `100m / 1GiB`, i.e.:
+
+    pod:
+      spec:
+        initContainers:
+        - request:
+            cpu: 10m
+            memory: 1GiB
+        - limits:
+            cpu: 100m
+            memory: 1100MiB
+        containers:
+        - request:
+            cpu: 10m
+            memory: 1GiB
+        - limits:
+            cpu: 100m
+            memory: 1100MiB
+
+and thus have the QoS tier **Burstable** (because request is not equal to
+limit).
+
+Quota and limits will be applied based on the effective pod request and
+limit.
+
+Pod level cGroups will be based on the effective pod request and limit, the
+same as the scheduler.
+
+
+### Kubelet and container runtime details
+
+Container runtimes should treat the set of init and app containers as one
+large pool. An individual init container execution should be identical to
+an app container, including all standard container environment setup
+(network, namespaces, hostnames, DNS, etc).
+
+All app container operations are permitted on init containers. The
+logs for an init container should be available for the duration of the pod
+lifetime or until the pod is restarted.
+
+During initialization, app container status should be shown with the reason
+PodInitializing if any init containers are present. Each init container
+should show appropriate container status, and all init containers that are
+waiting for earlier init containers to finish should have the `reason`
+PendingInitialization.
+
+The container runtime should aggressively prune failed init containers.
+The container runtime should record whether all init containers have
+succeeded internally, and only invoke new init containers if a pod
+restart is needed (for Docker, if all containers terminate or if the pod
+infra container terminates). Init containers should follow backoff rules
+as necessary. The Kubelet *must* preserve at least the most recent instance
+of an init container to serve logs and data for end users and to track
+failure states. The Kubelet *should* prefer to garbage collect completed
+init containers over app containers, as long as the Kubelet is able to
+track that initialization has been completed. In the future, container
+state checkpointing in the Kubelet may remove or reduce the need to
+preserve old init containers.
+
+For the initial implementation, the Kubelet will use the last termination
+container state of the highest indexed init container to determine whether
+the pod has completed initialization. During a pod restart, initialization
+will be restarted from the beginning (all initializers will be rerun).
+
+
+### API Behavior
+
+All APIs that access containers by name should operate on both init and
+app containers. Because names are unique the addition of the init container
+should be transparent to use cases.
+
+A client with no knowledge of init containers should see appropriate
+container status `reason` and `message` fields while the pod is in the
+`Pending` phase, and so be able to communicate that to end users.
+
+
+### Example init containers
+
+* Wait for a service to be created
+
+        pod:
+          spec:
+            initContainers:
+            - name: wait
+              image: centos:centos7
+              command: ["/bin/sh", "-c", "for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; exit 1"]
+            containers:
+            - name: run
+              image: application-image
+              command: ["/my_application_that_depends_on_myservice"]
+
+* Register this pod with a remote server
+
+        pod:
+          spec:
+            initContainers:
+            - name: register
+              image: centos:centos7
+              command: ["/bin/sh", "-c", "curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(POD_NAME)&ip=$(POD_IP)'"]
+              env:
+              - name: POD_NAME
+                valueFrom:
+                  field: metadata.name
+              - name: POD_IP
+                valueFrom:
+                  field: status.podIP
+            containers:
+            - name: run
+              image: application-image
+              command: ["/my_application_that_depends_on_myservice"]
+
+* Wait for an arbitrary period of time
+
+        pod:
+          spec:
+            initContainers:
+            - name: wait
+              image: centos:centos7
+              command: ["/bin/sh", "-c", "sleep 60"]
+            containers:
+            - name: run
+              image: application-image
+              command: ["/static_binary_without_sleep"]
+
+* Clone a git repository into a volume (can be implemented by volume containers in the future):
+
+        pod:
+          spec:
+            initContainers:
+            - name: download
+              image: image-with-git
+              command: ["git", "clone", "https://github.com/myrepo/myrepo.git", "/var/lib/data"]
+              volumeMounts:
+              - mountPath: /var/lib/data
+                volumeName: git
+            containers:
+            - name: run
+              image: centos:centos7
+              command: ["/var/lib/data/binary"]
+              volumeMounts:
+              - mountPath: /var/lib/data
+                volumeName: git
+            volumes:
+            - emptyDir: {}
+              name: git
+
+* Execute a template transformation based on environment (can be implemented by volume containers in the future):
+
+        pod:
+          spec:
+            initContainers:
+            - name: copy
+              image: application-image
+              command: ["/bin/cp", "mytemplate.j2", "/var/lib/data/"]
+              volumeMounts:
+              - mountPath: /var/lib/data
+                volumeName: data
+            - name: transform
+              image: image-with-jinja
+              command: ["/bin/sh", "-c", "jinja /var/lib/data/mytemplate.j2 > /var/lib/data/mytemplate.conf"]
+              volumeMounts:
+              - mountPath: /var/lib/data
+                volumeName: data
+            containers:
+            - name: run
+              image: application-image
+              command: ["/myapplication", "-conf", "/var/lib/data/mytemplate.conf"]
+              volumeMounts:
+              - mountPath: /var/lib/data
+                volumeName: data
+            volumes:
+            - emptyDir: {}
+              name: data
+
+* Perform a container build
+
+        pod:
+          spec:
+            initContainers:
+            - name: copy
+              image: base-image
+              workingDir: /home/user/source-tree
+              command: ["make"]
+            containers:
+            - name: commit
+              image: image-with-docker
+              command:
+              - /bin/sh
+              - -c
+              - docker commit $(complex_bash_to_get_container_id_of_copy) \
+                docker push $(commit_id) myrepo:latest
+              volumesMounts:
+              - mountPath: /var/run/docker.sock
+                volumeName: dockersocket
+
+## Backwards compatibilty implications
+
+Since this is a net new feature in the API and Kubelet, new API servers during upgrade may not
+be able to rely on Kubelets implementing init containers. The management of feature skew between
+master and Kubelet is tracked in issue [#4855](https://github.com/kubernetes/kubernetes/issues/4855).
+
+
+## Future work
+
+* Unify pod QoS class with init containers
+* Implement container / image volumes to make composition of runtime from images efficient
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/container-init.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->