diff --git a/docs/proposals/api-group.md b/docs/proposals/api-group.md index 435994fe25b..58f67699687 100644 --- a/docs/proposals/api-group.md +++ b/docs/proposals/api-group.md @@ -1,119 +1 @@ -# Supporting multiple API groups - -## Goal - -1. Breaking the monolithic v1 API into modular groups and allowing groups to be enabled/disabled individually. This allows us to break the monolithic API server to smaller components in the future. - -2. Supporting different versions in different groups. This allows different groups to evolve at different speed. - -3. Supporting identically named kinds to exist in different groups. This is useful when we experiment new features of an API in the experimental group while supporting the stable API in the original group at the same time. - -4. Exposing the API groups and versions supported by the server. This is required to develop a dynamic client. - -5. Laying the basis for [API Plugin](../../docs/design/extending-api.md). - -6. Keeping the user interaction easy. For example, we should allow users to omit group name when using kubectl if there is no ambiguity. - - -## Bookkeeping for groups - -1. No changes to TypeMeta: - - Currently many internal structures, such as RESTMapper and Scheme, are indexed and retrieved by APIVersion. For a fast implementation targeting the v1.1 deadline, we will concatenate group with version, in the form of "group/version", and use it where a version string is expected, so that many code can be reused. This implies we will not add a new field to TypeMeta, we will use TypeMeta.APIVersion to hold "group/version". - - For backward compatibility, v1 objects belong to the group with an empty name, so existing v1 config files will remain valid. - -2. /pkg/conversion#Scheme: - - The key of /pkg/conversion#Scheme.versionMap for versioned types will be "group/version". For now, the internal version types of all groups will be registered to versionMap[""], as we don't have any identically named kinds in different groups yet. In the near future, internal version types will be registered to versionMap["group/"], and pkg/conversion#Scheme.InternalVersion will have type []string. - - We will need a mechanism to express if two kinds in different groups (e.g., compute/pods and experimental/pods) are convertible, and auto-generate the conversions if they are. - -3. meta.RESTMapper: - - Each group will have its own RESTMapper (of type DefaultRESTMapper), and these mappers will be registered to pkg/api#RESTMapper (of type MultiRESTMapper). - - To support identically named kinds in different groups, We need to expand the input of RESTMapper.VersionAndKindForResource from (resource string) to (group, resource string). If group is not specified and there is ambiguity (i.e., the resource exists in multiple groups), an error should be returned to force the user to specify the group. - -## Server-side implementation - -1. resource handlers' URL: - - We will force the URL to be in the form of prefix/group/version/... - - Prefix is used to differentiate API paths from other paths like /healthz. All groups will use the same prefix="apis", except when backward compatibility requires otherwise. No "/" is allowed in prefix, group, or version. Specifically, - - * for /api/v1, we set the prefix="api" (which is populated from cmd/kube-apiserver/app#APIServer.APIPrefix), group="", version="v1", so the URL remains to be /api/v1. - - * for new kube API groups, we will set the prefix="apis" (we will add a field in type APIServer to hold this prefix), group=GROUP_NAME, version=VERSION. For example, the URL of the experimental resources will be /apis/experimental/v1alpha1. - - * for OpenShift v1 API, because it's currently registered at /oapi/v1, to be backward compatible, OpenShift may set prefix="oapi", group="". - - * for other new third-party API, they should also use the prefix="apis" and choose the group and version. This can be done through the thirdparty API plugin mechanism in [13000](http://pr.k8s.io/13000). - -2. supporting API discovery: - - * At /prefix (e.g., /apis), API server will return the supported groups and their versions using pkg/api/unversioned#APIVersions type, setting the Versions field to "group/version". This is backward compatible, because currently API server does return "v1" encoded in pkg/api/unversioned#APIVersions at /api. (We will also rename the JSON field name from `versions` to `apiVersions`, to be consistent with pkg/api#TypeMeta.APIVersion field) - - * At /prefix/group, API server will return all supported versions of the group. We will create a new type VersionList (name is open to discussion) in pkg/api/unversioned as the API. - - * At /prefix/group/version, API server will return all supported resources in this group, and whether each resource is namespaced. We will create a new type APIResourceList (name is open to discussion) in pkg/api/unversioned as the API. - - We will design how to handle deeper path in other proposals. - - * At /swaggerapi/swagger-version/prefix/group/version, API server will return the Swagger spec of that group/version in `swagger-version` (e.g. we may support both Swagger v1.2 and v2.0). - -3. handling common API objects: - - * top-level common API objects: - - To handle the top-level API objects that are used by all groups, we either have to register them to all schemes, or we can choose not to encode them to a version. We plan to take the latter approach and place such types in a new package called `unversioned`, because many of the common top-level objects, such as APIVersions, VersionList, and APIResourceList, which are used in the API discovery, and pkg/api#Status, are part of the protocol between client and server, and do not belong to the domain-specific parts of the API, which will evolve independently over time. - - Types in the unversioned package will not have the APIVersion field, but may retain the Kind field. - - For backward compatibility, when handling the Status, the server will encode it to v1 if the client expects the Status to be encoded in v1, otherwise the server will send the unversioned#Status. If an error occurs before the version can be determined, the server will send the unversioned#Status. - - * non-top-level common API objects: - - Assuming object o belonging to group X is used as a field in an object belonging to group Y, currently genconversion will generate the conversion functions for o in package Y. Hence, we don't need any special treatment for non-top-level common API objects. - - TypeMeta is an exception, because it is a common object that is used by objects in all groups but does not logically belong to any group. We plan to move it to the package `unversioned`. - -## Client-side implementation - -1. clients: - - Currently we have structured (pkg/client/unversioned#ExperimentalClient, pkg/client/unversioned#Client) and unstructured (pkg/kubectl/resource#Helper) clients. The structured clients are not scalable because each of them implements specific interface, e.g., `[here]../../pkg/client/unversioned/client.go#L32`--fixed. Only the unstructured clients are scalable. We should either auto-generate the code for structured clients or migrate to use the unstructured clients as much as possible. - - We should also move the unstructured client to pkg/client/. - -2. Spelling the URL: - - The URL is in the form of prefix/group/version/. The prefix is hard-coded in the client/unversioned.Config. The client should be able to figure out `group` and `version` using the RESTMapper. For a third-party client which does not have access to the RESTMapper, it should discover the mapping of `group`, `version` and `kind` by querying the server as described in point 2 of #server-side-implementation. - -3. kubectl: - - kubectl should accept arguments like `group/resource`, `group/resource/name`. Nevertheless, the user can omit the `group`, then kubectl shall rely on RESTMapper.VersionAndKindForResource() to figure out the default group/version of the resource. For example, for resources (like `node`) that exist in both k8s v1 API and k8s modularized API (like `infra/v2`), we should set kubectl default to use one of them. If there is no default group, kubectl should return an error for the ambiguity. - - When kubectl is used with a single resource type, the --api-version and --output-version flag of kubectl should accept values in the form of `group/version`, and they should work as they do today. For multi-resource operations, we will disable these two flags initially. - - Currently, by setting pkg/client/unversioned/clientcmd/api/v1#Config.NamedCluster[x].Cluster.APIVersion ([here](../../pkg/client/unversioned/clientcmd/api/v1/types.go#L58)), user can configure the default apiVersion used by kubectl to talk to server. It does not make sense to set a global version used by kubectl when there are multiple groups, so we plan to deprecate this field. We may extend the version negotiation function to negotiate the preferred version of each group. Details will be in another proposal. - -## OpenShift integration - -OpenShift can take a similar approach to break monolithic v1 API: keeping the v1 where they are, and gradually adding groups. - -For the v1 objects in OpenShift, they should keep doing what they do now: they should remain registered to Scheme.versionMap["v1"] scheme, they should keep being added to originMapper. - -For new OpenShift groups, they should do the same as native Kubernetes groups would do: each group should register to Scheme.versionMap["group/version"], each should has separate RESTMapper and the register the MultiRESTMapper. - -To expose a list of the supported Openshift groups to clients, OpenShift just has to call to pkg/cmd/server/origin#call initAPIVersionRoute() as it does now, passing in the supported "group/versions" instead of "versions". - - -## Future work - -1. Dependencies between groups: we need an interface to register the dependencies between groups. It is not our priority now as the use cases are not clear yet. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/api-group.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-group.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-group.md) diff --git a/docs/proposals/apiserver-watch.md b/docs/proposals/apiserver-watch.md index 9764768f198..1e22086e673 100644 --- a/docs/proposals/apiserver-watch.md +++ b/docs/proposals/apiserver-watch.md @@ -1,145 +1 @@ -## Abstract - -In the current system, most watch requests sent to apiserver are redirected to -etcd. This means that for every watch request the apiserver opens a watch on -etcd. - -The purpose of the proposal is to improve the overall performance of the system -by solving the following problems: - -- having too many open watches on etcd -- avoiding deserializing/converting the same objects multiple times in different -watch results - -In the future, we would also like to add an indexing mechanism to the watch. -Although Indexer is not part of this proposal, it is supposed to be compatible -with it - in the future Indexer should be incorporated into the proposed new -watch solution in apiserver without requiring any redesign. - - -## High level design - -We are going to solve those problems by allowing many clients to watch the same -storage in the apiserver, without being redirected to etcd. - -At the high level, apiserver will have a single watch open to etcd, watching all -the objects (of a given type) without any filtering. The changes delivered from -etcd will then be stored in a cache in apiserver. This cache is in fact a -"rolling history window" that will support clients having some amount of latency -between their list and watch calls. Thus it will have a limited capacity and -whenever a new change comes from etcd when a cache is full, the oldest change -will be remove to make place for the new one. - -When a client sends a watch request to apiserver, instead of redirecting it to -etcd, it will cause: - - - registering a handler to receive all new changes coming from etcd - - iterating though a watch window, starting at the requested resourceVersion - to the head and sending filtered changes directory to the client, blocking - the above until this iteration has caught up - -This will be done be creating a go-routine per watcher that will be responsible -for performing the above. - -The following section describes the proposal in more details, analyzes some -corner cases and divides the whole design in more fine-grained steps. - - -## Proposal details - -We would like the cache to be __per-resource-type__ and __optional__. Thanks to -it we will be able to: - - have different cache sizes for different resources (e.g. bigger cache - [= longer history] for pods, which can significantly affect performance) - - avoid any overhead for objects that are watched very rarely (e.g. events - are almost not watched at all, but there are a lot of them) - - filter the cache for each watcher more effectively - -If we decide to support watches spanning different resources in the future and -we have an efficient indexing mechanisms, it should be relatively simple to unify -the cache to be common for all the resources. - -The rest of this section describes the concrete steps that need to be done -to implement the proposal. - -1. Since we want the watch in apiserver to be optional for different resource -types, this needs to be self-contained and hidden behind a well defined API. -This should be a layer very close to etcd - in particular all registries: -"pkg/registry/generic/registry" should be built on top of it. -We will solve it by turning tools.EtcdHelper by extracting its interface -and treating this interface as this API - the whole watch mechanisms in -apiserver will be hidden behind that interface. -Thanks to it we will get an initial implementation for free and we will just -need to reimplement few relevant functions (probably just Watch and List). -Moreover, this will not require any changes in other parts of the code. -This step is about extracting the interface of tools.EtcdHelper. - -2. Create a FIFO cache with a given capacity. In its "rolling history window" -we will store two things: - - - the resourceVersion of the object (being an etcdIndex) - - the object watched from etcd itself (in a deserialized form) - - This should be as simple as having an array an treating it as a cyclic buffer. - Obviously resourceVersion of objects watched from etcd will be increasing, but - they are necessary for registering a new watcher that is interested in all the - changes since a given etcdIndex. - - Additionally, we should support LIST operation, otherwise clients can never - start watching at now. We may consider passing lists through etcd, however - this will not work once we have Indexer, so we will need that information - in memory anyway. - Thus, we should support LIST operation from the "end of the history" - i.e. - from the moment just after the newest cached watched event. It should be - pretty simple to do, because we can incrementally update this list whenever - the new watch event is watched from etcd. - We may consider reusing existing structures cache.Store or cache.Indexer - ("pkg/client/cache") but this is not a hard requirement. - -3. Create the new implementation of the API, that will internally have a -single watch open to etcd and will store the data received from etcd in -the FIFO cache - this includes implementing registration of a new watcher -which will start a new go-routine responsible for iterating over the cache -and sending all the objects watcher is interested in (by applying filtering -function) to the watcher. - -4. Add a support for processing "error too old" from etcd, which will require: - - disconnect all the watchers - - clear the internal cache and relist all objects from etcd - - start accepting watchers again - -5. Enable watch in apiserver for some of the existing resource types - this -should require only changes at the initialization level. - -6. The next step will be to incorporate some indexing mechanism, but details -of it are TBD. - - - -### Future optimizations: - -1. The implementation of watch in apiserver internally will open a single -watch to etcd, responsible for watching all the changes of objects of a given -resource type. However, this watch can potentially expire at any time and -reconnecting can return "too old resource version". In that case relisting is -necessary. In such case, to avoid LIST requests coming from all watchers at -the same time, we can introduce an additional etcd event type: -[EtcdResync](../../pkg/storage/etcd/etcd_watcher.go#L36) - - Whenever relisting will be done to refresh the internal watch to etcd, - EtcdResync event will be send to all the watchers. It will contain the - full list of all the objects the watcher is interested in (appropriately - filtered) as the parameter of this watch event. - Thus, we need to create the EtcdResync event, extend watch.Interface and - its implementations to support it and handle those events appropriately - in places like - [Reflector](../../pkg/client/cache/reflector.go) - - However, this might turn out to be unnecessary optimization if apiserver - will always keep up (which is possible in the new design). We will work - out all necessary details at that point. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/apiserver-watch.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apiserver-watch.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apiserver-watch.md) diff --git a/docs/proposals/apparmor.md b/docs/proposals/apparmor.md index d7051567792..98ecc8ce623 100644 --- a/docs/proposals/apparmor.md +++ b/docs/proposals/apparmor.md @@ -1,310 +1 @@ - - -- [Overview](#overview) - - [Motivation](#motivation) - - [Related work](#related-work) -- [Alpha Design](#alpha-design) - - [Overview](#overview-1) - - [Prerequisites](#prerequisites) - - [API Changes](#api-changes) - - [Pod Security Policy](#pod-security-policy) - - [Deploying profiles](#deploying-profiles) - - [Testing](#testing) -- [Beta Design](#beta-design) - - [API Changes](#api-changes-1) -- [Future work](#future-work) - - [System component profiles](#system-component-profiles) - - [Deploying profiles](#deploying-profiles-1) - - [Custom app profiles](#custom-app-profiles) - - [Security plugins](#security-plugins) - - [Container Runtime Interface](#container-runtime-interface) - - [Alerting](#alerting) - - [Profile authoring](#profile-authoring) -- [Appendix](#appendix) - - - -# Overview - -AppArmor is a [mandatory access control](https://en.wikipedia.org/wiki/Mandatory_access_control) -(MAC) system for Linux that supplements the standard Linux user and group based -permissions. AppArmor can be configured for any application to reduce the potential attack surface -and provide greater [defense in depth](https://en.wikipedia.org/wiki/Defense_in_depth_(computing)). -It is configured through profiles tuned to whitelist the access needed by a specific program or -container, such as Linux capabilities, network access, file permissions, etc. Each profile can be -run in either enforcing mode, which blocks access to disallowed resources, or complain mode, which -only reports violations. - -AppArmor is similar to SELinux. Both are MAC systems implemented as a Linux security module (LSM), -and are mutually exclusive. SELinux offers a lot of power and very fine-grained controls, but is -generally considered very difficult to understand and maintain. AppArmor sacrifices some of that -flexibility in favor of ease of use. Seccomp-bpf is another Linux kernel security feature for -limiting attack surface, and can (and should!) be used alongside AppArmor. - -## Motivation - -AppArmor can enable users to run a more secure deployment, and / or provide better auditing and -monitoring of their systems. Although it is not the only solution, we should enable AppArmor for -users that want a simpler alternative to SELinux, or are already maintaining a set of AppArmor -profiles. We have heard from multiple Kubernetes users already that AppArmor support is important to -them. The [seccomp proposal](../../docs/design/seccomp.md#use-cases) details several use cases that -also apply to AppArmor. - -## Related work - -Much of this design is drawn from the work already done to support seccomp profiles in Kubernetes, -which is outlined in the [seccomp design doc](../../docs/design/seccomp.md). The designs should be -kept close to apply lessons learned, and reduce cognitive and maintenance overhead. - -Docker has supported AppArmor profiles since version 1.3, and maintains a default profile which is -applied to all containers on supported systems. - -AppArmor was upstreamed into the Linux kernel in version 2.6.36. It is currently maintained by -[Canonical](http://www.canonical.com/), is shipped by default on all Ubuntu and openSUSE systems, -and is supported on several -[other distributions](http://wiki.apparmor.net/index.php/Main_Page#Distributions_and_Ports). - -# Alpha Design - -This section describes the proposed design for -[alpha-level](../../docs/devel/api_changes.md#alpha-beta-and-stable-versions) support, although -additional features are described in [future work](#future-work). For AppArmor alpha support -(targeted for Kubernetes 1.4) we will enable: - -- Specifying a pre-loaded profile to apply to a pod container -- Restricting pod containers to a set of profiles (admin use case) - -We will also provide a reference implementation of a pod for loading profiles on nodes, but an -official supported mechanism for deploying profiles is out of scope for alpha. - -## Overview - -An AppArmor profile can be specified for a container through the Kubernetes API with a pod -annotation. If a profile is specified, the Kubelet will verify that the node meets the required -[prerequisites](#prerequisites) (e.g. the profile is already configured on the node) before starting -the container, and will not run the container if the profile cannot be applied. If the requirements -are met, the container runtime will configure the appropriate options to apply the profile. Profile -requirements and defaults can be specified on the -[PodSecurityPolicy](security-context-constraints.md). - -## Prerequisites - -When an AppArmor profile is specified, the Kubelet will verify the prerequisites for applying the -profile to the container. In order to [fail -securely](https://www.owasp.org/index.php/Fail_securely), a container **will not be run** if any of -the prerequisites are not met. The prerequisites are: - -1. **Kernel support** - The AppArmor kernel module is loaded. Can be checked by - [libcontainer](https://github.com/opencontainers/runc/blob/4dedd0939638fc27a609de1cb37e0666b3cf2079/libcontainer/apparmor/apparmor.go#L17). -2. **Runtime support** - For the initial implementation, Docker will be required (rkt does not - currently have AppArmor support). All supported Docker versions include AppArmor support. See - [Container Runtime Interface](#container-runtime-interface) for other runtimes. -3. **Installed profile** - The target profile must be loaded prior to starting the container. Loaded - profiles can be found in the AppArmor securityfs \[1\]. - -If any of the prerequisites are not met an event will be generated to report the error and the pod -will be -[rejected](https://github.com/kubernetes/kubernetes/blob/cdfe7b7b42373317ecd83eb195a683e35db0d569/pkg/kubelet/kubelet.go#L2201) -by the Kubelet. - -*[1] The securityfs can be found in `/proc/mounts`, and defaults to `/sys/kernel/security` on my -Ubuntu system. The profiles can be found at `{securityfs}/apparmor/profiles` -([example](http://bazaar.launchpad.net/~apparmor-dev/apparmor/master/view/head:/utils/aa-status#L137)).* - -## API Changes - -The initial alpha support of AppArmor will follow the pattern -[used by seccomp](https://github.com/kubernetes/kubernetes/pull/25324) and specify profiles through -annotations. Profiles can be specified per-container through pod annotations. The annotation format -is a key matching the container, and a profile name value: - -``` -container.apparmor.security.alpha.kubernetes.io/= -``` - -The profiles can be specified in the following formats (following the convention used by [seccomp](../../docs/design/seccomp.md#api-changes)): - -1. `runtime/default` - Applies the default profile for the runtime. For docker, the profile is - generated from a template - [here](https://github.com/docker/docker/blob/master/profiles/apparmor/template.go). If no - AppArmor annotations are provided, this profile is enabled by default if AppArmor is enabled in - the kernel. Runtimes may define this to be unconfined, as Docker does for privileged pods. -2. `localhost/` - The profile name specifies the profile to load. - -*Note: There is no way to explicitly specify an "unconfined" profile, since it is discouraged. If - this is truly needed, the user can load an "allow-all" profile.* - -### Pod Security Policy - -The [PodSecurityPolicy](security-context-constraints.md) allows cluster administrators to control -the security context for a pod and its containers. An annotation can be specified on the -PodSecurityPolicy to restrict which AppArmor profiles can be used, and specify a default if no -profile is specified. - -The annotation key is `apparmor.security.alpha.kubernetes.io/allowedProfileNames`. The value is a -comma delimited list, with each item following the format described [above](#api-changes). If a list -of profiles are provided and a pod does not have an AppArmor annotation, the first profile in the -list will be used by default. - -Enforcement of the policy is standard. See the -[seccomp implementation](https://github.com/kubernetes/kubernetes/pull/28300) as an example. - -## Deploying profiles - -We will provide a reference implementation of a DaemonSet pod for loading profiles on nodes, but -there will not be an official mechanism or API in the initial version (see -[future work](#deploying-profiles-1)). The reference container will contain the `apparmor_parser` -tool and a script for using the tool to load all profiles in a set of (configurable) -directories. The initial implementation will poll (with a configurable interval) the directories for -additions, but will not update or unload existing profiles. The pod can be run in a DaemonSet to -load the profiles onto all nodes. The pod will need to be run in privileged mode. - -This simple design should be sufficient to deploy AppArmor profiles from any volume source, such as -a ConfigMap or PersistentDisk. Users seeking more advanced features should be able extend this -design easily. - -## Testing - -Our e2e testing framework does not currently run nodes with AppArmor enabled, but we can run a node -e2e test suite on an AppArmor enabled node. The cases we should test are: - -- *PodSecurityPolicy* - These tests can be run on a cluster even if AppArmor is not enabled on the - nodes. - - No AppArmor policy allows pods with arbitrary profiles - - With a policy a default is selected - - With a policy arbitrary profiles are prevented - - With a policy allowed profiles are allowed -- *Node AppArmor enforcement* - These tests need to run on AppArmor enabled nodes, in the node e2e - suite. - - A valid container profile gets applied - - An unloaded profile will be rejected - -# Beta Design - -The only part of the design that changes for beta is the API, which is upgraded from -annotation-based to first class fields. - -## API Changes - -AppArmor profiles will be specified in the container's SecurityContext, as part of an -`AppArmorOptions` struct. The options struct makes the API more flexible to future additions. - -```go -type SecurityContext struct { - ... - // The AppArmor options to be applied to the container. - AppArmorOptions *AppArmorOptions `json:"appArmorOptions,omitempty"` - ... -} - -// Reference to an AppArmor profile loaded on the host. -type AppArmorProfileName string - -// Options specifying how to run Containers with AppArmor. -type AppArmorOptions struct { - // The profile the Container must be run with. - Profile AppArmorProfileName `json:"profile"` -} -``` - -The `AppArmorProfileName` format matches the format for the profile annotation values describe -[above](#api-changes). - -The `PodSecurityPolicySpec` receives a similar treatment with the addition of an -`AppArmorStrategyOptions` struct. Here the `DefaultProfile` is separated from the `AllowedProfiles` -in the interest of making the behavior more explicit. - -```go -type PodSecurityPolicySpec struct { - ... - AppArmorStrategyOptions *AppArmorStrategyOptions `json:"appArmorStrategyOptions,omitempty"` - ... -} - -// AppArmorStrategyOptions specifies AppArmor restrictions and requirements for pods and containers. -type AppArmorStrategyOptions struct { - // If non-empty, all pod containers must be run with one of the profiles in this list. - AllowedProfiles []AppArmorProfileName `json:"allowedProfiles,omitempty"` - // The default profile to use if a profile is not specified for a container. - // Defaults to "runtime/default". Must be allowed by AllowedProfiles. - DefaultProfile AppArmorProfileName `json:"defaultProfile,omitempty"` -} -``` - -# Future work - -Post-1.4 feature ideas. These are not fully-fleshed designs. - -## System component profiles - -We should publish (to GitHub) AppArmor profiles for all Kubernetes system components, including core -components like the API server and controller manager, as well as addons like influxDB and -Grafana. `kube-up.sh` and its successor should have an option to apply the profiles, if the AppArmor -is supported by the nodes. Distros that support AppArmor and provide a Kubernetes package should -include the profiles out of the box. - -## Deploying profiles - -We could provide an official supported solution for loading profiles on the nodes. One option is to -extend the reference implementation described [above](#deploying-profiles) into a DaemonSet that -watches the directory sources to sync changes, or to watch a ConfigMap object directly. Another -option is to add an official API for this purpose, and load the profiles on-demand in the Kubelet. - -## Custom app profiles - -[Profile stacking](http://wiki.apparmor.net/index.php/AppArmorStacking) is an AppArmor feature -currently in development that will enable multiple profiles to be applied to the same object. If -profiles are stacked, the allowed set of operations is the "intersection" of both profiles -(i.e. stacked profiles are never more permissive). Taking advantage of this feature, the cluster -administrator could restrict the allowed profiles on a PodSecurityPolicy to a few broad profiles, -and then individual apps could apply more app specific profiles on top. - -## Security plugins - -AppArmor, SELinux, TOMOYO, grsecurity, SMACK, etc. are all Linux MAC implementations with similar -requirements and features. At the very least, the AppArmor implementation should be factored in a -way that makes it easy to add alternative systems. A more advanced approach would be to extract a -set of interfaces for plugins implementing the alternatives. An even higher level approach would be -to define a common API or profile interface for all of them. Work towards this last option is -already underway for Docker, called -[Docker Security Profiles](https://github.com/docker/docker/issues/17142#issuecomment-148974642). - -## Container Runtime Interface - -Other container runtimes will likely add AppArmor support eventually, so the -[Container Runtime Interface](container-runtime-interface-v1.md) (CRI) needs to be made compatible -with this design. The two important pieces are a way to report whether AppArmor is supported by the -runtime, and a way to specify the profile to load (likely through the `LinuxContainerConfig`). - -## Alerting - -Whether AppArmor is running in enforcing or complain mode it generates logs of policy -violations. These logs can be important cues for intrusion detection, or at the very least a bug in -the profile. Violations should almost always generate alerts in production systems. We should -provide reference documentation for setting up alerts. - -## Profile authoring - -A common method for writing AppArmor profiles is to start with a restrictive profile in complain -mode, and then use the `aa-logprof` tool to build a profile from the logs. We should provide -documentation for following this process in a Kubernetes environment. - -# Appendix - -- [What is AppArmor](https://askubuntu.com/questions/236381/what-is-apparmor) -- [Debugging AppArmor on Docker](https://github.com/docker/docker/blob/master/docs/security/apparmor.md#debug-apparmor) -- Load an AppArmor profile with `apparmor_parser` (required by Docker so it should be available): - - ``` - $ apparmor_parser --replace --write-cache /path/to/profile - ``` - -- Unload with: - - ``` - $ apparmor_parser --remove /path/to/profile - ``` - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/apparmor.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apparmor.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apparmor.md) diff --git a/docs/proposals/client-package-structure.md b/docs/proposals/client-package-structure.md index 2d30021df92..d44ee86c6db 100644 --- a/docs/proposals/client-package-structure.md +++ b/docs/proposals/client-package-structure.md @@ -1,316 +1 @@ - - -- [Client: layering and package structure](#client-layering-and-package-structure) - - [Desired layers](#desired-layers) - - [Transport](#transport) - - [RESTClient/request.go](#restclientrequestgo) - - [Mux layer](#mux-layer) - - [High-level: Individual typed](#high-level-individual-typed) - - [High-level, typed: Discovery](#high-level-typed-discovery) - - [High-level: Dynamic](#high-level-dynamic) - - [High-level: Client Sets](#high-level-client-sets) - - [Package Structure](#package-structure) - - [Client Guarantees (and testing)](#client-guarantees-and-testing) - - - -# Client: layering and package structure - -## Desired layers - -### Transport - -The transport layer is concerned with round-tripping requests to an apiserver -somewhere. It consumes a Config object with options appropriate for this. -(That's most of the current client.Config structure.) - -Transport delivers an object that implements http's RoundTripper interface -and/or can be used in place of http.DefaultTransport to route requests. - -Transport objects are safe for concurrent use, and are cached and reused by -subsequent layers. - -Tentative name: "Transport". - -It's expected that the transport config will be general enough that third -parties (e.g., OpenShift) will not need their own implementation, rather they -can change the certs, token, etc., to be appropriate for their own servers, -etc.. - -Action items: -* Split out of current client package into a new package. (@krousey) - -### RESTClient/request.go - -RESTClient consumes a Transport and a Codec (and optionally a group/version), -and produces something that implements the interface currently in request.go. -That is, with a RESTClient, you can write chains of calls like: - -`c.Get().Path(p).Param("name", "value").Do()` - -RESTClient is generically usable by any client for servers exposing REST-like -semantics. It provides helpers that benefit those following api-conventions.md, -but does not mandate them. It provides a higher level http interface that -abstracts transport, wire serialization, retry logic, and error handling. -Kubernetes-like constructs that deviate from standard HTTP should be bypassable. -Every non-trivial call made to a remote restful API from Kubernetes code should -go through a rest client. - -The group and version may be empty when constructing a RESTClient. This is valid -for executing discovery commands. The group and version may be overridable with -a chained function call. - -Ideally, no semantic behavior is built into RESTClient, and RESTClient will use -the Codec it was constructed with for all semantic operations, including turning -options objects into URL query parameters. Unfortunately, that is not true of -today's RESTClient, which may have some semantic information built in. We will -remove this. - -RESTClient should not make assumptions about the format of data produced or -consumed by the Codec. Currently, it is JSON, but we want to support binary -protocols in the future. - -The Codec would look something like this: - -```go -type Codec interface { - Encode(runtime.Object) ([]byte, error) - Decode([]byte]) (runtime.Object, error) - - // Used to version-control query parameters - EncodeParameters(optionsObject runtime.Object) (url.Values, error) - - // Not included here since the client doesn't need it, but a corresponding - // DecodeParametersInto method would be available on the server. -} -``` - -There should be one codec per version. RESTClient is *not* responsible for -converting between versions; if a client wishes, they can supply a Codec that -does that. But RESTClient will make the assumption that it's talking to a single -group/version, and will not contain any conversion logic. (This is a slight -change from the current state.) - -As with Transport, it is expected that 3rd party providers following the api -conventions should be able to use RESTClient, and will not need to implement -their own. - -Action items: -* Split out of the current client package. (@krousey) -* Possibly, convert to an interface (currently, it's a struct). This will allow - extending the error-checking monad that's currently in request.go up an - additional layer. -* Switch from ParamX("x") functions to using types representing the collection - of parameters and the Codec for query parameter serialization. -* Any other Kubernetes group specific behavior should also be removed from - RESTClient. - -### Mux layer - -(See TODO at end; this can probably be merged with the "client set" concept.) - -The client muxer layer has a map of group/version to cached RESTClient, and -knows how to construct a new RESTClient in case of a cache miss (using the -discovery client mentioned below). The ClientMux may need to deal with multiple -transports pointing at differing destinations (e.g. OpenShift or other 3rd party -provider API may be at a different location). - -When constructing a RESTClient generically, the muxer will just use the Codec -the high-level dynamic client would use. Alternatively, the user should be able -to pass in a Codec-- for the case where the correct types are compiled in. - -Tentative name: ClientMux - -Action items: -* Move client cache out of kubectl libraries into a more general home. -* TODO: a mux layer may not be necessary, depending on what needs to be cached. - If transports are cached already, and RESTClients are extremely light-weight, - there may not need to be much code at all in this layer. - -### High-level: Individual typed - -Our current high-level client allows you to write things like -`c.Pods("namespace").Create(p)`; we will insert a level for the group. - -That is, the system will be: - -`clientset.GroupName().NamespaceSpecifier().Action()` - -Where: -* `clientset` is a thing that holds multiple individually typed clients (see - below). -* `GroupName()` returns the generated client that this section is about. -* `NamespaceSpecifier()` may take a namespace parameter or nothing. -* `Action` is one of Create/Get/Update/Delete/Watch, or appropriate actions - from the type's subresources. -* It is TBD how we'll represent subresources and their actions. This is - inconsistent in the current clients, so we'll need to define a consistent - format. Possible choices: - * Insert a `.Subresource()` before the `.Action()` - * Flatten subresources, such that they become special Actions on the parent - resource. - -The types returned/consumed by such functions will be e.g. api/v1, NOT the -current version inspecific types. The current internal-versioned client is -inconvenient for users, as it does not protect them from having to recompile -their code with every minor update. (We may continue to generate an -internal-versioned client for our own use for a while, but even for our own -components it probably makes sense to switch to specifically versioned clients.) - -We will provide this structure for each version of each group. It is infeasible -to do this manually, so we will generate this. The generator will accept both -swagger and the ordinary go types. The generator should operate on out-of-tree -sources AND out-of-tree destinations, so it will be useful for consuming -out-of-tree APIs and for others to build custom clients into their own -repositories. - -Typed clients will be constructable given a ClientMux; the typed constructor will use -the ClientMux to find or construct an appropriate RESTClient. Alternatively, a -typed client should be constructable individually given a config, from which it -will be able to construct the appropriate RESTClient. - -Typed clients do not require any version negotiation. The server either supports -the client's group/version, or it does not. However, there are ways around this: -* If you want to use a typed client against a server's API endpoint and the - server's API version doesn't match the client's API version, you can construct - the client with a RESTClient using a Codec that does the conversion (this is - basically what our client does now). -* Alternatively, you could use the dynamic client. - -Action items: -* Move current typed clients into new directory structure (described below) -* Finish client generation logic. (@caesarxuchao, @lavalamp) - -#### High-level, typed: Discovery - -A `DiscoveryClient` is necessary to discover the api groups, versions, and -resources a server supports. It's constructable given a RESTClient. It is -consumed by both the ClientMux and users who want to iterate over groups, -versions, or resources. (Example: namespace controller.) - -The DiscoveryClient is *not* required if you already know the group/version of -the resource you want to use: you can simply try the operation without checking -first, which is lower-latency anyway as it avoids an extra round-trip. - -Action items: -* Refactor existing functions to present a sane interface, as close to that - offered by the other typed clients as possible. (@caeserxuchao) -* Use a RESTClient to make the necessary API calls. -* Make sure that no discovery happens unless it is explicitly requested. (Make - sure SetKubeDefaults doesn't call it, for example.) - -### High-level: Dynamic - -The dynamic client lets users consume apis which are not compiled into their -binary. It will provide the same interface as the typed client, but will take -and return `runtime.Object`s instead of typed objects. There is only one dynamic -client, so it's not necessary to generate it, although optionally we may do so -depending on whether the typed client generator makes it easy. - -A dynamic client is constructable given a config, group, and version. It will -use this to construct a RESTClient with a Codec which encodes/decodes to -'Unstructured' `runtime.Object`s. The group and version may be from a previous -invocation of a DiscoveryClient, or they may be known by other means. - -For now, the dynamic client will assume that a JSON encoding is allowed. In the -future, if we have binary-only APIs (unlikely?), we can add that to the -discovery information and construct an appropriate dynamic Codec. - -Action items: -* A rudimentary version of this exists in kubectl's builder. It needs to be - moved to a more general place. -* Produce a useful 'Unstructured' runtime.Object, which allows for easy - Object/ListMeta introspection. - -### High-level: Client Sets - -Because there will be multiple groups with multiple versions, we will provide an -aggregation layer that combines multiple typed clients in a single object. - -We do this to: -* Deliver a concrete thing for users to consume, construct, and pass around. We - don't want people making 10 typed clients and making a random system to keep - track of them. -* Constrain the testing matrix. Users can generate a client set at their whim - against their cluster, but we need to make guarantees that the clients we - shipped with v1.X.0 will work with v1.X+1.0, and vice versa. That's not - practical unless we "bless" a particular version of each API group and ship an - official client set with earch release. (If the server supports 15 groups with - 2 versions each, that's 2^15 different possible client sets. We don't want to - test all of them.) - -A client set is generated into its own package. The generator will take the list -of group/versions to be included. Only one version from each group will be in -the client set. - -A client set is constructable at runtime from either a ClientMux or a transport -config (for easy one-stop-shopping). - -An example: - -```go -import ( - api_v1 "k8s.io/kubernetes/pkg/client/typed/generated/v1" - ext_v1beta1 "k8s.io/kubernetes/pkg/client/typed/generated/extensions/v1beta1" - net_v1beta1 "k8s.io/kubernetes/pkg/client/typed/generated/net/v1beta1" - "k8s.io/kubernetes/pkg/client/typed/dynamic" -) - -type Client interface { - API() api_v1.Client - Extensions() ext_v1beta1.Client - Net() net_v1beta1.Client - // ... other typed clients here. - - // Included in every set - Discovery() discovery.Client - GroupVersion(group, version string) dynamic.Client -} -``` - -Note that a particular version is chosen for each group. It is a general rule -for our API structure that no client need care about more than one version of -each group at a time. - -This is the primary deliverable that people would consume. It is also generated. - -Action items: -* This needs to be built. It will replace the ClientInterface that everyone - passes around right now. - -## Package Structure - -``` -pkg/client/ -----------/transport/ # transport & associated config -----------/restclient/ -----------/clientmux/ -----------/typed/ -----------------/discovery/ -----------------/generated/ ---------------------------// -----------------------------------// ---------------------------------------------/.go -----------------/dynamic/ -----------/clientsets/ ----------------------/release-1.1/ ----------------------/release-1.2/ ----------------------/the-test-set-you-just-generated/ -``` - -`/clientsets/` will retain their contents until they reach their expire date. -e.g., when we release v1.N, we'll remove clientset v1.(N-3). Clients from old -releases live on and continue to work (i.e., are tested) without any interface -changes for multiple releases, to give users time to transition. - -## Client Guarantees (and testing) - -Once we release a clientset, we will not make interface changes to it. Users of -that client will not have to change their code until they are deliberately -upgrading their import. We probably will want to generate some sort of stub test -with a clientset, to ensure that we don't change the interface. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/client-package-structure.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/client-package-structure.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/client-package-structure.md) diff --git a/docs/proposals/cluster-deployment.md b/docs/proposals/cluster-deployment.md index c6466f89cf2..d7f7ab364dd 100644 --- a/docs/proposals/cluster-deployment.md +++ b/docs/proposals/cluster-deployment.md @@ -1,171 +1 @@ -# Objective - -Simplify the cluster provisioning process for a cluster with one master and multiple worker nodes. -It should be secured with SSL and have all the default add-ons. There should not be significant -differences in the provisioning process across deployment targets (cloud provider + OS distribution) -once machines meet the node specification. - -# Overview - -Cluster provisioning can be broken into a number of phases, each with their own exit criteria. -In some cases, multiple phases will be combined together to more seamlessly automate the cluster setup, -but in all cases the phases can be run sequentially to provision a functional cluster. - -It is possible that for some platforms we will provide an optimized flow that combines some of the steps -together, but that is out of scope of this document. - -# Deployment flow - -**Note**: _Exit critieria_ in the following sections are not intended to list all tests that should pass, -rather list those that must pass. - -## Step 1: Provision cluster - -**Objective**: Create a set of machines (master + nodes) where we will deploy Kubernetes. - -For this phase to be completed successfully, the following requirements must be completed for all nodes: -- Basic connectivity between nodes (i.e. nodes can all ping each other) -- Docker installed (and in production setups should be monitored to be always running) -- One of the supported OS - -We will provide a node specification conformance test that will verify if provisioning has been successful. - -This step is provider specific and will be implemented for each cloud provider + OS distribution separately -using provider specific technology (cloud formation, deployment manager, PXE boot, etc). -Some OS distributions may meet the provisioning criteria without needing to run any post-boot steps as they -ship with all of the requirements for the node specification by default. - -**Substeps** (on the GCE example): - -1. Create network -2. Create firewall rules to allow communication inside the cluster -3. Create firewall rule to allow ```ssh``` to all machines -4. Create firewall rule to allow ```https``` to master -5. Create persistent disk for master -6. Create static IP address for master -7. Create master machine -8. Create node machines -9. Install docker on all machines - -**Exit critera**: - -1. Can ```ssh``` to all machines and run a test docker image -2. Can ```ssh``` to master and nodes and ping other machines - -## Step 2: Generate certificates - -**Objective**: Generate security certificates used to configure secure communication between client, master and nodes - -TODO: Enumerate certificates which have to be generated. - -## Step 3: Deploy master - -**Objective**: Run kubelet and all the required components (e.g. etcd, apiserver, scheduler, controllers) on the master machine. - -**Substeps**: - -1. copy certificates -2. copy manifests for static pods: - 1. etcd - 2. apiserver, controller manager, scheduler -3. run kubelet in docker container (configuration is read from apiserver Config object) -4. run kubelet-checker in docker container - -**v1.2 simplifications**: - -1. kubelet-runner.sh - we will provide a custom docker image to run kubelet; it will contain -kubelet binary and will run it using ```nsenter``` to workaround problem with mount propagation -1. kubelet config file - we will read kubelet configuration file from disk instead of apiserver; it will -be generated locally and copied to all nodes. - -**Exit criteria**: - -1. Can run basic API calls (e.g. create, list and delete pods) from the client side (e.g. replication -controller works - user can create RC object and RC manager can create pods based on that) -2. Critical master components works: - 1. scheduler - 2. controller manager - -## Step 4: Deploy nodes - -**Objective**: Start kubelet on all nodes and configure kubernetes network. -Each node can be deployed separately and the implementation should make it ~impossible to change this assumption. - -### Step 4.1: Run kubelet - -**Substeps**: - -1. copy certificates -2. run kubelet in docker container (configuration is read from apiserver Config object) -3. run kubelet-checker in docker container - -**v1.2 simplifications**: - -1. kubelet config file - we will read kubelet configuration file from disk instead of apiserver; it will -be generated locally and copied to all nodes. - -**Exit critera**: - -1. All nodes are registered, but not ready due to lack of kubernetes networking. - -### Step 4.2: Setup kubernetes networking - -**Objective**: Configure the Kubernetes networking to allow routing requests to pods and services. - -To keep default setup consistent across open source deployments we will use Flannel to configure -kubernetes networking. However, implementation of this step will allow to easily plug in different -network solutions. - -**Substeps**: - -1. copy manifest for flannel server to master machine -2. create a daemonset with flannel daemon (it will read assigned CIDR and configure network appropriately). - -**v1.2 simplifications**: - -1. flannel daemon will run as a standalone binary (not in docker container) -2. flannel server will assign CIDRs to nodes outside of kubernetes; this will require restarting kubelet -after reconfiguring network bridge on local machine; this will also require running master nad node differently -(```--configure-cbr0=false``` on node and ```--allocate-node-cidrs=false``` on master), which breaks encapsulation -between nodes - -**Exit criteria**: - -1. Pods correctly created, scheduled, run and accessible from all nodes. - -## Step 5: Add daemons - -**Objective:** Start all system daemons (e.g. kube-proxy) - -**Substeps:**: - -1. Create daemonset for kube-proxy - -**Exit criteria**: - -1. Services work correctly on all nodes. - -## Step 6: Add add-ons - -**Objective**: Add default add-ons (e.g. dns, dashboard) - -**Substeps:**: - -1. Create Deployments (and daemonsets if needed) for all add-ons - -## Deployment technology - -We will use Ansible as the default technology for deployment orchestration. It has low requirements on the cluster machines -and seems to be popular in kubernetes community which will help us to maintain it. - -For simpler UX we will provide simple bash scripts that will wrap all basic commands for deployment (e.g. ```up``` or ```down```) - -One disadvantage of using Ansible is that it adds a dependency on a machine which runs deployment scripts. We will workaround -this by distributing deployment scripts via a docker image so that user will run the following command to create a cluster: - -```docker run gcr.io/google_containers/deploy_kubernetes:v1.2 up --num-nodes=3 --provider=aws``` - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/cluster-deployment.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/cluster-deployment.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/cluster-deployment.md) diff --git a/docs/proposals/container-init.md b/docs/proposals/container-init.md index 6e9dbb4a041..90c4096e12e 100644 --- a/docs/proposals/container-init.md +++ b/docs/proposals/container-init.md @@ -1,444 +1 @@ -# Pod initialization - -@smarterclayton - -March 2016 - -## Proposal and Motivation - -Within a pod there is a need to initialize local data or adapt to the current -cluster environment that is not easily achieved in the current container model. -Containers start in parallel after volumes are mounted, leaving no opportunity -for coordination between containers without specialization of the image. If -two containers need to share common initialization data, both images must -be altered to cooperate using filesystem or network semantics, which introduces -coupling between images. Likewise, if an image requires configuration in order -to start and that configuration is environment dependent, the image must be -altered to add the necessary templating or retrieval. - -This proposal introduces the concept of an **init container**, one or more -containers started in sequence before the pod's normal containers are started. -These init containers may share volumes, perform network operations, and perform -computation prior to the start of the remaining containers. They may also, by -virtue of their sequencing, block or delay the startup of application containers -until some precondition is met. In this document we refer to the existing pod -containers as **app containers**. - -This proposal also provides a high level design of **volume containers**, which -initialize a particular volume, as a feature that specializes some of the tasks -defined for init containers. The init container design anticipates the existence -of volume containers and highlights where they will take future work - -## Design Points - -* Init containers should be able to: - * Perform initialization of shared volumes - * Download binaries that will be used in app containers as execution targets - * Inject configuration or extension capability to generic images at startup - * Perform complex templating of information available in the local environment - * Initialize a database by starting a temporary execution process and applying - schema info. - * Delay the startup of application containers until preconditions are met - * Register the pod with other components of the system -* Reduce coupling: - * Between application images, eliminating the need to customize those images for - Kubernetes generally or specific roles - * Inside of images, by specializing which containers perform which tasks - (install git into init container, use filesystem contents - in web container) - * Between initialization steps, by supporting multiple sequential init containers -* Init containers allow simple start preconditions to be implemented that are - decoupled from application code - * The order init containers start should be predictable and allow users to easily - reason about the startup of a container - * Complex ordering and failure will not be supported - all complex workflows can - if necessary be implemented inside of a single init container, and this proposal - aims to enable that ordering without adding undue complexity to the system. - Pods in general are not intended to support DAG workflows. -* Both run-once and run-forever pods should be able to use init containers -* As much as possible, an init container should behave like an app container - to reduce complexity for end users, for clients, and for divergent use cases. - An init container is a container with the minimum alterations to accomplish - its goal. -* Volume containers should be able to: - * Perform initialization of a single volume - * Start in parallel - * Perform computation to initialize a volume, and delay start until that - volume is initialized successfully. - * Using a volume container that does not populate a volume to delay pod start - (in the absence of init containers) would be an abuse of the goal of volume - containers. -* Container pre-start hooks are not sufficient for all initialization cases: - * They cannot easily coordinate complex conditions across containers - * They can only function with code in the image or code in a shared volume, - which would have to be statically linked (not a common pattern in wide use) - * They cannot be implemented with the current Docker implementation - see - [#140](https://github.com/kubernetes/kubernetes/issues/140) - - - -## Alternatives - -* Any mechanism that runs user code on a node before regular pod containers - should itself be a container and modeled as such - we explicitly reject - creating new mechanisms for running user processes. -* The container pre-start hook (not yet implemented) requires execution within - the container's image and so cannot adapt existing images. It also cannot - block startup of containers -* Running a "pre-pod" would defeat the purpose of the pod being an atomic - unit of scheduling. - - -## Design - -Each pod may have 0..N init containers defined along with the existing -1..M app containers. - -On startup of the pod, after the network and volumes are initialized, the -init containers are started in order. Each container must exit successfully -before the next is invoked. If a container fails to start (due to the runtime) -or exits with failure, it is retried according to the pod RestartPolicy. -RestartPolicyNever pods will immediately fail and exit. RestartPolicyAlways -pods will retry the failing init container with increasing backoff until it -succeeds. To align with the design of application containers, init containers -will only support "infinite retries" (RestartPolicyAlways) or "no retries" -(RestartPolicyNever). - -A pod cannot be ready until all init containers have succeeded. The ports -on an init container are not aggregated under a service. A pod that is -being initialized is in the `Pending` phase but should have a distinct -condition. Each app container and all future init containers should have -the reason `PodInitializing`. The pod should have a condition `Initializing` -set to `false` until all init containers have succeeded, and `true` thereafter. -If the pod is restarted, the `Initializing` condition should be set to `false. - -If the pod is "restarted" all containers stopped and started due to -a node restart, change to the pod definition, or admin interaction, all -init containers must execute again. Restartable conditions are defined as: - -* An init container image is changed -* The pod infrastructure container is restarted (shared namespaces are lost) -* The Kubelet detects that all containers in a pod are terminated AND - no record of init container completion is available on disk (due to GC) - -Changes to the init container spec are limited to the container image field. -Altering the container image field is equivalent to restarting the pod. - -Because init containers can be restarted, retried, or reexecuted, container -authors should make their init behavior idempotent by handling volumes that -are already populated or the possibility that this instance of the pod has -already contacted a remote system. - -Each init container has all of the fields of an app container. The following -fields are prohibited from being used on init containers by validation: - -* `readinessProbe` - init containers must exit for pod startup to continue, - are not included in rotation, and so cannot define readiness distinct from - completion. - -Init container authors may use `activeDeadlineSeconds` on the pod and -`livenessProbe` on the container to prevent init containers from failing -forever. The active deadline includes init containers. - -Because init containers are semantically different in lifecycle from app -containers (they are run serially, rather than in parallel), for backwards -compatibility and design clarity they will be identified as distinct fields -in the API: - - pod: - spec: - containers: ... - initContainers: - - name: init-container1 - image: ... - ... - - name: init-container2 - ... - status: - containerStatuses: ... - initContainerStatuses: - - name: init-container1 - ... - - name: init-container2 - ... - -This separation also serves to make the order of container initialization -clear - init containers are executed in the order that they appear, then all -app containers are started at once. - -The name of each app and init container in a pod must be unique - it is a -validation error for any container to share a name. - -While pod containers are in alpha state, they will be serialized as an annotation -on the pod with the name `pod.alpha.kubernetes.io/init-containers` and the status -of the containers will be stored as `pod.alpha.kubernetes.io/init-container-statuses`. -Mutation of these annotations is prohibited on existing pods. - - -### Resources - -Given the ordering and execution for init containers, the following rules -for resource usage apply: - -* The highest of any particular resource request or limit defined on all init - containers is the **effective init request/limit** -* The pod's **effective request/limit** for a resource is the higher of: - * sum of all app containers request/limit for a resource - * effective init request/limit for a resource -* Scheduling is done based on effective requests/limits, which means - init containers can reserve resources for initialization that are not used - during the life of the pod. -* The lowest QoS tier of init containers per resource is the **effective init QoS tier**, - and the highest QoS tier of both init containers and regular containers is the - **effective pod QoS tier**. - -So the following pod: - - pod: - spec: - initContainers: - - limits: - cpu: 100m - memory: 1GiB - - limits: - cpu: 50m - memory: 2GiB - containers: - - limits: - cpu: 10m - memory: 1100MiB - - limits: - cpu: 10m - memory: 1100MiB - -has an effective pod limit of `cpu: 100m`, `memory: 2200MiB` (highest init -container cpu is larger than sum of all app containers, sum of container -memory is larger than the max of all init containers). The scheduler, node, -and quota must respect the effective pod request/limit. - -In the absence of a defined request or limit on a container, the effective -request/limit will be applied. For example, the following pod: - - pod: - spec: - initContainers: - - limits: - cpu: 100m - memory: 1GiB - containers: - - request: - cpu: 10m - memory: 1100MiB - -will have an effective request of `10m / 1100MiB`, and an effective limit -of `100m / 1GiB`, i.e.: - - pod: - spec: - initContainers: - - request: - cpu: 10m - memory: 1GiB - - limits: - cpu: 100m - memory: 1100MiB - containers: - - request: - cpu: 10m - memory: 1GiB - - limits: - cpu: 100m - memory: 1100MiB - -and thus have the QoS tier **Burstable** (because request is not equal to -limit). - -Quota and limits will be applied based on the effective pod request and -limit. - -Pod level cGroups will be based on the effective pod request and limit, the -same as the scheduler. - - -### Kubelet and container runtime details - -Container runtimes should treat the set of init and app containers as one -large pool. An individual init container execution should be identical to -an app container, including all standard container environment setup -(network, namespaces, hostnames, DNS, etc). - -All app container operations are permitted on init containers. The -logs for an init container should be available for the duration of the pod -lifetime or until the pod is restarted. - -During initialization, app container status should be shown with the reason -PodInitializing if any init containers are present. Each init container -should show appropriate container status, and all init containers that are -waiting for earlier init containers to finish should have the `reason` -PendingInitialization. - -The container runtime should aggressively prune failed init containers. -The container runtime should record whether all init containers have -succeeded internally, and only invoke new init containers if a pod -restart is needed (for Docker, if all containers terminate or if the pod -infra container terminates). Init containers should follow backoff rules -as necessary. The Kubelet *must* preserve at least the most recent instance -of an init container to serve logs and data for end users and to track -failure states. The Kubelet *should* prefer to garbage collect completed -init containers over app containers, as long as the Kubelet is able to -track that initialization has been completed. In the future, container -state checkpointing in the Kubelet may remove or reduce the need to -preserve old init containers. - -For the initial implementation, the Kubelet will use the last termination -container state of the highest indexed init container to determine whether -the pod has completed initialization. During a pod restart, initialization -will be restarted from the beginning (all initializers will be rerun). - - -### API Behavior - -All APIs that access containers by name should operate on both init and -app containers. Because names are unique the addition of the init container -should be transparent to use cases. - -A client with no knowledge of init containers should see appropriate -container status `reason` and `message` fields while the pod is in the -`Pending` phase, and so be able to communicate that to end users. - - -### Example init containers - -* Wait for a service to be created - - pod: - spec: - initContainers: - - name: wait - image: centos:centos7 - command: ["/bin/sh", "-c", "for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; exit 1"] - containers: - - name: run - image: application-image - command: ["/my_application_that_depends_on_myservice"] - -* Register this pod with a remote server - - pod: - spec: - initContainers: - - name: register - image: centos:centos7 - command: ["/bin/sh", "-c", "curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(POD_NAME)&ip=$(POD_IP)'"] - env: - - name: POD_NAME - valueFrom: - field: metadata.name - - name: POD_IP - valueFrom: - field: status.podIP - containers: - - name: run - image: application-image - command: ["/my_application_that_depends_on_myservice"] - -* Wait for an arbitrary period of time - - pod: - spec: - initContainers: - - name: wait - image: centos:centos7 - command: ["/bin/sh", "-c", "sleep 60"] - containers: - - name: run - image: application-image - command: ["/static_binary_without_sleep"] - -* Clone a git repository into a volume (can be implemented by volume containers in the future): - - pod: - spec: - initContainers: - - name: download - image: image-with-git - command: ["git", "clone", "https://github.com/myrepo/myrepo.git", "/var/lib/data"] - volumeMounts: - - mountPath: /var/lib/data - volumeName: git - containers: - - name: run - image: centos:centos7 - command: ["/var/lib/data/binary"] - volumeMounts: - - mountPath: /var/lib/data - volumeName: git - volumes: - - emptyDir: {} - name: git - -* Execute a template transformation based on environment (can be implemented by volume containers in the future): - - pod: - spec: - initContainers: - - name: copy - image: application-image - command: ["/bin/cp", "mytemplate.j2", "/var/lib/data/"] - volumeMounts: - - mountPath: /var/lib/data - volumeName: data - - name: transform - image: image-with-jinja - command: ["/bin/sh", "-c", "jinja /var/lib/data/mytemplate.j2 > /var/lib/data/mytemplate.conf"] - volumeMounts: - - mountPath: /var/lib/data - volumeName: data - containers: - - name: run - image: application-image - command: ["/myapplication", "-conf", "/var/lib/data/mytemplate.conf"] - volumeMounts: - - mountPath: /var/lib/data - volumeName: data - volumes: - - emptyDir: {} - name: data - -* Perform a container build - - pod: - spec: - initContainers: - - name: copy - image: base-image - workingDir: /home/user/source-tree - command: ["make"] - containers: - - name: commit - image: image-with-docker - command: - - /bin/sh - - -c - - docker commit $(complex_bash_to_get_container_id_of_copy) \ - docker push $(commit_id) myrepo:latest - volumesMounts: - - mountPath: /var/run/docker.sock - volumeName: dockersocket - -## Backwards compatibilty implications - -Since this is a net new feature in the API and Kubelet, new API servers during upgrade may not -be able to rely on Kubelets implementing init containers. The management of feature skew between -master and Kubelet is tracked in issue [#4855](https://github.com/kubernetes/kubernetes/issues/4855). - - -## Future work - -* Unify pod QoS class with init containers -* Implement container / image volumes to make composition of runtime from images efficient - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/container-init.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-init.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-init.md) diff --git a/docs/proposals/container-runtime-interface-v1.md b/docs/proposals/container-runtime-interface-v1.md index 36592727db3..26443a7deb7 100644 --- a/docs/proposals/container-runtime-interface-v1.md +++ b/docs/proposals/container-runtime-interface-v1.md @@ -1,267 +1 @@ -# Redefine Container Runtime Interface - -The umbrella issue: [#22964](https://issues.k8s.io/22964) - -## Motivation - -Kubelet employs a declarative pod-level interface, which acts as the sole -integration point for container runtimes (e.g., `docker` and `rkt`). The -high-level, declarative interface has caused higher integration and maintenance -cost, and also slowed down feature velocity for the following reasons. - 1. **Not every container runtime supports the concept of pods natively**. - When integrating with Kubernetes, a significant amount of work needs to - go into implementing a shim of significant size to support all pod - features. This also adds maintenance overhead (e.g., `docker`). - 2. **High-level interface discourages code sharing and reuse among runtimes**. - E.g, each runtime today implements an all-encompassing `SyncPod()` - function, with the Pod Spec as the input argument. The runtime implements - logic to determine how to achieve the desired state based on the current - status, (re-)starts pods/containers and manages lifecycle hooks - accordingly. - 3. **Pod Spec is evolving rapidly**. New features are being added constantly. - Any pod-level change or addition requires changing of all container - runtime shims. E.g., init containers and volume containers. - -## Goals and Non-Goals - -The goals of defining the interface are to - - **improve extensibility**: Easier container runtime integration. - - **improve feature velocity** - - **improve code maintainability** - -The non-goals include - - proposing *how* to integrate with new runtimes, i.e., where the shim - resides. The discussion of adopting a client-server architecture is tracked - by [#13768](https://issues.k8s.io/13768), where benefits and shortcomings of - such an architecture is discussed. - - versioning the new interface/API. We intend to provide API versioning to - offer stability for runtime integrations, but the details are beyond the - scope of this proposal. - - adding support to Windows containers. Windows container support is a - parallel effort and is tracked by [#22623](https://issues.k8s.io/22623). - The new interface will not be augmented to support Windows containers, but - it will be made extensible such that the support can be added in the future. - - re-defining Kubelet's internal interfaces. These interfaces, though, may - affect Kubelet's maintainability, is not relevant to runtime integration. - - improving Kubelet's efficiency or performance, e.g., adopting event stream - from the container runtime [#8756](https://issues.k8s.io/8756), - [#16831](https://issues.k8s.io/16831). - -## Requirements - - * Support the already integrated container runtime: `docker` and `rkt` - * Support hypervisor-based container runtimes: `hyper`. - -The existing pod-level interface will remain as it is in the near future to -ensure supports of all existing runtimes are continued. Meanwhile, we will -work with all parties involved to switching to the proposed interface. - - -## Container Runtime Interface - -The main idea of this proposal is to adopt an imperative container-level -interface, which allows Kubelet to directly control the lifecycles of the -containers. - -Pod is composed of a group of containers in an isolated environment with -resource constraints. In Kubernetes, pod is also the smallest schedulable unit. -After a pod has been scheduled to the node, Kubelet will create the environment -for the pod, and add/update/remove containers in that environment to meet the -Pod Spec. To distinguish between the environment and the pod as a whole, we -will call the pod environment **PodSandbox.** - -The container runtimes may interpret the PodSandBox concept differently based -on how it operates internally. For runtimes relying on hypervisor, sandbox -represents a virtual machine naturally. For others, it can be Linux namespaces. - -In short, a PodSandbox should have the following features. - - * **Isolation**: E.g., Linux namespaces or a full virtual machine, or even - support additional security features. - * **Compute resource specifications**: A PodSandbox should implement pod-level - resource demands and restrictions. - -*NOTE: The resource specification does not include externalized costs to -container setup that are not currently trackable as Pod constraints, e.g., -filesystem setup, container image pulling, etc.* - -A container in a PodSandbox maps to an application in the Pod Spec. For Linux -containers, they are expected to share at least network and IPC namespaces, -with sharing more namespaces discussed in [#1615](https://issues.k8s.io/1615). - - -Below is an example of the proposed interfaces. - -```go -// PodSandboxManager contains basic operations for sandbox. -type PodSandboxManager interface { - Create(config *PodSandboxConfig) (string, error) - Delete(id string) (string, error) - List(filter PodSandboxFilter) []PodSandboxListItem - Status(id string) PodSandboxStatus -} - -// ContainerRuntime contains basic operations for containers. -type ContainerRuntime interface { - Create(config *ContainerConfig, sandboxConfig *PodSandboxConfig, PodSandboxID string) (string, error) - Start(id string) error - Stop(id string, timeout int) error - Remove(id string) error - List(filter ContainerFilter) ([]ContainerListItem, error) - Status(id string) (ContainerStatus, error) - Exec(id string, cmd []string, streamOpts StreamOptions) error -} - -// ImageService contains image-related operations. -type ImageService interface { - List() ([]Image, error) - Pull(image ImageSpec, auth AuthConfig) error - Remove(image ImageSpec) error - Status(image ImageSpec) (Image, error) - Metrics(image ImageSpec) (ImageMetrics, error) -} - -type ContainerMetricsGetter interface { - ContainerMetrics(id string) (ContainerMetrics, error) -} - -All functions listed above are expected to be thread-safe. -``` - -### Pod/Container Lifecycle - -The PodSandbox’s lifecycle is decoupled from the containers, i.e., a sandbox -is created before any containers, and can exist after all containers in it have -terminated. - -Assume there is a pod with a single container C. To start a pod: - -``` - create sandbox Foo --> create container C --> start container C -``` - -To delete a pod: - -``` - stop container C --> remove container C --> delete sandbox Foo -``` - -The container runtime must not apply any transition (such as starting a new -container) unless explicitly instructed by Kubelet. It is Kubelet's -responsibility to enforce garbage collection, restart policy, and otherwise -react to changes in lifecycle. - -The only transitions that are possible for a container are described below: - -``` -() -> Created // A container can only transition to created from the - // empty, nonexistent state. The ContainerRuntime.Create - // method causes this transition. -Created -> Running // The ContainerRuntime.Start method may be applied to a - // Created container to move it to Running -Running -> Exited // The ContainerRuntime.Stop method may be applied to a running - // container to move it to Exited. - // A container may also make this transition under its own volition -Exited -> () // An exited container can be moved to the terminal empty - // state via a ContainerRuntime.Remove call. -``` - - -Kubelet is also responsible for gracefully terminating all the containers -in the sandbox before deleting the sandbox. If Kubelet chooses to delete -the sandbox with running containers in it, those containers should be forcibly -deleted. - -Note that every PodSandbox/container lifecycle operation (create, start, -stop, delete) should either return an error or block until the operation -succeeds. A successful operation should include a state transition of the -PodSandbox/container. E.g., if a `Create` call for a container does not -return an error, the container state should be "created" when the runtime is -queried. - -### Updates to PodSandbox or Containers - -Kubernetes support updates only to a very limited set of fields in the Pod -Spec. These updates may require containers to be re-created by Kubelet. This -can be achieved through the proposed, imperative container-level interface. -On the other hand, PodSandbox update currently is not required. - - -### Container Lifecycle Hooks - -Kubernetes supports post-start and pre-stop lifecycle hooks, with ongoing -discussion for supporting pre-start and post-stop hooks in -[#140](https://issues.k8s.io/140). - -These lifecycle hooks will be implemented by Kubelet via `Exec` calls to the -container runtime. This frees the runtimes from having to support hooks -natively. - -Illustration of the container lifecycle and hooks: - -``` - pre-start post-start pre-stop post-stop - | | | | - exec exec exec exec - | | | | - create --------> start ----------------> stop --------> remove -``` - -In order for the lifecycle hooks to function as expected, the `Exec` call -will need access to the container's filesystem (e.g., mount namespaces). - -### Extensibility - -There are several dimensions for container runtime extensibility. - - Host OS (e.g., Linux) - - PodSandbox isolation mechanism (e.g., namespaces or VM) - - PodSandbox OS (e.g., Linux) - -As mentioned previously, this proposal will only address the Linux based -PodSandbox and containers. All Linux-specific configuration will be grouped -into one field. A container runtime is required to enforce all configuration -applicable to its platform, and should return an error otherwise. - -### Keep it minimal - -The proposed interface is experimental, i.e., it will go through (many) changes -until it stabilizes. The principle is to to keep the interface minimal and -extend it later if needed. This includes a several features that are still in -discussion and may be achieved alternatively: - - * `AttachContainer`: [#23335](https://issues.k8s.io/23335) - * `PortForward`: [#25113](https://issues.k8s.io/25113) - -## Alternatives - -**[Status quo] Declarative pod-level interface** - - Pros: No changes needed. - - Cons: All the issues stated in #motivation - -**Allow integration at both pod- and container-level interfaces** - - Pros: Flexibility. - - Cons: All the issues stated in #motivation - -**Imperative pod-level interface** -The interface contains only CreatePod(), StartPod(), StopPod() and RemovePod(). -This implies that the runtime needs to take over container lifecycle -management (i.e., enforce restart policy), lifecycle hooks, liveness checks, -etc. Kubelet will mainly be responsible for interfacing with the apiserver, and -can potentially become a very thin daemon. - - Pros: Lower maintenance overhead for the Kubernetes maintainers if `Docker` - shim maintenance cost is discounted. - - Cons: This will incur higher integration cost because every new container - runtime needs to implement all the features and need to understand the - concept of pods. This would also lead to lower feature velocity because the - interface will need to be changed, and the new pod-level feature will need - to be supported in each runtime. - -## Related Issues - - * Metrics: [#27097](https://issues.k8s.io/27097) - * Log management: [#24677](https://issues.k8s.io/24677) - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/container-runtime-interface-v1.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-runtime-interface-v1.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-runtime-interface-v1.md) diff --git a/docs/proposals/controller-ref.md b/docs/proposals/controller-ref.md index 09dfd68402c..90275e3a1c9 100644 --- a/docs/proposals/controller-ref.md +++ b/docs/proposals/controller-ref.md @@ -1,102 +1 @@ -# ControllerRef proposal - -Author: gmarek@ -Last edit: 2016-05-11 -Status: raw - -Approvers: -- [ ] briangrant -- [ ] dbsmith - -**Table of Contents** - -- [Goal of ControllerReference](#goal-of-setreference) -- [Non goals](#non-goals) -- [API and semantic changes](#api-and-semantic-changes) -- [Upgrade/downgrade procedure](#upgradedowngrade-procedure) -- [Orphaning/adoption](#orphaningadoption) -- [Implementation plan (sketch)](#implementation-plan-sketch) -- [Considered alternatives](#considered-alternatives) - -# Goal of ControllerReference - -Main goal of `ControllerReference` effort is to solve a problem of overlapping controllers that fight over some resources (e.g. `ReplicaSets` fighting with `ReplicationControllers` over `Pods`), which cause serious [problems](https://github.com/kubernetes/kubernetes/issues/24433) such as exploding memory of Controller Manager. - -We don’t want to have (just) an in-memory solution, as we don’t want a Controller Manager crash to cause massive changes in object ownership in the system. I.e. we need to persist the information about "owning controller". - -Secondary goal of this effort is to improve performance of various controllers and schedulers, by removing the need for expensive lookup for all matching "controllers". - -# Non goals - -Cascading deletion is not a goal of this effort. Cascading deletion will use `ownerReferences`, which is a [separate effort](garbage-collection.md). - -`ControllerRef` will extend `OwnerReference` and reuse machinery written for it (GarbageCollector, adoption/orphaning logic). - -# API and semantic changes - -There will be a new API field in the `OwnerReference` in which we will store an information if given owner is a managing controller: - -``` -OwnerReference { - … - Controller bool - … -} -``` - -From now on by `ControllerRef` we mean an `OwnerReference` with `Controller=true`. - -Most controllers (all that manage collections of things defined by label selector) will have slightly changed semantics: currently controller owns an object if its selector matches object’s labels and if it doesn't notice an older controller of the same kind that also matches the object's labels, but after introduction of `ControllerReference` a controller will own an object iff selector matches labels and the `OwnerReference` with `Controller=true`points to it. - -If the owner's selector or owned object's labels change, the owning controller will be responsible for orphaning (clearing `Controller` field in the `OwnerReference` and/or deleting `OwnerReference` altogether) objects, after which adoption procedure (setting `Controller` field in one of `OwnerReferencec` and/or adding new `OwnerReferences`) might occur, if another controller has a selector matching. - -For debugging purposes we want to add an `adoptionTime` annotation prefixed with `kubernetes.io/` which will keep the time of last controller ownership transfer. - -# Upgrade/downgrade procedure - -Because `ControllerRef` will be a part of `OwnerReference` effort it will have the same upgrade/downgrade procedures. - -# Orphaning/adoption - -Because `ControllerRef` will be a part of `OwnerReference` effort it will have the same orphaning/adoption procedures. - -Controllers will orphan objects they own in two cases: -* Change of label/selector causing selector to stop matching labels (executed by the controller) -* Deletion of a controller with `Orphaning=true` (executed by the GarbageCollector) - -We will need a secondary orphaning mechanism in case of unclean controller deletion: -* GarbageCollector will remove `ControllerRef` from objects that no longer points to existing controllers - -Controller will adopt (set `Controller` field in the `OwnerReference` that points to it) an object whose labels match its selector iff: -* there are no `OwnerReferences` with `Controller` set to true in `OwnerReferences` array -* `DeletionTimestamp` is not set -and -* Controller is the first controller that will manage to adopt the Pod from all Controllers that have matching label selector and don't have `DeletionTimestamp` set. - -By design there are possible races during adoption if multiple controllers can own a given object. - -To prevent re-adoption of an object during deletion the `DeletionTimestamp` will be set when deletion is starting. When a controller has a non-nil `DeletionTimestamp` it won’t take any actions except updating its `Status` (in particular it won’t adopt any objects). - -# Implementation plan (sketch): - -* Add API field for `Controller`, -* Extend `OwnerReference` adoption procedure to set a `Controller` field in one of the owners, -* Update all affected controllers to respect `ControllerRef`. - -Necessary related work: -* `OwnerReferences` are correctly added/deleted, -* GarbageCollector removes dangling references, -* Controllers don't take any meaningful actions when `DeletionTimestamps` is set. - -# Considered alternatives - -* Generic "ReferenceController": centralized component that managed adoption/orphaning - * Dropped because: hard to write something that will work for all imaginable 3rd party objects, adding hooks to framework makes it possible for users to write their own logic -* Separate API field for `ControllerRef` in the ObjectMeta. - * Dropped because: nontrivial relationship between `ControllerRef` and `OwnerReferences` when it comes to deletion/adoption. - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/controller-ref.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/controller-ref.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/controller-ref.md) diff --git a/docs/proposals/deploy.md b/docs/proposals/deploy.md index a27fb01ff80..72a5ace14a9 100644 --- a/docs/proposals/deploy.md +++ b/docs/proposals/deploy.md @@ -1,147 +1 @@ - - -- [Deploy through CLI](#deploy-through-cli) - - [Motivation](#motivation) - - [Requirements](#requirements) - - [Related `kubectl` Commands](#related-kubectl-commands) - - [`kubectl run`](#kubectl-run) - - [`kubectl scale` and `kubectl autoscale`](#kubectl-scale-and-kubectl-autoscale) - - [`kubectl rollout`](#kubectl-rollout) - - [`kubectl set`](#kubectl-set) - - [Mutating Operations](#mutating-operations) - - [Example](#example) - - [Support in Deployment](#support-in-deployment) - - [Deployment Status](#deployment-status) - - [Deployment Version](#deployment-version) - - [Pause Deployments](#pause-deployments) - - [Perm-failed Deployments](#perm-failed-deployments) - - - -# Deploy through CLI - -## Motivation - -Users can use [Deployments](../user-guide/deployments.md) or [`kubectl rolling-update`](../user-guide/kubectl/kubectl_rolling-update.md) to deploy in their Kubernetes clusters. A Deployment provides declarative update for Pods and ReplicationControllers, whereas `rolling-update` allows the users to update their earlier deployment without worrying about schemas and configurations. Users need a way that's similar to `rolling-update` to manage their Deployments more easily. - -`rolling-update` expects ReplicationController as the only resource type it deals with. It's not trivial to support exactly the same behavior with Deployment, which requires: -- Print out scaling up/down events. -- Stop the deployment if users press Ctrl-c. -- The controller should not make any more changes once the process ends. (Delete the deployment when status.replicas=status.updatedReplicas=spec.replicas) - -So, instead, this document proposes another way to support easier deployment management via Kubernetes CLI (`kubectl`). - -## Requirements - -The followings are operations we need to support for the users to easily managing deployments: - -- **Create**: To create deployments. -- **Rollback**: To restore to an earlier version of deployment. -- **Watch the status**: To watch for the status update of deployments. -- **Pause/resume**: To pause a deployment mid-way, and to resume it. (A use case is to support canary deployment.) -- **Version information**: To record and show version information that's meaningful to users. This can be useful for rollback. - -## Related `kubectl` Commands - -### `kubectl run` - -`kubectl run` should support the creation of Deployment (already implemented) and DaemonSet resources. - -### `kubectl scale` and `kubectl autoscale` - -Users may use `kubectl scale` or `kubectl autoscale` to scale up and down Deployments (both already implemented). - -### `kubectl rollout` - -`kubectl rollout` supports both Deployment and DaemonSet. It has the following subcommands: -- `kubectl rollout undo` works like rollback; it allows the users to rollback to a previous version of deployment. -- `kubectl rollout pause` allows the users to pause a deployment. See [pause deployments](#pause-deployments). -- `kubectl rollout resume` allows the users to resume a paused deployment. -- `kubectl rollout status` shows the status of a deployment. -- `kubectl rollout history` shows meaningful version information of all previous deployments. See [development version](#deployment-version). -- `kubectl rollout retry` retries a failed deployment. See [perm-failed deployments](#perm-failed-deployments). - -### `kubectl set` - -`kubectl set` has the following subcommands: -- `kubectl set env` allows the users to set environment variables of Kubernetes resources. It should support any object that contains a single, primary PodTemplate (such as Pod, ReplicationController, ReplicaSet, Deployment, and DaemonSet). -- `kubectl set image` allows the users to update multiple images of Kubernetes resources. Users will use `--container` and `--image` flags to update the image of a container. It should support anything that has a PodTemplate. - -`kubectl set` should be used for things that are common and commonly modified. Other possible future commands include: -- `kubectl set volume` -- `kubectl set limits` -- `kubectl set security` -- `kubectl set port` - -### Mutating Operations - -Other means of mutating Deployments and DaemonSets, including `kubectl apply`, `kubectl edit`, `kubectl replace`, `kubectl patch`, `kubectl label`, and `kubectl annotate`, may trigger rollouts if they modify the pod template. - -`kubectl create` and `kubectl delete`, for creating and deleting Deployments and DaemonSets, are also relevant. - -### Example - -With the commands introduced above, here's an example of deployment management: - -```console -# Create a Deployment -$ kubectl run nginx --image=nginx --replicas=2 --generator=deployment/v1beta1 - -# Watch the Deployment status -$ kubectl rollout status deployment/nginx - -# Update the Deployment -$ kubectl set image deployment/nginx --container=nginx --image=nginx: - -# Pause the Deployment -$ kubectl rollout pause deployment/nginx - -# Resume the Deployment -$ kubectl rollout resume deployment/nginx - -# Check the change history (deployment versions) -$ kubectl rollout history deployment/nginx - -# Rollback to a previous version. -$ kubectl rollout undo deployment/nginx --to-version= -``` - -## Support in Deployment - -### Deployment Status - -Deployment status should summarize information about Pods, which includes: -- The number of pods of each version. -- The number of ready/not ready pods. - -See issue [#17164](https://github.com/kubernetes/kubernetes/issues/17164). - -### Deployment Version - -We store previous deployment version information in annotations `rollout.kubectl.kubernetes.io/change-source` and `rollout.kubectl.kubernetes.io/version` of replication controllers of the deployment, to support rolling back changes as well as for the users to view previous changes with `kubectl rollout history`. -- `rollout.kubectl.kubernetes.io/change-source`, which is optional, records the kubectl command of the last mutation made to this rollout. Users may use `--record` in `kubectl` to record current command in this annotation. -- `rollout.kubectl.kubernetes.io/version` records a version number to distinguish the change sequence of a deployment's -replication controllers. A deployment obtains the largest version number from its replication controllers and increments the number by 1 upon update or creation of the deployment, and update the version annotation of its new replication controller. - -When the users perform a rollback, i.e. `kubectl rollout undo`, the deployment first looks at its existing replication controllers, regardless of their number of replicas. Then it finds the one with annotation `rollout.kubectl.kubernetes.io/version` that either contains the specified rollback version number or contains the second largest version number among all the replication controllers (current new replication controller should obtain the largest version number) if the user didn't specify any version number (the user wants to rollback to the last change). Lastly, it -starts scaling up that replication controller it's rolling back to, and scaling down the current ones, and then update the version counter and the rollout annotations accordingly. - -Note that a deployment's replication controllers use PodTemplate hashes (i.e. the hash of `.spec.template`) to distinguish from each others. When doing rollout or rollback, a deployment reuses existing replication controller if it has the same PodTemplate, and its `rollout.kubectl.kubernetes.io/change-source` and `rollout.kubectl.kubernetes.io/version` annotations will be updated by the new rollout. At this point, the earlier state of this replication controller is lost in history. For example, if we had 3 replication controllers in -deployment history, and then we do a rollout with the same PodTemplate as version 1, then version 1 is lost and becomes version 4 after the rollout. - -To make deployment versions more meaningful and readable for the users, we can add more annotations in the future. For example, we can add the following flags to `kubectl` for the users to describe and record their current rollout: -- `--description`: adds `description` annotation to an object when it's created to describe the object. -- `--note`: adds `note` annotation to an object when it's updated to record the change. -- `--commit`: adds `commit` annotation to an object with the commit id. - -### Pause Deployments - -Users sometimes need to temporarily disable a deployment. See issue [#14516](https://github.com/kubernetes/kubernetes/issues/14516). - -### Perm-failed Deployments - -The deployment could be marked as "permanently failed" for a given spec hash so that the system won't continue thrashing on a doomed deployment. The users can retry a failed deployment with `kubectl rollout retry`. See issue [#14519](https://github.com/kubernetes/kubernetes/issues/14519). - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/deploy.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/deploy.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/deploy.md) diff --git a/docs/proposals/deployment.md b/docs/proposals/deployment.md index f12ffc9e0d0..c6339fc918c 100644 --- a/docs/proposals/deployment.md +++ b/docs/proposals/deployment.md @@ -1,229 +1 @@ -# Deployment - -## Abstract - -A proposal for implementing a new resource - Deployment - which will enable -declarative config updates for Pods and ReplicationControllers. - -Users will be able to create a Deployment, which will spin up -a ReplicationController to bring up the desired pods. -Users can also target the Deployment at existing ReplicationControllers, in -which case the new RC will replace the existing ones. The exact mechanics of -replacement depends on the DeploymentStrategy chosen by the user. -DeploymentStrategies are explained in detail in a later section. - -## Implementation - -### API Object - -The `Deployment` API object will have the following structure: - -```go -type Deployment struct { - TypeMeta - ObjectMeta - - // Specification of the desired behavior of the Deployment. - Spec DeploymentSpec - - // Most recently observed status of the Deployment. - Status DeploymentStatus -} - -type DeploymentSpec struct { - // Number of desired pods. This is a pointer to distinguish between explicit - // zero and not specified. Defaults to 1. - Replicas *int - - // Label selector for pods. Existing ReplicationControllers whose pods are - // selected by this will be scaled down. New ReplicationControllers will be - // created with this selector, with a unique label `pod-template-hash`. - // If Selector is empty, it is defaulted to the labels present on the Pod template. - Selector map[string]string - - // Describes the pods that will be created. - Template *PodTemplateSpec - - // The deployment strategy to use to replace existing pods with new ones. - Strategy DeploymentStrategy -} - -type DeploymentStrategy struct { - // Type of deployment. Can be "Recreate" or "RollingUpdate". - Type DeploymentStrategyType - - // TODO: Update this to follow our convention for oneOf, whatever we decide it - // to be. - // Rolling update config params. Present only if DeploymentStrategyType = - // RollingUpdate. - RollingUpdate *RollingUpdateDeploymentStrategy -} - -type DeploymentStrategyType string - -const ( - // Kill all existing pods before creating new ones. - RecreateDeploymentStrategyType DeploymentStrategyType = "Recreate" - - // Replace the old RCs by new one using rolling update i.e gradually scale down the old RCs and scale up the new one. - RollingUpdateDeploymentStrategyType DeploymentStrategyType = "RollingUpdate" -) - -// Spec to control the desired behavior of rolling update. -type RollingUpdateDeploymentStrategy struct { - // The maximum number of pods that can be unavailable during the update. - // Value can be an absolute number (ex: 5) or a percentage of total pods at the start of update (ex: 10%). - // Absolute number is calculated from percentage by rounding up. - // This can not be 0 if MaxSurge is 0. - // By default, a fixed value of 1 is used. - // Example: when this is set to 30%, the old RC can be scaled down by 30% - // immediately when the rolling update starts. Once new pods are ready, old RC - // can be scaled down further, followed by scaling up the new RC, ensuring - // that at least 70% of original number of pods are available at all times - // during the update. - MaxUnavailable IntOrString - - // The maximum number of pods that can be scheduled above the original number of - // pods. - // Value can be an absolute number (ex: 5) or a percentage of total pods at - // the start of the update (ex: 10%). This can not be 0 if MaxUnavailable is 0. - // Absolute number is calculated from percentage by rounding up. - // By default, a value of 1 is used. - // Example: when this is set to 30%, the new RC can be scaled up by 30% - // immediately when the rolling update starts. Once old pods have been killed, - // new RC can be scaled up further, ensuring that total number of pods running - // at any time during the update is atmost 130% of original pods. - MaxSurge IntOrString - - // Minimum number of seconds for which a newly created pod should be ready - // without any of its container crashing, for it to be considered available. - // Defaults to 0 (pod will be considered available as soon as it is ready) - MinReadySeconds int -} - -type DeploymentStatus struct { - // Total number of ready pods targeted by this deployment (this - // includes both the old and new pods). - Replicas int - - // Total number of new ready pods with the desired template spec. - UpdatedReplicas int -} - -``` - -### Controller - -#### Deployment Controller - -The DeploymentController will make Deployments happen. -It will watch Deployment objects in etcd. -For each pending deployment, it will: - -1. Find all RCs whose label selector is a superset of DeploymentSpec.Selector. - - For now, we will do this in the client - list all RCs and then filter the - ones we want. Eventually, we want to expose this in the API. -2. The new RC can have the same selector as the old RC and hence we add a unique - selector to all these RCs (and the corresponding label to their pods) to ensure - that they do not select the newly created pods (or old pods get selected by - new RC). - - The label key will be "pod-template-hash". - - The label value will be hash of the podTemplateSpec for that RC without - this label. This value will be unique for all RCs, since PodTemplateSpec should be unique. - - If the RCs and pods dont already have this label and selector: - - We will first add this to RC.PodTemplateSpec.Metadata.Labels for all RCs to - ensure that all new pods that they create will have this label. - - Then we will add this label to their existing pods and then add this as a selector - to that RC. -3. Find if there exists an RC for which value of "pod-template-hash" label - is same as hash of DeploymentSpec.PodTemplateSpec. If it exists already, then - this is the RC that will be ramped up. If there is no such RC, then we create - a new one using DeploymentSpec and then add a "pod-template-hash" label - to it. RCSpec.replicas = 0 for a newly created RC. -4. Scale up the new RC and scale down the olds ones as per the DeploymentStrategy. - - Raise an event if we detect an error, like new pods failing to come up. -5. Go back to step 1 unless the new RC has been ramped up to desired replicas - and the old RCs have been ramped down to 0. -6. Cleanup. - -DeploymentController is stateless so that it can recover in case it crashes during a deployment. - -### MinReadySeconds - -We will implement MinReadySeconds using the Ready condition in Pod. We will add -a LastTransitionTime to PodCondition and update kubelet to set Ready to false, -each time any container crashes. Kubelet will set Ready condition back to true once -all containers are ready. For containers without a readiness probe, we will -assume that they are ready as soon as they are up. -https://github.com/kubernetes/kubernetes/issues/11234 tracks updating kubelet -and https://github.com/kubernetes/kubernetes/issues/12615 tracks adding -LastTransitionTime to PodCondition. - -## Changing Deployment mid-way - -### Updating - -Users can update an ongoing deployment before it is completed. -In this case, the existing deployment will be stalled and the new one will -begin. -For ex: consider the following case: -- User creates a deployment to rolling-update 10 pods with image:v1 to - pods with image:v2. -- User then updates this deployment to create pods with image:v3, - when the image:v2 RC had been ramped up to 5 pods and the image:v1 RC - had been ramped down to 5 pods. -- When Deployment Controller observes the new deployment, it will create - a new RC for creating pods with image:v3. It will then start ramping up this - new RC to 10 pods and will ramp down both the existing RCs to 0. - -### Deleting - -Users can pause/cancel a deployment by deleting it before it is completed. -Recreating the same deployment will resume it. -For ex: consider the following case: -- User creates a deployment to rolling-update 10 pods with image:v1 to - pods with image:v2. -- User then deletes this deployment while the old and new RCs are at 5 replicas each. - User will end up with 2 RCs with 5 replicas each. -User can then create the same deployment again in which case, DeploymentController will -notice that the second RC exists already which it can ramp up while ramping down -the first one. - -### Rollback - -We want to allow the user to rollback a deployment. To rollback a -completed (or ongoing) deployment, user can create (or update) a deployment with -DeploymentSpec.PodTemplateSpec = oldRC.PodTemplateSpec. - -## Deployment Strategies - -DeploymentStrategy specifies how the new RC should replace existing RCs. -To begin with, we will support 2 types of deployment: -* Recreate: We kill all existing RCs and then bring up the new one. This results - in quick deployment but there is a downtime when old pods are down but - the new ones have not come up yet. -* Rolling update: We gradually scale down old RCs while scaling up the new one. - This results in a slower deployment, but there is no downtime. At all times - during the deployment, there are a few pods available (old or new). The number - of available pods and when is a pod considered "available" can be configured - using RollingUpdateDeploymentStrategy. - -In future, we want to support more deployment types. - -## Future - -Apart from the above, we want to add support for the following: -* Running the deployment process in a pod: In future, we can run the deployment process in a pod. Then users can define their own custom deployments and we can run it using the image name. -* More DeploymentStrategyTypes: https://github.com/openshift/origin/blob/master/examples/deployment/README.md#deployment-types lists most commonly used ones. -* Triggers: Deployment will have a trigger field to identify what triggered the deployment. Options are: Manual/UserTriggered, Autoscaler, NewImage. -* Automatic rollback on error: We want to support automatic rollback on error or timeout. - -## References - -- https://github.com/kubernetes/kubernetes/issues/1743 has most of the - discussion that resulted in this proposal. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/deployment.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/deployment.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/deployment.md) diff --git a/docs/proposals/disk-accounting.md b/docs/proposals/disk-accounting.md index 1978235688d..0528ec00dbb 100755 --- a/docs/proposals/disk-accounting.md +++ b/docs/proposals/disk-accounting.md @@ -1,615 +1 @@ -**Author**: Vishnu Kannan - -**Last** **Updated**: 11/16/2015 - -**Status**: Pending Review - -This proposal is an attempt to come up with a means for accounting disk usage in Kubernetes clusters that are running docker as the container runtime. Some of the principles here might apply for other runtimes too. - -### Why is disk accounting necessary? - -As of kubernetes v1.1 clusters become unusable over time due to the local disk becoming full. The kubelets on the node attempt to perform garbage collection of old containers and images, but that doesn’t prevent running pods from using up all the available disk space. - -Kubernetes users have no insight into how the disk is being consumed. - -Large images and rapid logging can lead to temporary downtime on the nodes. The node has to free up disk space by deleting images and containers. During this cleanup, existing pods can fail and new pods cannot be started. The node will also transition into an `OutOfDisk` condition, preventing more pods from being scheduled to the node. - -Automated eviction of pods that are hogging the local disk is not possible since proper accounting isn’t available. - -Since local disk is a non-compressible resource, users need means to restrict usage of local disk by pods and containers. Proper disk accounting is a prerequisite. As of today, a misconfigured low QoS class pod can end up bringing down the entire cluster by taking up all the available disk space (misconfigured logging for example) - -### Goals - -1. Account for disk usage on the nodes. - -2. Compatibility with the most common docker storage backends - devicemapper, aufs and overlayfs - -3. Provide a roadmap for enabling disk as a schedulable resource in the future. - -4. Provide a plugin interface for extending support to non-default filesystems and storage drivers. - -### Non Goals - -1. Compatibility with all storage backends. The matrix is pretty large already and the priority is to get disk accounting to on most widely deployed platforms. - -2. Support for filesystems other than ext4 and xfs. - -### Introduction - -Disk accounting in Kubernetes cluster running with docker is complex because of the plethora of ways in which disk gets utilized by a container. - -Disk can be consumed for: - -1. Container images - -2. Container’s writable layer - -3. Container’s logs - when written to stdout/stderr and default logging backend in docker is used. - -4. Local volumes - hostPath, emptyDir, gitRepo, etc. - -As of Kubernetes v1.1, kubelet exposes disk usage for the entire node and the container’s writable layer for aufs docker storage driver. -This information is made available to end users via the heapster monitoring pipeline. - -#### Image layers - -Image layers are shared between containers (COW) and so accounting for images is complicated. - -Image layers will have to be accounted as system overhead. - -As of today, it is not possible to check if there is enough disk space available on the node before an image is pulled. - -#### Writable Layer - -Docker creates a writable layer for every container on the host. Depending on the storage driver, the location and the underlying filesystem of this layer will change. - -Any files that the container creates or updates (assuming there are no volumes) will be considered as writable layer usage. - -The underlying filesystem is whatever the docker storage directory resides on. It is ext4 by default on most distributions, and xfs on RHEL. - -#### Container logs - -Docker engine provides a pluggable logging interface. Kubernetes is currently using the default logging mode which is `local file`. In this mode, the docker daemon stores bytes written by containers to their stdout or stderr, to local disk. These log files are contained in a special directory that is managed by the docker daemon. These logs are exposed via `docker logs` interface which is then exposed via kubelet and apiserver APIs. Currently, there is a hard-requirement for persisting these log files on the disk. - -#### Local Volumes - -Volumes are slightly different from other local disk use cases. They are pod scoped. Their lifetime is tied to that of a pod. Due to this property accounting of volumes will also be at the pod level. - -As of now, the volume types that can use local disk directly are ‘HostPath’, ‘EmptyDir’, and ‘GitRepo’. Secretes and Downwards API volumes wrap these primitive volumes. -Everything else is a network based volume. - -‘HostPath’ volumes map in existing directories in the host filesystem into a pod. Kubernetes manages only the mapping. It does not manage the source on the host filesystem. - -In addition to this, the changes introduced by a pod on the source of a hostPath volume is not cleaned by kubernetes once the pod exits. Due to these limitations, we will have to account hostPath volumes to system overhead. We should explicitly discourage use of HostPath in read-write mode. - -`EmptyDir`, `GitRepo` and other local storage volumes map to a directory on the host root filesystem, that is managed by Kubernetes (kubelet). Their contents are erased as soon as the pod exits. Tracking and potentially restricting usage for volumes is possible. - -### Docker storage model - -Before we start exploring solutions, let’s get familiar with how docker handles storage for images, writable layer and logs. - -On all storage drivers, logs are stored under `/containers//` - -The default location of the docker root directory is `/var/lib/docker`. - -Volumes are handled by kubernetes. -*Caveat: Volumes specified as part of Docker images are not handled by Kubernetes currently.* - -Container images and writable layers are managed by docker and their location will change depending on the storage driver. Each image layer and writable layer is referred to by an ID. The image layers are read-only. Once saved, existing writable layers can be frozen. Saving feature is not of importance to kubernetes since it works only on immutable images. - -*Note: Image layer IDs can be obtained by running `docker history -q --no-trunc `* - -##### Aufs - -Image layers and writable layers are stored under `/var/lib/docker/aufs/diff/`. - -The writable layers ID is equivalent to that of the container ID. - -##### Devicemapper - -Each container and each image gets own block device. Since this driver works at the block level, it is not possible to access the layers directly without mounting them. Each container gets its own block device while running. - -##### Overlayfs - -Image layers and writable layers are stored under `/var/lib/docker/overlay/`. - -Identical files are hardlinked between images. - -The image layers contain all their data under a `root` subdirectory. - -Everything under `/var/lib/docker/overlay/` are files required for running the container, including its writable layer. - -### Improve disk accounting - -Disk accounting is dependent on the storage driver in docker. A common solution that works across all storage drivers isn't available. - -I’m listing a few possible solutions for disk accounting below along with their limitations. - -We need a plugin model for disk accounting. Some storage drivers in docker will require special plugins. - -#### Container Images - -As of today, the partition that is holding docker images is flagged by cadvisor, and it uses filesystem stats to identify the overall disk usage of that partition. - -Isolated usage of just image layers is available today using `docker history `. -But isolated usage isn't of much use because image layers are shared between containers and so it is not possible to charge a single pod for image disk usage. - -Continuing to use the entire partition availability for garbage collection purposes in kubelet, should not affect reliability. -We might garbage collect more often. -As long as we do not expose features that require persisting old containers, computing image layer usage wouldn’t be necessary. - -Main goals for images are -1. Capturing total image disk usage -2. Check if a new image will fit on disk. - -In case we choose to compute the size of image layers alone, the following are some of the ways to achieve that. - -*Note that some of the strategies mentioned below are applicable in general to other kinds of storage like volumes, etc.* - -##### Docker History - -It is possible to run `docker history` and then create a graph of all images and corresponding image layers. -This graph will let us figure out the disk usage of all the images. - -**Pros** -* Compatible across storage drivers. - -**Cons** -* Requires maintaining an internal representation of images. - -##### Enhance docker - -Docker handles the upload and download of image layers. It can embed enough information about each layer. If docker is enhanced to expose this information, we can statically identify space about to be occupied by read-only image layers, even before the image layers are downloaded. - -A new [docker feature](https://github.com/docker/docker/pull/16450) (docker pull --dry-run) is pending review, which outputs the disk space that will be consumed by new images. Once this feature lands, we can perform feasibility checks and reject pods that will consume more disk space that what is current availability on the node. - -Another option is to expose disk usage of all images together as a first-class feature. - -**Pros** - -* Works across all storage drivers since docker abstracts the storage drivers. - -* Less code to maintain in kubelet. - -**Cons** - -* Not available today. - -* Requires serialized image pulls. - -* Metadata files are not tracked. - -##### Overlayfs and Aufs - -####### `du` - -We can list all the image layer specific directories, excluding container directories, and run `du` on each of those directories. - -**Pros**: - -* This is the least-intrusive approach. - -* It will work off the box without requiring any additional configuration. - -**Cons**: - -* `du` can consume a lot of cpu and memory. There have been several issues reported against the kubelet in the past that were related to `du`. - -* It is time consuming. Cannot be run frequently. Requires special handling to constrain resource usage - setting lower nice value or running in a sub-container. - -* Can block container deletion by keeping file descriptors open. - - -####### Linux gid based Disk Quota - -[Disk quota](https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-disk-quotas.html) feature provided by the linux kernel can be used to track the usage of image layers. Ideally, we need `project` support for disk quota, which lets us track usage of directory hierarchies using `project ids`. Unfortunately, that feature is only available for zfs filesystems. Since most of our distributions use `ext4` by default, we will have to use either `uid` or `gid` based quota tracking. - -Both `uids` and `gids` are meant for security. Overloading that concept for disk tracking is painful and ugly. But, that is what we have today. - -Kubelet needs to define a gid for tracking image layers and make that gid or group the owner of `/var/lib/docker/[aufs | overlayfs]` recursively. Once this is done, the quota sub-system in the kernel will report the blocks being consumed by the storage driver on the underlying partition. - -Since this number also includes the container’s writable layer, we will have to somehow subtract that usage from the overall usage of the storage driver directory. Luckily, we can use the same mechanism for tracking container’s writable layer. Once we apply a different `gid` to the container’s writable layer, which is located under `/var/lib/docker//diff/`, the quota subsystem will not include the container’s writable layer usage. - -Xfs on the other hand support project quota which lets us track disk usage of arbitrary directories using a project. Support for this feature in ext4 is being reviewed. So on xfs, we can use quota without having to clobber the writable layer's uid and gid. - -**Pros**: - -* Low overhead tracking provided by the kernel. - - -**Cons** - -* Requires updates to default ownership on docker’s internal storage driver directories. We will have to deal with storage driver implementation details in any approach that is not docker native. - -* Requires additional node configuration - quota subsystem needs to be setup on the node. This can either be automated or made a requirement for the node. - -* Kubelet needs to perform gid management. A range of gids have to allocated to the kubelet for the purposes of quota management. This range must not be used for any other purposes out of band. Not required if project quota is available. - -* Breaks `docker save` semantics. Since kubernetes assumes immutable images, this is not a blocker. To support quota in docker, we will need user-namespaces along with custom gid mapping for each container. This feature does not exist today. This is not an issue with project quota. - -*Note: Refer to the [Appendix](#appendix) section more real examples on using quota with docker.* - -**Project Quota** - -Project Quota support for ext4 is currently being reviewed upstream. If that feature lands in upstream sometime soon, project IDs will be used to disk tracking instead of uids and gids. - - -##### Devicemapper - -Devicemapper storage driver will setup two volumes, metadata and data, that will be used to store image layers and container writable layer. The volumes can be real devices or loopback. A Pool device is created which uses the underlying volume for real storage. - -A new thinly-provisioned volume, based on the pool, will be created for running container’s. - -The kernel tracks the usage of the pool device at the block device layer. The usage here includes image layers and container’s writable layers. - -Since the kubelet has to track the writable layer usage anyways, we can subtract the aggregated root filesystem usage from the overall pool device usage to get the image layer’s disk usage. - -Linux quota and `du` will not work with device mapper. - -A docker dry run option (mentioned above) is another possibility. - - -#### Container Writable Layer - -###### Overlayfs / Aufs - -Docker creates a separate directory for the container’s writable layer which is then overlayed on top of read-only image layers. - -Both the previously mentioned options of `du` and `Linux Quota` will work for this case as well. - -Kubelet can use `du` to track usage and enforce `limits` once disk becomes a schedulable resource. As mentioned earlier `du` is resource intensive. - -To use Disk quota, kubelet will have to allocate a separate gid per container. Kubelet can reuse the same gid for multiple instances of the same container (restart scenario). As and when kubelet garbage collects dead containers, the usage of the container will drop. - -If local disk becomes a schedulable resource, `linux quota` can be used to impose `request` and `limits` on the container writable layer. -`limits` can be enforced using hard limits. Enforcing `request` will be tricky. One option is to enforce `requests` only when the disk availability drops below a threshold (10%). Kubelet can at this point evict pods that are exceeding their requested space. Other options include using `soft limits` with grace periods, but this option is complex. - -###### Devicemapper - -FIXME: How to calculate writable layer usage with devicemapper? - -To enforce `limits` the volume created for the container’s writable layer filesystem can be dynamically [resized](https://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/), to not use more than `limit`. `request` will have to be enforced by the kubelet. - - -#### Container logs - -Container logs are not storage driver specific. We can use either `du` or `quota` to track log usage per container. Log files are stored under `/var/lib/docker/containers/`. - -In the case of quota, we can create a separate gid for tracking log usage. This will let users track log usage and writable layer’s usage individually. - -For the purposes of enforcing limits though, kubelet will use the sum of logs and writable layer. - -In the future, we can consider adding log rotation support for these log files either in kubelet or via docker. - - -#### Volumes - -The local disk based volumes map to a directory on the disk. We can use `du` or `quota` to track the usage of volumes. - -There exists a concept called `FsGroup` today in kubernetes, which lets users specify a gid for all volumes in a pod. If that is set, we can use the `FsGroup` gid for quota purposes. This requires `limits` for volumes to be a pod level resource though. - - -### Yet to be explored - -* Support for filesystems other than ext4 and xfs like `zfs` - -* Support for Btrfs - -It should be clear at this point that we need a plugin based model for disk accounting. Support for other filesystems both CoW and regular can be added as and when required. As we progress towards making accounting work on the above mentioned storage drivers, we can come up with an abstraction for storage plugins in general. - - -### Implementation Plan and Milestones - -#### Milestone 1 - Get accounting to just work! - -This milestone targets exposing the following categories of disk usage from the kubelet - infrastructure (images, sys daemons, etc), containers (log + writable layer) and volumes. - -* `du` works today. Use `du` for all the categories and ensure that it works on both on aufs and overlayfs. - -* Add device mapper support. - -* Define a storage driver based pluggable disk accounting interface in cadvisor. - -* Reuse that interface for accounting volumes in kubelet. - -* Define a disk manager module in kubelet that will serve as a source of disk usage information for the rest of the kubelet. - -* Ensure that the kubelet metrics APIs (/apis/metrics/v1beta1) exposes the disk usage information. Add an integration test. - - -#### Milestone 2 - node reliability - -Improve user experience by doing whatever is necessary to keep the node running. - -NOTE: [`Out of Resource Killing`](https://github.com/kubernetes/kubernetes/issues/17186) design is a prerequisite. - -* Disk manager will evict pods and containers based on QoS class whenever the disk availability is below a critical level. - -* Explore combining existing container and image garbage collection logic into disk manager. - -Ideally, this phase should be completed before v1.2. - - -#### Milestone 3 - Performance improvements - -In this milestone, we will add support for quota and make it opt-in. There should be no user visible changes in this phase. - -* Add gid allocation manager to kubelet - -* Reconcile gids allocated after restart. - -* Configure linux quota automatically on startup. Do not set any limits in this phase. - -* Allocate gids for pod volumes, container’s writable layer and logs, and also for image layers. - -* Update the docker runtime plugin in kubelet to perform the necessary `chown’s` and `chmod’s` between container creation and startup. - -* Pass the allocated gids as supplementary gids to containers. - -* Update disk manager in kubelet to use quota when configured. - - -#### Milestone 4 - Users manage local disks - -In this milestone, we will make local disk a schedulable resource. - -* Finalize volume accounting - is it at the pod level or per-volume. - -* Finalize multi-disk management policy. Will additional disks be handled as whole units? - -* Set aside some space for image layers and rest of the infra overhead - node allocable resources includes local disk. - -* `du` plugin triggers container or pod eviction whenever usage exceeds limit. - -* Quota plugin sets hard limits equal to user specified `limits`. - -* Devicemapper plugin resizes writable layer to not exceed the container’s disk `limit`. - -* Disk manager evicts pods based on `usage` - `request` delta instead of just QoS class. - -* Sufficient integration testing to this feature. - - -### Appendix - - -#### Implementation Notes - -The following is a rough outline of the testing I performed to corroborate by prior design ideas. - -Test setup information - -* Testing was performed on GCE virtual machines - -* All the test VMs were using ext4. - -* Distribution tested against is mentioned as part of each graph driver. - -##### AUFS testing notes: - -Tested on Debian jessie - -1. Setup Linux Quota following this [tutorial](https://www.google.com/url?q=https://www.howtoforge.com/tutorial/linux-quota-ubuntu-debian/&sa=D&ust=1446146816105000&usg=AFQjCNHThn4nwfj1YLoVmv5fJ6kqAQ9FlQ). - -2. Create a new group ‘x’ on the host and enable quota for that group - - 1. `groupadd -g 9000 x` - - 2. `setquota -g 9000 -a 0 100 0 100` // 100 blocks (4096 bytes each*) - - 3. `quota -g 9000 -v` // Check that quota is enabled - -3. Create a docker container - - 4. `docker create -it busybox /bin/sh -c "dd if=/dev/zero of=/file count=10 bs=1M"` - - 8d8c56dcfbf5cda9f9bfec7c6615577753292d9772ab455f581951d9a92d169d - -4. Change group on the writable layer directory for this container - - 5. `chmod a+s /var/lib/docker/aufs/diff/8d8c56dcfbf5cda9f9bfec7c6615577753292d9772ab455f581951d9a92d169d` - - 6. `chown :x /var/lib/docker/aufs/diff/8d8c56dcfbf5cda9f9bfec7c6615577753292d9772ab455f581951d9a92d169d` - -5. Start the docker container - - 7. `docker start 8d` - - 8. Check usage using quota and group ‘x’ - - ```shell - $ quota -g x -v - - Disk quotas for group x (gid 9000): - - Filesystem **blocks** quota limit grace files quota limit grace - - /dev/sda1 **10248** 0 0 3 0 0 - ``` - - Using the same workflow, we can add new sticky group IDs to emptyDir volumes and account for their usage against pods. - - Since each container requires a gid for the purposes of quota, we will have to reserve ranges of gids for use by the kubelet. Since kubelet does not checkpoint its state, recovery of group id allocations will be an interesting problem. More on this later. - -Track the space occupied by images after it has been pulled locally as follows. - -*Note: This approach requires serialized image pulls to be of any use to the kubelet.* - -1. Create a group specifically for the graph driver - - 1. `groupadd -g 9001 docker-images` - -2. Update group ownership on the ‘graph’ (tracks image metadata) and ‘storage driver’ directories. - - 2. `chown -R :9001 /var/lib/docker/[overlay | aufs]` - - 3. `chmod a+s /var/lib/docker/[overlay | aufs]` - - 4. `chown -R :9001 /var/lib/docker/graph` - - 5. `chmod a+s /var/lib/docker/graph` - -3. Any new images pulled or containers created will be accounted to the `docker-images` group by default. - -4. Once we update the group ownership on newly created containers to a different gid, the container writable layer’s specific disk usage gets dropped from this group. - -#### Overlayfs - -Tested on Ubuntu 15.10. - -Overlayfs works similar to Aufs. The path to the writable directory for container writable layer changes. - -* Setup Linux Quota following this [tutorial](https://www.google.com/url?q=https://www.howtoforge.com/tutorial/linux-quota-ubuntu-debian/&sa=D&ust=1446146816105000&usg=AFQjCNHThn4nwfj1YLoVmv5fJ6kqAQ9FlQ). - -* Create a new group ‘x’ on the host and enable quota for that group - - * `groupadd -g 9000 x` - - * `setquota -g 9000 -a 0 100 0 100` // 100 blocks (4096 bytes each*) - - * `quota -g 9000 -v` // Check that quota is enabled - -* Create a docker container - - * `docker create -it busybox /bin/sh -c "dd if=/dev/zero of=/file count=10 bs=1M"` - - * `b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61` - -* Change group on the writable layer’s directory for this container - - * `chmod -R a+s /var/lib/docker/overlay/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*` - - * `chown -R :9000 /var/lib/docker/overlay/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*` - -* Check quota before and after running the container. - - ```shell - $ quota -g x -v - - Disk quotas for group x (gid 9000): - - Filesystem blocks quota limit grace files quota limit grace - - /dev/sda1 48 0 0 19 0 0 - ``` - - * Start the docker container - - * `docker start b8` - - * ```shell - quota -g x -v - - Disk quotas for group x (gid 9000): - - Filesystem **blocks** quota limit grace files quota limit grace - - /dev/sda1 **10288** 0 0 20 0 0 - - ``` - -##### Device mapper - -Usage of Linux Quota should be possible for the purposes of volumes and log files. - -Devicemapper storage driver in docker uses ["thin targets"](https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt). Underneath there are two block devices devices - “data” and “metadata”, using which more block devices are created for containers. More information [here](http://www.projectatomic.io/docs/filesystems/). - -These devices can be loopback or real storage devices. - -The base device has a maximum storage capacity. This means that the sum total of storage space occupied by images and containers cannot exceed this capacity. - -By default, all images and containers are created from an initial filesystem with a 10GB limit. - -A separate filesystem is created for each container as part of start (not create). - -It is possible to [resize](https://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/) the container filesystem. - -For the purposes of image space tracking, we can - -####Testing notes: - -* ```shell -$ docker info - -... - -Storage Driver: devicemapper - - Pool Name: **docker-8:1-268480-pool** - - Pool Blocksize: 65.54 kB - - Backing Filesystem: extfs - - Data file: /dev/loop0 - - Metadata file: /dev/loop1 - - Data Space Used: 2.059 GB - - Data Space Total: 107.4 GB - - Data Space Available: 48.45 GB - - Metadata Space Used: 1.806 MB - - Metadata Space Total: 2.147 GB - - Metadata Space Available: 2.146 GB - - Udev Sync Supported: true - - Deferred Removal Enabled: false - - Data loop file: /var/lib/docker/devicemapper/devicemapper/data - - Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata - - Library Version: 1.02.99 (2015-06-20) -``` - -```shell -$ dmsetup table docker-8\:1-268480-pool - -0 209715200 thin-pool 7:1 7:0 **128** 32768 1 skip_block_zeroing -``` - -128 is the data block size - -Usage from kernel for the primary block device - -```shell -$ dmsetup status docker-8\:1-268480-pool - -0 209715200 thin-pool 37 441/524288 **31424/1638400** - rw discard_passdown queue_if_no_space - -``` - -Usage/Available - 31424/1638400 - -Usage in MB = 31424 * 512 * 128 (block size from above) bytes = 1964 MB - -Capacity in MB = 1638400 * 512 * 128 bytes = 100 GB - -#### Log file accounting - -* Setup Linux quota for a container as mentioned above. - -* Update group ownership on the following directories to that of the container group ID created for graphing. Adapting the examples above: - - * `chmod -R a+s /var/lib/docker/**containers**/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*` - - * `chown -R :9000 /var/lib/docker/**container**/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*` - -##### Testing titbits - -* Ubuntu 15.10 doesn’t ship with the quota module on virtual machines. [Install ‘linux-image-extra-virtual’](http://askubuntu.com/questions/109585/quota-format-not-supported-in-kernel) package to get quota to work. - -* Overlay storage driver needs kernels >= 3.18. I used Ubuntu 15.10 to test Overlayfs. - -* If you use a non-default location for docker storage, change `/var/lib/docker` in the examples to your storage location. - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/disk-accounting.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/disk-accounting.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/disk-accounting.md) diff --git a/docs/proposals/dramatically-simplify-cluster-creation.md b/docs/proposals/dramatically-simplify-cluster-creation.md index d5bc8a381c0..98f791ab461 100644 --- a/docs/proposals/dramatically-simplify-cluster-creation.md +++ b/docs/proposals/dramatically-simplify-cluster-creation.md @@ -1,266 +1 @@ -# Proposal: Dramatically Simplify Kubernetes Cluster Creation - -> ***Please note: this proposal doesn't reflect final implementation, it's here for the purpose of capturing the original ideas.*** -> ***You should probably [read `kubeadm` docs](http://kubernetes.io/docs/getting-started-guides/kubeadm/), to understand the end-result of this effor.*** - -Luke Marsden & many others in [SIG-cluster-lifecycle](https://github.com/kubernetes/community/tree/master/sig-cluster-lifecycle). - -17th August 2016 - -*This proposal aims to capture the latest consensus and plan of action of SIG-cluster-lifecycle. It should satisfy the first bullet point [required by the feature description](https://github.com/kubernetes/features/issues/11).* - -See also: [this presentation to community hangout on 4th August 2016](https://docs.google.com/presentation/d/17xrFxrTwqrK-MJk0f2XCjfUPagljG7togXHcC39p0sM/edit?ts=57a33e24#slide=id.g158d2ee41a_0_76) - -## Motivation - -Kubernetes is hard to install, and there are many different ways to do it today. None of them are excellent. We believe this is hindering adoption. - -## Goals - -Have one recommended, official, tested, "happy path" which will enable a majority of new and existing Kubernetes users to: - -* Kick the tires and easily turn up a new cluster on infrastructure of their choice - -* Get a reasonably secure, production-ready cluster, with reasonable defaults and a range of easily-installable add-ons - -We plan to do so by improving and simplifying Kubernetes itself, rather than building lots of tooling which "wraps" Kubernetes by poking all the bits into the right place. - -## Scope of project - -There are logically 3 steps to deploying a Kubernetes cluster: - -1. *Provisioning*: Getting some servers - these may be VMs on a developer's workstation, VMs in public clouds, or bare-metal servers in a user's data center. - -2. *Install & Discovery*: Installing the Kubernetes core components on those servers (kubelet, etc) - and bootstrapping the cluster to a state of basic liveness, including allowing each server in the cluster to discover other servers: for example teaching etcd servers about their peers, having TLS certificates provisioned, etc. - -3. *Add-ons*: Now that basic cluster functionality is working, installing add-ons such as DNS or a pod network (should be possible using kubectl apply). - -Notably, this project is *only* working on dramatically improving 2 and 3 from the perspective of users typing commands directly into root shells of servers. The reason for this is that there are a great many different ways of provisioning servers, and users will already have their own preferences. - -What's more, once we've radically improved the user experience of 2 and 3, it will make the job of tools that want to do all three much easier. - -## User stories - -### Phase I - -**_In time to be an alpha feature in Kubernetes 1.4._** - -Note: the current plan is to deliver `kubeadm` which implements these stories as "alpha" packages built from master (after the 1.4 feature freeze), but which are capable of installing a Kubernetes 1.4 cluster. - -* *Install*: As a potential Kubernetes user, I can deploy a Kubernetes 1.4 cluster on a handful of computers running Linux and Docker by typing two commands on each of those computers. The process is so simple that it becomes obvious to me how to easily automate it if I so wish. - -* *Pre-flight check*: If any of the computers don't have working dependencies installed (e.g. bad version of Docker, too-old Linux kernel), I am informed early on and given clear instructions on how to fix it so that I can keep trying until it works. - -* *Control*: Having provisioned a cluster, I can gain user credentials which allow me to remotely control it using kubectl. - -* *Install-addons*: I can select from a set of recommended add-ons to install directly after installing Kubernetes on my set of initial computers with kubectl apply. - -* *Add-node*: I can add another computer to the cluster. - -* *Secure*: As an attacker with (presumed) control of the network, I cannot add malicious nodes I control to the cluster created by the user. I also cannot remotely control the cluster. - -### Phase II - -**_In time for Kubernetes 1.5:_** -*Everything from Phase I as beta/stable feature, everything else below as beta feature in Kubernetes 1.5.* - -* *Upgrade*: Later, when Kubernetes 1.4.1 or any newer release is published, I can upgrade to it by typing one other command on each computer. - -* *HA*: If one of the computers in the cluster fails, the cluster carries on working. I can find out how to replace the failed computer, including if the computer was one of the masters. - -## Top-down view: UX for Phase I items - -We will introduce a new binary, kubeadm, which ships with the Kubernetes OS packages (and binary tarballs, for OSes without package managers). - -``` -laptop$ kubeadm --help -kubeadm: bootstrap a secure kubernetes cluster easily. - - /==========================================================\ - | KUBEADM IS ALPHA, DO NOT USE IT FOR PRODUCTION CLUSTERS! | - | | - | But, please try it out! Give us feedback at: | - | https://github.com/kubernetes/kubernetes/issues | - | and at-mention @kubernetes/sig-cluster-lifecycle | - \==========================================================/ - -Example usage: - - Create a two-machine cluster with one master (which controls the cluster), - and one node (where workloads, like pods and containers run). - - On the first machine - ==================== - master# kubeadm init master - Your token is: - - On the second machine - ===================== - node# kubeadm join node --token= - -Usage: - kubeadm [command] - -Available Commands: - init Run this on the first server you deploy onto. - join Run this on other servers to join an existing cluster. - user Get initial admin credentials for a cluster. - manual Advanced, less-automated functionality, for power users. - -Use "kubeadm [command] --help" for more information about a command. -``` - -### Install - -*On first machine:* - -``` -master# kubeadm init master -Initializing kubernetes master... [done] -Cluster token: 73R2SIPM739TNZOA -Run the following command on machines you want to become nodes: - kubeadm join node --token=73R2SIPM739TNZOA -You can now run kubectl here. -``` - -*On N "node" machines:* - -``` -node# kubeadm join node --token=73R2SIPM739TNZOA -Initializing kubernetes node... [done] -Bootstrapping certificates... [done] -Joined node to cluster, see 'kubectl get nodes' on master. -``` - -Note `[done]` would be colored green in all of the above. - -### Install: alternative for automated deploy - -*The user (or their config management system) creates a token and passes the same one to both init and join.* - -``` -master# kubeadm init master --token=73R2SIPM739TNZOA -Initializing kubernetes master... [done] -You can now run kubectl here. -``` - -### Pre-flight check - -``` -master# kubeadm init master -Error: socat not installed. Unable to proceed. -``` - -### Control - -*On master, after Install, kubectl is automatically able to talk to localhost:8080:* - -``` -master# kubectl get pods -[normal kubectl output] -``` - -*To mint new user credentials on the master:* - -``` -master# kubeadm user create -o kubeconfig-bob bob - -Waiting for cluster to become ready... [done] -Creating user certificate for user... [done] -Waiting for user certificate to be signed... [done] -Your cluster configuration file has been saved in kubeconfig. - -laptop# scp :/root/kubeconfig-bob ~/.kubeconfig -laptop# kubectl get pods -[normal kubectl output] -``` - -### Install-addons - -*Using CNI network as example:* - -``` -master# kubectl apply --purge -f \ - https://git.io/kubernetes-addons/.yaml -[normal kubectl apply output] -``` - -### Add-node - -*Same as Install – "on node machines".* - -### Secure - -``` -node# kubeadm join --token=GARBAGE node -Unable to join mesh network. Check your token. -``` - -## Work streams – critical path – must have in 1.4 before feature freeze - -1. [TLS bootstrapping](https://github.com/kubernetes/features/issues/43) - so that kubeadm can mint credentials for kubelets and users - - * Requires [#25764](https://github.com/kubernetes/kubernetes/pull/25764) and auto-signing [#30153](https://github.com/kubernetes/kubernetes/pull/30153) but does not require [#30094](https://github.com/kubernetes/kubernetes/pull/30094). - * @philips, @gtank & @yifan-gu - -1. Fix for [#30515](https://github.com/kubernetes/kubernetes/issues/30515) - so that kubeadm can install a kubeconfig which kubelet then picks up - - * @smarterclayton - -## Work streams – can land after 1.4 feature freeze - -1. [Debs](https://github.com/kubernetes/release/pull/35) and [RPMs](https://github.com/kubernetes/release/pull/50) (and binaries?) - so that kubernetes can be installed in the first place - - * @mikedanese & @dgoodwin - -1. [kubeadm implementation](https://github.com/lukemarsden/kubernetes/tree/kubeadm-scaffolding) - the kubeadm CLI itself, will get bundled into "alpha" kubeadm packages - - * @lukemarsden & @errordeveloper - -1. [Implementation of JWS server](https://github.com/jbeda/kubernetes/blob/discovery-api/docs/proposals/super-simple-discovery-api.md#method-jws-token) from [#30707](https://github.com/kubernetes/kubernetes/pull/30707) - so that we can implement the simple UX with no dependencies - - * @jbeda & @philips? - -1. Documentation - so that new users can see this in 1.4 (even if it’s caveated with alpha/experimental labels and flags all over it) - - * @lukemarsden - -1. `kubeadm` alpha packages - - * @lukemarsden, @mikedanese, @dgoodwin - -### Nice to have - -1. [Kubectl apply --purge](https://github.com/kubernetes/kubernetes/pull/29551) - so that addons can be maintained using k8s infrastructure - - * @lukemarsden & @errordeveloper - -## kubeadm implementation plan - -Based on [@philips' comment here](https://github.com/kubernetes/kubernetes/pull/30361#issuecomment-239588596). -The key point with this implementation plan is that it requires basically no changes to kubelet except [#30515](https://github.com/kubernetes/kubernetes/issues/30515). -It also doesn't require kubelet to do TLS bootstrapping - kubeadm handles that. - -### kubeadm init master - -1. User installs and configures kubelet to look for manifests in `/etc/kubernetes/manifests` -1. API server CA certs are generated by kubeadm -1. kubeadm generates pod manifests to launch API server and etcd -1. kubeadm pushes replica set for prototype jsw-server and the JWS into API server with host-networking so it is listening on the master node IP -1. kubeadm prints out the IP of JWS server and JWS token - -### kubeadm join node --token IP - -1. User installs and configures kubelet to have a kubeconfig at `/var/lib/kubelet/kubeconfig` but the kubelet is in a crash loop and is restarted by host init system -1. kubeadm talks to jws-server on IP with token and gets the cacert, then talks to the apiserver TLS bootstrap API to get client cert, etc and generates a kubelet kubeconfig -1. kubeadm places kubeconfig into `/var/lib/kubelet/kubeconfig` and waits for kubelet to restart -1. Mission accomplished, we think. - -## See also - -* [Joe Beda's "K8s the hard way easier"](https://docs.google.com/document/d/1lJ26LmCP-I_zMuqs6uloTgAnHPcuT7kOYtQ7XSgYLMA/edit#heading=h.ilgrv18sg5t) which combines Kelsey's "Kubernetes the hard way" with history of proposed UX at the end (scroll all the way down to the bottom). - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/dramatically-simplify-cluster-creation.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/dramatically-simplify-cluster-creation.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/dramatically-simplify-cluster-creation.md) diff --git a/docs/proposals/external-lb-source-ip-preservation.md b/docs/proposals/external-lb-source-ip-preservation.md index e1450e641e7..f4d94cf91e8 100644 --- a/docs/proposals/external-lb-source-ip-preservation.md +++ b/docs/proposals/external-lb-source-ip-preservation.md @@ -1,238 +1 @@ - - -- [Overview](#overview) - - [Motivation](#motivation) -- [Alpha Design](#alpha-design) - - [Overview](#overview-1) - - [Traffic Steering using LB programming](#traffic-steering-using-lb-programming) - - [Traffic Steering using Health Checks](#traffic-steering-using-health-checks) - - [Choice of traffic steering approaches by individual Cloud Provider implementations](#choice-of-traffic-steering-approaches-by-individual-cloud-provider-implementations) - - [API Changes](#api-changes) - - [Local Endpoint Recognition Support](#local-endpoint-recognition-support) - - [Service Annotation to opt-in for new behaviour](#service-annotation-to-opt-in-for-new-behaviour) - - [NodePort allocation for HealthChecks](#nodeport-allocation-for-healthchecks) - - [Behavior Changes expected](#behavior-changes-expected) - - [External Traffic Blackholed on nodes with no local endpoints](#external-traffic-blackholed-on-nodes-with-no-local-endpoints) - - [Traffic Balancing Changes](#traffic-balancing-changes) - - [Cloud Provider support](#cloud-provider-support) - - [GCE 1.4](#gce-14) - - [GCE Expected Packet Source/Destination IP (Datapath)](#gce-expected-packet-sourcedestination-ip-datapath) - - [GCE Expected Packet Destination IP (HealthCheck path)](#gce-expected-packet-destination-ip-healthcheck-path) - - [AWS TBD](#aws-tbd) - - [Openstack TBD](#openstack-tbd) - - [Azure TBD](#azure-tbd) - - [Testing](#testing) -- [Beta Design](#beta-design) - - [API Changes from Alpha to Beta](#api-changes-from-alpha-to-beta) -- [Future work](#future-work) -- [Appendix](#appendix) - - - -# Overview - -Kubernetes provides an external loadbalancer service type which creates a virtual external ip -(in supported cloud provider environments) that can be used to load-balance traffic to -the pods matching the service pod-selector. - -## Motivation - -The current implementation requires that the cloud loadbalancer balances traffic across all -Kubernetes worker nodes, and this traffic is then equally distributed to all the backend -pods for that service. -Due to the DNAT required to redirect the traffic to its ultimate destination, the return -path for each session MUST traverse the same node again. To ensure this, the node also -performs a SNAT, replacing the source ip with its own. - -This causes the service endpoint to see the session as originating from a cluster local ip address. -*The original external source IP is lost* - -This is not a satisfactory solution - the original external source IP MUST be preserved for a -lot of applications and customer use-cases. - -# Alpha Design - -This section describes the proposed design for -[alpha-level](../../docs/devel/api_changes.md#alpha-beta-and-stable-versions) support, although -additional features are described in [future work](#future-work). - -## Overview - -The double hop must be prevented by programming the external load balancer to direct traffic -only to nodes that have local pods for the service. This can be accomplished in two ways, either -by API calls to add/delete nodes from the LB node pool or by adding health checking to the LB and -failing/passing health checks depending on the presence of local pods. - -## Traffic Steering using LB programming - -This approach requires that the Cloud LB be reprogrammed to be in sync with endpoint presence. -Whenever the first service endpoint is scheduled onto a node, the node is added to the LB pool. -Whenever the last service endpoint is unhealthy on a node, the node needs to be removed from the LB pool. - -This is a slow operation, on the order of 30-60 seconds, and involves the Cloud Provider API path. -If the API endpoint is temporarily unavailable, the datapath will be misprogrammed till the -reprogramming is successful and the API->datapath tables are updated by the cloud provider backend. - -## Traffic Steering using Health Checks - -This approach requires that all worker nodes in the cluster be programmed into the LB target pool. -To steer traffic only onto nodes that have endpoints for the service, we program the LB to perform -node healthchecks. The kube-proxy daemons running on each node will be responsible for responding -to these healthcheck requests (URL `/healthz`) from the cloud provider LB healthchecker. An additional nodePort -will be allocated for these health check for this purpose. -kube-proxy already watches for Service and Endpoint changes, it will maintain an in-memory lookup -table indicating the number of local endpoints for each service. -For a value of zero local endpoints, it responds with a health check failure (503 Service Unavailable), -and success (200 OK) for non-zero values. - -Healthchecks are programmable with a min period of 1 second on most cloud provider LBs, and min -failures to trigger node health state change can be configurable from 2 through 5. - -This will allow much faster transition times on the order of 1-5 seconds, and involve no -API calls to the cloud provider (and hence reduce the impact of API unreliability), keeping the -time window where traffic might get directed to nodes with no local endpoints to a minimum. - -## Choice of traffic steering approaches by individual Cloud Provider implementations - -The cloud provider package may choose either of these approaches. kube-proxy will provide these -healthcheck responder capabilities, regardless of the cloud provider configured on a cluster. - -## API Changes - -### Local Endpoint Recognition Support - -To allow kube-proxy to recognize if an endpoint is local requires that the EndpointAddress struct -should also contain the NodeName it resides on. This new string field will be read-only and -populated *only* by the Endpoints Controller. - -### Service Annotation to opt-in for new behaviour - -A new annotation `service.alpha.kubernetes.io/external-traffic` will be recognized -by the service controller only for services of Type LoadBalancer. Services that wish to opt-in to -the new LoadBalancer behaviour must annotate the Service to request the new ESIPP behavior. -Supported values for this annotation are OnlyLocal and Global. -- OnlyLocal activates the new logic (described in this proposal) and balances locally within a node. -- Global activates the old logic of balancing traffic across the entire cluster. - -### NodePort allocation for HealthChecks - -An additional nodePort allocation will be necessary for services that are of type LoadBalancer and -have the new annotation specified. This additional nodePort is necessary for kube-proxy to listen for -healthcheck requests on all nodes. -This NodePort will be added as an annotation (`service.alpha.kubernetes.io/healthcheck-nodeport`) to -the Service after allocation (in the alpha release). The value of this annotation may also be -specified during the Create call and the allocator will reserve that specific nodePort. - - -## Behavior Changes expected - -### External Traffic Blackholed on nodes with no local endpoints - -When the last endpoint on the node has gone away and the LB has not marked the node as unhealthy, -worst-case window size = (N+1) * HCP, where N = minimum failed healthchecks and HCP = Health Check Period, -external traffic will still be steered to the node. This traffic will be blackholed and not forwarded -to other endpoints elsewhere in the cluster. - -Internal pod to pod traffic should behave as before, with equal probability across all pods. - -### Traffic Balancing Changes - -GCE/AWS load balancers do not provide weights for their target pools. This was not an issue with the old LB -kube-proxy rules which would correctly balance across all endpoints. - -With the new functionality, the external traffic will not be equally load balanced across pods, but rather -equally balanced at the node level (because GCE/AWS and other external LB implementations do not have the ability -for specifying the weight per node, they balance equally across all target nodes, disregarding the number of -pods on each node). - -We can, however, state that for NumServicePods << NumNodes or NumServicePods >> NumNodes, a fairly close-to-equal -distribution will be seen, even without weights. - -Once the external load balancers provide weights, this functionality can be added to the LB programming path. -*Future Work: No support for weights is provided for the 1.4 release, but may be added at a future date* - -## Cloud Provider support - -This feature is added as an opt-in annotation. -Default behaviour of LoadBalancer type services will be unchanged for all Cloud providers. -The annotation will be ignored by existing cloud provider libraries until they add support. - -### GCE 1.4 - -For the 1.4 release, this feature will be implemented for the GCE cloud provider. - -#### GCE Expected Packet Source/Destination IP (Datapath) - -- Node: On the node, we expect to see the real source IP of the client. Destination IP will be the Service Virtual External IP. - -- Pod: For processes running inside the Pod network namepsace, the source IP will be the real client source IP. The destination address will the be Pod IP. - -#### GCE Expected Packet Destination IP (HealthCheck path) - -kube-proxy listens on the health check node port for TCP health checks on :::. -This allow responding to health checks when the destination IP is either the VM IP or the Service Virtual External IP. -In practice, tcpdump traces on GCE show source IP is 169.254.169.254 and destination address is the Service Virtual External IP. - -### AWS TBD - -TBD *discuss timelines and feasibility with Kubernetes sig-aws team members* - -### Openstack TBD - -This functionality may not be introduced in Openstack in the near term. - -*Note from Openstack team member @anguslees* -Underlying vendor devices might be able to do this, but we only expose full-NAT/proxy loadbalancing through the OpenStack API (LBaaS v1/v2 and Octavia). So I'm afraid this will be unsupported on OpenStack, afaics. - -### Azure TBD - -*To be confirmed* For the 1.4 release, this feature will be implemented for the Azure cloud provider. - -## Testing - -The cases we should test are: - -1. Core Functionality Tests - -1.1 Source IP Preservation - -Test the main intent of this change, source ip preservation - use the all-in-one network tests container -with new functionality that responds with the client IP. Verify the container is seeing the external IP -of the test client. - -1.2 Health Check responses - -Testcases use pods explicitly pinned to nodes and delete/add to nodes randomly. Validate that healthchecks succeed -and fail on the expected nodes as endpoints move around. Gather LB response times (time from pod declares ready to -time for Cloud LB to declare node healthy and vice versa) to endpoint changes. - -2. Inter-Operability Tests - -Validate that internal cluster communications are still possible from nodes without local endpoints. This change -is only for externally sourced traffic. - -3. Backward Compatibility Tests - -Validate that old and new functionality can simultaneously exist in a single cluster. Create services with and without -the annotation, and validate datapath correctness. - -# Beta Design - -The only part of the design that changes for beta is the API, which is upgraded from -annotation-based to first class fields. - -## API Changes from Alpha to Beta - -Annotation `service.alpha.kubernetes.io/node-local-loadbalancer` will switch to a Service object field. - -# Future work - -Post-1.4 feature ideas. These are not fully-fleshed designs. - - - -# Appendix - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/external-lb-source-ip-preservation.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/external-lb-source-ip-preservation.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/external-lb-source-ip-preservation.md) diff --git a/docs/proposals/federated-api-servers.md b/docs/proposals/federated-api-servers.md index 1d2d5ba17b9..fe536802214 100644 --- a/docs/proposals/federated-api-servers.md +++ b/docs/proposals/federated-api-servers.md @@ -1,209 +1 @@ -# Federated API Servers - -## Abstract - -We want to divide the single monolithic API server into multiple federated -servers. Anyone should be able to write their own federated API server to expose APIs they want. -Cluster admins should be able to expose new APIs at runtime by bringing up new -federated servers. - -## Motivation - -* Extensibility: We want to allow community members to write their own API - servers to expose APIs they want. Cluster admins should be able to use these - servers without having to require any change in the core kubernetes - repository. -* Unblock new APIs from core kubernetes team review: A lot of new API proposals - are currently blocked on review from the core kubernetes team. By allowing - developers to expose their APIs as a separate server and enabling the cluster - admin to use it without any change to the core kubernetes repository, we - unblock these APIs. -* Place for staging experimental APIs: New APIs can remain in separate - federated servers until they become stable, at which point, they can be moved - to the core kubernetes master, if appropriate. -* Ensure that new APIs follow kubernetes conventions: Without the mechanism - proposed here, community members might be forced to roll their own thing which - may or may not follow kubernetes conventions. - -## Goal - -* Developers should be able to write their own API server and cluster admins - should be able to add them to their cluster, exposing new APIs at runtime. All - of this should not require any change to the core kubernetes API server. -* These new APIs should be seamless extension of the core kubernetes APIs (ex: - they should be operated upon via kubectl). - -## Non Goals - -The following are related but are not the goals of this specific proposal: -* Make it easy to write a kubernetes API server. - -## High Level Architecture - -There will be 2 new components in the cluster: -* A simple program to summarize discovery information from all the servers. -* A reverse proxy to proxy client requests to individual servers. - -The reverse proxy is optional. Clients can discover server URLs using the -summarized discovery information and contact them directly. Simple clients, can -always use the proxy. -The same program can provide both discovery summarization and reverse proxy. - -### Constraints - -* Unique API groups across servers: Each API server (and groups of servers, in HA) - should expose unique API groups. -* Follow API conventions: APIs exposed by every API server should adhere to [kubernetes API - conventions](../devel/api-conventions.md). -* Support discovery API: Each API server should support the kubernetes discovery API - (list the suported groupVersions at `/apis` and list the supported resources - at `/apis//`) -* No bootstrap problem: The core kubernetes server should not depend on any - other federated server to come up. Other servers can only depend on the core - kubernetes server. - -## Implementation Details - -### Summarizing discovery information - -We can have a very simple Go program to summarize discovery information from all -servers. Cluster admins will register each federated API server (its baseURL and swagger -spec path) with the proxy. The proxy will summarize the list of all group versions -exposed by all registered API servers with their individual URLs at `/apis`. - -### Reverse proxy - -We can use any standard reverse proxy server like nginx or extend the same Go program that -summarizes discovery information to act as reverse proxy for all federated servers. - -Cluster admins are also free to use any of the multiple open source API management tools -(for example, there is [Kong](https://getkong.org/), which is written in lua and there is -[Tyk](https://tyk.io/), which is written in Go). These API management tools -provide a lot more functionality like: rate-limiting, caching, logging, -transformations and authentication. -In future, we can also use ingress. That will give cluster admins the flexibility to -easily swap out the ingress controller by a Go reverse proxy, nginx, haproxy -or any other solution they might want. - -### Storage - -Each API server is responsible for storing their resources. They can have their -own etcd or can use kubernetes server's etcd using [third party -resources](../design/extending-api.md#adding-custom-resources-to-the-kubernetes-api-server). - -### Health check - -Kubernetes server's `/api/v1/componentstatuses` will continue to report status -of master components that it depends on (scheduler and various controllers). -Since clients have access to server URLs, they can use that to do -health check of individual servers. -In future, if a global health check is required, we can expose a health check -endpoint in the proxy that will report the status of all federated api servers -in the cluster. - -### Auth - -Since the actual server which serves client's request can be opaque to the client, -all API servers need to have homogeneous authentication and authorisation mechanisms. -All API servers will handle authn and authz for their resources themselves. -In future, we can also have the proxy do the auth and then have apiservers trust -it (via client certs) to report the actual user in an X-something header. - -For now, we will trust system admins to configure homogeneous auth on all servers. -Future proposals will refine how auth is managed across the cluster. - -### kubectl - -kubectl will talk to the discovery endpoint (or proxy) and use the discovery API to -figure out the operations and resources supported in the cluster. -Today, it uses RESTMapper to determine that. We will update kubectl code to populate -RESTMapper using the discovery API so that we can add and remove resources -at runtime. -We will also need to make kubectl truly generic. Right now, a lot of operations -(like get, describe) are hardcoded in the binary for all resources. A future -proposal will provide details on moving those operations to server. - -Note that it is possible for kubectl to talk to individual servers directly in -which case proxy will not be required at all, but this requires a bit more logic -in kubectl. We can do this in future, if desired. - -### Handling global policies - -Now that we have resources spread across multiple API servers, we need to -be careful to ensure that global policies (limit ranges, resource quotas, etc) are enforced. -Future proposals will improve how this is done across the cluster. - -#### Namespaces - -When a namespaced resource is created in any of the federated server, that -server first needs to check with the kubernetes server that: - -* The namespace exists. -* User has authorization to create resources in that namespace. -* Resource quota for the namespace is not exceeded. - -To prevent race conditions, the kubernetes server might need to expose an atomic -API for all these operations. - -While deleting a namespace, kubernetes server needs to ensure that resources in -that namespace maintained by other servers are deleted as well. We can do this -using resource [finalizers](../design/namespaces.md#finalizers). Each server -will add themselves in the set of finalizers before they create a resource in -the corresponding namespace and delete all their resources in that namespace, -whenever it is to be deleted (kubernetes API server already has this code, we -will refactor it into a library to enable reuse). - -Future proposal will talk about this in more detail and provide a better -mechanism. - -#### Limit ranges and resource quotas - -kubernetes server maintains [resource quotas](../admin/resourcequota/README.md) and -[limit ranges](../admin/limitrange/README.md) for all resources. -Federated servers will need to check with the kubernetes server before creating any -resource. - -## Running on hosted kubernetes cluster - -This proposal is not enough for hosted cluster users, but allows us to improve -that in the future. -On a hosted kubernetes cluster, for e.g. on GKE - where Google manages the kubernetes -API server, users will have to bring up and maintain the proxy and federated servers -themselves. -Other system components like the various controllers, will not be aware of the -proxy and will only talk to the kubernetes API server. - -One possible solution to fix this is to update kubernetes API server to detect when -there are federated servers in the cluster and then change its advertise address to -the IP address of the proxy. -Future proposal will talk about this in more detail. - -## Alternatives - -There were other alternatives that we had discussed. - -* Instead of adding a proxy in front, let the core kubernetes server provide an - API for other servers to register themselves. It can also provide a discovery - API which the clients can use to discover other servers and then talk to them - directly. But this would have required another server API a lot of client logic as well. -* Validating federated servers: We can validate new servers when they are registered - with the proxy, or keep validating them at regular intervals, or validate - them only when explicitly requested, or not validate at all. - We decided that the proxy will just assume that all the servers are valid - (conform to our api conventions). In future, we can provide conformance tests. - -## Future Work - -* Validate servers: We should have some conformance tests that validate that the - servers follow kubernetes api-conventions. -* Provide centralised auth service: It is very hard to ensure homogeneous auth - across multiple federated servers, especially in case of hosted clusters - (where different people control the different servers). We can fix it by - providing a centralised authentication and authorization service which all of - the servers can use. - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federated-api-servers.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-api-servers.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-api-servers.md) diff --git a/docs/proposals/federated-ingress.md b/docs/proposals/federated-ingress.md index 05d36e1d9b1..28ae7fbfd6f 100644 --- a/docs/proposals/federated-ingress.md +++ b/docs/proposals/federated-ingress.md @@ -1,223 +1 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - -# Kubernetes Federated Ingress - - Requirements and High Level Design - - Quinton Hoole - - July 17, 2016 - -## Overview/Summary - -[Kubernetes Ingress](https://github.com/kubernetes/kubernetes.github.io/blob/master/docs/user-guide/ingress.md) -provides an abstraction for sophisticated L7 load balancing through a -single IP address (and DNS name) across multiple pods in a single -Kubernetes cluster. Multiple alternative underlying implementations -are provided, including one based on GCE L7 load balancing and another -using an in-cluster nginx/HAProxy deployment (for non-GCE -environments). An AWS implementation, based on Elastic Load Balancers -and Route53 is under way by the community. - -To extend the above to cover multiple clusters, Kubernetes Federated -Ingress aims to provide a similar/identical API abstraction and, -again, multiple implementations to cover various -cloud-provider-specific as well as multi-cloud scenarios. The general -model is to allow the user to instantiate a single Ingress object via -the Federation API, and have it automatically provision all of the -necessary underlying resources (L7 cloud load balancers, in-cluster -proxies etc) to provide L7 load balancing across a service spanning -multiple clusters. - -Four options are outlined: - -1. GCP only -1. AWS only -1. Cross-cloud via GCP in-cluster proxies (i.e. clients get to AWS and on-prem via GCP). -1. Cross-cloud via AWS in-cluster proxies (i.e. clients get to GCP and on-prem via AWS). - -Option 1 is the: - -1. easiest/quickest, -1. most featureful - -Recommendations: - -+ Suggest tackling option 1 (GCP only) first (target beta in v1.4) -+ Thereafter option 3 (cross-cloud via GCP) -+ We should encourage/facilitate the community to tackle option 2 (AWS-only) - -## Options - -## Google Cloud Platform only - backed by GCE L7 Load Balancers - -This is an option for federations across clusters which all run on Google Cloud Platform (i.e. GCE and/or GKE) - -### Features - -In summary, all of [GCE L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/) features: - -1. Single global virtual (a.k.a. "anycast") IP address ("VIP" - no dependence on dynamic DNS) -1. Geo-locality for both external and GCP-internal clients -1. Load-based overflow to next-closest geo-locality (i.e. cluster). Based on either queries per second, or CPU load (unfortunately on the first-hop target VM, not the final destination K8s Service). -1. URL-based request direction (different backend services can fulfill each different URL). -1. HTTPS request termination (at the GCE load balancer, with server SSL certs) - -### Implementation - -1. Federation user creates (federated) Ingress object (the services - backing the ingress object must share the same nodePort, as they - share a single GCP health check). -1. Federated Ingress Controller creates Ingress object in each cluster - in the federation (after [configuring each cluster ingress - controller to share the same ingress UID](https://gist.github.com/bprashanth/52648b2a0b6a5b637f843e7efb2abc97)). -1. Each cluster-level Ingress Controller ("GLBC") creates Google L7 - Load Balancer machinery (forwarding rules, target proxy, URL map, - backend service, health check) which ensures that traffic to the - Ingress (backed by a Service), is directed to the nodes in the cluster. -1. KubeProxy redirects to one of the backend Pods (currently round-robin, per KubeProxy instance) - -An alternative implementation approach involves lifting the current -Federated Ingress Controller functionality up into the Federation -control plane. This alternative is not considered any any further -detail in this document. - -### Outstanding work Items - -1. This should in theory all work out of the box. Need to confirm -with a manual setup. ([#29341](https://github.com/kubernetes/kubernetes/issues/29341)) -1. Implement Federated Ingress: - 1. API machinery (~1 day) - 1. Controller (~3 weeks) -1. Add DNS field to Ingress object (currently missing, but needs to be added, independent of federation) - 1. API machinery (~1 day) - 1. KubeDNS support (~ 1 week?) - -### Pros - -1. Global VIP is awesome - geo-locality, load-based overflow (but see caveats below) -1. Leverages existing K8s Ingress machinery - not too much to add. -1. Leverages existing Federated Service machinery - controller looks - almost identical, DNS provider also re-used. - -### Cons - -1. Only works across GCP clusters (but see below for a light at the end of the tunnel, for future versions). - -## Amazon Web Services only - backed by Route53 - -This is an option for AWS-only federations. Parts of this are -apparently work in progress, see e.g. -[AWS Ingress controller](https://github.com/kubernetes/contrib/issues/346) -[[WIP/RFC] Simple ingress -> DNS controller, using AWS -Route53](https://github.com/kubernetes/contrib/pull/841). - -### Features - -In summary, most of the features of [AWS Elastic Load Balancing](https://aws.amazon.com/elasticloadbalancing/) and [Route53 DNS](https://aws.amazon.com/route53/). - -1. Geo-aware DNS direction to closest regional elastic load balancer -1. DNS health checks to route traffic to only healthy elastic load -balancers -1. A variety of possible DNS routing types, including Latency Based Routing, Geo DNS, and Weighted Round Robin -1. Elastic Load Balancing automatically routes traffic across multiple - instances and multiple Availability Zones within the same region. -1. Health checks ensure that only healthy Amazon EC2 instances receive traffic. - -### Implementation - -1. Federation user creates (federated) Ingress object -1. Federated Ingress Controller creates Ingress object in each cluster in the federation -1. Each cluster-level AWS Ingress Controller creates/updates - 1. (regional) AWS Elastic Load Balancer machinery which ensures that traffic to the Ingress (backed by a Service), is directed to one of the nodes in one of the clusters in the region. - 1. (global) AWS Route53 DNS machinery which ensures that clients are directed to the closest non-overloaded (regional) elastic load balancer. -1. KubeProxy redirects to one of the backend Pods (currently round-robin, per KubeProxy instance) in the destination K8s cluster. - -### Outstanding Work Items - -Most of this remains is currently unimplemented ([AWS Ingress controller](https://github.com/kubernetes/contrib/issues/346) -[[WIP/RFC] Simple ingress -> DNS controller, using AWS -Route53](https://github.com/kubernetes/contrib/pull/841). - -1. K8s AWS Ingress Controller -1. Re-uses all of the non-GCE specific Federation machinery discussed above under "GCP-only...". - -### Pros - -1. Geo-locality (via geo-DNS, not VIP) -1. Load-based overflow -1. Real load balancing (same caveats as for GCP above). -1. L7 SSL connection termination. -1. Seems it can be made to work for hybrid with on-premise (using VPC). More research required. - -### Cons - -1. K8s Ingress Controller still needs to be developed. Lots of work. -1. geo-DNS based locality/failover is not as nice as VIP-based (but very useful, nonetheless) -1. Only works on AWS (initial version, at least). - -## Cross-cloud via GCP - -### Summary - -Use GCP Federated Ingress machinery described above, augmented with additional HA-proxy backends in all GCP clusters to proxy to non-GCP clusters (via either Service External IP's, or VPN directly to KubeProxy or Pods). - -### Features - -As per GCP-only above, except that geo-locality would be to the closest GCP cluster (and possibly onwards to the closest AWS/on-prem cluster). - -### Implementation - -TBD - see Summary above in the mean time. - -### Outstanding Work - -Assuming that GCP-only (see above) is complete: - -1. Wire-up the HA-proxy load balancers to redirect to non-GCP clusters -1. Probably some more - additional detailed research and design necessary. - -### Pros - -1. Works for cross-cloud. - -### Cons - -1. Traffic to non-GCP clusters proxies through GCP clusters. Additional bandwidth costs (3x?) in those cases. - -## Cross-cloud via AWS - -In theory the same approach as "Cross-cloud via GCP" above could be used, except that AWS infrastructure would be used to get traffic first to an AWS cluster, and then proxied onwards to non-AWS and/or on-prem clusters. -Detail docs TBD. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federated-ingress.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-ingress.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-ingress.md) diff --git a/docs/proposals/federation-lite.md b/docs/proposals/federation-lite.md index 549f98df476..5a3cdf37280 100644 --- a/docs/proposals/federation-lite.md +++ b/docs/proposals/federation-lite.md @@ -1,201 +1 @@ -# Kubernetes Multi-AZ Clusters - -## (previously nicknamed "Ubernetes-Lite") - -## Introduction - -Full Cluster Federation will offer sophisticated federation between multiple kubernetes -clusters, offering true high-availability, multiple provider support & -cloud-bursting, multiple region support etc. However, many users have -expressed a desire for a "reasonably" high-available cluster, that runs in -multiple zones on GCE or availability zones in AWS, and can tolerate the failure -of a single zone without the complexity of running multiple clusters. - -Multi-AZ Clusters aim to deliver exactly that functionality: to run a single -Kubernetes cluster in multiple zones. It will attempt to make reasonable -scheduling decisions, in particular so that a replication controller's pods are -spread across zones, and it will try to be aware of constraints - for example -that a volume cannot be mounted on a node in a different zone. - -Multi-AZ Clusters are deliberately limited in scope; for many advanced functions -the answer will be "use full Cluster Federation". For example, multiple-region -support is not in scope. Routing affinity (e.g. so that a webserver will -prefer to talk to a backend service in the same zone) is similarly not in -scope. - -## Design - -These are the main requirements: - -1. kube-up must allow bringing up a cluster that spans multiple zones. -1. pods in a replication controller should attempt to spread across zones. -1. pods which require volumes should not be scheduled onto nodes in a different zone. -1. load-balanced services should work reasonably - -### kube-up support - -kube-up support for multiple zones will initially be considered -advanced/experimental functionality, so the interface is not initially going to -be particularly user-friendly. As we design the evolution of kube-up, we will -make multiple zones better supported. - -For the initial implementation, kube-up must be run multiple times, once for -each zone. The first kube-up will take place as normal, but then for each -additional zone the user must run kube-up again, specifying -`KUBE_USE_EXISTING_MASTER=true` and `KUBE_SUBNET_CIDR=172.20.x.0/24`. This will then -create additional nodes in a different zone, but will register them with the -existing master. - -### Zone spreading - -This will be implemented by modifying the existing scheduler priority function -`SelectorSpread`. Currently this priority function aims to put pods in an RC -on different hosts, but it will be extended first to spread across zones, and -then to spread across hosts. - -So that the scheduler does not need to call out to the cloud provider on every -scheduling decision, we must somehow record the zone information for each node. -The implementation of this will be described in the implementation section. - -Note that zone spreading is 'best effort'; zones are just be one of the factors -in making scheduling decisions, and thus it is not guaranteed that pods will -spread evenly across zones. However, this is likely desirable: if a zone is -overloaded or failing, we still want to schedule the requested number of pods. - -### Volume affinity - -Most cloud providers (at least GCE and AWS) cannot attach their persistent -volumes across zones. Thus when a pod is being scheduled, if there is a volume -attached, that will dictate the zone. This will be implemented using a new -scheduler predicate (a hard constraint): `VolumeZonePredicate`. - -When `VolumeZonePredicate` observes a pod scheduling request that includes a -volume, if that volume is zone-specific, `VolumeZonePredicate` will exclude any -nodes not in that zone. - -Again, to avoid the scheduler calling out to the cloud provider, this will rely -on information attached to the volumes. This means that this will only support -PersistentVolumeClaims, because direct mounts do not have a place to attach -zone information. PersistentVolumes will then include zone information where -volumes are zone-specific. - -### Load-balanced services should operate reasonably - -For both AWS & GCE, Kubernetes creates a native cloud load-balancer for each -service of type LoadBalancer. The native cloud load-balancers on both AWS & -GCE are region-level, and support load-balancing across instances in multiple -zones (in the same region). For both clouds, the behaviour of the native cloud -load-balancer is reasonable in the face of failures (indeed, this is why clouds -provide load-balancing as a primitve). - -For multi-AZ clusters we will therefore simply rely on the native cloud provider -load balancer behaviour, and we do not anticipate substantial code changes. - -One notable shortcoming here is that load-balanced traffic still goes through -kube-proxy controlled routing, and kube-proxy does not (currently) favor -targeting a pod running on the same instance or even the same zone. This will -likely produce a lot of unnecessary cross-zone traffic (which is likely slower -and more expensive). This might be sufficiently low-hanging fruit that we -choose to address it in kube-proxy / multi-AZ clusters, but this can be addressed -after the initial implementation. - - -## Implementation - -The main implementation points are: - -1. how to attach zone information to Nodes and PersistentVolumes -1. how nodes get zone information -1. how volumes get zone information - -### Attaching zone information - -We must attach zone information to Nodes and PersistentVolumes, and possibly to -other resources in future. There are two obvious alternatives: we can use -labels/annotations, or we can extend the schema to include the information. - -For the initial implementation, we propose to use labels. The reasoning is: - -1. It is considerably easier to implement. -1. We will reserve the two labels `failure-domain.alpha.kubernetes.io/zone` and -`failure-domain.alpha.kubernetes.io/region` for the two pieces of information -we need. By putting this under the `kubernetes.io` namespace there is no risk -of collision, and by putting it under `alpha.kubernetes.io` we clearly mark -this as an experimental feature. -1. We do not yet know whether these labels will be sufficient for all -environments, nor which entities will require zone information. Labels give us -more flexibility here. -1. Because the labels are reserved, we can move to schema-defined fields in -future using our cross-version mapping techniques. - -### Node labeling - -We do not want to require an administrator to manually label nodes. We instead -modify the kubelet to include the appropriate labels when it registers itself. -The information is easily obtained by the kubelet from the cloud provider. - -### Volume labeling - -As with nodes, we do not want to require an administrator to manually label -volumes. We will create an admission controller `PersistentVolumeLabel`. -`PersistentVolumeLabel` will intercept requests to create PersistentVolumes, -and will label them appropriately by calling in to the cloud provider. - -## AWS Specific Considerations - -The AWS implementation here is fairly straightforward. The AWS API is -region-wide, meaning that a single call will find instances and volumes in all -zones. In addition, instance ids and volume ids are unique per-region (and -hence also per-zone). I believe they are actually globally unique, but I do -not know if this is guaranteed; in any case we only need global uniqueness if -we are to span regions, which will not be supported by multi-AZ clusters (to do -that correctly requires a full Cluster Federation type approach). - -## GCE Specific Considerations - -The GCE implementation is more complicated than the AWS implementation because -GCE APIs are zone-scoped. To perform an operation, we must perform one REST -call per zone and combine the results, unless we can determine in advance that -an operation references a particular zone. For many operations, we can make -that determination, but in some cases - such as listing all instances, we must -combine results from calls in all relevant zones. - -A further complexity is that GCE volume names are scoped per-zone, not -per-region. Thus it is permitted to have two volumes both named `myvolume` in -two different GCE zones. (Instance names are currently unique per-region, and -thus are not a problem for multi-AZ clusters). - -The volume scoping leads to a (small) behavioural change for multi-AZ clusters on -GCE. If you had two volumes both named `myvolume` in two different GCE zones, -this would not be ambiguous when Kubernetes is operating only in a single zone. -But, when operating a cluster across multiple zones, `myvolume` is no longer -sufficient to specify a volume uniquely. Worse, the fact that a volume happens -to be unambigious at a particular time is no guarantee that it will continue to -be unambigious in future, because a volume with the same name could -subsequently be created in a second zone. While perhaps unlikely in practice, -we cannot automatically enable multi-AZ clusters for GCE users if this then causes -volume mounts to stop working. - -This suggests that (at least on GCE), multi-AZ clusters must be optional (i.e. -there must be a feature-flag). It may be that we can make this feature -semi-automatic in future, by detecting whether nodes are running in multiple -zones, but it seems likely that kube-up could instead simply set this flag. - -For the initial implementation, creating volumes with identical names will -yield undefined results. Later, we may add some way to specify the zone for a -volume (and possibly require that volumes have their zone specified when -running in multi-AZ cluster mode). We could add a new `zone` field to the -PersistentVolume type for GCE PD volumes, or we could use a DNS-style dotted -name for the volume name (.) - -Initially therefore, the GCE changes will be to: - -1. change kube-up to support creation of a cluster in multiple zones -1. pass a flag enabling multi-AZ clusters with kube-up -1. change the kubernetes cloud provider to iterate through relevant zones when resolving items -1. tag GCE PD volumes with the appropriate zone information - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federation-lite.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation-lite.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation-lite.md) diff --git a/docs/proposals/federation.md b/docs/proposals/federation.md index fc595123302..276f8f3cd14 100644 --- a/docs/proposals/federation.md +++ b/docs/proposals/federation.md @@ -1,648 +1 @@ -# Kubernetes Cluster Federation - -## (previously nicknamed "Ubernetes") - -## Requirements Analysis and Product Proposal - -## _by Quinton Hoole ([quinton@google.com](mailto:quinton@google.com))_ - -_Initial revision: 2015-03-05_ -_Last updated: 2015-08-20_ -This doc: [tinyurl.com/ubernetesv2](http://tinyurl.com/ubernetesv2) -Original slides: [tinyurl.com/ubernetes-slides](http://tinyurl.com/ubernetes-slides) -Updated slides: [tinyurl.com/ubernetes-whereto](http://tinyurl.com/ubernetes-whereto) - -## Introduction - -Today, each Kubernetes cluster is a relatively self-contained unit, -which typically runs in a single "on-premise" data centre or single -availability zone of a cloud provider (Google's GCE, Amazon's AWS, -etc). - -Several current and potential Kubernetes users and customers have -expressed a keen interest in tying together ("federating") multiple -clusters in some sensible way in order to enable the following kinds -of use cases (intentionally vague): - -1. _"Preferentially run my workloads in my on-premise cluster(s), but - automatically overflow to my cloud-hosted cluster(s) if I run out - of on-premise capacity"_. -1. _"Most of my workloads should run in my preferred cloud-hosted - cluster(s), but some are privacy-sensitive, and should be - automatically diverted to run in my secure, on-premise - cluster(s)"_. -1. _"I want to avoid vendor lock-in, so I want my workloads to run - across multiple cloud providers all the time. I change my set of - such cloud providers, and my pricing contracts with them, - periodically"_. -1. _"I want to be immune to any single data centre or cloud - availability zone outage, so I want to spread my service across - multiple such zones (and ideally even across multiple cloud - providers)."_ - -The above use cases are by necessity left imprecisely defined. The -rest of this document explores these use cases and their implications -in further detail, and compares a few alternative high level -approaches to addressing them. The idea of cluster federation has -informally become known as _"Ubernetes"_. - -## Summary/TL;DR - -Four primary customer-driven use cases are explored in more detail. -The two highest priority ones relate to High Availability and -Application Portability (between cloud providers, and between -on-premise and cloud providers). - -Four primary federation primitives are identified (location affinity, -cross-cluster scheduling, service discovery and application -migration). Fortunately not all four of these primitives are required -for each primary use case, so incremental development is feasible. - -## What exactly is a Kubernetes Cluster? - -A central design concept in Kubernetes is that of a _cluster_. While -loosely speaking, a cluster can be thought of as running in a single -data center, or cloud provider availability zone, a more precise -definition is that each cluster provides: - -1. a single Kubernetes API entry point, -1. a consistent, cluster-wide resource naming scheme -1. a scheduling/container placement domain -1. a service network routing domain -1. an authentication and authorization model. - -The above in turn imply the need for a relatively performant, reliable -and cheap network within each cluster. - -There is also assumed to be some degree of failure correlation across -a cluster, i.e. whole clusters are expected to fail, at least -occasionally (due to cluster-wide power and network failures, natural -disasters etc). Clusters are often relatively homogeneous in that all -compute nodes are typically provided by a single cloud provider or -hardware vendor, and connected by a common, unified network fabric. -But these are not hard requirements of Kubernetes. - -Other classes of Kubernetes deployments than the one sketched above -are technically feasible, but come with some challenges of their own, -and are not yet common or explicitly supported. - -More specifically, having a Kubernetes cluster span multiple -well-connected availability zones within a single geographical region -(e.g. US North East, UK, Japan etc) is worthy of further -consideration, in particular because it potentially addresses -some of these requirements. - -## What use cases require Cluster Federation? - -Let's name a few concrete use cases to aid the discussion: - -## 1.Capacity Overflow - -_"I want to preferentially run my workloads in my on-premise cluster(s), but automatically "overflow" to my cloud-hosted cluster(s) when I run out of on-premise capacity."_ - -This idea is known in some circles as "[cloudbursting](http://searchcloudcomputing.techtarget.com/definition/cloud-bursting)". - -**Clarifying questions:** What is the unit of overflow? Individual - pods? Probably not always. Replication controllers and their - associated sets of pods? Groups of replication controllers - (a.k.a. distributed applications)? How are persistent disks - overflowed? Can the "overflowed" pods communicate with their - brethren and sistren pods and services in the other cluster(s)? - Presumably yes, at higher cost and latency, provided that they use - external service discovery. Is "overflow" enabled only when creating - new workloads/replication controllers, or are existing workloads - dynamically migrated between clusters based on fluctuating available - capacity? If so, what is the desired behaviour, and how is it - achieved? How, if at all, does this relate to quota enforcement - (e.g. if we run out of on-premise capacity, can all or only some - quotas transfer to other, potentially more expensive off-premise - capacity?) - -It seems that most of this boils down to: - -1. **location affinity** (pods relative to each other, and to other - stateful services like persistent storage - how is this expressed - and enforced?) -1. **cross-cluster scheduling** (given location affinity constraints - and other scheduling policy, which resources are assigned to which - clusters, and by what?) -1. **cross-cluster service discovery** (how do pods in one cluster - discover and communicate with pods in another cluster?) -1. **cross-cluster migration** (how do compute and storage resources, - and the distributed applications to which they belong, move from - one cluster to another) -1. **cross-cluster load-balancing** (how does is user traffic directed - to an appropriate cluster?) -1. **cross-cluster monitoring and auditing** (a.k.a. Unified Visibility) - -## 2. Sensitive Workloads - -_"I want most of my workloads to run in my preferred cloud-hosted -cluster(s), but some are privacy-sensitive, and should be -automatically diverted to run in my secure, on-premise cluster(s). The -list of privacy-sensitive workloads changes over time, and they're -subject to external auditing."_ - -**Clarifying questions:** -1. What kinds of rules determine which -workloads go where? - 1. Is there in fact a requirement to have these rules be - declaratively expressed and automatically enforced, or is it - acceptable/better to have users manually select where to run - their workloads when starting them? - 1. Is a static mapping from container (or more typically, - replication controller) to cluster maintained and enforced? - 1. If so, is it only enforced on startup, or are things migrated - between clusters when the mappings change? - -This starts to look quite similar to "1. Capacity Overflow", and again -seems to boil down to: - -1. location affinity -1. cross-cluster scheduling -1. cross-cluster service discovery -1. cross-cluster migration -1. cross-cluster monitoring and auditing -1. cross-cluster load balancing - -## 3. Vendor lock-in avoidance - -_"My CTO wants us to avoid vendor lock-in, so she wants our workloads -to run across multiple cloud providers at all times. She changes our -set of preferred cloud providers and pricing contracts with them -periodically, and doesn't want to have to communicate and manually -enforce these policy changes across the organization every time this -happens. She wants it centrally and automatically enforced, monitored -and audited."_ - -**Clarifying questions:** - -1. How does this relate to other use cases (high availability, -capacity overflow etc), as they may all be across multiple vendors. -It's probably not strictly speaking a separate -use case, but it's brought up so often as a requirement, that it's -worth calling out explicitly. -1. Is a useful intermediate step to make it as simple as possible to - migrate an application from one vendor to another in a one-off fashion? - -Again, I think that this can probably be - reformulated as a Capacity Overflow problem - the fundamental - principles seem to be the same or substantially similar to those - above. - -## 4. "High Availability" - -_"I want to be immune to any single data centre or cloud availability -zone outage, so I want to spread my service across multiple such zones -(and ideally even across multiple cloud providers), and have my -service remain available even if one of the availability zones or -cloud providers "goes down"_. - -It seems useful to split this into multiple sets of sub use cases: - -1. Multiple availability zones within a single cloud provider (across - which feature sets like private networks, load balancing, - persistent disks, data snapshots etc are typically consistent and - explicitly designed to inter-operate). - 1. within the same geographical region (e.g. metro) within which network - is fast and cheap enough to be almost analogous to a single data - center. - 1. across multiple geographical regions, where high network cost and - poor network performance may be prohibitive. -1. Multiple cloud providers (typically with inconsistent feature sets, - more limited interoperability, and typically no cheap inter-cluster - networking described above). - -The single cloud provider case might be easier to implement (although -the multi-cloud provider implementation should just work for a single -cloud provider). Propose high-level design catering for both, with -initial implementation targeting single cloud provider only. - -**Clarifying questions:** -**How does global external service discovery work?** In the steady - state, which external clients connect to which clusters? GeoDNS or - similar? What is the tolerable failover latency if a cluster goes - down? Maybe something like (make up some numbers, notwithstanding - some buggy DNS resolvers, TTL's, caches etc) ~3 minutes for ~90% of - clients to re-issue DNS lookups and reconnect to a new cluster when - their home cluster fails is good enough for most Kubernetes users - (or at least way better than the status quo), given that these sorts - of failure only happen a small number of times a year? - -**How does dynamic load balancing across clusters work, if at all?** - One simple starting point might be "it doesn't". i.e. if a service - in a cluster is deemed to be "up", it receives as much traffic as is - generated "nearby" (even if it overloads). If the service is deemed - to "be down" in a given cluster, "all" nearby traffic is redirected - to some other cluster within some number of seconds (failover could - be automatic or manual). Failover is essentially binary. An - improvement would be to detect when a service in a cluster reaches - maximum serving capacity, and dynamically divert additional traffic - to other clusters. But how exactly does all of this work, and how - much of it is provided by Kubernetes, as opposed to something else - bolted on top (e.g. external monitoring and manipulation of GeoDNS)? - -**How does this tie in with auto-scaling of services?** More - specifically, if I run my service across _n_ clusters globally, and - one (or more) of them fail, how do I ensure that the remaining _n-1_ - clusters have enough capacity to serve the additional, failed-over - traffic? Either: - -1. I constantly over-provision all clusters by 1/n (potentially expensive), or -1. I "manually" (or automatically) update my replica count configurations in the - remaining clusters by 1/n when the failure occurs, and Kubernetes - takes care of the rest for me, or -1. Auto-scaling in the remaining clusters takes - care of it for me automagically as the additional failed-over - traffic arrives (with some latency). Note that this implies that - the cloud provider keeps the necessary resources on hand to - accommodate such auto-scaling (e.g. via something similar to AWS reserved - and spot instances) - -Up to this point, this use case ("Unavailability Zones") seems materially different from all the others above. It does not require dynamic cross-cluster service migration (we assume that the service is already running in more than one cluster when the failure occurs). Nor does it necessarily involve cross-cluster service discovery or location affinity. As a result, I propose that we address this use case somewhat independently of the others (although I strongly suspect that it will become substantially easier once we've solved the others). - -All of the above (regarding "Unavailability Zones") refers primarily -to already-running user-facing services, and minimizing the impact on -end users of those services becoming unavailable in a given cluster. -What about the people and systems that deploy Kubernetes services -(devops etc)? Should they be automatically shielded from the impact -of the cluster outage? i.e. have their new resource creation requests -automatically diverted to another cluster during the outage? While -this specific requirement seems non-critical (manual fail-over seems -relatively non-arduous, ignoring the user-facing issues above), it -smells a lot like the first three use cases listed above ("Capacity -Overflow, Sensitive Services, Vendor lock-in..."), so if we address -those, we probably get this one free of charge. - -## Core Challenges of Cluster Federation - -As we saw above, a few common challenges fall out of most of the use -cases considered above, namely: - -## Location Affinity - -Can the pods comprising a single distributed application be -partitioned across more than one cluster? More generally, how far -apart, in network terms, can a given client and server within a -distributed application reasonably be? A server need not necessarily -be a pod, but could instead be a persistent disk housing data, or some -other stateful network service. What is tolerable is typically -application-dependent, primarily influenced by network bandwidth -consumption, latency requirements and cost sensitivity. - -For simplicity, let's assume that all Kubernetes distributed -applications fall into one of three categories with respect to relative -location affinity: - -1. **"Strictly Coupled"**: Those applications that strictly cannot be - partitioned between clusters. They simply fail if they are - partitioned. When scheduled, all pods _must_ be scheduled to the - same cluster. To move them, we need to shut the whole distributed - application down (all pods) in one cluster, possibly move some - data, and then bring the up all of the pods in another cluster. To - avoid downtime, we might bring up the replacement cluster and - divert traffic there before turning down the original, but the - principle is much the same. In some cases moving the data might be - prohibitively expensive or time-consuming, in which case these - applications may be effectively _immovable_. -1. **"Strictly Decoupled"**: Those applications that can be - indefinitely partitioned across more than one cluster, to no - disadvantage. An embarrassingly parallel YouTube porn detector, - where each pod repeatedly dequeues a video URL from a remote work - queue, downloads and chews on the video for a few hours, and - arrives at a binary verdict, might be one such example. The pods - derive no benefit from being close to each other, or anything else - (other than the source of YouTube videos, which is assumed to be - equally remote from all clusters in this example). Each pod can be - scheduled independently, in any cluster, and moved at any time. -1. **"Preferentially Coupled"**: Somewhere between Coupled and - Decoupled. These applications prefer to have all of their pods - located in the same cluster (e.g. for failure correlation, network - latency or bandwidth cost reasons), but can tolerate being - partitioned for "short" periods of time (for example while - migrating the application from one cluster to another). Most small - to medium sized LAMP stacks with not-very-strict latency goals - probably fall into this category (provided that they use sane - service discovery and reconnect-on-fail, which they need to do - anyway to run effectively, even in a single Kubernetes cluster). - -From a fault isolation point of view, there are also opposites of the -above. For example, a master database and its slave replica might -need to be in different availability zones. We'll refer to this a -anti-affinity, although it is largely outside the scope of this -document. - -Note that there is somewhat of a continuum with respect to network -cost and quality between any two nodes, ranging from two nodes on the -same L2 network segment (lowest latency and cost, highest bandwidth) -to two nodes on different continents (highest latency and cost, lowest -bandwidth). One interesting point on that continuum relates to -multiple availability zones within a well-connected metro or region -and single cloud provider. Despite being in different data centers, -or areas within a mega data center, network in this case is often very fast -and effectively free or very cheap. For the purposes of this network location -affinity discussion, this case is considered analogous to a single -availability zone. Furthermore, if a given application doesn't fit -cleanly into one of the above, shoe-horn it into the best fit, -defaulting to the "Strictly Coupled and Immovable" bucket if you're -not sure. - -And then there's what I'll call _absolute_ location affinity. Some -applications are required to run in bounded geographical or network -topology locations. The reasons for this are typically -political/legislative (data privacy laws etc), or driven by network -proximity to consumers (or data providers) of the application ("most -of our users are in Western Europe, U.S. West Coast" etc). - -**Proposal:** First tackle Strictly Decoupled applications (which can - be trivially scheduled, partitioned or moved, one pod at a time). - Then tackle Preferentially Coupled applications (which must be - scheduled in totality in a single cluster, and can be moved, but - ultimately in total, and necessarily within some bounded time). - Leave strictly coupled applications to be manually moved between - clusters as required for the foreseeable future. - -## Cross-cluster service discovery - -I propose having pods use standard discovery methods used by external -clients of Kubernetes applications (i.e. DNS). DNS might resolve to a -public endpoint in the local or a remote cluster. Other than Strictly -Coupled applications, software should be largely oblivious of which of -the two occurs. - -_Aside:_ How do we avoid "tromboning" through an external VIP when DNS -resolves to a public IP on the local cluster? Strictly speaking this -would be an optimization for some cases, and probably only matters to -high-bandwidth, low-latency communications. We could potentially -eliminate the trombone with some kube-proxy magic if necessary. More -detail to be added here, but feel free to shoot down the basic DNS -idea in the mean time. In addition, some applications rely on private -networking between clusters for security (e.g. AWS VPC or more -generally VPN). It should not be necessary to forsake this in -order to use Cluster Federation, for example by being forced to use public -connectivity between clusters. - -## Cross-cluster Scheduling - -This is closely related to location affinity above, and also discussed -there. The basic idea is that some controller, logically outside of -the basic Kubernetes control plane of the clusters in question, needs -to be able to: - -1. Receive "global" resource creation requests. -1. Make policy-based decisions as to which cluster(s) should be used - to fulfill each given resource request. In a simple case, the - request is just redirected to one cluster. In a more complex case, - the request is "demultiplexed" into multiple sub-requests, each to - a different cluster. Knowledge of the (albeit approximate) - available capacity in each cluster will be required by the - controller to sanely split the request. Similarly, knowledge of - the properties of the application (Location Affinity class -- - Strictly Coupled, Strictly Decoupled etc, privacy class etc) will - be required. It is also conceivable that knowledge of service - SLAs and monitoring thereof might provide an input into - scheduling/placement algorithms. -1. Multiplex the responses from the individual clusters into an - aggregate response. - -There is of course a lot of detail still missing from this section, -including discussion of: - -1. admission control -1. initial placement of instances of a new -service vs. scheduling new instances of an existing service in response -to auto-scaling -1. rescheduling pods due to failure (response might be -different depending on if it's failure of a node, rack, or whole AZ) -1. data placement relative to compute capacity, -etc. - -## Cross-cluster Migration - -Again this is closely related to location affinity discussed above, -and is in some sense an extension of Cross-cluster Scheduling. When -certain events occur, it becomes necessary or desirable for the -cluster federation system to proactively move distributed applications -(either in part or in whole) from one cluster to another. Examples of -such events include: - -1. A low capacity event in a cluster (or a cluster failure). -1. A change of scheduling policy ("we no longer use cloud provider X"). -1. A change of resource pricing ("cloud provider Y dropped their - prices - let's migrate there"). - -Strictly Decoupled applications can be trivially moved, in part or in -whole, one pod at a time, to one or more clusters (within applicable -policy constraints, for example "PrivateCloudOnly"). - -For Preferentially Decoupled applications, the federation system must -first locate a single cluster with sufficient capacity to accommodate -the entire application, then reserve that capacity, and incrementally -move the application, one (or more) resources at a time, over to the -new cluster, within some bounded time period (and possibly within a -predefined "maintenance" window). Strictly Coupled applications (with -the exception of those deemed completely immovable) require the -federation system to: - -1. start up an entire replica application in the destination cluster -1. copy persistent data to the new application instance (possibly - before starting pods) -1. switch user traffic across -1. tear down the original application instance - -It is proposed that support for automated migration of Strictly -Coupled applications be deferred to a later date. - -## Other Requirements - -These are often left implicit by customers, but are worth calling out explicitly: - -1. Software failure isolation between Kubernetes clusters should be - retained as far as is practically possible. The federation system - should not materially increase the failure correlation across - clusters. For this reason the federation control plane software - should ideally be completely independent of the Kubernetes cluster - control software, and look just like any other Kubernetes API - client, with no special treatment. If the federation control plane - software fails catastrophically, the underlying Kubernetes clusters - should remain independently usable. -1. Unified monitoring, alerting and auditing across federated Kubernetes clusters. -1. Unified authentication, authorization and quota management across - clusters (this is in direct conflict with failure isolation above, - so there are some tough trade-offs to be made here). - -## Proposed High-Level Architectures - -Two distinct potential architectural approaches have emerged from discussions -thus far: - -1. An explicitly decoupled and hierarchical architecture, where the - Federation Control Plane sits logically above a set of independent - Kubernetes clusters, each of which is (potentially) unaware of the - other clusters, and of the Federation Control Plane itself (other - than to the extent that it is an API client much like any other). - One possible example of this general architecture is illustrated - below, and will be referred to as the "Decoupled, Hierarchical" - approach. -1. A more monolithic architecture, where a single instance of the - Kubernetes control plane itself manages a single logical cluster - composed of nodes in multiple availability zones and cloud - providers. - -A very brief, non-exhaustive list of pro's and con's of the two -approaches follows. (In the interest of full disclosure, the author -prefers the Decoupled Hierarchical model for the reasons stated below). - -1. **Failure isolation:** The Decoupled Hierarchical approach provides - better failure isolation than the Monolithic approach, as each - underlying Kubernetes cluster, and the Federation Control Plane, - can operate and fail completely independently of each other. In - particular, their software and configurations can be updated - independently. Such updates are, in our experience, the primary - cause of control-plane failures, in general. -1. **Failure probability:** The Decoupled Hierarchical model incorporates - numerically more independent pieces of software and configuration - than the Monolithic one. But the complexity of each of these - decoupled pieces is arguably better contained in the Decoupled - model (per standard arguments for modular rather than monolithic - software design). Which of the two models presents higher - aggregate complexity and consequent failure probability remains - somewhat of an open question. -1. **Scalability:** Conceptually the Decoupled Hierarchical model wins - here, as each underlying Kubernetes cluster can be scaled - completely independently w.r.t. scheduling, node state management, - monitoring, network connectivity etc. It is even potentially - feasible to stack federations of clusters (i.e. create - federations of federations) should scalability of the independent - Federation Control Plane become an issue (although the author does - not envision this being a problem worth solving in the short - term). -1. **Code complexity:** I think that an argument can be made both ways - here. It depends on whether you prefer to weave the logic for - handling nodes in multiple availability zones and cloud providers - within a single logical cluster into the existing Kubernetes - control plane code base (which was explicitly not designed for - this), or separate it into a decoupled Federation system (with - possible code sharing between the two via shared libraries). The - author prefers the latter because it: - 1. Promotes better code modularity and interface design. - 1. Allows the code - bases of Kubernetes and the Federation system to progress - largely independently (different sets of developers, different - release schedules etc). -1. **Administration complexity:** Again, I think that this could be argued - both ways. Superficially it would seem that administration of a - single Monolithic multi-zone cluster might be simpler by virtue of - being only "one thing to manage", however in practise each of the - underlying availability zones (and possibly cloud providers) has - its own capacity, pricing, hardware platforms, and possibly - bureaucratic boundaries (e.g. "our EMEA IT department manages those - European clusters"). So explicitly allowing for (but not - mandating) completely independent administration of each - underlying Kubernetes cluster, and the Federation system itself, - in the Decoupled Hierarchical model seems to have real practical - benefits that outweigh the superficial simplicity of the - Monolithic model. -1. **Application development and deployment complexity:** It's not clear - to me that there is any significant difference between the two - models in this regard. Presumably the API exposed by the two - different architectures would look very similar, as would the - behavior of the deployed applications. It has even been suggested - to write the code in such a way that it could be run in either - configuration. It's not clear that this makes sense in practise - though. -1. **Control plane cost overhead:** There is a minimum per-cluster - overhead -- two possibly virtual machines, or more for redundant HA - deployments. For deployments of very small Kubernetes - clusters with the Decoupled Hierarchical approach, this cost can - become significant. - -### The Decoupled, Hierarchical Approach - Illustrated - -![image](federation-high-level-arch.png) - -## Cluster Federation API - -It is proposed that this look a lot like the existing Kubernetes API -but be explicitly multi-cluster. - -+ Clusters become first class objects, which can be registered, - listed, described, deregistered etc via the API. -+ Compute resources can be explicitly requested in specific clusters, - or automatically scheduled to the "best" cluster by the Cluster - Federation control system (by a - pluggable Policy Engine). -+ There is a federated equivalent of a replication controller type (or - perhaps a [deployment](deployment.md)), - which is multicluster-aware, and delegates to cluster-specific - replication controllers/deployments as required (e.g. a federated RC for n - replicas might simply spawn multiple replication controllers in - different clusters to do the hard work). - -## Policy Engine and Migration/Replication Controllers - -The Policy Engine decides which parts of each application go into each -cluster at any point in time, and stores this desired state in the -Desired Federation State store (an etcd or -similar). Migration/Replication Controllers reconcile this against the -desired states stored in the underlying Kubernetes clusters (by -watching both, and creating or updating the underlying Replication -Controllers and related Services accordingly). - -## Authentication and Authorization - -This should ideally be delegated to some external auth system, shared -by the underlying clusters, to avoid duplication and inconsistency. -Either that, or we end up with multilevel auth. Local readonly -eventually consistent auth slaves in each cluster and in the Cluster -Federation control system -could potentially cache auth, to mitigate an SPOF auth system. - -## Data consistency, failure and availability characteristics - -The services comprising the Cluster Federation control plane) have to run - somewhere. Several options exist here: -* For high availability Cluster Federation deployments, these - services may run in either: - * a dedicated Kubernetes cluster, not co-located in the same - availability zone with any of the federated clusters (for fault - isolation reasons). If that cluster/availability zone, and hence the Federation - system, fails catastrophically, the underlying pods and - applications continue to run correctly, albeit temporarily - without the Federation system. - * across multiple Kubernetes availability zones, probably with - some sort of cross-AZ quorum-based store. This provides - theoretically higher availability, at the cost of some - complexity related to data consistency across multiple - availability zones. - * For simpler, less highly available deployments, just co-locate the - Federation control plane in/on/with one of the underlying - Kubernetes clusters. The downside of this approach is that if - that specific cluster fails, all automated failover and scaling - logic which relies on the federation system will also be - unavailable at the same time (i.e. precisely when it is needed). - But if one of the other federated clusters fails, everything - should work just fine. - -There is some further thinking to be done around the data consistency - model upon which the Federation system is based, and it's impact - on the detailed semantics, failure and availability - characteristics of the system. - -## Proposed Next Steps - -Identify concrete applications of each use case and configure a proof -of concept service that exercises the use case. For example, cluster -failure tolerance seems popular, so set up an apache frontend with -replicas in each of three availability zones with either an Amazon Elastic -Load Balancer or Google Cloud Load Balancer pointing at them? What -does the zookeeper config look like for N=3 across 3 AZs -- and how -does each replica find the other replicas and how do clients find -their primary zookeeper replica? And now how do I do a shared, highly -available redis database? Use a few common specific use cases like -this to flesh out the detailed API and semantics of Cluster Federation. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federation.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation.md) diff --git a/docs/proposals/flannel-integration.md b/docs/proposals/flannel-integration.md index 465ee5e635f..9001c43482f 100644 --- a/docs/proposals/flannel-integration.md +++ b/docs/proposals/flannel-integration.md @@ -1,132 +1 @@ -# Flannel integration with Kubernetes - -## Why? - -* Networking works out of the box. -* Cloud gateway configuration is regulated by quota. -* Consistent bare metal and cloud experience. -* Lays foundation for integrating with networking backends and vendors. - -## How? - -Thus: - -``` -Master | Node1 ----------------------------------------------------------------------- -{192.168.0.0/16, 256 /24} | docker - | | | restart with podcidr -apiserver <------------------ kubelet (sends podcidr) - | | | here's podcidr, mtu -flannel-server:10253 <------------------ flannel-daemon -Allocates a /24 ------------------> [config iptables, VXLan] - <------------------ [watch subnet leases] -I just allocated ------------------> [config VXLan] -another /24 | -``` - -## Proposal - -Explaining vxlan is out of the scope of this document, however it does take some basic understanding to grok the proposal. Assume some pod wants to communicate across nodes with the above setup. Check the flannel vxlan devices: - -```console -node1 $ ip -d link show flannel.1 -4: flannel.1: mtu 1410 qdisc noqueue state UNKNOWN mode DEFAULT - link/ether a2:53:86:b5:5f:c1 brd ff:ff:ff:ff:ff:ff - vxlan -node1 $ ip -d link show eth0 -2: eth0: mtu 1460 qdisc mq state UP mode DEFAULT qlen 1000 - link/ether 42:01:0a:f0:00:04 brd ff:ff:ff:ff:ff:ff - -node2 $ ip -d link show flannel.1 -4: flannel.1: mtu 1410 qdisc noqueue state UNKNOWN mode DEFAULT - link/ether 56:71:35:66:4a:d8 brd ff:ff:ff:ff:ff:ff - vxlan -node2 $ ip -d link show eth0 -2: eth0: mtu 1460 qdisc mq state UP mode DEFAULT qlen 1000 - link/ether 42:01:0a:f0:00:03 brd ff:ff:ff:ff:ff:ff -``` - -Note that we're ignoring cbr0 for the sake of simplicity. Spin-up a container on each node. We're using raw docker for this example only because we want control over where the container lands: - -``` -node1 $ docker run -it radial/busyboxplus:curl /bin/sh -[ root@5ca3c154cde3:/ ]$ ip addr show -1: lo: mtu 65536 qdisc noqueue -8: eth0: mtu 1410 qdisc noqueue - link/ether 02:42:12:10:20:03 brd ff:ff:ff:ff:ff:ff - inet 192.168.32.3/24 scope global eth0 - valid_lft forever preferred_lft forever - -node2 $ docker run -it radial/busyboxplus:curl /bin/sh -[ root@d8a879a29f5d:/ ]$ ip addr show -1: lo: mtu 65536 qdisc noqueue -16: eth0: mtu 1410 qdisc noqueue - link/ether 02:42:12:10:0e:07 brd ff:ff:ff:ff:ff:ff - inet 192.168.14.7/24 scope global eth0 - valid_lft forever preferred_lft forever -[ root@d8a879a29f5d:/ ]$ ping 192.168.32.3 -PING 192.168.32.3 (192.168.32.3): 56 data bytes -64 bytes from 192.168.32.3: seq=0 ttl=62 time=1.190 ms -``` - -__What happened?__: - -From 1000 feet: -* vxlan device driver starts up on node1 and creates a udp tunnel endpoint on 8472 -* container 192.168.32.3 pings 192.168.14.7 - - what's the MAC of 192.168.14.0? - - L2 miss, flannel looks up MAC of subnet - - Stores `192.168.14.0 <-> 56:71:35:66:4a:d8` in neighbor table - - what's tunnel endpoint of this MAC? - - L3 miss, flannel looks up destination VM ip - - Stores `10.240.0.3 <-> 56:71:35:66:4a:d8` in bridge database -* Sends `[56:71:35:66:4a:d8, 10.240.0.3][vxlan: port, vni][02:42:12:10:20:03, 192.168.14.7][icmp]` - -__But will it blend?__ - -Kubernetes integration is fairly straight-forward once we understand the pieces involved, and can be prioritized as follows: -* Kubelet understands flannel daemon in client mode, flannel server manages independent etcd store on master, node controller backs off CIDR allocation -* Flannel server consults the Kubernetes master for everything network related -* Flannel daemon works through network plugins in a generic way without bothering the kubelet: needs CNI x Kubernetes standardization - -The first is accomplished in this PR, while a timeline for 2. and 3. is TDB. To implement the flannel api we can either run a proxy per node and get rid of the flannel server, or service all requests in the flannel server with something like a go-routine per node: -* `/network/config`: read network configuration and return -* `/network/leases`: - - Post: Return a lease as understood by flannel - - Lookip node by IP - - Store node metadata from [flannel request] (https://github.com/coreos/flannel/blob/master/subnet/subnet.go#L34) in annotations - - Return [Lease object] (https://github.com/coreos/flannel/blob/master/subnet/subnet.go#L40) reflecting node cidr - - Get: Handle a watch on leases -* `/network/leases/subnet`: - - Put: This is a request for a lease. If the nodecontroller is allocating CIDRs we can probably just no-op. -* `/network/reservations`: TDB, we can probably use this to accommodate node controller allocating CIDR instead of flannel requesting it - -The ick-iest part of this implementation is going to the `GET /network/leases`, i.e. the watch proxy. We can side-step by waiting for a more generic Kubernetes resource. However, we can also implement it as follows: -* Watch all nodes, ignore heartbeats -* On each change, figure out the lease for the node, construct a [lease watch result](https://github.com/coreos/flannel/blob/0bf263826eab1707be5262703a8092c7d15e0be4/subnet/subnet.go#L72), and send it down the watch with the RV from the node -* Implement a lease list that does a similar translation - -I say this is gross without an api object because for each node->lease translation one has to store and retrieve the node metadata sent by flannel (eg: VTEP) from node annotations. [Reference implementation](https://github.com/bprashanth/kubernetes/blob/network_vxlan/pkg/kubelet/flannel_server.go) and [watch proxy](https://github.com/bprashanth/kubernetes/blob/network_vxlan/pkg/kubelet/watch_proxy.go). - -# Limitations - -* Integration is experimental -* Flannel etcd not stored in persistent disk -* CIDR allocation does *not* flow from Kubernetes down to nodes anymore - -# Wishlist - -This proposal is really just a call for community help in writing a Kubernetes x flannel backend. - -* CNI plugin integration -* Flannel daemon in privileged pod -* Flannel server talks to apiserver, described in proposal above -* HTTPs between flannel daemon/server -* Investigate flannel server running on every node (as done in the reference implementation mentioned above) -* Use flannel reservation mode to support node controller podcidr allocation - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/flannel-integration.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/flannel-integration.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/flannel-integration.md) diff --git a/docs/proposals/garbage-collection.md b/docs/proposals/garbage-collection.md index b24a7f2104c..6b6d54077ea 100644 --- a/docs/proposals/garbage-collection.md +++ b/docs/proposals/garbage-collection.md @@ -1,357 +1 @@ -**Table of Contents** - -- [Overview](#overview) -- [Cascading deletion with Garbage Collector](#cascading-deletion-with-garbage-collector) -- [Orphaning the descendants with "orphan" finalizer](#orphaning-the-descendants-with-orphan-finalizer) - - [Part I. The finalizer framework](#part-i-the-finalizer-framework) - - [Part II. The "orphan" finalizer](#part-ii-the-orphan-finalizer) -- [Related issues](#related-issues) - - [Orphan adoption](#orphan-adoption) - - [Upgrading a cluster to support cascading deletion](#upgrading-a-cluster-to-support-cascading-deletion) -- [End-to-End Examples](#end-to-end-examples) - - [Life of a Deployment and its descendants](#life-of-a-deployment-and-its-descendants) -- [Open Questions](#open-questions) -- [Considered and Rejected Designs](#considered-and-rejected-designs) -- [1. Tombstone + GC](#1-tombstone--gc) -- [2. Recovering from abnormal cascading deletion](#2-recovering-from-abnormal-cascading-deletion) - - -# Overview - -Currently most cascading deletion logic is implemented at client-side. For example, when deleting a replica set, kubectl uses a reaper to delete the created pods and then delete the replica set. We plan to move the cascading deletion to the server to simplify the client-side logic. In this proposal, we present the garbage collector which implements cascading deletion for all API resources in a generic way; we also present the finalizer framework, particularly the "orphan" finalizer, to enable flexible alternation between cascading deletion and orphaning. - -Goals of the design include: -* Supporting cascading deletion at the server-side. -* Centralizing the cascading deletion logic, rather than spreading in controllers. -* Allowing optionally orphan the dependent objects. - -Non-goals include: -* Releasing the name of an object immediately, so it can be reused ASAP. -* Propagating the grace period in cascading deletion. - -# Cascading deletion with Garbage Collector - -## API Changes - -``` -type ObjectMeta struct { - ... - OwnerReferences []OwnerReference -} -``` - -**ObjectMeta.OwnerReferences**: -List of objects depended by this object. If ***all*** objects in the list have been deleted, this object will be garbage collected. For example, a replica set `R` created by a deployment `D` should have an entry in ObjectMeta.OwnerReferences pointing to `D`, set by the deployment controller when `R` is created. This field can be updated by any client that has the privilege to both update ***and*** delete the object. For safety reasons, we can add validation rules to restrict what resources could be set as owners. For example, Events will likely be banned from being owners. - -``` -type OwnerReference struct { - // Version of the referent. - APIVersion string - // Kind of the referent. - Kind string - // Name of the referent. - Name string - // UID of the referent. - UID types.UID -} -``` - -**OwnerReference struct**: OwnerReference contains enough information to let you identify an owning object. Please refer to the inline comments for the meaning of each field. Currently, an owning object must be in the same namespace as the dependent object, so there is no namespace field. - -## New components: the Garbage Collector - -The Garbage Collector is responsible to delete an object if none of the owners listed in the object's OwnerReferences exist. -The Garbage Collector consists of a scanner, a garbage processor, and a propagator. -* Scanner: - * Uses the discovery API to detect all the resources supported by the system. - * Periodically scans all resources in the system and adds each object to the *Dirty Queue*. - -* Garbage Processor: - * Consists of the *Dirty Queue* and workers. - * Each worker: - * Dequeues an item from *Dirty Queue*. - * If the item's OwnerReferences is empty, continues to process the next item in the *Dirty Queue*. - * Otherwise checks each entry in the OwnerReferences: - * If at least one owner exists, do nothing. - * If none of the owners exist, requests the API server to delete the item. - -* Propagator: - * The Propagator is for optimization, not for correctness. - * Consists of an *Event Queue*, a single worker, and a DAG of owner-dependent relations. - * The DAG stores only name/uid/orphan triplets, not the entire body of every item. - * Watches for create/update/delete events for all resources, enqueues the events to the *Event Queue*. - * Worker: - * Dequeues an item from the *Event Queue*. - * If the item is an creation or update, then updates the DAG accordingly. - * If the object has an owner and the owner doesn’t exist in the DAG yet, then apart from adding the object to the DAG, also enqueues the object to the *Dirty Queue*. - * If the item is a deletion, then removes the object from the DAG, and enqueues all its dependent objects to the *Dirty Queue*. - * The propagator shouldn't need to do any RPCs, so a single worker should be sufficient. This makes locking easier. - * With the Propagator, we *only* need to run the Scanner when starting the GC to populate the DAG and the *Dirty Queue*. - -# Orphaning the descendants with "orphan" finalizer - -Users may want to delete an owning object (e.g., a replicaset) while orphaning the dependent object (e.g., pods), that is, leaving the dependent objects untouched. We support such use cases by introducing the "orphan" finalizer. Finalizer is a generic API that has uses other than supporting orphaning, so we first describe the generic finalizer framework, then describe the specific design of the "orphan" finalizer. - -## Part I. The finalizer framework - -## API changes - -``` -type ObjectMeta struct { - … - Finalizers []string -} -``` - -**ObjectMeta.Finalizers**: List of finalizers that need to run before deleting the object. This list must be empty before the object is deleted from the registry. Each string in the list is an identifier for the responsible component that will remove the entry from the list. If the deletionTimestamp of the object is non-nil, entries in this list can only be removed. For safety reasons, updating finalizers requires special privileges. To enforce the admission rules, we will expose finalizers as a subresource and disallow directly changing finalizers when updating the main resource. - -## New components - -* Finalizers: - * Like a controller, a finalizer is always running. - * A third party can develop and run their own finalizer in the cluster. A finalizer doesn't need to be registered with the API server. - * Watches for update events that meet two conditions: - 1. the updated object has the identifier of the finalizer in ObjectMeta.Finalizers; - 2. ObjectMeta.DeletionTimestamp is updated from nil to non-nil. - * Applies the finalizing logic to the object in the update event. - * After the finalizing logic is completed, removes itself from ObjectMeta.Finalizers. - * The API server deletes the object after the last finalizer removes itself from the ObjectMeta.Finalizers field. - * Because it's possible for the finalizing logic to be applied multiple times (e.g., the finalizer crashes after applying the finalizing logic but before being removed form ObjectMeta.Finalizers), the finalizing logic has to be idempotent. - * If a finalizer fails to act in a timely manner, users with proper privileges can manually remove the finalizer from ObjectMeta.Finalizers. We will provide a kubectl command to do this. - -## Changes to existing components - -* API server: - * Deletion handler: - * If the `ObjectMeta.Finalizers` of the object being deleted is non-empty, then updates the DeletionTimestamp, but does not delete the object. - * If the `ObjectMeta.Finalizers` is empty and the options.GracePeriod is zero, then deletes the object. If the options.GracePeriod is non-zero, then just updates the DeletionTimestamp. - * Update handler: - * If the update removes the last finalizer, and the DeletionTimestamp is non-nil, and the DeletionGracePeriodSeconds is zero, then deletes the object from the registry. - * If the update removes the last finalizer, and the DeletionTimestamp is non-nil, but the DeletionGracePeriodSeconds is non-zero, then just updates the object. - -## Part II. The "orphan" finalizer - -## API changes - -``` -type DeleteOptions struct { - … - OrphanDependents bool -} -``` - -**DeleteOptions.OrphanDependents**: allows a user to express whether the dependent objects should be orphaned. It defaults to true, because controllers before release 1.2 expect dependent objects to be orphaned. - -## Changes to existing components - -* API server: -When handling a deletion request, depending on if DeleteOptions.OrphanDependents is true, the API server updates the object to add/remove the "orphan" finalizer to/from the ObjectMeta.Finalizers map. - - -## New components - -Adding a fourth component to the Garbage Collector, the"orphan" finalizer: -* Watches for update events as described in [Part I](#part-i-the-finalizer-framework). -* Removes the object in the event from the `OwnerReferences` of its dependents. - * dependent objects can be found via the DAG kept by the GC, or by relisting the dependent resource and checking the OwnerReferences field of each potential dependent object. -* Also removes any dangling owner references the dependent objects have. -* At last, removes the itself from the `ObjectMeta.Finalizers` of the object. - -# Related issues - -## Orphan adoption - -Controllers are responsible for adopting orphaned dependent resources. To do so, controllers -* Checks a potential dependent object’s OwnerReferences to determine if it is orphaned. -* Fills the OwnerReferences if the object matches the controller’s selector and is orphaned. - -There is a potential race between the "orphan" finalizer removing an owner reference and the controllers adding it back during adoption. Imagining this case: a user deletes an owning object and intends to orphan the dependent objects, so the GC removes the owner from the dependent object's OwnerReferences list, but the controller of the owner resource hasn't observed the deletion yet, so it adopts the dependent again and adds the reference back, resulting in the mistaken deletion of the dependent object. This race can be avoided by implementing Status.ObservedGeneration in all resources. Before updating the dependent Object's OwnerReferences, the "orphan" finalizer checks Status.ObservedGeneration of the owning object to ensure its controller has already observed the deletion. - -## Upgrading a cluster to support cascading deletion - -For the master, after upgrading to a version that supports cascading deletion, the OwnerReferences of existing objects remain empty, so the controllers will regard them as orphaned and start the adoption procedures. After the adoptions are done, server-side cascading will be effective for these existing objects. - -For nodes, cascading deletion does not affect them. - -For kubectl, we will keep the kubectl’s cascading deletion logic for one more release. - -# End-to-End Examples - -This section presents an example of all components working together to enforce the cascading deletion or orphaning. - -## Life of a Deployment and its descendants - -1. User creates a deployment `D1`. -2. The Propagator of the GC observes the creation. It creates an entry of `D1` in the DAG. -3. The deployment controller observes the creation of `D1`. It creates the replicaset `R1`, whose OwnerReferences field contains a reference to `D1`, and has the "orphan" finalizer in its ObjectMeta.Finalizers map. -4. The Propagator of the GC observes the creation of `R1`. It creates an entry of `R1` in the DAG, with `D1` as its owner. -5. The replicaset controller observes the creation of `R1` and creates Pods `P1`~`Pn`, all with `R1` in their OwnerReferences. -6. The Propagator of the GC observes the creation of `P1`~`Pn`. It creates entries for them in the DAG, with `R1` as their owner. - - ***In case the user wants to cascadingly delete `D1`'s descendants, then*** - -7. The user deletes the deployment `D1`, with `DeleteOptions.OrphanDependents=false`. API server checks if `D1` has "orphan" finalizer in its Finalizers map, if so, it updates `D1` to remove the "orphan" finalizer. Then API server deletes `D1`. -8. The "orphan" finalizer does *not* take any action, because the observed deletion shows `D1` has an empty Finalizers map. -9. The Propagator of the GC observes the deletion of `D1`. It deletes `D1` from the DAG. It adds its dependent object, replicaset `R1`, to the *dirty queue*. -10. The Garbage Processor of the GC dequeues `R1` from the *dirty queue*. It finds `R1` has an owner reference pointing to `D1`, and `D1` no longer exists, so it requests API server to delete `R1`, with `DeleteOptions.OrphanDependents=false`. (The Garbage Processor should always set this field to false.) -11. The API server updates `R1` to remove the "orphan" finalizer if it's in the `R1`'s Finalizers map. Then the API server deletes `R1`, as `R1` has an empty Finalizers map. -12. The Propagator of the GC observes the deletion of `R1`. It deletes `R1` from the DAG. It adds its dependent objects, Pods `P1`~`Pn`, to the *Dirty Queue*. -13. The Garbage Processor of the GC dequeues `Px` (1 <= x <= n) from the *Dirty Queue*. It finds that `Px` have an owner reference pointing to `D1`, and `D1` no longer exists, so it requests API server to delete `Px`, with `DeleteOptions.OrphanDependents=false`. -14. API server deletes the Pods. - - ***In case the user wants to orphan `D1`'s descendants, then*** - -7. The user deletes the deployment `D1`, with `DeleteOptions.OrphanDependents=true`. -8. The API server first updates `D1`, with DeletionTimestamp=now and DeletionGracePeriodSeconds=0, increments the Generation by 1, and add the "orphan" finalizer to ObjectMeta.Finalizers if it's not present yet. The API server does not delete `D1`, because its Finalizers map is not empty. -9. The deployment controller observes the update, and acknowledges by updating the `D1`'s ObservedGeneration. The deployment controller won't create more replicasets on `D1`'s behalf. -10. The "orphan" finalizer observes the update, and notes down the Generation. It waits until the ObservedGeneration becomes equal to or greater than the noted Generation. Then it updates `R1` to remove `D1` from its OwnerReferences. At last, it updates `D1`, removing itself from `D1`'s Finalizers map. -11. The API server handles the update of `D1`, because *i)* DeletionTimestamp is non-nil, *ii)* the DeletionGracePeriodSeconds is zero, and *iii)* the last finalizer is removed from the Finalizers map, API server deletes `D1`. -12. The Propagator of the GC observes the deletion of `D1`. It deletes `D1` from the DAG. It adds its dependent, replicaset `R1`, to the *Dirty Queue*. -13. The Garbage Processor of the GC dequeues `R1` from the *Dirty Queue* and skips it, because its OwnerReferences is empty. - -# Open Questions - -1. In case an object has multiple owners, some owners are deleted with DeleteOptions.OrphanDependents=true, and some are deleted with DeleteOptions.OrphanDependents=false, what should happen to the object? - - The presented design will respect the setting in the deletion request of last owner. - -2. How to propagate the grace period in a cascading deletion? For example, when deleting a ReplicaSet with grace period of 5s, a user may expect the same grace period to be applied to the deletion of the Pods controlled the ReplicaSet. - - Propagating grace period in a cascading deletion is a ***non-goal*** of this proposal. Nevertheless, the presented design can be extended to support it. A tentative solution is letting the garbage collector to propagate the grace period when deleting dependent object. To persist the grace period set by the user, the owning object should not be deleted from the registry until all its dependent objects are in the graceful deletion state. This could be ensured by introducing another finalizer, tentatively named as the "populating graceful deletion" finalizer. Upon receiving the graceful deletion request, the API server adds this finalizer to the finalizers list of the owning object. Later the GC will remove it when all dependents are in the graceful deletion state. - - [#25055](https://github.com/kubernetes/kubernetes/issues/25055) tracks this problem. - -3. How can a client know when the cascading deletion is completed? - - A tentative solution is introducing a "completing cascading deletion" finalizer, which will be added to the finalizers list of the owning object, and removed by the GC when all dependents are deleted. The user can watch for the deletion event of the owning object to ensure the cascading deletion process has completed. - - ---- -***THE REST IS FOR ARCHIVAL PURPOSES*** ---- - -# Considered and Rejected Designs - -# 1. Tombstone + GC - -## Reasons of rejection - -* It likely would conflict with our plan in the future to use all resources as their own tombstones, once the registry supports multi-object transaction. -* The TTL of the tombstone is hand-waving, there is no guarantee that the value of the TTL is long enough. -* This design is essentially the same as the selected design, with the tombstone as an extra element. The benefit the extra complexity buys is that a parent object can be deleted immediately even if the user wants to orphan the children. The benefit doesn't justify the complexity. - - -## API Changes - -``` -type DeleteOptions struct { - … - OrphanChildren bool -} -``` - -**DeleteOptions.OrphanChildren**: allows a user to express whether the child objects should be orphaned. - -``` -type ObjectMeta struct { - ... - ParentReferences []ObjectReference -} -``` - -**ObjectMeta.ParentReferences**: links the resource to the parent resources. For example, a replica set `R` created by a deployment `D` should have an entry in ObjectMeta.ParentReferences pointing to `D`. The link should be set when the child object is created. It can be updated after the creation. - -``` -type Tombstone struct { - unversioned.TypeMeta - ObjectMeta - UID types.UID -} -``` - -**Tombstone**: a tombstone is created when an object is deleted and the user requires the children to be orphaned. -**Tombstone.UID**: the UID of the original object. - -## New components - -The only new component is the Garbage Collector, which consists of a scanner, a garbage processor, and a propagator. -* Scanner: - * Uses the discovery API to detect all the resources supported by the system. - * For performance reasons, resources can be marked as not participating cascading deletion in the discovery info, then the GC will not monitor them. - * Periodically scans all resources in the system and adds each object to the *Dirty Queue*. - -* Garbage Processor: - * Consists of the *Dirty Queue* and workers. - * Each worker: - * Dequeues an item from *Dirty Queue*. - * If the item's ParentReferences is empty, continues to process the next item in the *Dirty Queue*. - * Otherwise checks each entry in the ParentReferences: - * If a parent exists, continues to check the next parent. - * If a parent doesn't exist, checks if a tombstone standing for the parent exists. - * If the step above shows no parent nor tombstone exists, requests the API server to delete the item. That is, only if ***all*** parents are non-existent, and none of them have tombstones, the child object will be garbage collected. - * Otherwise removes the item's ParentReferences to non-existent parents. - -* Propagator: - * The Propagator is for optimization, not for correctness. - * Maintains a DAG of parent-child relations. This DAG stores only name/uid/orphan triplets, not the entire body of every item. - * Consists of an *Event Queue* and a single worker. - * Watches for create/update/delete events for all resources that participating cascading deletion, enqueues the events to the *Event Queue*. - * Worker: - * Dequeues an item from the *Event Queue*. - * If the item is an creation or update, then updates the DAG accordingly. - * If the object has a parent and the parent doesn’t exist in the DAG yet, then apart from adding the object to the DAG, also enqueues the object to the *Dirty Queue*. - * If the item is a deletion, then removes the object from the DAG, and enqueues all its children to the *Dirty Queue*. - * The propagator shouldn't need to do any RPCs, so a single worker should be sufficient. This makes locking easier. - * With the Propagator, we *only* need to run the Scanner when starting the Propagator to populate the DAG and the *Dirty Queue*. - -## Changes to existing components - -* Storage: we should add a REST storage for Tombstones. The index should be UID rather than namespace/name. - -* API Server: when handling a deletion request, if DeleteOptions.OrphanChildren is true, then the API Server either creates a tombstone with TTL if the tombstone doesn't exist yet, or updates the TTL of the existing tombstone. The API Server deletes the object after the tombstone is created. - -* Controllers: when creating child objects, controllers need to fill up their ObjectMeta.ParentReferences field. Objects that don’t have a parent should have the namespace object as the parent. - -## Comparison with the selected design - -The main difference between the two designs is when to update the ParentReferences. In design #1, because a tombstone is created to indicate "orphaning" is desired, the updates to ParentReferences can be deferred until the deletion of the tombstone. In design #2, the updates need to be done before the parent object is deleted from the registry. - -* Advantages of "Tombstone + GC" design - * Faster to free the resource name compared to using finalizers. The original object can be deleted to free the resource name once the tombstone is created, rather than waiting for the finalizers to update all children’s ObjectMeta.ParentReferences. -* Advantages of "Finalizer Framework + GC" - * The finalizer framework is needed for other purposes as well. - - -# 2. Recovering from abnormal cascading deletion - -## Reasons of rejection - -* Not a goal -* Tons of work, not feasible in the near future - -In case the garbage collector is mistakenly deleting objects, we should provide mechanism to stop the garbage collector and restore the objects. - -* Stopping the garbage collector - - We will add a "--enable-garbage-collector" flag to the controller manager binary to indicate if the garbage collector should be enabled. Admin can stop the garbage collector in a running cluster by restarting the kube-controller-manager with --enable-garbage-collector=false. - -* Restoring mistakenly deleted objects - * Guidelines - * The restoration should be implemented as a roll-forward rather than a roll-back, because likely the state of the cluster (e.g., available resources on a node) has changed since the object was deleted. - * Need to archive the complete specs of the deleted objects. - * The content of the archive is sensitive, so the access to the archive subjects to the same authorization policy enforced on the original resource. - * States should be stored in etcd. All components should remain stateless. - - * A preliminary design - - This is a generic design for “undoing a deletion”, not specific to undoing cascading deletion. - * Add a `/archive` sub-resource to every resource, it's used to store the spec of the deleted objects. - * Before an object is deleted from the registry, the API server clears fields like DeletionTimestamp, then creates the object in /archive and sets a TTL. - * Add a `kubectl restore` command, which takes a resource/name pair as input, creates the object with the spec stored in the /archive, and deletes the archived object. - - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/garbage-collection.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/garbage-collection.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/garbage-collection.md) diff --git a/docs/proposals/gpu-support.md b/docs/proposals/gpu-support.md index 604f64bdf7a..f7c24439e14 100644 --- a/docs/proposals/gpu-support.md +++ b/docs/proposals/gpu-support.md @@ -1,279 +1 @@ - - -- [GPU support](#gpu-support) - - [Objective](#objective) - - [Background](#background) - - [Detailed discussion](#detailed-discussion) - - [Inventory](#inventory) - - [Scheduling](#scheduling) - - [The runtime](#the-runtime) - - [NVIDIA support](#nvidia-support) - - [Event flow](#event-flow) - - [Too complex for now: nvidia-docker](#too-complex-for-now-nvidia-docker) - - [Implementation plan](#implementation-plan) - - [V0](#v0) - - [Scheduling](#scheduling-1) - - [Runtime](#runtime) - - [Other](#other) - - [Future work](#future-work) - - [V1](#v1) - - [V2](#v2) - - [V3](#v3) - - [Undetermined](#undetermined) - - [Security considerations](#security-considerations) - - - -# GPU support - -Author: @therc - -Date: Apr 2016 - -Status: Design in progress, early implementation of requirements - -## Objective - -Users should be able to request GPU resources for their workloads, as easily as -for CPU or memory. Kubernetes should keep an inventory of machines with GPU -hardware, schedule containers on appropriate nodes and set up the container -environment with all that's necessary to access the GPU. All of this should -eventually be supported for clusters on either bare metal or cloud providers. - -## Background - -An increasing number of workloads, such as machine learning and seismic survey -processing, benefits from offloading computations to graphic hardware. While not -as tuned as traditional, dedicated high performance computing systems such as -MPI, a Kubernetes cluster can still be a great environment for organizations -that need a variety of additional, "classic" workloads, such as database, web -serving, etc. - -GPU support is hard to provide extensively and will thus take time to tame -completely, because - -- different vendors expose the hardware to users in different ways -- some vendors require fairly tight coupling between the kernel driver -controlling the GPU and the libraries/applications that access the hardware -- it adds more resource types (whole GPUs, GPU cores, GPU memory) -- it can introduce new security pitfalls -- for systems with multiple GPUs, affinity matters, similarly to NUMA -considerations for CPUs -- running GPU code in containers is still a relatively novel idea - -## Detailed discussion - -Currently, this document is mostly focused on the basic use case: run GPU code -on AWS `g2.2xlarge` EC2 machine instances using Docker. It constitutes a narrow -enough scenario that it does not require large amounts of generic code yet. GCE -doesn't support GPUs at all; bare metal systems throw a lot of extra variables -into the mix. - -Later sections will outline future work to support a broader set of hardware, -environments and container runtimes. - -### Inventory - -Before any scheduling can occur, we need to know what's available out there. In -v0, we'll hardcode capacity detected by the kubelet based on a flag, -`--experimental-nvidia-gpu`. This will result in the user-defined resource -`alpha.kubernetes.io/nvidia-gpu` to be reported for `NodeCapacity` and -`NodeAllocatable`, as well as as a node label. - -### Scheduling - -GPUs will be visible as first-class resources. In v0, we'll only assign whole -devices; sharing among multiple pods is left to future implementations. It's -probable that GPUs will exacerbate the need for [a rescheduler](rescheduler.md) -or pod priorities, especially if the nodes in a cluster are not homogeneous. -Consider these two cases: - -> Only half of the machines have a GPU and they're all busy with other -workloads. The other half of the cluster is doing very little work. A GPU -workload comes, but it can't schedule, because the devices are sitting idle on -nodes that are running something else and the nodes with little load lack the -hardware. - -> Some or all the machines have two graphic cards each. A number of jobs get -scheduled, requesting one device per pod. The scheduler puts them all on -different machines, spreading the load, perhaps by design. Then a new job comes -in, requiring two devices per pod, but it can't schedule anywhere, because all -we can find, at most, is one unused device per node. - -### The runtime - -Once we know where to run the container, it's time to set up its environment. At -a minimum, we'll need to map the host device(s) into the container. Because each -manufacturer exposes different device nodes (`/dev/ati/card0`, `/dev/nvidia0`, -but also the required `/dev/nvidiactl` and `/dev/nvidia-uvm`), some of the logic -needs to be hardware-specific, mapping from a logical device to a list of device -nodes necessary for software to talk to it. - -Support binaries and libraries are often versioned along with the kernel module, -so there should be further hooks to project those under `/bin` and some kind of -`/lib` before the application is started. This can be done for Docker with the -use of a versioned [Docker -volume](https://docs.docker.com/engine/tutorials/dockervolumes/) or -with upcoming Kubernetes-specific hooks such as init containers and volume -containers. In v0, images are expected to bundle everything they need. - -#### NVIDIA support - -The first implementation and testing ground will be for NVIDIA devices, by far -the most common setup. - -In v0, the `--experimental-nvidia-gpu` flag will also result in the host devices -(limited to those required to drive the first card, `nvidia0`) to be mapped into -the container by the dockertools library. - -### Event flow - -This is what happens before and after an user schedules a GPU pod. - -1. Administrator installs a number of Kubernetes nodes with GPUs. The correct -kernel modules and device nodes under `/dev/` are present. - -1. Administrator makes sure the latest CUDA/driver versions are installed. - -1. Administrator enables `--experimental-nvidia-gpu` on kubelets - -1. Kubelets update node status with information about the GPU device, in addition -to cAdvisor's usual data about CPU/memory/disk - -1. User creates a Docker image compiling their application for CUDA, bundling -the necessary libraries. We ignore any versioning requirements in the image -using labels based on [NVIDIA's -conventions](https://github.com/NVIDIA/nvidia-docker/blob/64510511e3fd0d00168eb076623854b0fcf1507d/tools/src/nvidia-docker/utils.go#L13). - -1. User creates a pod using the image, requiring -`alpha.kubernetes.io/nvidia-gpu: 1` - -1. Scheduler picks a node for the pod - -1. The kubelet notices the GPU requirement and maps the three devices. In -Docker's engine-api, this means it'll add them to the Resources.Devices list. - -1. Docker runs the container to completion - -1. The scheduler notices that the device is available again - -### Too complex for now: nvidia-docker - -For v0, we discussed at length, but decided to leave aside initially the -[nvidia-docker plugin](https://github.com/NVIDIA/nvidia-docker). The plugin is -an officially supported solution, thus avoiding a lot of new low level code, as -it takes care of functionality such as: - -- creating a Docker volume with binaries such as `nvidia-smi` and shared -libraries -- providing HTTP endpoints that monitoring tools can use to collect GPU metrics -- abstracting details such as `/dev` entry names for each device, as well as -control ones like `nvidiactl` - -The `nvidia-docker` wrapper also verifies that the CUDA version required by a -given image is supported by the host drivers, through inspection of well-known -image labels, if present. We should try to provide equivalent checks, either -for CUDA or OpenCL. - -This is current sample output from `nvidia-docker-plugin`, wrapped for -readability: - - $ curl -s localhost:3476/docker/cli - --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 - --volume-driver=nvidia-docker - --volume=nvidia_driver_352.68:/usr/local/nvidia:ro - -It runs as a daemon listening for HTTP requests on port 3476. The endpoint above -returns flags that need to be added to the Docker command line in order to -expose GPUs to the containers. There are optional URL arguments to request -specific devices if more than one are present on the system, as well as specific -versions of the support software. An obvious improvement is an additional -endpoint for JSON output. - -The unresolved question is whether `nvidia-docker-plugin` would run standalone -as it does today (called over HTTP, perhaps with endpoints for a new Kubernetes -resource API) or whether the relevant code from its `nvidia` package should be -linked directly into kubelet. A partial list of tradeoffs: - -| | External binary | Linked in | -|---------------------|---------------------------------------------------------------------------------------------------|--------------------------------------------------------------| -| Use of cgo | Confined to binary | Linked into kubelet, but with lazy binding | -| Expandibility | Limited if we run the plugin, increased if library is used to build a Kubernetes-tailored daemon. | Can reuse the `nvidia` library as we prefer | -| Bloat | None | Larger kubelet, even for systems without GPUs | -| Reliability | Need to handle the binary disappearing at any time | Fewer headeaches | -| (Un)Marshalling | Need to talk over JSON | None | -| Administration cost | One more daemon to install, configure and monitor | No extra work required, other than perhaps configuring flags | -| Releases | Potentially on its own schedule | Tied to Kubernetes' | - -## Implementation plan - -### V0 - -The first two tracks can progress in parallel. - -#### Scheduling - -1. Define new resource `alpha.kubernetes.io:nvidia-gpu` in `pkg/api/types.go` -and co. -1. Plug resource into feasability checks used by kubelet, scheduler and -schedulercache. Maybe gated behind a flag? -1. Plug resource into resource_helpers.go -1. Plug resource into the limitranger - -#### Runtime - -1. Add kubelet config parameter to enable the resource -1. Make kubelet's `setNodeStatusMachineInfo` report the resource -1. Add a Devices list to container.RunContainerOptions -1. Use it from DockerManager's runContainer -1. Do the same for rkt (stretch goal) -1. When a pod requests a GPU, add the devices to the container options - -#### Other - -1. Add new resource to `kubectl describe` output. Optional for non-GPU users? -1. Administrator documentation, with sample scripts -1. User documentation - -## Future work - -Above all, we need to collect feedback from real users and use that to set -priorities for any of the items below. - -### V1 - -- Perform real detection of the installed hardware -- Figure a standard way to avoid bundling of shared libraries in images -- Support fractional resources so multiple pods can share the same GPU -- Support bare metal setups -- Report resource usage - -### V2 - -- Support multiple GPUs with resource hierarchies and affinities -- Support versioning of resources (e.g. "CUDA v7.5+") -- Build resource plugins into the kubelet? -- Support other device vendors -- Support Azure? -- Support rkt? - -### V3 - -- Support OpenCL (so images can be device-agnostic) - -### Undetermined - -It makes sense to turn the output of this project (external resource plugins, -etc.) into a more generic abstraction at some point. - - -## Security considerations - -There should be knobs for the cluster administrator to only allow certain users -or roles to schedule GPU workloads. Overcommitting or sharing the same device -across different pods is not considered safe. It should be possible to segregate -such GPU-sharing pods by user, namespace or a combination thereof. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/gpu-support.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/gpu-support.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/gpu-support.md) diff --git a/docs/proposals/high-availability.md b/docs/proposals/high-availability.md index da2f4fc9b04..908f9a05a00 100644 --- a/docs/proposals/high-availability.md +++ b/docs/proposals/high-availability.md @@ -1,8 +1 @@ -# High Availability of Scheduling and Controller Components in Kubernetes - -This document is deprecated. For more details about running a highly available -cluster master, please see the [admin instructions document](../../docs/admin/high-availability.md). - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/high-availability.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/high-availability.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/high-availability.md) diff --git a/docs/proposals/image-provenance.md b/docs/proposals/image-provenance.md index 7a5580d9ff9..a3e1a5147a5 100644 --- a/docs/proposals/image-provenance.md +++ b/docs/proposals/image-provenance.md @@ -1,331 +1 @@ - -# Overview - -Organizations wish to avoid running "unapproved" images. - -The exact nature of "approval" is beyond the scope of Kubernetes, but may include reasons like: - - - only run images that are scanned to confirm they do not contain vulnerabilities - - only run images that use a "required" base image - - only run images that contain binaries which were built from peer reviewed, checked-in source - by a trusted compiler toolchain. - - only allow images signed by certain public keys. - - - etc... - -Goals of the design include: -* Block creation of pods that would cause "unapproved" images to run. -* Make it easy for users or partners to build "image provenance checkers" which check whether images are "approved". - * We expect there will be multiple implementations. -* Allow users to request an "override" of the policy in a convenient way (subject to the override being allowed). - * "overrides" are needed to allow "emergency changes", but need to not happen accidentally, since they may - require tedious after-the-fact justification and affect audit controls. - -Non-goals include: -* Encoding image policy into Kubernetes code. -* Implementing objects in core kubernetes which describe complete policies for what images are approved. - * A third-party implementation of an image policy checker could optionally use ThirdPartyResource to store its policy. -* Kubernetes core code dealing with concepts of image layers, build processes, source repositories, etc. - * We expect there will be multiple PaaSes and/or de-facto programming environments, each with different takes on - these concepts. At any rate, Kubernetes is not ready to be opinionated on these concepts. -* Sending more information than strictly needed to a third-party service. - * Information sent by Kubernetes to a third-party service constitutes an API of Kubernetes, and we want to - avoid making these broader than necessary, as it restricts future evolution of Kubernetes, and makes - Kubernetes harder to reason about. Also, excessive information limits cache-ability of decisions. Caching - reduces latency and allows short outages of the backend to be tolerated. - - -Detailed discussion in [Ensuring only images are from approved sources are run]( -https://github.com/kubernetes/kubernetes/issues/22888). - -# Implementation - -A new admission controller will be added. That will be the only change. - -## Admission controller - -An `ImagePolicyWebhook` admission controller will be written. The admission controller examines all pod objects which are -created or updated. It can either admit the pod, or reject it. If it is rejected, the request sees a `403 FORBIDDEN` - -The admission controller code will go in `plugin/pkg/admission/imagepolicy`. - -There will be a cache of decisions in the admission controller. - -If the apiserver cannot reach the webhook backend, it will log a warning and either admit or deny the pod. -A flag will control whether it admits or denies on failure. -The rationale for deny is that an attacker could DoS the backend or wait for it to be down, and then sneak a -bad pod into the system. The rationale for allow here is that, if the cluster admin also does -after-the-fact auditing of what images were run (which we think will be common), this will catch -any bad images run during periods of backend failure. With default-allow, the availability of Kubernetes does -not depend on the availability of the backend. - -# Webhook Backend - -The admission controller code in that directory does not contain logic to make an admit/reject decision. Instead, it extracts -relevant fields from the Pod creation/update request and sends those fields to a Backend (which we have been loosely calling "WebHooks" -in Kubernetes). The request the admission controller sends to the backend is called a WebHook request to distinguish it from the -request being admission-controlled. The server that accepts the WebHook request from Kubernetes is called the "Backend" -to distinguish it from the WebHook request itself, and from the API server. - -The whole system will work similarly to the [Authentication WebHook]( -https://github.com/kubernetes/kubernetes/pull/24902 -) or the [AuthorizationWebHook]( -https://github.com/kubernetes/kubernetes/pull/20347). - -The WebHook request can optionally authenticate itself to its backend using a token from a `kubeconfig` file. - -The WebHook request and response are JSON, and correspond to the following `go` structures: - -```go -// Filename: pkg/apis/imagepolicy.k8s.io/register.go -package imagepolicy - -// ImageReview checks if the set of images in a pod are allowed. -type ImageReview struct { - unversioned.TypeMeta - - // Spec holds information about the pod being evaluated - Spec ImageReviewSpec - - // Status is filled in by the backend and indicates whether the pod should be allowed. - Status ImageReviewStatus - } - -// ImageReviewSpec is a description of the pod creation request. -type ImageReviewSpec struct { - // Containers is a list of a subset of the information in each container of the Pod being created. - Containers []ImageReviewContainerSpec - // Annotations is a list of key-value pairs extracted from the Pod's annotations. - // It only includes keys which match the pattern `*.image-policy.k8s.io/*`. - // It is up to each webhook backend to determine how to interpret these annotations, if at all. - Annotations map[string]string - // Namespace is the namespace the pod is being created in. - Namespace string -} - -// ImageReviewContainerSpec is a description of a container within the pod creation request. -type ImageReviewContainerSpec struct { - Image string - // In future, we may add command line overrides, exec health check command lines, and so on. -} - -// ImageReviewStatus is the result of the token authentication request. -type ImageReviewStatus struct { - // Allowed indicates that all images were allowed to be run. - Allowed bool - // Reason should be empty unless Allowed is false in which case it - // may contain a short description of what is wrong. Kubernetes - // may truncate excessively long errors when displaying to the user. - Reason string -} -``` - -## Extending with Annotations - -All annotations on a Pod that match `*.image-policy.k8s.io/*` are sent to the webhook. -Sending annotations allows users who are aware of the image policy backend to send -extra information to it, and for different backends implementations to accept -different information. - -Examples of information you might put here are - -- request to "break glass" to override a policy, in case of emergency. -- a ticket number from a ticket system that documents the break-glass request -- provide a hint to the policy server as to the imageID of the image being provided, to save it a lookup - -In any case, the annotations are provided by the user and are not validated by Kubernetes in any way. In the future, if an annotation is determined to be widely -useful, we may promote it to a named field of ImageReviewSpec. - -In the case of a Pod update, Kubernetes may send the backend either all images in the updated image, or only the ones that -changed, at its discretion. - -## Interaction with Controllers - -In the case of a Deployment object, no image check is done when the Deployment object is created or updated. -Likewise, no check happens when the Deployment controller creates a ReplicaSet. The check only happens -when the ReplicaSet controller creates a Pod. Checking Pod is necessary since users can directly create pods, -and since third-parties can write their own controllers, which kubernetes might not be aware of or even contain -pod templates. - -The ReplicaSet, or other controller, is responsible for recognizing when a 403 has happened -(whether due to user not having permission due to bad image, or some other permission reason) -and throttling itself and surfacing the error in a way that CLIs and UIs can show to the user. - -Issue [22298](https://github.com/kubernetes/kubernetes/issues/22298) needs to be resolved to -propagate Pod creation errors up through a stack of controllers. - -## Changes in policy over time - -The Backend might change the policy over time. For example, yesterday `redis:v1` was allowed, but today `redis:v1` is not allowed -due to a CVE that just came out (fictional scenario). In this scenario: -. - -- a newly created replicaSet will be unable to create Pods. -- updating a deployment will be safe in the sense that it will detect that the new ReplicaSet is not scaling - up and not scale down the old one. -- an existing replicaSet will be unable to create Pods that replace ones which are terminated. If this is due to - slow loss of nodes, then there should be time to react before significant loss of capacity. -- For non-replicated things (size 1 ReplicaSet, StatefulSet), a single node failure may disable it. -- a node rolling update will eventually check for liveness of replacements, and would be throttled if - in the case when the image was no longer allowed and so replacements could not be started. -- rapid node restarts will cause existing pod objects to be restarted by kubelet. -- slow node restarts or network partitions will cause node controller to delete pods and there will be no replacement - -It is up to the Backend implementor, and the cluster administrator who decides to use that backend, to decide -whether the Backend should be allowed to change its mind. There is a tradeoff between responsiveness -to changes in policy, versus keeping existing services running. The two models that make sense are: - -- never change a policy, unless some external process has ensured no active objects depend on the to-be-forbidden - images. -- change a policy and assume that transition to new image happens faster than the existing pods decay. - -## Ubernetes - -If two clusters share an image policy backend, then they will have the same policies. - -The clusters can pass different tokens to the backend, and the backend can use this to distinguish -between different clusters. - -## Image tags and IDs - -Image tags are like: `myrepo/myimage:v1`. - -Image IDs are like: `myrepo/myimage@sha256:beb6bd6a68f114c1dc2ea4b28db81bdf91de202a9014972bec5e4d9171d90ed`. -You can see image IDs with `docker images --no-trunc`. - -The Backend needs to be able to resolve tags to IDs (by talking to the images repo). -If the Backend resolves tags to IDs, there is some risk that the tag-to-ID mapping will be -modified after approval by the Backend, but before Kubelet pulls the image. We will not address this -race condition at this time. - -We will wait and see how much demand there is for closing this hole. If the community demands a solution, -we may suggest one of these: - -1. Use a backend that refuses to accept images that are specified with tags, and require users to resolve to IDs - prior to creating a pod template. - - [kubectl could be modified to automate this process](https://github.com/kubernetes/kubernetes/issues/1697) - - a CI/CD system or templating system could be used that maps IDs to tags before Deployment modification/creation. -1. Audit logs from kubelets to see image IDs were actually run, to see if any unapproved images slipped through. -1. Monitor tag changes in image repository for suspicious activity, or restrict remapping of tags after initial application. - -If none of these works well, we could do the following: - -- Image Policy Admission Controller adds new field to Pod, e.g. `pod.spec.container[i].imageID` (or an annotation). - and kubelet will enforce that both the imageID and image match the image pulled. - -Since this adds complexity and interacts with imagePullPolicy, we avoid adding the above feature initially. - -### Caching - -There will be a cache of decisions in the admission controller. -TTL will be user-controllable, but default to 1 hour for allows and 30s for denies. -Low TTL for deny allows user to correct a setting on the backend and see the fix -rapidly. It is assumed that denies are infrequent. -Caching allows permits RC to scale up services even during short unavailability of the webhook backend. -The ImageReviewSpec is used as the key to the cache. - -In the case of a cache miss and timeout talking to the backend, the default is to allow Pod creation. -Keeping services running is more important than a hypothetical threat from an un-verified image. - - -### Post-pod-creation audit - -There are several cases where an image not currently allowed might still run. Users wanting a -complete audit solution are advised to also do after-the-fact auditing of what images -ran. This can catch: - -- images allowed due to backend not reachable -- images that kept running after policy change (e.g. CVE discovered) -- images started via local files or http option of kubelet -- checking SHA of images allowed by a tag which was remapped - -This proposal does not include post-pod-creation audit. - -## Alternatives considered - -### Admission Control on Controller Objects - -We could have done admission control on Deployments, Jobs, ReplicationControllers, and anything else that creates a Pod, directly or indirectly. -This approach is good because it provides immediate feedback to the user that the image is not allowed. However, we do not expect disallowed images -to be used often. And controllers need to be able to surface problems creating pods for a variety of other reasons anyways. - -Other good things about this alternative are: - -- Fewer calls to Backend, once per controller rather than once per pod creation. Caching in backend should be able to help with this, though. -- End user that created the object is seen, rather than the user of the controller process. This can be fixed by implementing `Impersonate-User` for controllers. - -Other problems are: - -- Works only with "core" controllers. Need to update admission controller if we add more "core" controllers. Won't work with "third party controllers", e.g. how we open-source distributed systems like hadoop, spark, zookeeper, etc running on kubernetes. Because those controllers don't have config that can be "admission controlled", or if they do, schema is not known to admission controller, have to "search" for pod templates in json. Yuck. -- How would it work if user created pod directly, which is allowed, and the recommended way to run something at most once. - -### Sending User to Backend - -We could have sent the username of the pod creator to the backend. The username could be used to allow different users to run -different categories of images. This would require propagating the username from e.g. Deployment creation, through to -Pod creation via, e.g. the `Impersonate-User:` header. This feature is [not ready](https://github.com/kubernetes/kubernetes/issues/27152). - When it is, we will re-evaluate adding user as a field of `ImagePolicyRequest`. - -### Enforcement at Docker level - -Docker supports plugins which can check any container creation before it happens. For example the [twistlock/authz](https://github.com/twistlock/authz) -Docker plugin can audit the full request sent to the Docker daemon and approve or deny it. This could include checking if the image is allowed. - -We reject this option because: -- it requires all nodes to be able to configured with how to reach the Backend, which complicates node setup. -- it may not work with other runtimes -- propagating error messages back to the user is more difficult -- it requires plumbing additional information about requests to nodes (if we later want to consider `User` in policy). - -### Policy Stored in API - -We decided to store policy about what SecurityContexts a pod can have in the API, via PodSecurityPolicy. -This is because Pods are a Kubernetes object, and the Policy is very closely tied to the definition of Pods, -and grows in step as the Pods API grows. - -For Image policy, the connection is not as strong. To Kubernetes API, and Image is just a string, and it -does not know any of the image metadata, which lives outside the API. - -Image policy may depend on the Dockerfile, the source code, the source repo, the source review tools, -vulnerability databases, and so on. Kubernetes does not have these as built-in concepts or have plans to add -them anytime soon. - -### Registry whitelist/blacklist - -We considered a whitelist/blacklist of registries and/or repositories. Basically, a prefix match on image strings. - The problem of approving images would be then pushed to a problem of controlling who has access to push to a -trusted registry/repository. That approach is simple for kubernetes. Problems with it are: - -- tricky to allow users to share a repository but have different image policies per user or per namespace. -- tricky to do things after image push, such as scan image for vulnerabilities (such as Docker Nautilus), and have those results considered by policy -- tricky to block "older" versions from running, whose interaction with current system may not be well understood. -- how to allow emergency override? -- hard to change policy decision over time. - -We still want to use rkt trust, docker content trust, etc for any registries used. We just need additional -image policy checks beyond what trust can provide. - -### Send every Request to a Generic Admission Control Backend - -Instead of just sending a subset of PodSpec to an Image Provenance backed, we could have sent every object -that is created or updated (or deleted?) to one or ore Generic Admission Control Backends. - -This might be a good idea, but needs quite a bit more thought. Some questions with that approach are: -It will not be a generic webhook. A generic webhook would need a lot more discussion: - -- a generic webhook needs to touch all objects, not just pods. So it won't have a fixed schema. How to express this in our IDL? Harder to write clients - that interpret unstructured data rather than a fixed schema. Harder to version, and to detect errors. -- a generic webhook client needs to ignore kinds it does not care about, or the apiserver needs to know which backends care about which kinds. How - to specify which backends see which requests. Sending all requests including high-rate requests like events and pod-status updated, might be - too high a rate for some backends? - -Additionally, just sending all the fields of just the Pod kind also has problems: -- it exposes our whole API to a webhook backend without giving us (the project) any chance to review or understand how it is being used. -- because we do not know which fields of an object are inspected by the backend, caching of decisions is not effective. Sending fewer fields allows caching. -- sending fewer fields makes it possible to rev the version of the webhook request slower than the version of our internal obejcts (e.g. pod v2 could still use imageReview v1.) -probably lots more reasons. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/image-provenance.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/image-provenance.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/image-provenance.md) diff --git a/docs/proposals/initial-resources.md b/docs/proposals/initial-resources.md index f383f14a830..d16e6895681 100644 --- a/docs/proposals/initial-resources.md +++ b/docs/proposals/initial-resources.md @@ -1,75 +1 @@ -## Abstract - -Initial Resources is a data-driven feature that based on historical data tries to estimate resource usage of a container without Resources specified -and set them before the container is run. This document describes design of the component. - -## Motivation - -Since we want to make Kubernetes as simple as possible for its users we don’t want to require setting [Resources](../design/resource-qos.md) for container by its owner. -On the other hand having Resources filled is critical for scheduling decisions. -Current solution to set up Resources to hardcoded value has obvious drawbacks. -We need to implement a component which will set initial Resources to a reasonable value. - -## Design - -InitialResources component will be implemented as an [admission plugin](../../plugin/pkg/admission/) and invoked right before -[LimitRanger](https://github.com/kubernetes/kubernetes/blob/7c9bbef96ed7f2a192a1318aa312919b861aee00/cluster/gce/config-default.sh#L91). -For every container without Resources specified it will try to predict amount of resources that should be sufficient for it. -So that a pod without specified resources will be treated as -. - -InitialResources will set only [request](../design/resource-qos.md#requests-and-limits) (independently for each resource type: cpu, memory) field in the first version to avoid killing containers due to OOM (however the container still may be killed if exceeds requested resources). -To make the component work with LimitRanger the estimated value will be capped by min and max possible values if defined. -It will prevent from situation when the pod is rejected due to too low or too high estimation. - -The container won’t be marked as managed by this component in any way, however appropriate event will be exported. -The predicting algorithm should have very low latency to not increase significantly e2e pod startup latency -[#3954](https://github.com/kubernetes/kubernetes/pull/3954). - -### Predicting algorithm details - -In the first version estimation will be made based on historical data for the Docker image being run in the container (both the name and the tag matters). -CPU/memory usage of each container is exported periodically (by default with 1 minute resolution) to the backend (see more in [Monitoring pipeline](#monitoring-pipeline)). - -InitialResources will set Request for both cpu/mem as the 90th percentile of the first (in the following order) set of samples defined in the following way: - -* 7 days same image:tag, assuming there is at least 60 samples (1 hour) -* 30 days same image:tag, assuming there is at least 60 samples (1 hour) -* 30 days same image, assuming there is at least 1 sample - -If there is still no data the default value will be set by LimitRanger. Same parameters will be configurable with appropriate flags. - -#### Example - -If we have at least 60 samples from image:tag over the past 7 days, we will use the 90th percentile of all of the samples of image:tag over the past 7 days. -Otherwise, if we have at least 60 samples from image:tag over the past 30 days, we will use the 90th percentile of all of the samples over of image:tag the past 30 days. -Otherwise, if we have at least 1 sample from image over the past 30 days, we will use that the 90th percentile of all of the samples of image over the past 30 days. -Otherwise we will use default value. - -### Monitoring pipeline - -In the first version there will be available 2 options for backend for predicting algorithm: - -* [InfluxDB](../../docs/user-guide/monitoring.md#influxdb-and-grafana) - aggregation will be made in SQL query -* [GCM](../../docs/user-guide/monitoring.md#google-cloud-monitoring) - since GCM is not as powerful as InfluxDB some aggregation will be made on the client side - -Both will be hidden under an abstraction layer, so it would be easy to add another option. -The code will be a part of Initial Resources component to not block development, however in the future it should be a part of Heapster. - - -## Next steps - -The first version will be quite simple so there is a lot of possible improvements. Some of them seem to have high priority -and should be introduced shortly after the first version is done: - -* observe OOM and then react to it by increasing estimation -* add possibility to specify if estimation should be made, possibly as ```InitialResourcesPolicy``` with options: *always*, *if-not-set*, *never* -* add other features to the model like *namespace* -* remember predefined values for the most popular images like *mysql*, *nginx*, *redis*, etc. -* dry mode, which allows to ask system for resource recommendation for a container without running it -* add estimation as annotations for those containers that already has resources set -* support for other data sources like [Hawkular](http://www.hawkular.org/) - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/initial-resources.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/initial-resources.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/initial-resources.md) diff --git a/docs/proposals/job.md b/docs/proposals/job.md index 160b38ddb4f..78408989b48 100644 --- a/docs/proposals/job.md +++ b/docs/proposals/job.md @@ -1,159 +1 @@ -# Job Controller - -## Abstract - -A proposal for implementing a new controller - Job controller - which will be responsible -for managing pod(s) that require running once to completion even if the machine -the pod is running on fails, in contrast to what ReplicationController currently offers. - -Several existing issues and PRs were already created regarding that particular subject: -* Job Controller [#1624](https://github.com/kubernetes/kubernetes/issues/1624) -* New Job resource [#7380](https://github.com/kubernetes/kubernetes/pull/7380) - - -## Use Cases - -1. Be able to start one or several pods tracked as a single entity. -1. Be able to run batch-oriented workloads on Kubernetes. -1. Be able to get the job status. -1. Be able to specify the number of instances performing a job at any one time. -1. Be able to specify the number of successfully finished instances required to finish a job. - - -## Motivation - -Jobs are needed for executing multi-pod computation to completion; a good example -here would be the ability to implement any type of batch oriented tasks. - - -## Implementation - -Job controller is similar to replication controller in that they manage pods. -This implies they will follow the same controller framework that replication -controllers already defined. The biggest difference between a `Job` and a -`ReplicationController` object is the purpose; `ReplicationController` -ensures that a specified number of Pods are running at any one time, whereas -`Job` is responsible for keeping the desired number of Pods to a completion of -a task. This difference will be represented by the `RestartPolicy` which is -required to always take value of `RestartPolicyNever` or `RestartOnFailure`. - - -The new `Job` object will have the following content: - -```go -// Job represents the configuration of a single job. -type Job struct { - TypeMeta - ObjectMeta - - // Spec is a structure defining the expected behavior of a job. - Spec JobSpec - - // Status is a structure describing current status of a job. - Status JobStatus -} - -// JobList is a collection of jobs. -type JobList struct { - TypeMeta - ListMeta - - Items []Job -} -``` - -`JobSpec` structure is defined to contain all the information how the actual job execution -will look like. - -```go -// JobSpec describes how the job execution will look like. -type JobSpec struct { - - // Parallelism specifies the maximum desired number of pods the job should - // run at any given time. The actual number of pods running in steady state will - // be less than this number when ((.spec.completions - .status.successful) < .spec.parallelism), - // i.e. when the work left to do is less than max parallelism. - Parallelism *int - - // Completions specifies the desired number of successfully finished pods the - // job should be run with. Defaults to 1. - Completions *int - - // Selector is a label query over pods running a job. - Selector map[string]string - - // Template is the object that describes the pod that will be created when - // executing a job. - Template *PodTemplateSpec -} -``` - -`JobStatus` structure is defined to contain information about pods executing -specified job. The structure holds information about pods currently executing -the job. - -```go -// JobStatus represents the current state of a Job. -type JobStatus struct { - Conditions []JobCondition - - // CreationTime represents time when the job was created - CreationTime unversioned.Time - - // StartTime represents time when the job was started - StartTime unversioned.Time - - // CompletionTime represents time when the job was completed - CompletionTime unversioned.Time - - // Active is the number of actively running pods. - Active int - - // Successful is the number of pods successfully completed their job. - Successful int - - // Unsuccessful is the number of pods failures, this applies only to jobs - // created with RestartPolicyNever, otherwise this value will always be 0. - Unsuccessful int -} - -type JobConditionType string - -// These are valid conditions of a job. -const ( - // JobComplete means the job has completed its execution. - JobComplete JobConditionType = "Complete" -) - -// JobCondition describes current state of a job. -type JobCondition struct { - Type JobConditionType - Status ConditionStatus - LastHeartbeatTime unversioned.Time - LastTransitionTime unversioned.Time - Reason string - Message string -} -``` - -## Events - -Job controller will be emitting the following events: -* JobStart -* JobFinish - -## Future evolution - -Below are the possible future extensions to the Job controller: -* Be able to limit the execution time for a job, similarly to ActiveDeadlineSeconds for Pods. *now implemented* -* Be able to create a chain of jobs dependent one on another. *will be implemented in a separate type called Workflow* -* Be able to specify the work each of the workers should execute (see type 1 from - [this comment](https://github.com/kubernetes/kubernetes/issues/1624#issuecomment-97622142)) -* Be able to inspect Pods running a Job, especially after a Job has finished, e.g. - by providing pointers to Pods in the JobStatus ([see comment](https://github.com/kubernetes/kubernetes/pull/11746/files#r37142628)). -* help users avoid non-unique label selectors ([see this proposal](../../docs/design/selector-generation.md)) - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/job.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/job.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/job.md) diff --git a/docs/proposals/kubectl-login.md b/docs/proposals/kubectl-login.md index a333e9dc0dc..7fdd99f765e 100644 --- a/docs/proposals/kubectl-login.md +++ b/docs/proposals/kubectl-login.md @@ -1,220 +1 @@ -# Kubectl Login Subcommand - -**Authors**: Eric Chiang (@ericchiang) - -## Goals - -`kubectl login` is an entrypoint for any user attempting to connect to an -existing server. It should provide a more tailored experience than the existing -`kubectl config` including config validation, auth challenges, and discovery. - -Short term the subcommand should recognize and attempt to help: - -* New users with an empty configuration trying to connect to a server. -* Users with no credentials, by prompt for any required information. -* Fully configured users who want to validate credentials. -* Users trying to switch servers. -* Users trying to reauthenticate as the same user because credentials have expired. -* Authenticate as a different user to the same server. - -Long term `kubectl login` should enable authentication strategies to be -discoverable from a master to avoid the end-user having to know how their -sysadmin configured the Kubernetes cluster. - -## Design - -The "login" subcommand helps users move towards a fully functional kubeconfig by -evaluating the current state of the kubeconfig and trying to prompt the user for -and validate the necessary information to login to the kubernetes cluster. - -This is inspired by a similar tools such as: - - * [os login](https://docs.openshift.org/latest/cli_reference/get_started_cli.html#basic-setup-and-login) - * [gcloud auth login](https://cloud.google.com/sdk/gcloud/reference/auth/login) - * [aws configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html) - -The steps taken are: - -1. If no cluster configured, prompt user for cluster information. -2. If no user is configured, discover the authentication strategies supported by the API server. -3. Prompt the user for some information based on the authentication strategy they choose. -4. Attempt to login as a user, including authentication challenges such as OAuth2 flows, and display user info. - -Importantly, each step is skipped if the existing configuration is validated or -can be supplied without user interaction (refreshing an OAuth token, redeeming -a Kerberos ticket, etc.). Users with fully configured kubeconfigs will only see -the user they're logged in as, useful for opaque credentials such as X509 certs -or bearer tokens. - -The command differs from `kubectl config` by: - -* Communicating with the API server to determine if the user is supplying valid auth events. -* Validating input and being opinionated about the input it asks for. -* Triggering authentication challenges for example: - * Basic auth: Actually try to communicate with the API server. - * OpenID Connect: Create an OAuth2 redirect. - -However `kubectl login` should still be seen as a supplement to, not a -replacement for, `kubectl config` by helping validate any kubeconfig generated -by the latter command. - -## Credential validation - -When clusters utilize authorization plugins access decisions are based on the -correct configuration of an auth-N plugin, an auth-Z plugin, and client side -credentials. Being rejected then begs several questions. Is the user's -kubeconfig misconfigured? Is the authorization plugin setup wrong? Is the user -authenticating as a different user than the one they assume? - -To help `kubectl login` diagnose misconfigured credentials, responses from the -API server to authenticated requests SHOULD include the `Authentication-Info` -header as defined in [RFC 7615](https://tools.ietf.org/html/rfc7615). The value -will hold name value pairs for `username` and `uid`. Since usernames and IDs -can be arbitrary strings, these values will be escaped using the `quoted-string` -format noted in the RFC. - -``` -HTTP/1.1 200 OK -Authentication-Info: username="janedoe@example.com", uid="123456" -``` - -If the user successfully authenticates this header will be set, regardless of -auth-Z decisions. For example a 401 Unauthorized (user didn't provide valid -credentials) would lack this header, while a 403 Forbidden response would -contain it. - -## Authentication discovery - -A long term goal of `kubectl login` is to facilitate a customized experience -for clusters configured with different auth providers. This will require some -way for the API server to indicate to `kubectl` how a user is expected to -login. - -Currently, this document doesn't propose a specific implementation for -discovery. While it'd be preferable to utilize an existing standard (such as the -`WWW-Authenticate` HTTP header), discovery may require a solution custom to the -API server, such as an additional discovery endpoint with a custom type. - -## Use in non-interactive session - -For the initial implementation, if `kubectl login` requires prompting and is -called from a non-interactive session (determined by if the session is using a -TTY) it errors out, recommending using `kubectl config` instead. In future -updates `kubectl login` may include options for non-interactive sessions so -auth strategies which require custom behavior not built into `kubectl config`, -such as the exchanges in Kerberos or OpenID Connect, can be triggered from -scripts. - -## Examples - -If kubeconfig isn't configured, `kubectl login` will attempt to fully configure -and validate the client's credentials. - -``` -$ kubectl login -Cluster URL []: https://172.17.4.99:443 -Cluster CA [(defaults to host certs)]: ${PWD}/ssl/ca.pem -Cluster Name ["cluster-1"]: - -The kubernetes server supports the following methods: - - 1. Bearer token - 2. Username and password - 3. Keystone - 4. OpenID Connect - 5. TLS client certificate - -Enter login method [1]: 4 - -Logging in using OpenID Connect. - -Issuer ["valuefromdiscovery"]: https://accounts.google.com -Issuer CA [(defaults to host certs)]: -Scopes ["profile email"]: -Client ID []: client@localhost:foobar -Client Secret []: ***** - -Open the following address in a browser. - - https://accounts.google.com/o/oauth2/v2/auth?redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scopes=openid%20email&access_type=offline&... - -Enter security code: **** - -Logged in as "janedoe@gmail.com" -``` - -Human readable names are provided by a combination of the auth providers -understood by `kubectl login` and the authenticator discovery. For instance, -Keystone uses basic auth credentials in the same way as a static user file, but -if the discovery indicates that the Keystone plugin is being used it should be -presented to the user differently. - -Users with configured credentials will simply auth against the API server and see -who they are. Running this command again simply validates the user's credentials. - -``` -$ kubectl login -Logged in as "janedoe@gmail.com" -``` - -Users who are halfway through the flow will start where they left off. For -instance if a user has configured the cluster field but on a user field, they will -be prompted for credentials. - -``` -$ kubectl login -No auth type configured. The kubernetes server supports the following methods: - - 1. Bearer token - 2. Username and password - 3. Keystone - 4. OpenID Connect - 5. TLS client certificate - -Enter login method [1]: 2 - -Logging in with basic auth. Enter the following fields. - -Username: janedoe -Password: **** - -Logged in as "janedoe@gmail.com" -``` - -Users who wish to switch servers can provide the `--switch-cluster` flag which -will prompt the user for new cluster details and switch the current context. It -behaves identically to `kubectl login` when a cluster is not set. - -``` -$ kubectl login --switch-cluster -# ... -``` - -Switching users goes through a similar flow attempting to prompt the user for -new credentials to the same server. - -``` -$ kubectl login --switch-user -# ... -``` - -## Work to do - -Phase 1: - -* Provide a simple dialog for configuring authentication. -* Kubectl can trigger authentication actions such as trigging OAuth2 redirects. -* Validation of user credentials thought the `Authentication-Info` endpoint. - -Phase 2: - -* Update proposal with auth provider discovery mechanism. -* Customize dialog using discovery data. - -Further improvements will require adding more authentication providers, and -adapting existing plugins to take advantage of challenge based authentication. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubectl-login.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubectl-login.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubectl-login.md) diff --git a/docs/proposals/kubelet-auth.md b/docs/proposals/kubelet-auth.md index c4d35dd966e..38ab08d48b3 100644 --- a/docs/proposals/kubelet-auth.md +++ b/docs/proposals/kubelet-auth.md @@ -1,106 +1 @@ -# Kubelet Authentication / Authorization - -Author: Jordan Liggitt (jliggitt@redhat.com) - -## Overview - -The kubelet exposes endpoints which give access to data of varying sensitivity, -and allow performing operations of varying power on the node and within containers. -There is no built-in way to limit or subdivide access to those endpoints, -so deployers must secure the kubelet API using external, ad-hoc methods. - -This document proposes a method for authenticating and authorizing access -to the kubelet API, using interfaces and methods that complement the existing -authentication and authorization used by the API server. - -## Preliminaries - -This proposal assumes the existence of: - -* a functioning API server -* the SubjectAccessReview and TokenReview APIs - -It also assumes each node is additionally provisioned with the following information: - -1. Location of the API server -2. Any CA certificates necessary to trust the API server's TLS certificate -3. Client credentials authorized to make SubjectAccessReview and TokenReview API calls - -## API Changes - -None - -## Kubelet Authentication - -Enable starting the kubelet with one or more of the following authentication methods: - -* x509 client certificate -* bearer token -* anonymous (current default) - -For backwards compatibility, the default is to enable anonymous authentication. - -### x509 client certificate - -Add a new `--client-ca-file=[file]` option to the kubelet. -When started with this option, the kubelet authenticates incoming requests using x509 -client certificates, validated against the root certificates in the provided bundle. -The kubelet will reuse the x509 authenticator already used by the API server. - -The master API server can already be started with `--kubelet-client-certificate` and -`--kubelet-client-key` options in order to make authenticated requests to the kubelet. - -### Bearer token - -Add a new `--authentication-token-webhook=[true|false]` option to the kubelet. -When true, the kubelet authenticates incoming requests with bearer tokens by making -`TokenReview` API calls to the API server. - -The kubelet will reuse the webhook authenticator already used by the API server, configured -to call the API server using the connection information already provided to the kubelet. - -To improve performance of repeated requests with the same bearer token, the -`--authentication-token-webhook-cache-ttl` option supported by the API server -would be supported. - -### Anonymous - -Add a new `--anonymous-auth=[true|false]` option to the kubelet. -When true, requests to the secure port that are not rejected by other configured -authentication methods are treated as anonymous requests, and given a username -of `system:anonymous` and a group of `system:unauthenticated`. - -## Kubelet Authorization - -Add a new `--authorization-mode` option to the kubelet, specifying one of the following modes: -* `Webhook` -* `AlwaysAllow` (current default) - -For backwards compatibility, the authorization mode defaults to `AlwaysAllow`. - -### Webhook - -Webhook mode converts the request to authorization attributes, and makes a `SubjectAccessReview` -API call to check if the authenticated subject is allowed to make a request with those attributes. -This enables authorization policy to be centrally managed by the authorizer configured for the API server. - -The kubelet will reuse the webhook authorizer already used by the API server, configured -to call the API server using the connection information already provided to the kubelet. - -To improve performance of repeated requests with the same authenticated subject and request attributes, -the same webhook authorizer caching options supported by the API server would be supported: - -* `--authorization-webhook-cache-authorized-ttl` -* `--authorization-webhook-cache-unauthorized-ttl` - -### AlwaysAllow - -This mode allows any authenticated request. - -## Future Work - -* Add support for CRL revocation for x509 client certificate authentication (http://issue.k8s.io/18982) - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-auth.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-auth.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-auth.md) diff --git a/docs/proposals/kubelet-cri-logging.md b/docs/proposals/kubelet-cri-logging.md index 8cc6fac111b..f8204e41ab9 100644 --- a/docs/proposals/kubelet-cri-logging.md +++ b/docs/proposals/kubelet-cri-logging.md @@ -1,269 +1 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - -# CRI: Log management for container stdout/stderr streams - - -## Goals and non-goals - -Container Runtime Interface (CRI) is an ongoing project to allow container -runtimes to integrate with kubernetes via a newly-defined API. The goal of this -proposal is to define how container's *stdout/stderr* log streams should be -handled in CRI. - -The explicit non-goal is to define how (non-stdout/stderr) application logs -should be handled. Collecting and managing arbitrary application logs is a -long-standing issue [1] in kubernetes and is worth a proposal of its own. Even -though this proposal does not touch upon these logs, the direction of -this proposal is aligned with one of the most-discussed solutions, logging -volumes [1], for general logging management. - -*In this proposal, “logs” refer to the stdout/stderr streams of the -containers, unless specified otherwise.* - -Previous CRI logging issues: - - Tracking issue: https://github.com/kubernetes/kubernetes/issues/30709 - - Proposal (by @tmrtfs): https://github.com/kubernetes/kubernetes/pull/33111 - -The scope of this proposal is narrower than the #33111 proposal, and hopefully -this will encourage a more focused discussion. - - -## Background - -Below is a brief overview of logging in kubernetes with docker, which is the -only container runtime with fully functional integration today. - -**Log lifecycle and management** - -Docker supports various logging drivers (e.g., syslog, journal, and json-file), -and allows users to configure the driver by passing flags to the docker daemon -at startup. Kubernetes defaults to the "json-file" logging driver, in which -docker writes the stdout/stderr streams to a file in the json format as shown -below. - -``` -{“log”: “The actual log line”, “stream”: “stderr”, “time”: “2016-10-05T00:00:30.082640485Z”} -``` - -Docker deletes the log files when the container is removed, and a cron-job (or -systemd timer-based job) on the node is responsible to rotate the logs (using -`logrotate`). To preserve the logs for introspection and debuggability, kubelet -keeps the terminated container until the pod object has been deleted from the -apiserver. - -**Container log retrieval** - -The kubernetes CLI tool, kubectl, allows users to access the container logs -using [`kubectl logs`] -(http://kubernetes.io/docs/user-guide/kubectl/kubectl_logs/) command. -`kubectl logs` supports flags such as `--since` that requires understanding of -the format and the metadata (i.e., timestamps) of the logs. In the current -implementation, kubelet calls `docker logs` with parameters to return the log -content. As of now, docker only supports `log` operations for the “journal” and -“json-file” drivers [2]. In other words, *the support of `kubectl logs` is not -universal in all kuernetes deployments*. - -**Cluster logging support** - -In a production cluster, logs are usually collected, aggregated, and shipped to -a remote store where advanced analysis/search/archiving functions are -supported. In kubernetes, the default cluster-addons includes a per-node log -collection daemon, `fluentd`. To facilitate the log collection, kubelet creates -symbolic links to all the docker containers logs under `/var/log/containers` -with pod and container metadata embedded in the filename. - -``` -/var/log/containers/__-.log` -``` - -The fluentd daemon watches the `/var/log/containers/` directory and extract the -metadata associated with the log from the path. Note that this integration -requires kubelet to know where the container runtime stores the logs, and will -not be directly applicable to CRI. - - -## Requirements - - 1. **Provide ways for CRI-compliant runtimes to support all existing logging - features, i.e., `kubectl logs`.** - - 2. **Allow kubelet to manage the lifecycle of the logs to pave the way for - better disk management in the future.** This implies that the lifecycle - of containers and their logs need to be decoupled. - - 3. **Allow log collectors to easily integrate with Kubernetes across - different container runtimes while preserving efficient storage and - retrieval.** - -Requirement (1) provides opportunities for runtimes to continue support -`kubectl logs --since` and related features. Note that even though such -features are only supported today for a limited set of log drivers, this is an -important usability tool for a fresh, basic kubernetes cluster, and should not -be overlooked. Requirement (2) stems from the fact that disk is managed by -kubelet as a node-level resource (not per-pod) today, hence it is difficult to -delegate to the runtime by enforcing per-pod disk quota policy. In addition, -container disk quota is not well supported yet, and such limitation may not -even be well-perceived by users. Requirement (1) is crucial to the kubernetes' -extensibility and usability across all deployments. - -## Proposed solution - -This proposal intends to satisfy the requirements by - - 1. Enforce where the container logs should be stored on the host - filesystem. Both kubelet and the log collector can interact with - the log files directly. - - 2. Ask the runtime to decorate the logs in a format that kubelet understands. - -**Log directories and structures** - -Kubelet will be configured with a root directory (e.g., `/var/log/pods` or -`/var/lib/kubelet/logs/) to store all container logs. Below is an example of a -path to the log of a container in a pod. - -``` -/var/log/pods//_.log -``` - -In CRI, this is implemented by setting the pod-level log directory when -creating the pod sandbox, and passing the relative container log path -when creating a container. - -``` -PodSandboxConfig.LogDirectory: /var/log/pods// -ContainerConfig.LogPath: _.log -``` - -Because kubelet determines where the logs are stores and can access them -directly, this meets requirement (1). As for requirement (2), the log collector -can easily extract basic pod metadata (e.g., pod UID, container name) from -the paths, and watch the directly for any changes. In the future, we can -extend this by maintaining a metada file in the pod directory. - -**Log format** - -The runtime should decorate each log entry with a RFC 3339Nano timestamp -prefix, the stream type (i.e., "stdout" or "stderr"), and ends with a newline. - -``` -2016-10-06T00:17:09.669794202Z stdout The content of the log entry 1 -2016-10-06T00:17:10.113242941Z stderr The content of the log entry 2 -``` - -With the knowledge, kubelet can parses the logs and serve them for `kubectl -logs` requests. This meets requirement (3). Note that the format is defined -deliberately simple to provide only information necessary to serve the requests. -We do not intend for kubelet to host various logging plugins. It is also worth -mentioning again that the scope of this proposal is restricted to stdout/stderr -streams of the container, and we impose no restriction to the logging format of -arbitrary container logs. - -**Who should rotate the logs?** - -We assume that a separate task (e.g., cron job) will be configured on the node -to rotate the logs periodically, similar to today’s implementation. - -We do not rule out the possibility of letting kubelet or a per-node daemon -(`DaemonSet`) to take up the responsibility, or even declare rotation policy -in the kubernetes API as part of the `PodSpec`, but it is beyond the scope of -the this proposal. - -**What about non-supported log formats?** - -If a runtime chooses to store logs in non-supported formats, it essentially -opts out of `kubectl logs` features, which is backed by kubelet today. It is -assumed that the user can rely on the advanced, cluster logging infrastructure -to examine the logs. - -It is also possible that in the future, `kubectl logs` can contact the cluster -logging infrastructure directly to serve logs [1a]. Note that this does not -eliminate the need to store the logs on the node locally for reliability. - - -**How can existing runtimes (docker/rkt) comply to the logging requirements?** - -In the short term, the ongoing docker-CRI integration [3] will support the -proposed solution only partially by (1) creating symbolic links for kubelet -to access, but not manage the logs, and (2) add support for json format in -kubelet. A more sophisticated solution that either involves using a custom -plugin or launching a separate process to copy and decorate the log will be -considered as a mid-term solution. - -For rkt, implementation will rely on providing external file-descriptors for -stdout/stderr to applications via systemd [4]. Those streams are currently -managed by a journald sidecar, which collects stream outputs and store them -in the journal file of the pod. This will replaced by a custom sidecar which -can produce logs in the format expected by this specification and can handle -clients attaching as well. - -## Alternatives - -There are ad-hoc solutions/discussions that addresses one or two of the -requirements, but no comprehensive solution for CRI specifically has been -proposed so far (with the excpetion of @tmrtfs's proposal -[#33111](https://github.com/kubernetes/kubernetes/pull/33111), which has a much -wider scope). It has come up in discussions that kubelet can delegate all the -logging management to the runtime to allow maximum flexibility. However, it is -difficult for this approach to meet either requirement (1) or (2), without -defining complex logging API. - -There are also possibilities to implement the current proposal by imposing the -log file paths, while leveraging the runtime to access and/or manage logs. This -requires the runtime to expose knobs in CRI to retrieve, remove, and examine -the disk usage of logs. The upside of this approach is that kubelet needs not -mandate the logging format, assuming runtime already includes plugins for -various logging formats. Unfortunately, this is not true for existing runtimes -such as docker, which supports log retrieval only for a very limited number of -log drivers [2]. On the other hand, the downside is that we would be enforcing -more requirements on the runtime through log storage location on the host, and -a potentially premature logging API that may change as the disk management -evolves. - -## References - -[1] Log management issues: - - a. https://github.com/kubernetes/kubernetes/issues/17183 - - b. https://github.com/kubernetes/kubernetes/issues/24677 - - c. https://github.com/kubernetes/kubernetes/pull/13010 - -[2] Docker logging drivers: - - https://docs.docker.com/engine/admin/logging/overview/ - -[3] Docker CRI integration: - - https://github.com/kubernetes/kubernetes/issues/31459 - -[4] rkt support: https://github.com/systemd/systemd/pull/4179 - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-cri-logging.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-cri-logging.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-cri-logging.md) diff --git a/docs/proposals/kubelet-eviction.md b/docs/proposals/kubelet-eviction.md index 233956b8271..a4d0fa12abc 100644 --- a/docs/proposals/kubelet-eviction.md +++ b/docs/proposals/kubelet-eviction.md @@ -1,462 +1 @@ -# Kubelet - Eviction Policy - -**Authors**: Derek Carr (@derekwaynecarr), Vishnu Kannan (@vishh) - -**Status**: Proposed (memory evictions WIP) - -This document presents a specification for how the `kubelet` evicts pods when compute resources are too low. - -## Goals - -The node needs a mechanism to preserve stability when available compute resources are low. - -This is especially important when dealing with incompressible compute resources such -as memory or disk. If either resource is exhausted, the node would become unstable. - -The `kubelet` has some support for influencing system behavior in response to a system OOM by -having the system OOM killer see higher OOM score adjust scores for containers that have consumed -the largest amount of memory relative to their request. System OOM events are very compute -intensive, and can stall the node until the OOM killing process has completed. In addition, -the system is prone to return to an unstable state since the containers that are killed due to OOM -are either restarted or a new pod is scheduled on to the node. - -Instead, we would prefer a system where the `kubelet` can pro-actively monitor for -and prevent against total starvation of a compute resource, and in cases of where it -could appear to occur, pro-actively fail one or more pods, so the workload can get -moved and scheduled elsewhere when/if its backing controller creates a new pod. - -## Scope of proposal - -This proposal defines a pod eviction policy for reclaiming compute resources. - -As of now, memory and disk based evictions are supported. -The proposal focuses on a simple default eviction strategy -intended to cover the broadest class of user workloads. - -## Eviction Signals - -The `kubelet` will support the ability to trigger eviction decisions on the following signals. - -| Eviction Signal | Description | -|------------------|---------------------------------------------------------------------------------| -| memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet | -| nodefs.available | nodefs.available := node.stats.fs.available | -| nodefs.inodesFree | nodefs.inodesFree := node.stats.fs.inodesFree | -| imagefs.available | imagefs.available := node.stats.runtime.imagefs.available | -| imagefs.inodesFree | imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree | - -Each of the above signals support either a literal or percentage based value. The percentage based value -is calculated relative to the total capacity associated with each signal. - -`kubelet` supports only two filesystem partitions. - -1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc. -1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers. - -`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor. -`kubelet` does not care about any other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is *not OK* to store volumes and logs in a dedicated `imagefs`. - -## Eviction Thresholds - -The `kubelet` will support the ability to specify eviction thresholds. - -An eviction threshold is of the following form: - -`` - -* valid `eviction-signal` tokens as defined above. -* valid `operator` tokens are `<` -* valid `quantity` tokens must match the quantity representation used by Kubernetes -* an eviction threshold can be expressed as a percentage if ends with `%` token. - -If threshold criteria are met, the `kubelet` will take pro-active action to attempt -to reclaim the starved compute resource associated with the eviction signal. - -The `kubelet` will support soft and hard eviction thresholds. - -For example, if a node has `10Gi` of memory, and the desire is to induce eviction -if available memory falls below `1Gi`, an eviction signal can be specified as either -of the following (but not both). - -* `memory.available<10%` -* `memory.available<1Gi` - -### Soft Eviction Thresholds - -A soft eviction threshold pairs an eviction threshold with a required -administrator specified grace period. No action is taken by the `kubelet` -to reclaim resources associated with the eviction signal until that grace -period has been exceeded. If no grace period is provided, the `kubelet` will -error on startup. - -In addition, if a soft eviction threshold has been met, an operator can -specify a maximum allowed pod termination grace period to use when evicting -pods from the node. If specified, the `kubelet` will use the lesser value among -the `pod.Spec.TerminationGracePeriodSeconds` and the max allowed grace period. -If not specified, the `kubelet` will kill pods immediately with no graceful -termination. - -To configure soft eviction thresholds, the following flags will be supported: - -``` ---eviction-soft="": A set of eviction thresholds (e.g. memory.available<1.5Gi) that if met over a corresponding grace period would trigger a pod eviction. ---eviction-soft-grace-period="": A set of eviction grace periods (e.g. memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction. ---eviction-max-pod-grace-period="0": Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met. -``` - -### Hard Eviction Thresholds - -A hard eviction threshold has no grace period, and if observed, the `kubelet` -will take immediate action to reclaim the associated starved resource. If a -hard eviction threshold is met, the `kubelet` will kill the pod immediately -with no graceful termination. - -To configure hard eviction thresholds, the following flag will be supported: - -``` ---eviction-hard="": A set of eviction thresholds (e.g. memory.available<1Gi) that if met would trigger a pod eviction. -``` - -## Eviction Monitoring Interval - -The `kubelet` will initially evaluate eviction thresholds at the same -housekeeping interval as `cAdvisor` housekeeping. - -In Kubernetes 1.2, this was defaulted to `10s`. - -It is a goal to shrink the monitoring interval to a much shorter window. -This may require changes to `cAdvisor` to let alternate housekeeping intervals -be specified for selected data (https://github.com/google/cadvisor/issues/1247) - -For the purposes of this proposal, we expect the monitoring interval to be no -more than `10s` to know when a threshold has been triggered, but we will strive -to reduce that latency time permitting. - -## Node Conditions - -The `kubelet` will support a node condition that corresponds to each eviction signal. - -If a hard eviction threshold has been met, or a soft eviction threshold has been met -independent of its associated grace period, the `kubelet` will report a condition that -reflects the node is under pressure. - -The following node conditions are defined that correspond to the specified eviction signal. - -| Node Condition | Eviction Signal | Description | -|----------------|------------------|------------------------------------------------------------------| -| MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold | -| DiskPressure | nodefs.available, nodefs.inodesFree, imagefs.available, or imagefs.inodesFree | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold | - -The `kubelet` will continue to report node status updates at the frequency specified by -`--node-status-update-frequency` which defaults to `10s`. - -### Oscillation of node conditions - -If a node is oscillating above and below a soft eviction threshold, but not exceeding -its associated grace period, it would cause the corresponding node condition to -constantly oscillate between true and false, and could cause poor scheduling decisions -as a consequence. - -To protect against this oscillation, the following flag is defined to control how -long the `kubelet` must wait before transitioning out of a pressure condition. - -``` ---eviction-pressure-transition-period=5m0s: Duration for which the kubelet has to wait -before transitioning out of an eviction pressure condition. -``` - -The `kubelet` would ensure that it has not observed an eviction threshold being met -for the specified pressure condition for the period specified before toggling the -condition back to `false`. - -## Eviction scenarios - -### Memory - -Let's assume the operator started the `kubelet` with the following: - -``` ---eviction-hard="memory.available<100Mi" ---eviction-soft="memory.available<300Mi" ---eviction-soft-grace-period="memory.available=30s" -``` - -The `kubelet` will run a sync loop that looks at the available memory -on the node as reported from `cAdvisor` by calculating (capacity - workingSet). -If available memory is observed to drop below 100Mi, the `kubelet` will immediately -initiate eviction. If available memory is observed as falling below `300Mi`, -it will record when that signal was observed internally in a cache. If at the next -sync, that criteria was no longer satisfied, the cache is cleared for that -signal. If that signal is observed as being satisfied for longer than the -specified period, the `kubelet` will initiate eviction to attempt to -reclaim the resource that has met its eviction threshold. - -### Disk - -Let's assume the operator started the `kubelet` with the following: - -``` ---eviction-hard="nodefs.available<1Gi,nodefs.inodesFree<1,imagefs.available<10Gi,imagefs.inodesFree<10" ---eviction-soft="nodefs.available<1.5Gi,nodefs.inodesFree<10,imagefs.available<20Gi,imagefs.inodesFree<100" ---eviction-soft-grace-period="nodefs.available=1m,imagefs.available=2m" -``` - -The `kubelet` will run a sync loop that looks at the available disk -on the node's supported partitions as reported from `cAdvisor`. -If available disk space on the node's primary filesystem is observed to drop below 1Gi -or the free inodes on the node's primary filesystem is less than 1, -the `kubelet` will immediately initiate eviction. -If available disk space on the node's image filesystem is observed to drop below 10Gi -or the free inodes on the node's primary image filesystem is less than 10, -the `kubelet` will immediately initiate eviction. - -If available disk space on the node's primary filesystem is observed as falling below `1.5Gi`, -or if the free inodes on the node's primary filesystem is less than 10, -or if available disk space on the node's image filesystem is observed as falling below `20Gi`, -or if the free inodes on the node's image filesystem is less than 100, -it will record when that signal was observed internally in a cache. If at the next -sync, that criterion was no longer satisfied, the cache is cleared for that -signal. If that signal is observed as being satisfied for longer than the -specified period, the `kubelet` will initiate eviction to attempt to -reclaim the resource that has met its eviction threshold. - -## Eviction of Pods - -If an eviction threshold has been met, the `kubelet` will initiate the -process of evicting pods until it has observed the signal has gone below -its defined threshold. - -The eviction sequence works as follows: - -* for each monitoring interval, if eviction thresholds have been met - * find candidate pod - * fail the pod - * block until pod is terminated on node - -If a pod is not terminated because a container does not happen to die -(i.e. processes stuck in disk IO for example), the `kubelet` may select -an additional pod to fail instead. The `kubelet` will invoke the `KillPod` -operation exposed on the runtime interface. If an error is returned, -the `kubelet` will select a subsequent pod. - -## Eviction Strategy - -The `kubelet` will implement a default eviction strategy oriented around -the pod quality of service class. - -It will target pods that are the largest consumers of the starved compute -resource relative to their scheduling request. It ranks pods within a -quality of service tier in the following order. - -* `BestEffort` pods that consume the most of the starved resource are failed -first. -* `Burstable` pods that consume the greatest amount of the starved resource -relative to their request for that resource are killed first. If no pod -has exceeded its request, the strategy targets the largest consumer of the -starved resource. -* `Guaranteed` pods that consume the greatest amount of the starved resource -relative to their request are killed first. If no pod has exceeded its request, -the strategy targets the largest consumer of the starved resource. - -A guaranteed pod is guaranteed to never be evicted because of another pod's -resource consumption. That said, guarantees are only as good as the underlying -foundation they are built upon. If a system daemon -(i.e. `kubelet`, `docker`, `journald`, etc.) is consuming more resources than -were reserved via `system-reserved` or `kube-reserved` allocations, and the node -only has guaranteed pod(s) remaining, then the node must choose to evict a -guaranteed pod in order to preserve node stability, and to limit the impact -of the unexpected consumption to other guaranteed pod(s). - -## Disk based evictions - -### With Imagefs - -If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order: - -1. Delete logs -1. Evict Pods if required. - -If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order: - -1. Delete unused images -1. Evict Pods if required. - -### Without Imagefs - -If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order: - -1. Delete logs -1. Delete unused images -1. Evict Pods if required. - -Let's explore the different options for freeing up disk space. - -### Delete logs of dead pods/containers - -As of today, logs are tied to a container's lifetime. `kubelet` keeps dead containers around, -to provide access to logs. -In the future, if we store logs of dead containers outside of the container itself, then -`kubelet` can delete these logs to free up disk space. -Once the lifetime of containers and logs are split, kubelet can support more user friendly policies -around log evictions. `kubelet` can delete logs of the oldest containers first. -Since logs from the first and the most recent incarnation of a container is the most important for most applications, -kubelet can try to preserve these logs and aggressively delete logs from other container incarnations. - -Until logs are split from container's lifetime, `kubelet` can delete dead containers to free up disk space. - -### Delete unused images - -`kubelet` performs image garbage collection based on thresholds today. It uses a high and a low watermark. -Whenever disk usage exceeds the high watermark, it removes images until the low watermark is reached. -`kubelet` employs a LRU policy when it comes to deleting images. - -The existing policy will be replaced with a much simpler policy. -Images will be deleted based on eviction thresholds. If kubelet can delete logs and keep disk space availability -above eviction thresholds, then kubelet will not delete any images. -If `kubelet` decides to delete unused images, it will delete *all* unused images. - -### Evict pods - -There is no ability to specify disk limits for pods/containers today. -Disk is a best effort resource. When necessary, `kubelet` can evict pods one at a time. -`kubelet` will follow the [Eviction Strategy](#eviction-strategy) mentioned above for making eviction decisions. -`kubelet` will evict the pod that will free up the maximum amount of disk space on the filesystem that has hit eviction thresholds. -Within each QoS bucket, `kubelet` will sort pods according to their disk usage. -`kubelet` will sort pods in each bucket as follows: - -#### Without Imagefs - -If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage -- local volumes + logs & writable layer of all its containers. - -#### With Imagefs - -If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs` -- local volumes + logs of all its containers. - -If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers. - -## Minimum eviction reclaim - -In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in -`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`, - is time consuming. - -To mitigate these issues, `kubelet` will have a per-resource `minimum-reclaim`. Whenever `kubelet` observes -resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource. - -Following are the flags through which `minimum-reclaim` can be configured for each evictable resource: - -`--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"` - -The default `eviction-minimum-reclaim` is `0` for all resources. - -## Deprecation of existing features - -`kubelet` has been freeing up disk space on demand to keep the node stable. As part of this proposal, -some of the existing features/flags around disk space retrieval will be deprecated in-favor of this proposal. - -| Existing Flag | New Flag | Rationale | -| ------------- | -------- | --------- | -| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` | existing eviction signals can capture image garbage collection | -| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` | eviction reclaims achieve the same behavior | -| `--maximum-dead-containers` | | deprecated once old logs are stored outside of container's context | -| `--maximum-dead-containers-per-container` | | deprecated once old logs are stored outside of container's context | -| `--minimum-container-ttl-duration` | | deprecated once old logs are stored outside of container's context | -| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` | this use case is better handled by this proposal | -| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` | make the flag generic to suit all compute resources | - -## Kubelet Admission Control - -### Feasibility checks during kubelet admission - -#### Memory - -The `kubelet` will reject `BestEffort` pods if any of the memory -eviction thresholds have been exceeded independent of the configured -grace period. - -Let's assume the operator started the `kubelet` with the following: - -``` ---eviction-soft="memory.available<256Mi" ---eviction-soft-grace-period="memory.available=30s" -``` - -If the `kubelet` sees that it has less than `256Mi` of memory available -on the node, but the `kubelet` has not yet initiated eviction since the -grace period criteria has not yet been met, the `kubelet` will still immediately -fail any incoming best effort pods. - -The reasoning for this decision is the expectation that the incoming pod is -likely to further starve the particular compute resource and the `kubelet` should -return to a steady state before accepting new workloads. - -#### Disk - -The `kubelet` will reject all pods if any of the disk eviction thresholds have been met. - -Let's assume the operator started the `kubelet` with the following: - -``` ---eviction-soft="nodefs.available<1500Mi" ---eviction-soft-grace-period="nodefs.available=30s" -``` - -If the `kubelet` sees that it has less than `1500Mi` of disk available -on the node, but the `kubelet` has not yet initiated eviction since the -grace period criteria has not yet been met, the `kubelet` will still immediately -fail any incoming pods. - -The rationale for failing **all** pods instead of just best effort is because disk is currently -a best effort resource for all QoS classes. - -Kubelet will apply the same policy even if there is a dedicated `image` filesystem. - -## Scheduler - -The node will report a condition when a compute resource is under pressure. The -scheduler should view that condition as a signal to dissuade placing additional -best effort pods on the node. - -In this case, the `MemoryPressure` condition if true should dissuade the scheduler -from placing new best effort pods on the node since they will be rejected by the `kubelet` in admission. - -On the other hand, the `DiskPressure` condition if true should dissuade the scheduler from -placing **any** new pods on the node since they will be rejected by the `kubelet` in admission. - -## Best Practices - -### DaemonSet - -It is never desired for a `kubelet` to evict a pod that was derived from -a `DaemonSet` since the pod will immediately be recreated and rescheduled -back to the same node. - -At the moment, the `kubelet` has no ability to distinguish a pod created -from `DaemonSet` versus any other object. If/when that information is -available, the `kubelet` could pro-actively filter those pods from the -candidate set of pods provided to the eviction strategy. - -In general, it should be strongly recommended that `DaemonSet` not -create `BestEffort` pods to avoid being identified as a candidate pod -for eviction. Instead `DaemonSet` should ideally include Guaranteed pods only. - -## Known issues - -### kubelet may evict more pods than needed - -The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding -the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future. - -### How kubelet ranks pods for eviction in response to inode exhaustion - -At this time, it is not possible to know how many inodes were consumed by a particular container. If the `kubelet` observes -inode exhaustion, it will evict pods by ranking them by quality of service. The following issue has been opened in cadvisor -to track per container inode consumption (https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods -by inode consumption. For example, this would let us identify a container that created large numbers of 0 byte files, and evict -that pod over others. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-eviction.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-eviction.md) diff --git a/docs/proposals/kubelet-hypercontainer-runtime.md b/docs/proposals/kubelet-hypercontainer-runtime.md index c3da7d9a0cc..55db1cd0bfc 100644 --- a/docs/proposals/kubelet-hypercontainer-runtime.md +++ b/docs/proposals/kubelet-hypercontainer-runtime.md @@ -1,45 +1 @@ -Kubelet HyperContainer Container Runtime -======================================= - -Authors: Pengfei Ni (@feiskyer), Harry Zhang (@resouer) - -## Abstract - -This proposal aims to support [HyperContainer](http://hypercontainer.io) container -runtime in Kubelet. - -## Motivation - -HyperContainer is a Hypervisor-agnostic Container Engine that allows you to run Docker images using -hypervisors (KVM, Xen, etc.). By running containers within separate VM instances, it offers a -hardware-enforced isolation, which is required in multi-tenant environments. - -## Goals - -1. Complete pod/container/image lifecycle management with HyperContainer. -2. Setup network by network plugins. -3. 100% Pass node e2e tests. -4. Easy to deploy for both local dev/test and production clusters. - -## Design - -The HyperContainer runtime will make use of the kubelet Container Runtime Interface. [Fakti](https://github.com/kubernetes/frakti) implements the CRI interface and exposes -a local endpoint to Kubelet. Fakti communicates with [hyperd](https://github.com/hyperhq/hyperd) -with its gRPC API to manage the lifecycle of sandboxes, containers and images. - -![frakti](https://cloud.githubusercontent.com/assets/676637/18940978/6e3e5384-863f-11e6-9132-b638d862fd09.png) - -## Limitations - -Since pods are running directly inside hypervisor, host network is not supported in HyperContainer -runtime. - -## Development - -The HyperContainer runtime is maintained by . - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-hypercontainer-runtime.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-hypercontainer-runtime.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-hypercontainer-runtime.md) diff --git a/docs/proposals/kubelet-rkt-runtime.md b/docs/proposals/kubelet-rkt-runtime.md index 84aac8cc4ab..e16d0b5793b 100644 --- a/docs/proposals/kubelet-rkt-runtime.md +++ b/docs/proposals/kubelet-rkt-runtime.md @@ -1,103 +1 @@ -Next generation rkt runtime integration -======================================= - -Authors: Euan Kemp (@euank), Yifan Gu (@yifan-gu) - -## Abstract - -This proposal describes the design and road path for integrating rkt with kubelet with the new container runtime interface. - -## Background - -Currently, the Kubernetes project supports rkt as a container runtime via an implementation under [pkg/kubelet/rkt package](https://github.com/kubernetes/kubernetes/tree/v1.5.0-alpha.0/pkg/kubelet/rkt). - -This implementation, for historical reasons, has required implementing a large amount of logic shared by the original Docker implementation. - -In order to make additional container runtime integrations easier, more clearly defined, and more consistent, a new [Container Runtime Interface](https://github.com/kubernetes/kubernetes/blob/v1.5.0-alpha.0/pkg/kubelet/api/v1alpha1/runtime/api.proto) (CRI) is being designed. -The existing runtimes, in order to both prove the correctness of the interface and reduce maintenance burden, are incentivized to move to this interface. - -This document proposes how the rkt runtime integration will transition to using the CRI. - -## Goals - -### Full-featured - -The CRI integration must work as well as the existing integration in terms of features. - -Until that's the case, the existing integration will continue to be maintained. - -### Easy to Deploy - -The new integration should not be any more difficult to deploy and configure than the existing integration. - -### Easy to Develop - -This iteration should be as easy to work and iterate on as the original one. - -It will be available in an initial usable form quickly in order to validate the CRI. - -## Design - -In order to fulfill the above goals, the rkt CRI integration will make the following choices: - -### Remain in-process with Kubelet - -The current rkt container runtime integration is able to be deployed simply by deploying the kubelet binary. - -This is, in no small part, to make it *Easy to Deploy*. - -Remaining in-process also helps this integration not regress on performance, one axis of being *Full-Featured*. - -### Communicate through gRPC - -Although the kubelet and rktlet will be compiled together, the runtime and kubelet will still communicate through gRPC interface for better API abstraction. - -For the near short term, they will still talk through a unix socket before we implement a custom gRPC connection that skips the network stack. - -### Developed as a Separate Repository - -Brian Grant's discussion on splitting the Kubernetes project into [separate repos](https://github.com/kubernetes/kubernetes/issues/24343) is a compelling argument for why it makes sense to split this work into a separate repo. - -In order to be *Easy to Develop*, this iteration will be maintained as a separate repository, and re-vendored back in. - -This choice will also allow better long-term growth in terms of better issue-management, testing pipelines, and so on. - -Unfortunately, in the short term, it's possible that some aspects of this will also cause pain and it's very difficult to weight each side correctly. - -### Exec the rkt binary (initially) - -While significant work on the rkt [api-service](https://coreos.com/rkt/docs/latest/subcommands/api-service.html) has been made, -it has also been a source of problems and additional complexity, -and was never transitioned to entirely. - -In addition, the rkt cli has historically been the primary interface to the rkt runtime. - -The initial integration will execute the rkt binary directly for app creation/start/stop/removal, as well as image pulling/removal. - -The creation of pod sanbox is also done via rkt command line, but it will run under `systemd-run` so it's monitored by the init process. - -In the future, some of these decisions are expected to be changed such that rkt is vendored as a library dependency for all operations, and other init systems will be supported as well. - - -## Roadmap and Milestones - -1. rktlet integrate with kubelet to support basic pod/container lifecycle (pod creation, container creation/start/stop, pod stop/removal) [[Done]](https://github.com/kubernetes-incubator/rktlet/issues/9) -2. rktlet integrate with kubelet to support more advanced features: - - Support kubelet networking, host network - - Support mount / volumes [[#33526]](https://github.com/kubernetes/kubernetes/issues/33526) - - Support exposing ports - - Support privileged containers - - Support selinux options [[#33139]](https://github.com/kubernetes/kubernetes/issues/33139) - - Support attach [[#29579]](https://github.com/kubernetes/kubernetes/issues/29579) - - Support exec [[#29579]](https://github.com/kubernetes/kubernetes/issues/29579) - - Support logging [[#33111]](https://github.com/kubernetes/kubernetes/pull/33111) - -3. rktlet integrate with kubelet, pass 100% e2e and node e2e tests, with nspawn stage1. -4. rktlet integrate with kubelet, pass 100% e2e and node e2e tests, with kvm stage1. -5. Revendor rktlet into `pkg/kubelet/rktshim`, and start deprecating the `pkg/kubelet/rkt` package. -6. Eventually replace the current `pkg/kubelet/rkt` package. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-rkt-runtime.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-rkt-runtime.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-rkt-runtime.md) diff --git a/docs/proposals/kubelet-systemd.md b/docs/proposals/kubelet-systemd.md index b4277cfa92b..b835563482a 100644 --- a/docs/proposals/kubelet-systemd.md +++ b/docs/proposals/kubelet-systemd.md @@ -1,407 +1 @@ -# Kubelet and systemd interaction - -**Author**: Derek Carr (@derekwaynecarr) - -**Status**: Proposed - -## Motivation - -Many Linux distributions have either adopted, or plan to adopt `systemd` as their init system. - -This document describes how the node should be configured, and a set of enhancements that should -be made to the `kubelet` to better integrate with these distributions independent of container -runtime. - -## Scope of proposal - -This proposal does not account for running the `kubelet` in a container. - -## Background on systemd - -To help understand this proposal, we first provide a brief summary of `systemd` behavior. - -### systemd units - -`systemd` manages a hierarchy of `slice`, `scope`, and `service` units. - -* `service` - application on the server that is launched by `systemd`; how it should start/stop; -when it should be started; under what circumstances it should be restarted; and any resource -controls that should be applied to it. -* `scope` - a process or group of processes which are not launched by `systemd` (i.e. fork), like -a service, resource controls may be applied -* `slice` - organizes a hierarchy in which `scope` and `service` units are placed. a `slice` may -contain `slice`, `scope`, or `service` units; processes are attached to `service` and `scope` -units only, not to `slices`. The hierarchy is intended to be unified, meaning a process may -only belong to a single leaf node. - -### cgroup hierarchy: split versus unified hierarchies - -Classical `cgroup` hierarchies were split per resource group controller, and a process could -exist in different parts of the hierarchy. - -For example, a process `p1` could exist in each of the following at the same time: - -* `/sys/fs/cgroup/cpu/important/` -* `/sys/fs/cgroup/memory/unimportant/` -* `/sys/fs/cgroup/cpuacct/unimportant/` - -In addition, controllers for one resource group could depend on another in ways that were not -always obvious. - -For example, the `cpu` controller depends on the `cpuacct` controller yet they were treated -separately. - -Many found it confusing for a single process to belong to different nodes in the `cgroup` hierarchy -across controllers. - -The Kernel direction for `cgroup` support is to move toward a unified `cgroup` hierarchy, where the -per-controller hierarchies are eliminated in favor of hierarchies like the following: - -* `/sys/fs/cgroup/important/` -* `/sys/fs/cgroup/unimportant/` - -In a unified hierarchy, a process may only belong to a single node in the `cgroup` tree. - -### cgroupfs single writer - -The Kernel direction for `cgroup` management is to promote a single-writer model rather than -allowing multiple processes to independently write to parts of the file-system. - -In distributions that run `systemd` as their init system, the cgroup tree is managed by `systemd` -by default since it implicitly interacts with the cgroup tree when starting units. Manual changes -made by other cgroup managers to the cgroup tree are not guaranteed to be preserved unless `systemd` -is made aware. `systemd` can be told to ignore sections of the cgroup tree by configuring the unit -to have the `Delegate=` option. - -See: http://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate= - -### cgroup management with systemd and container runtimes - -A `slice` corresponds to an inner-node in the `cgroup` file-system hierarchy. - -For example, the `system.slice` is represented as follows: - -`/sys/fs/cgroup//system.slice` - -A `slice` is nested in the hierarchy by its naming convention. - -For example, the `system-foo.slice` is represented as follows: - -`/sys/fs/cgroup//system.slice/system-foo.slice/` - -A `service` or `scope` corresponds to leaf nodes in the `cgroup` file-system hierarchy managed by -`systemd`. Services and scopes can have child nodes managed outside of `systemd` if they have been -delegated with the `Delegate=` option. - -For example, if the `docker.service` is associated with the `system.slice`, it is -represented as follows: - -`/sys/fs/cgroup//system.slice/docker.service/` - -To demonstrate the use of `scope` units using the `docker` container runtime, if a -user launches a container via `docker run -m 100M busybox`, a `scope` will be created -because the process was not launched by `systemd` itself. The `scope` is parented by -the `slice` associated with the launching daemon. - -For example: - -`/sys/fs/cgroup//system.slice/docker-.scope` - -`systemd` defines a set of slices. By default, service and scope units are placed in -`system.slice`, virtual machines and containers registered with `systemd-machined` are -found in `machine.slice`, and user sessions handled by `systemd-logind` in `user.slice`. - -## Node Configuration on systemd - -### kubelet cgroup driver - -The `kubelet` reads and writes to the `cgroup` tree during bootstrapping -of the node. In the future, it will write to the `cgroup` tree to satisfy other -purposes around quality of service, etc. - -The `kubelet` must cooperate with `systemd` in order to ensure proper function of the -system. The bootstrapping requirements for a `systemd` system are different than one -without it. - -The `kubelet` will accept a new flag to control how it interacts with the `cgroup` tree. - -* `--cgroup-driver=` - cgroup driver used by the kubelet. `cgroupfs` or `systemd`. - -By default, the `kubelet` should default `--cgroup-driver` to `systemd` on `systemd` distributions. - -The `kubelet` should associate node bootstrapping semantics to the configured -`cgroup driver`. - -### Node allocatable - -The proposal makes no changes to the definition as presented here: -https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/node-allocatable.md - -The node will report a set of allocatable compute resources defined as follows: - -`[Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved]` - -### Node capacity - -The `kubelet` will continue to interface with `cAdvisor` to determine node capacity. - -### System reserved - -The node may set aside a set of designated resources for non-Kubernetes components. - -The `kubelet` accepts the followings flags that support this feature: - -* `--system-reserved=` - A set of `ResourceName`=`ResourceQuantity` pairs that -describe resources reserved for host daemons. -* `--system-container=` - Optional resource-only container in which to place all -non-kernel processes that are not already in a container. Empty for no container. -Rolling back the flag requires a reboot. (Default: ""). - -The current meaning of `system-container` is inadequate on `systemd` environments. -The `kubelet` should use the flag to know the location that has the processes that -are associated with `system-reserved`, but it should not modify the cgroups of -existing processes on the system during bootstrapping of the node. This is -because `systemd` is the `cgroup manager` on the host and it has not delegated -authority to the `kubelet` to change how it manages `units`. - -The following describes the type of things that can happen if this does not change: -https://bugzilla.redhat.com/show_bug.cgi?id=1202859 - -As a result, the `kubelet` needs to distinguish placement of non-kernel processes -based on the cgroup driver, and only do its current behavior when not on `systemd`. - -The flag should be modified as follows: - -* `--system-container=` - Name of resource-only container that holds all -non-kernel processes whose resource consumption is accounted under -system-reserved. The default value is cgroup driver specific. systemd -defaults to system, cgroupfs defines no default. Rolling back the flag -requires a reboot. - -The `kubelet` will error if the defined `--system-container` does not exist -on `systemd` environments. It will verify that the appropriate `cpu` and `memory` -controllers are enabled. - -### Kubernetes reserved - -The node may set aside a set of resources for Kubernetes components: - -* `--kube-reserved=:` - A set of `ResourceName`=`ResourceQuantity` pairs that -describe resources reserved for host daemons. - -The `kubelet` does not enforce `--kube-reserved` at this time, but the ability -to distinguish the static reservation from observed usage is important for node accounting. - -This proposal asserts that `kubernetes.slice` is the default slice associated with -the `kubelet` and `kube-proxy` service units defined in the project. Keeping it -separate from `system.slice` allows for accounting to be distinguished separately. - -The `kubelet` will detect its `cgroup` to track `kube-reserved` observed usage on `systemd`. -If the `kubelet` detects that its a child of the `system-container` based on the observed -`cgroup` hierarchy, it will warn. - -If the `kubelet` is launched directly from a terminal, it's most likely destination will -be in a `scope` that is a child of `user.slice` as follows: - -`/sys/fs/cgroup//user.slice/user-1000.slice/session-1.scope` - -In this context, the parent `scope` is what will be used to facilitate local developer -debugging scenarios for tracking `kube-reserved` usage. - -The `kubelet` has the following flag: - -* `--resource-container="/kubelet":` Absolute name of the resource-only container to create -and run the Kubelet in (Default: /kubelet). - -This flag will not be supported on `systemd` environments since the init system has already -spawned the process and placed it in the corresponding container associated with its unit. - -### Kubernetes container runtime reserved - -This proposal asserts that the reservation of compute resources for any associated -container runtime daemons is tracked by the operator under the `system-reserved` or -`kubernetes-reserved` values and any enforced limits are set by the -operator specific to the container runtime. - -**Docker** - -If the `kubelet` is configured with the `container-runtime` set to `docker`, the -`kubelet` will detect the `cgroup` associated with the `docker` daemon and use that -to do local node accounting. If an operator wants to impose runtime limits on the -`docker` daemon to control resource usage, the operator should set those explicitly in -the `service` unit that launches `docker`. The `kubelet` will not set any limits itself -at this time and will assume whatever budget was set aside for `docker` was included in -either `--kube-reserved` or `--system-reserved` reservations. - -Many OS distributions package `docker` by default, and it will often belong to the -`system.slice` hierarchy, and therefore operators will need to budget it for there -by default unless they explicitly move it. - -**rkt** - -rkt has no client/server daemon, and therefore has no explicit requirements on container-runtime -reservation. - -### kubelet cgroup enforcement - -The `kubelet` does not enforce the `system-reserved` or `kube-reserved` values by default. - -The `kubelet` should support an additional flag to turn on enforcement: - -* `--system-reserved-enforce=false` - Optional flag that if true tells the `kubelet` -to enforce the `system-reserved` constraints defined (if any) -* `--kube-reserved-enforce=false` - Optional flag that if true tells the `kubelet` -to enforce the `kube-reserved` constraints defined (if any) - -Usage of this flag requires that end-user containers are launched in a separate part -of cgroup hierarchy via `cgroup-root`. - -If this flag is enabled, the `kubelet` will continually validate that the configured -resource constraints are applied on the associated `cgroup`. - -### kubelet cgroup-root behavior under systemd - -The `kubelet` supports a `cgroup-root` flag which is the optional root `cgroup` to use for pods. - -This flag should be treated as a pass-through to the underlying configured container runtime. - -If `--cgroup-enforce=true`, this flag warrants special consideration by the operator depending -on how the node was configured. For example, if the container runtime is `docker` and its using -the `systemd` cgroup driver, then `docker` will take the daemon wide default and launch containers -in the same slice associated with the `docker.service`. By default, this would mean `system.slice` -which could cause end-user pods to be launched in the same part of the cgroup hierarchy as system daemons. - -In those environments, it is recommended that `cgroup-root` is configured to be a subtree of `machine.slice`. - -### Proposed cgroup hierarchy - -``` -$ROOT - | - +- system.slice - | | - | +- sshd.service - | +- docker.service (optional) - | +- ... - | - +- kubernetes.slice - | | - | +- kubelet.service - | +- docker.service (optional) - | - +- machine.slice (container runtime specific) - | | - | +- docker-.scope - | - +- user.slice - | +- ... -``` - -* `system.slice` corresponds to `--system-reserved`, and contains any services the -operator brought to the node as normal configuration. -* `kubernetes.slice` corresponds to the `--kube-reserved`, and contains kube specific -daemons. -* `machine.slice` should parent all end-user containers on the system and serve as the -root of the end-user cluster workloads run on the system. -* `user.slice` is not explicitly tracked by the `kubelet`, but it is possible that `ssh` -sessions to the node where the user launches actions directly. Any resource accounting -reserved for those actions should be part of `system-reserved`. - -The container runtime daemon, `docker` in this outline, must be accounted for in either -`system.slice` or `kubernetes.slice`. - -In the future, the depth of the container hierarchy is not recommended to be rooted -more than 2 layers below the root as it historically has caused issues with node performance -in other `cgroup` aware systems (https://bugzilla.redhat.com/show_bug.cgi?id=850718). It -is anticipated that the `kubelet` will parent containers based on quality of service -in the future. In that environment, those changes will be relative to the configured -`cgroup-root`. - -### Linux Kernel Parameters - -The `kubelet` will set the following: - -* `sysctl -w vm.overcommit_memory=1` -* `sysctl -w vm.panic_on_oom=0` -* `sysctl -w kernel/panic=10` -* `sysctl -w kernel/panic_on_oops=1` - -### OOM Score Adjustment - -The `kubelet` at bootstrapping will set the `oom_score_adj` value for Kubernetes -daemons, and any dependent container-runtime daemons. - -If `container-runtime` is set to `docker`, then set its `oom_score_adj=-999` - -## Implementation concerns - -### kubelet block-level architecture - -``` -+----------+ +----------+ +----------+ -| | | | | Pod | -| Node <-------+ Container<----+ Lifecycle| -| Manager | | Manager | | Manager | -| +-------> | | | -+---+------+ +-----+----+ +----------+ - | | - | | - | +-----------------+ - | | | - | | | -+---v--v--+ +-----v----+ -| cgroups | | container| -| library | | runtimes | -+---+-----+ +-----+----+ - | | - | | - +---------+----------+ - | - | - +-----------v-----------+ - | Linux Kernel | - +-----------------------+ -``` - -The `kubelet` should move to an architecture that resembles the above diagram: - -* The `kubelet` should not interface directly with the `cgroup` file-system, but instead -should use a common `cgroups library` that has the proper abstraction in place to -work with either `cgroupfs` or `systemd`. The `kubelet` should just use `libcontainer` -abstractions to facilitate this requirement. The `libcontainer` abstractions as -currently defined only support an `Apply(pid)` pattern, and we need to separate that -abstraction to allow cgroup to be created and then later joined. -* The existing `ContainerManager` should separate node bootstrapping into a separate -`NodeManager` that is dependent on the configured `cgroup-driver`. -* The `kubelet` flags for cgroup paths will convert internally as part of cgroup library, -i.e. `/foo/bar` will just convert to `foo-bar.slice` - -### kubelet accounting for end-user pods - -This proposal re-enforces that it is inappropriate at this time to depend on `--cgroup-root` as the -primary mechanism to distinguish and account for end-user pod compute resource usage. - -Instead, the `kubelet` can and should sum the usage of each running `pod` on the node to account for -end-user pod usage separate from system-reserved and kubernetes-reserved accounting via `cAdvisor`. - -## Known issues - -### Docker runtime support for --cgroup-parent - -Docker versions <= 1.0.9 did not have proper support for `-cgroup-parent` flag on `systemd`. This -was fixed in this PR (https://github.com/docker/docker/pull/18612). As result, it's expected -that containers launched by the `docker` daemon may continue to go in the default `system.slice` and -appear to be counted under system-reserved node usage accounting. - -If operators run with later versions of `docker`, they can avoid this issue via the use of `cgroup-root` -flag on the `kubelet`, but this proposal makes no requirement on operators to do that at this time, and -this can be revisited if/when the project adopts docker 1.10. - -Some OS distributions will fix this bug in versions of docker <= 1.0.9, so operators should -be aware of how their version of `docker` was packaged when using this feature. - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-systemd.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-systemd.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-systemd.md) diff --git a/docs/proposals/kubelet-tls-bootstrap.md b/docs/proposals/kubelet-tls-bootstrap.md index fbd9841371d..07a045c98d3 100644 --- a/docs/proposals/kubelet-tls-bootstrap.md +++ b/docs/proposals/kubelet-tls-bootstrap.md @@ -1,243 +1 @@ -# Kubelet TLS bootstrap - -Author: George Tankersley (george.tankersley@coreos.com) - -## Preface - -This document describes a method for a kubelet to bootstrap itself -into a TLS-secured cluster. Crucially, it automates the provision and -distribution of signed certificates. - -## Overview - -When a kubelet runs for the first time, it must be given TLS assets -or generate them itself. In the first case, this is a burden on the cluster -admin and a significant logistical barrier to secure Kubernetes rollouts. In -the second, the kubelet must self-sign its certificate and forfeits many of the -advantages of a PKI system. Instead, we propose that the kubelet generate a -private key and a CSR for submission to a cluster-level certificate signing -process. - -## Preliminaries - -We assume the existence of a functioning control plane. The -apiserver should be configured for TLS initially or possess the ability to -generate valid TLS credentials for itself. If secret information is passed in -the request (e.g. auth tokens supplied with the request or included in -ExtraInfo) then all communications from the node to the apiserver must take -place over a verified TLS connection. - -Each node is additionally provisioned with the following information: - -1. Location of the apiserver -2. Any CA certificates necessary to trust the apiserver's TLS certificate -3. Access tokens (if needed) to communicate with the CSR endpoint - -These should not change often and are thus simple to include in a static -provisioning script. - -## API Changes - -### CertificateSigningRequest Object - -We introduce a new API object to represent PKCS#10 certificate signing -requests. It will be accessible under: - -`/apis/certificates/v1beta1/certificatesigningrequests/mycsr` - -It will have the following structure: - -```go -// Describes a certificate signing request -type CertificateSigningRequest struct { - unversioned.TypeMeta `json:",inline"` - api.ObjectMeta `json:"metadata,omitempty"` - - // The certificate request itself and any additional information. - Spec CertificateSigningRequestSpec `json:"spec,omitempty"` - - // Derived information about the request. - Status CertificateSigningRequestStatus `json:"status,omitempty"` -} - -// This information is immutable after the request is created. -type CertificateSigningRequestSpec struct { - // Base64-encoded PKCS#10 CSR data - Request string `json:"request"` - - // Any extra information the node wishes to send with the request. - ExtraInfo []string `json:"extrainfo,omitempty"` -} - -// This information is derived from the request by Kubernetes and cannot be -// modified by users. All information is optional since it might not be -// available in the underlying request. This is intended to aid approval -// decisions. -type CertificateSigningRequestStatus struct { - // Information about the requesting user (if relevant) - // See user.Info interface for details - Username string `json:"username,omitempty"` - UID string `json:"uid,omitempty"` - Groups []string `json:"groups,omitempty"` - - // Fingerprint of the public key in request - Fingerprint string `json:"fingerprint,omitempty"` - - // Subject fields from the request - Subject internal.Subject `json:"subject,omitempty"` - - // DNS SANs from the request - Hostnames []string `json:"hostnames,omitempty"` - - // IP SANs from the request - IPAddresses []string `json:"ipaddresses,omitempty"` - - Conditions []CertificateSigningRequestCondition `json:"conditions,omitempty"` -} - -type RequestConditionType string - -// These are the possible states for a certificate request. -const ( - Approved RequestConditionType = "Approved" - Denied RequestConditionType = "Denied" -) - -type CertificateSigningRequestCondition struct { - // request approval state, currently Approved or Denied. - Type RequestConditionType `json:"type"` - // brief reason for the request state - Reason string `json:"reason,omitempty"` - // human readable message with details about the request state - Message string `json:"message,omitempty"` - // If request was approved, the controller will place the issued certificate here. - Certificate []byte `json:"certificate,omitempty"` -} - -type CertificateSigningRequestList struct { - unversioned.TypeMeta `json:",inline"` - unversioned.ListMeta `json:"metadata,omitempty"` - - Items []CertificateSigningRequest `json:"items,omitempty"` -} -``` - -We also introduce CertificateSigningRequestList to allow listing all the CSRs in the cluster: - -```go -type CertificateSigningRequestList struct { - api.TypeMeta - api.ListMeta - - Items []CertificateSigningRequest -} -``` - -## Certificate Request Process - -### Node intialization - -When the kubelet executes it checks a location on disk for TLS assets -(currently `/var/run/kubernetes/kubelet.{key,crt}` by default). If it finds -them, it proceeds. If there are no TLS assets, the kubelet generates a keypair -and self-signed certificate. We propose the following optional behavior: - -1. Generate a keypair -2. Generate a CSR for that keypair with CN set to the hostname (or - `--hostname-override` value) and DNS/IP SANs supplied with whatever values - the host knows for itself. -3. Post the CSR to the CSR API endpoint. -4. Set a watch on the CSR object to be notified of approval or rejection. - -### Controller response - -The apiserver persists the CertificateSigningRequests and exposes the List of -all CSRs for an administrator to approve or reject. - -A new certificate controller watches for certificate requests. It must first -validate the signature on each CSR and add `Condition=Denied` on -any requests with invalid signatures (with Reason and Message incidicating -such). For valid requests, the controller will derive the information in -`CertificateSigningRequestStatus` and update that object. The controller should -watch for updates to the approval condition of any CertificateSigningRequest. -When a request is approved (signified by Conditions containing only Approved) -the controller should generate and sign a certificate based on that CSR, then -update the condition with the certificate data using the `/approval` -subresource. - -### Manual CSR approval - -An administrator using `kubectl` or another API client can query the -CertificateSigningRequestList and update the approval condition of -CertificateSigningRequests. The default state is empty, indicating that there -has been no decision so far. A state of "Approved" indicates that the admin has -approved the request and the certificate controller should issue the -certificate. A state of "Denied" indicates that admin has denied the -request. An admin may also supply Reason and Message fields to explain the -rejection. - -## kube-apiserver support - -The apiserver will present the new endpoints mentioned above and support the -relevant object types. - -## kube-controller-manager support - -To handle certificate issuance, the controller-manager will need access to CA -signing assets. This could be as simple as a private key and a config file or -as complex as a PKCS#11 client and supplementary policy system. For now, we -will add flags for a signing key, a certificate, and a basic policy file. - -## kubectl support - -To support manual CSR inspection and approval, we will add support for listing, -inspecting, and approving or denying CertificateSigningRequests to kubectl. The -interaction will be similar to -[salt-key](https://docs.saltstack.com/en/latest/ref/cli/salt-key.html). - -Specifically, the admin will have the ability to retrieve the full list of -pending CSRs, inspect their contents, and set their approval conditions to one -of: - -1. **Approved** if the controller should issue the cert -2. **Denied** if the controller should not issue the cert - -The suggested command for listing is `kubectl get csrs`. The approve/deny -interactions can be accomplished with normal updates, but would be more -conveniently accessed by direct subresource updates. We leave this for future -updates to kubectl. - -## Security Considerations - -### Endpoint Access Control - -The ability to post CSRs to the signing endpoint should be controlled. As a -simple solution we propose that each node be provisioned with an auth token -(possibly static across the cluster) that is scoped via ABAC to only allow -access to the CSR endpoint. - -### Expiration & Revocation - -The node is responsible for monitoring its own certificate expiration date. -When the certificate is close to expiration, the kubelet should begin repeating -this flow until it successfully obtains a new certificate. If the expiring -certificate has not been revoked and the previous certificate request is still -approved, then it may do so using the same keypair unless the cluster policy -(see "Future Work") requires fresh keys. - -Revocation is for the most part an unhandled problem in Go, requiring each -application to produce its own logic around a variety of parsing functions. For -now, our suggested best practice is to issue only short-lived certificates. In -the future it may make sense to add CRL support to the apiserver's client cert -auth. - -## Future Work - -- revocation UI in kubectl and CRL support at the apiserver -- supplemental policy (e.g. cluster CA only issues 30-day certs for hostnames *.k8s.example.com, each new cert must have fresh keys, ...) -- fully automated provisioning (using a handshake protocol or external list of authorized machines) - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-tls-bootstrap.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-tls-bootstrap.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-tls-bootstrap.md) diff --git a/docs/proposals/kubemark.md b/docs/proposals/kubemark.md index 1f28e2b01ac..13a3f348b0a 100644 --- a/docs/proposals/kubemark.md +++ b/docs/proposals/kubemark.md @@ -1,157 +1 @@ -# Kubemark proposal - -## Goal of this document - -This document describes a design of Kubemark - a system that allows performance testing of a Kubernetes cluster. It describes the -assumption, high level design and discusses possible solutions for lower-level problems. It is supposed to be a starting point for more -detailed discussion. - -## Current state and objective - -Currently performance testing happens on ‘live’ clusters of up to 100 Nodes. It takes quite a while to start such cluster or to push -updates to all Nodes, and it uses quite a lot of resources. At this scale the amount of wasted time and used resources is still acceptable. -In the next quarter or two we’re targeting 1000 Node cluster, which will push it way beyond ‘acceptable’ level. Additionally we want to -enable people without many resources to run scalability tests on bigger clusters than they can afford at given time. Having an ability to -cheaply run scalability tests will enable us to run some set of them on "normal" test clusters, which in turn would mean ability to run -them on every PR. - -This means that we need a system that will allow for realistic performance testing on (much) smaller number of “real” machines. First -assumption we make is that Nodes are independent, i.e. number of existing Nodes do not impact performance of a single Node. This is not -entirely true, as number of Nodes can increase latency of various components on Master machine, which in turn may increase latency of Node -operations, but we’re not interested in measuring this effect here. Instead we want to measure how number of Nodes and the load imposed by -Node daemons affects the performance of Master components. - -## Kubemark architecture overview - -The high-level idea behind Kubemark is to write library that allows running artificial "Hollow" Nodes that will be able to simulate a -behavior of real Kubelet and KubeProxy in a single, lightweight binary. Hollow components will need to correctly respond to Controllers -(via API server), and preferably, in the fullness of time, be able to ‘replay’ previously recorded real traffic (this is out of scope for -initial version). To teach Hollow components replaying recorded traffic they will need to store data specifying when given Pod/Container -should die (e.g. observed lifetime). Such data can be extracted e.g. from etcd Raft logs, or it can be reconstructed from Events. In the -initial version we only want them to be able to fool Master components and put some configurable (in what way TBD) load on them. - -When we have Hollow Node ready, we’ll be able to test performance of Master Components by creating a real Master Node, with API server, -Controllers, etcd and whatnot, and create number of Hollow Nodes that will register to the running Master. - -To make Kubemark easier to maintain when system evolves Hollow components will reuse real "production" code for Kubelet and KubeProxy, but -will mock all the backends with no-op or very simple mocks. We believe that this approach is better in the long run than writing special -"performance-test-aimed" separate version of them. This may take more time to create an initial version, but we think maintenance cost will -be noticeably smaller. - -### Option 1 - -For the initial version we will teach Master components to use port number to identify Kubelet/KubeProxy. This will allow running those -components on non-default ports, and in the same time will allow to run multiple Hollow Nodes on a single machine. During setup we will -generate credentials for cluster communication and pass them to HollowKubelet/HollowProxy to use. Master will treat all HollowNodes as -normal ones. - -![Kubmark architecture diagram for option 1](Kubemark_architecture.png?raw=true "Kubemark architecture overview") -*Kubmark architecture diagram for option 1* - -### Option 2 - -As a second (equivalent) option we will run Kubemark on top of 'real' Kubernetes cluster, where both Master and Hollow Nodes will be Pods. -In this option we'll be able to use Kubernetes mechanisms to streamline setup, e.g. by using Kubernetes networking to ensure unique IPs for -Hollow Nodes, or using Secrets to distribute Kubelet credentials. The downside of this configuration is that it's likely that some noise -will appear in Kubemark results from either CPU/Memory pressure from other things running on Nodes (e.g. FluentD, or Kubelet) or running -cluster over an overlay network. We believe that it'll be possible to turn off cluster monitoring for Kubemark runs, so that the impact -of real Node daemons will be minimized, but we don't know what will be the impact of using higher level networking stack. Running a -comparison will be an interesting test in itself. - -### Discussion - -Before taking a closer look at steps necessary to set up a minimal Hollow cluster it's hard to tell which approach will be simpler. It's -quite possible that the initial version will end up as hybrid between running the Hollow cluster directly on top of VMs and running the -Hollow cluster on top of a Kubernetes cluster that is running on top of VMs. E.g. running Nodes as Pods in Kubernetes cluster and Master -directly on top of VM. - -## Things to simulate - -In real Kubernetes on a single Node we run two daemons that communicate with Master in some way: Kubelet and KubeProxy. - -### KubeProxy - -As a replacement for KubeProxy we'll use HollowProxy, which will be a real KubeProxy with injected no-op mocks everywhere it makes sense. - -### Kubelet - -As a replacement for Kubelet we'll use HollowKubelet, which will be a real Kubelet with injected no-op or simple mocks everywhere it makes -sense. - -Kubelet also exposes cadvisor endpoint which is scraped by Heapster, healthz to be read by supervisord, and we have FluentD running as a -Pod on each Node that exports logs to Elasticsearch (or Google Cloud Logging). Both Heapster and Elasticsearch are running in Pods in the -cluster so do not add any load on a Master components by themselves. There can be other systems that scrape Heapster through proxy running -on Master, which adds additional load, but they're not the part of default setup, so in the first version we won't simulate this behavior. - -In the first version we’ll assume that all started Pods will run indefinitely if not explicitly deleted. In the future we can add a model -of short-running batch jobs, but in the initial version we’ll assume only serving-like Pods. - -### Heapster - -In addition to system components we run Heapster as a part of cluster monitoring setup. Heapster currently watches Events, Pods and Nodes -through the API server. In the test setup we can use real heapster for watching API server, with mocked out piece that scrapes cAdvisor -data from Kubelets. - -### Elasticsearch and Fluentd - -Similarly to Heapster Elasticsearch runs outside the Master machine but generates some traffic on it. Fluentd “daemon” running on Master -periodically sends Docker logs it gathered to the Elasticsearch running on one of the Nodes. In the initial version we omit Elasticsearch, -as it produces only a constant small load on Master Node that does not change with the size of the cluster. - -## Necessary work - -There are three more or less independent things that needs to be worked on: -- HollowNode implementation, creating a library/binary that will be able to listen to Watches and respond in a correct fashion with Status -updates. This also involves creation of a CloudProvider that can produce such Hollow Nodes, or making sure that HollowNodes can correctly -self-register in no-provider Master. -- Kubemark setup, including figuring networking model, number of Hollow Nodes that will be allowed to run on a single “machine”, writing -setup/run/teardown scripts (in [option 1](#option-1)), or figuring out how to run Master and Hollow Nodes on top of Kubernetes -(in [option 2](#option-2)) -- Creating a Player component that will send requests to the API server putting a load on a cluster. This involves creating a way to -specify desired workload. This task is -very well isolated from the rest, as it is about sending requests to the real API server. Because of that we can discuss requirements -separately. - -## Concerns - -Network performance most likely won't be a problem for the initial version if running on directly on VMs rather than on top of a Kubernetes -cluster, as Kubemark will be running on standard networking stack (no cloud-provider software routes, or overlay network is needed, as we -don't need custom routing between Pods). Similarly we don't think that running Kubemark on Kubernetes virtualized cluster networking will -cause noticeable performance impact, but it requires testing. - -On the other hand when adding additional features it may turn out that we need to simulate Kubernetes Pod network. In such, when running -'pure' Kubemark we may try one of the following: - - running overlay network like Flannel or OVS instead of using cloud providers routes, - - write simple network multiplexer to multiplex communications from the Hollow Kubelets/KubeProxies on the machine. - -In case of Kubemark on Kubernetes it may turn that we run into a problem with adding yet another layer of network virtualization, but we -don't need to solve this problem now. - -## Work plan - -- Teach/make sure that Master can talk to multiple Kubelets on the same Machine [option 1](#option-1): - - make sure that Master can talk to a Kubelet on non-default port, - - make sure that Master can talk to all Kubelets on different ports, -- Write HollowNode library: - - new HollowProxy, - - new HollowKubelet, - - new HollowNode combining the two, - - make sure that Master can talk to two HollowKubelets running on the same machine -- Make sure that we can run Hollow cluster on top of Kubernetes [option 2](#option-2) -- Write a player that will automatically put some predefined load on Master, <- this is the moment when it’s possible to play with it and is useful by itself for -scalability tests. Alternatively we can just use current density/load tests, -- Benchmark our machines - see how many Watch clients we can have before everything explodes, -- See how many HollowNodes we can run on a single machine by attaching them to the real master <- this is the moment it starts to useful -- Update kube-up/kube-down scripts to enable creating “HollowClusters”/write a new scripts/something, integrate HollowCluster with a Elasticsearch/Heapster equivalents, -- Allow passing custom configuration to the Player - -## Future work - -In the future we want to add following capabilities to the Kubemark system: -- replaying real traffic reconstructed from the recorded Events stream, -- simulating scraping things running on Nodes through Master proxy. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubemark.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubemark.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubemark.md) diff --git a/docs/proposals/local-cluster-ux.md b/docs/proposals/local-cluster-ux.md index c78a51b7ab8..2a405741e62 100644 --- a/docs/proposals/local-cluster-ux.md +++ b/docs/proposals/local-cluster-ux.md @@ -1,161 +1 @@ -# Kubernetes Local Cluster Experience - -This proposal attempts to improve the existing local cluster experience for kubernetes. -The current local cluster experience is sub-par and often not functional. -There are several options to setup a local cluster (docker, vagrant, linux processes, etc) and we do not test any of them continuously. -Here are some highlighted issues: -- Docker based solution breaks with docker upgrades, does not support DNS, and many kubelet features are not functional yet inside a container. -- Vagrant based solution are too heavy and have mostly failed on OS X. -- Local linux cluster is poorly documented and is undiscoverable. -From an end user perspective, they want to run a kubernetes cluster. They care less about *how* a cluster is setup locally and more about what they can do with a functional cluster. - - -## Primary Goals - -From a high level the goal is to make it easy for a new user to run a Kubernetes cluster and play with curated examples that require least amount of knowledge about Kubernetes. -These examples will only use kubectl and only a subset of Kubernetes features that are available will be exposed. - -- Works across multiple OSes - OS X, Linux and Windows primarily. -- Single command setup and teardown UX. -- Unified UX across OSes -- Minimal dependencies on third party software. -- Minimal resource overhead. -- Eliminate any other alternatives to local cluster deployment. - -## Secondary Goals - -- Enable developers to use the local cluster for kubernetes development. - -## Non Goals - -- Simplifying kubernetes production deployment experience. [Kube-deploy](https://github.com/kubernetes/kube-deploy) is attempting to tackle this problem. -- Supporting all possible deployment configurations of Kubernetes like various types of storage, networking, etc. - - -## Local cluster requirements - -- Includes all the master components & DNS (Apiserver, scheduler, controller manager, etcd and kube dns) -- Basic auth -- Service accounts should be setup -- Kubectl should be auto-configured to use the local cluster -- Tested & maintained as part of Kubernetes core - -## Existing solutions - -Following are some of the existing solutions that attempt to simplify local cluster deployments. - -### [Spread](https://github.com/redspread/spread) - -Spread's UX is great! -It is adapted from monokube and includes DNS as well. -It satisfies almost all the requirements, excepting that of requiring docker to be pre-installed. -It has a loose dependency on docker. -New releases of docker might break this setup. - -### [Kmachine](https://github.com/skippbox/kmachine) - -Kmachine is adapted from docker-machine. -It exposes the entire docker-machine CLI. -It is possible to repurpose Kmachine to meet all our requirements. - -### [Monokube](https://github.com/polvi/monokube) - -Single binary that runs all kube master components. -Does not include DNS. -This is only a part of the overall local cluster solution. - -### Vagrant - -The kube-up.sh script included in Kubernetes release supports a few Vagrant based local cluster deployments. -kube-up.sh is not user friendly. -It typically takes a long time for the cluster to be set up using vagrant and often times is unsuccessful on OS X. -The [Core OS single machine guide](https://coreos.com/kubernetes/docs/latest/kubernetes-on-vagrant-single.html) uses Vagrant as well and it just works. -Since we are targeting a single command install/teardown experience, vagrant needs to be an implementation detail and not be exposed to our users. - -## Proposed Solution - -To avoid exposing users to third party software and external dependencies, we will build a toolbox that will be shipped with all the dependencies including all kubernetes components, hypervisor, base image, kubectl, etc. -*Note: Docker provides a [similar toolbox](https://www.docker.com/products/docker-toolbox).* -This "Localkube" tool will be referred to as "Minikube" in this proposal to avoid ambiguity against Spread's existing ["localkube"](https://github.com/redspread/localkube). -The final name of this tool is TBD. Suggestions are welcome! - -Minikube will provide a unified CLI to interact with the local cluster. -The CLI will support only a few operations: - - **Start** - creates & starts a local cluster along with setting up kubectl & networking (if necessary) - - **Stop** - suspends the local cluster & preserves cluster state - - **Delete** - deletes the local cluster completely - - **Upgrade** - upgrades internal components to the latest available version (upgrades are not guaranteed to preserve cluster state) - -For running and managing the kubernetes components themselves, we can re-use [Spread's localkube](https://github.com/redspread/localkube). -Localkube is a self-contained go binary that includes all the master components including DNS and runs them using multiple go threads. -Each Kubernetes release will include a localkube binary that has been tested exhaustively. - -To support Windows and OS X, minikube will use [libmachine](https://github.com/docker/machine/tree/master/libmachine) internally to create and destroy virtual machines. -Minikube will be shipped with an hypervisor (virtualbox) in the case of OS X. -Minikube will include a base image that will be well tested. - -In the case of Linux, since the cluster can be run locally, we ideally want to avoid setting up a VM. -Since docker is the only fully supported runtime as of Kubernetes v1.2, we can initially use docker to run and manage localkube. -There is risk of being incompatible with the existing version of docker. -By using a VM, we can avoid such incompatibility issues though. -Feedback from the community will be helpful here. - -If the goal is to run outside of a VM, we can have minikube prompt the user if docker is unavailable or version is incompatible. -Alternatives to docker for running the localkube core includes using [rkt](https://coreos.com/rkt/docs/latest/), setting up systemd services, or a System V Init script depending on the distro. - -To summarize the pipeline is as follows: - -##### OS X / Windows - -minikube -> libmachine -> virtualbox/hyper V -> linux VM -> localkube - -##### Linux - -minikube -> docker -> localkube - -### Alternatives considered - -#### Bring your own docker - -##### Pros - -- Kubernetes users will probably already have it -- No extra work for us -- Only one VM/daemon, we can just reuse the existing one - -##### Cons - -- Not designed to be wrapped, may be unstable -- Might make configuring networking difficult on OS X and Windows -- Versioning and updates will be challenging. We can mitigate some of this with testing at HEAD, but we'll - inevitably hit situations where it's infeasible to work with multiple versions of docker. -- There are lots of different ways to install docker, networking might be challenging if we try to support many paths. - -#### Vagrant - -##### Pros - -- We control the entire experience -- Networking might be easier to build -- Docker can't break us since we'll include a pinned version of Docker -- Easier to support rkt or hyper in the future -- Would let us run some things outside of containers (kubelet, maybe ingress/load balancers) - -##### Cons - -- More work -- Extra resources (if the user is also running docker-machine) -- Confusing if there are two docker daemons (images built in one can't be run in another) -- Always needs a VM, even on Linux -- Requires installing and possibly understanding Vagrant. - -## Releases & Distribution - -- Minikube will be released independent of Kubernetes core in order to facilitate fixing of issues that are outside of Kubernetes core. -- The latest version of Minikube is guaranteed to support the latest release of Kubernetes, including documentation. -- The Google Cloud SDK will package minikube and provide utilities for configuring kubectl to use it, but will not in any other way wrap minikube. - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/local-cluster-ux.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/local-cluster-ux.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/local-cluster-ux.md) diff --git a/docs/proposals/multi-platform.md b/docs/proposals/multi-platform.md index 36eacefa220..78010c4f6f9 100644 --- a/docs/proposals/multi-platform.md +++ b/docs/proposals/multi-platform.md @@ -1,532 +1 @@ -# Kubernetes for multiple platforms - -**Author**: Lucas Käldström ([@luxas](https://github.com/luxas)) - -**Status** (25th of August 2016): Some parts are already implemented; but still there quite a lot of work to be done. - -## Abstract - -We obviously want Kubernetes to run on as many platforms as possible, in order to make Kubernetes a even more powerful system. -This is a proposal that explains what should be done in order to achieve a true cross-platform container management system. - -Kubernetes is written in Go, and Go code is portable across platforms. -Docker and rkt are also written in Go, and it's already possible to use them on various platforms. -When it's possible to run containers on a specific architecture, people also want to use Kubernetes to manage the containers. - -In this proposal, a `platform` is defined as `operating system/architecture` or `${GOOS}/${GOARCH}` in Go terms. - -The following platforms are proposed to be built for in a Kubernetes release: - - linux/amd64 - - linux/arm (GOARM=6 initially, but we probably have to bump this to GOARM=7 due to that the most of other ARM things are ARMv7) - - linux/arm64 - - linux/ppc64le - -If there's interest in running Kubernetes on `linux/s390x` too, it won't require many changes to the source now when we've laid the ground for a multi-platform Kubernetes already. - -There is also work going on with porting Kubernetes to Windows (`windows/amd64`). See [this issue](https://github.com/kubernetes/kubernetes/issues/22623) for more details. - -But note that when porting to a new OS like windows, a lot of os-specific changes have to be implemented before cross-compiling, releasing and other concerns this document describes may apply. - -## Motivation - -Then the question probably is: Why? - -In fact, making it possible to run Kubernetes on other platforms will enable people to create customized and highly-optimized solutions that exactly fits their hardware needs. - -Example: [Paypal validates arm64 for real-time data analysis](http://www.datacenterdynamics.com/content-tracks/servers-storage/paypal-successfully-tests-arm-based-servers/93835.fullarticle) - -Also, by including other platforms to the Kubernetes party a healthy competition between platforms can/will take place. - -Every platform obviously has both pros and cons. By adding the option to make clusters of mixed platforms, the end user may take advantage of the good sides of every platform. - -## Use Cases - -For a large enterprise where computing power is the king, one may imagine the following combinations: - - `linux/amd64`: For running most of the general-purpose computing tasks, cluster addons, etc. - - `linux/ppc64le`: For running highly-optimized software; especially massive compute tasks - - `windows/amd64`: For running services that are only compatible on windows; e.g. business applications written in C# .NET - -For a mid-sized business where efficiency is most important, these could be combinations: - - `linux/amd64`: For running most of the general-purpose computing tasks, plus tasks that require very high single-core performance. - - `linux/arm64`: For running webservices and high-density tasks => the cluster could autoscale in a way that `linux/amd64` machines could hibernate at night in order to minimize power usage. - -For a small business or university, arm is often sufficient: - - `linux/arm`: Draws very little power, and can run web sites and app backends efficiently on Scaleway for example. - -And last but not least; Raspberry Pi's should be used for [education at universities](http://kubecloud.io/) and are great for **demoing Kubernetes' features at conferences.** - -## Main proposal - -### Release binaries for all platforms - -First and foremost, binaries have to be released for all platforms. -This affects the build-release tools. Fortunately, this is quite straightforward to implement, once you understand how Go cross-compilation works. - -Since Kubernetes' release and build jobs run on `linux/amd64`, binaries have to be cross-compiled and Docker images should be cross-built. -Builds should be run in a Docker container in order to get reproducible builds; and `gcc` should be installed for all platforms inside that image (`kube-cross`) - -All released binaries should be uploaded to `https://storage.googleapis.com/kubernetes-release/release/${version}/bin/${os}/${arch}/${binary}` - -This is a fairly long topic. If you're interested how to cross-compile, see [details about cross-compilation](#cross-compilation-details) - -### Support all platforms in a "run everywhere" deployment - -The easiest way of running Kubernetes on another architecture at the time of writing is probably by using the docker-multinode deployment. Of course, you may choose whatever deployment you want, the binaries are easily downloadable from the URL above. - -[docker-multinode](https://github.com/kubernetes/kube-deploy/tree/master/docker-multinode) is intended to be a "kick-the-tires" multi-platform solution with Docker as the only real dependency (but it's not production ready) - -But when we (`sig-cluster-lifecycle`) have standardized the deployments to about three and made them production ready; at least one deployment should support **all platforms**. - -### Set up a build and e2e CI's - -#### Build CI - -Kubernetes should always enforce that all binaries are compiling. -**On every PR, `make release` have to be run** in order to require the code proposed to be merged to be compatible for all architectures. - -For more information, see [conflicts](#conflicts) - -#### e2e CI - -To ensure all functionality really is working on all other platforms, the community should be able to setup a CI. -To be able to do that, all the test-specific images have to be ported to multiple architectures, and the test images should preferably be manifest lists. -If the test images aren't manifest lists, the test code should automatically choose the right image based on the image naming. - -IBM volunteered to run continuously running e2e tests for `linux/ppc64le`. -Still it's hard to set up a such CI (even on `linux/amd64`), but that work belongs to `kubernetes/test-infra` proposals. - -When it's possible to test Kubernetes using Kubernetes; volunteers should be given access to publish their results on `k8s-testgrid.appspot.com`. - -### Official support level - -When all e2e tests are passing for a given platform; the platform should be officially supported by the Kubernetes team. -At the time of writing, `amd64` is in the officially supported category category. - -When a platform is building and it's possible to set up a cluster with the core functionality, the platform is supported on a "best-effort" and experimental basis. -At the time of writing, `arm`, `arm64` and `ppc64le` are in the experimental category; the e2e tests aren't cross-platform yet. - -### Docker image naming and manifest lists - -#### Docker manifest lists - -Here's a good article about how the "manifest list" in the Docker image [manifest spec v2](https://github.com/docker/distribution/pull/1068) works: [A step towards multi-platform Docker images](https://integratedcode.us/2016/04/22/a-step-towards-multi-platform-docker-images/) - -A short summary: A manifest list is a list of Docker images with a single name (e.g. `busybox`), that holds layers for multiple platforms _when it's stored in a registry_. -When the image is pulled by a client (`docker pull busybox`), only layers for the target platforms are downloaded. -Right now we have to write `busybox-${ARCH}` for example instead, but that leads to extra scripting and unnecessary logic. - -For reference see [docker/docker#24739](https://github.com/docker/docker/issues/24739) and [appc/docker2aci#193](https://github.com/appc/docker2aci/issues/193) - -#### Image naming - -This has been debated quite a lot about; how we should name non-amd64 docker images that are pushed to `gcr.io`. See [#23059](https://github.com/kubernetes/kubernetes/pull/23059) and [#23009](https://github.com/kubernetes/kubernetes/pull/23009). - -This means that the naming `gcr.io/google_containers/${binary}:${version}` should contain a _manifest list_ for future tags. -The manifest list thereby becomes a wrapper that is pointing to the `-${arch}` images. -This requires `docker-1.10` or newer, which probably means Kubernetes v1.4 and higher. - -TL;DR; - - `${binary}-${arch}:${version}` images should be pushed for all platforms - - `${binary}:${version}` images should point to the `-${arch}`-specific ones, and docker will then download the right image. - -### Components should expose their platform - -It should be possible to run clusters with mixed platforms smoothly. After all, bringing heterogenous machines together to a single unit (a cluster) is one of Kubernetes' greatest strengths. And since the Kubernetes' components communicate over HTTP, two binaries of different architectures may talk to each other normally. - -The crucial thing here is that the components that handle platform-specific tasks (e.g. kubelet) should expose their platform. In the kubelet case, we've initially solved it by exposing the labels `beta.kubernetes.io/{os,arch}` on every node. This way an user may run binaries for different platforms on a multi-platform cluster, but still it requires manual work to apply the label to every manifest. - -Also, [the apiserver now exposes](https://github.com/kubernetes/kubernetes/pull/19905) it's platform at `GET /version`. But note that the value exposed at `/version` only is the apiserver's platform; there might be kubelets of various other platforms. - -### Standardize all image Makefiles to follow the same pattern - -All Makefiles should push for all platforms when doing `make push`, and build for all platforms when doing `make build`. -Under the hood; they should compile binaries in a container for reproducability, and use QEMU for emulating Dockerfile `RUN` commands if necessary. - -### Remove linux/amd64 hard-codings from the codebase - -All places where `linux/amd64` is hardcoded in the codebase should be rewritten. - -#### Make kubelet automatically use the right pause image - -The `pause` is used for connecting containers into Pods. It's a binary that just sleeps forever. -When Kubernetes starts up a Pod, it first starts a `pause` container, and let's all "real" containers join the same network by setting `--net=${pause_container_id}`. - -So in order to start Kubernetes Pods on any other architecture, an ever-sleeping image have to exist. - -Fortunately, `kubelet` has the `--pod-infra-container-image` option, and it has been used when running Kubernetes on other platforms. - -But relying on the deployment setup to specify the right image for the platform isn't great, the kubelet should be smarter than that. - -This specific problem has been fixed in [#23059](https://github.com/kubernetes/kubernetes/pull/23059). - -#### Vendored packages - -Here are two common problems that a vendored package might have when trying to add/update it: - - Including constants combined with build tags - -```go -//+ build linux,amd64 -const AnAmd64OnlyConstant = 123 -``` - - - Relying on platform-specific syscalls (e.g. `syscall.Dup2`) - -If someone tries to add a dependency that doesn't satisfy these requirements; the CI will catch it and block the PR until the author has updated the vendored repo and fixed the problem. - -### kubectl should be released for all platforms that are relevant - -kubectl is released for more platforms than the proposed server platforms, if you want to check out an up-to-date list of them, [see here](../../hack/lib/golang.sh). - -kubectl is trivial to cross-compile, so if there's interest in adding a new platform for it, it may be as easy as appending the platform to the list linked above. - -### Addons - -Addons like dns, heapster and ingress play a big role in a working Kubernetes cluster, and we should aim to be able to deploy these addons on multiple platforms too. - -`kube-dns`, `dashboard` and `addon-manager` are the most important images, and they are already ported for multiple platforms. - -These addons should also be converted to multiple platforms: - - heapster, influxdb + grafana - - nginx-ingress - - elasticsearch, fluentd + kibana - - registry - -### Conflicts - -What should we do if there's a conflict between keeping e.g. `linux/ppc64le` builds vs. merging a release blocker? - -In fact, we faced this problem while this proposal was being written; in [#25243](https://github.com/kubernetes/kubernetes/pull/25243). It is quite obvious that the release blocker is of higher priority. - -However, before temporarily [deactivating builds](https://github.com/kubernetes/kubernetes/commit/2c9b83f291e3e506acc3c08cd10652c255f86f79), the author of the breaking PR should first try to fix the problem. If it turns out being really hard to solve, builds for the affected platform may be deactivated and a P1 issue should be made to activate them again. - -## Cross-compilation details (for reference) - -### Go language details - -Go 1.5 introduced many changes. To name a few that are relevant to Kubernetes: - - C was eliminated from the tree (it was earlier used for the bootstrap runtime). - - All processors are used by default, which means we should be able to remove [lines like this one](https://github.com/kubernetes/kubernetes/blob/v1.2.0/cmd/kubelet/kubelet.go#L37) - - The garbage collector became more efficent (but also [confused our latency test](https://github.com/golang/go/issues/14396)). - - `linux/arm64` and `linux/ppc64le` were added as new ports. - - The `GO15VENDOREXPERIMENT` was started. We switched from `Godeps/_workspace` to the native `vendor/` in [this PR](https://github.com/kubernetes/kubernetes/pull/24242). - - It's not required to pre-build the whole standard library `std` when cross-compliling. [Details](#prebuilding-the-standard-library-std) - - Builds are approximately twice as slow as earlier. That affects the CI. [Details](#releasing) - - The native Go DNS resolver will suffice in the most situations. This makes static linking much easier. - -All release notes for Go 1.5 [are here](https://golang.org/doc/go1.5) - -Go 1.6 didn't introduce as many changes as Go 1.5 did, but here are some of note: - - It should perform a little bit better than Go 1.5. - - `linux/mips64` and `linux/mips64le` were added as new ports. - - Go < 1.6.2 for `ppc64le` had [bugs in it](https://github.com/kubernetes/kubernetes/issues/24922). - -All release notes for Go 1.6 [are here](https://golang.org/doc/go1.6) - -In Kubernetes 1.2, the only supported Go version was `1.4.2`, so `linux/arm` was the only possible extra architecture: [#19769](https://github.com/kubernetes/kubernetes/pull/19769). -In Kubernetes 1.3, [we upgraded to Go 1.6](https://github.com/kubernetes/kubernetes/pull/22149), which made it possible to build Kubernetes for even more architectures [#23931](https://github.com/kubernetes/kubernetes/pull/23931). - -#### The `sync/atomic` bug on 32-bit platforms - -From https://golang.org/pkg/sync/atomic/#pkg-note-BUG: -> On both ARM and x86-32, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically. The first word in a global variable or in an allocated struct or slice can be relied upon to be 64-bit aligned. - -`etcd` have had [issues](https://github.com/coreos/etcd/issues/2308) with this. See [how to fix it here](https://github.com/coreos/etcd/pull/3249) - -```go -// 32-bit-atomic-bug.go -package main -import "sync/atomic" - -type a struct { - b chan struct{} - c int64 -} - -func main(){ - d := a{} - atomic.StoreInt64(&d.c, 10 * 1000 * 1000 * 1000) -} -``` - -```console -$ GOARCH=386 go build 32-bit-atomic-bug.go -$ file 32-bit-atomic-bug -32-bit-atomic-bug: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, not stripped -$ ./32-bit-atomic-bug -panic: runtime error: invalid memory address or nil pointer dereference -[signal 0xb code=0x1 addr=0x0 pc=0x808cd9b] - -goroutine 1 [running]: -panic(0x8098de0, 0x1830a038) - /usr/local/go/src/runtime/panic.go:481 +0x326 -sync/atomic.StoreUint64(0x1830e0f4, 0x540be400, 0x2) - /usr/local/go/src/sync/atomic/asm_386.s:190 +0xb -main.main() - /tmp/32-bit-atomic-bug.go:11 +0x4b -``` - -This means that all structs should keep all `int64` and `uint64` fields at the top of the struct to be safe. If we would move `a.c` to the top of the `a` struct above, the operation would succeed. - -The bug affects `32-bit` platforms when a `(u)int64` field is accessed by an `atomic` method. -It would be great to write a tool that checks so all `atomic` accessed fields are aligned at the top of the struct, but it's hard: [coreos/etcd#5027](https://github.com/coreos/etcd/issues/5027). - -## Prebuilding the Go standard library (`std`) - -A great blog post [that is describing this](https://medium.com/@rakyll/go-1-5-cross-compilation-488092ba44ec#.5jcd0owem) - -Before Go 1.5, the whole Go project had to be cross-compiled from source for **all** platforms that _might_ be used, and that was quite a slow process: - -```console -# From build-tools/build-image/cross/Dockerfile when we used Go 1.4 -$ cd /usr/src/go/src -$ for platform in ${PLATFORMS}; do GOOS=${platform%/*} GOARCH=${platform##*/} ./make.bash --no-clean; done -``` - -With Go 1.5+, cross-compiling the Go repository isn't required anymore. Go will automatically cross-compile the `std` packages that are being used by the code that is being compiled, _and throw it away after the compilation_. -If you cross-compile multiple times, Go will build parts of `std`, throw it away, compile parts of it again, throw that away and so on. - -However, there is an easy way of cross-compiling all `std` packages in advance with Go 1.5+: - -```console -# From build-tools/build-image/cross/Dockerfile when we're using Go 1.5+ -$ for platform in ${PLATFORMS}; do GOOS=${platform%/*} GOARCH=${platform##*/} go install std; done -``` - -### Static cross-compilation - -Static compilation with Go 1.5+ is dead easy: - -```go -// main.go -package main -import "fmt" -func main() { - fmt.Println("Hello Kubernetes!") -} -``` - -```console -$ go build main.go -$ file main -main: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped -$ GOOS=linux GOARCH=arm go build main.go -$ file main -main: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, not stripped -``` - -The only thing you have to do is change the `GOARCH` and `GOOS` variables. Here's a list of valid values for [GOOS/GOARCH](https://golang.org/doc/install/source#environment) - -#### Static compilation with `net` - -Consider this: - -```go -// main-with-net.go -package main -import "net" -import "fmt" -func main() { - fmt.Println(net.ParseIP("10.0.0.10").String()) -} -``` - -```console -$ go build main-with-net.go -$ file main-with-net -main-with-net: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, - interpreter /lib64/ld-linux-x86-64.so.2, not stripped -$ GOOS=linux GOARCH=arm go build main-with-net.go -$ file main-with-net -main-with-net: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, not stripped -``` - -Wait, what? Just because we included `net` from the `std` package, the binary defaults to being dynamically linked when the target platform equals to the host platform? -Let's take a look at `go env` to get a clue why this happens: - -```console -$ go env -GOARCH="amd64" -GOHOSTARCH="amd64" -GOHOSTOS="linux" -GOOS="linux" -GOPATH="/go" -GOROOT="/usr/local/go" -GO15VENDOREXPERIMENT="1" -CC="gcc" -CXX="g++" -CGO_ENABLED="1" -``` - -See the `CGO_ENABLED=1` at the end? That's where compilation for the host and cross-compilation differs. By default, Go will link statically if no `cgo` code is involved. `net` is one of the packages that prefers `cgo`, but doesn't depend on it. - -When cross-compiling on the other hand, `CGO_ENABLED` is set to `0` by default. - -To always be safe, run this when compiling statically: - -```console -$ CGO_ENABLED=0 go build -a -installsuffix cgo main-with-net.go -$ file main-with-net -main-with-net: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped -``` - -See [golang/go#9344](https://github.com/golang/go/issues/9344) for more details. - -### Dynamic cross-compilation - -In order to dynamically compile a go binary with `cgo`, we need `gcc` installed at build time. - -The only Kubernetes binary that is using C code is the `kubelet`, or in fact `cAdvisor` on which `kubelet` depends. `hyperkube` is also dynamically linked as long as `kubelet` is. We should aim to make `kubelet` statically linked. - -The normal `x86_64-linux-gnu` can't cross-compile binaries, so we have to install gcc cross-compilers for every platform. We do this in the [`kube-cross`](../../build-tools/build-image/cross/Dockerfile) image, -and depend on the [`emdebian.org` repository](https://wiki.debian.org/CrossToolchains). Depending on `emdebian` isn't ideal, so we should consider using the latest `gcc` cross-compiler packages from the `ubuntu` main repositories in the future. - -Here's an example when cross-compiling plain C code: - -```c -// main.c -#include -main() -{ - printf("Hello Kubernetes!\n"); -} -``` - -```console -$ arm-linux-gnueabi-gcc -o main-c main.c -$ file main-c -main-c: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, - interpreter /lib/ld-linux.so.3, for GNU/Linux 2.6.32, not stripped -``` - -And here's an example when cross-compiling `go` and `c`: - -```go -// main-cgo.go -package main -/* -char* sayhello(void) { return "Hello Kubernetes!"; } -*/ -import "C" -import "fmt" -func main() { - fmt.Println(C.GoString(C.sayhello())) -} -``` - -```console -$ CGO_ENABLED=1 CC=arm-linux-gnueabi-gcc GOOS=linux GOARCH=arm go build main-cgo.go -$ file main-cgo -./main-cgo: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, - interpreter /lib/ld-linux.so.3, for GNU/Linux 2.6.32, not stripped -``` - -The bad thing with dynamic compilation is that it adds an unnecessary dependency on `glibc` _at runtime_. - -### Static compilation with CGO code - -Lastly, it's even possible to cross-compile `cgo` code _statically_: - -```console -$ CGO_ENABLED=1 CC=arm-linux-gnueabi-gcc GOARCH=arm go build -ldflags '-extldflags "-static"' main-cgo.go -$ file main-cgo -./main-cgo: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, - for GNU/Linux 2.6.32, not stripped -``` - -This is especially useful if we want to include the binary in a container. -If the binary is statically compiled, we may use `busybox` or even `scratch` as the base image. -This should be the preferred way of compiling binaries that strictly require C code to be a part of it. - -#### GOARM - -32-bit ARM comes in two main flavours: ARMv5 and ARMv7. Go has the `GOARM` environment variable that controls which version of ARM Go should target. Here's a table of all ARM versions and how they play together: - -ARM Version | GOARCH | GOARM | GCC package | No. of bits ------------ | ------ | ----- | ----------- | ----------- -ARMv5 | arm | 5 | armel | 32-bit -ARMv6 | arm | 6 | - | 32-bit -ARMv7 | arm | 7 | armhf | 32-bit -ARMv8 | arm64 | - | aarch64 | 64-bit - -The compability between the versions is pretty straightforward, ARMv5 binaries may run on ARMv7 hosts, but not vice versa. - -## Cross-building docker images for linux - -After binaries have been cross-compiled, they should be distributed in some manner. - -The default and maybe the most intuitive way of doing this is by packaging it in a docker image. - -### Trivial Dockerfile - -All `Dockerfile` commands except for `RUN` works for any architecture without any modification. -The base image has to be switched to an arch-specific one, but except from that, a cross-built image is only a `docker build` away. - -```Dockerfile -FROM armel/busybox -ENV kubernetes=true -COPY kube-apiserver /usr/local/bin/ -CMD ["/usr/local/bin/kube-apiserver"] -``` - -```console -$ file kube-apiserver -kube-apiserver: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, not stripped -$ docker build -t gcr.io/google_containers/kube-apiserver-arm:v1.x.y . -Step 1 : FROM armel/busybox - ---> 9bb1e6d4f824 -Step 2 : ENV kubernetes true - ---> Running in 8a1bfcb220ac - ---> e4ef9f34236e -Removing intermediate container 8a1bfcb220ac -Step 3 : COPY kube-apiserver /usr/local/bin/ - ---> 3f0c4633e5ac -Removing intermediate container b75a054ab53c -Step 4 : CMD /usr/local/bin/kube-apiserver - ---> Running in 4e6fe931a0a5 - ---> 28f50e58c909 -Removing intermediate container 4e6fe931a0a5 -Successfully built 28f50e58c909 -``` - -### Complex Dockerfile - -However, in the most cases, `RUN` statements are needed when building the image. - -The `RUN` statement invokes `/bin/sh` inside the container, but in this example, `/bin/sh` is an ARM binary, which can't execute on an `amd64` processor. - -#### QEMU to the rescue - -Here's a way to run ARM Docker images on an amd64 host by using `qemu`: - -```console -# Register other architectures` magic numbers in the binfmt_misc kernel module, so it`s possible to run foreign binaries -$ docker run --rm --privileged multiarch/qemu-user-static:register --reset -# Download qemu 2.5.0 -$ curl -sSL https://github.com/multiarch/qemu-user-static/releases/download/v2.5.0/x86_64_qemu-arm-static.tar.xz \ - | tar -xJ -# Run a foreign docker image, and inject the amd64 qemu binary for translating all syscalls -$ docker run -it -v $(pwd)/qemu-arm-static:/usr/bin/qemu-arm-static armel/busybox /bin/sh - -# Now we`re inside an ARM container although we`re running on an amd64 host -$ uname -a -Linux 0a7da80f1665 4.2.0-25-generic #30-Ubuntu SMP Mon Jan 18 12:31:50 UTC 2016 armv7l GNU/Linux -``` - -Here a linux module called `binfmt_misc` registered the "magic numbers" in the kernel, so the kernel may detect which architecture a binary is, and prepend the call with `/usr/bin/qemu-(arm|aarch64|ppc64le)-static`. For example, `/usr/bin/qemu-arm-static` is a statically linked `amd64` binary that translates all ARM syscalls to `amd64` syscalls. - -The multiarch guys have done a great job here, you may find the source for this and other images at [GitHub](https://github.com/multiarch) - - -## Implementation - -## History - -32-bit ARM (`linux/arm`) was the first platform Kubernetes was ported to, and luxas' project [`Kubernetes on ARM`](https://github.com/luxas/kubernetes-on-arm) (released on GitHub the 31st of September 2015) -served as a way of running Kubernetes on ARM devices easily. -The 30th of November 2015, a tracking issue about making Kubernetes run on ARM was opened: [#17981](https://github.com/kubernetes/kubernetes/issues/17981). It later shifted focus to how to make Kubernetes a more platform-independent system. - -The 27th of April 2016, Kubernetes `v1.3.0-alpha.3` was released, and it became the first release that was able to run the [docker getting started guide](http://kubernetes.io/docs/getting-started-guides/docker/) on `linux/amd64`, `linux/arm`, `linux/arm64` and `linux/ppc64le` without any modification. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/multi-platform.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/multi-platform.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/multi-platform.md) diff --git a/docs/proposals/multiple-schedulers.md b/docs/proposals/multiple-schedulers.md index 4f675b1021a..248ca25f061 100644 --- a/docs/proposals/multiple-schedulers.md +++ b/docs/proposals/multiple-schedulers.md @@ -1,138 +1 @@ -# Multi-Scheduler in Kubernetes - -**Status**: Design & Implementation in progress. - -> Contact @HaiyangDING for questions & suggestions. - -## Motivation - -In current Kubernetes design, there is only one default scheduler in a Kubernetes cluster. -However it is common that multiple types of workload, such as traditional batch, DAG batch, streaming and user-facing production services, -are running in the same cluster and they need to be scheduled in different ways. For example, in -[Omega](http://research.google.com/pubs/pub41684.html) batch workload and service workload are scheduled by two types of schedulers: -the batch workload is scheduled by a scheduler which looks at the current usage of the cluster to improve the resource usage rate -and the service workload is scheduled by another one which considers the reserved resources in the -cluster and many other constraints since their performance must meet some higher SLOs. -[Mesos](http://mesos.apache.org/) has done a great work to support multiple schedulers by building a -two-level scheduling structure. This proposal describes how Kubernetes is going to support multi-scheduler -so that users could be able to run their user-provided scheduler(s) to enable some customized scheduling -behavior as they need. As previously discussed in [#11793](https://github.com/kubernetes/kubernetes/issues/11793), -[#9920](https://github.com/kubernetes/kubernetes/issues/9920) and [#11470](https://github.com/kubernetes/kubernetes/issues/11470), -the design of the multiple scheduler should be generic and includes adding a scheduler name annotation to separate the pods. -It is worth mentioning that the proposal does not address the question of how the scheduler name annotation gets -set although it is reasonable to anticipate that it would be set by a component like admission controller/initializer, -as the doc currently does. - -Before going to the details of this proposal, below lists a number of the methods to extend the scheduler: - -- Write your own scheduler and run it along with Kubernetes native scheduler. This is going to be detailed in this proposal -- Use the callout approach such as the one implemented in [#13580](https://github.com/kubernetes/kubernetes/issues/13580) -- Recompile the scheduler with a new policy -- Restart the scheduler with a new [scheduler policy config file](../../examples/scheduler-policy-config.json) -- Or maybe in future dynamically link a new policy into the running scheduler - -## Challenges in multiple schedulers - -- Separating the pods - - Each pod should be scheduled by only one scheduler. As for implementation, a pod should - have an additional field to tell by which scheduler it wants to be scheduled. Besides, - each scheduler, including the default one, should have a unique logic of how to add unscheduled - pods to its to-be-scheduled pod queue. Details will be explained in later sections. - -- Dealing with conflicts - - Different schedulers are essentially separated processes. When all schedulers try to schedule - their pods onto the nodes, there might be conflicts. - - One example of the conflicts is resource racing: Suppose there be a `pod1` scheduled by - `my-scheduler` requiring 1 CPU's *request*, and a `pod2` scheduled by `kube-scheduler` (k8s native - scheduler, acting as default scheduler) requiring 2 CPU's *request*, while `node-a` only has 2.5 - free CPU's, if both schedulers all try to put their pods on `node-a`, then one of them would eventually - fail when Kubelet on `node-a` performs the create action due to insufficient CPU resources. - - This conflict is complex to deal with in api-server and etcd. Our current solution is to let Kubelet - to do the conflict check and if the conflict happens, effected pods would be put back to scheduler - and waiting to be scheduled again. Implementation details are in later sections. - -## Where to start: initial design - -We definitely want the multi-scheduler design to be a generic mechanism. The following lists the changes -we want to make in the first step. - -- Add an annotation in pod template: `scheduler.alpha.kubernetes.io/name: scheduler-name`, this is used to -separate pods between schedulers. `scheduler-name` should match one of the schedulers' `scheduler-name` -- Add a `scheduler-name` to each scheduler. It is done by hardcode or as command-line argument. The -Kubernetes native scheduler (now `kube-scheduler` process) would have the name as `kube-scheduler` -- The `scheduler-name` plays an important part in separating the pods between different schedulers. -Pods are statically dispatched to different schedulers based on `scheduler.alpha.kubernetes.io/name: scheduler-name` -annotation and there should not be any conflicts between different schedulers handling their pods, i.e. one pod must -NOT be claimed by more than one scheduler. To be specific, a scheduler can add a pod to its queue if and only if: - 1. The pod has no nodeName, **AND** - 2. The `scheduler-name` specified in the pod's annotation `scheduler.alpha.kubernetes.io/name: scheduler-name` - matches the `scheduler-name` of the scheduler. - - The only one exception is the default scheduler. Any pod that has no `scheduler.alpha.kubernetes.io/name: scheduler-name` - annotation is assumed to be handled by the "default scheduler". In the first version of the multi-scheduler feature, - the default scheduler would be the Kubernetes built-in scheduler with `scheduler-name` as `kube-scheduler`. - The Kubernetes build-in scheduler will claim any pod which has no `scheduler.alpha.kubernetes.io/name: scheduler-name` - annotation or which has `scheduler.alpha.kubernetes.io/name: kube-scheduler`. In the future, it may be possible to - change which scheduler is the default for a given cluster. - -- Dealing with conflicts. All schedulers must use predicate functions that are at least as strict as -the ones that Kubelet applies when deciding whether to accept a pod, otherwise Kubelet and scheduler -may get into an infinite loop where Kubelet keeps rejecting a pod and scheduler keeps re-scheduling -it back the same node. To make it easier for people who write new schedulers to obey this rule, we will -create a library containing the predicates Kubelet uses. (See issue [#12744](https://github.com/kubernetes/kubernetes/issues/12744).) - -In summary, in the initial version of this multi-scheduler design, we will achieve the following: - -- If a pod has the annotation `scheduler.alpha.kubernetes.io/name: kube-scheduler` or the user does not explicitly -sets this annotation in the template, it will be picked up by default scheduler -- If the annotation is set and refers to a valid `scheduler-name`, it will be picked up by the scheduler of -specified `scheduler-name` -- If the annotation is set but refers to an invalid `scheduler-name`, the pod will not be picked by any scheduler. -The pod will keep PENDING. - -### An example - -```yaml - kind: Pod - apiVersion: v1 - metadata: - name: pod-abc - labels: - foo: bar - annotations: - scheduler.alpha.kubernetes.io/name: my-scheduler -``` - -This pod will be scheduled by "my-scheduler" and ignored by "kube-scheduler". If there is no running scheduler -of name "my-scheduler", the pod will never be scheduled. - -## Next steps - -1. Use admission controller to add and verify the annotation, and do some modification if necessary. For example, the -admission controller might add the scheduler annotation based on the namespace of the pod, and/or identify if -there are conflicting rules, and/or set a default value for the scheduler annotation, and/or reject pods on -which the client has set a scheduler annotation that does not correspond to a running scheduler. -2. Dynamic launching scheduler(s) and registering to admission controller (as an external call). This also -requires some work on authorization and authentication to control what schedulers can write the /binding -subresource of which pods. -3. Optimize the behaviors of priority functions in multi-scheduler scenario. In the case where multiple schedulers have -the same predicate and priority functions (for example, when using multiple schedulers for parallelism rather than to -customize the scheduling policies), all schedulers would tend to pick the same node as "best" when scheduling identical -pods and therefore would be likely to conflict on the Kubelet. To solve this problem, we can pass -an optional flag such as `--randomize-node-selection=N` to scheduler, setting this flag would cause the scheduler to pick -randomly among the top N nodes instead of the one with the highest score. - -## Other issues/discussions related to scheduler design - -- [#13580](https://github.com/kubernetes/kubernetes/pull/13580): scheduler extension -- [#17097](https://github.com/kubernetes/kubernetes/issues/17097): policy config file in pod template -- [#16845](https://github.com/kubernetes/kubernetes/issues/16845): scheduling groups of pods -- [#17208](https://github.com/kubernetes/kubernetes/issues/17208): guide to writing a new scheduler - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/multiple-schedulers.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/multiple-schedulers.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/multiple-schedulers.md) diff --git a/docs/proposals/network-policy.md b/docs/proposals/network-policy.md index ff75aa57a29..24ec520149a 100644 --- a/docs/proposals/network-policy.md +++ b/docs/proposals/network-policy.md @@ -1,304 +1 @@ -# NetworkPolicy - -## Abstract - -A proposal for implementing a new resource - NetworkPolicy - which -will enable definition of ingress policies for selections of pods. - -The design for this proposal has been created by, and discussed -extensively within the Kubernetes networking SIG. It has been implemented -and tested using Kubernetes API extensions by various networking solutions already. - -In this design, users can create various NetworkPolicy objects which select groups of pods and -define how those pods should be allowed to communicate with each other. The -implementation of that policy at the network layer is left up to the -chosen networking solution. - -> Note that this proposal does not yet include egress / cidr-based policy, which is still actively undergoing discussion in the SIG. These are expected to augment this proposal in a backwards compatible way. - -## Implementation - -The implementation in Kubernetes consists of: -- A v1beta1 NetworkPolicy API object -- A structure on the `Namespace` object to control policy, to be developed as an annotation for now. - -### Namespace changes - -The following objects will be defined on a Namespace Spec. ->NOTE: In v1beta1 the Namespace changes will be implemented as an annotation. - -```go -type IngressIsolationPolicy string - -const ( - // Deny all ingress traffic to pods in this namespace. Ingress means - // any incoming traffic to pods, whether that be from other pods within this namespace - // or any source outside of this namespace. - DefaultDeny IngressIsolationPolicy = "DefaultDeny" -) - -// Standard NamespaceSpec object, modified to include a new -// NamespaceNetworkPolicy field. -type NamespaceSpec struct { - // This is a pointer so that it can be left undefined. - NetworkPolicy *NamespaceNetworkPolicy `json:"networkPolicy,omitempty"` -} - -type NamespaceNetworkPolicy struct { - // Ingress configuration for this namespace. This config is - // applied to all pods within this namespace. For now, only - // ingress is supported. This field is optional - if not - // defined, then the cluster default for ingress is applied. - Ingress *NamespaceIngressPolicy `json:"ingress,omitempty"` -} - -// Configuration for ingress to pods within this namespace. -// For now, this only supports specifying an isolation policy. -type NamespaceIngressPolicy struct { - // The isolation policy to apply to pods in this namespace. - // Currently this field only supports "DefaultDeny", but could - // be extended to support other policies in the future. When set to DefaultDeny, - // pods in this namespace are denied ingress traffic by default. When not defined, - // the cluster default ingress isolation policy is applied (currently allow all). - Isolation *IngressIsolationPolicy `json:"isolation,omitempty"` -} -``` - -```yaml -kind: Namespace -apiVersion: v1 -spec: - networkPolicy: - ingress: - isolation: DefaultDeny -``` - -The above structures will be represented in v1beta1 as a json encoded annotation like so: - -```yaml -kind: Namespace -apiVersion: v1 -metadata: - annotations: - net.beta.kubernetes.io/network-policy: | - { - "ingress": { - "isolation": "DefaultDeny" - } - } -``` - -### NetworkPolicy Go Definition - -For a namespace with ingress isolation, connections to pods in that namespace (from any source) are prevented. -The user needs a way to explicitly declare which connections are allowed into pods of that namespace. - -This is accomplished through ingress rules on `NetworkPolicy` -objects (of which there can be multiple in a single namespace). Pods selected by -one or more NetworkPolicy objects should allow any incoming connections that match any -ingress rule on those NetworkPolicy objects, per the network plugin’s capabilities. - -NetworkPolicy objects and the above namespace isolation both act on _connections_ rather than individual packets. That is to say that if traffic from pod A to pod B is allowed by the configured -policy, then the return packets for that connection from B -> A are also allowed, even if the policy in place would not allow B to initiate a connection to A. NetworkPolicy objects act on a broad definition of _connection_ which includes both TCP and UDP streams. If new network policy is applied that would block an existing connection between two endpoints, the enforcer of policy -should terminate and block the existing connection as soon as can be expected by the implementation. - -We propose adding the new NetworkPolicy object to the `extensions/v1beta1` API group for now. - -The SIG also considered the following while developing the proposed NetworkPolicy object: -- A per-pod policy field. We discounted this in favor of the loose coupling that labels provide, similar to Services. -- Per-Service policy. We chose not to attach network policy to services to avoid semantic overloading of a single object, and conflating the existing semantics of load-balancing and service discovery with those of network policy. - -```go -type NetworkPolicy struct { - TypeMeta - ObjectMeta - - // Specification of the desired behavior for this NetworkPolicy. - Spec NetworkPolicySpec -} - -type NetworkPolicySpec struct { - // Selects the pods to which this NetworkPolicy object applies. The array of ingress rules - // is applied to any pods selected by this field. Multiple network policies can select the - // same set of pods. In this case, the ingress rules for each are combined additively. - // This field is NOT optional and follows standard unversioned.LabelSelector semantics. - // An empty podSelector matches all pods in this namespace. - PodSelector unversioned.LabelSelector `json:"podSelector"` - - // List of ingress rules to be applied to the selected pods. - // Traffic is allowed to a pod if namespace.networkPolicy.ingress.isolation is undefined and cluster policy allows it, - // OR if the traffic source is the pod's local node, - // OR if the traffic matches at least one ingress rule across all of the NetworkPolicy - // objects whose podSelector matches the pod. - // If this field is empty then this NetworkPolicy does not affect ingress isolation. - // If this field is present and contains at least one rule, this policy allows any traffic - // which matches at least one of the ingress rules in this list. - Ingress []NetworkPolicyIngressRule `json:"ingress,omitempty"` -} - -// This NetworkPolicyIngressRule matches traffic if and only if the traffic matches both ports AND from. -type NetworkPolicyIngressRule struct { - // List of ports which should be made accessible on the pods selected for this rule. - // Each item in this list is combined using a logical OR. - // If this field is not provided, this rule matches all ports (traffic not restricted by port). - // If this field is empty, this rule matches no ports (no traffic matches). - // If this field is present and contains at least one item, then this rule allows traffic - // only if the traffic matches at least one port in the ports list. - Ports *[]NetworkPolicyPort `json:"ports,omitempty"` - - // List of sources which should be able to access the pods selected for this rule. - // Items in this list are combined using a logical OR operation. - // If this field is not provided, this rule matches all sources (traffic not restricted by source). - // If this field is empty, this rule matches no sources (no traffic matches). - // If this field is present and contains at least on item, this rule allows traffic only if the - // traffic matches at least one item in the from list. - From *[]NetworkPolicyPeer `json:"from,omitempty"` -} - -type NetworkPolicyPort struct { - // Optional. The protocol (TCP or UDP) which traffic must match. - // If not specified, this field defaults to TCP. - Protocol *api.Protocol `json:"protocol,omitempty"` - - // If specified, the port on the given protocol. This can - // either be a numerical or named port. If this field is not provided, - // this matches all port names and numbers. - // If present, only traffic on the specified protocol AND port - // will be matched. - Port *intstr.IntOrString `json:"port,omitempty"` -} - -type NetworkPolicyPeer struct { - // Exactly one of the following must be specified. - - // This is a label selector which selects Pods in this namespace. - // This field follows standard unversioned.LabelSelector semantics. - // If present but empty, this selector selects all pods in this namespace. - PodSelector *unversioned.LabelSelector `json:"podSelector,omitempty"` - - // Selects Namespaces using cluster scoped-labels. This - // matches all pods in all namespaces selected by this label selector. - // This field follows standard unversioned.LabelSelector semantics. - // If present but empty, this selector selects all namespaces. - NamespaceSelector *unversioned.LabelSelector `json:"namespaceSelector,omitempty"` -} -``` - -### Behavior - -The following pseudo-code attempts to define when traffic is allowed to a given pod when using this API. - -```python -def is_traffic_allowed(traffic, pod): - """ - Returns True if traffic is allowed to this pod, False otherwise. - """ - if not pod.Namespace.Spec.NetworkPolicy.Ingress.Isolation: - # If ingress isolation is disabled on the Namespace, use cluster default. - return clusterDefault(traffic, pod) - elif traffic.source == pod.node.kubelet: - # Traffic is from kubelet health checks. - return True - else: - # If namespace ingress isolation is enabled, only allow traffic - # that matches a network policy which selects this pod. - for network_policy in network_policies(pod.Namespace): - if not network_policy.Spec.PodSelector.selects(pod): - # This policy doesn't select this pod. Try the next one. - continue - - # This policy selects this pod. Check each ingress rule - # defined on this policy to see if it allows the traffic. - # If at least one does, then the traffic is allowed. - for ingress_rule in network_policy.Ingress or []: - if ingress_rule.matches(traffic): - return True - - # Ingress isolation is DefaultDeny and no policies match the given pod and traffic. - return false -``` - -### Potential Future Work / Questions - -- A single podSelector per NetworkPolicy may lead to managing a large number of NetworkPolicy objects, each of which is small and easy to understand on its own. However, this may lead for a policy change to require touching several policy objects. Allowing an optional podSelector per ingress rule additionally to the podSelector per NetworkPolicy object would allow the user to group rules into logical segments and define size/complexity ratio where it makes sense. This may lead to a smaller number of objects with more complexity if the user opts in to the additional podSelector. This increases the complexity of the NetworkPolicy object itself. This proposal has opted to favor a larger number of smaller objects that are easier to understand, with the understanding that additional podSelectors could be added to this design in the future should the requirement become apparent. - -- Is the `Namespaces` selector in the `NetworkPolicyPeer` struct too coarse? Do we need to support the AND combination of `Namespaces` and `Pods`? - -### Examples - -1) Only allow traffic from frontend pods on TCP port 6379 to backend pods in the same namespace. - -```yaml -kind: Namespace -apiVersion: v1 -metadata: - name: myns - annotations: - net.beta.kubernetes.io/network-policy: | - { - "ingress": { - "isolation": "DefaultDeny" - } - } ---- -kind: NetworkPolicy -apiVersion: extensions/v1beta1 -metadata: - name: allow-frontend - namespace: myns -spec: - podSelector: - matchLabels: - role: backend - ingress: - - from: - - podSelector: - matchLabels: - role: frontend - ports: - - protocol: TCP - port: 6379 -``` - -2) Allow TCP 443 from any source in Bob's namespaces. - -```yaml -kind: NetworkPolicy -apiVersion: extensions/v1beta1 -metadata: - name: allow-tcp-443 -spec: - podSelector: - matchLabels: - role: frontend - ingress: - - ports: - - protocol: TCP - port: 443 - from: - - namespaceSelector: - matchLabels: - user: bob -``` - -3) Allow all traffic to all pods in this namespace. - -```yaml -kind: NetworkPolicy -apiVersion: extensions/v1beta1 -metadata: - name: allow-all -spec: - podSelector: - ingress: - - {} -``` - -## References - -- https://github.com/kubernetes/kubernetes/issues/22469 tracks network policy in kubernetes. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/network-policy.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/network-policy.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/network-policy.md) diff --git a/docs/proposals/node-allocatable.md b/docs/proposals/node-allocatable.md index ac9f46c407a..af407ab1316 100644 --- a/docs/proposals/node-allocatable.md +++ b/docs/proposals/node-allocatable.md @@ -1,151 +1 @@ -# Node Allocatable Resources - -**Issue:** https://github.com/kubernetes/kubernetes/issues/13984 - -## Overview - -Currently Node.Status has Capacity, but no concept of node Allocatable. We need additional -parameters to serve several purposes: - -1. Kubernetes metrics provides "/docker-daemon", "/kubelet", - "/kube-proxy", "/system" etc. raw containers for monitoring system component resource usage - patterns and detecting regressions. Eventually we want to cap system component usage to a certain - limit / request. However this is not currently feasible due to a variety of reasons including: - 1. Docker still uses tons of computing resources (See - [#16943](https://github.com/kubernetes/kubernetes/issues/16943)) - 2. We have not yet defined the minimal system requirements, so we cannot control Kubernetes - nodes or know about arbitrary daemons, which can make the system resources - unmanageable. Even with a resource cap we cannot do a full resource management on the - node, but with the proposed parameters we can mitigate really bad resource over commits - 3. Usage scales with the number of pods running on the node -2. For external schedulers (such as mesos, hadoop, etc.) integration, they might want to partition - compute resources on a given node, limiting how much Kubelet can use. We should provide a - mechanism by which they can query kubelet, and reserve some resources for their own purpose. - -### Scope of proposal - -This proposal deals with resource reporting through the [`Allocatable` field](#allocatable) for more -reliable scheduling, and minimizing resource over commitment. This proposal *does not* cover -resource usage enforcement (e.g. limiting kubernetes component usage), pod eviction (e.g. when -reservation grows), or running multiple Kubelets on a single node. - -## Design - -### Definitions - -![image](node-allocatable.png) - -1. **Node Capacity** - Already provided as - [`NodeStatus.Capacity`](https://htmlpreview.github.io/?https://github.com/kubernetes/kubernetes/blob/HEAD/docs/api-reference/v1/definitions.html#_v1_nodestatus), - this is total capacity read from the node instance, and assumed to be constant. -2. **System-Reserved** (proposed) - Compute resources reserved for processes which are not managed by - Kubernetes. Currently this covers all the processes lumped together in the `/system` raw - container. -3. **Kubelet Allocatable** - Compute resources available for scheduling (including scheduled & - unscheduled resources). This value is the focus of this proposal. See [below](#api-changes) for - more details. -4. **Kube-Reserved** (proposed) - Compute resources reserved for Kubernetes components such as the - docker daemon, kubelet, kube proxy, etc. - -### API changes - -#### Allocatable - -Add `Allocatable` (4) to -[`NodeStatus`](https://htmlpreview.github.io/?https://github.com/kubernetes/kubernetes/blob/HEAD/docs/api-reference/v1/definitions.html#_v1_nodestatus): - -``` -type NodeStatus struct { - ... - // Allocatable represents schedulable resources of a node. - Allocatable ResourceList `json:"allocatable,omitempty"` - ... -} -``` - -Allocatable will be computed by the Kubelet and reported to the API server. It is defined to be: - -``` - [Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved] -``` - -The scheduler will use `Allocatable` in place of `Capacity` when scheduling pods, and the Kubelet -will use it when performing admission checks. - -*Note: Since kernel usage can fluctuate and is out of kubernetes control, it will be reported as a - separate value (probably via the metrics API). Reporting kernel usage is out-of-scope for this - proposal.* - -#### Kube-Reserved - -`KubeReserved` is the parameter specifying resources reserved for kubernetes components (4). It is -provided as a command-line flag to the Kubelet at startup, and therefore cannot be changed during -normal Kubelet operation (this may change in the [future](#future-work)). - -The flag will be specified as a serialized `ResourceList`, with resources defined by the API -`ResourceName` and values specified in `resource.Quantity` format, e.g.: - -``` ---kube-reserved=cpu=500m,memory=5Mi -``` - -Initially we will only support CPU and memory, but will eventually support more resources. See -[#16889](https://github.com/kubernetes/kubernetes/pull/16889) for disk accounting. - -If KubeReserved is not set it defaults to a sane value (TBD) calculated from machine capacity. If it -is explicitly set to 0 (along with `SystemReserved`), then `Allocatable == Capacity`, and the system -behavior is equivalent to the 1.1 behavior with scheduling based on Capacity. - -#### System-Reserved - -In the initial implementation, `SystemReserved` will be functionally equivalent to -[`KubeReserved`](#system-reserved), but with a different semantic meaning. While KubeReserved -designates resources set aside for kubernetes components, SystemReserved designates resources set -aside for non-kubernetes components (currently this is reported as all the processes lumped -together in the `/system` raw container). - -## Issues - -### Kubernetes reservation is smaller than kubernetes component usage - -**Solution**: Initially, do nothing (best effort). Let the kubernetes daemons overflow the reserved -resources and hope for the best. If the node usage is less than Allocatable, there will be some room -for overflow and the node should continue to function. If the node has been scheduled to capacity -(worst-case scenario) it may enter an unstable state, which is the current behavior in this -situation. - -In the [future](#future-work) we may set a parent cgroup for kubernetes components, with limits set -according to `KubeReserved`. - -### Version discrepancy - -**API server / scheduler is not allocatable-resources aware:** If the Kubelet rejects a Pod but the - scheduler expects the Kubelet to accept it, the system could get stuck in an infinite loop - scheduling a Pod onto the node only to have Kubelet repeatedly reject it. To avoid this situation, - we will do a 2-stage rollout of `Allocatable`. In stage 1 (targeted for 1.2), `Allocatable` will - be reported by the Kubelet and the scheduler will be updated to use it, but Kubelet will continue - to do admission checks based on `Capacity` (same as today). In stage 2 of the rollout (targeted - for 1.3 or later), the Kubelet will start doing admission checks based on `Allocatable`. - -**API server expects `Allocatable` but does not receive it:** If the kubelet is older and does not - provide `Allocatable` in the `NodeStatus`, then `Allocatable` will be - [defaulted](../../pkg/api/v1/defaults.go) to - `Capacity` (which will yield today's behavior of scheduling based on capacity). - -### 3rd party schedulers - -The community should be notified that an update to schedulers is recommended, but if a scheduler is -not updated it falls under the above case of "scheduler is not allocatable-resources aware". - -## Future work - -1. Convert kubelet flags to Config API - Prerequisite to (2). See - [#12245](https://github.com/kubernetes/kubernetes/issues/12245). -2. Set cgroup limits according KubeReserved - as described in the [overview](#overview) -3. Report kernel usage to be considered with scheduling decisions. - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/node-allocatable.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) diff --git a/docs/proposals/performance-related-monitoring.md b/docs/proposals/performance-related-monitoring.md index f70da39bda7..e5ff81f7314 100644 --- a/docs/proposals/performance-related-monitoring.md +++ b/docs/proposals/performance-related-monitoring.md @@ -1,116 +1 @@ -# Performance Monitoring - -## Reason for this document - -This document serves as a place to gather information about past performance regressions, their reason and impact and discuss ideas to avoid similar regressions in the future. -Main reason behind doing this is to understand what kind of monitoring needs to be in place to keep Kubernetes fast. - -## Known past and present performance issues - -### Higher logging level causing scheduler stair stepping - -Issue https://github.com/kubernetes/kubernetes/issues/14216 was opened because @spiffxp observed a regression in scheduler performance in 1.1 branch in comparison to `old` 1.0 -cut. In the end it turned out the be caused by `--v=4` (instead of default `--v=2`) flag in the scheduler together with the flag `--logtostderr` which disables batching of -log lines and a number of logging without explicit V level. This caused weird behavior of the whole component. - -Because we now know that logging may have big performance impact we should consider instrumenting logging mechanism and compute statistics such as number of logged messages, -total and average size of them. Each binary should be responsible for exposing its metrics. An unaccounted but way too big number of days, if not weeks, of engineering time was -lost because of this issue. - -### Adding per-pod probe-time, which increased the number of PodStatus updates, causing major slowdown - -In September 2015 we tried to add per-pod probe times to the PodStatus. It caused (https://github.com/kubernetes/kubernetes/issues/14273) a massive increase in both number and -total volume of object (PodStatus) changes. It drastically increased the load on API server which wasn’t able to handle new number of requests quickly enough, violating our -response time SLO. We had to revert this change. - -### Late Ready->Running PodPhase transition caused test failures as it seemed like slowdown - -In late September we encountered a strange problem (https://github.com/kubernetes/kubernetes/issues/14554): we observed an increased observed latencies in small clusters (few -Nodes). It turned out that it’s caused by an added latency between PodRunning and PodReady phases. This was not a real regression, but our tests thought it were, which shows -how careful we need to be. - -### Huge number of handshakes slows down API server - -It was a long standing issue for performance and is/was an important bottleneck for scalability (https://github.com/kubernetes/kubernetes/issues/13671). The bug directly -causing this problem was incorrect (from the golangs standpoint) handling of TCP connections. Secondary issue was that elliptic curve encryption (only one available in go 1.4) -is unbelievably slow. - -## Proposed metrics/statistics to gather/compute to avoid problems - -### Cluster-level metrics - -Basic ideas: -- number of Pods/ReplicationControllers/Services in the cluster -- number of running replicas of master components (if they are replicated) -- current elected master of ectd cluster (if running distributed version) -- nuber of master component restarts -- number of lost Nodes - -### Logging monitoring - -Log spam is a serious problem and we need to keep it under control. Simplest way to check for regressions, suggested by @brendandburns, is to compute the rate in which log files -grow in e2e tests. - -Basic ideas: -- log generation rate (B/s) - -### REST call monitoring - -We do measure REST call duration in the Density test, but we need an API server monitoring as well, to avoid false failures caused e.g. by the network traffic. We already have -some metrics in place (https://github.com/kubernetes/kubernetes/blob/master/pkg/apiserver/metrics/metrics.go), but we need to revisit the list and add some more. - -Basic ideas: -- number of calls per verb, client, resource type -- latency distribution per verb, client, resource type -- number of calls that was rejected per client, resource type and reason (invalid version number, already at maximum number of requests in flight) -- number of relists in various watchers - -### Rate limit monitoring - -Reverse of REST call monitoring done in the API server. We need to know when a given component increases a pressure it puts on the API server. As a proxy for number of -requests sent we can track how saturated are rate limiters. This has additional advantage of giving us data needed to fine-tune rate limiter constants. - -Because we have rate limiting on both ends (client and API server) we should monitor number of inflight requests in API server and how it relates to `max-requests-inflight`. - -Basic ideas: -- percentage of used non-burst limit, -- amount of time in last hour with depleted burst tokens, -- number of inflight requests in API server. - -### Network connection monitoring - -During development we observed incorrect use/reuse of HTTP connections multiple times already. We should at least monitor number of created connections. - -### ETCD monitoring - -@xiang-90 and @hongchaodeng - you probably have way more experience on what'd be good to look at from the ETCD perspective. - -Basic ideas: -- ETCD memory footprint -- number of objects per kind -- read/write latencies per kind -- number of requests from the API server -- read/write counts per key (it may be too heavy though) - -### Resource consumption - -On top of all things mentioned above we need to monitor changes in resource usage in both: cluster components (API server, Kubelet, Scheduler, etc.) and system add-ons -(Heapster, L7 load balancer, etc.). Monitoring memory usage is tricky, because if no limits are set, system won't apply memory pressure to processes, which makes their memory -footprint constantly grow. We argue that monitoring usage in tests still makes sense, as tests should be repeatable, and if memory usage will grow drastically between two runs -it most likely can be attributed to some kind of regression (assuming that nothing else has changed in the environment). - -Basic ideas: -- CPU usage -- memory usage - -### Other saturation metrics - -We should monitor other aspects of the system, which may indicate saturation of some component. - -Basic ideas: -- queue length for queues in the system, -- wait time for WaitGroups. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/performance-related-monitoring.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/performance-related-monitoring.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/performance-related-monitoring.md) diff --git a/docs/proposals/pod-lifecycle-event-generator.md b/docs/proposals/pod-lifecycle-event-generator.md index 207d6a17c04..b49e8922bcd 100644 --- a/docs/proposals/pod-lifecycle-event-generator.md +++ b/docs/proposals/pod-lifecycle-event-generator.md @@ -1,201 +1 @@ -# Kubelet: Pod Lifecycle Event Generator (PLEG) - -In Kubernetes, Kubelet is a per-node daemon that manages the pods on the node, -driving the pod states to match their pod specifications (specs). To achieve -this, Kubelet needs to react to changes in both (1) pod specs and (2) the -container states. For the former, Kubelet watches the pod specs changes from -multiple sources; for the latter, Kubelet polls the container runtime -periodically (e.g., 10s) for the latest states for all containers. - -Polling incurs non-negligible overhead as the number of pods/containers increases, -and is exacerbated by Kubelet's parallelism -- one worker (goroutine) per pod, which -queries the container runtime individually. Periodic, concurrent, large number -of requests causes high CPU usage spikes (even when there is no spec/state -change), poor performance, and reliability problems due to overwhelmed container -runtime. Ultimately, it limits Kubelet's scalability. - -(Related issues reported by users: [#10451](https://issues.k8s.io/10451), -[#12099](https://issues.k8s.io/12099), [#12082](https://issues.k8s.io/12082)) - -## Goals and Requirements - -The goal of this proposal is to improve Kubelet's scalability and performance -by lowering the pod management overhead. - - Reduce unnecessary work during inactivity (no spec/state changes) - - Lower the concurrent requests to the container runtime. - -The design should be generic so that it can support different container runtimes -(e.g., Docker and rkt). - -## Overview - -This proposal aims to replace the periodic polling with a pod lifecycle event -watcher. - -![pleg](pleg.png) - -## Pod Lifecycle Event - -A pod lifecycle event interprets the underlying container state change at the -pod-level abstraction, making it container-runtime-agnostic. The abstraction -shields Kubelet from the runtime specifics. - -```go -type PodLifeCycleEventType string - -const ( - ContainerStarted PodLifeCycleEventType = "ContainerStarted" - ContainerStopped PodLifeCycleEventType = "ContainerStopped" - NetworkSetupCompleted PodLifeCycleEventType = "NetworkSetupCompleted" - NetworkFailed PodLifeCycleEventType = "NetworkFailed" -) - -// PodLifecycleEvent is an event reflects the change of the pod state. -type PodLifecycleEvent struct { - // The pod ID. - ID types.UID - // The type of the event. - Type PodLifeCycleEventType - // The accompanied data which varies based on the event type. - Data interface{} -} -``` - -Using Docker as an example, starting of a POD infra container would be -translated to a NetworkSetupCompleted`pod lifecycle event. - - -## Detect Changes in Container States Via Relisting - -In order to generate pod lifecycle events, PLEG needs to detect changes in -container states. We can achieve this by periodically relisting all containers -(e.g., docker ps). Although this is similar to Kubelet's polling today, it will -only be performed by a single thread (PLEG). This means that we still -benefit from not having all pod workers hitting the container runtime -concurrently. Moreover, only the relevant pod worker would be woken up -to perform a sync. - -The upside of relying on relisting is that it is container runtime-agnostic, -and requires no external dependency. - -### Relist period - -The shorter the relist period is, the sooner that Kubelet can detect the -change. Shorter relist period also implies higher cpu usage. Moreover, the -relist latency depends on the underlying container runtime, and usually -increases as the number of containers/pods grows. We should set a default -relist period based on measurements. Regardless of what period we set, it will -likely be significantly shorter than the current pod sync period (10s), i.e., -Kubelet will detect container changes sooner. - - -## Impact on the Pod Worker Control Flow - -Kubelet is responsible for dispatching an event to the appropriate pod -worker based on the pod ID. Only one pod worker would be woken up for -each event. - -Today, the pod syncing routine in Kubelet is idempotent as it always -examines the pod state and the spec, and tries to drive to state to -match the spec by performing a series of operations. It should be -noted that this proposal does not intend to change this property -- -the sync pod routine would still perform all necessary checks, -regardless of the event type. This trades some efficiency for -reliability and eliminate the need to build a state machine that is -compatible with different runtimes. - -## Leverage Upstream Container Events - -Instead of relying on relisting, PLEG can leverage other components which -provide container events, and translate these events into pod lifecycle -events. This will further improve Kubelet's responsiveness and reduce the -resource usage caused by frequent relisting. - -The upstream container events can come from: - -(1). *Event stream provided by each container runtime* - -Docker's API exposes an [event -stream](https://docs.docker.com/reference/api/docker_remote_api_v1.17/#monitor-docker-s-events). -Nonetheless, rkt does not support this yet, but they will eventually support it -(see [coreos/rkt#1193](https://github.com/coreos/rkt/issues/1193)). - -(2). *cgroups event stream by cAdvisor* - -cAdvisor is integrated in Kubelet to provide container stats. It watches cgroups -containers using inotify and exposes an event stream. Even though it does not -support rkt yet, it should be straightforward to add such a support. - -Option (1) may provide richer sets of events, but option (2) has the advantage -to be more universal across runtimes, as long as the container runtime uses -cgroups. Regardless of what one chooses to implement now, the container event -stream should be easily swappable with a clearly defined interface. - -Note that we cannot solely rely on the upstream container events due to the -possibility of missing events. PLEG should relist infrequently to ensure no -events are missed. - -## Generate Expected Events - -*This is optional for PLEGs which performs only relisting, but required for -PLEGs that watch upstream events.* - -A pod worker's actions could lead to pod lifecycle events (e.g., -create/kill a container), which the worker would not observe until -later. The pod worker should ignore such events to avoid unnecessary -work. - -For example, assume a pod has two containers, A and B. The worker - - - Creates container A - - Receives an event `(ContainerStopped, B)` - - Receives an event `(ContainerStarted, A)` - - -The worker should ignore the `(ContainerStarted, A)` event since it is -expected. Arguably, the worker could process `(ContainerStopped, B)` -as soon as it receives the event, before observing the creation of -A. However, it is desirable to wait until the expected event -`(ContainerStarted, A)` is observed to keep a consistent per-pod view -at the worker. Therefore, the control flow of a single pod worker -should adhere to the following rules: - -1. Pod worker should process the events sequentially. -2. Pod worker should not start syncing until it observes the outcome of its own - actions in the last sync to maintain a consistent view. - -In other words, a pod worker should record the expected events, and -only wake up to perform the next sync until all expectations are met. - - - Creates container A, records an expected event `(ContainerStarted, A)` - - Receives `(ContainerStopped, B)`; stores the event and goes back to sleep. - - Receives `(ContainerStarted, A)`; clears the expectation. Proceeds to handle - `(ContainerStopped, B)`. - -We should set an expiration time for each expected events to prevent the worker -from being stalled indefinitely by missing events. - -## TODOs for v1.2 - -For v1.2, we will add a generic PLEG which relists periodically, and leave -adopting container events for future work. We will also *not* implement the -optimization that generate and filters out expected events to minimize -redundant syncs. - -- Add a generic PLEG using relisting. Modify the container runtime interface - to provide all necessary information to detect container state changes - in `GetPods()` (#13571). - -- Benchmark docker to adjust relising frequency. - -- Fix/adapt features that rely on frequent, periodic pod syncing. - * Liveness/Readiness probing: Create a separate probing manager using - explicitly container probing period [#10878](https://issues.k8s.io/10878). - * Instruct pod workers to set up a wake-up call if syncing failed, so that - it can retry. - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-lifecycle-event-generator.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-lifecycle-event-generator.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-lifecycle-event-generator.md) diff --git a/docs/proposals/pod-resource-management.md b/docs/proposals/pod-resource-management.md index 39f939e3add..0f0b09b4a61 100644 --- a/docs/proposals/pod-resource-management.md +++ b/docs/proposals/pod-resource-management.md @@ -1,416 +1 @@ -# Pod level resource management in Kubelet - -**Author**: Buddha Prakash (@dubstack), Vishnu Kannan (@vishh) - -**Last Updated**: 06/23/2016 - -**Status**: Draft Proposal (WIP) - -This document proposes a design for introducing pod level resource accounting to Kubernetes, and outlines the implementation and rollout plan. - - - -- [Pod level resource management in Kubelet](#pod-level-resource-management-in-kubelet) - - [Introduction](#introduction) - - [Non Goals](#non-goals) - - [Motivations](#motivations) - - [Design](#design) - - [Proposed cgroup hierarchy:](#proposed-cgroup-hierarchy) - - [QoS classes](#qos-classes) - - [Guaranteed](#guaranteed) - - [Burstable](#burstable) - - [Best Effort](#best-effort) - - [With Systemd](#with-systemd) - - [Hierarchy Outline](#hierarchy-outline) - - [QoS Policy Design Decisions](#qos-policy-design-decisions) - - [Implementation Plan](#implementation-plan) - - [Top level Cgroups for QoS tiers](#top-level-cgroups-for-qos-tiers) - - [Pod level Cgroup creation and deletion (Docker runtime)](#pod-level-cgroup-creation-and-deletion-docker-runtime) - - [Container level cgroups](#container-level-cgroups) - - [Rkt runtime](#rkt-runtime) - - [Add Pod level metrics to Kubelet's metrics provider](#add-pod-level-metrics-to-kubelets-metrics-provider) - - [Rollout Plan](#rollout-plan) - - [Implementation Status](#implementation-status) - - - -## Introduction - -As of now [Quality of Service(QoS)](../../docs/design/resource-qos.md) is not enforced at a pod level. Excepting pod evictions, all the other QoS features are not applicable at the pod level. -To better support QoS, there is a need to add support for pod level resource accounting in Kubernetes. - -We propose to have a unified cgroup hierarchy with pod level cgroups for better resource management. We will have a cgroup hierarchy with top level cgroups for the three QoS classes Guaranteed, Burstable and BestEffort. Pods (and their containers) belonging to a QoS class will be grouped under these top level QoS cgroups. And all containers in a pod are nested under the pod cgroup. - -The proposed cgroup hierarchy would allow for more efficient resource management and lead to improvements in node reliability. -This would also allow for significant latency optimizations in terms of pod eviction on nodes with the use of pod level resource usage metrics. -This document provides a basic outline of how we plan to implement and rollout this feature. - - -## Non Goals - -- Pod level disk accounting will not be tackled in this proposal. -- Pod level resource specification in the Kubernetes API will not be tackled in this proposal. - -## Motivations - -Kubernetes currently supports container level isolation only and lets users specify resource requests/limits on the containers [Compute Resources](../../docs/design/resources.md). The `kubelet` creates a cgroup sandbox (via it's container runtime) for each container. - - -There are a few shortcomings to the current model. - - Existing QoS support does not apply to pods as a whole. On-going work to support pod level eviction using QoS requires all containers in a pod to belong to the same class. By having pod level cgroups, it is easy to track pod level usage and make eviction decisions. - - Infrastructure overhead per pod is currently charged to the node. The overhead of setting up and managing the pod sandbox is currently accounted to the node. If the pod sandbox is a bit expensive, like in the case of hyper, having pod level accounting becomes critical. - - For the docker runtime we have a containerd-shim which is a small library that sits in front of a runtime implementation allowing it to be reparented to init, handle reattach from the caller etc. With pod level cgroups containerd-shim can be charged to the pod instead of the machine. - - If a container exits, all its anonymous pages (tmpfs) gets accounted to the machine (root). With pod level cgroups, that usage can also be attributed to the pod. - - Let containers share resources - with pod level limits, a pod with a Burstable container and a BestEffort container is classified as Burstable pod. The BestEffort container is able to consume slack resources not used by the Burstable container, and still be capped by the overall pod level limits. - -## Design - -High level requirements for the design are as follows: - - Do not break existing users. Ideally, there should be no changes to the Kubernetes API semantics. - - Support multiple cgroup managers - systemd, cgroupfs, etc. - -How we intend to achieve these high level goals is covered in greater detail in the Implementation Plan. - -We use the following denotations in the sections below: - -For the three QoS classes -`G⇒ Guaranteed QoS, Bu⇒ Burstable QoS, BE⇒ BestEffort QoS` - -For the value specified for the --qos-memory-overcommitment flag -`qmo⇒ qos-memory-overcommitment` - -Currently the Kubelet highly prioritizes resource utilization and thus allows BE pods to use as much resources as they want. And in case of OOM the BE pods are first to be killed. We follow this policy as G pods often don't use the amount of resource request they specify. By overcommiting the node the BE pods are able to utilize these left over resources. And in case of OOM the BE pods are evicted by the eviciton manager. But there is some latency involved in the pod eviction process which can be a cause of concern in latency-sensitive servers. On such servers we would want to avoid OOM conditions on the node. Pod level cgroups allow us to restrict the amount of available resources to the BE pods. So reserving the requested resources for the G and Bu pods would allow us to avoid invoking the OOM killer. - - -We add a flag `qos-memory-overcommitment` to kubelet which would allow users to configure the percentage of memory overcommitment on the node. We have the default as 100, so by default we allow complete overcommitment on the node and let the BE pod use as much memory as it wants, and not reserve any resources for the G and Bu pods. As expected if there is an OOM in such a case we first kill the BE pods before the G and Bu pods. -On the other hand if a user wants to ensure very predictable tail latency for latency-sensitive servers he would need to set qos-memory-overcommitment to a really low value(preferrably 0). In this case memory resources would be reserved for the G and Bu pods and BE pods would be able to use only the left over memory resource. - -Examples in the next section. - -### Proposed cgroup hierarchy: - -For the initial implementation we will only support limits for cpu and memory resources. - -#### QoS classes - -A pod can belong to one of the following 3 QoS classes: Guaranteed, Burstable, and BestEffort, in decreasing order of priority. - -#### Guaranteed - -`G` pods will be placed at the `$Root` cgroup by default. `$Root` is the system root i.e. "/" by default and if `--cgroup-root` flag is used then we use the specified cgroup-root as the `$Root`. To ensure Kubelet's idempotent behaviour we follow a pod cgroup naming format which is opaque and deterministic. Say we have a pod with UID: `5f9b19c9-3a30-11e6-8eea-28d2444e470d` the pod cgroup PodUID would be named: `pod-5f9b19c93a3011e6-8eea28d2444e470d`. - - -__Note__: The cgroup-root flag would allow the user to configure the root of the QoS cgroup hierarchy. Hence cgroup-root would be redefined as the root of QoS cgroup hierarchy and not containers. - -``` -/PodUID/cpu.quota = cpu limit of Pod -/PodUID/cpu.shares = cpu request of Pod -/PodUID/memory.limit_in_bytes = memory limit of Pod -``` - -Example: -We have two pods Pod1 and Pod2 having Pod Spec given below - -```yaml -kind: Pod -metadata: - name: Pod1 -spec: - containers: - name: foo - resources: - limits: - cpu: 10m - memory: 1Gi - name: bar - resources: - limits: - cpu: 100m - memory: 2Gi -``` - -```yaml -kind: Pod -metadata: - name: Pod2 -spec: - containers: - name: foo - resources: - limits: - cpu: 20m - memory: 2Gii -``` - -Pod1 and Pod2 are both classified as `G` and are nested under the `Root` cgroup. - -``` -/Pod1/cpu.quota = 110m -/Pod1/cpu.shares = 110m -/Pod2/cpu.quota = 20m -/Pod2/cpu.shares = 20m -/Pod1/memory.limit_in_bytes = 3Gi -/Pod2/memory.limit_in_bytes = 2Gi -``` - -#### Burstable - -We have the following resource parameters for the `Bu` cgroup. - -``` -/Bu/cpu.shares = summation of cpu requests of all Bu pods -/Bu/PodUID/cpu.quota = Pod Cpu Limit -/Bu/PodUID/cpu.shares = Pod Cpu Request -/Bu/memory.limit_in_bytes = Allocatable - {(summation of memory requests/limits of `G` pods)*(1-qom/100)} -/Bu/PodUID/memory.limit_in_bytes = Pod memory limit -``` - -`Note: For the `Bu` QoS when limits are not specified for any one of the containers, the Pod limit defaults to the node resource allocatable quantity.` - -Example: -We have two pods Pod3 and Pod4 having Pod Spec given below: - -```yaml -kind: Pod -metadata: - name: Pod3 -spec: - containers: - name: foo - resources: - limits: - cpu: 50m - memory: 2Gi - requests: - cpu: 20m - memory: 1Gi - name: bar - resources: - limits: - cpu: 100m - memory: 1Gi -``` - -```yaml -kind: Pod -metadata: - name: Pod4 -spec: - containers: - name: foo - resources: - limits: - cpu: 20m - memory: 2Gi - requests: - cpu: 10m - memory: 1Gi -``` - -Pod3 and Pod4 are both classified as `Bu` and are hence nested under the Bu cgroup -And for `qom` = 0 - -``` -/Bu/cpu.shares = 30m -/Bu/Pod3/cpu.quota = 150m -/Bu/Pod3/cpu.shares = 20m -/Bu/Pod4/cpu.quota = 20m -/Bu/Pod4/cpu.shares = 10m -/Bu/memory.limit_in_bytes = Allocatable - 5Gi -/Bu/Pod3/memory.limit_in_bytes = 3Gi -/Bu/Pod4/memory.limit_in_bytes = 2Gi -``` - -#### Best Effort - -For pods belonging to the `BE` QoS we don't set any quota. - -``` -/BE/cpu.shares = 2 -/BE/cpu.quota= not set -/BE/memory.limit_in_bytes = Allocatable - {(summation of memory requests of all `G` and `Bu` pods)*(1-qom/100)} -/BE/PodUID/memory.limit_in_bytes = no limit -``` - -Example: -We have a pod 'Pod5' having Pod Spec given below: - -```yaml -kind: Pod -metadata: - name: Pod5 -spec: - containers: - name: foo - resources: - name: bar - resources: -``` - -Pod5 is classified as `BE` and is hence nested under the BE cgroup -And for `qom` = 0 - -``` -/BE/cpu.shares = 2 -/BE/cpu.quota= not set -/BE/memory.limit_in_bytes = Allocatable - 7Gi -/BE/Pod5/memory.limit_in_bytes = no limit -``` - -### With Systemd - -In systemd we have slices for the three top level QoS class. Further each pod is a subslice of exactly one of the three QoS slices. Each container in a pod belongs to a scope nested under the qosclass-pod slice. - -Example: We plan to have the following cgroup hierarchy on systemd systems - -``` -/memory/G-PodUID.slice/containerUID.scope -/cpu,cpuacct/G-PodUID.slice/containerUID.scope -/memory/Bu.slice/Bu-PodUID.slice/containerUID.scope -/cpu,cpuacct/Bu.slice/Bu-PodUID.slice/containerUID.scope -/memory/BE.slice/BE-PodUID.slice/containerUID.scope -/cpu,cpuacct/BE.slice/BE-PodUID.slice/containerUID.scope -``` - -### Hierarchy Outline - -- "$Root" is the system root of the node i.e. "/" by default and if `--cgroup-root` is specified then the specified cgroup-root is used as "$Root". -- We have a top level QoS cgroup for the `Bu` and `BE` QoS classes. -- But we __dont__ have a separate cgroup for the `G` QoS class. `G` pod cgroups are brought up directly under the `Root` cgroup. -- Each pod has its own cgroup which is nested under the cgroup matching the pod's QoS class. -- All containers brought up by the pod are nested under the pod's cgroup. -- system-reserved cgroup contains the system specific processes. -- kube-reserved cgroup contains the kubelet specific daemons. - -``` -$ROOT - | - +- Pod1 - | | - | +- Container1 - | +- Container2 - | ... - +- Pod2 - | +- Container3 - | ... - +- ... - | - +- Bu - | | - | +- Pod3 - | | | - | | +- Container4 - | | ... - | +- Pod4 - | | +- Container5 - | | ... - | +- ... - | - +- BE - | | - | +- Pod5 - | | | - | | +- Container6 - | | +- Container7 - | | ... - | +- ... - | - +- System-reserved - | | - | +- system - | +- docker (optional) - | +- ... - | - +- Kube-reserved - | | - | +- kubelet - | +- docker (optional) - | +- ... - | -``` - -#### QoS Policy Design Decisions - -- This hierarchy highly prioritizes resource guarantees to the `G` over `Bu` and `BE` pods. -- By not having a separate cgroup for the `G` class, the hierarchy allows the `G` pods to burst and utilize all of Node's Allocatable capacity. -- The `BE` and `Bu` pods are strictly restricted from bursting and hogging resources and thus `G` Pods are guaranteed resource isolation. -- `BE` pods are treated as lowest priority. So for the `BE` QoS cgroup we set cpu shares to the lowest possible value ie.2. This ensures that the `BE` containers get a relatively small share of cpu time. -- Also we don't set any quota on the cpu resources as the containers on the `BE` pods can use any amount of free resources on the node. -- Having memory limit of `BE` cgroup as (Allocatable - summation of memory requests of `G` and `Bu` pods) would result in `BE` pods becoming more susceptible to being OOM killed. As more `G` and `Bu` pods are scheduled kubelet will more likely kill `BE` pods, even if the `G` and `Bu` pods are using less than their request since we will be dynamically reducing the size of `BE` m.limit_in_bytes. But this allows for better memory guarantees to the `G` and `Bu` pods. - -## Implementation Plan - -The implementation plan is outlined in the next sections. -We will have a 'experimental-cgroups-per-qos' flag to specify if the user wants to use the QoS based cgroup hierarchy. The flag would be set to false by default at least in v1.5. - -#### Top level Cgroups for QoS tiers - -Two top level cgroups for `Bu` and `BE` QoS classes are created when Kubelet starts to run on a node. All `G` pods cgroups are by default nested under the `Root`. So we dont create a top level cgroup for the `G` class. For raw cgroup systems we would use libcontainers cgroups manager for general cgroup management(cgroup creation/destruction). But for systemd we don't have equivalent support for slice management in libcontainer yet. So we will be adding support for the same in the Kubelet. These cgroups are only created once on Kubelet initialization as a part of node setup. Also on systemd these cgroups are transient units and will not survive reboot. - -#### Pod level Cgroup creation and deletion (Docker runtime) - -- When a new pod is brought up, its QoS class is firstly determined. -- We add an interface to Kubelet’s ContainerManager to create and delete pod level cgroups under the cgroup that matches the pod’s QoS class. -- This interface will be pluggable. Kubelet will support both systemd and raw cgroups based __cgroup__ drivers. We will be using the --cgroup-driver flag proposed in the [Systemd Node Spec](kubelet-systemd.md) to specify the cgroup driver. -- We inject creation and deletion of pod level cgroups into the pod workers. -- As new pods are added QoS class cgroup parameters are updated to match the resource requests by the Pod. - -#### Container level cgroups - -Have docker manager create container cgroups under pod level cgroups. With the docker runtime, we will pass --cgroup-parent using the syntax expected for the corresponding cgroup-driver the runtime was configured to use. - -#### Rkt runtime - -We want to have rkt create pods under a root QoS class that kubelet specifies, and set pod level cgroup parameters mentioned in this proposal by itself. - -#### Add Pod level metrics to Kubelet's metrics provider - -Update Kubelet’s metrics provider to include Pod level metrics. Use cAdvisor's cgroup subsystem information to determine various Pod level usage metrics. - -`Note: Changes to cAdvisor might be necessary.` - -## Rollout Plan - -This feature will be opt-in in v1.4 and an opt-out in v1.5. We recommend users to drain their nodes and opt-in, before switching to v1.5, which will result in a no-op when v1.5 kubelet is rolled out. - -## Implementation Status - -The implementation goals of the first milestone are outlined below. -- [x] Finalize and submit Pod Resource Management proposal for the project #26751 -- [x] Refactor qos package to be used globally throughout the codebase #27749 #28093 -- [x] Add interfaces for CgroupManager and CgroupManagerImpl which implements the CgroupManager interface and creates, destroys/updates cgroups using the libcontainer cgroupfs driver. #27755 #28566 -- [x] Inject top level QoS Cgroup creation in the Kubelet and add e2e tests to test that behaviour. #27853 -- [x] Add PodContainerManagerImpl Create and Destroy methods which implements the respective PodContainerManager methods using a cgroupfs driver. #28017 -- [x] Have docker manager create container cgroups under pod level cgroups. Inject creation and deletion of pod cgroups into the pod workers. Add e2e tests to test this behaviour. #29049 -- [x] Add support for updating policy for the pod cgroups. Add e2e tests to test this behaviour. #29087 -- [ ] Enabling 'cgroup-per-qos' flag in Kubelet: The user is expected to drain the node and restart it before enabling this feature, but as a fallback we also want to allow the user to just restart the kubelet with the cgroup-per-qos flag enabled to use this feature. As a part of this we need to figure out a policy for pods having Restart Policy: Never. More details in this [issue](https://github.com/kubernetes/kubernetes/issues/29946). -- [ ] Removing terminated pod's Cgroup : We need to cleanup the pod's cgroup once the pod is terminated. More details in this [issue](https://github.com/kubernetes/kubernetes/issues/29927). -- [ ] Kubelet needs to ensure that the cgroup settings are what the kubelet expects them to be. If security is not of concern, one can assume that once kubelet applies cgroups setting successfully, the values will never change unless kubelet changes it. If security is of concern, then kubelet will have to ensure that the cgroup values meet its requirements and then continue to watch for updates to cgroups via inotify and re-apply cgroup values if necessary. -Updating QoS limits needs to happen before pod cgroups values are updated. When pod cgroups are being deleted, QoS limits have to be updated after pod cgroup values have been updated for deletion or pod cgroups have been removed. Given that kubelet doesn't have any checkpoints and updates to QoS and pod cgroups are not atomic, kubelet needs to reconcile cgroups status whenever it restarts to ensure that the cgroups values match kubelet's expectation. -- [ ] [TEST] Opting in for this feature and rollbacks should be accompanied by detailed error message when killing pod intermittently. -- [ ] Add a systemd implementation for Cgroup Manager interface - - -Other smaller work items that we would be good to have before the release of this feature. -- [ ] Add Pod UID to the downward api which will help simplify the e2e testing logic. -- [ ] Check if parent cgroup exist and error out if they don’t. -- [ ] Set top level cgroup limit to resource allocatable until we support QoS level cgroup updates. If cgroup root is not `/` then set node resource allocatable as the cgroup resource limits on cgroup root. -- [ ] Add a NodeResourceAllocatableProvider which returns the amount of allocatable resources on the nodes. This interface would be used both by the Kubelet and ContainerManager. -- [ ] Add top level feasibility check to ensure that pod can be admitted on the node by estimating left over resources on the node. -- [ ] Log basic cgroup management ie. creation/deletion metrics - - -To better support our requirements we needed to make some changes/add features to Libcontainer as well - -- [x] Allowing or denying all devices by writing 'a' to devices.allow or devices.deny is -not possible once the device cgroups has children. Libcontainer doesn’t have the option of skipping updates on parent devices cgroup. opencontainers/runc/pull/958 -- [x] To use libcontainer for creating and managing cgroups in the Kubelet, I would like to just create a cgroup with no pid attached and if need be apply a pid to the cgroup later on. But libcontainer did not support cgroup creation without attaching a pid. opencontainers/runc/pull/956 - - - - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-resource-management.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-resource-management.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-resource-management.md) diff --git a/docs/proposals/pod-security-context.md b/docs/proposals/pod-security-context.md index bfaffa595c7..be82b4c63b5 100644 --- a/docs/proposals/pod-security-context.md +++ b/docs/proposals/pod-security-context.md @@ -1,374 +1 @@ -## Abstract - -A proposal for refactoring `SecurityContext` to have pod-level and container-level attributes in -order to correctly model pod- and container-level security concerns. - -## Motivation - -Currently, containers have a `SecurityContext` attribute which contains information about the -security settings the container uses. In practice, many of these attributes are uniform across all -containers in a pod. Simultaneously, there is also a need to apply the security context pattern -at the pod level to correctly model security attributes that apply only at a pod level. - -Users should be able to: - -1. Express security settings that are applicable to the entire pod -2. Express base security settings that apply to all containers -3. Override only the settings that need to be differentiated from the base in individual - containers - -This proposal is a dependency for other changes related to security context: - -1. [Volume ownership management in the Kubelet](https://github.com/kubernetes/kubernetes/pull/12944) -2. [Generic SELinux label management in the Kubelet](https://github.com/kubernetes/kubernetes/pull/14192) - -Goals of this design: - -1. Describe the use cases for which a pod-level security context is necessary -2. Thoroughly describe the API backward compatibility issues that arise from the introduction of - a pod-level security context -3. Describe all implementation changes necessary for the feature - -## Constraints and assumptions - -1. We will not design for intra-pod security; we are not currently concerned about isolating - containers in the same pod from one another -1. We will design for backward compatibility with the current V1 API - -## Use Cases - -1. As a developer, I want to correctly model security attributes which belong to an entire pod -2. As a user, I want to be able to specify container attributes that apply to all containers - without repeating myself -3. As an existing user, I want to be able to use the existing container-level security API - -### Use Case: Pod level security attributes - -Some security attributes make sense only to model at the pod level. For example, it is a -fundamental property of pods that all containers in a pod share the same network namespace. -Therefore, using the host namespace makes sense to model at the pod level only, and indeed, today -it is part of the `PodSpec`. Other host namespace support is currently being added and these will -also be pod-level settings; it makes sense to model them as a pod-level collection of security -attributes. - -## Use Case: Override pod security context for container - -Some use cases require the containers in a pod to run with different security settings. As an -example, a user may want to have a pod with two containers, one of which runs as root with the -privileged setting, and one that runs as a non-root UID. To support use cases like this, it should -be possible to override appropriate (i.e., not intrinsically pod-level) security settings for -individual containers. - -## Proposed Design - -### SecurityContext - -For posterity and ease of reading, note the current state of `SecurityContext`: - -```go -package api - -type Container struct { - // Other fields omitted - - // Optional: SecurityContext defines the security options the pod should be run with - SecurityContext *SecurityContext `json:"securityContext,omitempty"` -} - -type SecurityContext struct { - // Capabilities are the capabilities to add/drop when running the container - Capabilities *Capabilities `json:"capabilities,omitempty"` - - // Run the container in privileged mode - Privileged *bool `json:"privileged,omitempty"` - - // SELinuxOptions are the labels to be applied to the container - // and volumes - SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"` - - // RunAsUser is the UID to run the entrypoint of the container process. - RunAsUser *int64 `json:"runAsUser,omitempty"` - - // RunAsNonRoot indicates that the container should be run as a non-root user. If the RunAsUser - // field is not explicitly set then the kubelet may check the image for a specified user or - // perform defaulting to specify a user. - RunAsNonRoot bool `json:"runAsNonRoot,omitempty"` -} - -// SELinuxOptions contains the fields that make up the SELinux context of a container. -type SELinuxOptions struct { - // SELinux user label - User string `json:"user,omitempty"` - - // SELinux role label - Role string `json:"role,omitempty"` - - // SELinux type label - Type string `json:"type,omitempty"` - - // SELinux level label. - Level string `json:"level,omitempty"` -} -``` - -### PodSecurityContext - -`PodSecurityContext` specifies two types of security attributes: - -1. Attributes that apply to the pod itself -2. Attributes that apply to the containers of the pod - -In the internal API, fields of the `PodSpec` controlling the use of the host PID, IPC, and network -namespaces are relocated to this type: - -```go -package api - -type PodSpec struct { - // Other fields omitted - - // Optional: SecurityContext specifies pod-level attributes and container security attributes - // that apply to all containers. - SecurityContext *PodSecurityContext `json:"securityContext,omitempty"` -} - -// PodSecurityContext specifies security attributes of the pod and container attributes that apply -// to all containers of the pod. -type PodSecurityContext struct { - // Use the host's network namespace. If this option is set, the ports that will be - // used must be specified. - // Optional: Default to false. - HostNetwork bool - // Use the host's IPC namespace - HostIPC bool - - // Use the host's PID namespace - HostPID bool - - // Capabilities are the capabilities to add/drop when running containers - Capabilities *Capabilities `json:"capabilities,omitempty"` - - // Run the container in privileged mode - Privileged *bool `json:"privileged,omitempty"` - - // SELinuxOptions are the labels to be applied to the container - // and volumes - SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"` - - // RunAsUser is the UID to run the entrypoint of the container process. - RunAsUser *int64 `json:"runAsUser,omitempty"` - - // RunAsNonRoot indicates that the container should be run as a non-root user. If the RunAsUser - // field is not explicitly set then the kubelet may check the image for a specified user or - // perform defaulting to specify a user. - RunAsNonRoot bool -} - -// Comments and generated docs will change for the container.SecurityContext field to indicate -// the precedence of these fields over the pod-level ones. - -type Container struct { - // Other fields omitted - - // Optional: SecurityContext defines the security options the pod should be run with. - // Settings specified in this field take precedence over the settings defined in - // pod.Spec.SecurityContext. - SecurityContext *SecurityContext `json:"securityContext,omitempty"` -} -``` - -In the V1 API, the pod-level security attributes which are currently fields of the `PodSpec` are -retained on the `PodSpec` for backward compatibility purposes: - -```go -package v1 - -type PodSpec struct { - // Other fields omitted - - // Use the host's network namespace. If this option is set, the ports that will be - // used must be specified. - // Optional: Default to false. - HostNetwork bool `json:"hostNetwork,omitempty"` - // Use the host's pid namespace. - // Optional: Default to false. - HostPID bool `json:"hostPID,omitempty"` - // Use the host's ipc namespace. - // Optional: Default to false. - HostIPC bool `json:"hostIPC,omitempty"` - - // Optional: SecurityContext specifies pod-level attributes and container security attributes - // that apply to all containers. - SecurityContext *PodSecurityContext `json:"securityContext,omitempty"` -} -``` - -The `pod.Spec.SecurityContext` specifies the security context of all containers in the pod. -The containers' `securityContext` field is overlaid on the base security context to determine the -effective security context for the container. - -The new V1 API should be backward compatible with the existing API. Backward compatibility is -defined as: - -> 1. Any API call (e.g. a structure POSTed to a REST endpoint) that worked before your change must -> work the same after your change. -> 2. Any API call that uses your change must not cause problems (e.g. crash or degrade behavior) when -> issued against servers that do not include your change. -> 3. It must be possible to round-trip your change (convert to different API versions and back) with -> no loss of information. - -Previous versions of this proposal attempted to deal with backward compatibility by defining -the affect of setting the pod-level fields on the container-level fields. While trying to find -consensus on this design, it became apparent that this approach was going to be extremely complex -to implement, explain, and support. Instead, we will approach backward compatibility as follows: - -1. Pod-level and container-level settings will not affect one another -2. Old clients will be able to use container-level settings in the exact same way -3. Container level settings always override pod-level settings if they are set - -#### Examples - -1. Old client using `pod.Spec.Containers[x].SecurityContext` - - An old client creates a pod: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - containers: - - name: a - securityContext: - runAsUser: 1001 - - name: b - securityContext: - runAsUser: 1002 - ``` - - looks to old clients like: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - containers: - - name: a - securityContext: - runAsUser: 1001 - - name: b - securityContext: - runAsUser: 1002 - ``` - - looks to new clients like: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - containers: - - name: a - securityContext: - runAsUser: 1001 - - name: b - securityContext: - runAsUser: 1002 - ``` - -2. New client using `pod.Spec.SecurityContext` - - A new client creates a pod using a field of `pod.Spec.SecurityContext`: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - securityContext: - runAsUser: 1001 - containers: - - name: a - - name: b - ``` - - appears to new clients as: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - securityContext: - runAsUser: 1001 - containers: - - name: a - - name: b - ``` - - old clients will see: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - containers: - - name: a - - name: b - ``` - -3. Pods created using `pod.Spec.SecurityContext` and `pod.Spec.Containers[x].SecurityContext` - - If a field is set in both `pod.Spec.SecurityContext` and - `pod.Spec.Containers[x].SecurityContext`, the value in `pod.Spec.Containers[x].SecurityContext` - wins. In the following pod: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: test-pod - spec: - securityContext: - runAsUser: 1001 - containers: - - name: a - securityContext: - runAsUser: 1002 - - name: b - ``` - - The effective setting for `runAsUser` for container A is `1002`. - -#### Testing - -A backward compatibility test suite will be established for the v1 API. The test suite will -verify compatibility by converting objects into the internal API and back to the version API and -examining the results. - -All of the examples here will be used as test-cases. As more test cases are added, the proposal will -be updated. - -An example of a test like this can be found in the -[OpenShift API package](https://github.com/openshift/origin/blob/master/pkg/api/compatibility_test.go) - -E2E test cases will be added to test the correct determination of the security context for containers. - -### Kubelet changes - -1. The Kubelet will use the new fields on the `PodSecurityContext` for host namespace control -2. The Kubelet will be modified to correctly implement the backward compatibility and effective - security context determination defined here - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-security-context.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-security-context.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-security-context.md) diff --git a/docs/proposals/protobuf.md b/docs/proposals/protobuf.md index 6741bbab9ef..9e2c06263a4 100644 --- a/docs/proposals/protobuf.md +++ b/docs/proposals/protobuf.md @@ -1,480 +1 @@ -# Protobuf serialization and internal storage - -@smarterclayton - -March 2016 - -## Proposal and Motivation - -The Kubernetes API server is a "dumb server" which offers storage, versioning, -validation, update, and watch semantics on API resources. In a large cluster -the API server must efficiently retrieve, store, and deliver large numbers -of coarse-grained objects to many clients. In addition, Kubernetes traffic is -heavily biased towards intra-cluster traffic - as much as 90% of the requests -served by the APIs are for internal cluster components like nodes, controllers, -and proxies. The primary format for intercluster API communication is JSON -today for ease of client construction. - -At the current time, the latency of reaction to change in the cluster is -dominated by the time required to load objects from persistent store (etcd), -convert them to an output version, serialize them JSON over the network, and -then perform the reverse operation in clients. The cost of -serialization/deserialization and the size of the bytes on the wire, as well -as the memory garbage created during those operations, dominate the CPU and -network usage of the API servers. - -In order to reach clusters of 10k nodes, we need roughly an order of magnitude -efficiency improvement in a number of areas of the cluster, starting with the -masters but also including API clients like controllers, kubelets, and node -proxies. - -We propose to introduce a Protobuf serialization for all common API objects -that can optionally be used by intra-cluster components. Experiments have -demonstrated a 10x reduction in CPU use during serialization and deserialization, -a 2x reduction in size in bytes on the wire, and a 6-9x reduction in the amount -of objects created on the heap during serialization. The Protobuf schema -for each object will be automatically generated from the external API Go structs -we use to serialize to JSON. - -Benchmarking showed that the time spent on the server in a typical GET -resembles: - - etcd -> decode -> defaulting -> convert to internal -> - JSON 50us 5us 15us - Proto 5us - JSON 150allocs 80allocs - Proto 100allocs - - process -> convert to external -> encode -> client - JSON 15us 40us - Proto 5us - JSON 80allocs 100allocs - Proto 4allocs - - Protobuf has a huge benefit on encoding because it does not need to allocate - temporary objects, just one large buffer. Changing to protobuf moves our - hotspot back to conversion, not serialization. - - -## Design Points - -* Generate Protobuf schema from Go structs (like we do for JSON) to avoid - manual schema update and drift -* Generate Protobuf schema that is field equivalent to the JSON fields (no - special types or enumerations), reducing drift for clients across formats. -* Follow our existing API versioning rules (backwards compatible in major - API versions, breaking changes across major versions) by creating one - Protobuf schema per API type. -* Continue to use the existing REST API patterns but offer an alternative - serialization, which means existing client and server tooling can remain - the same while benefiting from faster decoding. -* Protobuf objects on disk or in etcd will need to be self identifying at - rest, like JSON, in order for backwards compatibility in storage to work, - so we must add an envelope with apiVersion and kind to wrap the nested - object, and make the data format recognizable to clients. -* Use the [gogo-protobuf](https://github.com/gogo/protobuf) Golang library to generate marshal/unmarshal - operations, allowing us to bypass the expensive reflection used by the - golang JSOn operation - - -## Alternatives - -* We considered JSON compression to reduce size on wire, but that does not - reduce the amount of memory garbage created during serialization and - deserialization. -* More efficient formats like Msgpack were considered, but they only offer - 2x speed up vs. the 10x observed for Protobuf -* gRPC was considered, but is a larger change that requires more core - refactoring. This approach does not eliminate the possibility of switching - to gRPC in the future. -* We considered attempting to improve JSON serialization, but the cost of - implementing a more efficient serializer library than ugorji is - significantly higher than creating a protobuf schema from our Go structs. - - -## Schema - -The Protobuf schema for each API group and version will be generated from -the objects in that API group and version. The schema will be named using -the package identifier of the Go package, i.e. - - k8s.io/kubernetes/pkg/api/v1 - -Each top level object will be generated as a Protobuf message, i.e.: - - type Pod struct { ... } - - message Pod {} - -Since the Go structs are designed to be serialized to JSON (with only the -int, string, bool, map, and array primitive types), we will use the -canonical JSON serialization as the protobuf field type wherever possible, -i.e.: - - JSON Protobuf - string -> string - int -> varint - bool -> bool - array -> repeating message|primitive - -We disallow the use of the Go `int` type in external fields because it is -ambiguous depending on compiler platform, and instead always use `int32` or -`int64`. - -We will use maps (a protobuf 3 extension that can serialize to protobuf 2) -to represent JSON maps: - - JSON Protobuf Wire (proto2) - map -> map -> repeated Message { key string; value bytes } - -We will not convert known string constants to enumerations, since that -would require extra logic we do not already have in JSOn. - -To begin with, we will use Protobuf 3 to generate a Protobuf 2 schema, and -in the future investigate a Protobuf 3 serialization. We will introduce -abstractions that let us have more than a single protobuf serialization if -necessary. Protobuf 3 would require us to support message types for -pointer primitive (nullable) fields, which is more complex than Protobuf 2's -support for pointers. - -### Example of generated proto IDL - -Without gogo extensions: - -``` -syntax = 'proto2'; - -package k8s.io.kubernetes.pkg.api.v1; - -import "k8s.io/kubernetes/pkg/api/resource/generated.proto"; -import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto"; -import "k8s.io/kubernetes/pkg/runtime/generated.proto"; -import "k8s.io/kubernetes/pkg/util/intstr/generated.proto"; - -// Package-wide variables from generator "generated". -option go_package = "v1"; - -// Represents a Persistent Disk resource in AWS. -// -// An AWS EBS disk must exist before mounting to a container. The disk -// must also be in the same AWS zone as the kubelet. An AWS EBS disk -// can only be mounted as read/write once. AWS EBS volumes support -// ownership management and SELinux relabeling. -message AWSElasticBlockStoreVolumeSource { - // Unique ID of the persistent disk resource in AWS (Amazon EBS volume). - // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore - optional string volumeID = 1; - - // Filesystem type of the volume that you want to mount. - // Tip: Ensure that the filesystem type is supported by the host operating system. - // Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. - // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore - // TODO: how do we prevent errors in the filesystem from compromising the machine - optional string fsType = 2; - - // The partition in the volume that you want to mount. - // If omitted, the default is to mount by volume name. - // Examples: For volume /dev/sda1, you specify the partition as "1". - // Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty). - optional int32 partition = 3; - - // Specify "true" to force and set the ReadOnly property in VolumeMounts to "true". - // If omitted, the default is "false". - // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore - optional bool readOnly = 4; -} - -// Affinity is a group of affinity scheduling rules, currently -// only node affinity, but in the future also inter-pod affinity. -message Affinity { - // Describes node affinity scheduling rules for the pod. - optional NodeAffinity nodeAffinity = 1; -} -``` - -With extensions: - -``` -syntax = 'proto2'; - -package k8s.io.kubernetes.pkg.api.v1; - -import "github.com/gogo/protobuf/gogoproto/gogo.proto"; -import "k8s.io/kubernetes/pkg/api/resource/generated.proto"; -import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto"; -import "k8s.io/kubernetes/pkg/runtime/generated.proto"; -import "k8s.io/kubernetes/pkg/util/intstr/generated.proto"; - -// Package-wide variables from generator "generated". -option (gogoproto.marshaler_all) = true; -option (gogoproto.sizer_all) = true; -option (gogoproto.unmarshaler_all) = true; -option (gogoproto.goproto_unrecognized_all) = false; -option (gogoproto.goproto_enum_prefix_all) = false; -option (gogoproto.goproto_getters_all) = false; -option go_package = "v1"; - -// Represents a Persistent Disk resource in AWS. -// -// An AWS EBS disk must exist before mounting to a container. The disk -// must also be in the same AWS zone as the kubelet. An AWS EBS disk -// can only be mounted as read/write once. AWS EBS volumes support -// ownership management and SELinux relabeling. -message AWSElasticBlockStoreVolumeSource { - // Unique ID of the persistent disk resource in AWS (Amazon EBS volume). - // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore - optional string volumeID = 1 [(gogoproto.customname) = "VolumeID", (gogoproto.nullable) = false]; - - // Filesystem type of the volume that you want to mount. - // Tip: Ensure that the filesystem type is supported by the host operating system. - // Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. - // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore - // TODO: how do we prevent errors in the filesystem from compromising the machine - optional string fsType = 2 [(gogoproto.customname) = "FSType", (gogoproto.nullable) = false]; - - // The partition in the volume that you want to mount. - // If omitted, the default is to mount by volume name. - // Examples: For volume /dev/sda1, you specify the partition as "1". - // Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty). - optional int32 partition = 3 [(gogoproto.customname) = "Partition", (gogoproto.nullable) = false]; - - // Specify "true" to force and set the ReadOnly property in VolumeMounts to "true". - // If omitted, the default is "false". - // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore - optional bool readOnly = 4 [(gogoproto.customname) = "ReadOnly", (gogoproto.nullable) = false]; -} - -// Affinity is a group of affinity scheduling rules, currently -// only node affinity, but in the future also inter-pod affinity. -message Affinity { - // Describes node affinity scheduling rules for the pod. - optional NodeAffinity nodeAffinity = 1 [(gogoproto.customname) = "NodeAffinity"]; -} -``` - -## Wire format - -In order to make Protobuf serialized objects recognizable in a binary form, -the encoded object must be prefixed by a magic number, and then wrap the -non-self-describing Protobuf object in a Protobuf object that contains -schema information. The protobuf object is referred to as the `raw` object -and the encapsulation is referred to as `wrapper` object. - -The simplest serialization is the raw Protobuf object with no identifying -information. In some use cases, we may wish to have the server identify the -raw object type on the wire using a protocol dependent format (gRPC uses -a type HTTP header). This works when all objects are of the same type, but -we occasionally have reasons to encode different object types in the same -context (watches, lists of objects on disk, and API calls that may return -errors). - -To identify the type of a wrapped Protobuf object, we wrap it in a message -in package `k8s.io/kubernetes/pkg/runtime` with message name `Unknown` -having the following schema: - - message Unknown { - optional TypeMeta typeMeta = 1; - optional bytes value = 2; - optional string contentEncoding = 3; - optional string contentType = 4; - } - - message TypeMeta { - optional string apiVersion = 1; - optional string kind = 2; - } - -The `value` field is an encoded protobuf object that matches the schema -defined in `typeMeta` and has optional `contentType` and `contentEncoding` -fields. `contentType` and `contentEncoding` have the same meaning as in -HTTP, if unspecified `contentType` means "raw protobuf object", and -`contentEncoding` defaults to no encoding. If `contentEncoding` is -specified, the defined transformation should be applied to `value` before -attempting to decode the value. - -The `contentType` field is required to support objects without a defined -protobuf schema, like the ThirdPartyResource or templates. Those objects -would have to be encoded as JSON or another structure compatible form -when used with Protobuf. Generic clients must deal with the possibility -that the returned value is not in the known type. - -We add the `contentEncoding` field here to preserve room for future -optimizations like encryption-at-rest or compression of the nested content. -Clients should error when receiving an encoding they do not support. -Negotioting encoding is not defined here, but introducing new encodings -is similar to introducing a schema change or new API version. - -A client should use the `kind` and `apiVersion` fields to identify the -correct protobuf IDL for that message and version, and then decode the -`bytes` field into that Protobuf message. - -Any Unknown value written to stable storage will be given a 4 byte prefix -`0x6b, 0x38, 0x73, 0x00`, which correspond to `k8s` followed by a zero byte. -The content-type `application/vnd.kubernetes.protobuf` is defined as -representing the following schema: - - MESSAGE = '0x6b 0x38 0x73 0x00' UNKNOWN - UNKNOWN = - -A client should check for the first four bytes, then perform a protobuf -deserialization of the remaining bytes into the `runtime.Unknown` type. - -## Streaming wire format - -While the majority of Kubernetes APIs return single objects that can vary -in type (Pod vs. Status, PodList vs. Status), the watch APIs return a stream -of identical objects (Events). At the time of this writing, this is the only -current or anticipated streaming RESTful protocol (logging, port-forwarding, -and exec protocols use a binary protocol over Websockets or SPDY). - -In JSON, this API is implemented as a stream of JSON objects that are -separated by their syntax (the closing `}` brace is followed by whitespace -and the opening `{` brace starts the next object). There is no formal -specification covering this pattern, nor a unique content-type. Each object -is expected to be of type `watch.Event`, and is currently not self describing. - -For expediency and consistency, we define a format for Protobuf watch Events -that is similar. Since protobuf messages are not self describing, we must -identify the boundaries between Events (a `frame`). We do that by prefixing -each frame of N bytes with a 4-byte, big-endian, unsigned integer with the -value N. - - frame = length body - length = 32-bit unsigned integer in big-endian order, denoting length of - bytes of body - body = - - # frame containing a single byte 0a - frame = 01 00 00 00 0a - - # equivalent JSON - frame = {"type": "added", ...} - -The body of each frame is a serialized Protobuf message `Event` in package -`k8s.io/kubernetes/pkg/watch/versioned`. The content type used for this -format is `application/vnd.kubernetes.protobuf;type=watch`. - -## Negotiation - -To allow clients to request protobuf serialization optionally, the `Accept` -HTTP header is used by callers to indicate which serialization they wish -returned in the response, and the `Content-Type` header is used to tell the -server how to decode the bytes sent in the request (for DELETE/POST/PUT/PATCH -requests). The server will return 406 if the `Accept` header is not -recognized or 415 if the `Content-Type` is not recognized (as defined in -RFC2616). - -To be backwards compatible, clients must consider that the server does not -support protobuf serialization. A number of options are possible: - -### Preconfigured - -Clients can have a configuration setting that instructs them which version -to use. This is the simplest option, but requires intervention when the -component upgrades to protobuf. - -### Include serialization information in api-discovery - -Servers can define the list of content types they accept and return in -their API discovery docs, and clients can use protobuf if they support it. -Allows dynamic configuration during upgrade if the client is already using -API-discovery. - -### Optimistically attempt to send and receive requests using protobuf - -Using multiple `Accept` values: - - Accept: application/vnd.kubernetes.protobuf, application/json - -clients can indicate their preferences and handle the returned -`Content-Type` using whatever the server responds. On update operations, -clients can try protobuf and if they receive a 415 error, record that and -fall back to JSON. Allows the client to be backwards compatible with -any server, but comes at the cost of some implementation complexity. - - -## Generation process - -Generation proceeds in five phases: - -1. Generate a gogo-protobuf annotated IDL from the source Go struct. -2. Generate temporary Go structs from the IDL using gogo-protobuf. -3. Generate marshaller/unmarshallers based on the IDL using gogo-protobuf. -4. Take all tag numbers generated for the IDL and apply them as struct tags - to the original Go types. -5. Generate a final IDL without gogo-protobuf annotations as the canonical IDL. - -The output is a `generated.proto` file in each package containing a standard -proto2 IDL, and a `generated.pb.go` file in each package that contains the -generated marshal/unmarshallers. - -The Go struct generated by gogo-protobuf from the first IDL must be identical -to the origin struct - a number of changes have been made to gogo-protobuf -to ensure exact 1-1 conversion. A small number of additions may be necessary -in the future if we introduce more exotic field types (Go type aliases, maps -with aliased Go types, and embedded fields were fixed). If they are identical, -the output marshallers/unmarshallers can then work on the origin struct. - -Whenever a new field is added, generation will assign that field a unique tag -and the 4th phase will write that tag back to the origin Go struct as a `protobuf` -struct tag. This ensures subsequent generation passes are stable, even in the -face of internal refactors. The first time a field is added, the author will -need to check in both the new IDL AND the protobuf struct tag changes. - -The second IDL is generated without gogo-protobuf annotations to allow clients -in other languages to generate easily. - -Any errors in the generation process are considered fatal and must be resolved -early (being unable to identify a field type for conversion, duplicate fields, -duplicate tags, protoc errors, etc). The conversion fuzzer is used to ensure -that a Go struct can be round-tripped to protobuf and back, as we do for JSON -and conversion testing. - - -## Changes to development process - -All existing API change rules would still apply. New fields added would be -automatically assigned a tag by the generation process. New API versions will -have a new proto IDL, and field name and changes across API versions would be -handled using our existing API change rules. Tags cannot change within an -API version. - -Generation would be done by developers and then checked into source control, -like conversions and ugorji JSON codecs. - -Because protoc is not packaged well across all platforms, we will add it to -the `kube-cross` Docker image and developers can use that to generate -updated protobufs. Protobuf 3 beta is required. - -The generated protobuf will be checked with a verify script before merging. - - -## Implications - -* The generated marshal code is large and will increase build times and binary - size. We may be able to remove ugorji after protobuf is added, since the - bulk of our decoding would switch to protobuf. -* The protobuf schema is naive, which means it may not be as a minimal as - possible. -* Debugging of protobuf related errors is harder due to the binary nature of - the format. -* Migrating API object storage from JSON to protobuf will require that all - API servers are upgraded before beginning to write protobuf to disk, since - old servers won't recognize protobuf. -* Transport of protobuf between etcd and the api server will be less efficient - in etcd2 than etcd3 (since etcd2 must encode binary values returned as JSON). - Should still be smaller than current JSON request. -* Third-party API objects must be stored as JSON inside of a protobuf wrapper - in etcd, and the API endpoints will not benefit from clients that speak - protobuf. Clients will have to deal with some API objects not supporting - protobuf. - - -## Open Questions - -* Is supporting stored protobuf files on disk in the kubectl client worth it? - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/protobuf.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/protobuf.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/protobuf.md) diff --git a/docs/proposals/release-notes.md b/docs/proposals/release-notes.md index f602eead50b..f1f7d7cac43 100644 --- a/docs/proposals/release-notes.md +++ b/docs/proposals/release-notes.md @@ -1,194 +1 @@ - -# Kubernetes Release Notes - -[djmm@google.com](mailto:djmm@google.com)
-Last Updated: 2016-04-06 - - - -- [Kubernetes Release Notes](#kubernetes-release-notes) - - [Objective](#objective) - - [Background](#background) - - [The Problem](#the-problem) - - [The (general) Solution](#the-general-solution) - - [Then why not just list *every* change that was submitted, CHANGELOG-style?](#then-why-not-just-list-every-change-that-was-submitted-changelog-style) - - [Options](#options) - - [Collection Design](#collection-design) - - [Publishing Design](#publishing-design) - - [Location](#location) - - [Layout](#layout) - - [Alpha/Beta/Patch Releases](#alphabetapatch-releases) - - [Major/Minor Releases](#majorminor-releases) - - [Work estimates](#work-estimates) - - [Caveats / Considerations](#caveats--considerations) - - - -## Objective - -Define a process and design tooling for collecting, arranging and publishing -release notes for Kubernetes releases, automating as much of the process as -possible. - -The goal is to introduce minor changes to the development workflow -in a way that is mostly frictionless and allows for the capture of release notes -as PRs are submitted to the repository. - -This direct association of release notes to PRs captures the intention of -release visibility of the PR at the point an idea is submitted upstream. -The release notes can then be more easily collected and published when the -release is ready. - -## Background - -### The Problem - -Release notes are often an afterthought and clarifying and finalizing them -is often left until the very last minute at the time the release is made. -This is usually long after the feature or bug fix was added and is no longer on -the mind of the author. Worse, the collecting and summarizing of the -release is often left to those who may know little or nothing about these -individual changes! - -Writing and editing release notes at the end of the cycle can be a rushed, -interrupt-driven and often stressful process resulting in incomplete, -inconsistent release notes often with errors and omissions. - -### The (general) Solution - -Like most things in the development/release pipeline, the earlier you do it, -the easier it is for everyone and the better the outcome. Gather your release -notes earlier in the development cycle, at the time the features and fixes are -added. - -#### Then why not just list *every* change that was submitted, CHANGELOG-style? - -On larger projects like Kubernetes, showing every single change (PR) would mean -hundreds of entries. The goal is to highlight the major changes for a release. - -## Options - -1. Use of pre-commit and other local git hooks - * Experiments here using `prepare-commit-msg` and `commit-msg` git hook files - were promising but less than optimal due to the fact that they would - require input/confirmation with each commit and there may be multiple - commits in a push and eventual PR. -1. Use of [github templates](https://github.com/blog/2111-issue-and-pull-request-templates) - * Templates provide a great way to pre-fill PR comments, but there are no - server-side hooks available to parse and/or easily check the contents of - those templates to ensure that checkboxes were checked or forms were filled - in. -1. Use of labels enforced by mungers/bots - * We already make great use of mungers/bots to manage labels on PRs and it - fits very nicely in the existing workflow - -## Collection Design - -The munger/bot option fits most cleanly into the existing workflow. - -All `release-note-*` labeling is managed on the master branch PR only. -No `release-note-*` labels are needed on cherry-pick PRs and no information -will be collected from that cherry-pick PR. - -The only exception to this rule is when a PR is not a cherry-pick and is -targeted directly to the non-master branch. In this case, a `release-note-*` -label is required for that non-master PR. - -1. New labels added to github: `release-note-none`, maybe others for new release note categories - see Layout section below -1. A [new munger](https://github.com/kubernetes/kubernetes/issues/23409) that will: - * Add a `release-note-label-needed` label to all new master branch PRs - * Block merge by the submit queue on all PRs labeled as `release-note-label-needed` - * Auto-remove `release-note-label-needed` when one of the `release-note-*` labels is added - -## Publishing Design - -### Location - -With v1.2.0, the release notes were moved from their previous [github releases](https://github.com/kubernetes/kubernetes/releases) -location to [CHANGELOG.md](../../CHANGELOG.md). Going forward this seems like a good plan. -Other projects do similarly. - -The kubernetes.tar.gz download link is also displayed along with the release notes -in [CHANGELOG.md](../../CHANGELOG.md). - -Is there any reason to continue publishing anything to github releases if -the complete release story is published in [CHANGELOG.md](../../CHANGELOG.md)? - -### Layout - -Different types of releases will generally have different requirements in -terms of layout. As expected, major releases like v1.2.0 are going -to require much more detail than the automated release notes will provide. - -The idea is that these mechanisms will provide 100% of the release note -content for alpha, beta and most minor releases and bootstrap the content -with a release note 'template' for the authors of major releases like v1.2.0. - -The authors can then collaborate and edit the higher level sections of the -release notes in a PR, updating [CHANGELOG.md](../../CHANGELOG.md) as needed. - -v1.2.0 demonstrated the need, at least for major releases like v1.2.0, for -several sections in the published release notes. -In order to provide a basic layout for release notes in the future, -new releases can bootstrap [CHANGELOG.md](../../CHANGELOG.md) with the following template types: - -#### Alpha/Beta/Patch Releases - -These are automatically generated from `release-note*` labels, but can be modified as needed. - -``` -Action Required -* PR titles from the release-note-action-required label - -Other notable changes -* PR titles from the release-note label -``` - -#### Major/Minor Releases - -``` -Major Themes -* Add to or delete this section - -Other notable improvements -* Add to or delete this section - -Experimental Features -* Add to or delete this section - -Action Required -* PR titles from the release-note-action-required label - -Known Issues -* Add to or delete this section - -Provider-specific Notes -* Add to or delete this section - -Other notable changes -* PR titles from the release-note label -``` - -## Work estimates - -* The [new munger](https://github.com/kubernetes/kubernetes/issues/23409) - * Owner: @eparis - * Time estimate: Mostly done -* Updates to the tool that collects, organizes, publishes and sends release - notifications. - * Owner: @david-mcmahon - * Time estimate: A few days - - -## Caveats / Considerations - -* As part of the planning and development workflow how can we capture - release notes for bigger features? - [#23070](https://github.com/kubernetes/kubernetes/issues/23070) - * For now contributors should simply use the first PR that enables a new - feature by default. We'll revisit if this does not work well. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/release-notes.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/release-notes.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/release-notes.md) diff --git a/docs/proposals/rescheduler.md b/docs/proposals/rescheduler.md index faf535642f9..aeb8ed642a4 100644 --- a/docs/proposals/rescheduler.md +++ b/docs/proposals/rescheduler.md @@ -1,123 +1 @@ -# Rescheduler design space - -@davidopp, @erictune, @briangrant - -July 2015 - -## Introduction and definition - -A rescheduler is an agent that proactively causes currently-running -Pods to be moved, so as to optimize some objective function for -goodness of the layout of Pods in the cluster. (The objective function -doesn't have to be expressed mathematically; it may just be a -collection of ad-hoc rules, but in principle there is an objective -function. Implicitly an objective function is described by the -scheduler's predicate and priority functions.) It might be triggered -to run every N minutes, or whenever some event happens that is known -to make the objective function worse (for example, whenever any Pod goes -PENDING for a long time.) - -## Motivation and use cases - -A rescheduler is useful because without a rescheduler, scheduling -decisions are only made at the time Pods are created. But later on, -the state of the cell may have changed in some way such that it would -be better to move the Pod to another node. - -There are two categories of movements a rescheduler might trigger: coalescing -and spreading. - -### Coalesce Pods - -This is the most common use case. Cluster layout changes over time. For -example, run-to-completion Pods terminate, producing free space in their wake, but that space -is fragmented. This fragmentation might prevent a PENDING Pod from scheduling -(there are enough free resource for the Pod in aggregate across the cluster, -but not on any single node). A rescheduler can coalesce free space like a -disk defragmenter, thereby producing enough free space on a node for a PENDING -Pod to schedule. In some cases it can do this just by moving Pods into existing -holes, but often it will need to evict (and reschedule) running Pods in order to -create a large enough hole. - -A second use case for a rescheduler to coalesce pods is when it becomes possible -to support the running Pods on a fewer number of nodes. The rescheduler can -gradually move Pods off of some set of nodes to make those nodes empty so -that they can then be shut down/removed. More specifically, -the system could do a simulation to see whether after removing a node from the -cluster, will the Pods that were on that node be able to reschedule, -either directly or with the help of the rescheduler; if the answer is -yes, then you can safely auto-scale down (assuming services will still -meeting their application-level SLOs). - -### Spread Pods - -The main use cases for spreading Pods revolve around relieving congestion on (a) highly -utilized node(s). For example, some process might suddenly start receiving a significantly -above-normal amount of external requests, leading to starvation of best-effort -Pods on the node. We can use the rescheduler to move the best-effort Pods off of the -node. (They are likely to have generous eviction SLOs, so are more likely to be movable -than the Pod that is experiencing the higher load, but in principle we might move either.) -Or even before any node becomes overloaded, we might proactively re-spread Pods from nodes -with high-utilization, to give them some buffer against future utilization spikes. In either -case, the nodes we move the Pods onto might have been in the system for a long time or might -have been added by the cluster auto-scaler specifically to allow the rescheduler to -rebalance utilization. - -A second spreading use case is to separate antagonists. -Sometimes the processes running in two different Pods on the same node -may have unexpected antagonistic -behavior towards one another. A system component might monitor for such -antagonism and ask the rescheduler to move one of the antagonists to a new node. - -### Ranking the use cases - -The vast majority of users probably only care about rescheduling for three scenarios: - -1. Move Pods around to get a PENDING Pod to schedule -1. Redistribute Pods onto new nodes added by a cluster auto-scaler when there are no PENDING Pods -1. Move Pods around when CPU starvation is detected on a node - -## Design considerations and design space - -Because rescheduling is disruptive--it causes one or more -already-running Pods to die when they otherwise wouldn't--a key -constraint on rescheduling is that it must be done subject to -disruption SLOs. There are a number of ways to specify these SLOs--a -global rate limit across all Pods, a rate limit across a set of Pods -defined by some particular label selector, a maximum number of Pods -that can be down at any one time among a set defined by some -particular label selector, etc. These policies are presumably part of -the Rescheduler's configuration. - -There are a lot of design possibilities for a rescheduler. To explain -them, it's easiest to start with the description of a baseline -rescheduler, and then describe possible modifications. The Baseline -rescheduler -* only kicks in when there are one or more PENDING Pods for some period of time; its objective function is binary: completely happy if there are no PENDING Pods, and completely unhappy if there are PENDING Pods; it does not try to optimize for any other aspect of cluster layout -* is not a scheduler -- it simply identifies a node where a PENDING Pod could fit if one or more Pods on that node were moved out of the way, and then kills those Pods to make room for the PENDING Pod, which will then be scheduled there by the regular scheduler(s). [obviously this killing operation must be able to specify "don't allow the killed Pod to reschedule back to whence it was killed" otherwise the killing is pointless] Of course it should only do this if it is sure the killed Pods will be able to reschedule into already-free space in the cluster. Note that although it is not a scheduler, the Rescheduler needs to be linked with the predicate functions of the scheduling algorithm(s) so that it can know (1) that the PENDING Pod would actually schedule into the hole it has identified once the hole is created, and (2) that the evicted Pod(s) will be able to schedule somewhere else in the cluster. - -Possible variations on this Baseline rescheduler are - -1. it can kill the Pod(s) whose space it wants **and also schedule the Pod that will take that space and reschedule the Pod(s) that were killed**, rather than just killing the Pod(s) whose space it wants and relying on the regular scheduler(s) to schedule the Pod that will take that space (and to reschedule the Pod(s) that were evicted) -1. it can run continuously in the background to optimize general cluster layout instead of just trying to get a PENDING Pod to schedule -1. it can try to move groups of Pods instead of using a one-at-a-time / greedy approach -1. it can formulate multi-hop plans instead of single-hop - -A key design question for a Rescheduler is how much knowledge it needs about the scheduling policies used by the cluster's scheduler(s). -* For the Baseline rescheduler, it needs to know the predicate functions used by the cluster's scheduler(s) else it can't know how to create a hole that the PENDING Pod will fit into, nor be sure that the evicted Pod(s) will be able to reschedule elsewhere. -* If it is going to run continuously in the background to optimize cluster layout but is still only going to kill Pods, then it still needs to know the predicate functions for the reason mentioned above. In principle it doesn't need to know the priority functions; it could just randomly kill Pods and rely on the regular scheduler to put them back in better places. However, this is a rather inexact approach. Thus it is useful for the rescheduler to know the priority functions, or at least some subset of them, so it can be sure that an action it takes will actually improve the cluster layout. -* If it is going to run continuously in the background to optimize cluster layout and is going to act as a scheduler rather than just killing Pods, then it needs to know the predicate functions and some compatible (but not necessarily identical) priority functions One example of a case where "compatible but not identical" might be useful is if the main scheduler(s) has a very simple scheduling policy optimized for low scheduling latency, and the Rescheduler having a more sophisticated/optimal scheduling policy that requires more computation time. The main thing to avoid is for the scheduler(s) and rescheduler to have incompatible priority functions, as this will cause them to "fight" (though it still can't lead to an infinite loop, since the scheduler(s) only ever touches a Pod once). - -## Appendix: Integrating rescheduler with cluster auto-scaler (scale up) - -For scaling up the cluster, a reasonable workflow might be: - -1. pod horizontal auto-scaler decides to add one or more Pods to a service, based on the metrics it is observing -1. the Pod goes PENDING due to lack of a suitable node with sufficient resources -1. rescheduler notices the PENDING Pod and determines that the Pod cannot schedule just by rearranging existing Pods (while respecting SLOs) -1. rescheduler triggers cluster auto-scaler to add a node of the appropriate type for the PENDING Pod -1. the PENDING Pod schedules onto the new node (and possibly the rescheduler also moves other Pods onto that node) - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/rescheduler.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduler.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduler.md) diff --git a/docs/proposals/rescheduling-for-critical-pods.md b/docs/proposals/rescheduling-for-critical-pods.md index 1d2d80ee58a..0b999c0483d 100644 --- a/docs/proposals/rescheduling-for-critical-pods.md +++ b/docs/proposals/rescheduling-for-critical-pods.md @@ -1,88 +1 @@ -# Rescheduler: guaranteed scheduling of critical addons - -## Motivation - -In addition to Kubernetes core components like api-server, scheduler, controller-manager running on a master machine -there is a bunch of addons which due to various reasons have to run on a regular cluster node, not the master. -Some of them are critical to have fully functional cluster: Heapster, DNS, UI. Users can break their cluster -by evicting a critical addon (either manually or as a side effect of an other operation like upgrade) -which possibly can become pending (for example when the cluster is highly utilized). -To avoid such situation we want to have a mechanism which guarantees that -critical addons are scheduled assuming the cluster is big enough. -This possibly may affect other pods (including production user’s applications). - -## Design - -Rescheduler will ensure that critical addons are always scheduled. -In the first version it will implement only this policy, but later we may want to introduce other policies. -It will be a standalone component running on master machine similarly to scheduler. -Those components will share common logic (initially rescheduler will in fact import some of scheduler packages). - -### Guaranteed scheduling of critical addons - -Rescheduler will observe critical addons -(with annotation `scheduler.alpha.kubernetes.io/critical-pod`). -If one of them is marked by scheduler as unschedulable (pod condition `PodScheduled` set to `false`, the reason set to `Unschedulable`) -the component will try to find a space for the addon by evicting some pods and then the scheduler will schedule the addon. - -#### Scoring nodes - -Initially we want to choose a random node with enough capacity -(chosen as described in [Evicting pods](rescheduling-for-critical-pods.md#evicting-pods)) to schedule given addons. -Later we may want to introduce some heuristic: -* minimize number of evicted pods with violation of disruption budget or shortened termination grace period -* minimize number of affected pods by choosing a node on which we have to evict less pods -* increase probability of scheduling of evicted pods by preferring a set of pods with the smallest total sum of requests -* avoid nodes which are ‘non-drainable’ (according to drain logic), for example on which there is a pod which doesn’t belong to any RC/RS/Deployment - -#### Evicting pods - -There are 2 mechanism which possibly can delay a pod eviction: Disruption Budget and Termination Grace Period. - -While removing a pod we will try to avoid violating Disruption Budget, though we can’t guarantee it -since there is a chance that it would block this operation for longer period of time. -We will also try to respect Termination Grace Period, though without any guarantee. -In case we have to remove a pod with termination grace period longer than 10s it will be shortened to 10s. - -The proposed order while choosing a node to schedule a critical addon and pods to remove: -1. a node where the critical addon pod can fit after evicting only pods satisfying both -(1) their disruption budget will not be violated by such eviction and (2) they have grace period <= 10 seconds -1. a node where the critical addon pod can fit after evicting only pods whose disruption budget will not be violated by such eviction -1. any node where the critical addon pod can fit after evicting some pods - -### Interaction with Scheduler - -To avoid situation when Scheduler will schedule another pod into the space prepared for the critical addon, -the chosen node has to be temporarily excluded from a list of nodes considered by Scheduler while making decisions. -For this purpose the node will get a temporary -[Taint](../../docs/design/taint-toleration-dedicated.md) “CriticalAddonsOnly” -and each critical addon has to have defined toleration for this taint. -After Rescheduler has no more work to do: all critical addons are scheduled or cluster is too small for them, -all taints will be removed. - -### Interaction with Cluster Autoscaler - -Rescheduler possibly can duplicate the responsibility of Cluster Autoscaler: -both components are taking action when there is unschedulable pod. -It may cause the situation when CA will add extra node for a pending critical addon -and Rescheduler will evict some running pods to make a space for the addon. -This situation would be rare and usually an extra node would be anyway needed for evicted pods. -In the worst case CA will add and then remove the node. -To not complicate architecture by introducing interaction between those 2 components we accept this overlap. - -We want to ensure that CA won’t remove nodes with critical addons by adding appropriate logic there. - -### Rescheduler control loop - -The rescheduler control loop will be as follow: -* while there is an unschedulable critical addon do the following: - * choose a node on which the addon should be scheduled (as described in Evicting pods) - * add taint to the node to prevent scheduler from using it - * delete pods which blocks the addon from being scheduled - * wait until scheduler will schedule the critical addon -* if there is no more critical addons for which we can help, ensure there is no node with the taint - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/rescheduling-for-critical-pods.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduling-for-critical-pods.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduling-for-critical-pods.md) diff --git a/docs/proposals/rescheduling.md b/docs/proposals/rescheduling.md index b1bdb937082..59c30c6b1da 100644 --- a/docs/proposals/rescheduling.md +++ b/docs/proposals/rescheduling.md @@ -1,493 +1 @@ -# Controlled Rescheduling in Kubernetes - -## Overview - -Although the Kubernetes scheduler(s) try to make good placement decisions for pods, -conditions in the cluster change over time (e.g. jobs finish and new pods arrive, nodes -are removed due to failures or planned maintenance or auto-scaling down, nodes appear due -to recovery after a failure or re-joining after maintenance or auto-scaling up or adding -new hardware to a bare-metal cluster), and schedulers are not omniscient (e.g. there are -some interactions between pods, or between pods and nodes, that they cannot predict). As -a result, the initial node selected for a pod may turn out to be a bad match, from the -perspective of the pod and/or the cluster as a whole, at some point after the pod has -started running. - -Today (Kubernetes version 1.2) once a pod is scheduled to a node, it never moves unless -it terminates on its own, is deleted by the user, or experiences some unplanned event -(e.g. the node where it is running dies). Thus in a cluster with long-running pods, the -assignment of pods to nodes degrades over time, no matter how good an initial scheduling -decision the scheduler makes. This observation motivates "controlled rescheduling," a -mechanism by which Kubernetes will "move" already-running pods over time to improve their -placement. Controlled rescheduling is the subject of this proposal. - -Note that the term "move" is not technically accurate -- the mechanism used is that -Kubernetes will terminate a pod that is managed by a controller, and the controller will -create a replacement pod that is then scheduled by the pod's scheduler. The terminated -pod and replacement pod are completely separate pods, and no pod migration is -implied. However, describing the process as "moving" the pod is approximately accurate -and easier to understand, so we will use this terminology in the document. - -We use the term "rescheduling" to describe any action the system takes to move an -already-running pod. The decision may be made and executed by any component; we wil -introduce the concept of a "rescheduler" component later, but it is not the only -component that can do rescheduling. - -This proposal primarily focuses on the architecture and features/mechanisms used to -achieve rescheduling, and only briefly discuss example policies. We expect that community -experimentation will lead to a significantly better understanding of the range, potential, -and limitations of rescheduling policies. - -## Example use cases - -Example use cases for rescheduling are - -* moving a running pod onto a node that better satisfies its scheduling criteria - * moving a pod onto an under-utilized node - * moving a pod onto a node that meets more of the pod's affinity/anti-affinity preferences -* moving a running pod off of a node in anticipation of a known or speculated future event - * draining a node in preparation for maintenance, decomissioning, auto-scale-down, etc. - * "preempting" a running pod to make room for a pending pod to schedule - * proactively/speculatively make room for large and/or exclusive pods to facilitate - fast scheduling in the future (often called "defragmentation") - * (note that these last two cases are the only use cases where the first-order intent - is to move a pod specifically for the benefit of another pod) -* moving a running pod off of a node from which it is receiving poor service - * anomalous crashlooping or other mysterious incompatiblity between the pod and the node - * repeated out-of-resource killing (see #18724) - * repeated attempts by the scheduler to schedule the pod onto some node, but it is - rejected by Kubelet admission control due to incomplete scheduler knowledge - * poor performance due to interference from other containers on the node (CPU hogs, - cache thrashers, etc.) (note that in this case there is a choice of moving the victim - or the aggressor) - -## Some axes of the design space - -Among the key design decisions are - -* how does a pod specify its tolerance for these system-generated disruptions, and how - does the system enforce such disruption limits -* for each use case, where is the decision made about when and which pods to reschedule - (controllers, schedulers, an entirely new component e.g. "rescheduler", etc.) -* rescheduler design issues: how much does a rescheduler need to know about pods' - schedulers' policies, how does the rescheduler specify its rescheduling - requests/decisions (e.g. just as an eviction, an eviction with a hint about where to - reschedule, or as an eviction paired with a specific binding), how does the system - implement these requests, does the rescheduler take into account the second-order - effects of decisions (e.g. whether an evicted pod will reschedule, will cause - a preemption when it reschedules, etc.), does the rescheduler execute multi-step plans - (e.g. evict two pods at the same time with the intent of moving one into the space - vacated by the other, or even more complex plans) - -Additional musings on the rescheduling design space can be found [here](rescheduler.md). - -## Design proposal - -The key mechanisms and components of the proposed design are priority, preemption, -disruption budgets, the `/evict` subresource, and the rescheduler. - -### Priority - -#### Motivation - - -Just as it is useful to overcommit nodes to increase node-level utilization, it is useful -to overcommit clusters to increase cluster-level utilization. Scheduling priority (which -we abbreviate as *priority*, in combination with disruption budgets (described in the -next section), allows Kubernetes to safely overcommit clusters much as QoS levels allow -it to safely overcommit nodes. - -Today, cluster sharing among users, workload types, etc. is regulated via the -[quota](../admin/resourcequota/README.md) mechanism. When allocating quota, a cluster -administrator has two choices: (1) the sum of the quotas is less than or equal to the -capacity of the cluster, or (2) the sum of the quotas is greater than the capacity of the -cluster (that is, the cluster is overcommitted). (1) is likely to lead to cluster -under-utilization, while (2) is unsafe in the sense that someone's pods may go pending -indefinitely even though they are still within their quota. Priority makes cluster -overcommitment (i.e. case (2)) safe by allowing users and/or administrators to identify -which pods should be allowed to run, and which should go pending, when demand for cluster -resources exceeds supply to due to cluster overcommitment. - -Priority is also useful in some special-case scenarios, such as ensuring that system -DaemonSets can always schedule and reschedule onto every node where they want to run -(assuming they are given the highest priority), e.g. see #21767. - -#### Specifying priorities - -We propose to add a required `Priority` field to `PodSpec`. Its value type is string, and -the cluster administrator defines a total ordering on these strings (for example -`Critical`, `Normal`, `Preemptible`). We choose string instead of integer so that it is -easy for an administrator to add new priority levels in between existing levels, to -encourage thinking about priority in terms of user intent and avoid magic numbers, and to -make the internal implementation more flexible. - -When a scheduler is scheduling a new pod P and cannot find any node that meets all of P's -scheduling predicates, it is allowed to evict ("preempt") one or more pods that are at -the same or lower priority than P (subject to disruption budgets, see next section) from -a node in order to make room for P, i.e. in order to make the scheduling predicates -satisfied for P on that node. (Note that when we add cluster-level resources (#19080), -it might be necessary to preempt from multiple nodes, but that scenario is outside the -scope of this document.) The preempted pod(s) may or may not be able to reschedule. The -net effect of this process is that when demand for cluster resources exceeds supply, the -higher-priority pods will be able to run while the lower-priority pods will be forced to -wait. The detailed mechanics of preemption are described in a later section. - -In addition to taking disruption budget into account, for equal-priority preemptions the -scheduler will try to enforce fairness (across victim controllers, services, etc.) - -Priorities could be specified directly by users in the podTemplate, or assigned by an -admission controller using -properties of the pod. Either way, all schedulers must be configured to understand the -same priorities (names and ordering). This could be done by making them constants in the -API, or using ConfigMap to configure the schedulers with the information. The advantage of -the former (at least making the names, if not the ordering, constants in the API) is that -it allows the API server to do validation (e.g. to catch mis-spelling). - -In the future, which priorities are usable for a given namespace and pods with certain -attributes may be configurable, similar to ResourceQuota, LimitRange, or security policy. - -Priority and resource QoS are indepedent. - -The priority we have described here might be used to prioritize the scheduling queue -(i.e. the order in which a scheduler examines pods in its scheduling loop), but the two -priority concepts do not have to be connected. It is somewhat logical to tie them -together, since a higher priority genreally indicates that a pod is more urgent to get -running. Also, scheduling low-priority pods before high-priority pods might lead to -avoidable preemptions if the high-priority pods end up preempting the low-priority pods -that were just scheduled. - -TODO: Priority and preemption are global or namespace-relative? See -[this discussion thread](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r55737389). - -#### Relationship of priority to quota - -Of course, if the decision of what priority to give a pod is solely up to the user, then -users have no incentive to ever request any priority less than the maximum. Thus -priority is intimately related to quota, in the sense that resource quotas must be -allocated on a per-priority-level basis (X amount of RAM at priority A, Y amount of RAM -at priority B, etc.). The "guarantee" that highest-priority pods will always be able to -schedule can only be achieved if the sum of the quotas at the top priority level is less -than or equal to the cluster capacity. This is analogous to QoS, where safety can only be -achieved if the sum of the limits of the top QoS level ("Guaranteed") is less than or -equal to the node capacity. In terms of incentives, an organization could "charge" -an amount proportional to the priority of the resources. - -The topic of how to allocate quota at different priority levels to achieve a desired -balance between utilization and probability of schedulability is an extremely complex -topic that is outside the scope of this document. For example, resource fragmentation and -RequiredDuringScheduling node and pod affinity and anti-affinity means that even if the -sum of the quotas at the top priority level is less than or equal to the total aggregate -capacity of the cluster, some pods at the top priority level might still go pending. In -general, priority provdes a *probabilistic* guarantees of pod schedulability in the face -of overcommitment, by allowing prioritization of which pods should be allowed to run pods -when demand for cluster resources exceeds supply. - -### Disruption budget - -While priority can protect pods from one source of disruption (preemption by a -lower-priority pod), *disruption budgets* limit disruptions from all Kubernetes-initiated -causes, including preemption by an equal or higher-priority pod, or being evicted to -achieve other rescheduling goals. In particular, each pod is optionally associated with a -"disruption budget," a new API resource that limits Kubernetes-initiated terminations -across a set of pods (e.g. the pods of a particular Service might all point to the same -disruption budget object), regardless of cause. Initially we expect disruption budget -(e.g. `DisruptionBudgetSpec`) to consist of - -* a rate limit on disruptions (preemption and other evictions) across the corresponding - set of pods, e.g. no more than one disruption per hour across the pods of a particular Service -* a minimum number of pods that must be up simultaneously (sometimes called "shard - strength") (of course this can also be expressed as the inverse, i.e. the number of - pods of the collection that can be down simultaneously) - -The second item merits a bit more explanation. One use case is to specify a quorum size, -e.g. to ensure that at least 3 replicas of a quorum-based service with 5 replicas are up -at the same time. In practice, a service should ideally create enough replicas to survive -at least one planned and one unplanned outage. So in our quorum example, we would specify -that at least 4 replicas must be up at the same time; this allows for one intentional -disruption (bringing the number of live replicas down from 5 to 4 and consuming one unit -of shard strength budget) and one unplanned disruption (bringing the number of live -replicas down from 4 to 3) while still maintaining a quorum. Shard strength is also -useful for simpler replicated services; for example, you might not want more than 10% of -your front-ends to be down at the same time, so as to avoid overloading the remaining -replicas. - -Initially, disruption budgets will be specified by the user. Thus as with priority, -disruption budgets need to be tied into quota, to prevent users from saying none of their -pods can ever be disrupted. The exact way of expressing and enforcing this quota is TBD, -though a simple starting point would be to have an admission controller assign a default -disruption budget based on priority level (more liberal with decreasing priority). -We also likely need a quota that applies to Kubernetes *components*, to the limit the rate -at which any one component is allowed to consume disruption budget. - -Of course there should also be a `DisruptionBudgetStatus` that indicates the current -disruption rate that the collection of pods is experiencing, and the number of pods that -are up. - -For the purposes of disruption budget, a pod is considered to be disrupted as soon as its -graceful termination period starts. - -A pod that is not covered by a disruption budget but is managed by a controller, -gets an implicit disruption budget of infinite (though the system should try to not -unduly victimize such pods). How a pod that is not managed by a controller is -handled is TBD. - -TBD: In addition to `PodSpec`, where do we store pointer to disruption budget -(podTemplate in controller that managed the pod?)? Do we auto-generate a disruption -budget (e.g. when instantiating a Service), or require the user to create it manually -before they create a controller? Which objects should return the disruption budget object -as part of the output on `kubectl get` other than (obviously) `kubectl get` for the -disruption budget itself? - -TODO: Clean up distinction between "down due to voluntary action taken by Kubernetes" -and "down due to unplanned outage" in spec and status. - -For now, there is nothing to prevent clients from circumventing the disruption budget -protections. Of course, clients that do this are not being "good citizens." In the next -section we describe a mechanism that at least makes it easy for well-behaved clients to -obey the disruption budgets. - -See #12611 for additional discussion of disruption budgets. - -### /evict subresource and PreferAvoidPods - -Although we could put the responsibility for checking and updating disruption budgets -solely on the client, it is safer and more convenient if we implement that functionality -in the API server. Thus we will introduce a new `/evict` subresource on pod. It is similar to -today's "delete" on pod except - - * It will be rejected if the deletion would violate disruption budget. (See how - Deployment handles failure of /rollback for ideas on how clients could handle failure - of `/evict`.) There are two possible ways to implement this: - - * For the initial implementation, this will be accomplished by the API server just - looking at the `DisruptionBudgetStatus` and seeing if the disruption would violate the - `DisruptionBudgetSpec`. In this approach, we assume a disruption budget controller - keeps the `DisruptionBudgetStatus` up-to-date by observing all pod deletions and - creations in the cluster, so that an approved disruption is quickly reflected in the - `DisruptionBudgetStatus`. Of course this approach does allow a race in which one or - more additional disruptions could be approved before the first one is reflected in the - `DisruptionBudgetStatus`. - - * Thus a subsequent implementation will have the API server explicitly debit the - `DisruptionBudgetStatus` when it accepts an `/evict`. (There still needs to be a - controller, to keep the shard strength status up-to-date when replacement pods are - created after an eviction; the controller may also be necessary for the rate status - depending on how rate is represented, e.g. adding tokens to a bucket at a fixed rate.) - Once etcd support multi-object transactions (etcd v3), the debit and pod deletion will - be placed in the same transaction. - - * Note: For the purposes of disruption budget, a pod is considered to be disrupted as soon as its - graceful termination period starts (so when we say "delete" here we do not mean - "deleted from etcd" but rather "graceful termination period has started"). - - * It will allow clients to communicate additional parameters when they wish to delete a - pod. (In the absence of the `/evict` subresource, we would have to create a pod-specific - type analogous to `api.DeleteOptions`.) - -We will make `kubectl delete pod` use `/evict` by default, and require a command-line -flag to delete the pod directly. - -We will add to `NodeStatus` a bounded-sized list of signatures of pods that should avoid -that node (provisionally called `PreferAvoidPods`). One of the pieces of information -specified in the `/evict` subresource is whether the eviction should add the evicted -pod's signature to the corresponding node's `PreferAvoidPods`. Initially the pod -signature will be a -[controllerRef](https://github.com/kubernetes/kubernetes/issues/14961#issuecomment-183431648), -i.e. a reference to the pod's controller. Controllers are responsible for garbage -collecting, after some period of time, `PreferAvoidPods` entries that point to them, but the API -server will also enforce a bounded size on the list. All schedulers will have a -highest-weighted priority function that gives a node the worst priority if the pod it is -scheduling appears in that node's `PreferAvoidPods` list. Thus appearing in -`PreferAvoidPods` is similar to -[RequiredDuringScheduling node anti-affinity](../../docs/user-guide/node-selection/README.md) -but it takes precedence over all other priority criteria and is not explicitly listed in -the `NodeAffinity` of the pod. - -`PreferAvoidPods` is useful for the "moving a running pod off of a node from which it is -receiving poor service" use case, as it reduces the chance that the replacement pod will -end up on the same node (keep in mind that most of those cases are situations that the -scheduler does not have explicit priority functions for, for example it cannot know in -advance that a pod will be starved). Also, though we do not intend to implement any such -policies in the first version of the rescheduler, it is useful whenever the rescheduler evicts -two pods A and B with the intention of moving A into the space vacated by B (it prevents -B from rescheduling back into the space it vacated before A's scheduler has a chance to -reschedule A there). Note that these two uses are subtly different; in the first -case we want the avoidance to last a relatively long time, whereas in the second case we -may only need it to last until A schedules. - -See #20699 for more discussion. - -### Preemption mechanics - -**NOTE: We expect a fuller design doc to be written on preemption before it is implemented. -However, a sketch of some ideas are presented here, since preemption is closely related to the -concepts discussed in this doc.** - -Pod schedulers will decide and enact preemptions, subject to the priority and disruption -budget rules described earlier. (Though note that we currently do not have any mechanism -to prevent schedulers from bypassing either the priority or disruption budget rules.) -The scheduler does not concern itself with whether the evicted pod(s) can reschedule. The -eviction(s) use(s) the `/evict` subresource so that it is subject to the disruption -budget(s) of the victim(s), but it does not request to add the victim pod(s) to the -nodes' `PreferAvoidPods`. - -Evicting victim(s) and binding the pending pod that the evictions are intended to enable -to schedule, are not transactional. We expect the scheduler to issue the operations in -sequence, but it is still possible that another scheduler could schedule its pod in -between the eviction(s) and the binding, or that the set of pods running on the node in -question changed between the time the scheduler made its decision and the time it sent -the operations to the API server thereby causing the eviction(s) to be not sufficient to get the -pending pod to schedule. In general there are a number of race conditions that cannot be -avoided without (1) making the evictions and binding be part of a single transaction, and -(2) making the binding preconditioned on a version number that is associated with the -node and is incremented on every binding. We may or may not implement those mechanisms in -the future. - -Given a choice between a node where scheduling a pod requires preemption and one where it -does not, all other things being equal, a scheduler should choose the one where -preemption is not required. (TBD: Also, if the selected node does require preemption, the -scheduler should preempt lower-priority pods before higher-priority pods (e.g. if the -scheduler needs to free up 4 GB of RAM, and the node has two 2 GB low-priority pods and -one 4 GB high-priority pod, all of which have sufficient disruption budget, it should -preempt the two low-priority pods). This is debatable, since all have sufficient -disruption budget. But still better to err on the side of giving better disruption SLO to -higher-priority pods when possible?) - -Preemption victims must be given their termination grace period. One possible sequence -of events is - -1. The API server binds the preemptor to the node (i.e. sets `nodeName` on the -preempting pod) and sets `deletionTimestamp` on the victims -2. Kubelet sees that `deletionTimestamp` has been set on the victims; they enter their -graceful termination period -3. Kubelet sees the preempting pod. It runs the admission checks on the new pod -assuming all pods that are in their graceful termination period are gone and that -all pods that are in the waiting state (see (4)) are running. -4. If (3) fails, then the new pod is rejected. If (3) passes, then Kubelet holds the -new pod in a waiting state, and does not run it until the pod passes passes the -admission checks using the set of actually running pods. - -Note that there are a lot of details to be figured out here; above is just a very -hand-wavy sketch of one general approach that might work. - -See #22212 for additional discussion. - -### Node drain - -Node drain will be handled by one or more components not described in this document. They -will respect disruption budgets. Initially, we will just make `kubectl drain` -respect disruption budgets. See #17393 for other discussion. - -### Rescheduler - -All rescheduling other than preemption and node drain will be decided and enacted by a -new component called the *rescheduler*. It runs continuously in the background, looking -for opportunities to move pods to better locations. It acts when the degree of -improvement meets some threshold and is allowed by the pod's disruption budget. The -action is eviction of a pod using the `/evict` subresource, with the pod's signature -enqueued in the node's `PreferAvoidPods`. It does not force the pod to reschedule to any -particular node. Thus it is really an "unscheduler"; only in combination with the evicted -pod's scheduler, which schedules the replacement pod, do we get true "rescheduling." See -the "Example use cases" section earlier for some example use cases. - -The rescheduler is a best-effort service that makes no guarantees about how quickly (or -whether) it will resolve a suboptimal pod placement. - -The first version of the rescheduler will not take into consideration where or whether an -evicted pod will reschedule. The evicted pod may go pending, consuming one unit of the -corresponding shard strength disruption budget by one indefinitely. By using the `/evict` -subresource, the rescheduler ensures that an evicted pod has sufficient budget for the -evicted pod to go and stay pending. We expect future versions of the rescheduler may be -linked with the "mandatory" predicate functions (currently, the ones that constitute the -Kubelet admission criteria), and will only evict if the rescheduler determines that the -pod can reschedule somewhere according to those criteria. (Note that this still does not -guarantee that the pod actually will be able to reschedule, for at least two reasons: (1) -the state of the cluster may change between the time the rescheduler evaluates it and -when the evicted pod's scheduler tries to schedule the replacement pod, and (2) the -evicted pod's scheduler may have additional predicate functions in addition to the -mandatory ones). - -(Note: see [this comment](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r54527968)). - -The first version of the rescheduler will only implement two objectives: moving a pod -onto an under-utilized node, and moving a pod onto a node that meets more of the pod's -affinity/anti-affinity preferences than wherever it is currently running. (We assume that -nodes that are intentionally under-utilized, e.g. because they are being drained, are -marked unschedulable, thus the first objective will not cause the rescheduler to "fight" -a system that is draining nodes.) We assume that all schedulers sufficiently weight the -priority functions for affinity/anti-affinity and avoiding very packed nodes, -otherwise evicted pods may not actually move onto a node that is better according to -the criteria that caused it to be evicted. (But note that in all cases it will move to a -node that is better according to the totality of its scheduler's priority functions, -except in the case where the node where it was already running was the only node -where it can run.) As a general rule, the rescheduler should only act when it sees -particularly bad situations, since (1) an eviction for a marginal improvement is likely -not worth the disruption--just because there is sufficient budget for an eviction doesn't -mean an eviction is painless to the application, and (2) rescheduling the pod might not -actually mitigate the identified problem if it is minor enough that other scheduling -factors dominate the decision of where the replacement pod is scheduled. - -We assume schedulers' priority functions are at least vaguely aligned with the -rescheduler's policies; otherwise the rescheduler will never accomplish anything useful, -given that it relies on the schedulers to actually reschedule the evicted pods. (Even if -the rescheduler acted as a scheduler, explicitly rebinding evicted pods, we'd still want -this to be true, to prevent the schedulers and rescheduler from "fighting" one another.) - -The rescheduler will be configured using ConfigMap; the cluster administrator can enable -or disable policies and can tune the rescheduler's aggressiveness (aggressive means it -will use a relatively low threshold for triggering an eviction and may consume a lot of -disruption budget, while non-aggressive means it will use a relatively high threshold for -triggering an eviction and will try to leave plenty of buffer in disruption budgets). The -first version of the rescheduler will not be extensible or pluggable, since we want to -keep the code simple while we gain experience with the overall concept. In the future, we -anticipate a version that will be extensible and pluggable. - -We might want some way to force the evicted pod to the front of the scheduler queue, -independently of its priority. - -See #12140 for additional discussion. - -### Final comments - -In general, the design space for this topic is huge. This document describes some of the -design considerations and proposes one particular initial implementation. We expect -certain aspects of the design to be "permanent" (e.g. the notion and use of priorities, -preemption, disruption budgets, and the `/evict` subresource) while others may change over time -(e.g. the partitioning of functionality between schedulers, controllers, rescheduler, -horizontal pod autoscaler, and cluster autoscaler; the policies the rescheduler implements; -the factors the rescheduler takes into account when making decisions (e.g. knowledge of -schedulers' predicate and priority functions, second-order effects like whether and where -evicted pod will be able to reschedule, etc.); the way the rescheduler enacts its -decisions; and the complexity of the plans the rescheduler attempts to implement). - -## Implementation plan - -The highest-priority feature to implement is the rescheduler with the two use cases -highlighted earlier: moving a pod onto an under-utilized node, and moving a pod onto a -node that meets more of the pod's affinity/anti-affinity preferences. The former is -useful to rebalance pods after cluster auto-scale-up, and the latter is useful for -Ubernetes. This requires implementing disruption budgets and the `/evict` subresource, -but not priority or preemption. - -Because the general topic of rescheduling is very speculative, we have intentionally -proposed that the first version of the rescheduler be very simple -- only uses eviction -(no attempt to guide replacement pod to any particular node), doesn't know schedulers' -predicate or priority functions, doesn't try to move two pods at the same time, and only -implements two use cases. As alluded to in the previous subsection, we expect the design -and implementation to evolve over time, and we encourage members of the community to -experiment with more sophisticated policies and to report their results from using them -on real workloads. - -## Alternative implementations - -TODO. - -## Additional references - -TODO. - -TODO: Add reference to this doc from docs/proposals/rescheduler.md - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/rescheduling.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduling.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduling.md) diff --git a/docs/proposals/resource-metrics-api.md b/docs/proposals/resource-metrics-api.md index fee416e053a..8f63fb74d59 100644 --- a/docs/proposals/resource-metrics-api.md +++ b/docs/proposals/resource-metrics-api.md @@ -1,151 +1 @@ -# Resource Metrics API - -This document describes API part of MVP version of Resource Metrics API effort in Kubernetes. -Once the agreement will be made the document will be extended to also cover implementation details. -The shape of the effort may be also a subject of changes once we will have more well-defined use cases. - -## Goal - -The goal for the effort is to provide resource usage metrics for pods and nodes through the API server. -This will be a stable, versioned API which core Kubernetes components can rely on. -In the first version only the well-defined use cases will be handled, -although the API should be easily extensible for potential future use cases. - -## Main use cases - -This section describes well-defined use cases which should be handled in the first version. -Use cases which are not listed below are out of the scope of MVP version of Resource Metrics API. - -#### Horizontal Pod Autoscaler - -HPA uses the latest value of cpu usage as an average aggregated across 1 minute -(the window may change in the future). The data for a given set of pods -(defined either by pod list or label selector) should be accesible in one request -due to performance issues. - -#### Scheduler - -Scheduler in order to schedule best-effort pods requires node level resource usage metrics -as an average aggregated across 1 minute (the window may change in the future). -The metrics should be available for all resources supported in the scheduler. -Currently the scheduler does not need this information, because it schedules best-effort pods -without considering node usage. But having the metrics available in the API server is a blocker -for adding the ability to take node usage into account when scheduling best-effort pods. - -## Other considered use cases - -This section describes the other considered use cases and explains why they are out -of the scope of the MVP version. - -#### Custom metrics in HPA - -HPA requires the latest value of application level metrics. - -The design of the pipeline for collecting application level metrics should -be revisited and it's not clear whether application level metrics should be -available in API server so the use case initially won't be supported. - -#### Cluster Federation - -The Cluster Federation control system might want to consider cluster-level usage (in addition to cluster-level request) -of running pods when choosing where to schedule new pods. Although -Cluster Federation is still in design, -we expect the metrics API described here to be sufficient. Cluster-level usage can be -obtained by summing over usage of all nodes in the cluster. - -#### kubectl top - -This feature is not yet specified/implemented although it seems reasonable to provide users information -about resource usage on pod/node level. - -Since this feature has not been fully specified yet it will be not supported initially in the API although -it will be probably possible to provide a reasonable implementation of the feature anyway. - -#### Kubernetes dashboard - -[Kubernetes dashboard](https://github.com/kubernetes/dashboard) in order to draw graphs requires resource usage -in timeseries format from relatively long period of time. The aggregations should be also possible on various levels -including replication controllers, deployments, services, etc. - -Since the use case is complicated it will not be supported initially in the API and they will query Heapster -directly using some custom API there. - -## Proposed API - -Initially the metrics API will be in a separate [API group](api-group.md) called ```metrics```. -Later if we decided to have Node and Pod in different API groups also -NodeMetrics and PodMetrics should be in different API groups. - -#### Schema - -The proposed schema is as follow. Each top-level object has `TypeMeta` and `ObjectMeta` fields -to be compatible with Kubernetes API standards. - -```go -type NodeMetrics struct { - unversioned.TypeMeta - ObjectMeta - - // The following fields define time interval from which metrics were - // collected in the following format [Timestamp-Window, Timestamp]. - Timestamp unversioned.Time - Window unversioned.Duration - - // The memory usage is the memory working set. - Usage v1.ResourceList -} - -type PodMetrics struct { - unversioned.TypeMeta - ObjectMeta - - // The following fields define time interval from which metrics were - // collected in the following format [Timestamp-Window, Timestamp]. - Timestamp unversioned.Time - Window unversioned.Duration - - // Metrics for all containers are collected within the same time window. - Containers []ContainerMetrics -} - -type ContainerMetrics struct { - // Container name corresponding to the one from v1.Pod.Spec.Containers. - Name string - // The memory usage is the memory working set. - Usage v1.ResourceList -} -``` - -By default `Usage` is the mean from samples collected within the returned time window. -The default time window is 1 minute. - -#### Endpoints - -All endpoints are GET endpoints, rooted at `/apis/metrics/v1alpha1/`. -There won't be support for the other REST methods. - -The list of supported endpoints: -- `/nodes` - all node metrics; type `[]NodeMetrics` -- `/nodes/{node}` - metrics for a specified node; type `NodeMetrics` -- `/namespaces/{namespace}/pods` - all pod metrics within namespace with support for `all-namespaces`; type `[]PodMetrics` -- `/namespaces/{namespace}/pods/{pod}` - metrics for a specified pod; type `PodMetrics` - -The following query parameters are supported: -- `labelSelector` - restrict the list of returned objects by labels (list endpoints only) - -In the future we may want to introduce the following params: -`aggregator` (`max`, `min`, `95th`, etc.) and `window` (`1h`, `1d`, `1w`, etc.) -which will allow to get the other aggregates over the custom time window. - -## Further improvements - -Depending on the further requirements the following features may be added: -- support for more metrics -- support for application level metrics -- watch for metrics -- possibility to query for window sizes and aggregation functions (though single window size/aggregation function per request) -- cluster level metrics - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/resource-metrics-api.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-metrics-api.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-metrics-api.md) diff --git a/docs/proposals/resource-quota-scoping.md b/docs/proposals/resource-quota-scoping.md index ac977d4eb11..e1a9cdd7b71 100644 --- a/docs/proposals/resource-quota-scoping.md +++ b/docs/proposals/resource-quota-scoping.md @@ -1,333 +1 @@ -# Resource Quota - Scoping resources - -## Problem Description - -### Ability to limit compute requests and limits - -The existing `ResourceQuota` API object constrains the total amount of compute -resource requests. This is useful when a cluster-admin is interested in -controlling explicit resource guarantees such that there would be a relatively -strong guarantee that pods created by users who stay within their quota will find -enough free resources in the cluster to be able to schedule. The end-user creating -the pod is expected to have intimate knowledge on their minimum required resource -as well as their potential limits. - -There are many environments where a cluster-admin does not extend this level -of trust to their end-user because user's often request too much resource, and -they have trouble reasoning about what they hope to have available for their -application versus what their application actually needs. In these environments, -the cluster-admin will often just expose a single value (the limit) to the end-user. -Internally, they may choose a variety of other strategies for setting the request. -For example, some cluster operators are focused on satisfying a particular over-commit -ratio and may choose to set the request as a factor of the limit to control for -over-commit. Other cluster operators may defer to a resource estimation tool that -sets the request based on known historical trends. In this environment, the -cluster-admin is interested in exposing a quota to their end-users that maps -to their desired limit instead of their request since that is the value the user -manages. - -### Ability to limit impact to node and promote fair-use - -The current `ResourceQuota` API object does not allow the ability -to quota best-effort pods separately from pods with resource guarantees. -For example, if a cluster-admin applies a quota that caps requested -cpu at 10 cores and memory at 10Gi, all pods in the namespace must -make an explicit resource request for cpu and memory to satisfy -quota. This prevents a namespace with a quota from supporting best-effort -pods. - -In practice, the cluster-admin wants to control the impact of best-effort -pods to the cluster, but not restrict the ability to run best-effort pods -altogether. - -As a result, the cluster-admin requires the ability to control the -max number of active best-effort pods. In addition, the cluster-admin -requires the ability to scope a quota that limits compute resources to -exclude best-effort pods. - -### Ability to quota long-running vs. bounded-duration compute resources - -The cluster-admin may want to quota end-users separately -based on long-running vs. bounded-duration compute resources. - -For example, a cluster-admin may offer more compute resources -for long running pods that are expected to have a more permanent residence -on the node than bounded-duration pods. Many batch style workloads -tend to consume as much resource as they can until something else applies -the brakes. As a result, these workloads tend to operate at their limit, -while many traditional web applications may often consume closer to their -request if there is no active traffic. An operator that wants to control -density will offer lower quota limits for batch workloads than web applications. - -A classic example is a PaaS deployment where the cluster-admin may -allow a separate budget for pods that run their web application vs. pods that -build web applications. - -Another example is providing more quota to a database pod than a -pod that performs a database migration. - -## Use Cases - -* As a cluster-admin, I want the ability to quota - * compute resource requests - * compute resource limits - * compute resources for terminating vs. non-terminating workloads - * compute resources for best-effort vs. non-best-effort pods - -## Proposed Change - -### New quota tracked resources - -Support the following resources that can be tracked by quota. - -| Resource Name | Description | -| ------------- | ----------- | -| cpu | total cpu requests (backwards compatibility) | -| memory | total memory requests (backwards compatibility) | -| requests.cpu | total cpu requests | -| requests.memory | total memory requests | -| limits.cpu | total cpu limits | -| limits.memory | total memory limits | - -### Resource Quota Scopes - -Add the ability to associate a set of `scopes` to a quota. - -A quota will only measure usage for a `resource` if it matches -the intersection of enumerated `scopes`. - -Adding a `scope` to a quota limits the number of resources -it supports to those that pertain to the `scope`. Specifying -a resource on the quota object outside of the allowed set -would result in a validation error. - -| Scope | Description | -| ----- | ----------- | -| Terminating | Match `kind=Pod` where `spec.activeDeadlineSeconds >= 0` | -| NotTerminating | Match `kind=Pod` where `spec.activeDeadlineSeconds = nil` | -| BestEffort | Match `kind=Pod` where `status.qualityOfService in (BestEffort)` | -| NotBestEffort | Match `kind=Pod` where `status.qualityOfService not in (BestEffort)` | - -A `BestEffort` scope restricts a quota to tracking the following resources: - -* pod - -A `Terminating`, `NotTerminating`, `NotBestEffort` scope restricts a quota to -tracking the following resources: - -* pod -* memory, requests.memory, limits.memory -* cpu, requests.cpu, limits.cpu - -## Data Model Impact - -``` -// The following identify resource constants for Kubernetes object types -const ( - // CPU request, in cores. (500m = .5 cores) - ResourceRequestsCPU ResourceName = "requests.cpu" - // Memory request, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024) - ResourceRequestsMemory ResourceName = "requests.memory" - // CPU limit, in cores. (500m = .5 cores) - ResourceLimitsCPU ResourceName = "limits.cpu" - // Memory limit, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024) - ResourceLimitsMemory ResourceName = "limits.memory" -) - -// A scope is a filter that matches an object -type ResourceQuotaScope string -const ( - ResourceQuotaScopeTerminating ResourceQuotaScope = "Terminating" - ResourceQuotaScopeNotTerminating ResourceQuotaScope = "NotTerminating" - ResourceQuotaScopeBestEffort ResourceQuotaScope = "BestEffort" - ResourceQuotaScopeNotBestEffort ResourceQuotaScope = "NotBestEffort" -) - -// ResourceQuotaSpec defines the desired hard limits to enforce for Quota -// The quota matches by default on all objects in its namespace. -// The quota can optionally match objects that satisfy a set of scopes. -type ResourceQuotaSpec struct { - // Hard is the set of desired hard limits for each named resource - Hard ResourceList `json:"hard,omitempty"` - // A collection of filters that must match each object tracked by a quota. - // If not specified, the quota matches all objects. - Scopes []ResourceQuotaScope `json:"scopes,omitempty"` -} -``` - -## Rest API Impact - -None. - -## Security Impact - -None. - -## End User Impact - -The `kubectl` commands that render quota should display its scopes. - -## Performance Impact - -This feature will make having more quota objects in a namespace -more common in certain clusters. This impacts the number of quota -objects that need to be incremented during creation of an object -in admission control. It impacts the number of quota objects -that need to be updated during controller loops. - -## Developer Impact - -None. - -## Alternatives - -This proposal initially enumerated a solution that leveraged a -`FieldSelector` on a `ResourceQuota` object. A `FieldSelector` -grouped an `APIVersion` and `Kind` with a selector over its -fields that supported set-based requirements. It would have allowed -a quota to track objects based on cluster defined attributes. - -For example, a quota could do the following: - -* match `Kind=Pod` where `spec.restartPolicy in (Always)` -* match `Kind=Pod` where `spec.restartPolicy in (Never, OnFailure)` -* match `Kind=Pod` where `status.qualityOfService in (BestEffort)` -* match `Kind=Service` where `spec.type in (LoadBalancer)` - * see [#17484](https://github.com/kubernetes/kubernetes/issues/17484) - -Theoretically, it would enable support for fine-grained tracking -on a variety of resource types. While extremely flexible, there -are cons to to this approach that make it premature to pursue -at this time. - -* Generic field selectors are not yet settled art - * see [#1362](https://github.com/kubernetes/kubernetes/issues/1362) - * see [#19084](https://github.com/kubernetes/kubernetes/pull/19804) -* Discovery API Limitations - * Not possible to discover the set of field selectors supported by kind. - * Not possible to discover if a field is readonly, readwrite, or immutable - post-creation. - -The quota system would want to validate that a field selector is valid, -and it would only want to select on those fields that are readonly/immutable -post creation to make resource tracking work during update operations. - -The current proposal could grow to support a `FieldSelector` on a -`ResourceQuotaSpec` and support a simple migration path to convert -`scopes` to the matching `FieldSelector` once the project has identified -how it wants to handle `fieldSelector` requirements longer term. - -This proposal previously discussed a solution that leveraged a -`LabelSelector` as a mechanism to partition quota. This is potentially -interesting to explore in the future to allow `namespace-admins` to -quota workloads based on local knowledge. For example, a quota -could match all kinds that match the selector -`tier=cache, environment in (dev, qa)` separately from quota that -matched `tier=cache, environment in (prod)`. This is interesting to -explore in the future, but labels are insufficient selection targets -for `cluster-administrators` to control footprint. In those instances, -you need fields that are cluster controlled and not user-defined. - -## Example - -### Scenario 1 - -The cluster-admin wants to restrict the following: - -* limit 2 best-effort pods -* limit 2 terminating pods that can not use more than 1Gi of memory, and 2 cpu cores -* limit 4 long-running pods that can not use more than 4Gi of memory, and 4 cpu cores -* limit 6 pods in total, 10 replication controllers - -This would require the following quotas to be added to the namespace: - -``` -$ cat quota-best-effort -apiVersion: v1 -kind: ResourceQuota -metadata: - name: quota-best-effort -spec: - hard: - pods: "2" - scopes: - - BestEffort - -$ cat quota-terminating -apiVersion: v1 -kind: ResourceQuota -metadata: - name: quota-terminating -spec: - hard: - pods: "2" - memory.limit: 1Gi - cpu.limit: 2 - scopes: - - Terminating - - NotBestEffort - -$ cat quota-longrunning -apiVersion: v1 -kind: ResourceQuota -metadata: - name: quota-longrunning -spec: - hard: - pods: "2" - memory.limit: 4Gi - cpu.limit: 4 - scopes: - - NotTerminating - - NotBestEffort - -$ cat quota -apiVersion: v1 -kind: ResourceQuota -metadata: - name: quota -spec: - hard: - pods: "6" - replicationcontrollers: "10" -``` - -In the above scenario, every pod creation will result in its usage being -tracked by `quota` since it has no additional scoping. The pod will then -be tracked by at 1 additional quota object based on the scope it -matches. In order for the pod creation to succeed, it must not violate -the constraint of any matching quota. So for example, a best-effort pod -would only be created if there was available quota in `quota-best-effort` -and `quota`. - -## Implementation - -### Assignee - -@derekwaynecarr - -### Work Items - -* Add support for requests and limits -* Add support for scopes in quota-related admission and controller code - -## Dependencies - -None. - -Longer term, we should evaluate what we want to do with `fieldSelector` as -the requests around different quota semantics will continue to grow. - -## Testing - -Appropriate unit and e2e testing will be authored. - -## Documentation Impact - -Existing resource quota documentation and examples will be updated. - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/resource-quota-scoping.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-quota-scoping.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-quota-scoping.md) diff --git a/docs/proposals/runtime-client-server.md b/docs/proposals/runtime-client-server.md index 16cc677c970..3059176ac38 100644 --- a/docs/proposals/runtime-client-server.md +++ b/docs/proposals/runtime-client-server.md @@ -1,206 +1 @@ -# Client/Server container runtime - -## Abstract - -A proposal of client/server implementation of kubelet container runtime interface. - -## Motivation - -Currently, any container runtime has to be linked into the kubelet. This makes -experimentation difficult, and prevents users from landing an alternate -container runtime without landing code in core kubernetes. - -To facilitate experimentation and to enable user choice, this proposal adds a -client/server implementation of the [new container runtime interface](https://github.com/kubernetes/kubernetes/pull/25899). The main goal -of this proposal is: - -- make it easy to integrate new container runtimes -- improve code maintainability - -## Proposed design - -**Design of client/server container runtime** - -The main idea of client/server container runtime is to keep main control logic in kubelet while letting remote runtime only do dedicated actions. An alpha [container runtime API](../../pkg/kubelet/api/v1alpha1/runtime/api.proto) is introduced for integrating new container runtimes. The API is based on [protobuf](https://developers.google.com/protocol-buffers/) and [gRPC](http://www.grpc.io) for a number of benefits: - -- Perform faster than json -- Get client bindings for free: gRPC supports ten languages -- No encoding/decoding codes needed -- Manage api interfaces easily: server and client interfaces are generated automatically - -A new container runtime manager `KubeletGenericRuntimeManager` will be introduced to kubelet, which will - -- conforms to kubelet's [Runtime](../../pkg/kubelet/container/runtime.go#L58) interface -- manage Pods and Containers lifecycle according to kubelet policies -- call remote runtime's API to perform specific pod, container or image operations - -A simple workflow of invoking remote runtime API on starting a Pod with two containers can be shown: - -``` -Kubelet KubeletGenericRuntimeManager RemoteRuntime - + + + - | | | - +---------SyncPod------------->+ | - | | | - | +---- Create PodSandbox ------->+ - | +<------------------------------+ - | | | - | XXXXXXXXXXXX | - | | X | - | | NetworkPlugin. | - | | SetupPod | - | | X | - | XXXXXXXXXXXX | - | | | - | +<------------------------------+ - | +---- Pull image1 -------->+ - | +<------------------------------+ - | +---- Create container1 ------->+ - | +<------------------------------+ - | +---- Start container1 -------->+ - | +<------------------------------+ - | | | - | +<------------------------------+ - | +---- Pull image2 -------->+ - | +<------------------------------+ - | +---- Create container2 ------->+ - | +<------------------------------+ - | +---- Start container2 -------->+ - | +<------------------------------+ - | | | - | <-------Success--------------+ | - | | | - + + + -``` - -And deleting a pod can be shown: - -``` -Kubelet KubeletGenericRuntimeManager RemoteRuntime - + + + - | | | - +---------SyncPod------------->+ | - | | | - | +---- Stop container1 ----->+ - | +<------------------------------+ - | +---- Delete container1 ----->+ - | +<------------------------------+ - | | | - | +---- Stop container2 ------>+ - | +<------------------------------+ - | +---- Delete container2 ------>+ - | +<------------------------------+ - | | | - | XXXXXXXXXXXX | - | | X | - | | NetworkPlugin. | - | | TeardownPod | - | | X | - | XXXXXXXXXXXX | - | | | - | | | - | +---- Delete PodSandbox ------>+ - | +<------------------------------+ - | | | - | <-------Success--------------+ | - | | | - + + + -``` - -**API definition** - -Since we are going to introduce more image formats and want to separate image management from containers and pods, this proposal introduces two services `RuntimeService` and `ImageService`. Both services are defined at [pkg/kubelet/api/v1alpha1/runtime/api.proto](../../pkg/kubelet/api/v1alpha1/runtime/api.proto): - -```proto -// Runtime service defines the public APIs for remote container runtimes -service RuntimeService { - // Version returns the runtime name, runtime version and runtime API version - rpc Version(VersionRequest) returns (VersionResponse) {} - - // CreatePodSandbox creates a pod-level sandbox. - // The definition of PodSandbox is at https://github.com/kubernetes/kubernetes/pull/25899 - rpc CreatePodSandbox(CreatePodSandboxRequest) returns (CreatePodSandboxResponse) {} - // StopPodSandbox stops the sandbox. If there are any running containers in the - // sandbox, they should be force terminated. - rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {} - // DeletePodSandbox deletes the sandbox. If there are any running containers in the - // sandbox, they should be force deleted. - rpc DeletePodSandbox(DeletePodSandboxRequest) returns (DeletePodSandboxResponse) {} - // PodSandboxStatus returns the Status of the PodSandbox. - rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {} - // ListPodSandbox returns a list of SandBox. - rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {} - - // CreateContainer creates a new container in specified PodSandbox - rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {} - // StartContainer starts the container. - rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {} - // StopContainer stops a running container with a grace period (i.e., timeout). - rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {} - // RemoveContainer removes the container. If the container is running, the container - // should be force removed. - rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {} - // ListContainers lists all containers by filters. - rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {} - // ContainerStatus returns status of the container. - rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {} - - // Exec executes the command in the container. - rpc Exec(stream ExecRequest) returns (stream ExecResponse) {} -} - -// Image service defines the public APIs for managing images -service ImageService { - // ListImages lists existing images. - rpc ListImages(ListImagesRequest) returns (ListImagesResponse) {} - // ImageStatus returns the status of the image. - rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse) {} - // PullImage pulls a image with authentication config. - rpc PullImage(PullImageRequest) returns (PullImageResponse) {} - // RemoveImage removes the image. - rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse) {} -} -``` - -Note that some types in [pkg/kubelet/api/v1alpha1/runtime/api.proto](../../pkg/kubelet/api/v1alpha1/runtime/api.proto) are already defined at [Container runtime interface/integration](https://github.com/kubernetes/kubernetes/pull/25899). -We should decide how to integrate the types in [#25899](https://github.com/kubernetes/kubernetes/pull/25899) with gRPC services: - -* Auto-generate those types into protobuf by [go2idl](../../cmd/libs/go2idl/) - - Pros: - - trace type changes automatically, all type changes in Go will be automatically generated into proto files - - Cons: - - type change may break existing API implementations, e.g. new fields added automatically may not noticed by remote runtime - - needs to convert Go types to gRPC generated types, and vise versa - - needs processing attributes order carefully so as not to break generated protobufs (this could be done by using [protobuf tag](https://developers.google.com/protocol-buffers/docs/gotutorial)) - - go2idl doesn't support gRPC, [protoc-gen-gogo](https://github.com/gogo/protobuf) is still required for generating gRPC client -* Embed those types as raw protobuf definitions and generate Go files by [protoc-gen-gogo](https://github.com/gogo/protobuf) - - Pros: - - decouple type definitions, all type changes in Go will be added to proto manually, so it's easier to track gRPC API version changes - - Kubelet could reuse Go types generated by `protoc-gen-gogo` to avoid type conversions - - Cons: - - duplicate definition of same types - - hard to track type changes automatically - - need to manage proto files manually - -For better version controlling and fast iterations, this proposal embeds all those types in `api.proto` directly. - -## Implementation - -Each new runtime should implement the [gRPC](http://www.grpc.io) server based on [pkg/kubelet/api/v1alpha1/runtime/api.proto](../../pkg/kubelet/api/v1alpha1/runtime/api.proto). For version controlling, `KubeletGenericRuntimeManager` will request `RemoteRuntime`'s `Version()` interface with the runtime api version. To keep backward compatibility, the API follows standard [protobuf guide](https://developers.google.com/protocol-buffers/docs/proto) to deprecate or add new interfaces. - -A new flag `--container-runtime-endpoint` (overrides `--container-runtime`) will be introduced to kubelet which identifies the unix socket file of the remote runtime service. And new flag `--image-service-endpoint` will be introduced to kubelet which identifies the unix socket file of the image service. - -To facilitate switching current container runtime (e.g. `docker` or `rkt`) to new runtime API, `KubeletGenericRuntimeManager` will provide a plugin mechanism allowing to specify local implementation or gRPC implementation. - -## Community Discussion - -This proposal is first filed by [@brendandburns](https://github.com/brendandburns) at [kubernetes/13768](https://github.com/kubernetes/kubernetes/issues/13768): - -* [kubernetes/13768](https://github.com/kubernetes/kubernetes/issues/13768) -* [kubernetes/13709](https://github.com/kubernetes/kubernetes/pull/13079) -* [New container runtime interface](https://github.com/kubernetes/kubernetes/pull/25899) - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/runtime-client-server.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/runtime-client-server.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/runtime-client-server.md) diff --git a/docs/proposals/runtime-pod-cache.md b/docs/proposals/runtime-pod-cache.md index d4926c3e879..b04b0b47776 100644 --- a/docs/proposals/runtime-pod-cache.md +++ b/docs/proposals/runtime-pod-cache.md @@ -1,173 +1 @@ -# Kubelet: Runtime Pod Cache - -This proposal builds on top of the Pod Lifecycle Event Generator (PLEG) proposed -in [#12802](https://issues.k8s.io/12802). It assumes that Kubelet subscribes to -the pod lifecycle event stream to eliminate periodic polling of pod -states. Please see [#12802](https://issues.k8s.io/12802). for the motivation and -design concept for PLEG. - -Runtime pod cache is an in-memory cache which stores the *status* of -all pods, and is maintained by PLEG. It serves as a single source of -truth for internal pod status, freeing Kubelet from querying the -container runtime. - -## Motivation - -With PLEG, Kubelet no longer needs to perform comprehensive state -checking for all pods periodically. It only instructs a pod worker to -start syncing when there is a change of its pod status. Nevertheless, -during each sync, a pod worker still needs to construct the pod status -by examining all containers (whether dead or alive) in the pod, due to -the lack of the caching of previous states. With the integration of -pod cache, we can further improve Kubelet's CPU usage by - - 1. Lowering the number of concurrent requests to the container - runtime since pod workers no longer have to query the runtime - individually. - 2. Lowering the total number of inspect requests because there is no - need to inspect containers with no state changes. - -***Don't we already have a [container runtime cache] -(https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/container/runtime_cache.go)?*** - -The runtime cache is an optimization that reduces the number of `GetPods()` -calls from the workers. However, - - * The cache does not store all information necessary for a worker to - complete a sync (e.g., `docker inspect`); workers still need to inspect - containers individually to generate `api.PodStatus`. - * Workers sometimes need to bypass the cache in order to retrieve the - latest pod state. - -This proposal generalizes the cache and instructs PLEG to populate the cache, so -that the content is always up-to-date. - -**Why can't each worker cache its own pod status?** - -The short answer is yes, they can. The longer answer is that localized -caching limits the use of the cache content -- other components cannot -access it. This often leads to caching at multiple places and/or passing -objects around, complicating the control flow. - -## Runtime Pod Cache - -![pod cache](pod-cache.png) - -Pod cache stores the `PodStatus` for all pods on the node. `PodStatus` encompasses -all the information required from the container runtime to generate -`api.PodStatus` for a pod. - -```go -// PodStatus represents the status of the pod and its containers. -// api.PodStatus can be derived from examining PodStatus and api.Pod. -type PodStatus struct { - ID types.UID - Name string - Namespace string - IP string - ContainerStatuses []*ContainerStatus -} - -// ContainerStatus represents the status of a container. -type ContainerStatus struct { - ID ContainerID - Name string - State ContainerState - CreatedAt time.Time - StartedAt time.Time - FinishedAt time.Time - ExitCode int - Image string - ImageID string - Hash uint64 - RestartCount int - Reason string - Message string -} -``` - -`PodStatus` is defined in the container runtime interface, hence is -runtime-agnostic. - -PLEG is responsible for updating the entries pod cache, hence always keeping -the cache up-to-date. - -1. Detect change of container state -2. Inspect the pod for details -3. Update the pod cache with the new PodStatus - - If there is no real change of the pod entry, do nothing - - Otherwise, generate and send out the corresponding pod lifecycle event - -Note that in (3), PLEG can check if there is any disparity between the old -and the new pod entry to filter out duplicated events if needed. - -### Evict cache entries - -Note that the cache represents all the pods/containers known by the container -runtime. A cache entry should only be evicted if the pod is no longer visible -by the container runtime. PLEG is responsible for deleting entries in the -cache. - -### Generate `api.PodStatus` - -Because pod cache stores the up-to-date `PodStatus` of the pods, Kubelet can -generate the `api.PodStatus` by interpreting the cache entry at any -time. To avoid sending intermediate status (e.g., while a pod worker -is restarting a container), we will instruct the pod worker to generate a new -status at the beginning of each sync. - -### Cache contention - -Cache contention should not be a problem when the number of pods is -small. When Kubelet scales, we can always shard the pods by ID to -reduce contention. - -### Disk management - -The pod cache is not capable to fulfill the needs of container/image garbage -collectors as they may demand more than pod-level information. These components -will still need to query the container runtime directly at times. We may -consider extending the cache for these use cases, but they are beyond the scope -of this proposal. - - -## Impact on Pod Worker Control Flow - -A pod worker may perform various operations (e.g., start/kill a container) -during a sync. They will expect to see the results of such operations reflected -in the cache in the next sync. Alternately, they can bypass the cache and -query the container runtime directly to get the latest status. However, this -is not desirable since the cache is introduced exactly to eliminate unnecessary, -concurrent queries. Therefore, a pod worker should be blocked until all expected -results have been updated to the cache by PLEG. - -Depending on the type of PLEG (see [#12802](https://issues.k8s.io/12802)) in -use, the methods to check whether a requirement is met can differ. For a -PLEG that solely relies on relisting, a pod worker can simply wait until the -relist timestamp is newer than the end of the worker's last sync. On the other -hand, if pod worker knows what events to expect, they can also block until the -events are observed. - -It should be noted that `api.PodStatus` will only be generated by the pod -worker *after* the cache has been updated. This means that the perceived -responsiveness of Kubelet (from querying the API server) will be affected by -how soon the cache can be populated. For the pure-relisting PLEG, the relist -period can become the bottleneck. On the other hand, A PLEG which watches the -upstream event stream (and knows how what events to expect) is not restricted -by such periods and should improve Kubelet's perceived responsiveness. - -## TODOs for v1.2 - - - Redefine container runtime types ([#12619](https://issues.k8s.io/12619)): - and introduce `PodStatus`. Refactor dockertools and rkt to use the new type. - - - Add cache and instruct PLEG to populate it. - - - Refactor Kubelet to use the cache. - - - Deprecate the old runtime cache. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/runtime-pod-cache.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/runtime-pod-cache.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/runtime-pod-cache.md) diff --git a/docs/proposals/runtimeconfig.md b/docs/proposals/runtimeconfig.md index 896ca13056a..c3e2ef77692 100644 --- a/docs/proposals/runtimeconfig.md +++ b/docs/proposals/runtimeconfig.md @@ -1,69 +1 @@ -# Overview - -Proposes adding a `--feature-config` to core kube system components: -apiserver , scheduler, controller-manager, kube-proxy, and selected addons. -This flag will be used to enable/disable alpha features on a per-component basis. - -## Motivation - -Motivation is enabling/disabling features that are not tied to -an API group. API groups can be selectively enabled/disabled in the -apiserver via existing `--runtime-config` flag on apiserver, but there is -currently no mechanism to toggle alpha features that are controlled by -e.g. annotations. This means the burden of controlling whether such -features are enabled in a particular cluster is on feature implementors; -they must either define some ad hoc mechanism for toggling (e.g. flag -on component binary) or else toggle the feature on/off at compile time. - -By adding a`--feature-config` to all kube-system components, alpha features -can be toggled on a per-component basis by passing `enableAlphaFeature=true|false` -to `--feature-config` for each component that the feature touches. - -## Design - -The following components will all get a `--feature-config` flag, -which loads a `config.ConfigurationMap`: - -- kube-apiserver -- kube-scheduler -- kube-controller-manager -- kube-proxy -- kube-dns - -(Note kubelet is omitted, it's dynamic config story is being addressed -by #29459). Alpha features that are not accessed via an alpha API -group should define an `enableFeatureName` flag and use it to toggle -activation of the feature in each system component that the feature -uses. - -## Suggested conventions - -This proposal only covers adding a mechanism to toggle features in -system components. Implementation details will still depend on the alpha -feature's owner(s). The following are suggested conventions: - -- Naming for feature config entries should follow the pattern - "enable=true". -- Features that touch multiple components should reserve the same key - in each component to toggle on/off. -- Alpha features should be disabled by default. Beta features may - be enabled by default. Refer to docs/devel/api_changes.md#alpha-beta-and-stable-versions - for more detailed guidance on alpha vs. beta. - -## Upgrade support - -As the primary motivation for cluster config is toggling alpha -features, upgrade support is not in scope. Enabling or disabling -a feature is necessarily a breaking change, so config should -not be altered in a running cluster. - -## Future work - -1. The eventual plan is for component config to be managed by versioned -APIs and not flags (#12245). When that is added, toggling of features -could be handled by versioned component config and the component flags -deprecated. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/runtimeconfig.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/runtimeconfig.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/runtimeconfig.md) diff --git a/docs/proposals/scalability-testing.md b/docs/proposals/scalability-testing.md index d0fcd1bef01..0898be72b84 100644 --- a/docs/proposals/scalability-testing.md +++ b/docs/proposals/scalability-testing.md @@ -1,72 +1 @@ - -## Background - -We have a goal to be able to scale to 1000-node clusters by end of 2015. -As a result, we need to be able to run some kind of regression tests and deliver -a mechanism so that developers can test their changes with respect to performance. - -Ideally, we would like to run performance tests also on PRs - although it might -be impossible to run them on every single PR, we may introduce a possibility for -a reviewer to trigger them if the change has non obvious impact on the performance -(something like "k8s-bot run scalability tests please" should be feasible). - -However, running performance tests on 1000-node clusters (or even bigger in the -future is) is a non-starter. Thus, we need some more sophisticated infrastructure -to simulate big clusters on relatively small number of machines and/or cores. - -This document describes two approaches to tackling this problem. -Once we have a better understanding of their consequences, we may want to -decide to drop one of them, but we are not yet in that position. - - -## Proposal 1 - Kubmark - -In this proposal we are focusing on scalability testing of master components. -We do NOT focus on node-scalability - this issue should be handled separately. - -Since we do not focus on the node performance, we don't need real Kubelet nor -KubeProxy - in fact we don't even need to start real containers. -All we actually need is to have some Kubelet-like and KubeProxy-like components -that will be simulating the load on apiserver that their real equivalents are -generating (e.g. sending NodeStatus updated, watching for pods, watching for -endpoints (KubeProxy), etc.). - -What needs to be done: - -1. Determine what requests both KubeProxy and Kubelet are sending to apiserver. -2. Create a KubeletSim that is generating the same load on apiserver that the - real Kubelet, but is not starting any containers. In the initial version we - can assume that pods never die, so it is enough to just react on the state - changes read from apiserver. - TBD: Maybe we can reuse a real Kubelet for it by just injecting some "fake" - interfaces to it? -3. Similarly create a KubeProxySim that is generating the same load on apiserver - as a real KubeProxy. Again, since we are not planning to talk to those - containers, it basically doesn't need to do anything apart from that. - TBD: Maybe we can reuse a real KubeProxy for it by just injecting some "fake" - interfaces to it? -4. Refactor kube-up/kube-down scripts (or create new ones) to allow starting - a cluster with KubeletSim and KubeProxySim instead of real ones and put - a bunch of them on a single machine. -5. Create a load generator for it (probably initially it would be enough to - reuse tests that we use in gce-scalability suite). - - -## Proposal 2 - Oversubscribing - -The other method we are proposing is to oversubscribe the resource, -or in essence enable a single node to look like many separate nodes even though -they reside on a single host. This is a well established pattern in many different -cluster managers (for more details see -http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html ). -There are a couple of different ways to accomplish this, but the most viable method -is to run privileged kubelet pods under a hosts kubelet process. These pods then -register back with the master via the introspective service using modified names -as not to collide. - -Complications may currently exist around container tracking and ownership in docker. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/scalability-testing.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scalability-testing.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scalability-testing.md) diff --git a/docs/proposals/scheduledjob.md b/docs/proposals/scheduledjob.md index 9c7e8d9fe76..ca89f9a71eb 100644 --- a/docs/proposals/scheduledjob.md +++ b/docs/proposals/scheduledjob.md @@ -1,335 +1 @@ -# ScheduledJob Controller - -## Abstract - -A proposal for implementing a new controller - ScheduledJob controller - which -will be responsible for managing time based jobs, namely: -* once at a specified point in time, -* repeatedly at a specified point in time. - -There is already a discussion regarding this subject: -* Distributed CRON jobs [#2156](https://issues.k8s.io/2156) - -There are also similar solutions available, already: -* [Mesos Chronos](https://github.com/mesos/chronos) -* [Quartz](http://quartz-scheduler.org/) - - -## Use Cases - -1. Be able to schedule a job execution at a given point in time. -1. Be able to create a periodic job, e.g. database backup, sending emails. - - -## Motivation - -ScheduledJobs are needed for performing all time-related actions, namely backups, -report generation and the like. Each of these tasks should be allowed to run -repeatedly (once a day/month, etc.) or once at a given point in time. - - -## Design Overview - -Users create a ScheduledJob object. One ScheduledJob object -is like one line of a crontab file. It has a schedule of when to run, -in [Cron](https://en.wikipedia.org/wiki/Cron) format. - - -The ScheduledJob controller creates a Job object [Job](job.md) -about once per execution time of the scheduled (e.g. once per -day for a daily schedule.) We say "about" because there are certain -circumstances where two jobs might be created, or no job might be -created. We attempt to make these rare, but do not completely prevent -them. Therefore, Jobs should be idempotent. - -The Job object is responsible for any retrying of Pods, and any parallelism -among pods it creates, and determining the success or failure of the set of -pods. The ScheduledJob does not examine pods at all. - - -### ScheduledJob resource - -The new `ScheduledJob` object will have the following contents: - -```go -// ScheduledJob represents the configuration of a single scheduled job. -type ScheduledJob struct { - TypeMeta - ObjectMeta - - // Spec is a structure defining the expected behavior of a job, including the schedule. - Spec ScheduledJobSpec - - // Status is a structure describing current status of a job. - Status ScheduledJobStatus -} - -// ScheduledJobList is a collection of scheduled jobs. -type ScheduledJobList struct { - TypeMeta - ListMeta - - Items []ScheduledJob -} -``` - -The `ScheduledJobSpec` structure is defined to contain all the information how the actual -job execution will look like, including the `JobSpec` from [Job API](job.md) -and the schedule in [Cron](https://en.wikipedia.org/wiki/Cron) format. This implies -that each ScheduledJob execution will be created from the JobSpec actual at a point -in time when the execution will be started. This also implies that any changes -to ScheduledJobSpec will be applied upon subsequent execution of a job. - -```go -// ScheduledJobSpec describes how the job execution will look like and when it will actually run. -type ScheduledJobSpec struct { - - // Schedule contains the schedule in Cron format, see https://en.wikipedia.org/wiki/Cron. - Schedule string - - // Optional deadline in seconds for starting the job if it misses scheduled - // time for any reason. Missed jobs executions will be counted as failed ones. - StartingDeadlineSeconds *int64 - - // ConcurrencyPolicy specifies how to treat concurrent executions of a Job. - ConcurrencyPolicy ConcurrencyPolicy - - // Suspend flag tells the controller to suspend subsequent executions, it does - // not apply to already started executions. Defaults to false. - Suspend bool - - // JobTemplate is the object that describes the job that will be created when - // executing a ScheduledJob. - JobTemplate *JobTemplateSpec -} - -// JobTemplateSpec describes of the Job that will be created when executing -// a ScheduledJob, including its standard metadata. -type JobTemplateSpec struct { - ObjectMeta - - // Specification of the desired behavior of the job. - Spec JobSpec -} - -// ConcurrencyPolicy describes how the job will be handled. -// Only one of the following concurrent policies may be specified. -// If none of the following policies is specified, the default one -// is AllowConcurrent. -type ConcurrencyPolicy string - -const ( - // AllowConcurrent allows ScheduledJobs to run concurrently. - AllowConcurrent ConcurrencyPolicy = "Allow" - - // ForbidConcurrent forbids concurrent runs, skipping next run if previous - // hasn't finished yet. - ForbidConcurrent ConcurrencyPolicy = "Forbid" - - // ReplaceConcurrent cancels currently running job and replaces it with a new one. - ReplaceConcurrent ConcurrencyPolicy = "Replace" -) -``` - -`ScheduledJobStatus` structure is defined to contain information about scheduled -job executions. The structure holds a list of currently running job instances -and additional information about overall successful and unsuccessful job executions. - -```go -// ScheduledJobStatus represents the current state of a Job. -type ScheduledJobStatus struct { - // Active holds pointers to currently running jobs. - Active []ObjectReference - - // Successful tracks the overall amount of successful completions of this job. - Successful int64 - - // Failed tracks the overall amount of failures of this job. - Failed int64 - - // LastScheduleTime keeps information of when was the last time the job was successfully scheduled. - LastScheduleTime Time -} -``` - -Users must use a generated selector for the job. - -## Modifications to Job resource - -TODO for beta: forbid manual selector since that could cause confusing between -subsequent jobs. - -### Running ScheduledJobs using kubectl - -A user should be able to easily start a Scheduled Job using `kubectl` (similarly -to running regular jobs). For example to run a job with a specified schedule, -a user should be able to type something simple like: - -``` -kubectl run pi --image=perl --restart=OnFailure --runAt="0 14 21 7 *" -- perl -Mbignum=bpi -wle 'print bpi(2000)' -``` - -In the above example: - -* `--restart=OnFailure` implies creating a job instead of replicationController. -* `--runAt="0 14 21 7 *"` implies the schedule with which the job should be run, here - July 21, 2pm. This value will be validated according to the same rules which - apply to `.spec.schedule`. - -## Fields Added to Job Template - -When the controller creates a Job from the JobTemplateSpec in the ScheduledJob, it -adds the following fields to the Job: - -- a name, based on the ScheduledJob's name, but with a suffix to distinguish - multiple executions, which may overlap. -- the standard created-by annotation on the Job, pointing to the SJ that created it - The standard key is `kubernetes.io/created-by`. The value is a serialized JSON object, like - `{ "kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ScheduledJob","namespace":"default",` - `"name":"nightly-earnings-report","uid":"5ef034e0-1890-11e6-8935-42010af0003e","apiVersion":...` - This serialization contains the UID of the parent. This is used to match the Job to the SJ that created - it. - -## Updates to ScheduledJobs - -If the schedule is updated on a ScheduledJob, it will: -- continue to use the Status.Active list of jobs to detect conflicts. -- try to fulfill all recently-passed times for the new schedule, by starting - new jobs. But it will not try to fulfill times prior to the - Status.LastScheduledTime. - - Example: If you have a schedule to run every 30 minutes, and change that to hourly, then the previously started - top-of-the-hour run, in Status.Active, will be seen and no new job started. - - Example: If you have a schedule to run every hour, change that to 30-minutely, at 31 minutes past the hour, - one run will be started immediately for the starting time that has just passed. - -If the job template of a ScheduledJob is updated, then future executions use the new template -but old ones still satisfy the schedule and are not re-run just because the template changed. - -If you delete and replace a ScheduledJob with one of the same name, it will: -- not use any old Status.Active, and not consider any existing running or terminated jobs from the previous - ScheduledJob (with a different UID) at all when determining coflicts, what needs to be started, etc. -- If there is an existing Job with the same time-based hash in its name (see below), then - new instances of that job will not be able to be created. So, delete it if you want to re-run. -with the same name as conflicts. -- not "re-run" jobs for "start times" before the creation time of the new ScheduledJobJob object. -- not consider executions from the previous UID when making decisions about what executions to - start, or status, etc. -- lose the history of the old SJ. - -To preserve status, you can suspend the old one, and make one with a new name, or make a note of the old status. - - -## Fault-Tolerance - -### Starting Jobs in the face of controller failures - -If the process with the scheduledJob controller in it fails, -and takes a while to restart, the scheduledJob controller -may miss the time window and it is too late to start a job. - -With a single scheduledJob controller process, we cannot give -very strong assurances about not missing starting jobs. - -With a suggested HA configuration, there are multiple controller -processes, and they use master election to determine which one -is active at any time. - -If the Job's StartingDeadlineSeconds is long enough, and the -lease for the master lock is short enough, and other controller -processes are running, then a Job will be started. - -TODO: consider hard-coding the minimum StartingDeadlineSeconds -at say 1 minute. Then we can offer a clearer guarantee, -assuming we know what the setting of the lock lease duration is. - -### Ensuring jobs are run at most once - -There are three problems here: - -- ensure at most one Job created per "start time" of a schedule. -- ensure that at most one Pod is created per Job -- ensure at most one container start occurs per Pod - -#### Ensuring one Job - -Multiple jobs might be created in the following sequence: - -1. scheduled job controller sends request to start Job J1 to fulfill start time T. -1. the create request is accepted by the apiserver and enqueued but not yet written to etcd. -1. scheduled job controller crashes -1. new scheduled job controller starts, and lists the existing jobs, and does not see one created. -1. it creates a new one. -1. the first one eventually gets written to etcd. -1. there are now two jobs for the same start time. - -We can solve this in several ways: - -1. with three-phase protocol, e.g.: - 1. controller creates a "suspended" job. - 1. controller writes writes an annotation in the SJ saying that it created a job for this time. - 1. controller unsuspends that job. -1. by picking a deterministic name, so that at most one object create can succeed. - -#### Ensuring one Pod - -Job object does not currently have a way to ask for this. -Even if it did, controller is not written to support it. -Same problem as above. - -#### Ensuring one container invocation per Pod - -Kubelet is not written to ensure at-most-one-container-start per pod. - -#### Decision - -This is too hard to do for the alpha version. We will await user -feedback to see if the "at most once" property is needed in the beta version. - -This is awkward but possible for a containerized application ensure on it own, as it needs -to know what ScheduledJob name and Start Time it is from, and then record the attempt -in a shared storage system. We should ensure it could extract this data from its annotations -using the downward API. - -## Name of Jobs - -A ScheduledJob creates one Job at each time when a Job should run. -Since there may be concurrent jobs, and since we might want to keep failed -non-overlapping Jobs around as a debugging record, each Job created by the same ScheduledJob -needs a distinct name. - -To make the Jobs from the same ScheduledJob distinct, we could use a random string, -in the way that pods have a `generateName`. For example, a scheduledJob named `nightly-earnings-report` -in namespace `ns1` might create a job `nightly-earnings-report-3m4d3`, and later create -a job called `nightly-earnings-report-6k7ts`. This is consistent with pods, but -does not give the user much information. - -Alternatively, we can use time as a uniquifier. For example, the same scheduledJob could -create a job called `nightly-earnings-report-2016-May-19`. -However, for Jobs that run more than once per day, we would need to represent -time as well as date. Standard date formats (e.g. RFC 3339) use colons for time. -Kubernetes names cannot include time. Using a non-standard date format without colons -will annoy some users. - -Also, date strings are much longer than random suffixes, which means that -the pods will also have long names, and that we are more likely to exceed the -253 character name limit when combining the scheduled-job name, -the time suffix, and pod random suffix. - -One option would be to compute a hash of the nominal start time of the job, -and use that as a suffix. This would not provide the user with an indication -of the start time, but it would prevent creation of the same execution -by two instances (replicated or restarting) of the controller process. - -We chose to use the hashed-date suffix approach. - -## Future evolution - -Below are the possible future extensions to the Job controller: -* Be able to specify workflow template in `.spec` field. This relates to the work - happening in [#18827](https://issues.k8s.io/18827). -* Be able to specify more general template in `.spec` field, to create arbitrary - types of resources. This relates to the work happening in [#18215](https://issues.k8s.io/18215). - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/scheduledjob.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduledjob.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduledjob.md) diff --git a/docs/proposals/secret-configmap-downwarapi-file-mode.md b/docs/proposals/secret-configmap-downwarapi-file-mode.md index 42def9bf092..1ae9ade30d5 100644 --- a/docs/proposals/secret-configmap-downwarapi-file-mode.md +++ b/docs/proposals/secret-configmap-downwarapi-file-mode.md @@ -1,186 +1 @@ -# Secrets, configmaps and downwardAPI file mode bits - -Author: Rodrigo Campos (@rata), Tim Hockin (@thockin) - -Date: July 2016 - -Status: Design in progress - -# Goal - -Allow users to specify permission mode bits for a secret/configmap/downwardAPI -file mounted as a volume. For example, if a secret has several keys, a user -should be able to specify the permission mode bits for any file, and they may -all have different modes. - -Let me say that with "permission" I only refer to the file mode here and I may -use them interchangeably. This is not about the file owners, although let me -know if you prefer to discuss that here too. - - -# Motivation - -There is currently no way to set permissions on secret files mounted as volumes. -This can be a problem for applications that enforce files to have permissions -only for the owner (like fetchmail, ssh, pgpass file in postgres[1], etc.) and -it's just not possible to run them without changing the file mode. Also, -in-house applications may have this restriction too. - -It doesn't seem totally wrong if someone wants to make a secret, that is -sensitive information, not world-readable (or group, too) as it is by default. -Although it's already in a container that is (hopefully) running only one -process and it might not be so bad. But people running more than one process in -a container asked for this too[2]. - -For example, my use case is that we are migrating to kubernetes, the migration -is in progress (and will take a while) and we have migrated our deployment web -interface to kubernetes. But this interface connects to the servers via ssh, so -it needs the ssh keys, and ssh will only work if the ssh key file mode is the -one it expects. - -This was asked on the mailing list here[2] and here[3], too. - -[1]: https://www.postgresql.org/docs/9.1/static/libpq-pgpass.html -[2]: https://groups.google.com/forum/#!topic/kubernetes-dev/eTnfMJSqmaM -[3]: https://groups.google.com/forum/#!topic/google-containers/EcaOPq4M758 - -# Alternatives considered - -Several alternatives have been considered: - - * Add a mode to the API definition when using secrets: this is backward - compatible as described in (docs/devel/api_changes.md) IIUC and seems like the - way to go. Also @thockin said in the ML that he would consider such an - approach. But it might be worth to consider if we want to do the same for - configmaps or owners, but there is no need to do it now either. - - * Change the default file mode for secrets: I think this is unacceptable as it - is stated in the api_changes doc. And besides it doesn't feel correct IMHO, it - is technically one option. The argument for this might be that world and group - readable for a secret is not a nice default, we already take care of not - writing it to disk, etc. but the file is created world-readable anyways. Such a - default change has been done recently: the default was 0444 in kubernetes <= 1.2 - and is now 0644 in kubernetes >= 1.3 (and the file is not a regular file, - it's a symlink now). This change was done here to minimize differences between - configmaps and secrets: https://github.com/kubernetes/kubernetes/pull/25285. But - doing it again, and changing to something more restrictive (now is 0644 and it - should be 0400 to work with ssh and most apps) seems too risky, it's even more - restrictive than in k8s 1.2. Specially if there is no way to revert to the old - permissions and some use case is broken by this. And if we are adding a way to - change it, like in the option above, there is no need to rush changing the - default. So I would discard this. - - * We don't want to people be able to change this, at least for now, and the - ones who do, suggest that do it as a "postStart" command. This is acceptable - if we don't want to change kubernetes core for some reason, although there - seem to be valid use cases. But if the user want's to use the "postStart" for - something else, then it is more disturbing to do both things (have a script - in the docker image that deals with this, but is not probably concern of the - project so it's not nice, or specify several commands by using "sh"). - -# Proposed implementation - -The proposed implementation goes with the first alternative: adding a `mode` -to the API. - -There will be a `defaultMode`, type `int`, in: `type SecretVolumeSource`, `type -ConfigMapVolumeSource` and `type DownwardAPIVolumeSource`. And a `mode`, type -`int` too, in `type KeyToPath` and `DownwardAPIVolumeFile`. - -The mask provided in any of these fields will be ANDed with 0777 to disallow -setting sticky and setuid bits. It's not clear that use case is needed nor -really understood. And directories within the volume will be created as before -and are not affected by this setting. - -In other words, the fields will look like this: - -``` -type SecretVolumeSource struct { - // Name of the secret in the pod's namespace to use. - SecretName string `json:"secretName,omitempty"` - // If unspecified, each key-value pair in the Data field of the referenced - // Secret will be projected into the volume as a file whose name is the - // key and content is the value. If specified, the listed keys will be - // projected into the specified paths, and unlisted keys will not be - // present. If a key is specified which is not present in the Secret, - // the volume setup will error. Paths must be relative and may not contain - // the '..' path or start with '..'. - Items []KeyToPath `json:"items,omitempty"` - // Mode bits to use on created files by default. The used mode bits will - // be the provided AND 0777. - // Directories within the path are not affected by this setting - DefaultMode int32 `json:"defaultMode,omitempty"` -} - -type ConfigMapVolumeSource struct { - LocalObjectReference `json:",inline"` - // If unspecified, each key-value pair in the Data field of the referenced - // ConfigMap will be projected into the volume as a file whose name is the - // key and content is the value. If specified, the listed keys will be - // projected into the specified paths, and unlisted keys will not be - // present. If a key is specified which is not present in the ConfigMap, - // the volume setup will error. Paths must be relative and may not contain - // the '..' path or start with '..'. - Items []KeyToPath `json:"items,omitempty"` - // Mode bits to use on created files by default. The used mode bits will - // be the provided AND 0777. - // Directories within the path are not affected by this setting - DefaultMode int32 `json:"defaultMode,omitempty"` -} - -type KeyToPath struct { - // The key to project. - Key string `json:"key"` - - // The relative path of the file to map the key to. - // May not be an absolute path. - // May not contain the path element '..'. - // May not start with the string '..'. - Path string `json:"path"` - // Mode bits to use on this file. The used mode bits will be the - // provided AND 0777. - Mode int32 `json:"mode,omitempty"` -} - -type DownwardAPIVolumeSource struct { - // Items is a list of DownwardAPIVolume file - Items []DownwardAPIVolumeFile `json:"items,omitempty"` - // Mode bits to use on created files by default. The used mode bits will - // be the provided AND 0777. - // Directories within the path are not affected by this setting - DefaultMode int32 `json:"defaultMode,omitempty"` -} - -type DownwardAPIVolumeFile struct { - // Required: Path is the relative path name of the file to be created. Must not be absolute or contain the '..' path. Must be utf-8 encoded. The first item of the relative path must not start with '..' - Path string `json:"path"` - // Required: Selects a field of the pod: only annotations, labels, name and namespace are supported. - FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"` - // Selects a resource of the container: only resources limits and requests - // (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported. - ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"` - // Mode bits to use on this file. The used mode bits will be the - // provided AND 0777. - Mode int32 `json:"mode,omitempty"` -} -``` - -Adding it there allows the user to change the mode bits of every file in the -object, so it achieves the goal, while having the option to have a default and -not specify all files in the object. - -The are two downside: - - * The files are symlinks pointint to the real file, and the realfile - permissions are only set. The symlink has the classic symlink permissions. - This is something already present in 1.3, and it seems applications like ssh - work just fine with that. Something worth mentioning, but doesn't seem to be - an issue. - * If the secret/configMap/downwardAPI is mounted in more than one container, - the file permissions will be the same on all. This is already the case for - Key mappings and doesn't seem like a big issue either. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/secret-configmap-downwarapi-file-mode.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/secret-configmap-downwarapi-file-mode.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/secret-configmap-downwarapi-file-mode.md) diff --git a/docs/proposals/security-context-constraints.md b/docs/proposals/security-context-constraints.md index ae966e215ab..826b6076128 100644 --- a/docs/proposals/security-context-constraints.md +++ b/docs/proposals/security-context-constraints.md @@ -1,348 +1 @@ -## Abstract - -PodSecurityPolicy allows cluster administrators to control the creation and validation of a security -context for a pod and containers. - -## Motivation - -Administration of a multi-tenant cluster requires the ability to provide varying sets of permissions -among the tenants, the infrastructure components, and end users of the system who may themselves be -administrators within their own isolated namespace. - -Actors in a cluster may include infrastructure that is managed by administrators, infrastructure -that is exposed to end users (builds, deployments), the isolated end user namespaces in the cluster, and -the individual users inside those namespaces. Infrastructure components that operate on behalf of a -user (builds, deployments) should be allowed to run at an elevated level of permissions without -granting the user themselves an elevated set of permissions. - -## Goals - -1. Associate [service accounts](../design/service_accounts.md), groups, and users with -a set of constraints that dictate how a security context is established for a pod and the pod's containers. -1. Provide the ability for users and infrastructure components to run pods with elevated privileges -on behalf of another user or within a namespace where privileges are more restrictive. -1. Secure the ability to reference elevated permissions or to change the constraints under which -a user runs. - -## Use Cases - -Use case 1: -As an administrator, I can create a namespace for a person that can't create privileged containers -AND enforce that the UID of the containers is set to a certain value - -Use case 2: -As a cluster operator, an infrastructure component should be able to create a pod with elevated -privileges in a namespace where regular users cannot create pods with these privileges or execute -commands in that pod. - -Use case 3: -As a cluster administrator, I can allow a given namespace (or service account) to create privileged -pods or to run root pods - -Use case 4: -As a cluster administrator, I can allow a project administrator to control the security contexts of -pods and service accounts within a project - - -## Requirements - -1. Provide a set of restrictions that controls how a security context is created for pods and containers -as a new cluster-scoped object called `PodSecurityPolicy`. -1. User information in `user.Info` must be available to admission controllers. (Completed in -https://github.com/GoogleCloudPlatform/kubernetes/pull/8203) -1. Some authorizers may restrict a user’s ability to reference a service account. Systems requiring -the ability to secure service accounts on a user level must be able to add a policy that enables -referencing specific service accounts themselves. -1. Admission control must validate the creation of Pods against the allowed set of constraints. - -## Design - -### Model - -PodSecurityPolicy objects exist in the root scope, outside of a namespace. The -PodSecurityPolicy will reference users and groups that are allowed -to operate under the constraints. In order to support this, `ServiceAccounts` must be mapped -to a user name or group list by the authentication/authorization layers. This allows the security -context to treat users, groups, and service accounts uniformly. - -Below is a list of PodSecurityPolicies which will likely serve most use cases: - -1. A default policy object. This object is permissioned to something which covers all actors, such -as a `system:authenticated` group, and will likely be the most restrictive set of constraints. -1. A default constraints object for service accounts. This object can be identified as serving -a group identified by `system:service-accounts`, which can be imposed by the service account authenticator / token generator. -1. Cluster admin constraints identified by `system:cluster-admins` group - a set of constraints with elevated privileges that can be used -by an administrative user or group. -1. Infrastructure components constraints which can be identified either by a specific service -account or by a group containing all service accounts. - -```go -// PodSecurityPolicy governs the ability to make requests that affect the SecurityContext -// that will be applied to a pod and container. -type PodSecurityPolicy struct { - unversioned.TypeMeta `json:",inline"` - api.ObjectMeta `json:"metadata,omitempty"` - - // Spec defines the policy enforced. - Spec PodSecurityPolicySpec `json:"spec,omitempty"` -} - -// PodSecurityPolicySpec defines the policy enforced. -type PodSecurityPolicySpec struct { - // Privileged determines if a pod can request to be run as privileged. - Privileged bool `json:"privileged,omitempty"` - // Capabilities is a list of capabilities that can be added. - Capabilities []api.Capability `json:"capabilities,omitempty"` - // Volumes allows and disallows the use of different types of volume plugins. - Volumes VolumeSecurityPolicy `json:"volumes,omitempty"` - // HostNetwork determines if the policy allows the use of HostNetwork in the pod spec. - HostNetwork bool `json:"hostNetwork,omitempty"` - // HostPorts determines which host port ranges are allowed to be exposed. - HostPorts []HostPortRange `json:"hostPorts,omitempty"` - // HostPID determines if the policy allows the use of HostPID in the pod spec. - HostPID bool `json:"hostPID,omitempty"` - // HostIPC determines if the policy allows the use of HostIPC in the pod spec. - HostIPC bool `json:"hostIPC,omitempty"` - // SELinuxContext is the strategy that will dictate the allowable labels that may be set. - SELinuxContext SELinuxContextStrategyOptions `json:"seLinuxContext,omitempty"` - // RunAsUser is the strategy that will dictate the allowable RunAsUser values that may be set. - RunAsUser RunAsUserStrategyOptions `json:"runAsUser,omitempty"` - - // The users who have permissions to use this policy - Users []string `json:"users,omitempty"` - // The groups that have permission to use this policy - Groups []string `json:"groups,omitempty"` -} - -// HostPortRange defines a range of host ports that will be enabled by a policy -// for pods to use. It requires both the start and end to be defined. -type HostPortRange struct { - // Start is the beginning of the port range which will be allowed. - Start int `json:"start"` - // End is the end of the port range which will be allowed. - End int `json:"end"` -} - -// VolumeSecurityPolicy allows and disallows the use of different types of volume plugins. -type VolumeSecurityPolicy struct { - // HostPath allows or disallows the use of the HostPath volume plugin. - // More info: http://kubernetes.io/docs/user-guide/volumes#hostpath - HostPath bool `json:"hostPath,omitempty"` - // EmptyDir allows or disallows the use of the EmptyDir volume plugin. - // More info: http://kubernetes.io/docs/user-guide/volumes#emptydir - EmptyDir bool `json:"emptyDir,omitempty"` - // GCEPersistentDisk allows or disallows the use of the GCEPersistentDisk volume plugin. - // More info: http://kubernetes.io/docs/user-guide/volumes#gcepersistentdisk - GCEPersistentDisk bool `json:"gcePersistentDisk,omitempty"` - // AWSElasticBlockStore allows or disallows the use of the AWSElasticBlockStore volume plugin. - // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore - AWSElasticBlockStore bool `json:"awsElasticBlockStore,omitempty"` - // GitRepo allows or disallows the use of the GitRepo volume plugin. - GitRepo bool `json:"gitRepo,omitempty"` - // Secret allows or disallows the use of the Secret volume plugin. - // More info: http://kubernetes.io/docs/user-guide/volumes#secrets - Secret bool `json:"secret,omitempty"` - // NFS allows or disallows the use of the NFS volume plugin. - // More info: http://kubernetes.io/docs/user-guide/volumes#nfs - NFS bool `json:"nfs,omitempty"` - // ISCSI allows or disallows the use of the ISCSI volume plugin. - // More info: http://releases.k8s.io/HEAD/examples/volumes/iscsi/README.md - ISCSI bool `json:"iscsi,omitempty"` - // Glusterfs allows or disallows the use of the Glusterfs volume plugin. - // More info: http://releases.k8s.io/HEAD/examples/volumes/glusterfs/README.md - Glusterfs bool `json:"glusterfs,omitempty"` - // PersistentVolumeClaim allows or disallows the use of the PersistentVolumeClaim volume plugin. - // More info: http://kubernetes.io/docs/user-guide/persistent-volumes#persistentvolumeclaims - PersistentVolumeClaim bool `json:"persistentVolumeClaim,omitempty"` - // RBD allows or disallows the use of the RBD volume plugin. - // More info: http://releases.k8s.io/HEAD/examples/volumes/rbd/README.md - RBD bool `json:"rbd,omitempty"` - // Cinder allows or disallows the use of the Cinder volume plugin. - // More info: http://releases.k8s.io/HEAD/examples/mysql-cinder-pd/README.md - Cinder bool `json:"cinder,omitempty"` - // CephFS allows or disallows the use of the CephFS volume plugin. - CephFS bool `json:"cephfs,omitempty"` - // DownwardAPI allows or disallows the use of the DownwardAPI volume plugin. - DownwardAPI bool `json:"downwardAPI,omitempty"` - // FC allows or disallows the use of the FC volume plugin. - FC bool `json:"fc,omitempty"` -} - -// SELinuxContextStrategyOptions defines the strategy type and any options used to create the strategy. -type SELinuxContextStrategyOptions struct { - // Type is the strategy that will dictate the allowable labels that may be set. - Type SELinuxContextStrategy `json:"type"` - // seLinuxOptions required to run as; required for MustRunAs - // More info: http://releases.k8s.io/HEAD/docs/design/security_context.md#security-context - SELinuxOptions *api.SELinuxOptions `json:"seLinuxOptions,omitempty"` -} - -// SELinuxContextStrategyType denotes strategy types for generating SELinux options for a -// SecurityContext. -type SELinuxContextStrategy string - -const ( - // container must have SELinux labels of X applied. - SELinuxStrategyMustRunAs SELinuxContextStrategy = "MustRunAs" - // container may make requests for any SELinux context labels. - SELinuxStrategyRunAsAny SELinuxContextStrategy = "RunAsAny" -) - -// RunAsUserStrategyOptions defines the strategy type and any options used to create the strategy. -type RunAsUserStrategyOptions struct { - // Type is the strategy that will dictate the allowable RunAsUser values that may be set. - Type RunAsUserStrategy `json:"type"` - // UID is the user id that containers must run as. Required for the MustRunAs strategy if not using - // a strategy that supports pre-allocated uids. - UID *int64 `json:"uid,omitempty"` - // UIDRangeMin defines the min value for a strategy that allocates by a range based strategy. - UIDRangeMin *int64 `json:"uidRangeMin,omitempty"` - // UIDRangeMax defines the max value for a strategy that allocates by a range based strategy. - UIDRangeMax *int64 `json:"uidRangeMax,omitempty"` -} - -// RunAsUserStrategyType denotes strategy types for generating RunAsUser values for a -// SecurityContext. -type RunAsUserStrategy string - -const ( - // container must run as a particular uid. - RunAsUserStrategyMustRunAs RunAsUserStrategy = "MustRunAs" - // container must run as a particular uid. - RunAsUserStrategyMustRunAsRange RunAsUserStrategy = "MustRunAsRange" - // container must run as a non-root uid - RunAsUserStrategyMustRunAsNonRoot RunAsUserStrategy = "MustRunAsNonRoot" - // container may make requests for any uid. - RunAsUserStrategyRunAsAny RunAsUserStrategy = "RunAsAny" -) -``` - -### PodSecurityPolicy Lifecycle - -As reusable objects in the root scope, PodSecurityPolicy follows the lifecycle of the -cluster itself. Maintenance of constraints such as adding, assigning, or changing them is the -responsibility of the cluster administrator. - -Creating a new user within a namespace should not require the cluster administrator to -define the user's PodSecurityPolicy. They should receive the default set of policies -that the administrator has defined for the groups they are assigned. - - -## Default PodSecurityPolicy And Overrides - -In order to establish policy for service accounts and users, there must be a way -to identify the default set of constraints that is to be used. This is best accomplished by using -groups. As mentioned above, groups may be used by the authentication/authorization layer to ensure -that every user maps to at least one group (with a default example of `system:authenticated`) and it -is up to the cluster administrator to ensure that a `PodSecurityPolicy` object exists that -references the group. - -If an administrator would like to provide a user with a changed set of security context permissions, -they may do the following: - -1. Create a new `PodSecurityPolicy` object and add a reference to the user or a group -that the user belongs to. -1. Add the user (or group) to an existing `PodSecurityPolicy` object with the proper -elevated privileges. - -## Admission - -Admission control using an authorizer provides the ability to control the creation of resources -based on capabilities granted to a user. In terms of the `PodSecurityPolicy`, it means -that an admission controller may inspect the user info made available in the context to retrieve -an appropriate set of policies for validation. - -The appropriate set of PodSecurityPolicies is defined as all of the policies -available that have reference to the user or groups that the user belongs to. - -Admission will use the PodSecurityPolicy to ensure that any requests for a -specific security context setting are valid and to generate settings using the following approach: - -1. Determine all the available `PodSecurityPolicy` objects that are allowed to be used -1. Sort the `PodSecurityPolicy` objects in a most restrictive to least restrictive order. -1. For each `PodSecurityPolicy`, generate a `SecurityContext` for each container. The generation phase will not override -any user requested settings in the `SecurityContext`, and will rely on the validation phase to ensure that -the user requests are valid. -1. Validate the generated `SecurityContext` to ensure it falls within the boundaries of the `PodSecurityPolicy` -1. If all containers validate under a single `PodSecurityPolicy` then the pod will be admitted -1. If all containers DO NOT validate under the `PodSecurityPolicy` then try the next `PodSecurityPolicy` -1. If no `PodSecurityPolicy` validates for the pod then the pod will not be admitted - - -## Creation of a SecurityContext Based on PodSecurityPolicy - -The creation of a `SecurityContext` based on a `PodSecurityPolicy` is based upon the configured -settings of the `PodSecurityPolicy`. - -There are three scenarios under which a `PodSecurityPolicy` field may fall: - -1. Governed by a boolean: fields of this type will be defaulted to the most restrictive value. -For instance, `AllowPrivileged` will always be set to false if unspecified. - -1. Governed by an allowable set: fields of this type will be checked against the set to ensure -their value is allowed. For example, `AllowCapabilities` will ensure that only capabilities -that are allowed to be requested are considered valid. `HostNetworkSources` will ensure that -only pods created from source X are allowed to request access to the host network. -1. Governed by a strategy: Items that have a strategy to generate a value will provide a -mechanism to generate the value as well as a mechanism to ensure that a specified value falls into -the set of allowable values. See the Types section for the description of the interfaces that -strategies must implement. - -Strategies have the ability to become dynamic. In order to support a dynamic strategy it should be -possible to make a strategy that has the ability to either be pre-populated with dynamic data by -another component (such as an admission controller) or has the ability to retrieve the information -itself based on the data in the pod. An example of this would be a pre-allocated UID for the namespace. -A dynamic `RunAsUser` strategy could inspect the namespace of the pod in order to find the required pre-allocated -UID and generate or validate requests based on that information. - - -```go -// SELinuxStrategy defines the interface for all SELinux constraint strategies. -type SELinuxStrategy interface { - // Generate creates the SELinuxOptions based on constraint rules. - Generate(pod *api.Pod, container *api.Container) (*api.SELinuxOptions, error) - // Validate ensures that the specified values fall within the range of the strategy. - Validate(pod *api.Pod, container *api.Container) fielderrors.ValidationErrorList -} - -// RunAsUserStrategy defines the interface for all uid constraint strategies. -type RunAsUserStrategy interface { - // Generate creates the uid based on policy rules. - Generate(pod *api.Pod, container *api.Container) (*int64, error) - // Validate ensures that the specified values fall within the range of the strategy. - Validate(pod *api.Pod, container *api.Container) fielderrors.ValidationErrorList -} -``` - -## Escalating Privileges by an Administrator - -An administrator may wish to create a resource in a namespace that runs with -escalated privileges. By allowing security context -constraints to operate on both the requesting user and the pod's service account, administrators are able to -create pods in namespaces with elevated privileges based on the administrator's security context -constraints. - -This also allows the system to guard commands being executed in the non-conforming container. For -instance, an `exec` command can first check the security context of the pod against the security -context constraints of the user or the user's ability to reference a service account. -If it does not validate then it can block users from executing the command. Since the validation -will be user aware, administrators would still be able to run the commands that are restricted to normal users. - -## Interaction with the Kubelet - -In certain cases, the Kubelet may need provide information about -the image in order to validate the security context. An example of this is a cluster -that is configured to run with a UID strategy of `MustRunAsNonRoot`. - -In this case the admission controller can set the existing `MustRunAsNonRoot` flag on the `SecurityContext` -based on the UID strategy of the `SecurityPolicy`. It should still validate any requests on the pod -for a specific UID and fail early if possible. However, if the `RunAsUser` is not set on the pod -it should still admit the pod and allow the Kubelet to ensure that the image does not run as -`root` with the existing non-root checks. - - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/security-context-constraints.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/security-context-constraints.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/security-context-constraints.md) diff --git a/docs/proposals/self-hosted-kubelet.md b/docs/proposals/self-hosted-kubelet.md index d2318bea0b6..81f81bb2f47 100644 --- a/docs/proposals/self-hosted-kubelet.md +++ b/docs/proposals/self-hosted-kubelet.md @@ -1,135 +1 @@ -# Proposal: Self-hosted kubelet - -## Abstract - -In a self-hosted Kubernetes deployment (see [this -comment](https://github.com/kubernetes/kubernetes/issues/246#issuecomment-64533959) -for background on self hosted kubernetes), we have the initial bootstrap problem. -When running self-hosted components, there needs to be a mechanism for pivoting -from the initial bootstrap state to the kubernetes-managed (self-hosted) state. -In the case of a self-hosted kubelet, this means pivoting from the initial -kubelet defined and run on the host, to the kubelet pod which has been scheduled -to the node. - -This proposal presents a solution to the kubelet bootstrap, and assumes a -functioning control plane (e.g. an apiserver, controller-manager, scheduler, and -etcd cluster), and a kubelet that can securely contact the API server. This -functioning control plane can be temporary, and not necessarily the "production" -control plane that will be used after the initial pivot / bootstrap. - -## Background and Motivation - -In order to understand the goals of this proposal, one must understand what -"self-hosted" means. This proposal defines "self-hosted" as a kubernetes cluster -that is installed and managed by the kubernetes installation itself. This means -that each kubernetes component is described by a kubernetes manifest (Daemonset, -Deployment, etc) and can be updated via kubernetes. - -The overall goal of this proposal is to make kubernetes easier to install and -upgrade. We can then treat kubernetes itself just like any other application -hosted in a kubernetes cluster, and have access to easy upgrades, monitoring, -and durability for core kubernetes components themselves. - -We intend to achieve this by using kubernetes to manage itself. However, in -order to do that we must first "bootstrap" the cluster, by using kubernetes to -install kubernetes components. This is where this proposal fits in, by -describing the necessary modifications, and required procedures, needed to run a -self-hosted kubelet. - -The approach being proposed for a self-hosted kubelet is a "pivot" style -installation. This procedure assumes a short-lived “bootstrap” kubelet will run -and start a long-running “self-hosted” kubelet. Once the self-hosted kubelet is -running the bootstrap kubelet will exit. As part of this, we propose introducing -a new `--bootstrap` flag to the kubelet. The behaviour of that flag will be -explained in detail below. - -## Proposal - -We propose adding a new flag to the kubelet, the `--bootstrap` flag, which is -assumed to be used in conjunction with the `--lock-file` flag. The `--lock-file` -flag is used to ensure only a single kubelet is running at any given time during -this pivot process. When the `--bootstrap` flag is provided, after the kubelet -acquires the file lock, it will begin asynchronously waiting on -[inotify](http://man7.org/linux/man-pages/man7/inotify.7.html) events. Once an -"open" event is received, the kubelet will assume another kubelet is attempting -to take control and will exit by calling `exit(0)`. - -Thus, the initial bootstrap becomes: - -1. "bootstrap" kubelet is started by $init system. -1. "bootstrap" kubelet pulls down "self-hosted" kubelet as a pod from a - daemonset -1. "self-hosted" kubelet attempts to acquire the file lock, causing "bootstrap" - kubelet to exit -1. "self-hosted" kubelet acquires lock and takes over -1. "bootstrap" kubelet is restarted by $init system and blocks on acquiring the - file lock - -During an upgrade of the kubelet, for simplicity we will consider 3 kubelets, -namely "bootstrap", "v1", and "v2". We imagine the following scenario for -upgrades: - -1. Cluster administrator introduces "v2" kubelet daemonset -1. "v1" kubelet pulls down and starts "v2" -1. Cluster administrator removes "v1" kubelet daemonset -1. "v1" kubelet is killed -1. Both "bootstrap" and "v2" kubelets race for file lock -1. If "v2" kubelet acquires lock, process has completed -1. If "bootstrap" kubelet acquires lock, it is assumed that "v2" kubelet will - fail a health check and be killed. Once restarted, it will try to acquire the - lock, triggering the "bootstrap" kubelet to exit. - -Alternatively, it would also be possible via this mechanism to delete the "v1" -daemonset first, allow the "bootstrap" kubelet to take over, and then introduce -the "v2" kubelet daemonset, effectively eliminating the race between "bootstrap" -and "v2" for lock acquisition, and the reliance on the failing health check -procedure. - -Eventually this could be handled by a DaemonSet upgrade policy. - -This will allow a "self-hosted" kubelet with minimal new concepts introduced -into the core Kubernetes code base, and remains flexible enough to work well -with future [bootstrapping -services](https://github.com/kubernetes/kubernetes/issues/5754). - -## Production readiness considerations / Out of scope issues - -* Deterministically pulling and running kubelet pod: we would prefer not to have - to loop until we finally get a kubelet pod. -* It is possible that the bootstrap kubelet version is incompatible with the - newer versions that were run in the node. For example, the cgroup - configurations might be incompatible. In the beginning, we will require - cluster admins to keep the configuration in sync. Since we want the bootstrap - kubelet to come up and run even if the API server is not available, we should - persist the configuration for bootstrap kubelet on the node. Once we have - checkpointing in kubelet, we will checkpoint the updated config and have the - bootstrap kubelet use the updated config, if it were to take over. -* Currently best practice when upgrading the kubelet on a node is to drain all - pods first. Automatically draining of the node during kubelet upgrade is out - of scope for this proposal. It is assumed that either the cluster - administrator or the daemonset upgrade policy will handle this. - -## Other discussion - -Various similar approaches have been discussed -[here](https://github.com/kubernetes/kubernetes/issues/246#issuecomment-64533959) -and -[here](https://github.com/kubernetes/kubernetes/issues/23073#issuecomment-198478997). -Other discussion around the kubelet being able to be run inside a container is -[here](https://github.com/kubernetes/kubernetes/issues/4869). Note this isn't a -strict requirement as the kubelet could be run in a chroot jail via rkt fly or -other such similar approach. - -Additionally, [Taints and -Tolerations](../../docs/design/taint-toleration-dedicated.md), whose design has -already been accepted, would make the overall kubelet bootstrap more -deterministic. With this, we would also need the ability for a kubelet to -register itself with a given taint when it first contacts the API server. Given -that, a kubelet could register itself with a given taint such as -“component=kubelet”, and a kubelet pod could exist that has a toleration to that -taint, ensuring it is the only pod the “bootstrap” kubelet runs. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/self-hosted-kubelet.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/self-hosted-kubelet.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/self-hosted-kubelet.md) diff --git a/docs/proposals/selinux-enhancements.md b/docs/proposals/selinux-enhancements.md index 3b3e168a380..e200945e5c0 100644 --- a/docs/proposals/selinux-enhancements.md +++ b/docs/proposals/selinux-enhancements.md @@ -1,209 +1 @@ -## Abstract - -Presents a proposal for enhancing the security of Kubernetes clusters using -SELinux and simplifying the implementation of SELinux support within the -Kubelet by removing the need to label the Kubelet directory with an SELinux -context usable from a container. - -## Motivation - -The current Kubernetes codebase relies upon the Kubelet directory being -labeled with an SELinux context usable from a container. This means that a -container escaping namespace isolation will be able to use any file within the -Kubelet directory without defeating kernel -[MAC (mandatory access control)](https://en.wikipedia.org/wiki/Mandatory_access_control). -In order to limit the attack surface, we should enhance the Kubelet to relabel -any bind-mounts into containers into a usable SELinux context without depending -on the Kubelet directory's SELinux context. - -## Constraints and Assumptions - -1. No API changes allowed -2. Behavior must be fully backward compatible -3. No new admission controllers - make incremental improvements without huge - refactorings - -## Use Cases - -1. As a cluster operator, I want to avoid having to label the Kubelet - directory with a label usable from a container, so that I can limit the - attack surface available to a container escaping its namespace isolation -2. As a user, I want to run a pod without an SELinux context explicitly - specified and be isolated using MCS (multi-category security) on systems - where SELinux is enabled, so that the pods on each host are isolated from - one another -3. As a user, I want to run a pod that uses the host IPC or PID namespace and - want the system to do the right thing with regard to SELinux, so that no - unnecessary relabel actions are performed - -### Labeling the Kubelet directory - -As previously stated, the current codebase relies on the Kubelet directory -being labeled with an SELinux context usable from a container. The Kubelet -uses the SELinux context of this directory to determine what SELinux context -`tmpfs` mounts (provided by the EmptyDir memory-medium option) should receive. -The problem with this is that it opens an attack surface to a container that -escapes its namespace isolation; such a container would be able to use any -file in the Kubelet directory without defeating kernel MAC. - -### SELinux when no context is specified - -When no SELinux context is specified, Kubernetes should just do the right -thing, where doing the right thing is defined as isolating pods with a node- -unique set of categories. Node-uniqueness means unique among the pods -scheduled onto the node. Long-term, we want to have a cluster-wide allocator -for MCS labels. Node-unique MCS labels are a good middle ground that is -possible without a new, large, feature. - -### SELinux and host IPC and PID namespaces - -Containers in pods that use the host IPC or PID namespaces need access to -other processes and IPC mechanisms on the host. Therefore, these containers -should be run with the `spc_t` SELinux type by the container runtime. The -`spc_t` type is an unconfined type that other SELinux domains are allowed to -connect to. In the case where a pod uses one of these host namespaces, it -should be unnecessary to relabel the pod's volumes. - -## Analysis - -### Libcontainer SELinux library - -Docker and rkt both use the libcontainer SELinux library. This library -provides a method, `GetLxcContexts`, that returns the a unique SELinux -contexts for container processes and files used by them. `GetLxcContexts` -reads the base SELinux context information from a file at `/etc/selinux//contexts/lxc_contexts` and then adds a process-unique MCS label. - -Docker and rkt both leverage this call to determine the 'starting' SELinux -contexts for containers. - -### Docker - -Docker's behavior when no SELinux context is defined for a container is to -give the container a node-unique MCS label. - -#### Sharing IPC namespaces - -On the Docker runtime, the containers in a Kubernetes pod share the IPC and -PID namespaces of the pod's infra container. - -Docker's behavior for containers sharing these namespaces is as follows: if a -container B shares the IPC namespace of another container A, container B is -given the SELinux context of container A. Therefore, for Kubernetes pods -running on docker, in a vacuum the containers in a pod should have the same -SELinux context. - -[**Known issue**](https://bugzilla.redhat.com/show_bug.cgi?id=1377869): When -the seccomp profile is set on a docker container that shares the IPC namespace -of another container, that container will not receive the other container's -SELinux context. - -#### Host IPC and PID namespaces - -In the case of a pod that shares the host IPC or PID namespace, this flag is -simply ignored and the container receives the `spc_t` SELinux type. The -`spc_t` type is unconfined, and so no relabeling needs to be done for volumes -for these pods. Currently, however, there is code which relabels volumes into -explicitly specified SELinux contexts for these pods. This code is unnecessary -and should be removed. - -#### Relabeling bind-mounts - -Docker is capable of relabeling bind-mounts into containers using the `:Z` -bind-mount flag. However, in the current implementation of the docker runtime -in Kubernetes, the `:Z` option is only applied when the pod's SecurityContext -contains an SELinux context. We could easily implement the correct behaviors -by always setting `:Z` on systems where SELinux is enabled. - -### rkt - -rkt's behavior when no SELinux context is defined for a pod is similar to -Docker's -- an SELinux context with a node-unique MCS label is given to the -containers of a pod. - -#### Sharing IPC namespaces - -Containers (apps, in rkt terminology) in rkt pods share an IPC and PID -namespace by default. - -#### Relabeling bind-mounts - -Bind-mounts into rkt pods are automatically relabeled into the pod's SELinux -context. - -#### Host IPC and PID namespaces - -Using the host IPC and PID namespaces is not currently supported by rkt. - -## Proposed Changes - -### Refactor `pkg/util/selinux` - -1. The `selinux` package should provide a method `SELinuxEnabled` that returns - whether SELinux is enabled, and is built for all platforms (the - libcontainer SELinux is only built on linux) -2. The `SelinuxContextRunner` interface should be renamed to `SELinuxRunner` - and be changed to have the same method names and signatures as the - libcontainer methods its implementations wrap -3. The `SELinuxRunner` interface only needs `Getfilecon`, which is used by - the rkt code - -```go -package selinux - -// Note: the libcontainer SELinux package is only built for Linux, so it is -// necessary to have a NOP wrapper which is built for non-Linux platforms to -// allow code that links to this package not to differentiate its own methods -// for Linux and non-Linux platforms. -// -// SELinuxRunner wraps certain libcontainer SELinux calls. For more -// information, see: -// -// https://github.com/opencontainers/runc/blob/master/libcontainer/selinux/selinux.go -type SELinuxRunner interface { - // Getfilecon returns the SELinux context for the given path or returns an - // error. - Getfilecon(path string) (string, error) -} -``` - -### Kubelet Changes - -1. The `relabelVolumes` method in `kubelet_volumes.go` is not needed and can - be removed -2. The `GenerateRunContainerOptions` method in `kubelet_pods.go` should no - longer call `relabelVolumes` -3. The `makeHostsMount` method in `kubelet_pods.go` should set the - `SELinuxRelabel` attribute of the mount for the pod's hosts file to `true` - -### Changes to `pkg/kubelet/dockertools/` - -1. The `makeMountBindings` should be changed to: - 1. No longer accept the `podHasSELinuxLabel` parameter - 2. Always use the `:Z` bind-mount flag when SELinux is enabled and the mount - has the `SELinuxRelabel` attribute set to `true` -2. The `runContainer` method should be changed to always use the `:Z` - bind-mount flag on the termination message mount when SELinux is enabled - -### Changes to `pkg/kubelet/rkt` - -The should not be any required changes for the rkt runtime; we should test to -ensure things work as expected under rkt. - -### Changes to volume plugins and infrastructure - -1. The `VolumeHost` interface contains a method called `GetRootContext`; this - is an artifact of the old assumptions about the Kubelet directory's SELinux - context and can be removed -2. The `empty_dir.go` file should be changed to be completely agnostic of - SELinux; no behavior in this plugin needs to be differentiated when SELinux - is enabled - -### Changes to `pkg/controller/...` - -The `VolumeHost` abstraction is used in a couple of PV controllers as NOP -implementations. These should be altered to no longer include `GetRootContext`. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/selinux-enhancements.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/selinux-enhancements.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/selinux-enhancements.md) diff --git a/docs/proposals/service-discovery.md b/docs/proposals/service-discovery.md index 28d1f8d4ad3..ead4f178a28 100644 --- a/docs/proposals/service-discovery.md +++ b/docs/proposals/service-discovery.md @@ -1,69 +1 @@ -# Service Discovery Proposal - -## Goal of this document - -To consume a service, a developer needs to know the full URL and a description of the API. Kubernetes contains the host and port information of a service, but it lacks the scheme and the path information needed if the service is not bound at the root. In this document we propose some standard kubernetes service annotations to fix these gaps. It is important that these annotations are a standard to allow for standard service discovery across Kubernetes implementations. Note that the example largely speaks to consuming WebServices but that the same concepts apply to other types of services. - -## Endpoint URL, Service Type - -A URL can accurately describe the location of a Service. A generic URL is of the following form - - scheme:[//[user:password@]host[:port]][/]path[?query][#fragment] - -however for the purpose of service discovery we can simplify this to the following form - - scheme:[//host[:port]][/]path - -If a user and/or password is required then this information can be passed using Kubernetes Secrets. Kubernetes contains the host and port of each service but it lacks the scheme and path. - -`Service Path` - Every Service has one (or more) endpoint. As a rule the endpoint should be located at the root "/" of the location URL, i.e. `http://172.100.1.52/`. There are cases where this is not possible and the actual service endpoint could be located at `http://172.100.1.52/cxfcdi`. The Kubernetes metadata for a service does not capture the path part, making it hard to consume this service. - -`Service Scheme` - Services can be deployed using different schemes. Some popular schemes include `http`,`https`,`file`,`ftp` and `jdbc`. - -`Service Protocol` - Services use different protocols that clients need to speak in order to communicate with the service, some examples of service level protocols are SOAP, REST (Yes, technically REST isn’t a protocol but an architectural style). For service consumers it can be hard to tell what protocol is expected. - -## Service Description - -The API of a service is the point of interaction with a service consumer. The description of the API is an essential piece of information at creation time of the service consumer. It has become common to publish a service definition document on a know location on the service itself. This 'well known' place it not very standard, so it is proposed the service developer provides the service description path and the type of Definition Language (DL) used. - -`Service Description Path` - To facilitate the consumption of the service by client, the location this document would be greatly helpful to the service consumer. In some cases the client side code can be generated from such a document. It is assumed that the service description document is published somewhere on the service endpoint itself. - -`Service Description Language` - A number of Definition Languages (DL) have been developed to describe the service. Some of examples are `WSDL`, `WADL` and `Swagger`. In order to consume a description document it is good to know the type of DL used. - -## Standard Service Annotations - -Kubernetes allows the creation of Service Annotations. Here we propose the use of the following standard annotations - -* `api.service.kubernetes.io/path` - the path part of the service endpoint url. An example value could be `cxfcdi`, -* `api.service.kubernetes.io/scheme` - the scheme part of the service endpoint url. Some values could be `http` or `https`. -* `api.service.kubernetes.io/protocol` - the protocol of the service. Known values are `SOAP`, `XML-RPC` and `REST`, -* `api.service.kubernetes.io/description-path` - the path part of the service description document’s endpoint. It is a pretty safe assumption that the service self-documents. An example value for a swagger 2.0 document can be `cxfcdi/swagger.json`, -* `api.kubernetes.io/description-language` - the type of Description Language used. Known values are `WSDL`, `WADL`, `SwaggerJSON`, `SwaggerYAML`. - -The fragment below is taken from the service section of the kubernetes.json were these annotations are used - - ... - "objects" : [ { - "apiVersion" : "v1", - "kind" : "Service", - "metadata" : { - "annotations" : { - "api.service.kubernetes.io/protocol" : "REST", - "api.service.kubernetes.io/scheme" "http", - "api.service.kubernetes.io/path" : "cxfcdi", - "api.service.kubernetes.io/description-path" : "cxfcdi/swagger.json", - "api.service.kubernetes.io/description-language" : "SwaggerJSON" - }, - ... - -## Conclusion - -Five service annotations are proposed as a standard way to describe a service endpoint. These five annotation are promoted as a Kubernetes standard, so that services can be discovered and a service catalog can be build to facilitate service consumers. - - - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/service-discovery.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/service-discovery.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/service-discovery.md) diff --git a/docs/proposals/service-external-name.md b/docs/proposals/service-external-name.md index 798da87fdf6..d7937cfa232 100644 --- a/docs/proposals/service-external-name.md +++ b/docs/proposals/service-external-name.md @@ -1,161 +1 @@ -# Service externalName - -Author: Tim Hockin (@thockin), Rodrigo Campos (@rata), Rudi C (@therc) - -Date: August 2016 - -Status: Implementation in progress - -# Goal - -Allow a service to have a CNAME record in the cluster internal DNS service. For -example, the lookup for a `db` service could return a CNAME that points to the -RDS resource `something.rds.aws.amazon.com`. No proxying is involved. - -# Motivation - -There were many related issues, but we'll try to summarize them here. More info -is on GitHub issues/PRs: #13748, #11838, #13358, #23921 - -One motivation is to present as native cluster services, services that are -hosted externally. Some cloud providers, like AWS, hand out hostnames (IPs are -not static) and the user wants to refer to these services using regular -Kubernetes tools. This was requested in bugs, at least for AWS, for RedShift, -RDS, Elasticsearch Service, ELB, etc. - -Other users just want to use an external service, for example `oracle`, with dns -name `oracle-1.testdev.mycompany.com`, without having to keep DNS in sync, and -are fine with a CNAME. - -Another use case is to "integrate" some services for local development. For -example, consider a search service running in Kubernetes in staging, let's say -`search-1.stating.mycompany.com`. It's running on AWS, so it resides behind an -ELB (which has no static IP, just a hostname). A developer is building an app -that consumes `search-1`, but doesn't want to run it on their machine (before -Kubernetes, they didn't, either). They can just create a service that has a -CNAME to the `search-1` endpoint in staging and be happy as before. - -Also, Openshift needs this for "service refs". Service ref is really just the -three use cases mentioned above, but in the future a way to automatically inject -"service ref"s into namespaces via "service catalog"[1] might be considered. And -service ref is the natural way to integrate an external service, since it takes -advantage of native DNS capabilities already in wide use. - -[1]: https://github.com/kubernetes/kubernetes/pull/17543 - -# Alternatives considered - -In the issues linked above, some alternatives were also considered. A partial -summary of them follows. - -One option is to add the hostname to endpoints, as proposed in -https://github.com/kubernetes/kubernetes/pull/11838. This is problematic, as -endpoints are used in many places and users assume the required fields (such as -IP address) are always present and valid (and check that, too). If the field is -not required anymore or if there is just a hostname instead of the IP, -applications could break. Even assuming those cases could be solved, the -hostname will have to be resolved, which presents further questions and issues: -the timeout to use, whether the lookup is synchronous or asynchronous, dealing -with DNS TTL and more. One imperfect approach was to only resolve the hostname -upon creation, but this was considered not a great idea. A better approach -would be at a higher level, maybe a service type. - -There are more ideas described in #13748, but all raised further issues, -ranging from using another upstream DNS server to creating a Name object -associated with DNSs. - -# Proposed solution - -The proposed solution works at the service layer, by adding a new `externalName` -type for services. This will create a CNAME record in the internal cluster DNS -service. No virtual IP or proxying is involved. - -Using a CNAME gets rid of unnecessary DNS lookups. There's no need for the -Kubernetes control plane to issue them, to pick a timeout for them and having to -refresh them when the TTL for a record expires. It's way simpler to implement, -while solving the right problem. And addressing it at the service layer avoids -all the complications mentioned above about doing it at the endpoints layer. - -The solution was outlined by Tim Hockin in -https://github.com/kubernetes/kubernetes/issues/13748#issuecomment-230397975 - -Currently a ServiceSpec looks like this, with comments edited for clarity: - -``` -type ServiceSpec struct { - Ports []ServicePort - - // If not specified, the associated Endpoints object is not automatically managed - Selector map[string]string - - // "", a real IP, or "None". If not specified, this is default allocated. If "None", this Service is not load-balanced - ClusterIP string - - // ClusterIP, NodePort, LoadBalancer. Only applies if clusterIP != "None" - Type ServiceType - - // Only applies if clusterIP != "None" - ExternalIPs []string - SessionAffinity ServiceAffinity - - // Only applies to type=LoadBalancer - LoadBalancerIP string - LoadBalancerSourceRanges []string -``` - -The proposal is to change it to: - -``` -type ServiceSpec struct { - Ports []ServicePort - - // If not specified, the associated Endpoints object is not automatically managed -+ // Only applies if type is ClusterIP, NodePort, or LoadBalancer. If type is ExternalName, this is ignored. - Selector map[string]string - - // "", a real IP, or "None". If not specified, this is default allocated. If "None", this Service is not load-balanced. -+ // Only applies if type is ClusterIP, NodePort, or LoadBalancer. If type is ExternalName, this is ignored. - ClusterIP string - -- // ClusterIP, NodePort, LoadBalancer. Only applies if clusterIP != "None" -+ // ExternalName, ClusterIP, NodePort, LoadBalancer. Only applies if clusterIP != "None" - Type ServiceType - -+ // Only applies if type is ExternalName -+ ExternalName string - - // Only applies if clusterIP != "None" - ExternalIPs []string - SessionAffinity ServiceAffinity - - // Only applies to type=LoadBalancer - LoadBalancerIP string - LoadBalancerSourceRanges []string -``` - -For example, it can be used like this: - -``` -apiVersion: v1 -kind: Service -metadata: - name: my-rds -spec: - ports: - - port: 12345 -type: ExternalName -externalName: myapp.rds.whatever.aws.says -``` - -There is one issue to take into account, that no other alternative considered -fixes, either: TLS. If the service is a CNAME for an endpoint that uses TLS, -connecting with the Kubernetes name `my-service.my-ns.svc.cluster.local` may -result in a failure during server certificate validation. This is acknowledged -and left for future consideration. For the time being, users and administrators -might need to ensure that the server certificates also mentions the Kubernetes -name as an alternate host name. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/service-external-name.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/service-external-name.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/service-external-name.md) diff --git a/docs/proposals/stateful-apps.md b/docs/proposals/stateful-apps.md index c5196f2a6df..dc8f0b9deb4 100644 --- a/docs/proposals/stateful-apps.md +++ b/docs/proposals/stateful-apps.md @@ -1,363 +1 @@ -# StatefulSets: Running pods which need strong identity and storage - -## Motivation - -Many examples of clustered software systems require stronger guarantees per instance than are provided -by the Replication Controller (aka Replication Controllers). Instances of these systems typically require: - -1. Data per instance which should not be lost even if the pod is deleted, typically on a persistent volume - * Some cluster instances may have tens of TB of stored data - forcing new instances to replicate data - from other members over the network is onerous -2. A stable and unique identity associated with that instance of the storage - such as a unique member id -3. A consistent network identity that allows other members to locate the instance even if the pod is deleted -4. A predictable number of instances to ensure that systems can form a quorum - * This may be necessary during initialization -5. Ability to migrate from node to node with stable network identity (DNS name) -6. The ability to scale up in a controlled fashion, but are very rarely scaled down without human - intervention - -Kubernetes should expose a pod controller (a StatefulSet) that satisfies these requirements in a flexible -manner. It should be easy for users to manage and reason about the behavior of this set. An administrator -with familiarity in a particular cluster system should be able to leverage this controller and its -supporting documentation to run that clustered system on Kubernetes. It is expected that some adaptation -is required to support each new cluster. - -This resource is **stateful** because it offers an easy way to link a pod's network identity to its storage -identity and because it is intended to be used to run software that is the holders of state for other -components. That does not mean that all stateful applications *must* use StatefulSets, but the tradeoffs -in this resource are intended to facilitate holding state in the cluster. - - -## Use Cases - -The software listed below forms the primary use-cases for a StatefulSet on the cluster - problems encountered -while adapting these for Kubernetes should be addressed in a final design. - -* Quorum with Leader Election - * MongoDB - in replica set mode forms a quorum with an elected leader, but instances must be preconfigured - and have stable network identities. - * ZooKeeper - forms a quorum with an elected leader, but is sensitive to cluster membership changes and - replacement instances *must* present consistent identities - * etcd - forms a quorum with an elected leader, can alter cluster membership in a consistent way, and - requires stable network identities -* Decentralized Quorum - * Cassandra - allows flexible consistency and distributes data via innate hash ring sharding, is also - flexible to scaling, more likely to support members that come and go. Scale down may trigger massive - rebalances. -* Active-active - * Galera - has multiple active masters which must remain in sync -* Leader-followers - * Spark in standalone mode - A single unilateral leader and a set of workers - - -## Background - -Replica sets are designed with a weak guarantee - that there should be N replicas of a particular -pod template. Each pod instance varies only by name, and the replication controller errs on the side of -ensuring that N replicas exist as quickly as possible (by creating new pods as soon as old ones begin graceful -deletion, for instance, or by being able to pick arbitrary pods to scale down). In addition, pods by design -have no stable network identity other than their assigned pod IP, which can change over the lifetime of a pod -resource. ReplicaSets are best leveraged for stateless, shared-nothing, zero-coordination, -embarassingly-parallel, or fungible software. - -While it is possible to emulate the guarantees described above by leveraging multiple replication controllers -(for distinct pod templates and pod identities) and multiple services (for stable network identity), the -resulting objects are hard to maintain and must be copied manually in order to scale a cluster. - -By constrast, a DaemonSet *can* offer some of the guarantees above, by leveraging Nodes as stable, long-lived -entities. An administrator might choose a set of nodes, label them a particular way, and create a -DaemonSet that maps pods to each node. The storage of the node itself (which could be network attached -storage, or a local SAN) is the persistent storage. The network identity of the node is the stable -identity. However, while there are examples of clustered software that benefit from close association to -a node, this creates an undue burden on administrators to design their cluster to satisfy these -constraints, when a goal of Kubernetes is to decouple system administration from application management. - - -## Design Assumptions - -* **Specialized Controller** - Rather than increase the complexity of the ReplicaSet to satisfy two distinct - use cases, create a new resource that assists users in solving this particular problem. -* **Safety first** - Running a clustered system on Kubernetes should be no harder - than running a clustered system off Kube. Authors should be given tools to guard against common cluster - failure modes (split brain, phantom member) to prevent introducing more failure modes. Sophisticated - distributed systems designers can implement more sophisticated solutions than StatefulSet if necessary - - new users should not become vulnerable to additional failure modes through an overly flexible design. -* **Controlled scaling** - While flexible scaling is important for some clusters, other examples of clusters - do not change scale without significant external intervention. Human intervention may be required after - scaling. Changing scale during cluster operation can lead to split brain in quorum systems. It should be - possible to scale, but there may be responsibilities on the set author to correctly manage the scale. -* **No generic cluster lifecycle** - Rather than design a general purpose lifecycle for clustered software, - focus on ensuring the information necessary for the software to function is available. For example, - rather than providing a "post-creation" hook invoked when the cluster is complete, provide the necessary - information to the "first" (or last) pod to determine the identity of the remaining cluster members and - allow it to manage its own initialization. - - -## Proposed Design - -Add a new resource to Kubernetes to represent a set of pods that are individually distinct but each -individual can safely be replaced-- the name **StatefulSet** is chosen to convey that the individual members of -the set are themselves "stateful" and thus each one is preserved. Each member has an identity, and there will -always be a member that thinks it is the "first" one. - -The StatefulSet is responsible for creating and maintaining a set of **identities** and ensuring that there is -one pod and zero or more **supporting resources** for each identity. There should never be more than one pod -or unique supporting resource per identity at any one time. A new pod can be created for an identity only -if a previous pod has been fully terminated (reached its graceful termination limit or cleanly exited). - -A StatefulSet has 0..N **members**, each with a unique **identity** which is a name that is unique within the -set. - -``` -type StatefulSet struct { - ObjectMeta - - Spec StatefulSetSpec - ... -} - -type StatefulSetSpec struct { - // Replicas is the desired number of replicas of the given template. - // Each replica is assigned a unique name of the form `name-$replica` - // where replica is in the range `0 - (replicas-1)`. - Replicas int - - // A label selector that "owns" objects created under this set - Selector *LabelSelector - - // Template is the object describing the pod that will be created - each - // pod created by this set will match the template, but have a unique identity. - Template *PodTemplateSpec - - // VolumeClaimTemplates is a list of claims that members are allowed to reference. - // The StatefulSet controller is responsible for mapping network identities to - // claims in a way that maintains the identity of a member. Every claim in - // this list must have at least one matching (by name) volumeMount in one - // container in the template. A claim in this list takes precedence over - // any volumes in the template, with the same name. - VolumeClaimTemplates []PersistentVolumeClaim - - // ServiceName is the name of the service that governs this StatefulSet. - // This service must exist before the StatefulSet, and is responsible for - // the network identity of the set. Members get DNS/hostnames that follow the - // pattern: member-specific-string.serviceName.default.svc.cluster.local - // where "member-specific-string" is managed by the StatefulSet controller. - ServiceName string -} -``` - -Like a replication controller, a StatefulSet may be targeted by an autoscaler. The StatefulSet makes no assumptions -about upgrading or altering the pods in the set for now - instead, the user can trigger graceful deletion -and the StatefulSet will replace the terminated member with the newer template once it exits. Future proposals -may offer update capabilities. A StatefulSet requires RestartAlways pods. The addition of forgiveness may be -necessary in the future to increase the safety of the controller recreating pods. - - -### How identities are managed - -A key question is whether scaling down a StatefulSet and then scaling it back up should reuse identities. If not, -scaling down becomes a destructive action (an admin cannot recover by scaling back up). Given the safety -first assumption, identity reuse seems the correct default. This implies that identity assignment should -be deterministic and not subject to controller races (a controller that has crashed during scale up should -assign the same identities on restart, and two concurrent controllers should decide on the same outcome -identities). - -The simplest way to manage identities, and easiest to understand for users, is a numeric identity system -starting at I=0 that ranges up to the current replica count and is contiguous. - -Future work: - -* Cover identity reclamation - cleaning up resources for identities that are no longer in use. -* Allow more sophisticated identity assignment - instead of `{name}-{0 - replicas-1}`, allow subsets and - complex indexing. - -### Controller behavior. - -When a StatefulSet is scaled up, the controller must create both pods and supporting resources for -each new identity. The controller must create supporting resources for the pod before creating the -pod. If a supporting resource with the appropriate name already exists, the controller should treat that as -creation succeeding. If a supporting resource cannot be created, the controller should flag an error to -status, back-off (like a scheduler or replication controller), and try again later. Each resource created -by a StatefulSet controller must have a set of labels that match the selector, support orphaning, and have a -controller back reference annotation identifying the owning StatefulSet by name and UID. - -When a StatefulSet is scaled down, the pod for the removed indentity should be deleted. It is less clear what the -controller should do to supporting resources. If every pod requires a PV, and a user accidentally scales -up to N=200 and then back down to N=3, leaving 197 PVs lying around may be undesirable (potential for -abuse). On the other hand, a cluster of 5 that is accidentally scaled down to 3 might irreparably destroy -the cluster if the PV for identities 4 and 5 are deleted (may not be recoverable). For the initial proposal, -leaving the supporting resources is the safest path (safety first) with a potential future policy applied -to the StatefulSet for how to manage supporting resources (DeleteImmediately, GarbageCollect, Preserve). - -The controller should reflect summary counts of resources on the StatefulSet status to enable clients to easily -understand the current state of the set. - -### Parameterizing pod templates and supporting resources - -Since each pod needs a unique and distinct identity, and the pod needs to know its own identity, the -StatefulSet must allow a pod template to be parameterized by the identity assigned to the pod. The pods that -are created should be easily identified by their cluster membership. - -Because that pod needs access to stable storage, the StatefulSet may specify a template for one or more -**persistent volume claims** that can be used for each distinct pod. The name of the volume claim must -match a volume mount within the pod template. - -Future work: - -* In the future other resources may be added that must also be templated - for instance, secrets (unique secret per member), config data (unique config per member), and in the futher future, arbitrary extension resources. -* Consider allowing the identity value itself to be passed as an environment variable via the downward API -* Consider allowing per identity values to be specified that are passed to the pod template or volume claim. - - -### Accessing pods by stable network identity - -In order to provide stable network identity, given that pods may not assume pod IP is constant over the -lifetime of a pod, it must be possible to have a resolvable DNS name for the pod that is tied to the -pod identity. There are two broad classes of clustered services - those that require clients to know -all members of the cluster (load balancer intolerant) and those that are amenable to load balancing. -For the former, clients must also be able to easily enumerate the list of DNS names that represent the -member identities and access them inside the cluster. Within a pod, it must be possible for containers -to find and access that DNS name for identifying itself to the cluster. - -Since a pod is expected to be controlled by a single controller at a time, it is reasonable for a pod to -have a single identity at a time. Therefore, a service can expose a pod by its identity in a unique -fashion via DNS by leveraging information written to the endpoints by the endpoints controller. - -The end result might be DNS resolution as follows: - -``` -# service mongo pointing to pods created by StatefulSet mdb, with identities mdb-1, mdb-2, mdb-3 - -dig mongodb.namespace.svc.cluster.local +short A -172.130.16.50 - -dig mdb-1.mongodb.namespace.svc.cluster.local +short A -# IP of pod created for mdb-1 - -dig mdb-2.mongodb.namespace.svc.cluster.local +short A -# IP of pod created for mdb-2 - -dig mdb-3.mongodb.namespace.svc.cluster.local +short A -# IP of pod created for mdb-3 -``` - -This is currently implemented via an annotation on pods, which is surfaced to endpoints, and finally -surfaced as DNS on the service that exposes those pods. - -``` -// The pods created by this StatefulSet will have the DNS names "mysql-0.NAMESPACE.svc.cluster.local" -// and "mysql-1.NAMESPACE.svc.cluster.local" -kind: StatefulSet -metadata: - name: mysql -spec: - replicas: 2 - serviceName: db - template: - spec: - containers: - - image: mysql:latest - -// Example pod created by stateful set -kind: Pod -metadata: - name: mysql-0 - annotations: - pod.beta.kubernetes.io/hostname: "mysql-0" - pod.beta.kubernetes.io/subdomain: db -spec: - ... -``` - - -### Preventing duplicate identities - -The StatefulSet controller is expected to execute like other controllers, as a single writer. However, when -considering designing for safety first, the possibility of the controller running concurrently cannot -be overlooked, and so it is important to ensure that duplicate pod identities are not achieved. - -There are two mechanisms to acheive this at the current time. One is to leverage unique names for pods -that carry the identity of the pod - this prevents duplication because etcd 2 can guarantee single -key transactionality. The other is to use the status field of the StatefulSet to coordinate membership -information. It is possible to leverage both at this time, and encourage users to not assume pod -name is significant, but users are likely to take what they can get. A downside of using unique names -is that it complicates pre-warming of pods and pod migration - on the other hand, those are also -advanced use cases that might be better solved by another, more specialized controller (a -MigratableStatefulSet). - - -### Managing lifecycle of members - -The most difficult aspect of managing a member set is ensuring that all members see a consistent configuration -state of the set. Without a strongly consistent view of cluster state, most clustered software is -vulnerable to split brain. For example, a new set is created with 3 members. If the node containing the -first member is partitioned from the cluster, it may not observe the other two members, and thus create its -own cluster of size 1. The other two members do see the first member, so they form a cluster of size 3. -Both clusters appear to have quorum, which can lead to data loss if not detected. - -StatefulSets should provide basic mechanisms that enable a consistent view of cluster state to be possible, -and in the future provide more tools to reduce the amount of work necessary to monitor and update that -state. - -The first mechanism is that the StatefulSet controller blocks creation of new pods until all previous pods -are reporting a healthy status. The StatefulSet controller uses the strong serializability of the underyling -etcd storage to ensure that it acts on a consistent view of the cluster membership (the pods and their) -status, and serializes the creation of pods based on the health state of other pods. This simplifies -reasoning about how to initialize a StatefulSet, but is not sufficient to guarantee split brain does not -occur. - -The second mechanism is having each "member" use the state of the cluster and transform that into cluster -configuration or decisions about membership. This is currently implemented using a side car container -that watches the master (via DNS today, although in the future this may be to endpoints directly) to -receive an ordered history of events, and then applying those safely to the configuration. Note that -for this to be safe, the history received must be strongly consistent (must be the same order of -events from all observers) and the config change must be bounded (an old config version may not -be allowed to exist forever). For now, this is known as a 'babysitter' (working name) and is intended -to help identify abstractions that can be provided by the StatefulSet controller in the future. - - -## Future Evolution - -Criteria for advancing to beta: - -* StatefulSets do not accidentally lose data due to cluster design - the pod safety proposal will - help ensure StatefulSets can guarantee **at most one** instance of a pod identity is running at - any time. -* A design consensus is reached on StatefulSet upgrades. - -Criteria for advancing to GA: - -* StatefulSets solve 80% of clustered software configuraton with minimal input from users and are safe from common split brain problems - * Several representative examples of StatefulSets from the community have been proven/tested to be "correct" for a variety of partition problems (possibly via Jepsen or similar) - * Sufficient testing and soak time has been in place (like for Deployments) to ensure the necessary features are in place. -* StatefulSets are considered easy to use for deploying clustered software for common cases - -Requested features: - -* IPs per member for clustered software like Cassandra that cache resolved DNS addresses that can be used outside the cluster - * Individual services can potentially be used to solve this in some cases. -* Send more / simpler events to each pod from a central spot via the "signal API" -* Persistent local volumes that can leverage local storage -* Allow pods within the StatefulSet to identify "leader" in a way that can direct requests from a service to a particular member. -* Provide upgrades of a StatefulSet in a controllable way (like Deployments). - - -## Overlap with other proposals - -* Jobs can be used to perform a run-once initialization of the cluster -* Init containers can be used to prime PVs and config with the identity of the pod. -* Templates and how fields are overriden in the resulting object should have broad alignment -* DaemonSet defines the core model for how new controllers sit alongside replication controller and - how upgrades can be implemented outside of Deployment objects. - - -## History - -StatefulSets were formerly known as PetSets and were renamed to be less "cutesy" and more descriptive as a -prerequisite to moving to beta. No animals were harmed in the making of this proposal. - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/stateful-apps.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/stateful-apps.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/stateful-apps.md) diff --git a/docs/proposals/synchronous-garbage-collection.md b/docs/proposals/synchronous-garbage-collection.md index c5157408f2a..d5f68abb779 100644 --- a/docs/proposals/synchronous-garbage-collection.md +++ b/docs/proposals/synchronous-garbage-collection.md @@ -1,175 +1 @@ -**Table of Contents** - - -- [Overview](#overview) -- [API Design](#api-design) - - [Standard Finalizers](#standard-finalizers) - - [OwnerReference](#ownerreference) - - [DeleteOptions](#deleteoptions) -- [Components changes](#components-changes) - - [API Server](#api-server) - - [Garbage Collector](#garbage-collector) - - [Controllers](#controllers) -- [Handling circular dependencies](#handling-circular-dependencies) -- [Unhandled cases](#unhandled-cases) -- [Implications to existing clients](#implications-to-existing-clients) - - - -# Overview - -Users of the server-side garbage collection need to determine if the garbage collection is done. For example: -* Currently `kubectl delete rc` blocks until all the pods are terminating. To convert to use server-side garbage collection, kubectl has to be able to determine if the garbage collection is done. -* [#19701](https://github.com/kubernetes/kubernetes/issues/19701#issuecomment-236997077) is a use case where the user needs to wait for all service dependencies garbage collected and their names released, before she recreates the dependencies. - -We define the garbage collection as "done" when all the dependents are deleted from the key-value store, rather than merely in the terminating state. There are two reasons: *i)* for `Pod`s, the most usual garbage, only when they are deleted from the key-value store, we know kubelet has released resources they occupy; *ii)* some users need to recreate objects with the same names, they need to wait for the old objects to be deleted from the key-value store. (This limitation is because we index objects by their names in the key-value store today.) - -Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandled-cases)) mechanism that allows user to determine if the garbage collection is done: after the API server receives a deletion request of an owning object, the object keeps existing in the key-value store until all its dependents are deleted from the key-value store by the garbage collector. - -Tracking issue: https://github.com/kubernetes/kubernetes/issues/29891 - -# API Design - -## Standard Finalizers - -We will introduce a new standard finalizer: - -```go -const GCFinalizer string = “DeletingDependents” -``` - -This finalizer indicates the object is terminating and is waiting for its dependents whose `OwnerReference.BlockOwnerDeletion` is true get deleted. - -## OwnerReference - -```go -OwnerReference { - ... - // If true, AND if the owner has the "DeletingDependents" finalizer, then the owner cannot be deleted from the key-value store until this reference is removed. - // Defaults to false. - // To set this field, a user needs "delete" permission of the owner, otherwise 422 (Unprocessable Entity) will be returned. - BlockOwnerDeletion *bool -} -``` - -The initial draft of the proposal did not include this field and it had a security loophole: a user who is only authorized to update one resource can set ownerReference to block the synchronous GC of other resources. Requiring users to explicitly set `BlockOwnerDeletion` allows the master to properly authorize the request. - -## DeleteOptions - -```go -DeleteOptions { - … - // Whether and how garbage collection will be performed. - // Defaults to DeletePropagationDefault - // Either this field or OrphanDependents may be set, but not both. - PropagationPolicy *DeletePropagationPolicy -} - -type DeletePropagationPolicy string - -const ( - // The default depends on the existing finalizers on the object and the type of the object. - DeletePropagationDefault DeletePropagationPolicy = "DeletePropagationDefault" - // Orphans the dependents - DeletePropagationOrphan DeletePropagationPolicy = "DeletePropagationOrphan" - // Deletes the object from the key-value store, the garbage collector will delete the dependents in the background. - DeletePropagationBackground DeletePropagationPolicy = "DeletePropagationBackground" - // The object exists in the key-value store until the garbage collector deletes all the dependents whose ownerReference.blockOwnerDeletion=true from the key-value store. - // API sever will put the "DeletingDependents" finalizer on the object, and sets its deletionTimestamp. - // This policy is cascading, i.e., the dependents will be deleted with GarbageCollectionSynchronous. - DeletePropagationForeground DeletePropagationPolicy = "DeletePropagationForeground" -) -``` - -The `DeletePropagationForeground` policy represents the synchronous GC mode. - -`DeleteOptions.OrphanDependents *bool` will be marked as deprecated and will be removed in 1.7. Validation code will make sure only one of `OrphanDependents` and `PropagationPolicy` may be set. We decided not to add another `DeleteAfterDependentsDeleted *bool`, because together with `OrphanDependents`, it will result in 9 possible combinations and is thus confusing. - -The conversion rules are described in the following table: - -| 1.5 | pre 1.4/1.4 | -|------------------------------------------|--------------------------| -| DeletePropagationDefault | OrphanDependents==nil | -| DeletePropagationOrphan | *OrphanDependents==true | -| DeletePropagationBackground | *OrphanDependents==false | -| DeletePropagationForeground | N/A | - -# Components changes - -## API Server - -`Delete()` function checks `DeleteOptions.PropagationPolicy`. If the policy is `DeletePropagationForeground`, the API server will update the object instead of deleting it, add the "DeletingDependents" finalizer, remove the "OrphanDependents" finalizer if it's present, and set the `ObjectMeta.DeletionTimestamp`. - -When validating the ownerReference, API server needs to query the `Authorizer` to check if the user has "delete" permission of the owner object. It returns 422 if the user does not have the permissions but intends to set `OwnerReference.BlockOwnerDeletion` to true. - -## Garbage Collector - -**Modifications to processEvent()** - -Currently `processEvent()` manages GC’s internal owner-dependency relationship graph, `uidToNode`. It updates `uidToNode` according to the Add/Update/Delete events in the cluster. To support synchronous GC, it has to: - -* handle Add or Update events where `obj.Finalizers.Has(GCFinalizer) && obj.DeletionTimestamp != nil`. The object will be added into the `dirtyQueue`. The object will be marked as “GC in progress” in `uidToNode`. -* Upon receiving the deletion event of an object, put its owner into the `dirtyQueue` if the owner node is marked as "GC in progress". This is to force the `processItem()` (described next) to re-check if all dependents of the owner is deleted. - -**Modifications to processItem()** - -Currently `processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to: - -* treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise synchronous GC will not progress because the owner keeps existing in the key-value store. -* when deleting dependents, if the owner's finalizers include `DeletingDependents`, it should use the `GarbageCollectionSynchronous` as GC policy. -* if an object has multiple owners, some owners still exist while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of owners, `processItem()` has to remove the ownerReferences pointing to them. - -In addition, if an object popped from `dirtyQueue` is marked as "GC in progress", `processItem()` treats it specially: - -* To avoid racing with another controller, it requeues the object if `observedGeneration < Generation`. This is best-effort, see [unhandled cases](#unhandled-cases). -* Checks if the object has dependents - * If not, send a PUT request to remove the `GCFinalizer`; - * If so, then add all dependents to the `dirtryQueue`; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times. - -## Controllers - -To utilize the synchronous garbage collection feature, controllers (e.g., the replicaset controller) need to set `OwnerReference.BlockOwnerDeletion` when creating dependent objects (e.g. pods). - -# Handling circular dependencies - -SynchronousGC will enter a deadlock in the presence of circular dependencies. The garbage collector can break the circle by lazily breaking circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it removes the `GCFinalizer` from the object. - -Note that the approach is not rigorous and thus having false positives. For example, if a user first sends a SynchronousGC delete request for an object, then sends the delete request for its owner, then `processItem()` will be fooled to believe there is a circle. We expect user not to do this. We can make the circle detection more rigorous if needed. - -Circular dependencies are regarded as user error. If needed, we can add more guarantees to handle such cases later. - -# Unhandled cases - -* If the GC observes the owning object with the `GCFinalizer` before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, synchronous GC is best-effort, though we guarantee that the dependents will be deleted eventually. We face a similar case when handling OrphanFinalizer, see [GC known issues](https://github.com/kubernetes/kubernetes/issues/26120). - -# Implications to existing clients - -Finalizer breaks an assumption that many Kubernetes components have: a deletion request with `grace period=0` will immediately remove the object from the key-value store. This is not true if an object has pending finalizers, the object will continue to exist, and currently the API server will not return an error in this case. - -**Namespace controller** suffered from this [problem](https://github.com/kubernetes/kubernetes/issues/32519) and was fixed in [#32524](https://github.com/kubernetes/kubernetes/pull/32524) by retrying every 15s if there are objects with pending finalizers to be removed from the key-value store. Object with pending `GCFinalizer` might take arbitrary long time be deleted, so namespace deletion might time out. - -**kubelet** deletes the pod from the key-value store after all its containers are terminated ([code](../../pkg/kubelet/status/status_manager.go#L441-L443)). It also assumes that if the API server does not return an error, the pod is removed from the key-value store. Breaking the assumption will not break `kubelet` though, because the `pod` must have already been in the terminated phase, `kubelet` will not care to manage it. - -**Node controller** forcefully deletes pod if the pod is scheduled to a node that does not exist ([code](../../pkg/controller/node/nodecontroller.go#L474)). The pod will continue to exist if it has pending finalizers. The node controller will futilely retry the deletion. Also, the `node controller` forcefully deletes pods before deleting the node ([code](../../pkg/controller/node/nodecontroller.go#L592)). If the pods have pending finalizers, the `node controller` will go ahead deleting the node, leaving those pods behind. These pods will be deleted from the key-value store when the pending finalizers are removed. - -**Podgc** deletes terminated pods if there are too many of them in the cluster. We need to make sure finalizers on Pods are taken off quickly enough so that the progress of `Podgc` is not affected. - -**Deployment controller** adopts existing `ReplicaSet` (RS) if its template matches. If a matching RS has a pending `GCFinalizer`, deployment should adopt it, take its pods into account, but shouldn't try to mutate it, because the RS controller will ignore a RS that's being deleted. Hence, `deployment controller` should wait for the RS to be deleted, and then create a new one. - -**Replication controller manager**, **Job controller**, and **ReplicaSet controller** ignore pods in terminated phase, so pods with pending finalizers will not block these controllers. - -**StatefulSet controller** will be blocked by a pod with pending finalizers, so synchronous GC might slow down its progress. - -**kubectl**: synchronous GC can simplify the **kubectl delete** reapers. Let's take the `deployment reaper` as an example, since it's the most complicated one. Currently, the reaper finds all `RS` with matching labels, scales them down, polls until `RS.Status.Replica` reaches 0, deletes the `RS`es, and finally deletes the `deployment`. If using synchronous GC, `kubectl delete deployment` is as easy as sending a synchronous GC delete request for the deployment, and polls until the deployment is deleted from the key-value store. - -Note that this **changes the behavior** of `kubectl delete`. The command will be blocked until all pods are deleted from the key-value store, instead of being blocked until pods are in the terminating state. This means `kubectl delete` blocks for longer time, but it has the benefit that the resources used by the pods are released when the `kubectl delete` returns. To allow kubectl user not waiting for the cleanup, we will add a `--wait` flag. It defaults to true; if it's set to `false`, `kubectl delete` will send the delete request with `PropagationPolicy=DeletePropagationBackground` and return immediately. - -To make the new kubectl compatible with the 1.4 and earlier masters, kubectl needs to switch to use the old reaper logic if it finds synchronous GC is not supported by the master. - -1.4 `kubectl delete rc/rs` uses `DeleteOptions.OrphanDependents=true`, which is going to be converted to `DeletePropagationBackground` (see [API Design](#api-changes)) by a 1.5 master, so its behavior keeps the same. - -Pre 1.4 `kubectl delete` uses `DeleteOptions.OrphanDependents=nil`, so does the 1.4 `kubectl delete` for resources other than rc and rs. The option is going to be converted to `DeletePropagationDefault` (see [API Design](#api-changes)) by a 1.5 master, so these commands behave the same as when working with a 1.4 master. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/synchronous-garbage-collection.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/synchronous-garbage-collection.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/synchronous-garbage-collection.md) diff --git a/docs/proposals/templates.md b/docs/proposals/templates.md index 2d58fbd59a5..0805d8e9780 100644 --- a/docs/proposals/templates.md +++ b/docs/proposals/templates.md @@ -1,569 +1 @@ -# Templates+Parameterization: Repeatedly instantiating user-customized application topologies. - -## Motivation - -Addresses https://github.com/kubernetes/kubernetes/issues/11492 - -There are two main motivators for Template functionality in Kubernetes: Controller Instantiation and Application Definition - -### Controller Instantiation - -Today the replication controller defines a PodTemplate which allows it to instantiate multiple pods with identical characteristics. -This is useful but limited. Stateful applications have a need to instantiate multiple instances of a more sophisticated topology -than just a single pod (e.g. they also need Volume definitions). A Template concept would allow a Controller to stamp out multiple -instances of a given Template definition. This capability would be immediately useful to the [StatefulSet](https://github.com/kubernetes/kubernetes/pull/18016) proposal. - -Similarly the [Service Catalog proposal](https://github.com/kubernetes/kubernetes/pull/17543) could leverage template instantiation as a mechanism for claiming service instances. - - -### Application Definition - -Kubernetes gives developers a platform on which to run images and many configuration objects to control those images, but -constructing a cohesive application made up of images and configuration objects is currently difficult. Applications -require: - -* Information sharing between images (e.g. one image provides a DB service, another consumes it) -* Configuration/tuning settings (memory sizes, queue limits) -* Unique/customizable identifiers (service names, routes) - -Application authors know which values should be tunable and what information must be shared, but there is currently no -consistent way for an application author to define that set of information so that application consumers can easily deploy -an application and make appropriate decisions about the tunable parameters the author intended to expose. - -Furthermore, even if an application author provides consumers with a set of API object definitions (e.g. a set of yaml files) -it is difficult to build a UI around those objects that would allow the deployer to modify names in one place without -potentially breaking assumed linkages to other pieces. There is also no prescriptive way to define which configuration -values are appropriate for a deployer to tune or what the parameters control. - -## Use Cases - -### Use cases for templates in general - -* Providing a full baked application experience in a single portable object that can be repeatably deployed in different environments. - * e.g. Wordpress deployment with separate database pod/replica controller - * Complex service/replication controller/volume topologies -* Bulk object creation -* Provide a management mechanism for deleting/uninstalling an entire set of components related to a single deployed application -* Providing a library of predefined application definitions that users can select from -* Enabling the creation of user interfaces that can guide an application deployer through the deployment process with descriptive help about the configuration value decisions they are making, and useful default values where appropriate -* Exporting a set of objects in a namespace as a template so the topology can be inspected/visualized or recreated in another environment -* Controllers that need to instantiate multiple instances of identical objects (e.g. StatefulSets). - - -### Use cases for parameters within templates - -* Share passwords between components (parameter value is provided to each component as an environment variable or as a Secret reference, with the Secret value being parameterized or produced by an [initializer](https://github.com/kubernetes/kubernetes/issues/3585)) -* Allow for simple deployment-time customization of “app” configuration via environment values or api objects, e.g. memory - tuning parameters to a MySQL image, Docker image registry prefix for image strings, pod resource requests and limits, default - scale size. -* Allow simple, declarative defaulting of parameter values and expose them to end users in an approachable way - a parameter - like “MySQL table space” can be parameterized in images as an env var - the template parameters declare the parameter, give - it a friendly name, give it a reasonable default, and informs the user what tuning options are available. -* Customization of component names to avoid collisions and ensure matched labeling (e.g. replica selector value and pod label are - user provided and in sync). -* Customize cross-component references (e.g. user provides the name of a secret that already exists in their namespace, to use in - a pod as a TLS cert). -* Provide guidance to users for parameters such as default values, descriptions, and whether or not a particular parameter value - is required or can be left blank. -* Parameterize the replica count of a deployment or [StatefulSet](https://github.com/kubernetes/kubernetes/pull/18016) -* Parameterize part of the labels and selector for a DaemonSet -* Parameterize quota/limit values for a pod -* Parameterize a secret value so a user can provide a custom password or other secret at deployment time - - -## Design Assumptions - -The goal for this proposal is a simple schema which addresses a few basic challenges: - -* Allow application authors to expose configuration knobs for application deployers, with suggested defaults and -descriptions of the purpose of each knob -* Allow application deployers to easily customize exposed values like object names while maintaining referential integrity - between dependent pieces (for example ensuring a pod's labels always match the corresponding selector definition of the service) -* Support maintaining a library of templates within Kubernetes that can be accessed and instantiated by end users -* Allow users to quickly and repeatedly deploy instances of well-defined application patterns produced by the community -* Follow established Kubernetes API patterns by defining new template related APIs which consume+return first class Kubernetes - API (and therefore json conformant) objects. - -We do not wish to invent a new Turing-complete templating language. There are good options available -(e.g. https://github.com/mustache/mustache) for developers who want a completely flexible and powerful solution for creating -arbitrarily complex templates with parameters, and tooling can be built around such schemes. - -This desire for simplicity also intentionally excludes template composability/embedding as a supported use case. - -Allowing templates to reference other templates presents versioning+consistency challenges along with making the template -no longer a self-contained portable object. Scenarios necessitating multiple templates can be handled in one of several -alternate ways: - -* Explicitly constructing a new template that merges the existing templates (tooling can easily be constructed to perform this - operation since the templates are first class api objects). -* Manually instantiating each template and utilizing [service linking](https://github.com/kubernetes/kubernetes/pull/17543) to share - any necessary configuration data. - -This document will also refrain from proposing server APIs or client implementations. This has been a point of debate, and it makes -more sense to focus on the template/parameter specification/syntax than to worry about the tooling that will process or manage the -template objects. However since there is a desire to at least be able to support a server side implementation, this proposal -does assume the specification will be k8s API friendly. - -## Desired characteristics - -* Fully k8s object json-compliant syntax. This allows server side apis that align with existing k8s apis to be constructed - which consume templates and existing k8s tooling to work with them. It also allows for api versioning/migration to be managed by - the existing k8s codec scheme rather than having to define/introduce a new syntax evolution mechanism. - * (Even if they are not part of the k8s core, it would still be good if a server side template processing+managing api supplied - as an ApiGroup consumed the same k8s object schema as the peer k8s apis rather than introducing a new one) -* Self-contained parameter definitions. This allows a template to be a portable object which includes metadata that describe - the inputs it expects, making it easy to wrapper a user interface around the parameterization flow. -* Object field primitive types include string, int, boolean, byte[]. The substitution scheme should support all of those types. - * complex types (struct/map/list) can be defined in terms of the available primitives, so it's preferred to avoid the complexity - of allowing for full complex-type substitution. -* Parameter metadata. Parameters should include at a minimum, information describing the purpose of the parameter, whether it is - required/optional, and a default/suggested value. Type information could also be required to enable more intelligent client interfaces. -* Template metadata. Templates should be able to include metadata describing their purpose or links to further documentation and - versioning information. Annotations on the Template's metadata field can fulfill this requirement. - - -## Proposed Implementation - -### Overview - -We began by looking at the List object which allows a user to easily group a set of objects together for easy creation via a -single CLI invocation. It also provides a portable format which requires only a single file to represent an application. - -From that starting point, we propose a Template API object which can encapsulate the definition of all components of an -application to be created. The application definition is encapsulated in the form of an array of API objects (identical to -List), plus a parameterization section. Components reference the parameter by name and the value of the parameter is -substituted during a processing step, prior to submitting each component to the appropriate API endpoint for creation. - -The primary capability provided is that parameter values can easily be shared between components, such as a database password -that is provided by the user once, but then attached as an environment variable to both a database pod and a web frontend pod. - -In addition, the template can be repeatedly instantiated for a consistent application deployment experience in different -namespaces or Kubernetes clusters. - -Lastly, we propose the Template API object include a “Labels” section in which the template author can define a set of labels -to be applied to all objects created from the template. This will give the template deployer an easy way to manage all the -components created from a given template. These labels will also be applied to selectors defined by Objects within the template, -allowing a combination of templates and labels to be used to scope resources within a namespace. That is, a given template -can be instantiated multiple times within the same namespace, as long as a different label value is used each for each -instantiation. The resulting objects will be independent from a replica/load-balancing perspective. - -Generation of parameter values for fields such as Secrets will be delegated to an [admission controller/initializer/finalizer](https://github.com/kubernetes/kubernetes/issues/3585) rather than being solved by the template processor. Some discussion about a generation -service is occurring [here](https://github.com/kubernetes/kubernetes/issues/12732) - -Labels to be assigned to all objects could also be generated in addition to, or instead of, allowing labels to be supplied in the -Template definition. - -### API Objects - -**Template Object** - -``` -// Template contains the inputs needed to produce a Config. -type Template struct { - unversioned.TypeMeta - kapi.ObjectMeta - - // Optional: Parameters is an array of Parameters used during the - // Template to Config transformation. - Parameters []Parameter - - // Required: A list of resources to create - Objects []runtime.Object - - // Optional: ObjectLabels is a set of labels that are applied to every - // object during the Template to Config transformation - // These labels are also be applied to selectors defined by objects in the template - ObjectLabels map[string]string -} -``` - -**Parameter Object** - -``` -// Parameter defines a name/value variable that is to be processed during -// the Template to Config transformation. -type Parameter struct { - // Required: Parameter name must be set and it can be referenced in Template - // Items using $(PARAMETER_NAME) - Name string - - // Optional: The name that will show in UI instead of parameter 'Name' - DisplayName string - - // Optional: Parameter can have description - Description string - - // Optional: Value holds the Parameter data. - // The value replaces all occurrences of the Parameter $(Name) or - // $((Name)) expression during the Template to Config transformation. - Value string - - // Optional: Indicates the parameter must have a non-empty value either provided by the user or provided by a default. Defaults to false. - Required bool - - // Optional: Type-value of the parameter (one of string, int, bool, or base64) - // Used by clients to provide validation of user input and guide users. - Type ParameterType -} -``` - -As seen above, parameters allow for metadata which can be fed into client implementations to display information about the -parameter’s purpose and whether a value is required. In lieu of type information, two reference styles are offered: `$(PARAM)` -and `$((PARAM))`. When the single parens option is used, the result of the substitution will remain quoted. When the double -parens option is used, the result of the substitution will not be quoted. For example, given a parameter defined with a value -of "BAR", the following behavior will be observed: - -``` -somefield: "$(FOO)" -> somefield: "BAR" -somefield: "$((FOO))" -> somefield: BAR -``` - -// for concatenation, the result value reflects the type of substitution (quoted or unquoted): - -``` -somefield: "prefix_$(FOO)_suffix" -> somefield: "prefix_BAR_suffix" -somefield: "prefix_$((FOO))_suffix" -> somefield: prefix_BAR_suffix -``` - -// if both types of substitution exist, quoting is performed: - -``` -somefield: "prefix_$((FOO))_$(FOO)_suffix" -> somefield: "prefix_BAR_BAR_suffix" -``` - -This mechanism allows for integer/boolean values to be substituted properly. - -The value of the parameter can be explicitly defined in template. This should be considered a default value for the parameter, clients -which process templates are free to override this value based on user input. - - -**Example Template** - -Illustration of a template which defines a service and replication controller with parameters to specialized -the name of the top level objects, the number of replicas, and several environment variables defined on the -pod template. - -``` -{ - "kind": "Template", - "apiVersion": "v1", - "metadata": { - "name": "mongodb-ephemeral", - "annotations": { - "description": "Provides a MongoDB database service" - } - }, - "labels": { - "template": "mongodb-ephemeral-template" - }, - "objects": [ - { - "kind": "Service", - "apiVersion": "v1", - "metadata": { - "name": "$(DATABASE_SERVICE_NAME)" - }, - "spec": { - "ports": [ - { - "name": "mongo", - "protocol": "TCP", - "targetPort": 27017 - } - ], - "selector": { - "name": "$(DATABASE_SERVICE_NAME)" - } - } - }, - { - "kind": "ReplicationController", - "apiVersion": "v1", - "metadata": { - "name": "$(DATABASE_SERVICE_NAME)" - }, - "spec": { - "replicas": "$((REPLICA_COUNT))", - "selector": { - "name": "$(DATABASE_SERVICE_NAME)" - }, - "template": { - "metadata": { - "creationTimestamp": null, - "labels": { - "name": "$(DATABASE_SERVICE_NAME)" - } - }, - "spec": { - "containers": [ - { - "name": "mongodb", - "image": "docker.io/centos/mongodb-26-centos7", - "ports": [ - { - "containerPort": 27017, - "protocol": "TCP" - } - ], - "env": [ - { - "name": "MONGODB_USER", - "value": "$(MONGODB_USER)" - }, - { - "name": "MONGODB_PASSWORD", - "value": "$(MONGODB_PASSWORD)" - }, - { - "name": "MONGODB_DATABASE", - "value": "$(MONGODB_DATABASE)" - } - ] - } - ] - } - } - } - } - ], - "parameters": [ - { - "name": "DATABASE_SERVICE_NAME", - "description": "Database service name", - "value": "mongodb", - "required": true - }, - { - "name": "MONGODB_USER", - "description": "Username for MongoDB user that will be used for accessing the database", - "value": "username", - "required": true - }, - { - "name": "MONGODB_PASSWORD", - "description": "Password for the MongoDB user", - "required": true - }, - { - "name": "MONGODB_DATABASE", - "description": "Database name", - "value": "sampledb", - "required": true - }, - { - "name": "REPLICA_COUNT", - "description": "Number of mongo replicas to run", - "value": "1", - "required": true - } - ] -} -``` - -### API Endpoints - -* **/processedtemplates** - when a template is POSTed to this endpoint, all parameters in the template are processed and -substituted into appropriate locations in the object definitions. Validation is performed to ensure required parameters have -a value supplied. In addition labels defined in the template are applied to the object definitions. Finally the customized -template (still a `Template` object) is returned to the caller. (The possibility of returning a List instead has -also been discussed and will be considered for implementation). - -The client is then responsible for iterating the objects returned and POSTing them to the appropriate resource api endpoint to -create each object, if that is the desired end goal for the client. - -Performing parameter substitution on the server side has the benefit of centralizing the processing so that new clients of -k8s, such as IDEs, CI systems, Web consoles, etc, do not need to reimplement template processing or embed the k8s binary. -Instead they can invoke the k8s api directly. - -* **/templates** - the REST storage resource for storing and retrieving template objects, scoped within a namespace. - -Storing templates within k8s has the benefit of enabling template sharing and securing via the same roles/resources -that are used to provide access control to other cluster resources. It also enables sophisticated service catalog -flows in which selecting a service from a catalog results in a new instantiation of that service. (This is not the -only way to implement such a flow, but it does provide a useful level of integration). - -Creating a new template (POST to the /templates api endpoint) simply stores the template definition, it has no side -effects(no other objects are created). - -This resource can also support a subresource "/templates/templatename/processed". This resource would accept just a -Parameters object and would process the template stored in the cluster as "templatename". The processed result would be -returned in the same form as `/processedtemplates` - -### Workflow - -#### Template Instantiation - -Given a well-formed template, a client will - -1. Optionally set an explicit `value` for any parameter values the user wishes to explicitly set -2. Submit the new template object to the `/processedtemplates` api endpoint - -The api endpoint will then: - -1. Validate the template including confirming “required” parameters have an explicit value. -2. Walk each api object in the template. -3. Adding all labels defined in the template’s ObjectLabels field. -4. For each field, check if the value matches a parameter name and if so, set the value of the field to the value of the parameter. - * Partial substitutions are accepted, such as `SOME_$(PARAM)` which would be transformed into `SOME_XXXX` where `XXXX` is the value - of the `$(PARAM)` parameter. - * If a given $(VAL) could be resolved to either a parameter or an environment variable/downward api reference, an error will be - returned. -5. Return the processed template object. (or List, depending on the choice made when this is implemented) - -The client can now either return the processed template to the user in a desired form (e.g. json or yaml), or directly iterate the -api objects within the template, invoking the appropriate object creation api endpoint for each element. (If the api returns -a List, the client would simply iterate the list to create the objects). - -The result is a consistently recreatable application configuration, including well-defined labels for grouping objects created by -the template, with end-user customizations as enabled by the template author. - -#### Template Authoring - -To aid application authors in the creation of new templates, it should be possible to export existing objects from a project -in template form. A user should be able to export all or a filtered subset of objects from a namespace, wrappered into a -Template API object. The user will still need to customize the resulting object to enable parameterization and labeling, -though sophisticated export logic could attempt to auto-parameterize well understood api fields. Such logic is not considered -in this proposal. - -#### Tooling - -As described above, templates can be instantiated by posting them to a template processing endpoint. CLI tools should -exist which can input parameter values from the user as part of the template instantiation flow. - -More sophisticated UI implementations should also guide the user through which parameters the template expects, the description -of those templates, and the collection of user provided values. - -In addition, as described above, existing objects in a namespace can be exported in template form, making it easy to recreate a -set of objects in a new namespace or a new cluster. - - -## Examples - -### Example Templates - -These examples reflect the current OpenShift template schema, not the exact schema proposed in this document, however this -proposal, if accepted, provides sufficient capability to support the examples defined here, with the exception of -automatic generation of passwords. - -* [Jenkins template](https://github.com/openshift/origin/blob/master/examples/jenkins/jenkins-persistent-template.json) -* [MySQL DB service template](https://github.com/openshift/origin/blob/master/examples/db-templates/mysql-persistent-template.json) - -### Examples of OpenShift Parameter Usage - -(mapped to use cases described above) - -* [Share passwords](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L146-L152) -* [Simple deployment-time customization of “app” configuration via environment values](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L108-L126) (e.g. memory tuning, resource limits, etc) -* [Customization of component names with referential integrity](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L199-L207) -* [Customize cross-component references](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L78-L83) (e.g. user provides the name of a secret that already exists in their namespace, to use in a pod as a TLS cert) - -## Requirements analysis - -There has been some discussion of desired goals for a templating/parameterization solution [here](https://github.com/kubernetes/kubernetes/issues/11492#issuecomment-160853594). This section will attempt to address each of those points. - -*The primary goal is that parameterization should facilitate reuse of declarative configuration templates in different environments in - a "significant number" of common cases without further expansion, substitution, or other static preprocessing.* - -* This solution provides for templates that can be reused as is (assuming parameters are not used or provide sane default values) across - different environments, they are a self-contained description of a topology. - -*Parameterization should not impede the ability to use kubectl commands with concrete resource specifications.* - -* The parameterization proposal here does not extend beyond Template objects. That is both a strength and limitation of this proposal. - Parameterizable objects must be wrapped into a Template object, rather than existing on their own. - -*Parameterization should work with all kubectl commands that accept --filename, and should work on templates comprised of multiple resources.* - -* Same as above. - -*The parameterization mechanism should not prevent the ability to wrap kubectl with workflow/orchestration tools, such as Deployment manager.* - -* Since this proposal uses standard API objects, a DM or Helm flow could still be constructed around a set of templates, just as those flows are - constructed around other API objects today. - -*Any parameterization mechanism we add should not preclude the use of a different parameterization mechanism, it should be possible -to use different mechanisms for different resources, and, ideally, the transformation should be composable with other -substitution/decoration passes.* - -* This templating scheme does not preclude layering an additional templating mechanism over top of it. For example, it would be - possible to write a Mustache template which, after Mustache processing, resulted in a Template which could then be instantiated - through the normal template instantiating process. - -*Parameterization should not compromise reproducibility. For instance, it should be possible to manage template arguments as well as -templates under version control.* - -* Templates are a single file, including default or chosen values for parameters. They can easily be managed under version control. - -*It should be possible to specify template arguments (i.e., parameter values) declaratively, in a way that is "self-describing" -(i.e., naming the parameters and the template to which they correspond). It should be possible to write generic commands to -process templates.* - -* Parameter definitions include metadata which describes the purpose of the parameter. Since parameter definitions are part of the template, - there is no need to indicate which template they correspond to. - -*It should be possible to validate templates and template parameters, both values and the schema.* - -* Template objects are subject to standard api validation. - -*It should also be possible to validate and view the output of the substitution process.* - -* The `/processedtemplates` api returns the result of the substitution process, which is itself a Template object that can be validated. - -*It should be possible to generate forms for parameterized templates, as discussed in #4210 and #6487.* - -* Parameter definitions provide metadata that allows for the construction of form-based UIs to gather parameter values from users. - -*It shouldn't be inordinately difficult to evolve templates. Thus, strategies such as versioning and encapsulation should be -encouraged, at least by convention.* - -* Templates can be versioned via annotations on the template object. - -## Key discussion points - -The preceding document is opinionated about each of these topics, however they have been popular topics of discussion so they are called out explicitly below. - -### Where to define parameters - -There has been some discussion around where to define parameters that are being injected into a Template - -1. In a separate standalone file -2. Within the Template itself - -This proposal suggests including the parameter definitions within the Template, which provides a self-contained structure that -can be easily versioned, transported, and instantiated without risk of mismatching content. In addition, a Template can easily -be validated to confirm that all parameter references are resolveable. - -Separating the parameter definitions makes for a more complex process with respect to -* Editing a template (if/when first class editing tools are created) -* Storing/retrieving template objects with a central store - -Note that the `/templates/sometemplate/processed` subresource would accept a standalone set of parameters to be applied to `sometemplate`. - -### How to define parameters - -There has also been debate about how a parameter should be referenced from within a template. This proposal suggests that -fields to be substituted by a parameter value use the "$(parameter)" syntax which is already used elsewhere within k8s. The -value of `parameter` should be matched to a parameter with that name, and the value of the matched parameter substituted into -the field value. - -Other suggestions include a path/map approach in which a list of field paths (e.g. json path expressions) and corresponding -parameter names are provided. The substitution process would walk the map, replacing fields with the appropriate -parameter value. This approach makes templates more fragile from the perspective of editing/refactoring as field paths -may change, thus breaking the map. There is of course also risk of breaking references with the previous scheme, but -renaming parameters seems less likely than changing field paths. - -### Storing templates in k8s - -Openshift defines templates as a first class resource so they can be created/retrieved/etc via standard tools. This allows client tools to list available templates (available in the openshift cluster), allows existing resource security controls to be applied to templates, and generally provides a more integrated feel to templates. However there is no explicit requirement that for k8s to adopt templates, it must also adopt storing them in the cluster. - -### Processing templates (server vs. client) - -Openshift handles template processing via a server endpoint which consumes a template object from the client and returns the list of objects -produced by processing the template. It is also possible to handle the entire template processing flow via the client, but this was deemed -undesirable as it would force each client tool to reimplement template processing (e.g. the standard CLI tool, an eclipse plugin, a plugin for a CI system like Jenkins, etc). The assumption in this proposal is that server side template processing is the preferred implementation approach for -this reason. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/templates.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/templates.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/templates.md) diff --git a/docs/proposals/volume-hostpath-qualifiers.md b/docs/proposals/volume-hostpath-qualifiers.md index cd0902ec58d..247e6be5aae 100644 --- a/docs/proposals/volume-hostpath-qualifiers.md +++ b/docs/proposals/volume-hostpath-qualifiers.md @@ -1,150 +1 @@ -# Support HostPath volume existence qualifiers - -## Introduction - -A Host volume source is probably the simplest volume type to define, needing -only a single path. However, that simplicity comes with many assumptions and -caveats. - -This proposal describes one of the issues associated with Host volumes — -their silent and implicit creation of directories on the host — and -proposes a solution. - -## Problem - -Right now, under Docker, when a bindmount references a hostPath, that path will -be created as an empty directory, owned by root, if it does not already exist. -This is rarely what the user actually wants because hostPath volumes are -typically used to express a dependency on an existing external file or -directory. -This concern was raised during the [initial -implementation](https://github.com/docker/docker/issues/1279#issuecomment-22965058) -of this behavior in Docker and it was suggested that orchestration systems -could better manage volume creation than Docker, but Docker does so as well -anyways. - -To fix this problem, I propose allowing a pod to specify whether a given -hostPath should exist prior to the pod running, whether it should be created, -and what it should exist as. -I also propose the inclusion of a default value which matches the current -behavior to ensure backwards compatibility. - -To understand exactly when this behavior will or won't be correct, it's -important to look at the use-cases of Host Volumes. -The table below broadly classifies the use-case of Host Volumes and asserts -whether this change would be of benefit to that use-case. - -### HostPath volume Use-cases - -| Use-case | Description | Examples | Benefits from this change? | Why? | -|:---------|:------------|:---------|:--------------------------:|:-----| -| Accessing an external system, data, or configuration | Data or a unix socket is created by a process on the host, and a pod within kubernetes consumes it | [fluentd-es-addon](https://github.com/kubernetes/kubernetes/blob/74b01041cc3feb2bb731cc243ab0e4515bef9a84/cluster/saltbase/salt/fluentd-es/fluentd-es.yaml#L30), [addon-manager](https://github.com/kubernetes/kubernetes/blob/808f3ecbe673b4127627a457dc77266ede49905d/cluster/gce/coreos/kube-manifests/kube-addon-manager.yaml#L23), [kube-proxy](https://github.com/kubernetes/kubernetes/blob/010c976ce8dd92904a7609483c8e794fd8e94d4e/cluster/saltbase/salt/kube-proxy/kube-proxy.manifest#L65), etc | :white_check_mark: | Fails faster and with more useful messages, and won't run when basic assumptions are false (e.g. that docker is the runtime and the docker.sock exists) | -| Providing data to external systems | Some pods wish to publish data to the host for other systems to consume, sometimes to a generic directory and sometimes to more component-specific ones | Kubelet core components which bindmount their logs out to `/var/log/*.log` so logrotate and other tools work with them | :white_check_mark: | Sometimes, but not always. It's directory-specific whether it not existing will be a problem. | -| Communicating between instances and versions of yourself | A pod can use a hostPath directory as a sort of cache and, as opposed to an emptyDir, persist the directory between versions of itself | [etcd](https://github.com/kubernetes/kubernetes/blob/fac54c9b22eff5c5052a8e3369cf8416a7827d36/cluster/saltbase/salt/etcd/etcd.manifest#L84), caches | :x: | It's pretty much always okay to create them | - - -### Other motivating factors - -One additional motivating factor for this change is that under the rkt runtime -paths are not created when they do not exist. This change moves the management -of these volumes into the Kubelet to the benefit of the rkt container runtime. - - -## Proposed API Change - -### Host Volume - -I propose that the -[`v1.HostPathVolumeSource`](https://github.com/kubernetes/kubernetes/blob/d26b4ca2859aa667ad520fb9518e0db67b74216a/pkg/api/types.go#L447-L451) -object be changed to include the following additional field: - -`Type` - An optional string of `exists|file|device|socket|directory` - If not -set, it will default to a backwards-compatible default behavior described -below. - -| Value | Behavior | -|:------|:---------| -| *unset* | If nothing exists at the given path, an empty directory will be created there. Otherwise, behaves like `exists` | -| `exists` | If nothing exists at the given path, the pod will fail to run and provide an informative error message | -| `file` | If a file does not exist at the given path, the pod will fail to run and provide an informative error message | -| `device` | If a block or character device does not exist at the given path, the pod will fail to run and provide an informative error message | -| `socket` | If a socket does not exist at the given path, the pod will fail to run and provide an informative error message | -| `directory` | If a directory does not exist at the given path, the pod will fail to run and provide an informative error message | - -Additional possible values, which are proposed to be excluded: - -|Value | Behavior | Reason for exclusion | -|:-----|:---------|:---------------------| -| `new-directory` | Like `auto`, but the given path must be a directory if it exists | `auto` mostly fills this use-case | -| `character-device` | | Granularity beyond `device` shouldn't matter often | -| `block-device` | | Granularity beyond `device` shouldn't matter often | -| `new-file` | Like file, but if nothing exist an empty file is created instead | In general, bindmounting the parent directory of the file you intend to create addresses this usecase | -| `optional` | If a path does not exist, then do not create any container-mount at all | This would better be handled by a new field entirely if this behavior is desirable | - - -### Why not as part of any other volume types? - -This feature does not make sense for any of the other volume types simply -because all of the other types are already fully qualified. For example, NFS -volumes are known to always be in existence else they will not mount. -Similarly, EmptyDir volumes will always exist as a directory. - -Only the HostVolume and SubPath means of referencing a path have the potential -to reference arbitrary incorrect or nonexistent things without erroring out. - -### Alternatives - -One alternative is to augment Host Volumes with a `MustExist` bool and provide -no further granularity. This would allow toggling between the `auto` and -`exists` behaviors described above. This would likely cover the "90%" use-case -and would be a simpler API. It would be sufficient for all of the examples -linked above in my opionion. - -## Kubelet implementation - -It's proposed that prior to starting a pod, the Kubelet validates that the -given path meets the qualifications of its type. Namely, if the type is `auto` -the Kubelet will create an empty directory if none exists there, and for each -of the others the Kubelet will perform the given validation prior to running -the pod. This validation might be done by a volume plugin, but further -technical consideration (out of scope of this proposal) is needed. - - -## Possible concerns - -### Permissions - -This proposal does not attempt to change the state of volume permissions. Currently, a HostPath volume is created with `root` ownership and `755` permissions. This behavior will be retained. An argument for this behavior is given [here](volumes.md#shared-storage-hostpath). - -### SELinux - -This proposal should not impact SELinux relabeling. Verifying the presence and -type of a given path will be logically separate from SELinux labeling. -Similarly, creating the directory when it doesn't exist will happen before any -SELinux operations and should not impact it. - - -### Containerized Kubelet - -A containerized kubelet would have difficulty creating directories. The -implementation will likely respect the `containerized` flag, or similar, -allowing it to either break out or be "/rootfs/" aware and thus operate as -desired. - -### Racy Validation - -Ideally the validation would be done at the time the bindmounts are created, -else it's possible for a given path or directory to change in the duration from -when it's validated and the container runtime attempts to create said mount. - -The only way to solve this problem is to integrate these sorts of qualification -into container runtimes themselves. - -I don't think this problem is severe enough that we need to push to solve it; -rather I think we can simply accept this minor race, and if runtimes eventually -allow this we can begin to leverage them. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volume-hostpath-qualifiers.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-hostpath-qualifiers.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-hostpath-qualifiers.md) diff --git a/docs/proposals/volume-ownership-management.md b/docs/proposals/volume-ownership-management.md index d08c491c8fa..143ac58837d 100644 --- a/docs/proposals/volume-ownership-management.md +++ b/docs/proposals/volume-ownership-management.md @@ -1,108 +1 @@ -## Volume plugins and idempotency - -Currently, volume plugins have a `SetUp` method which is called in the context of a higher-level -workflow within the kubelet which has externalized the problem of managing the ownership of volumes. -This design has a number of drawbacks that can be mitigated by completely internalizing all concerns -of volume setup behind the volume plugin `SetUp` method. - -### Known issues with current externalized design - -1. The ownership management is currently repeatedly applied, which breaks packages that require - special permissions in order to work correctly -2. There is a gap between files being mounted/created by volume plugins and when their ownership - is set correctly; race conditions exist around this -3. Solving the correct application of ownership management in an externalized model is difficult - and makes it clear that the a transaction boundary is being broken by the externalized design - -### Additional issues with externalization - -Fully externalizing any one concern of volumes is difficult for a number of reasons: - -1. Many types of idempotence checks exist, and are used in a variety of combinations and orders -2. Workflow in the kubelet becomes much more complex to handle: - 1. composition of plugins - 2. correct timing of application of ownership management - 3. callback to volume plugins when we know the whole `SetUp` flow is complete and correct - 4. callback to touch sentinel files - 5. etc etc -3. We want to support fully external volume plugins -- would require complex orchestration / chatty - remote API - -## Proposed implementation - -Since all of the ownership information is known in advance of the call to the volume plugin `SetUp` -method, we can easily internalize these concerns into the volume plugins and pass the ownership -information to `SetUp`. - -The volume `Builder` interface's `SetUp` method changes to accept the group that should own the -volume. Plugins become responsible for ensuring that the correct group is applied. The volume -`Attributes` struct can be modified to remove the `SupportsOwnershipManagement` field. - -```go -package volume - -type Builder interface { - // other methods omitted - - // SetUp prepares and mounts/unpacks the volume to a self-determined - // directory path and returns an error. The group ID that should own the volume - // is passed as a parameter. Plugins may choose to ignore the group ID directive - // in the event that they do not support it (example: NFS). A group ID of -1 - // indicates that the group ownership of the volume should not be modified by the plugin. - // - // SetUp will be called multiple times and should be idempotent. - SetUp(gid int64) error -} -``` - -Each volume plugin will have to change to support the new `SetUp` signature. The existing -ownership management code will be refactored into a library that volume plugins can use: - -``` -package volume - -func ManageOwnership(path string, fsGroup int64) error { - // 1. recursive chown of path - // 2. make path +setgid -} -``` - -The workflow from the Kubelet's perspective for handling volume setup and refresh becomes: - -```go -// go-ish pseudocode -func mountExternalVolumes(pod) error { - podVolumes := make(kubecontainer.VolumeMap) - for i := range pod.Spec.Volumes { - volSpec := &pod.Spec.Volumes[i] - var fsGroup int64 = 0 - if pod.Spec.SecurityContext != nil && - pod.Spec.SecurityContext.FSGroup != nil { - fsGroup = *pod.Spec.SecurityContext.FSGroup - } else { - fsGroup = -1 - } - - // Try to use a plugin for this volume. - plugin := volume.NewSpecFromVolume(volSpec) - builder, err := kl.newVolumeBuilderFromPlugins(plugin, pod) - if err != nil { - return err - } - if builder == nil { - return errUnsupportedVolumeType - } - - err := builder.SetUp(fsGroup) - if err != nil { - return nil - } - } - - return nil -} -``` - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volume-ownership-management.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-ownership-management.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-ownership-management.md) diff --git a/docs/proposals/volume-provisioning.md b/docs/proposals/volume-provisioning.md index f8202fbef5c..b28f68dd9e2 100644 --- a/docs/proposals/volume-provisioning.md +++ b/docs/proposals/volume-provisioning.md @@ -1,500 +1 @@ -## Abstract - -Real Kubernetes clusters have a variety of volumes which differ widely in -size, iops performance, retention policy, and other characteristics. -Administrators need a way to dynamically provision volumes of these different -types to automatically meet user demand. - -A new mechanism called 'storage classes' is proposed to provide this -capability. - -## Motivation - -In Kubernetes 1.2, an alpha form of limited dynamic provisioning was added -that allows a single volume type to be provisioned in clouds that offer -special volume types. - -In Kubernetes 1.3, a label selector was added to persistent volume claims to -allow administrators to create a taxonomy of volumes based on the -characteristics important to them, and to allow users to make claims on those -volumes based on those characteristics. This allows flexibility when claiming -existing volumes; the same flexibility is needed when dynamically provisioning -volumes. - -After gaining experience with dynamic provisioning after the 1.2 release, we -want to create a more flexible feature that allows configuration of how -different storage classes are provisioned and supports provisioning multiple -types of volumes within a single cloud. - -### Out-of-tree provisioners - -One of our goals is to enable administrators to create out-of-tree -provisioners, that is, provisioners whose code does not live in the Kubernetes -project. - -## Design - -This design represents the minimally viable changes required to provision based on storage class configuration. Additional incremental features may be added as a separate effort. - -We propose that: - -1. Both for in-tree and out-of-tree storage provisioners, the PV created by the - provisioners must match the PVC that led to its creations. If a provisioner - is unable to provision such a matching PV, it reports an error to the - user. - -2. The above point applies also to PVC label selector. If user submits a PVC - with a label selector, the provisioner must provision a PV with matching - labels. This directly implies that the provisioner understands meaning - behind these labels - if user submits a claim with selector that wants - a PV with label "region" not in "[east,west]", the provisioner must - understand what label "region" means, what available regions are there and - choose e.g. "north". - - In other words, provisioners should either refuse to provision a volume for - a PVC that has a selector, or select few labels that are allowed in - selectors (such as the "region" example above), implement necessary logic - for their parsing, document them and refuse any selector that references - unknown labels. - -3. An api object will be incubated in storage.k8s.io/v1beta1 to hold the a `StorageClass` - API resource. Each StorageClass object contains parameters required by the provisioner to provision volumes of that class. These parameters are opaque to the user. - -4. `PersistentVolume.Spec.Class` attribute is added to volumes. This attribute - is optional and specifies which `StorageClass` instance represents - storage characteristics of a particular PV. - - During incubation, `Class` is an annotation and not - actual attribute. - -5. `PersistentVolume` instances do not require labels by the provisioner. - -6. `PersistentVolumeClaim.Spec.Class` attribute is added to claims. This - attribute specifies that only a volume with equal - `PersistentVolume.Spec.Class` value can satisfy a claim. - - During incubation, `Class` is just an annotation and not - actual attribute. - -7. The existing provisioner plugin implementations be modified to accept - parameters as specified via `StorageClass`. - -8. The persistent volume controller modified to invoke provisioners using `StorageClass` configuration and bind claims with `PersistentVolumeClaim.Spec.Class` to volumes with equivalent `PersistentVolume.Spec.Class` - -9. The existing alpha dynamic provisioning feature be phased out in the - next release. - -### Controller workflow for provisioning volumes - -0. Kubernetes administator can configure name of a default StorageClass. This - StorageClass instance is then used when user requests a dynamically - provisioned volume, but does not specify a StorageClass. In other words, - `claim.Spec.Class == ""` - (or annotation `volume.beta.kubernetes.io/storage-class == ""`). - -1. When a new claim is submitted, the controller attempts to find an existing - volume that will fulfill the claim. - - 1. If the claim has non-empty `claim.Spec.Class`, only PVs with the same - `pv.Spec.Class` are considered. - - 2. If the claim has empty `claim.Spec.Class`, only PVs with an unset `pv.Spec.Class` are considered. - - All "considered" volumes are evaluated and the - smallest matching volume is bound to the claim. - -2. If no volume is found for the claim and `claim.Spec.Class` is not set or is - empty string dynamic provisioning is disabled. - -3. If `claim.Spec.Class` is set the controller tries to find instance of StorageClass with this name. If no - such StorageClass is found, the controller goes back to step 1. and - periodically retries finding a matching volume or storage class again until - a match is found. The claim is `Pending` during this period. - -4. With StorageClass instance, the controller updates the claim: - * `claim.Annotations["volume.beta.kubernetes.io/storage-provisioner"] = storageClass.Provisioner` - -* **In-tree provisioning** - - The controller tries to find an internal volume plugin referenced by - `storageClass.Provisioner`. If it is found: - - 5. The internal provisioner implements interface`ProvisionableVolumePlugin`, - which has a method called `NewProvisioner` that returns a new provisioner. - - 6. The controller calls volume plugin `Provision` with Parameters - from the `StorageClass` configuration object. - - 7. If `Provision` returns an error, the controller generates an event on the - claim and goes back to step 1., i.e. it will retry provisioning - periodically. - - 8. If `Provision` returns no error, the controller creates the returned - `api.PersistentVolume`, fills its `Class` attribute with `claim.Spec.Class` - and makes it already bound to the claim - - 1. If the create operation for the `api.PersistentVolume` fails, it is - retried - - 2. If the create operation does not succeed in reasonable time, the - controller attempts to delete the provisioned volume and creates an event - on the claim - -Existing behavior is unchanged for claims that do not specify -`claim.Spec.Class`. - -* **Out of tree provisioning** - - Following step 4. above, the controller tries to find internal plugin for the - `StorageClass`. If it is not found, it does not do anything, it just - periodically goes to step 1., i.e. tries to find available matching PV. - - The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", - "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be - interpreted as described in RFC 2119. - - External provisioner must have these features: - - * It MUST have a distinct name, following Kubernetenes plugin naming scheme - `/`, e.g. `gluster.org/gluster-volume`. - - * The provisioner SHOULD send events on a claim to report any errors - related to provisioning a volume for the claim. This way, users get the same - experience as with internal provisioners. - - * The provisioner MUST implement also a deleter. It must be able to delete - storage assets it created. It MUST NOT assume that any other internal or - external plugin is present. - - The external provisioner runs in a separate process which watches claims, be - it an external storage appliance, a daemon or a Kubernetes pod. For every - claim creation or update, it implements these steps: - - 1. The provisioner inspects if - `claim.Annotations["volume.beta.kubernetes.io/storage-provisioner"] == `. - All other claims MUST be ignored. - - 2. The provisioner MUST check that the claim is unbound, i.e. its - `claim.Spec.VolumeName` is empty. Bound volumes MUST be ignored. - - *Race condition when the provisioner provisions a new PV for a claim and - at the same time Kubernetes binds the same claim to another PV that was - just created by admin is discussed below.* - - 3. It tries to find a StorageClass instance referenced by annotation - `claim.Annotations["volume.beta.kubernetes.io/storage-class"]`. If not - found, it SHOULD report an error (by sending an event to the claim) and it - SHOULD retry periodically with step i. - - 4. The provisioner MUST parse arguments in the `StorageClass` and - `claim.Spec.Selector` and provisions appropriate storage asset that matches - both the parameters and the selector. - When it encounters unknown parameters in `storageClass.Parameters` or - `claim.Spec.Selector` or the combination of these parameters is impossible - to achieve, it SHOULD report an error and it MUST NOT provision a volume. - All errors found during parsing or provisioning SHOULD be send as events - on the claim and the provisioner SHOULD retry periodically with step i. - - As parsing (and understanding) claim selectors is hard, the sentence - "MUST parse ... `claim.Spec.Selector`" will in typical case lead to simple - refusal of claims that have any selector: - - ```go - if pvc.Spec.Selector != nil { - return Error("can't parse PVC selector!") - } - ``` - - 5. When the volume is provisioned, the provisioner MUST create a new PV - representing the storage asset and save it in Kubernetes. When this fails, - it SHOULD retry creating the PV again few times. If all attempts fail, it - MUST delete the storage asset. All errors SHOULD be sent as events to the - claim. - - The created PV MUST have these properties: - - * `pv.Spec.ClaimRef` MUST point to the claim that led to its creation - (including the claim UID). - - *This way, the PV will be bound to the claim.* - - * `pv.Annotations["pv.kubernetes.io/provisioned-by"]` MUST be set to name - of the external provisioner. This provisioner will be used to delete the - volume. - - *The provisioner/delete should not assume there is any other - provisioner/deleter available that would delete the volume.* - - * `pv.Annotations["volume.beta.kubernetes.io/storage-class"]` MUST be set - to name of the storage class requested by the claim. - - *So the created PV matches the claim.* - - * The provisioner MAY store any other information to the created PV as - annotations. It SHOULD save any information that is needed to delete the - storage asset there, as appropriate StorageClass instance may not exist - when the volume will be deleted. However, references to Secret instance - or direct username/password to a remote storage appliance MUST NOT be - stored there, see issue #34822. - - * `pv.Labels` MUST be set to match `claim.spec.selector`. The provisioner - MAY add additional labels. - - *So the created PV matches the claim.* - - * `pv.Spec` MUST be set to match requirements in `claim.Spec`, especially - access mode and PV size. The provisioned volume size MUST NOT be smaller - than size requested in the claim, however it MAY be larger. - - *So the created PV matches the claim.* - - * `pv.Spec.PersistentVolumeSource` MUST be set to point to the created - storage asset. - - * `pv.Spec.PersistentVolumeReclaimPolicy` SHOULD be set to `Delete` unless - user manually configures other reclaim policy. - - * `pv.Name` MUST be unique. Internal provisioners use name based on - `claim.UID` to produce conflicts when two provisioners accidentally - provision a PV for the same claim, however external provisioners can use - any mechanism to generate an unique PV name. - - Example of a claim that is to be provisioned by an external provisioner for - `foo.org/foo-volume`: - - ```yaml - apiVersion: v1 - kind: PersistentVolumeClaim - metadata: - annotations: - volume.beta.kubernetes.io/storage-class: myClass - volume.beta.kubernetes.io/storage-provisioner: foo.org/foo-volume - name: fooclaim - namespace: default - resourceVersion: "53" - uid: 5a294561-7e5b-11e6-a20e-0eb6048532a3 - spec: - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 4Gi - # volumeName: must be empty! - ``` - - Example of the created PV: - - ```yaml - apiVersion: v1 - kind: PersistentVolume - metadata: - annotations: - pv.kubernetes.io/provisioned-by: foo.org/foo-volume - volume.beta.kubernetes.io/storage-class: myClass - foo.org/provisioner: "any other annotations as needed" - labels: - foo.org/my-label: "any labels as needed" - generateName: "foo-volume-" - spec: - accessModes: - - ReadWriteOnce - awsElasticBlockStore: - fsType: ext4 - volumeID: aws://us-east-1d/vol-de401a79 - capacity: - storage: 4Gi - claimRef: - apiVersion: v1 - kind: PersistentVolumeClaim - name: fooclaim - namespace: default - resourceVersion: "53" - uid: 5a294561-7e5b-11e6-a20e-0eb6048532a3 - persistentVolumeReclaimPolicy: Delete - ``` - - As result, Kubernetes has a PV that represents the storage asset and is bound - to the claim. When everything went well, Kubernetes completed binding of the - claim to the PV. - - Kubernetes was not blocked in any way during the provisioning and could - either bound the claim to another PV that was created by user or even the - claim may have been deleted by the user. In both cases, Kubernetes will mark - the PV to be delete using the protocol below. - - The external provisioner MAY save any annotations to the claim that is - provisioned, however the claim may be modified or even deleted by the user at - any time. - - -### Controller workflow for deleting volumes - -When the controller decides that a volume should be deleted it performs these -steps: - -1. The controller changes `pv.Status.Phase` to `Released`. - -2. The controller looks for `pv.Annotations["pv.kubernetes.io/provisioned-by"]`. - If found, it uses this provisioner/deleter to delete the volume. - -3. If the volume is not annotated by `pv.kubernetes.io/provisioned-by`, the - controller inspects `pv.Spec` and finds in-tree deleter for the volume. - -4. If the deleter found by steps 2. or 3. is internal, it calls it and deletes - the storage asset together with the PV that represents it. - -5. If the deleter is not known to Kubernetes, it does not do anything. - -6. External deleters MUST watch for PV changes. When - `pv.Status.Phase == Released && pv.Annotations['pv.kubernetes.io/provisioned-by'] == `, - the deleter: - - * It MUST check reclaim policy of the PV and ignore all PVs whose - `Spec.PersistentVolumeReclaimPolicy` is not `Delete`. - - * It MUST delete the storage asset. - - * Only after the storage asset was successfully deleted, it MUST delete the - PV object in Kubernetes. - - * Any error SHOULD be sent as an event on the PV being deleted and the - deleter SHOULD retry to delete the volume periodically. - - * The deleter SHOULD NOT use any information from StorageClass instance - referenced by the PV. This is different to internal deleters, which - need to be StorageClass instance present at the time of deletion to read - Secret instances (see Gluster provisioner for example), however we would - like to phase out this behavior. - - Note that watching `pv.Status` has been frowned upon in the past, however in - this particular case we could use it quite reliably to trigger deletion. - It's not trivial to find out if a PV is not needed and should be deleted. - *Alternatively, an annotation could be used.* - -### Security considerations - -Both internal and external provisioners and deleters may need access to -credentials (e.g. username+password) of an external storage appliance to -provision and delete volumes. - -* For internal provisioners, a Secret instance in a well secured namespace -should be used. Pointer to the Secret instance shall be parameter of the -StorageClass and it MUST NOT be copied around the system e.g. in annotations -of PVs. See issue #34822. - -* External provisioners running in pod should have appropriate credentials -mouted as Secret inside pods that run the provisioner. Namespace with the pods -and Secret instance should be well secured. - -### `StorageClass` API - -A new API group should hold the API for storage classes, following the pattern -of autoscaling, metrics, etc. To allow for future storage-related APIs, we -should call this new API group `storage.k8s.io` and incubate in storage.k8s.io/v1beta1. - -Storage classes will be represented by an API object called `StorageClass`: - -```go -package storage - -// StorageClass describes the parameters for a class of storage for -// which PersistentVolumes can be dynamically provisioned. -// -// StorageClasses are non-namespaced; the name of the storage class -// according to etcd is in ObjectMeta.Name. -type StorageClass struct { - unversioned.TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty"` - - // Provisioner indicates the type of the provisioner. - Provisioner string `json:"provisioner,omitempty"` - - // Parameters for dynamic volume provisioner. - Parameters map[string]string `json:"parameters,omitempty"` -} - -``` - -`PersistentVolumeClaimSpec` and `PersistentVolumeSpec` both get Class attribute -(the existing annotation is used during incubation): - -```go -type PersistentVolumeClaimSpec struct { - // Name of requested storage class. If non-empty, only PVs with this - // pv.Spec.Class will be considered for binding and if no such PV is - // available, StorageClass with this name will be used to dynamically - // provision the volume. - Class string -... -} - -type PersistentVolumeSpec struct { - // Name of StorageClass instance that this volume belongs to. - Class string -... -} -``` - -Storage classes are natural to think of as a global resource, since they: - -1. Align with PersistentVolumes, which are a global resource -2. Are administrator controlled - -### Provisioning configuration - -With the scheme outlined above the provisioner creates PVs using parameters specified in the `StorageClass` object. - -### Provisioner interface changes - -`struct volume.VolumeOptions` (containing parameters for a provisioner plugin) -will be extended to contain StorageClass.Parameters. - -The existing provisioner implementations will be modified to accept the StorageClass configuration object. - -### PV Controller Changes - -The persistent volume controller will be modified to implement the new -workflow described in this proposal. The changes will be limited to the -`provisionClaimOperation` method, which is responsible for invoking the -provisioner and to favor existing volumes before provisioning a new one. - -## Examples - -### AWS provisioners with distinct QoS - -This example shows two storage classes, "aws-fast" and "aws-slow". - -``` -apiVersion: v1 -kind: StorageClass -metadata: - name: aws-fast -provisioner: kubernetes.io/aws-ebs -parameters: - zone: us-east-1b - type: ssd - - -apiVersion: v1 -kind: StorageClass -metadata: - name: aws-slow -provisioner: kubernetes.io/aws-ebs -parameters: - zone: us-east-1b - type: spinning -``` - -# Additional Implementation Details - -0. Annotation `volume.alpha.kubernetes.io/storage-class` is used instead of `claim.Spec.Class` and `volume.Spec.Class` during incubation. - -1. `claim.Spec.Selector` and `claim.Spec.Class` are mutually exclusive for now (1.4). User can either match existing volumes with `Selector` XOR match existing volumes with `Class` and get dynamic provisioning by using `Class`. This simplifies initial PR and also provisioners. This limitation may be lifted in future releases. - -# Cloud Providers - -Since the `volume.alpha.kubernetes.io/storage-class` is in use a `StorageClass` must be defined to support provisioning. No default is assumed as before. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volume-provisioning.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-provisioning.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-provisioning.md) diff --git a/docs/proposals/volume-selectors.md b/docs/proposals/volume-selectors.md index c1915f99c2b..d32a9ad2b25 100644 --- a/docs/proposals/volume-selectors.md +++ b/docs/proposals/volume-selectors.md @@ -1,268 +1 @@ -## Abstract - -Real Kubernetes clusters have a variety of volumes which differ widely in -size, iops performance, retention policy, and other characteristics. A -mechanism is needed to enable administrators to describe the taxonomy of these -volumes, and for users to make claims on these volumes based on their -attributes within this taxonomy. - -A label selector mechanism is proposed to enable flexible selection of volumes -by persistent volume claims. - -## Motivation - -Currently, users of persistent volumes have the ability to make claims on -those volumes based on some criteria such as the access modes the volume -supports and minimum resources offered by a volume. In an organization, there -are often more complex requirements for the storage volumes needed by -different groups of users. A mechanism is needed to model these different -types of volumes and to allow users to select those different types without -being intimately familiar with their underlying characteristics. - -As an example, many cloud providers offer a range of performance -characteristics for storage, with higher performing storage being more -expensive. Cluster administrators want the ability to: - -1. Invent a taxonomy of logical storage classes using the attributes - important to them -2. Allow users to make claims on volumes using these attributes - -## Constraints and Assumptions - -The proposed design should: - -1. Deal with manually-created volumes -2. Not necessarily require users to know or understand the differences between - volumes (ie, Kubernetes should not dictate any particular set of - characteristics to administrators to think in terms of) - -We will focus **only** on the barest mechanisms to describe and implement -label selectors in this proposal. We will address the following topics in -future proposals: - -1. An extension resource or third party resource for storage classes -1. Dynamically provisioning new volumes for based on storage class - -## Use Cases - -1. As a user, I want to be able to make a claim on a persistent volume by - specifying a label selector as well as the currently available attributes - -### Use Case: Taxonomy of Persistent Volumes - -Kubernetes offers volume types for a variety of storage systems. Within each -of those storage systems, there are numerous ways in which volume instances -may differ from one another: iops performance, retention policy, etc. -Administrators of real clusters typically need to manage a variety of -different volumes with different characteristics for different groups of -users. - -Kubernetes should make it possible for administrators to flexibly model the -taxonomy of volumes in their clusters and to label volumes with their storage -class. This capability must be optional and fully backward-compatible with -the existing API. - -Let's look at an example. This example is *purely fictitious* and the -taxonomies presented here are not a suggestion of any sort. In the case of -AWS EBS there are four different types of volume (in ascending order of cost): - -1. Cold HDD -2. Throughput optimized HDD -3. General purpose SSD -4. Provisioned IOPS SSD - -Currently, there is no way to distinguish between a group of 4 PVs where each -volume is of one of these different types. Administrators need the ability to -distinguish between instances of these types. An administrator might decide -to think of these volumes as follows: - -1. Cold HDD - `tin` -2. Throughput optimized HDD - `bronze` -3. General purpose SSD - `silver` -4. Provisioned IOPS SSD - `gold` - -This is not the only dimension that EBS volumes can differ in. Let's simplify -things and imagine that AWS has two availability zones, `east` and `west`. Our -administrators want to differentiate between volumes of the same type in these -two zones, so they create a taxonomy of volumes like so: - -1. `tin-west` -2. `tin-east` -3. `bronze-west` -4. `bronze-east` -5. `silver-west` -6. `silver-east` -7. `gold-west` -8. `gold-east` - -Another administrator of the same cluster might label things differently, -choosing to focus on the business role of volumes. Say that the data -warehouse department is the sole consumer of the cold HDD type, and the DB as -a service offering is the sole consumer of provisioned IOPS volumes. The -administrator might decide on the following taxonomy of volumes: - -1. `warehouse-east` -2. `warehouse-west` -3. `dbaas-east` -4. `dbaas-west` - -There are any number of ways an administrator may choose to distinguish -between volumes. Labels are used in Kubernetes to express the user-defined -properties of API objects and are a good fit to express this information for -volumes. In the examples above, administrators might differentiate between -the classes of volumes using the labels `business-unit`, `volume-type`, or -`region`. - -Label selectors are used through the Kubernetes API to describe relationships -between API objects using flexible, user-defined criteria. It makes sense to -use the same mechanism with persistent volumes and storage claims to provide -the same functionality for these API objects. - -## Proposed Design - -We propose that: - -1. A new field called `Selector` be added to the `PersistentVolumeClaimSpec` - type -2. The persistent volume controller be modified to account for this selector - when determining the volume to bind to a claim - -### Persistent Volume Selector - -Label selectors are used throughout the API to allow users to express -relationships in a flexible manner. The problem of selecting a volume to -match a claim fits perfectly within this metaphor. Adding a label selector to -`PersistentVolumeClaimSpec` will allow users to label their volumes with -criteria important to them and select volumes based on these criteria. - -```go -// PersistentVolumeClaimSpec describes the common attributes of storage devices -// and allows a Source for provider-specific attributes -type PersistentVolumeClaimSpec struct { - // Contains the types of access modes required - AccessModes []PersistentVolumeAccessMode `json:"accessModes,omitempty"` - // Selector is a selector which must be true for the claim to bind to a volume - Selector *unversioned.Selector `json:"selector,omitempty"` - // Resources represents the minimum resources required - Resources ResourceRequirements `json:"resources,omitempty"` - // VolumeName is the binding reference to the PersistentVolume backing this claim - VolumeName string `json:"volumeName,omitempty"` -} -``` - -### Labeling volumes - -Volumes can already be labeled: - -```yaml -apiVersion: v1 -kind: PersistentVolume -metadata: - name: ebs-pv-1 - labels: - ebs-volume-type: iops - aws-availability-zone: us-east-1 -spec: - capacity: - storage: 100Gi - accessModes: - - ReadWriteMany - persistentVolumeReclaimPolicy: Retain - awsElasticBlockStore: - volumeID: vol-12345 - fsType: xfs -``` - -### Controller Changes - -At the time of this writing, the various controllers for persistent volumes -are in the process of being refactored into a single controller (see -[kubernetes/24331](https://github.com/kubernetes/kubernetes/pull/24331)). - -The resulting controller should be modified to use the new -`selector` field to match a claim to a volume. In order to -match to a volume, all criteria must be satisfied; ie, if a label selector is -specified on a claim, a volume must match both the label selector and any -specified access modes and resource requirements to be considered a match. - -## Examples - -Let's take a look at a few examples, revisiting the taxonomy of EBS volumes and regions: - -Volumes of the different types might be labeled as follows: - -```yaml -apiVersion: v1 -kind: PersistentVolume -metadata: - name: ebs-pv-west - labels: - ebs-volume-type: iops-ssd - aws-availability-zone: us-west-1 -spec: - capacity: - storage: 150Gi - accessModes: - - ReadWriteMany - persistentVolumeReclaimPolicy: Retain - awsElasticBlockStore: - volumeID: vol-23456 - fsType: xfs - -apiVersion: v1 -kind: PersistentVolume -metadata: - name: ebs-pv-east - labels: - ebs-volume-type: gp-ssd - aws-availability-zone: us-east-1 -spec: - capacity: - storage: 150Gi - accessModes: - - ReadWriteMany - persistentVolumeReclaimPolicy: Retain - awsElasticBlockStore: - volumeID: vol-34567 - fsType: xfs -``` - -...claims on these volumes would look like: - -```yaml -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: ebs-claim-west -spec: - accessModes: - - ReadWriteMany - resources: - requests: - storage: 1Gi - selector: - matchLabels: - ebs-volume-type: iops-ssd - aws-availability-zone: us-west-1 - -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: ebs-claim-east -spec: - accessModes: - - ReadWriteMany - resources: - requests: - storage: 1Gi - selector: - matchLabels: - ebs-volume-type: gp-ssd - aws-availability-zone: us-east-1 -``` - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volume-selectors.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-selectors.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-selectors.md) diff --git a/docs/proposals/volumes.md b/docs/proposals/volumes.md index 874dc2af965..c36adc37609 100644 --- a/docs/proposals/volumes.md +++ b/docs/proposals/volumes.md @@ -1,482 +1 @@ -## Abstract - -A proposal for sharing volumes between containers in a pod using a special supplemental group. - -## Motivation - -Kubernetes volumes should be usable regardless of the UID a container runs as. This concern cuts -across all volume types, so the system should be able to handle them in a generalized way to provide -uniform functionality across all volume types and lower the barrier to new plugins. - -Goals of this design: - -1. Enumerate the different use-cases for volume usage in pods -2. Define the desired goal state for ownership and permission management in Kubernetes -3. Describe the changes necessary to achieve desired state - -## Constraints and Assumptions - -1. When writing permissions in this proposal, `D` represents a don't-care value; example: `07D0` - represents permissions where the owner has `7` permissions, all has `0` permissions, and group - has a don't-care value -2. Read-write usability of a volume from a container is defined as one of: - 1. The volume is owned by the container's effective UID and has permissions `07D0` - 2. The volume is owned by the container's effective GID or one of its supplemental groups and - has permissions `0D70` -3. Volume plugins should not have to handle setting permissions on volumes -5. Preventing two containers within a pod from reading and writing to the same volume (by choosing - different container UIDs) is not something we intend to support today -6. We will not design to support multiple processes running in a single container as different - UIDs; use cases that require work by different UIDs should be divided into different pods for - each UID - -## Current State Overview - -### Kubernetes - -Kubernetes volumes can be divided into two broad categories: - -1. Unshared storage: - 1. Volumes created by the kubelet on the host directory: empty directory, git repo, secret, - downward api. All volumes in this category delegate to `EmptyDir` for their underlying - storage. These volumes are created with ownership `root:root`. - 2. Volumes based on network block devices: AWS EBS, iSCSI, RBD, etc, *when used exclusively - by a single pod*. -2. Shared storage: - 1. `hostPath` is shared storage because it is necessarily used by a container and the host - 2. Network file systems such as NFS, Glusterfs, Cephfs, etc. For these volumes, the ownership - is determined by the configuration of the shared storage system. - 3. Block device based volumes in `ReadOnlyMany` or `ReadWriteMany` modes are shared because - they may be used simultaneously by multiple pods. - -The `EmptyDir` volume was recently modified to create the volume directory with `0777` permissions -from `0750` to support basic usability of that volume as a non-root UID. - -### Docker - -Docker recently added supplemental group support. This adds the ability to specify additional -groups that a container should be part of, and will be released with Docker 1.8. - -There is a [proposal](https://github.com/docker/docker/pull/14632) to add a bind-mount flag to tell -Docker to change the ownership of a volume to the effective UID and GID of a container, but this has -not yet been accepted. - -### rkt - -rkt -[image manifests](https://github.com/appc/spec/blob/master/spec/aci.md#image-manifest-schema) can -specify users and groups, similarly to how a Docker image can. A rkt -[pod manifest](https://github.com/appc/spec/blob/master/spec/pods.md#pod-manifest-schema) can also -override the default user and group specified by the image manifest. - -rkt does not currently support supplemental groups or changing the owning UID or -group of a volume, but it has been [requested](https://github.com/coreos/rkt/issues/1309). - -## Use Cases - -1. As a user, I want the system to set ownership and permissions on volumes correctly to enable - reads and writes with the following scenarios: - 1. All containers running as root - 2. All containers running as the same non-root user - 3. Multiple containers running as a mix of root and non-root users - -### All containers running as root - -For volumes that only need to be used by root, no action needs to be taken to change ownership or -permissions, but setting the ownership based on the supplemental group shared by all containers in a -pod will also work. For situations where read-only access to a shared volume is required from one -or more containers, the `VolumeMount`s in those containers should have the `readOnly` field set. - -### All containers running as a single non-root user - -In use cases whether a volume is used by a single non-root UID the volume ownership and permissions -should be set to enable read/write access. - -Currently, a non-root UID will not have permissions to write to any but an `EmptyDir` volume. -Today, users that need this case to work can: - -1. Grant the container the necessary capabilities to `chown` and `chmod` the volume: - - `CAP_FOWNER` - - `CAP_CHOWN` - - `CAP_DAC_OVERRIDE` -2. Run a wrapper script that runs `chown` and `chmod` commands to set the desired ownership and - permissions on the volume before starting their main process - -This workaround has significant drawbacks: - -1. It grants powerful kernel capabilities to the code in the image and thus is not securing, - defeating the reason containers are run as non-root users -2. The user experience is poor; it requires changing Dockerfile, adding a layer, or modifying the - container's command - -Some cluster operators manage the ownership of shared storage volumes on the server side. -In this scenario, the UID of the container using the volume is known in advance. The ownership of -the volume is set to match the container's UID on the server side. - -### Containers running as a mix of root and non-root users - -If the list of UIDs that need to use a volume includes both root and non-root users, supplemental -groups can be applied to enable sharing volumes between containers. The ownership and permissions -`root: 2770` will make a volume usable from both containers running as root and -running as a non-root UID and the supplemental group. The setgid bit is used to ensure that files -created in the volume will inherit the owning GID of the volume. - -## Community Design Discussion - -- [kubernetes/2630](https://github.com/kubernetes/kubernetes/issues/2630) -- [kubernetes/11319](https://github.com/kubernetes/kubernetes/issues/11319) -- [kubernetes/9384](https://github.com/kubernetes/kubernetes/pull/9384) - -## Analysis - -The system needs to be able to: - -1. Model correctly which volumes require ownership management -1. Determine the correct ownership of each volume in a pod if required -1. Set the ownership and permissions on volumes when required - -### Modeling whether a volume requires ownership management - -#### Unshared storage: volumes derived from `EmptyDir` - -Since Kubernetes creates `EmptyDir` volumes, it should ensure the ownership is set to enable the -volumes to be usable for all of the above scenarios. - -#### Unshared storage: network block devices - -Volume plugins based on network block devices such as AWS EBS and RBS can be treated the same way -as local volumes. Since inodes are written to these block devices in the same way as `EmptyDir` -volumes, permissions and ownership can be managed on the client side by the Kubelet when used -exclusively by one pod. When the volumes are used outside of a persistent volume, or with the -`ReadWriteOnce` mode, they are effectively unshared storage. - -When used by multiple pods, there are many additional use-cases to analyze before we can be -confident that we can support ownership management robustly with these file systems. The right -design is one that makes it easy to experiment and develop support for ownership management with -volume plugins to enable developers and cluster operators to continue exploring these issues. - -#### Shared storage: hostPath - -The `hostPath` volume should only be used by effective-root users, and the permissions of paths -exposed into containers via hostPath volumes should always be managed by the cluster operator. If -the Kubelet managed the ownership for `hostPath` volumes, a user who could create a `hostPath` -volume could affect changes in the state of arbitrary paths within the host's filesystem. This -would be a severe security risk, so we will consider hostPath a corner case that the kubelet should -never perform ownership management for. - -#### Shared storage - -Ownership management of shared storage is a complex topic. Ownership for existing shared storage -will be managed externally from Kubernetes. For this case, our API should make it simple to express -whether a particular volume should have these concerns managed by Kubernetes. - -We will not attempt to address the ownership and permissions concerns of new shared storage -in this proposal. - -When a network block device is used as a persistent volume in `ReadWriteMany` or `ReadOnlyMany` -modes, it is shared storage, and thus outside the scope of this proposal. - -#### Plugin API requirements - -From the above, we know that some volume plugins will 'want' ownership management from the Kubelet -and others will not. Plugins should be able to opt in to ownership management from the Kubelet. To -facilitate this, there should be a method added to the `volume.Plugin` interface that the Kubelet -uses to determine whether to perform ownership management for a volume. - -### Determining correct ownership of a volume - -Using the approach of a pod-level supplemental group to own volumes solves the problem in any of the -cases of UID/GID combinations within a pod. Since this is the simplest approach that handles all -use-cases, our solution will be made in terms of it. - -Eventually, Kubernetes should allocate a unique group for each pod so that a pod's volumes are -usable by that pod's containers, but not by containers of another pod. The supplemental group used -to share volumes must be unique in a multitenant cluster. If uniqueness is enforced at the host -level, pods from one host may be able to use shared filesystems meant for pods on another host. - -Eventually, Kubernetes should integrate with external identity management systems to populate pod -specs with the right supplemental groups necessary to use shared volumes. In the interim until the -identity management story is far enough along to implement this type of integration, we will rely -on being able to set arbitrary groups. (Note: as of this writing, a PR is being prepared for -setting arbitrary supplemental groups). - -An admission controller could handle allocating groups for each pod and setting the group in the -pod's security context. - -#### A note on the root group - -Today, by default, all docker containers are run in the root group (GID 0). This is relied on by -image authors that make images to run with a range of UIDs: they set the group ownership for -important paths to be the root group, so that containers running as GID 0 *and* an arbitrary UID -can read and write to those paths normally. - -It is important to note that the changes proposed here will not affect the primary GID of -containers in pods. Setting the `pod.Spec.SecurityContext.FSGroup` field will not -override the primary GID and should be safe to use in images that expect GID 0. - -### Setting ownership and permissions on volumes - -For `EmptyDir`-based volumes and unshared storage, `chown` and `chmod` on the node are sufficient to -set ownership and permissions. Shared storage is different because: - -1. Shared storage may not live on the node a pod that uses it runs on -2. Shared storage may be externally managed - -## Proposed design: - -Our design should minimize code for handling ownership required in the Kubelet and volume plugins. - -### API changes - -We should not interfere with images that need to run as a particular UID or primary GID. A pod -level supplemental group allows us to express a group that all containers in a pod run as in a way -that is orthogonal to the primary UID and GID of each container process. - -```go -package api - -type PodSecurityContext struct { - // FSGroup is a supplemental group that all containers in a pod run under. This group will own - // volumes that the Kubelet manages ownership for. If this is not specified, the Kubelet will - // not set the group ownership of any volumes. - FSGroup *int64 `json:"fsGroup,omitempty"` -} -``` - -The V1 API will be extended with the same field: - -```go -package v1 - -type PodSecurityContext struct { - // FSGroup is a supplemental group that all containers in a pod run under. This group will own - // volumes that the Kubelet manages ownership for. If this is not specified, the Kubelet will - // not set the group ownership of any volumes. - FSGroup *int64 `json:"fsGroup,omitempty"` -} -``` - -The values that can be specified for the `pod.Spec.SecurityContext.FSGroup` field are governed by -[pod security policy](https://github.com/kubernetes/kubernetes/pull/7893). - -#### API backward compatibility - -Pods created by old clients will have the `pod.Spec.SecurityContext.FSGroup` field unset; -these pods will not have their volumes managed by the Kubelet. Old clients will not be able to set -or read the `pod.Spec.SecurityContext.FSGroup` field. - -### Volume changes - -The `volume.Mounter` interface should have a new method added that indicates whether the plugin -supports ownership management: - -```go -package volume - -type Mounter interface { - // other methods omitted - - // SupportsOwnershipManagement indicates that this volume supports having ownership - // and permissions managed by the Kubelet; if true, the caller may manipulate UID - // or GID of this volume. - SupportsOwnershipManagement() bool -} -``` - -In the first round of work, only `hostPath` and `emptyDir` and its derivations will be tested with -ownership management support: - -| Plugin Name | SupportsOwnershipManagement | -|-------------------------|-------------------------------| -| `hostPath` | false | -| `emptyDir` | true | -| `gitRepo` | true | -| `secret` | true | -| `downwardAPI` | true | -| `gcePersistentDisk` | false | -| `awsElasticBlockStore` | false | -| `nfs` | false | -| `iscsi` | false | -| `glusterfs` | false | -| `persistentVolumeClaim` | depends on underlying volume and PV mode | -| `rbd` | false | -| `cinder` | false | -| `cephfs` | false | - -Ultimately, the matrix will theoretically look like: - -| Plugin Name | SupportsOwnershipManagement | -|-------------------------|-------------------------------| -| `hostPath` | false | -| `emptyDir` | true | -| `gitRepo` | true | -| `secret` | true | -| `downwardAPI` | true | -| `gcePersistentDisk` | true | -| `awsElasticBlockStore` | true | -| `nfs` | false | -| `iscsi` | true | -| `glusterfs` | false | -| `persistentVolumeClaim` | depends on underlying volume and PV mode | -| `rbd` | true | -| `cinder` | false | -| `cephfs` | false | - -### Kubelet changes - -The Kubelet should be modified to perform ownership and label management when required for a volume. - -For ownership management the criteria are: - -1. The `pod.Spec.SecurityContext.FSGroup` field is populated -2. The volume builder returns `true` from `SupportsOwnershipManagement` - -Logic should be added to the `mountExternalVolumes` method that runs a local `chgrp` and `chmod` if -the pod-level supplemental group is set and the volume supports ownership management: - -```go -package kubelet - -type ChgrpRunner interface { - Chgrp(path string, gid int) error -} - -type ChmodRunner interface { - Chmod(path string, mode os.FileMode) error -} - -type Kubelet struct { - chgrpRunner ChgrpRunner - chmodRunner ChmodRunner -} - -func (kl *Kubelet) mountExternalVolumes(pod *api.Pod) (kubecontainer.VolumeMap, error) { - podFSGroup = pod.Spec.PodSecurityContext.FSGroup - podFSGroupSet := false - if podFSGroup != 0 { - podFSGroupSet = true - } - - podVolumes := make(kubecontainer.VolumeMap) - - for i := range pod.Spec.Volumes { - volSpec := &pod.Spec.Volumes[i] - - rootContext, err := kl.getRootDirContext() - if err != nil { - return nil, err - } - - // Try to use a plugin for this volume. - internal := volume.NewSpecFromVolume(volSpec) - builder, err := kl.newVolumeMounterFromPlugins(internal, pod, volume.VolumeOptions{RootContext: rootContext}, kl.mounter) - if err != nil { - glog.Errorf("Could not create volume builder for pod %s: %v", pod.UID, err) - return nil, err - } - if builder == nil { - return nil, errUnsupportedVolumeType - } - err = builder.SetUp() - if err != nil { - return nil, err - } - - if builder.SupportsOwnershipManagement() && - podFSGroupSet { - err = kl.chgrpRunner.Chgrp(builder.GetPath(), podFSGroup) - if err != nil { - return nil, err - } - - err = kl.chmodRunner.Chmod(builder.GetPath(), os.FileMode(1770)) - if err != nil { - return nil, err - } - } - - podVolumes[volSpec.Name] = builder - } - - return podVolumes, nil -} -``` - -This allows the volume plugins to determine when they do and don't want this type of support from -the Kubelet, and allows the criteria each plugin uses to evolve without changing the Kubelet. - -The docker runtime will be modified to set the supplemental group of each container based on the -`pod.Spec.SecurityContext.FSGroup` field. Theoretically, the `rkt` runtime could support this -feature in a similar way. - -### Examples - -#### EmptyDir - -For a pod that has two containers sharing an `EmptyDir` volume: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: test-pod -spec: - securityContext: - fsGroup: 1001 - containers: - - name: a - securityContext: - runAsUser: 1009 - volumeMounts: - - mountPath: "/example/hostpath/a" - name: empty-vol - - name: b - securityContext: - runAsUser: 1010 - volumeMounts: - - mountPath: "/example/hostpath/b" - name: empty-vol - volumes: - - name: empty-vol -``` - -When the Kubelet runs this pod, the `empty-vol` volume will have ownership root:1001 and permissions -`0770`. It will be usable from both containers a and b. - -#### HostPath - -For a volume that uses a `hostPath` volume with containers running as different UIDs: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: test-pod -spec: - securityContext: - fsGroup: 1001 - containers: - - name: a - securityContext: - runAsUser: 1009 - volumeMounts: - - mountPath: "/example/hostpath/a" - name: host-vol - - name: b - securityContext: - runAsUser: 1010 - volumeMounts: - - mountPath: "/example/hostpath/b" - name: host-vol - volumes: - - name: host-vol - hostPath: - path: "/tmp/example-pod" -``` - -The cluster operator would need to manually `chgrp` and `chmod` the `/tmp/example-pod` on the host -in order for the volume to be usable from the pod. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volumes.md?pixel)]() - +This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volumes.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volumes.md)