Merge pull request #39123 from michelleN/docs-proposals-stubs

replace contents of docs/proposals with stubs
This commit is contained in:
Brian Grant 2016-12-21 21:31:55 -08:00 committed by GitHub
commit 41e6357a07
68 changed files with 68 additions and 16836 deletions

View File

@ -1,119 +1 @@
# Supporting multiple API groups This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-group.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-group.md)
## Goal
1. Breaking the monolithic v1 API into modular groups and allowing groups to be enabled/disabled individually. This allows us to break the monolithic API server to smaller components in the future.
2. Supporting different versions in different groups. This allows different groups to evolve at different speed.
3. Supporting identically named kinds to exist in different groups. This is useful when we experiment new features of an API in the experimental group while supporting the stable API in the original group at the same time.
4. Exposing the API groups and versions supported by the server. This is required to develop a dynamic client.
5. Laying the basis for [API Plugin](../../docs/design/extending-api.md).
6. Keeping the user interaction easy. For example, we should allow users to omit group name when using kubectl if there is no ambiguity.
## Bookkeeping for groups
1. No changes to TypeMeta:
Currently many internal structures, such as RESTMapper and Scheme, are indexed and retrieved by APIVersion. For a fast implementation targeting the v1.1 deadline, we will concatenate group with version, in the form of "group/version", and use it where a version string is expected, so that many code can be reused. This implies we will not add a new field to TypeMeta, we will use TypeMeta.APIVersion to hold "group/version".
For backward compatibility, v1 objects belong to the group with an empty name, so existing v1 config files will remain valid.
2. /pkg/conversion#Scheme:
The key of /pkg/conversion#Scheme.versionMap for versioned types will be "group/version". For now, the internal version types of all groups will be registered to versionMap[""], as we don't have any identically named kinds in different groups yet. In the near future, internal version types will be registered to versionMap["group/"], and pkg/conversion#Scheme.InternalVersion will have type []string.
We will need a mechanism to express if two kinds in different groups (e.g., compute/pods and experimental/pods) are convertible, and auto-generate the conversions if they are.
3. meta.RESTMapper:
Each group will have its own RESTMapper (of type DefaultRESTMapper), and these mappers will be registered to pkg/api#RESTMapper (of type MultiRESTMapper).
To support identically named kinds in different groups, We need to expand the input of RESTMapper.VersionAndKindForResource from (resource string) to (group, resource string). If group is not specified and there is ambiguity (i.e., the resource exists in multiple groups), an error should be returned to force the user to specify the group.
## Server-side implementation
1. resource handlers' URL:
We will force the URL to be in the form of prefix/group/version/...
Prefix is used to differentiate API paths from other paths like /healthz. All groups will use the same prefix="apis", except when backward compatibility requires otherwise. No "/" is allowed in prefix, group, or version. Specifically,
* for /api/v1, we set the prefix="api" (which is populated from cmd/kube-apiserver/app#APIServer.APIPrefix), group="", version="v1", so the URL remains to be /api/v1.
* for new kube API groups, we will set the prefix="apis" (we will add a field in type APIServer to hold this prefix), group=GROUP_NAME, version=VERSION. For example, the URL of the experimental resources will be /apis/experimental/v1alpha1.
* for OpenShift v1 API, because it's currently registered at /oapi/v1, to be backward compatible, OpenShift may set prefix="oapi", group="".
* for other new third-party API, they should also use the prefix="apis" and choose the group and version. This can be done through the thirdparty API plugin mechanism in [13000](http://pr.k8s.io/13000).
2. supporting API discovery:
* At /prefix (e.g., /apis), API server will return the supported groups and their versions using pkg/api/unversioned#APIVersions type, setting the Versions field to "group/version". This is backward compatible, because currently API server does return "v1" encoded in pkg/api/unversioned#APIVersions at /api. (We will also rename the JSON field name from `versions` to `apiVersions`, to be consistent with pkg/api#TypeMeta.APIVersion field)
* At /prefix/group, API server will return all supported versions of the group. We will create a new type VersionList (name is open to discussion) in pkg/api/unversioned as the API.
* At /prefix/group/version, API server will return all supported resources in this group, and whether each resource is namespaced. We will create a new type APIResourceList (name is open to discussion) in pkg/api/unversioned as the API.
We will design how to handle deeper path in other proposals.
* At /swaggerapi/swagger-version/prefix/group/version, API server will return the Swagger spec of that group/version in `swagger-version` (e.g. we may support both Swagger v1.2 and v2.0).
3. handling common API objects:
* top-level common API objects:
To handle the top-level API objects that are used by all groups, we either have to register them to all schemes, or we can choose not to encode them to a version. We plan to take the latter approach and place such types in a new package called `unversioned`, because many of the common top-level objects, such as APIVersions, VersionList, and APIResourceList, which are used in the API discovery, and pkg/api#Status, are part of the protocol between client and server, and do not belong to the domain-specific parts of the API, which will evolve independently over time.
Types in the unversioned package will not have the APIVersion field, but may retain the Kind field.
For backward compatibility, when handling the Status, the server will encode it to v1 if the client expects the Status to be encoded in v1, otherwise the server will send the unversioned#Status. If an error occurs before the version can be determined, the server will send the unversioned#Status.
* non-top-level common API objects:
Assuming object o belonging to group X is used as a field in an object belonging to group Y, currently genconversion will generate the conversion functions for o in package Y. Hence, we don't need any special treatment for non-top-level common API objects.
TypeMeta is an exception, because it is a common object that is used by objects in all groups but does not logically belong to any group. We plan to move it to the package `unversioned`.
## Client-side implementation
1. clients:
Currently we have structured (pkg/client/unversioned#ExperimentalClient, pkg/client/unversioned#Client) and unstructured (pkg/kubectl/resource#Helper) clients. The structured clients are not scalable because each of them implements specific interface, e.g., `[here]../../pkg/client/unversioned/client.go#L32`--fixed. Only the unstructured clients are scalable. We should either auto-generate the code for structured clients or migrate to use the unstructured clients as much as possible.
We should also move the unstructured client to pkg/client/.
2. Spelling the URL:
The URL is in the form of prefix/group/version/. The prefix is hard-coded in the client/unversioned.Config. The client should be able to figure out `group` and `version` using the RESTMapper. For a third-party client which does not have access to the RESTMapper, it should discover the mapping of `group`, `version` and `kind` by querying the server as described in point 2 of #server-side-implementation.
3. kubectl:
kubectl should accept arguments like `group/resource`, `group/resource/name`. Nevertheless, the user can omit the `group`, then kubectl shall rely on RESTMapper.VersionAndKindForResource() to figure out the default group/version of the resource. For example, for resources (like `node`) that exist in both k8s v1 API and k8s modularized API (like `infra/v2`), we should set kubectl default to use one of them. If there is no default group, kubectl should return an error for the ambiguity.
When kubectl is used with a single resource type, the --api-version and --output-version flag of kubectl should accept values in the form of `group/version`, and they should work as they do today. For multi-resource operations, we will disable these two flags initially.
Currently, by setting pkg/client/unversioned/clientcmd/api/v1#Config.NamedCluster[x].Cluster.APIVersion ([here](../../pkg/client/unversioned/clientcmd/api/v1/types.go#L58)), user can configure the default apiVersion used by kubectl to talk to server. It does not make sense to set a global version used by kubectl when there are multiple groups, so we plan to deprecate this field. We may extend the version negotiation function to negotiate the preferred version of each group. Details will be in another proposal.
## OpenShift integration
OpenShift can take a similar approach to break monolithic v1 API: keeping the v1 where they are, and gradually adding groups.
For the v1 objects in OpenShift, they should keep doing what they do now: they should remain registered to Scheme.versionMap["v1"] scheme, they should keep being added to originMapper.
For new OpenShift groups, they should do the same as native Kubernetes groups would do: each group should register to Scheme.versionMap["group/version"], each should has separate RESTMapper and the register the MultiRESTMapper.
To expose a list of the supported Openshift groups to clients, OpenShift just has to call to pkg/cmd/server/origin#call initAPIVersionRoute() as it does now, passing in the supported "group/versions" instead of "versions".
## Future work
1. Dependencies between groups: we need an interface to register the dependencies between groups. It is not our priority now as the use cases are not clear yet.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/api-group.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,145 +1 @@
## Abstract This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apiserver-watch.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apiserver-watch.md)
In the current system, most watch requests sent to apiserver are redirected to
etcd. This means that for every watch request the apiserver opens a watch on
etcd.
The purpose of the proposal is to improve the overall performance of the system
by solving the following problems:
- having too many open watches on etcd
- avoiding deserializing/converting the same objects multiple times in different
watch results
In the future, we would also like to add an indexing mechanism to the watch.
Although Indexer is not part of this proposal, it is supposed to be compatible
with it - in the future Indexer should be incorporated into the proposed new
watch solution in apiserver without requiring any redesign.
## High level design
We are going to solve those problems by allowing many clients to watch the same
storage in the apiserver, without being redirected to etcd.
At the high level, apiserver will have a single watch open to etcd, watching all
the objects (of a given type) without any filtering. The changes delivered from
etcd will then be stored in a cache in apiserver. This cache is in fact a
"rolling history window" that will support clients having some amount of latency
between their list and watch calls. Thus it will have a limited capacity and
whenever a new change comes from etcd when a cache is full, the oldest change
will be remove to make place for the new one.
When a client sends a watch request to apiserver, instead of redirecting it to
etcd, it will cause:
- registering a handler to receive all new changes coming from etcd
- iterating though a watch window, starting at the requested resourceVersion
to the head and sending filtered changes directory to the client, blocking
the above until this iteration has caught up
This will be done be creating a go-routine per watcher that will be responsible
for performing the above.
The following section describes the proposal in more details, analyzes some
corner cases and divides the whole design in more fine-grained steps.
## Proposal details
We would like the cache to be __per-resource-type__ and __optional__. Thanks to
it we will be able to:
- have different cache sizes for different resources (e.g. bigger cache
[= longer history] for pods, which can significantly affect performance)
- avoid any overhead for objects that are watched very rarely (e.g. events
are almost not watched at all, but there are a lot of them)
- filter the cache for each watcher more effectively
If we decide to support watches spanning different resources in the future and
we have an efficient indexing mechanisms, it should be relatively simple to unify
the cache to be common for all the resources.
The rest of this section describes the concrete steps that need to be done
to implement the proposal.
1. Since we want the watch in apiserver to be optional for different resource
types, this needs to be self-contained and hidden behind a well defined API.
This should be a layer very close to etcd - in particular all registries:
"pkg/registry/generic/registry" should be built on top of it.
We will solve it by turning tools.EtcdHelper by extracting its interface
and treating this interface as this API - the whole watch mechanisms in
apiserver will be hidden behind that interface.
Thanks to it we will get an initial implementation for free and we will just
need to reimplement few relevant functions (probably just Watch and List).
Moreover, this will not require any changes in other parts of the code.
This step is about extracting the interface of tools.EtcdHelper.
2. Create a FIFO cache with a given capacity. In its "rolling history window"
we will store two things:
- the resourceVersion of the object (being an etcdIndex)
- the object watched from etcd itself (in a deserialized form)
This should be as simple as having an array an treating it as a cyclic buffer.
Obviously resourceVersion of objects watched from etcd will be increasing, but
they are necessary for registering a new watcher that is interested in all the
changes since a given etcdIndex.
Additionally, we should support LIST operation, otherwise clients can never
start watching at now. We may consider passing lists through etcd, however
this will not work once we have Indexer, so we will need that information
in memory anyway.
Thus, we should support LIST operation from the "end of the history" - i.e.
from the moment just after the newest cached watched event. It should be
pretty simple to do, because we can incrementally update this list whenever
the new watch event is watched from etcd.
We may consider reusing existing structures cache.Store or cache.Indexer
("pkg/client/cache") but this is not a hard requirement.
3. Create the new implementation of the API, that will internally have a
single watch open to etcd and will store the data received from etcd in
the FIFO cache - this includes implementing registration of a new watcher
which will start a new go-routine responsible for iterating over the cache
and sending all the objects watcher is interested in (by applying filtering
function) to the watcher.
4. Add a support for processing "error too old" from etcd, which will require:
- disconnect all the watchers
- clear the internal cache and relist all objects from etcd
- start accepting watchers again
5. Enable watch in apiserver for some of the existing resource types - this
should require only changes at the initialization level.
6. The next step will be to incorporate some indexing mechanism, but details
of it are TBD.
### Future optimizations:
1. The implementation of watch in apiserver internally will open a single
watch to etcd, responsible for watching all the changes of objects of a given
resource type. However, this watch can potentially expire at any time and
reconnecting can return "too old resource version". In that case relisting is
necessary. In such case, to avoid LIST requests coming from all watchers at
the same time, we can introduce an additional etcd event type:
[EtcdResync](../../pkg/storage/etcd/etcd_watcher.go#L36)
Whenever relisting will be done to refresh the internal watch to etcd,
EtcdResync event will be send to all the watchers. It will contain the
full list of all the objects the watcher is interested in (appropriately
filtered) as the parameter of this watch event.
Thus, we need to create the EtcdResync event, extend watch.Interface and
its implementations to support it and handle those events appropriately
in places like
[Reflector](../../pkg/client/cache/reflector.go)
However, this might turn out to be unnecessary optimization if apiserver
will always keep up (which is possible in the new design). We will work
out all necessary details at that point.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/apiserver-watch.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,310 +1 @@
<!-- BEGIN MUNGE: GENERATED_TOC --> This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apparmor.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/apparmor.md)
- [Overview](#overview)
- [Motivation](#motivation)
- [Related work](#related-work)
- [Alpha Design](#alpha-design)
- [Overview](#overview-1)
- [Prerequisites](#prerequisites)
- [API Changes](#api-changes)
- [Pod Security Policy](#pod-security-policy)
- [Deploying profiles](#deploying-profiles)
- [Testing](#testing)
- [Beta Design](#beta-design)
- [API Changes](#api-changes-1)
- [Future work](#future-work)
- [System component profiles](#system-component-profiles)
- [Deploying profiles](#deploying-profiles-1)
- [Custom app profiles](#custom-app-profiles)
- [Security plugins](#security-plugins)
- [Container Runtime Interface](#container-runtime-interface)
- [Alerting](#alerting)
- [Profile authoring](#profile-authoring)
- [Appendix](#appendix)
<!-- END MUNGE: GENERATED_TOC -->
# Overview
AppArmor is a [mandatory access control](https://en.wikipedia.org/wiki/Mandatory_access_control)
(MAC) system for Linux that supplements the standard Linux user and group based
permissions. AppArmor can be configured for any application to reduce the potential attack surface
and provide greater [defense in depth](https://en.wikipedia.org/wiki/Defense_in_depth_(computing)).
It is configured through profiles tuned to whitelist the access needed by a specific program or
container, such as Linux capabilities, network access, file permissions, etc. Each profile can be
run in either enforcing mode, which blocks access to disallowed resources, or complain mode, which
only reports violations.
AppArmor is similar to SELinux. Both are MAC systems implemented as a Linux security module (LSM),
and are mutually exclusive. SELinux offers a lot of power and very fine-grained controls, but is
generally considered very difficult to understand and maintain. AppArmor sacrifices some of that
flexibility in favor of ease of use. Seccomp-bpf is another Linux kernel security feature for
limiting attack surface, and can (and should!) be used alongside AppArmor.
## Motivation
AppArmor can enable users to run a more secure deployment, and / or provide better auditing and
monitoring of their systems. Although it is not the only solution, we should enable AppArmor for
users that want a simpler alternative to SELinux, or are already maintaining a set of AppArmor
profiles. We have heard from multiple Kubernetes users already that AppArmor support is important to
them. The [seccomp proposal](../../docs/design/seccomp.md#use-cases) details several use cases that
also apply to AppArmor.
## Related work
Much of this design is drawn from the work already done to support seccomp profiles in Kubernetes,
which is outlined in the [seccomp design doc](../../docs/design/seccomp.md). The designs should be
kept close to apply lessons learned, and reduce cognitive and maintenance overhead.
Docker has supported AppArmor profiles since version 1.3, and maintains a default profile which is
applied to all containers on supported systems.
AppArmor was upstreamed into the Linux kernel in version 2.6.36. It is currently maintained by
[Canonical](http://www.canonical.com/), is shipped by default on all Ubuntu and openSUSE systems,
and is supported on several
[other distributions](http://wiki.apparmor.net/index.php/Main_Page#Distributions_and_Ports).
# Alpha Design
This section describes the proposed design for
[alpha-level](../../docs/devel/api_changes.md#alpha-beta-and-stable-versions) support, although
additional features are described in [future work](#future-work). For AppArmor alpha support
(targeted for Kubernetes 1.4) we will enable:
- Specifying a pre-loaded profile to apply to a pod container
- Restricting pod containers to a set of profiles (admin use case)
We will also provide a reference implementation of a pod for loading profiles on nodes, but an
official supported mechanism for deploying profiles is out of scope for alpha.
## Overview
An AppArmor profile can be specified for a container through the Kubernetes API with a pod
annotation. If a profile is specified, the Kubelet will verify that the node meets the required
[prerequisites](#prerequisites) (e.g. the profile is already configured on the node) before starting
the container, and will not run the container if the profile cannot be applied. If the requirements
are met, the container runtime will configure the appropriate options to apply the profile. Profile
requirements and defaults can be specified on the
[PodSecurityPolicy](security-context-constraints.md).
## Prerequisites
When an AppArmor profile is specified, the Kubelet will verify the prerequisites for applying the
profile to the container. In order to [fail
securely](https://www.owasp.org/index.php/Fail_securely), a container **will not be run** if any of
the prerequisites are not met. The prerequisites are:
1. **Kernel support** - The AppArmor kernel module is loaded. Can be checked by
[libcontainer](https://github.com/opencontainers/runc/blob/4dedd0939638fc27a609de1cb37e0666b3cf2079/libcontainer/apparmor/apparmor.go#L17).
2. **Runtime support** - For the initial implementation, Docker will be required (rkt does not
currently have AppArmor support). All supported Docker versions include AppArmor support. See
[Container Runtime Interface](#container-runtime-interface) for other runtimes.
3. **Installed profile** - The target profile must be loaded prior to starting the container. Loaded
profiles can be found in the AppArmor securityfs \[1\].
If any of the prerequisites are not met an event will be generated to report the error and the pod
will be
[rejected](https://github.com/kubernetes/kubernetes/blob/cdfe7b7b42373317ecd83eb195a683e35db0d569/pkg/kubelet/kubelet.go#L2201)
by the Kubelet.
*[1] The securityfs can be found in `/proc/mounts`, and defaults to `/sys/kernel/security` on my
Ubuntu system. The profiles can be found at `{securityfs}/apparmor/profiles`
([example](http://bazaar.launchpad.net/~apparmor-dev/apparmor/master/view/head:/utils/aa-status#L137)).*
## API Changes
The initial alpha support of AppArmor will follow the pattern
[used by seccomp](https://github.com/kubernetes/kubernetes/pull/25324) and specify profiles through
annotations. Profiles can be specified per-container through pod annotations. The annotation format
is a key matching the container, and a profile name value:
```
container.apparmor.security.alpha.kubernetes.io/<container_name>=<profile_name>
```
The profiles can be specified in the following formats (following the convention used by [seccomp](../../docs/design/seccomp.md#api-changes)):
1. `runtime/default` - Applies the default profile for the runtime. For docker, the profile is
generated from a template
[here](https://github.com/docker/docker/blob/master/profiles/apparmor/template.go). If no
AppArmor annotations are provided, this profile is enabled by default if AppArmor is enabled in
the kernel. Runtimes may define this to be unconfined, as Docker does for privileged pods.
2. `localhost/<profile_name>` - The profile name specifies the profile to load.
*Note: There is no way to explicitly specify an "unconfined" profile, since it is discouraged. If
this is truly needed, the user can load an "allow-all" profile.*
### Pod Security Policy
The [PodSecurityPolicy](security-context-constraints.md) allows cluster administrators to control
the security context for a pod and its containers. An annotation can be specified on the
PodSecurityPolicy to restrict which AppArmor profiles can be used, and specify a default if no
profile is specified.
The annotation key is `apparmor.security.alpha.kubernetes.io/allowedProfileNames`. The value is a
comma delimited list, with each item following the format described [above](#api-changes). If a list
of profiles are provided and a pod does not have an AppArmor annotation, the first profile in the
list will be used by default.
Enforcement of the policy is standard. See the
[seccomp implementation](https://github.com/kubernetes/kubernetes/pull/28300) as an example.
## Deploying profiles
We will provide a reference implementation of a DaemonSet pod for loading profiles on nodes, but
there will not be an official mechanism or API in the initial version (see
[future work](#deploying-profiles-1)). The reference container will contain the `apparmor_parser`
tool and a script for using the tool to load all profiles in a set of (configurable)
directories. The initial implementation will poll (with a configurable interval) the directories for
additions, but will not update or unload existing profiles. The pod can be run in a DaemonSet to
load the profiles onto all nodes. The pod will need to be run in privileged mode.
This simple design should be sufficient to deploy AppArmor profiles from any volume source, such as
a ConfigMap or PersistentDisk. Users seeking more advanced features should be able extend this
design easily.
## Testing
Our e2e testing framework does not currently run nodes with AppArmor enabled, but we can run a node
e2e test suite on an AppArmor enabled node. The cases we should test are:
- *PodSecurityPolicy* - These tests can be run on a cluster even if AppArmor is not enabled on the
nodes.
- No AppArmor policy allows pods with arbitrary profiles
- With a policy a default is selected
- With a policy arbitrary profiles are prevented
- With a policy allowed profiles are allowed
- *Node AppArmor enforcement* - These tests need to run on AppArmor enabled nodes, in the node e2e
suite.
- A valid container profile gets applied
- An unloaded profile will be rejected
# Beta Design
The only part of the design that changes for beta is the API, which is upgraded from
annotation-based to first class fields.
## API Changes
AppArmor profiles will be specified in the container's SecurityContext, as part of an
`AppArmorOptions` struct. The options struct makes the API more flexible to future additions.
```go
type SecurityContext struct {
...
// The AppArmor options to be applied to the container.
AppArmorOptions *AppArmorOptions `json:"appArmorOptions,omitempty"`
...
}
// Reference to an AppArmor profile loaded on the host.
type AppArmorProfileName string
// Options specifying how to run Containers with AppArmor.
type AppArmorOptions struct {
// The profile the Container must be run with.
Profile AppArmorProfileName `json:"profile"`
}
```
The `AppArmorProfileName` format matches the format for the profile annotation values describe
[above](#api-changes).
The `PodSecurityPolicySpec` receives a similar treatment with the addition of an
`AppArmorStrategyOptions` struct. Here the `DefaultProfile` is separated from the `AllowedProfiles`
in the interest of making the behavior more explicit.
```go
type PodSecurityPolicySpec struct {
...
AppArmorStrategyOptions *AppArmorStrategyOptions `json:"appArmorStrategyOptions,omitempty"`
...
}
// AppArmorStrategyOptions specifies AppArmor restrictions and requirements for pods and containers.
type AppArmorStrategyOptions struct {
// If non-empty, all pod containers must be run with one of the profiles in this list.
AllowedProfiles []AppArmorProfileName `json:"allowedProfiles,omitempty"`
// The default profile to use if a profile is not specified for a container.
// Defaults to "runtime/default". Must be allowed by AllowedProfiles.
DefaultProfile AppArmorProfileName `json:"defaultProfile,omitempty"`
}
```
# Future work
Post-1.4 feature ideas. These are not fully-fleshed designs.
## System component profiles
We should publish (to GitHub) AppArmor profiles for all Kubernetes system components, including core
components like the API server and controller manager, as well as addons like influxDB and
Grafana. `kube-up.sh` and its successor should have an option to apply the profiles, if the AppArmor
is supported by the nodes. Distros that support AppArmor and provide a Kubernetes package should
include the profiles out of the box.
## Deploying profiles
We could provide an official supported solution for loading profiles on the nodes. One option is to
extend the reference implementation described [above](#deploying-profiles) into a DaemonSet that
watches the directory sources to sync changes, or to watch a ConfigMap object directly. Another
option is to add an official API for this purpose, and load the profiles on-demand in the Kubelet.
## Custom app profiles
[Profile stacking](http://wiki.apparmor.net/index.php/AppArmorStacking) is an AppArmor feature
currently in development that will enable multiple profiles to be applied to the same object. If
profiles are stacked, the allowed set of operations is the "intersection" of both profiles
(i.e. stacked profiles are never more permissive). Taking advantage of this feature, the cluster
administrator could restrict the allowed profiles on a PodSecurityPolicy to a few broad profiles,
and then individual apps could apply more app specific profiles on top.
## Security plugins
AppArmor, SELinux, TOMOYO, grsecurity, SMACK, etc. are all Linux MAC implementations with similar
requirements and features. At the very least, the AppArmor implementation should be factored in a
way that makes it easy to add alternative systems. A more advanced approach would be to extract a
set of interfaces for plugins implementing the alternatives. An even higher level approach would be
to define a common API or profile interface for all of them. Work towards this last option is
already underway for Docker, called
[Docker Security Profiles](https://github.com/docker/docker/issues/17142#issuecomment-148974642).
## Container Runtime Interface
Other container runtimes will likely add AppArmor support eventually, so the
[Container Runtime Interface](container-runtime-interface-v1.md) (CRI) needs to be made compatible
with this design. The two important pieces are a way to report whether AppArmor is supported by the
runtime, and a way to specify the profile to load (likely through the `LinuxContainerConfig`).
## Alerting
Whether AppArmor is running in enforcing or complain mode it generates logs of policy
violations. These logs can be important cues for intrusion detection, or at the very least a bug in
the profile. Violations should almost always generate alerts in production systems. We should
provide reference documentation for setting up alerts.
## Profile authoring
A common method for writing AppArmor profiles is to start with a restrictive profile in complain
mode, and then use the `aa-logprof` tool to build a profile from the logs. We should provide
documentation for following this process in a Kubernetes environment.
# Appendix
- [What is AppArmor](https://askubuntu.com/questions/236381/what-is-apparmor)
- [Debugging AppArmor on Docker](https://github.com/docker/docker/blob/master/docs/security/apparmor.md#debug-apparmor)
- Load an AppArmor profile with `apparmor_parser` (required by Docker so it should be available):
```
$ apparmor_parser --replace --write-cache /path/to/profile
```
- Unload with:
```
$ apparmor_parser --remove /path/to/profile
```
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/apparmor.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,316 +1 @@
<!-- BEGIN MUNGE: GENERATED_TOC --> This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/client-package-structure.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/client-package-structure.md)
- [Client: layering and package structure](#client-layering-and-package-structure)
- [Desired layers](#desired-layers)
- [Transport](#transport)
- [RESTClient/request.go](#restclientrequestgo)
- [Mux layer](#mux-layer)
- [High-level: Individual typed](#high-level-individual-typed)
- [High-level, typed: Discovery](#high-level-typed-discovery)
- [High-level: Dynamic](#high-level-dynamic)
- [High-level: Client Sets](#high-level-client-sets)
- [Package Structure](#package-structure)
- [Client Guarantees (and testing)](#client-guarantees-and-testing)
<!-- END MUNGE: GENERATED_TOC -->
# Client: layering and package structure
## Desired layers
### Transport
The transport layer is concerned with round-tripping requests to an apiserver
somewhere. It consumes a Config object with options appropriate for this.
(That's most of the current client.Config structure.)
Transport delivers an object that implements http's RoundTripper interface
and/or can be used in place of http.DefaultTransport to route requests.
Transport objects are safe for concurrent use, and are cached and reused by
subsequent layers.
Tentative name: "Transport".
It's expected that the transport config will be general enough that third
parties (e.g., OpenShift) will not need their own implementation, rather they
can change the certs, token, etc., to be appropriate for their own servers,
etc..
Action items:
* Split out of current client package into a new package. (@krousey)
### RESTClient/request.go
RESTClient consumes a Transport and a Codec (and optionally a group/version),
and produces something that implements the interface currently in request.go.
That is, with a RESTClient, you can write chains of calls like:
`c.Get().Path(p).Param("name", "value").Do()`
RESTClient is generically usable by any client for servers exposing REST-like
semantics. It provides helpers that benefit those following api-conventions.md,
but does not mandate them. It provides a higher level http interface that
abstracts transport, wire serialization, retry logic, and error handling.
Kubernetes-like constructs that deviate from standard HTTP should be bypassable.
Every non-trivial call made to a remote restful API from Kubernetes code should
go through a rest client.
The group and version may be empty when constructing a RESTClient. This is valid
for executing discovery commands. The group and version may be overridable with
a chained function call.
Ideally, no semantic behavior is built into RESTClient, and RESTClient will use
the Codec it was constructed with for all semantic operations, including turning
options objects into URL query parameters. Unfortunately, that is not true of
today's RESTClient, which may have some semantic information built in. We will
remove this.
RESTClient should not make assumptions about the format of data produced or
consumed by the Codec. Currently, it is JSON, but we want to support binary
protocols in the future.
The Codec would look something like this:
```go
type Codec interface {
Encode(runtime.Object) ([]byte, error)
Decode([]byte]) (runtime.Object, error)
// Used to version-control query parameters
EncodeParameters(optionsObject runtime.Object) (url.Values, error)
// Not included here since the client doesn't need it, but a corresponding
// DecodeParametersInto method would be available on the server.
}
```
There should be one codec per version. RESTClient is *not* responsible for
converting between versions; if a client wishes, they can supply a Codec that
does that. But RESTClient will make the assumption that it's talking to a single
group/version, and will not contain any conversion logic. (This is a slight
change from the current state.)
As with Transport, it is expected that 3rd party providers following the api
conventions should be able to use RESTClient, and will not need to implement
their own.
Action items:
* Split out of the current client package. (@krousey)
* Possibly, convert to an interface (currently, it's a struct). This will allow
extending the error-checking monad that's currently in request.go up an
additional layer.
* Switch from ParamX("x") functions to using types representing the collection
of parameters and the Codec for query parameter serialization.
* Any other Kubernetes group specific behavior should also be removed from
RESTClient.
### Mux layer
(See TODO at end; this can probably be merged with the "client set" concept.)
The client muxer layer has a map of group/version to cached RESTClient, and
knows how to construct a new RESTClient in case of a cache miss (using the
discovery client mentioned below). The ClientMux may need to deal with multiple
transports pointing at differing destinations (e.g. OpenShift or other 3rd party
provider API may be at a different location).
When constructing a RESTClient generically, the muxer will just use the Codec
the high-level dynamic client would use. Alternatively, the user should be able
to pass in a Codec-- for the case where the correct types are compiled in.
Tentative name: ClientMux
Action items:
* Move client cache out of kubectl libraries into a more general home.
* TODO: a mux layer may not be necessary, depending on what needs to be cached.
If transports are cached already, and RESTClients are extremely light-weight,
there may not need to be much code at all in this layer.
### High-level: Individual typed
Our current high-level client allows you to write things like
`c.Pods("namespace").Create(p)`; we will insert a level for the group.
That is, the system will be:
`clientset.GroupName().NamespaceSpecifier().Action()`
Where:
* `clientset` is a thing that holds multiple individually typed clients (see
below).
* `GroupName()` returns the generated client that this section is about.
* `NamespaceSpecifier()` may take a namespace parameter or nothing.
* `Action` is one of Create/Get/Update/Delete/Watch, or appropriate actions
from the type's subresources.
* It is TBD how we'll represent subresources and their actions. This is
inconsistent in the current clients, so we'll need to define a consistent
format. Possible choices:
* Insert a `.Subresource()` before the `.Action()`
* Flatten subresources, such that they become special Actions on the parent
resource.
The types returned/consumed by such functions will be e.g. api/v1, NOT the
current version inspecific types. The current internal-versioned client is
inconvenient for users, as it does not protect them from having to recompile
their code with every minor update. (We may continue to generate an
internal-versioned client for our own use for a while, but even for our own
components it probably makes sense to switch to specifically versioned clients.)
We will provide this structure for each version of each group. It is infeasible
to do this manually, so we will generate this. The generator will accept both
swagger and the ordinary go types. The generator should operate on out-of-tree
sources AND out-of-tree destinations, so it will be useful for consuming
out-of-tree APIs and for others to build custom clients into their own
repositories.
Typed clients will be constructable given a ClientMux; the typed constructor will use
the ClientMux to find or construct an appropriate RESTClient. Alternatively, a
typed client should be constructable individually given a config, from which it
will be able to construct the appropriate RESTClient.
Typed clients do not require any version negotiation. The server either supports
the client's group/version, or it does not. However, there are ways around this:
* If you want to use a typed client against a server's API endpoint and the
server's API version doesn't match the client's API version, you can construct
the client with a RESTClient using a Codec that does the conversion (this is
basically what our client does now).
* Alternatively, you could use the dynamic client.
Action items:
* Move current typed clients into new directory structure (described below)
* Finish client generation logic. (@caesarxuchao, @lavalamp)
#### High-level, typed: Discovery
A `DiscoveryClient` is necessary to discover the api groups, versions, and
resources a server supports. It's constructable given a RESTClient. It is
consumed by both the ClientMux and users who want to iterate over groups,
versions, or resources. (Example: namespace controller.)
The DiscoveryClient is *not* required if you already know the group/version of
the resource you want to use: you can simply try the operation without checking
first, which is lower-latency anyway as it avoids an extra round-trip.
Action items:
* Refactor existing functions to present a sane interface, as close to that
offered by the other typed clients as possible. (@caeserxuchao)
* Use a RESTClient to make the necessary API calls.
* Make sure that no discovery happens unless it is explicitly requested. (Make
sure SetKubeDefaults doesn't call it, for example.)
### High-level: Dynamic
The dynamic client lets users consume apis which are not compiled into their
binary. It will provide the same interface as the typed client, but will take
and return `runtime.Object`s instead of typed objects. There is only one dynamic
client, so it's not necessary to generate it, although optionally we may do so
depending on whether the typed client generator makes it easy.
A dynamic client is constructable given a config, group, and version. It will
use this to construct a RESTClient with a Codec which encodes/decodes to
'Unstructured' `runtime.Object`s. The group and version may be from a previous
invocation of a DiscoveryClient, or they may be known by other means.
For now, the dynamic client will assume that a JSON encoding is allowed. In the
future, if we have binary-only APIs (unlikely?), we can add that to the
discovery information and construct an appropriate dynamic Codec.
Action items:
* A rudimentary version of this exists in kubectl's builder. It needs to be
moved to a more general place.
* Produce a useful 'Unstructured' runtime.Object, which allows for easy
Object/ListMeta introspection.
### High-level: Client Sets
Because there will be multiple groups with multiple versions, we will provide an
aggregation layer that combines multiple typed clients in a single object.
We do this to:
* Deliver a concrete thing for users to consume, construct, and pass around. We
don't want people making 10 typed clients and making a random system to keep
track of them.
* Constrain the testing matrix. Users can generate a client set at their whim
against their cluster, but we need to make guarantees that the clients we
shipped with v1.X.0 will work with v1.X+1.0, and vice versa. That's not
practical unless we "bless" a particular version of each API group and ship an
official client set with earch release. (If the server supports 15 groups with
2 versions each, that's 2^15 different possible client sets. We don't want to
test all of them.)
A client set is generated into its own package. The generator will take the list
of group/versions to be included. Only one version from each group will be in
the client set.
A client set is constructable at runtime from either a ClientMux or a transport
config (for easy one-stop-shopping).
An example:
```go
import (
api_v1 "k8s.io/kubernetes/pkg/client/typed/generated/v1"
ext_v1beta1 "k8s.io/kubernetes/pkg/client/typed/generated/extensions/v1beta1"
net_v1beta1 "k8s.io/kubernetes/pkg/client/typed/generated/net/v1beta1"
"k8s.io/kubernetes/pkg/client/typed/dynamic"
)
type Client interface {
API() api_v1.Client
Extensions() ext_v1beta1.Client
Net() net_v1beta1.Client
// ... other typed clients here.
// Included in every set
Discovery() discovery.Client
GroupVersion(group, version string) dynamic.Client
}
```
Note that a particular version is chosen for each group. It is a general rule
for our API structure that no client need care about more than one version of
each group at a time.
This is the primary deliverable that people would consume. It is also generated.
Action items:
* This needs to be built. It will replace the ClientInterface that everyone
passes around right now.
## Package Structure
```
pkg/client/
----------/transport/ # transport & associated config
----------/restclient/
----------/clientmux/
----------/typed/
----------------/discovery/
----------------/generated/
--------------------------/<group>/
----------------------------------/<version>/
--------------------------------------------/<resource>.go
----------------/dynamic/
----------/clientsets/
---------------------/release-1.1/
---------------------/release-1.2/
---------------------/the-test-set-you-just-generated/
```
`/clientsets/` will retain their contents until they reach their expire date.
e.g., when we release v1.N, we'll remove clientset v1.(N-3). Clients from old
releases live on and continue to work (i.e., are tested) without any interface
changes for multiple releases, to give users time to transition.
## Client Guarantees (and testing)
Once we release a clientset, we will not make interface changes to it. Users of
that client will not have to change their code until they are deliberately
upgrading their import. We probably will want to generate some sort of stub test
with a clientset, to ensure that we don't change the interface.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/client-package-structure.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,171 +1 @@
# Objective This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/cluster-deployment.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/cluster-deployment.md)
Simplify the cluster provisioning process for a cluster with one master and multiple worker nodes.
It should be secured with SSL and have all the default add-ons. There should not be significant
differences in the provisioning process across deployment targets (cloud provider + OS distribution)
once machines meet the node specification.
# Overview
Cluster provisioning can be broken into a number of phases, each with their own exit criteria.
In some cases, multiple phases will be combined together to more seamlessly automate the cluster setup,
but in all cases the phases can be run sequentially to provision a functional cluster.
It is possible that for some platforms we will provide an optimized flow that combines some of the steps
together, but that is out of scope of this document.
# Deployment flow
**Note**: _Exit critieria_ in the following sections are not intended to list all tests that should pass,
rather list those that must pass.
## Step 1: Provision cluster
**Objective**: Create a set of machines (master + nodes) where we will deploy Kubernetes.
For this phase to be completed successfully, the following requirements must be completed for all nodes:
- Basic connectivity between nodes (i.e. nodes can all ping each other)
- Docker installed (and in production setups should be monitored to be always running)
- One of the supported OS
We will provide a node specification conformance test that will verify if provisioning has been successful.
This step is provider specific and will be implemented for each cloud provider + OS distribution separately
using provider specific technology (cloud formation, deployment manager, PXE boot, etc).
Some OS distributions may meet the provisioning criteria without needing to run any post-boot steps as they
ship with all of the requirements for the node specification by default.
**Substeps** (on the GCE example):
1. Create network
2. Create firewall rules to allow communication inside the cluster
3. Create firewall rule to allow ```ssh``` to all machines
4. Create firewall rule to allow ```https``` to master
5. Create persistent disk for master
6. Create static IP address for master
7. Create master machine
8. Create node machines
9. Install docker on all machines
**Exit critera**:
1. Can ```ssh``` to all machines and run a test docker image
2. Can ```ssh``` to master and nodes and ping other machines
## Step 2: Generate certificates
**Objective**: Generate security certificates used to configure secure communication between client, master and nodes
TODO: Enumerate certificates which have to be generated.
## Step 3: Deploy master
**Objective**: Run kubelet and all the required components (e.g. etcd, apiserver, scheduler, controllers) on the master machine.
**Substeps**:
1. copy certificates
2. copy manifests for static pods:
1. etcd
2. apiserver, controller manager, scheduler
3. run kubelet in docker container (configuration is read from apiserver Config object)
4. run kubelet-checker in docker container
**v1.2 simplifications**:
1. kubelet-runner.sh - we will provide a custom docker image to run kubelet; it will contain
kubelet binary and will run it using ```nsenter``` to workaround problem with mount propagation
1. kubelet config file - we will read kubelet configuration file from disk instead of apiserver; it will
be generated locally and copied to all nodes.
**Exit criteria**:
1. Can run basic API calls (e.g. create, list and delete pods) from the client side (e.g. replication
controller works - user can create RC object and RC manager can create pods based on that)
2. Critical master components works:
1. scheduler
2. controller manager
## Step 4: Deploy nodes
**Objective**: Start kubelet on all nodes and configure kubernetes network.
Each node can be deployed separately and the implementation should make it ~impossible to change this assumption.
### Step 4.1: Run kubelet
**Substeps**:
1. copy certificates
2. run kubelet in docker container (configuration is read from apiserver Config object)
3. run kubelet-checker in docker container
**v1.2 simplifications**:
1. kubelet config file - we will read kubelet configuration file from disk instead of apiserver; it will
be generated locally and copied to all nodes.
**Exit critera**:
1. All nodes are registered, but not ready due to lack of kubernetes networking.
### Step 4.2: Setup kubernetes networking
**Objective**: Configure the Kubernetes networking to allow routing requests to pods and services.
To keep default setup consistent across open source deployments we will use Flannel to configure
kubernetes networking. However, implementation of this step will allow to easily plug in different
network solutions.
**Substeps**:
1. copy manifest for flannel server to master machine
2. create a daemonset with flannel daemon (it will read assigned CIDR and configure network appropriately).
**v1.2 simplifications**:
1. flannel daemon will run as a standalone binary (not in docker container)
2. flannel server will assign CIDRs to nodes outside of kubernetes; this will require restarting kubelet
after reconfiguring network bridge on local machine; this will also require running master nad node differently
(```--configure-cbr0=false``` on node and ```--allocate-node-cidrs=false``` on master), which breaks encapsulation
between nodes
**Exit criteria**:
1. Pods correctly created, scheduled, run and accessible from all nodes.
## Step 5: Add daemons
**Objective:** Start all system daemons (e.g. kube-proxy)
**Substeps:**:
1. Create daemonset for kube-proxy
**Exit criteria**:
1. Services work correctly on all nodes.
## Step 6: Add add-ons
**Objective**: Add default add-ons (e.g. dns, dashboard)
**Substeps:**:
1. Create Deployments (and daemonsets if needed) for all add-ons
## Deployment technology
We will use Ansible as the default technology for deployment orchestration. It has low requirements on the cluster machines
and seems to be popular in kubernetes community which will help us to maintain it.
For simpler UX we will provide simple bash scripts that will wrap all basic commands for deployment (e.g. ```up``` or ```down```)
One disadvantage of using Ansible is that it adds a dependency on a machine which runs deployment scripts. We will workaround
this by distributing deployment scripts via a docker image so that user will run the following command to create a cluster:
```docker run gcr.io/google_containers/deploy_kubernetes:v1.2 up --num-nodes=3 --provider=aws```
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/cluster-deployment.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,444 +1 @@
# Pod initialization This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-init.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-init.md)
@smarterclayton
March 2016
## Proposal and Motivation
Within a pod there is a need to initialize local data or adapt to the current
cluster environment that is not easily achieved in the current container model.
Containers start in parallel after volumes are mounted, leaving no opportunity
for coordination between containers without specialization of the image. If
two containers need to share common initialization data, both images must
be altered to cooperate using filesystem or network semantics, which introduces
coupling between images. Likewise, if an image requires configuration in order
to start and that configuration is environment dependent, the image must be
altered to add the necessary templating or retrieval.
This proposal introduces the concept of an **init container**, one or more
containers started in sequence before the pod's normal containers are started.
These init containers may share volumes, perform network operations, and perform
computation prior to the start of the remaining containers. They may also, by
virtue of their sequencing, block or delay the startup of application containers
until some precondition is met. In this document we refer to the existing pod
containers as **app containers**.
This proposal also provides a high level design of **volume containers**, which
initialize a particular volume, as a feature that specializes some of the tasks
defined for init containers. The init container design anticipates the existence
of volume containers and highlights where they will take future work
## Design Points
* Init containers should be able to:
* Perform initialization of shared volumes
* Download binaries that will be used in app containers as execution targets
* Inject configuration or extension capability to generic images at startup
* Perform complex templating of information available in the local environment
* Initialize a database by starting a temporary execution process and applying
schema info.
* Delay the startup of application containers until preconditions are met
* Register the pod with other components of the system
* Reduce coupling:
* Between application images, eliminating the need to customize those images for
Kubernetes generally or specific roles
* Inside of images, by specializing which containers perform which tasks
(install git into init container, use filesystem contents
in web container)
* Between initialization steps, by supporting multiple sequential init containers
* Init containers allow simple start preconditions to be implemented that are
decoupled from application code
* The order init containers start should be predictable and allow users to easily
reason about the startup of a container
* Complex ordering and failure will not be supported - all complex workflows can
if necessary be implemented inside of a single init container, and this proposal
aims to enable that ordering without adding undue complexity to the system.
Pods in general are not intended to support DAG workflows.
* Both run-once and run-forever pods should be able to use init containers
* As much as possible, an init container should behave like an app container
to reduce complexity for end users, for clients, and for divergent use cases.
An init container is a container with the minimum alterations to accomplish
its goal.
* Volume containers should be able to:
* Perform initialization of a single volume
* Start in parallel
* Perform computation to initialize a volume, and delay start until that
volume is initialized successfully.
* Using a volume container that does not populate a volume to delay pod start
(in the absence of init containers) would be an abuse of the goal of volume
containers.
* Container pre-start hooks are not sufficient for all initialization cases:
* They cannot easily coordinate complex conditions across containers
* They can only function with code in the image or code in a shared volume,
which would have to be statically linked (not a common pattern in wide use)
* They cannot be implemented with the current Docker implementation - see
[#140](https://github.com/kubernetes/kubernetes/issues/140)
## Alternatives
* Any mechanism that runs user code on a node before regular pod containers
should itself be a container and modeled as such - we explicitly reject
creating new mechanisms for running user processes.
* The container pre-start hook (not yet implemented) requires execution within
the container's image and so cannot adapt existing images. It also cannot
block startup of containers
* Running a "pre-pod" would defeat the purpose of the pod being an atomic
unit of scheduling.
## Design
Each pod may have 0..N init containers defined along with the existing
1..M app containers.
On startup of the pod, after the network and volumes are initialized, the
init containers are started in order. Each container must exit successfully
before the next is invoked. If a container fails to start (due to the runtime)
or exits with failure, it is retried according to the pod RestartPolicy.
RestartPolicyNever pods will immediately fail and exit. RestartPolicyAlways
pods will retry the failing init container with increasing backoff until it
succeeds. To align with the design of application containers, init containers
will only support "infinite retries" (RestartPolicyAlways) or "no retries"
(RestartPolicyNever).
A pod cannot be ready until all init containers have succeeded. The ports
on an init container are not aggregated under a service. A pod that is
being initialized is in the `Pending` phase but should have a distinct
condition. Each app container and all future init containers should have
the reason `PodInitializing`. The pod should have a condition `Initializing`
set to `false` until all init containers have succeeded, and `true` thereafter.
If the pod is restarted, the `Initializing` condition should be set to `false.
If the pod is "restarted" all containers stopped and started due to
a node restart, change to the pod definition, or admin interaction, all
init containers must execute again. Restartable conditions are defined as:
* An init container image is changed
* The pod infrastructure container is restarted (shared namespaces are lost)
* The Kubelet detects that all containers in a pod are terminated AND
no record of init container completion is available on disk (due to GC)
Changes to the init container spec are limited to the container image field.
Altering the container image field is equivalent to restarting the pod.
Because init containers can be restarted, retried, or reexecuted, container
authors should make their init behavior idempotent by handling volumes that
are already populated or the possibility that this instance of the pod has
already contacted a remote system.
Each init container has all of the fields of an app container. The following
fields are prohibited from being used on init containers by validation:
* `readinessProbe` - init containers must exit for pod startup to continue,
are not included in rotation, and so cannot define readiness distinct from
completion.
Init container authors may use `activeDeadlineSeconds` on the pod and
`livenessProbe` on the container to prevent init containers from failing
forever. The active deadline includes init containers.
Because init containers are semantically different in lifecycle from app
containers (they are run serially, rather than in parallel), for backwards
compatibility and design clarity they will be identified as distinct fields
in the API:
pod:
spec:
containers: ...
initContainers:
- name: init-container1
image: ...
...
- name: init-container2
...
status:
containerStatuses: ...
initContainerStatuses:
- name: init-container1
...
- name: init-container2
...
This separation also serves to make the order of container initialization
clear - init containers are executed in the order that they appear, then all
app containers are started at once.
The name of each app and init container in a pod must be unique - it is a
validation error for any container to share a name.
While pod containers are in alpha state, they will be serialized as an annotation
on the pod with the name `pod.alpha.kubernetes.io/init-containers` and the status
of the containers will be stored as `pod.alpha.kubernetes.io/init-container-statuses`.
Mutation of these annotations is prohibited on existing pods.
### Resources
Given the ordering and execution for init containers, the following rules
for resource usage apply:
* The highest of any particular resource request or limit defined on all init
containers is the **effective init request/limit**
* The pod's **effective request/limit** for a resource is the higher of:
* sum of all app containers request/limit for a resource
* effective init request/limit for a resource
* Scheduling is done based on effective requests/limits, which means
init containers can reserve resources for initialization that are not used
during the life of the pod.
* The lowest QoS tier of init containers per resource is the **effective init QoS tier**,
and the highest QoS tier of both init containers and regular containers is the
**effective pod QoS tier**.
So the following pod:
pod:
spec:
initContainers:
- limits:
cpu: 100m
memory: 1GiB
- limits:
cpu: 50m
memory: 2GiB
containers:
- limits:
cpu: 10m
memory: 1100MiB
- limits:
cpu: 10m
memory: 1100MiB
has an effective pod limit of `cpu: 100m`, `memory: 2200MiB` (highest init
container cpu is larger than sum of all app containers, sum of container
memory is larger than the max of all init containers). The scheduler, node,
and quota must respect the effective pod request/limit.
In the absence of a defined request or limit on a container, the effective
request/limit will be applied. For example, the following pod:
pod:
spec:
initContainers:
- limits:
cpu: 100m
memory: 1GiB
containers:
- request:
cpu: 10m
memory: 1100MiB
will have an effective request of `10m / 1100MiB`, and an effective limit
of `100m / 1GiB`, i.e.:
pod:
spec:
initContainers:
- request:
cpu: 10m
memory: 1GiB
- limits:
cpu: 100m
memory: 1100MiB
containers:
- request:
cpu: 10m
memory: 1GiB
- limits:
cpu: 100m
memory: 1100MiB
and thus have the QoS tier **Burstable** (because request is not equal to
limit).
Quota and limits will be applied based on the effective pod request and
limit.
Pod level cGroups will be based on the effective pod request and limit, the
same as the scheduler.
### Kubelet and container runtime details
Container runtimes should treat the set of init and app containers as one
large pool. An individual init container execution should be identical to
an app container, including all standard container environment setup
(network, namespaces, hostnames, DNS, etc).
All app container operations are permitted on init containers. The
logs for an init container should be available for the duration of the pod
lifetime or until the pod is restarted.
During initialization, app container status should be shown with the reason
PodInitializing if any init containers are present. Each init container
should show appropriate container status, and all init containers that are
waiting for earlier init containers to finish should have the `reason`
PendingInitialization.
The container runtime should aggressively prune failed init containers.
The container runtime should record whether all init containers have
succeeded internally, and only invoke new init containers if a pod
restart is needed (for Docker, if all containers terminate or if the pod
infra container terminates). Init containers should follow backoff rules
as necessary. The Kubelet *must* preserve at least the most recent instance
of an init container to serve logs and data for end users and to track
failure states. The Kubelet *should* prefer to garbage collect completed
init containers over app containers, as long as the Kubelet is able to
track that initialization has been completed. In the future, container
state checkpointing in the Kubelet may remove or reduce the need to
preserve old init containers.
For the initial implementation, the Kubelet will use the last termination
container state of the highest indexed init container to determine whether
the pod has completed initialization. During a pod restart, initialization
will be restarted from the beginning (all initializers will be rerun).
### API Behavior
All APIs that access containers by name should operate on both init and
app containers. Because names are unique the addition of the init container
should be transparent to use cases.
A client with no knowledge of init containers should see appropriate
container status `reason` and `message` fields while the pod is in the
`Pending` phase, and so be able to communicate that to end users.
### Example init containers
* Wait for a service to be created
pod:
spec:
initContainers:
- name: wait
image: centos:centos7
command: ["/bin/sh", "-c", "for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; exit 1"]
containers:
- name: run
image: application-image
command: ["/my_application_that_depends_on_myservice"]
* Register this pod with a remote server
pod:
spec:
initContainers:
- name: register
image: centos:centos7
command: ["/bin/sh", "-c", "curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(POD_NAME)&ip=$(POD_IP)'"]
env:
- name: POD_NAME
valueFrom:
field: metadata.name
- name: POD_IP
valueFrom:
field: status.podIP
containers:
- name: run
image: application-image
command: ["/my_application_that_depends_on_myservice"]
* Wait for an arbitrary period of time
pod:
spec:
initContainers:
- name: wait
image: centos:centos7
command: ["/bin/sh", "-c", "sleep 60"]
containers:
- name: run
image: application-image
command: ["/static_binary_without_sleep"]
* Clone a git repository into a volume (can be implemented by volume containers in the future):
pod:
spec:
initContainers:
- name: download
image: image-with-git
command: ["git", "clone", "https://github.com/myrepo/myrepo.git", "/var/lib/data"]
volumeMounts:
- mountPath: /var/lib/data
volumeName: git
containers:
- name: run
image: centos:centos7
command: ["/var/lib/data/binary"]
volumeMounts:
- mountPath: /var/lib/data
volumeName: git
volumes:
- emptyDir: {}
name: git
* Execute a template transformation based on environment (can be implemented by volume containers in the future):
pod:
spec:
initContainers:
- name: copy
image: application-image
command: ["/bin/cp", "mytemplate.j2", "/var/lib/data/"]
volumeMounts:
- mountPath: /var/lib/data
volumeName: data
- name: transform
image: image-with-jinja
command: ["/bin/sh", "-c", "jinja /var/lib/data/mytemplate.j2 > /var/lib/data/mytemplate.conf"]
volumeMounts:
- mountPath: /var/lib/data
volumeName: data
containers:
- name: run
image: application-image
command: ["/myapplication", "-conf", "/var/lib/data/mytemplate.conf"]
volumeMounts:
- mountPath: /var/lib/data
volumeName: data
volumes:
- emptyDir: {}
name: data
* Perform a container build
pod:
spec:
initContainers:
- name: copy
image: base-image
workingDir: /home/user/source-tree
command: ["make"]
containers:
- name: commit
image: image-with-docker
command:
- /bin/sh
- -c
- docker commit $(complex_bash_to_get_container_id_of_copy) \
docker push $(commit_id) myrepo:latest
volumesMounts:
- mountPath: /var/run/docker.sock
volumeName: dockersocket
## Backwards compatibilty implications
Since this is a net new feature in the API and Kubelet, new API servers during upgrade may not
be able to rely on Kubelets implementing init containers. The management of feature skew between
master and Kubelet is tracked in issue [#4855](https://github.com/kubernetes/kubernetes/issues/4855).
## Future work
* Unify pod QoS class with init containers
* Implement container / image volumes to make composition of runtime from images efficient
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/container-init.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,267 +1 @@
# Redefine Container Runtime Interface This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-runtime-interface-v1.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-runtime-interface-v1.md)
The umbrella issue: [#22964](https://issues.k8s.io/22964)
## Motivation
Kubelet employs a declarative pod-level interface, which acts as the sole
integration point for container runtimes (e.g., `docker` and `rkt`). The
high-level, declarative interface has caused higher integration and maintenance
cost, and also slowed down feature velocity for the following reasons.
1. **Not every container runtime supports the concept of pods natively**.
When integrating with Kubernetes, a significant amount of work needs to
go into implementing a shim of significant size to support all pod
features. This also adds maintenance overhead (e.g., `docker`).
2. **High-level interface discourages code sharing and reuse among runtimes**.
E.g, each runtime today implements an all-encompassing `SyncPod()`
function, with the Pod Spec as the input argument. The runtime implements
logic to determine how to achieve the desired state based on the current
status, (re-)starts pods/containers and manages lifecycle hooks
accordingly.
3. **Pod Spec is evolving rapidly**. New features are being added constantly.
Any pod-level change or addition requires changing of all container
runtime shims. E.g., init containers and volume containers.
## Goals and Non-Goals
The goals of defining the interface are to
- **improve extensibility**: Easier container runtime integration.
- **improve feature velocity**
- **improve code maintainability**
The non-goals include
- proposing *how* to integrate with new runtimes, i.e., where the shim
resides. The discussion of adopting a client-server architecture is tracked
by [#13768](https://issues.k8s.io/13768), where benefits and shortcomings of
such an architecture is discussed.
- versioning the new interface/API. We intend to provide API versioning to
offer stability for runtime integrations, but the details are beyond the
scope of this proposal.
- adding support to Windows containers. Windows container support is a
parallel effort and is tracked by [#22623](https://issues.k8s.io/22623).
The new interface will not be augmented to support Windows containers, but
it will be made extensible such that the support can be added in the future.
- re-defining Kubelet's internal interfaces. These interfaces, though, may
affect Kubelet's maintainability, is not relevant to runtime integration.
- improving Kubelet's efficiency or performance, e.g., adopting event stream
from the container runtime [#8756](https://issues.k8s.io/8756),
[#16831](https://issues.k8s.io/16831).
## Requirements
* Support the already integrated container runtime: `docker` and `rkt`
* Support hypervisor-based container runtimes: `hyper`.
The existing pod-level interface will remain as it is in the near future to
ensure supports of all existing runtimes are continued. Meanwhile, we will
work with all parties involved to switching to the proposed interface.
## Container Runtime Interface
The main idea of this proposal is to adopt an imperative container-level
interface, which allows Kubelet to directly control the lifecycles of the
containers.
Pod is composed of a group of containers in an isolated environment with
resource constraints. In Kubernetes, pod is also the smallest schedulable unit.
After a pod has been scheduled to the node, Kubelet will create the environment
for the pod, and add/update/remove containers in that environment to meet the
Pod Spec. To distinguish between the environment and the pod as a whole, we
will call the pod environment **PodSandbox.**
The container runtimes may interpret the PodSandBox concept differently based
on how it operates internally. For runtimes relying on hypervisor, sandbox
represents a virtual machine naturally. For others, it can be Linux namespaces.
In short, a PodSandbox should have the following features.
* **Isolation**: E.g., Linux namespaces or a full virtual machine, or even
support additional security features.
* **Compute resource specifications**: A PodSandbox should implement pod-level
resource demands and restrictions.
*NOTE: The resource specification does not include externalized costs to
container setup that are not currently trackable as Pod constraints, e.g.,
filesystem setup, container image pulling, etc.*
A container in a PodSandbox maps to an application in the Pod Spec. For Linux
containers, they are expected to share at least network and IPC namespaces,
with sharing more namespaces discussed in [#1615](https://issues.k8s.io/1615).
Below is an example of the proposed interfaces.
```go
// PodSandboxManager contains basic operations for sandbox.
type PodSandboxManager interface {
Create(config *PodSandboxConfig) (string, error)
Delete(id string) (string, error)
List(filter PodSandboxFilter) []PodSandboxListItem
Status(id string) PodSandboxStatus
}
// ContainerRuntime contains basic operations for containers.
type ContainerRuntime interface {
Create(config *ContainerConfig, sandboxConfig *PodSandboxConfig, PodSandboxID string) (string, error)
Start(id string) error
Stop(id string, timeout int) error
Remove(id string) error
List(filter ContainerFilter) ([]ContainerListItem, error)
Status(id string) (ContainerStatus, error)
Exec(id string, cmd []string, streamOpts StreamOptions) error
}
// ImageService contains image-related operations.
type ImageService interface {
List() ([]Image, error)
Pull(image ImageSpec, auth AuthConfig) error
Remove(image ImageSpec) error
Status(image ImageSpec) (Image, error)
Metrics(image ImageSpec) (ImageMetrics, error)
}
type ContainerMetricsGetter interface {
ContainerMetrics(id string) (ContainerMetrics, error)
}
All functions listed above are expected to be thread-safe.
```
### Pod/Container Lifecycle
The PodSandboxs lifecycle is decoupled from the containers, i.e., a sandbox
is created before any containers, and can exist after all containers in it have
terminated.
Assume there is a pod with a single container C. To start a pod:
```
create sandbox Foo --> create container C --> start container C
```
To delete a pod:
```
stop container C --> remove container C --> delete sandbox Foo
```
The container runtime must not apply any transition (such as starting a new
container) unless explicitly instructed by Kubelet. It is Kubelet's
responsibility to enforce garbage collection, restart policy, and otherwise
react to changes in lifecycle.
The only transitions that are possible for a container are described below:
```
() -> Created // A container can only transition to created from the
// empty, nonexistent state. The ContainerRuntime.Create
// method causes this transition.
Created -> Running // The ContainerRuntime.Start method may be applied to a
// Created container to move it to Running
Running -> Exited // The ContainerRuntime.Stop method may be applied to a running
// container to move it to Exited.
// A container may also make this transition under its own volition
Exited -> () // An exited container can be moved to the terminal empty
// state via a ContainerRuntime.Remove call.
```
Kubelet is also responsible for gracefully terminating all the containers
in the sandbox before deleting the sandbox. If Kubelet chooses to delete
the sandbox with running containers in it, those containers should be forcibly
deleted.
Note that every PodSandbox/container lifecycle operation (create, start,
stop, delete) should either return an error or block until the operation
succeeds. A successful operation should include a state transition of the
PodSandbox/container. E.g., if a `Create` call for a container does not
return an error, the container state should be "created" when the runtime is
queried.
### Updates to PodSandbox or Containers
Kubernetes support updates only to a very limited set of fields in the Pod
Spec. These updates may require containers to be re-created by Kubelet. This
can be achieved through the proposed, imperative container-level interface.
On the other hand, PodSandbox update currently is not required.
### Container Lifecycle Hooks
Kubernetes supports post-start and pre-stop lifecycle hooks, with ongoing
discussion for supporting pre-start and post-stop hooks in
[#140](https://issues.k8s.io/140).
These lifecycle hooks will be implemented by Kubelet via `Exec` calls to the
container runtime. This frees the runtimes from having to support hooks
natively.
Illustration of the container lifecycle and hooks:
```
pre-start post-start pre-stop post-stop
| | | |
exec exec exec exec
| | | |
create --------> start ----------------> stop --------> remove
```
In order for the lifecycle hooks to function as expected, the `Exec` call
will need access to the container's filesystem (e.g., mount namespaces).
### Extensibility
There are several dimensions for container runtime extensibility.
- Host OS (e.g., Linux)
- PodSandbox isolation mechanism (e.g., namespaces or VM)
- PodSandbox OS (e.g., Linux)
As mentioned previously, this proposal will only address the Linux based
PodSandbox and containers. All Linux-specific configuration will be grouped
into one field. A container runtime is required to enforce all configuration
applicable to its platform, and should return an error otherwise.
### Keep it minimal
The proposed interface is experimental, i.e., it will go through (many) changes
until it stabilizes. The principle is to to keep the interface minimal and
extend it later if needed. This includes a several features that are still in
discussion and may be achieved alternatively:
* `AttachContainer`: [#23335](https://issues.k8s.io/23335)
* `PortForward`: [#25113](https://issues.k8s.io/25113)
## Alternatives
**[Status quo] Declarative pod-level interface**
- Pros: No changes needed.
- Cons: All the issues stated in #motivation
**Allow integration at both pod- and container-level interfaces**
- Pros: Flexibility.
- Cons: All the issues stated in #motivation
**Imperative pod-level interface**
The interface contains only CreatePod(), StartPod(), StopPod() and RemovePod().
This implies that the runtime needs to take over container lifecycle
management (i.e., enforce restart policy), lifecycle hooks, liveness checks,
etc. Kubelet will mainly be responsible for interfacing with the apiserver, and
can potentially become a very thin daemon.
- Pros: Lower maintenance overhead for the Kubernetes maintainers if `Docker`
shim maintenance cost is discounted.
- Cons: This will incur higher integration cost because every new container
runtime needs to implement all the features and need to understand the
concept of pods. This would also lead to lower feature velocity because the
interface will need to be changed, and the new pod-level feature will need
to be supported in each runtime.
## Related Issues
* Metrics: [#27097](https://issues.k8s.io/27097)
* Log management: [#24677](https://issues.k8s.io/24677)
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/container-runtime-interface-v1.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,102 +1 @@
# ControllerRef proposal This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/controller-ref.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/controller-ref.md)
Author: gmarek@
Last edit: 2016-05-11
Status: raw
Approvers:
- [ ] briangrant
- [ ] dbsmith
**Table of Contents**
- [Goal of ControllerReference](#goal-of-setreference)
- [Non goals](#non-goals)
- [API and semantic changes](#api-and-semantic-changes)
- [Upgrade/downgrade procedure](#upgradedowngrade-procedure)
- [Orphaning/adoption](#orphaningadoption)
- [Implementation plan (sketch)](#implementation-plan-sketch)
- [Considered alternatives](#considered-alternatives)
# Goal of ControllerReference
Main goal of `ControllerReference` effort is to solve a problem of overlapping controllers that fight over some resources (e.g. `ReplicaSets` fighting with `ReplicationControllers` over `Pods`), which cause serious [problems](https://github.com/kubernetes/kubernetes/issues/24433) such as exploding memory of Controller Manager.
We dont want to have (just) an in-memory solution, as we dont want a Controller Manager crash to cause massive changes in object ownership in the system. I.e. we need to persist the information about "owning controller".
Secondary goal of this effort is to improve performance of various controllers and schedulers, by removing the need for expensive lookup for all matching "controllers".
# Non goals
Cascading deletion is not a goal of this effort. Cascading deletion will use `ownerReferences`, which is a [separate effort](garbage-collection.md).
`ControllerRef` will extend `OwnerReference` and reuse machinery written for it (GarbageCollector, adoption/orphaning logic).
# API and semantic changes
There will be a new API field in the `OwnerReference` in which we will store an information if given owner is a managing controller:
```
OwnerReference {
Controller bool
}
```
From now on by `ControllerRef` we mean an `OwnerReference` with `Controller=true`.
Most controllers (all that manage collections of things defined by label selector) will have slightly changed semantics: currently controller owns an object if its selector matches objects labels and if it doesn't notice an older controller of the same kind that also matches the object's labels, but after introduction of `ControllerReference` a controller will own an object iff selector matches labels and the `OwnerReference` with `Controller=true`points to it.
If the owner's selector or owned object's labels change, the owning controller will be responsible for orphaning (clearing `Controller` field in the `OwnerReference` and/or deleting `OwnerReference` altogether) objects, after which adoption procedure (setting `Controller` field in one of `OwnerReferencec` and/or adding new `OwnerReferences`) might occur, if another controller has a selector matching.
For debugging purposes we want to add an `adoptionTime` annotation prefixed with `kubernetes.io/` which will keep the time of last controller ownership transfer.
# Upgrade/downgrade procedure
Because `ControllerRef` will be a part of `OwnerReference` effort it will have the same upgrade/downgrade procedures.
# Orphaning/adoption
Because `ControllerRef` will be a part of `OwnerReference` effort it will have the same orphaning/adoption procedures.
Controllers will orphan objects they own in two cases:
* Change of label/selector causing selector to stop matching labels (executed by the controller)
* Deletion of a controller with `Orphaning=true` (executed by the GarbageCollector)
We will need a secondary orphaning mechanism in case of unclean controller deletion:
* GarbageCollector will remove `ControllerRef` from objects that no longer points to existing controllers
Controller will adopt (set `Controller` field in the `OwnerReference` that points to it) an object whose labels match its selector iff:
* there are no `OwnerReferences` with `Controller` set to true in `OwnerReferences` array
* `DeletionTimestamp` is not set
and
* Controller is the first controller that will manage to adopt the Pod from all Controllers that have matching label selector and don't have `DeletionTimestamp` set.
By design there are possible races during adoption if multiple controllers can own a given object.
To prevent re-adoption of an object during deletion the `DeletionTimestamp` will be set when deletion is starting. When a controller has a non-nil `DeletionTimestamp` it wont take any actions except updating its `Status` (in particular it wont adopt any objects).
# Implementation plan (sketch):
* Add API field for `Controller`,
* Extend `OwnerReference` adoption procedure to set a `Controller` field in one of the owners,
* Update all affected controllers to respect `ControllerRef`.
Necessary related work:
* `OwnerReferences` are correctly added/deleted,
* GarbageCollector removes dangling references,
* Controllers don't take any meaningful actions when `DeletionTimestamps` is set.
# Considered alternatives
* Generic "ReferenceController": centralized component that managed adoption/orphaning
* Dropped because: hard to write something that will work for all imaginable 3rd party objects, adding hooks to framework makes it possible for users to write their own logic
* Separate API field for `ControllerRef` in the ObjectMeta.
* Dropped because: nontrivial relationship between `ControllerRef` and `OwnerReferences` when it comes to deletion/adoption.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/controller-ref.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,147 +1 @@
<!-- BEGIN MUNGE: GENERATED_TOC --> This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/deploy.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/deploy.md)
- [Deploy through CLI](#deploy-through-cli)
- [Motivation](#motivation)
- [Requirements](#requirements)
- [Related `kubectl` Commands](#related-kubectl-commands)
- [`kubectl run`](#kubectl-run)
- [`kubectl scale` and `kubectl autoscale`](#kubectl-scale-and-kubectl-autoscale)
- [`kubectl rollout`](#kubectl-rollout)
- [`kubectl set`](#kubectl-set)
- [Mutating Operations](#mutating-operations)
- [Example](#example)
- [Support in Deployment](#support-in-deployment)
- [Deployment Status](#deployment-status)
- [Deployment Version](#deployment-version)
- [Pause Deployments](#pause-deployments)
- [Perm-failed Deployments](#perm-failed-deployments)
<!-- END MUNGE: GENERATED_TOC -->
# Deploy through CLI
## Motivation
Users can use [Deployments](../user-guide/deployments.md) or [`kubectl rolling-update`](../user-guide/kubectl/kubectl_rolling-update.md) to deploy in their Kubernetes clusters. A Deployment provides declarative update for Pods and ReplicationControllers, whereas `rolling-update` allows the users to update their earlier deployment without worrying about schemas and configurations. Users need a way that's similar to `rolling-update` to manage their Deployments more easily.
`rolling-update` expects ReplicationController as the only resource type it deals with. It's not trivial to support exactly the same behavior with Deployment, which requires:
- Print out scaling up/down events.
- Stop the deployment if users press Ctrl-c.
- The controller should not make any more changes once the process ends. (Delete the deployment when status.replicas=status.updatedReplicas=spec.replicas)
So, instead, this document proposes another way to support easier deployment management via Kubernetes CLI (`kubectl`).
## Requirements
The followings are operations we need to support for the users to easily managing deployments:
- **Create**: To create deployments.
- **Rollback**: To restore to an earlier version of deployment.
- **Watch the status**: To watch for the status update of deployments.
- **Pause/resume**: To pause a deployment mid-way, and to resume it. (A use case is to support canary deployment.)
- **Version information**: To record and show version information that's meaningful to users. This can be useful for rollback.
## Related `kubectl` Commands
### `kubectl run`
`kubectl run` should support the creation of Deployment (already implemented) and DaemonSet resources.
### `kubectl scale` and `kubectl autoscale`
Users may use `kubectl scale` or `kubectl autoscale` to scale up and down Deployments (both already implemented).
### `kubectl rollout`
`kubectl rollout` supports both Deployment and DaemonSet. It has the following subcommands:
- `kubectl rollout undo` works like rollback; it allows the users to rollback to a previous version of deployment.
- `kubectl rollout pause` allows the users to pause a deployment. See [pause deployments](#pause-deployments).
- `kubectl rollout resume` allows the users to resume a paused deployment.
- `kubectl rollout status` shows the status of a deployment.
- `kubectl rollout history` shows meaningful version information of all previous deployments. See [development version](#deployment-version).
- `kubectl rollout retry` retries a failed deployment. See [perm-failed deployments](#perm-failed-deployments).
### `kubectl set`
`kubectl set` has the following subcommands:
- `kubectl set env` allows the users to set environment variables of Kubernetes resources. It should support any object that contains a single, primary PodTemplate (such as Pod, ReplicationController, ReplicaSet, Deployment, and DaemonSet).
- `kubectl set image` allows the users to update multiple images of Kubernetes resources. Users will use `--container` and `--image` flags to update the image of a container. It should support anything that has a PodTemplate.
`kubectl set` should be used for things that are common and commonly modified. Other possible future commands include:
- `kubectl set volume`
- `kubectl set limits`
- `kubectl set security`
- `kubectl set port`
### Mutating Operations
Other means of mutating Deployments and DaemonSets, including `kubectl apply`, `kubectl edit`, `kubectl replace`, `kubectl patch`, `kubectl label`, and `kubectl annotate`, may trigger rollouts if they modify the pod template.
`kubectl create` and `kubectl delete`, for creating and deleting Deployments and DaemonSets, are also relevant.
### Example
With the commands introduced above, here's an example of deployment management:
```console
# Create a Deployment
$ kubectl run nginx --image=nginx --replicas=2 --generator=deployment/v1beta1
# Watch the Deployment status
$ kubectl rollout status deployment/nginx
# Update the Deployment
$ kubectl set image deployment/nginx --container=nginx --image=nginx:<some-version>
# Pause the Deployment
$ kubectl rollout pause deployment/nginx
# Resume the Deployment
$ kubectl rollout resume deployment/nginx
# Check the change history (deployment versions)
$ kubectl rollout history deployment/nginx
# Rollback to a previous version.
$ kubectl rollout undo deployment/nginx --to-version=<version>
```
## Support in Deployment
### Deployment Status
Deployment status should summarize information about Pods, which includes:
- The number of pods of each version.
- The number of ready/not ready pods.
See issue [#17164](https://github.com/kubernetes/kubernetes/issues/17164).
### Deployment Version
We store previous deployment version information in annotations `rollout.kubectl.kubernetes.io/change-source` and `rollout.kubectl.kubernetes.io/version` of replication controllers of the deployment, to support rolling back changes as well as for the users to view previous changes with `kubectl rollout history`.
- `rollout.kubectl.kubernetes.io/change-source`, which is optional, records the kubectl command of the last mutation made to this rollout. Users may use `--record` in `kubectl` to record current command in this annotation.
- `rollout.kubectl.kubernetes.io/version` records a version number to distinguish the change sequence of a deployment's
replication controllers. A deployment obtains the largest version number from its replication controllers and increments the number by 1 upon update or creation of the deployment, and update the version annotation of its new replication controller.
When the users perform a rollback, i.e. `kubectl rollout undo`, the deployment first looks at its existing replication controllers, regardless of their number of replicas. Then it finds the one with annotation `rollout.kubectl.kubernetes.io/version` that either contains the specified rollback version number or contains the second largest version number among all the replication controllers (current new replication controller should obtain the largest version number) if the user didn't specify any version number (the user wants to rollback to the last change). Lastly, it
starts scaling up that replication controller it's rolling back to, and scaling down the current ones, and then update the version counter and the rollout annotations accordingly.
Note that a deployment's replication controllers use PodTemplate hashes (i.e. the hash of `.spec.template`) to distinguish from each others. When doing rollout or rollback, a deployment reuses existing replication controller if it has the same PodTemplate, and its `rollout.kubectl.kubernetes.io/change-source` and `rollout.kubectl.kubernetes.io/version` annotations will be updated by the new rollout. At this point, the earlier state of this replication controller is lost in history. For example, if we had 3 replication controllers in
deployment history, and then we do a rollout with the same PodTemplate as version 1, then version 1 is lost and becomes version 4 after the rollout.
To make deployment versions more meaningful and readable for the users, we can add more annotations in the future. For example, we can add the following flags to `kubectl` for the users to describe and record their current rollout:
- `--description`: adds `description` annotation to an object when it's created to describe the object.
- `--note`: adds `note` annotation to an object when it's updated to record the change.
- `--commit`: adds `commit` annotation to an object with the commit id.
### Pause Deployments
Users sometimes need to temporarily disable a deployment. See issue [#14516](https://github.com/kubernetes/kubernetes/issues/14516).
### Perm-failed Deployments
The deployment could be marked as "permanently failed" for a given spec hash so that the system won't continue thrashing on a doomed deployment. The users can retry a failed deployment with `kubectl rollout retry`. See issue [#14519](https://github.com/kubernetes/kubernetes/issues/14519).
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/deploy.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,229 +1 @@
# Deployment This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/deployment.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/deployment.md)
## Abstract
A proposal for implementing a new resource - Deployment - which will enable
declarative config updates for Pods and ReplicationControllers.
Users will be able to create a Deployment, which will spin up
a ReplicationController to bring up the desired pods.
Users can also target the Deployment at existing ReplicationControllers, in
which case the new RC will replace the existing ones. The exact mechanics of
replacement depends on the DeploymentStrategy chosen by the user.
DeploymentStrategies are explained in detail in a later section.
## Implementation
### API Object
The `Deployment` API object will have the following structure:
```go
type Deployment struct {
TypeMeta
ObjectMeta
// Specification of the desired behavior of the Deployment.
Spec DeploymentSpec
// Most recently observed status of the Deployment.
Status DeploymentStatus
}
type DeploymentSpec struct {
// Number of desired pods. This is a pointer to distinguish between explicit
// zero and not specified. Defaults to 1.
Replicas *int
// Label selector for pods. Existing ReplicationControllers whose pods are
// selected by this will be scaled down. New ReplicationControllers will be
// created with this selector, with a unique label `pod-template-hash`.
// If Selector is empty, it is defaulted to the labels present on the Pod template.
Selector map[string]string
// Describes the pods that will be created.
Template *PodTemplateSpec
// The deployment strategy to use to replace existing pods with new ones.
Strategy DeploymentStrategy
}
type DeploymentStrategy struct {
// Type of deployment. Can be "Recreate" or "RollingUpdate".
Type DeploymentStrategyType
// TODO: Update this to follow our convention for oneOf, whatever we decide it
// to be.
// Rolling update config params. Present only if DeploymentStrategyType =
// RollingUpdate.
RollingUpdate *RollingUpdateDeploymentStrategy
}
type DeploymentStrategyType string
const (
// Kill all existing pods before creating new ones.
RecreateDeploymentStrategyType DeploymentStrategyType = "Recreate"
// Replace the old RCs by new one using rolling update i.e gradually scale down the old RCs and scale up the new one.
RollingUpdateDeploymentStrategyType DeploymentStrategyType = "RollingUpdate"
)
// Spec to control the desired behavior of rolling update.
type RollingUpdateDeploymentStrategy struct {
// The maximum number of pods that can be unavailable during the update.
// Value can be an absolute number (ex: 5) or a percentage of total pods at the start of update (ex: 10%).
// Absolute number is calculated from percentage by rounding up.
// This can not be 0 if MaxSurge is 0.
// By default, a fixed value of 1 is used.
// Example: when this is set to 30%, the old RC can be scaled down by 30%
// immediately when the rolling update starts. Once new pods are ready, old RC
// can be scaled down further, followed by scaling up the new RC, ensuring
// that at least 70% of original number of pods are available at all times
// during the update.
MaxUnavailable IntOrString
// The maximum number of pods that can be scheduled above the original number of
// pods.
// Value can be an absolute number (ex: 5) or a percentage of total pods at
// the start of the update (ex: 10%). This can not be 0 if MaxUnavailable is 0.
// Absolute number is calculated from percentage by rounding up.
// By default, a value of 1 is used.
// Example: when this is set to 30%, the new RC can be scaled up by 30%
// immediately when the rolling update starts. Once old pods have been killed,
// new RC can be scaled up further, ensuring that total number of pods running
// at any time during the update is atmost 130% of original pods.
MaxSurge IntOrString
// Minimum number of seconds for which a newly created pod should be ready
// without any of its container crashing, for it to be considered available.
// Defaults to 0 (pod will be considered available as soon as it is ready)
MinReadySeconds int
}
type DeploymentStatus struct {
// Total number of ready pods targeted by this deployment (this
// includes both the old and new pods).
Replicas int
// Total number of new ready pods with the desired template spec.
UpdatedReplicas int
}
```
### Controller
#### Deployment Controller
The DeploymentController will make Deployments happen.
It will watch Deployment objects in etcd.
For each pending deployment, it will:
1. Find all RCs whose label selector is a superset of DeploymentSpec.Selector.
- For now, we will do this in the client - list all RCs and then filter the
ones we want. Eventually, we want to expose this in the API.
2. The new RC can have the same selector as the old RC and hence we add a unique
selector to all these RCs (and the corresponding label to their pods) to ensure
that they do not select the newly created pods (or old pods get selected by
new RC).
- The label key will be "pod-template-hash".
- The label value will be hash of the podTemplateSpec for that RC without
this label. This value will be unique for all RCs, since PodTemplateSpec should be unique.
- If the RCs and pods dont already have this label and selector:
- We will first add this to RC.PodTemplateSpec.Metadata.Labels for all RCs to
ensure that all new pods that they create will have this label.
- Then we will add this label to their existing pods and then add this as a selector
to that RC.
3. Find if there exists an RC for which value of "pod-template-hash" label
is same as hash of DeploymentSpec.PodTemplateSpec. If it exists already, then
this is the RC that will be ramped up. If there is no such RC, then we create
a new one using DeploymentSpec and then add a "pod-template-hash" label
to it. RCSpec.replicas = 0 for a newly created RC.
4. Scale up the new RC and scale down the olds ones as per the DeploymentStrategy.
- Raise an event if we detect an error, like new pods failing to come up.
5. Go back to step 1 unless the new RC has been ramped up to desired replicas
and the old RCs have been ramped down to 0.
6. Cleanup.
DeploymentController is stateless so that it can recover in case it crashes during a deployment.
### MinReadySeconds
We will implement MinReadySeconds using the Ready condition in Pod. We will add
a LastTransitionTime to PodCondition and update kubelet to set Ready to false,
each time any container crashes. Kubelet will set Ready condition back to true once
all containers are ready. For containers without a readiness probe, we will
assume that they are ready as soon as they are up.
https://github.com/kubernetes/kubernetes/issues/11234 tracks updating kubelet
and https://github.com/kubernetes/kubernetes/issues/12615 tracks adding
LastTransitionTime to PodCondition.
## Changing Deployment mid-way
### Updating
Users can update an ongoing deployment before it is completed.
In this case, the existing deployment will be stalled and the new one will
begin.
For ex: consider the following case:
- User creates a deployment to rolling-update 10 pods with image:v1 to
pods with image:v2.
- User then updates this deployment to create pods with image:v3,
when the image:v2 RC had been ramped up to 5 pods and the image:v1 RC
had been ramped down to 5 pods.
- When Deployment Controller observes the new deployment, it will create
a new RC for creating pods with image:v3. It will then start ramping up this
new RC to 10 pods and will ramp down both the existing RCs to 0.
### Deleting
Users can pause/cancel a deployment by deleting it before it is completed.
Recreating the same deployment will resume it.
For ex: consider the following case:
- User creates a deployment to rolling-update 10 pods with image:v1 to
pods with image:v2.
- User then deletes this deployment while the old and new RCs are at 5 replicas each.
User will end up with 2 RCs with 5 replicas each.
User can then create the same deployment again in which case, DeploymentController will
notice that the second RC exists already which it can ramp up while ramping down
the first one.
### Rollback
We want to allow the user to rollback a deployment. To rollback a
completed (or ongoing) deployment, user can create (or update) a deployment with
DeploymentSpec.PodTemplateSpec = oldRC.PodTemplateSpec.
## Deployment Strategies
DeploymentStrategy specifies how the new RC should replace existing RCs.
To begin with, we will support 2 types of deployment:
* Recreate: We kill all existing RCs and then bring up the new one. This results
in quick deployment but there is a downtime when old pods are down but
the new ones have not come up yet.
* Rolling update: We gradually scale down old RCs while scaling up the new one.
This results in a slower deployment, but there is no downtime. At all times
during the deployment, there are a few pods available (old or new). The number
of available pods and when is a pod considered "available" can be configured
using RollingUpdateDeploymentStrategy.
In future, we want to support more deployment types.
## Future
Apart from the above, we want to add support for the following:
* Running the deployment process in a pod: In future, we can run the deployment process in a pod. Then users can define their own custom deployments and we can run it using the image name.
* More DeploymentStrategyTypes: https://github.com/openshift/origin/blob/master/examples/deployment/README.md#deployment-types lists most commonly used ones.
* Triggers: Deployment will have a trigger field to identify what triggered the deployment. Options are: Manual/UserTriggered, Autoscaler, NewImage.
* Automatic rollback on error: We want to support automatic rollback on error or timeout.
## References
- https://github.com/kubernetes/kubernetes/issues/1743 has most of the
discussion that resulted in this proposal.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/deployment.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,615 +1 @@
**Author**: Vishnu Kannan This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/disk-accounting.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/disk-accounting.md)
**Last** **Updated**: 11/16/2015
**Status**: Pending Review
This proposal is an attempt to come up with a means for accounting disk usage in Kubernetes clusters that are running docker as the container runtime. Some of the principles here might apply for other runtimes too.
### Why is disk accounting necessary?
As of kubernetes v1.1 clusters become unusable over time due to the local disk becoming full. The kubelets on the node attempt to perform garbage collection of old containers and images, but that doesnt prevent running pods from using up all the available disk space.
Kubernetes users have no insight into how the disk is being consumed.
Large images and rapid logging can lead to temporary downtime on the nodes. The node has to free up disk space by deleting images and containers. During this cleanup, existing pods can fail and new pods cannot be started. The node will also transition into an `OutOfDisk` condition, preventing more pods from being scheduled to the node.
Automated eviction of pods that are hogging the local disk is not possible since proper accounting isnt available.
Since local disk is a non-compressible resource, users need means to restrict usage of local disk by pods and containers. Proper disk accounting is a prerequisite. As of today, a misconfigured low QoS class pod can end up bringing down the entire cluster by taking up all the available disk space (misconfigured logging for example)
### Goals
1. Account for disk usage on the nodes.
2. Compatibility with the most common docker storage backends - devicemapper, aufs and overlayfs
3. Provide a roadmap for enabling disk as a schedulable resource in the future.
4. Provide a plugin interface for extending support to non-default filesystems and storage drivers.
### Non Goals
1. Compatibility with all storage backends. The matrix is pretty large already and the priority is to get disk accounting to on most widely deployed platforms.
2. Support for filesystems other than ext4 and xfs.
### Introduction
Disk accounting in Kubernetes cluster running with docker is complex because of the plethora of ways in which disk gets utilized by a container.
Disk can be consumed for:
1. Container images
2. Containers writable layer
3. Containers logs - when written to stdout/stderr and default logging backend in docker is used.
4. Local volumes - hostPath, emptyDir, gitRepo, etc.
As of Kubernetes v1.1, kubelet exposes disk usage for the entire node and the containers writable layer for aufs docker storage driver.
This information is made available to end users via the heapster monitoring pipeline.
#### Image layers
Image layers are shared between containers (COW) and so accounting for images is complicated.
Image layers will have to be accounted as system overhead.
As of today, it is not possible to check if there is enough disk space available on the node before an image is pulled.
#### Writable Layer
Docker creates a writable layer for every container on the host. Depending on the storage driver, the location and the underlying filesystem of this layer will change.
Any files that the container creates or updates (assuming there are no volumes) will be considered as writable layer usage.
The underlying filesystem is whatever the docker storage directory resides on. It is ext4 by default on most distributions, and xfs on RHEL.
#### Container logs
Docker engine provides a pluggable logging interface. Kubernetes is currently using the default logging mode which is `local file`. In this mode, the docker daemon stores bytes written by containers to their stdout or stderr, to local disk. These log files are contained in a special directory that is managed by the docker daemon. These logs are exposed via `docker logs` interface which is then exposed via kubelet and apiserver APIs. Currently, there is a hard-requirement for persisting these log files on the disk.
#### Local Volumes
Volumes are slightly different from other local disk use cases. They are pod scoped. Their lifetime is tied to that of a pod. Due to this property accounting of volumes will also be at the pod level.
As of now, the volume types that can use local disk directly are HostPath, EmptyDir, and GitRepo. Secretes and Downwards API volumes wrap these primitive volumes.
Everything else is a network based volume.
HostPath volumes map in existing directories in the host filesystem into a pod. Kubernetes manages only the mapping. It does not manage the source on the host filesystem.
In addition to this, the changes introduced by a pod on the source of a hostPath volume is not cleaned by kubernetes once the pod exits. Due to these limitations, we will have to account hostPath volumes to system overhead. We should explicitly discourage use of HostPath in read-write mode.
`EmptyDir`, `GitRepo` and other local storage volumes map to a directory on the host root filesystem, that is managed by Kubernetes (kubelet). Their contents are erased as soon as the pod exits. Tracking and potentially restricting usage for volumes is possible.
### Docker storage model
Before we start exploring solutions, lets get familiar with how docker handles storage for images, writable layer and logs.
On all storage drivers, logs are stored under `<docker root dir>/containers/<container-id>/`
The default location of the docker root directory is `/var/lib/docker`.
Volumes are handled by kubernetes.
*Caveat: Volumes specified as part of Docker images are not handled by Kubernetes currently.*
Container images and writable layers are managed by docker and their location will change depending on the storage driver. Each image layer and writable layer is referred to by an ID. The image layers are read-only. Once saved, existing writable layers can be frozen. Saving feature is not of importance to kubernetes since it works only on immutable images.
*Note: Image layer IDs can be obtained by running `docker history -q --no-trunc <imagename>`*
##### Aufs
Image layers and writable layers are stored under `/var/lib/docker/aufs/diff/<id>`.
The writable layers ID is equivalent to that of the container ID.
##### Devicemapper
Each container and each image gets own block device. Since this driver works at the block level, it is not possible to access the layers directly without mounting them. Each container gets its own block device while running.
##### Overlayfs
Image layers and writable layers are stored under `/var/lib/docker/overlay/<id>`.
Identical files are hardlinked between images.
The image layers contain all their data under a `root` subdirectory.
Everything under `/var/lib/docker/overlay/<id>` are files required for running the container, including its writable layer.
### Improve disk accounting
Disk accounting is dependent on the storage driver in docker. A common solution that works across all storage drivers isn't available.
Im listing a few possible solutions for disk accounting below along with their limitations.
We need a plugin model for disk accounting. Some storage drivers in docker will require special plugins.
#### Container Images
As of today, the partition that is holding docker images is flagged by cadvisor, and it uses filesystem stats to identify the overall disk usage of that partition.
Isolated usage of just image layers is available today using `docker history <image name>`.
But isolated usage isn't of much use because image layers are shared between containers and so it is not possible to charge a single pod for image disk usage.
Continuing to use the entire partition availability for garbage collection purposes in kubelet, should not affect reliability.
We might garbage collect more often.
As long as we do not expose features that require persisting old containers, computing image layer usage wouldnt be necessary.
Main goals for images are
1. Capturing total image disk usage
2. Check if a new image will fit on disk.
In case we choose to compute the size of image layers alone, the following are some of the ways to achieve that.
*Note that some of the strategies mentioned below are applicable in general to other kinds of storage like volumes, etc.*
##### Docker History
It is possible to run `docker history` and then create a graph of all images and corresponding image layers.
This graph will let us figure out the disk usage of all the images.
**Pros**
* Compatible across storage drivers.
**Cons**
* Requires maintaining an internal representation of images.
##### Enhance docker
Docker handles the upload and download of image layers. It can embed enough information about each layer. If docker is enhanced to expose this information, we can statically identify space about to be occupied by read-only image layers, even before the image layers are downloaded.
A new [docker feature](https://github.com/docker/docker/pull/16450) (docker pull --dry-run) is pending review, which outputs the disk space that will be consumed by new images. Once this feature lands, we can perform feasibility checks and reject pods that will consume more disk space that what is current availability on the node.
Another option is to expose disk usage of all images together as a first-class feature.
**Pros**
* Works across all storage drivers since docker abstracts the storage drivers.
* Less code to maintain in kubelet.
**Cons**
* Not available today.
* Requires serialized image pulls.
* Metadata files are not tracked.
##### Overlayfs and Aufs
####### `du`
We can list all the image layer specific directories, excluding container directories, and run `du` on each of those directories.
**Pros**:
* This is the least-intrusive approach.
* It will work off the box without requiring any additional configuration.
**Cons**:
* `du` can consume a lot of cpu and memory. There have been several issues reported against the kubelet in the past that were related to `du`.
* It is time consuming. Cannot be run frequently. Requires special handling to constrain resource usage - setting lower nice value or running in a sub-container.
* Can block container deletion by keeping file descriptors open.
####### Linux gid based Disk Quota
[Disk quota](https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-disk-quotas.html) feature provided by the linux kernel can be used to track the usage of image layers. Ideally, we need `project` support for disk quota, which lets us track usage of directory hierarchies using `project ids`. Unfortunately, that feature is only available for zfs filesystems. Since most of our distributions use `ext4` by default, we will have to use either `uid` or `gid` based quota tracking.
Both `uids` and `gids` are meant for security. Overloading that concept for disk tracking is painful and ugly. But, that is what we have today.
Kubelet needs to define a gid for tracking image layers and make that gid or group the owner of `/var/lib/docker/[aufs | overlayfs]` recursively. Once this is done, the quota sub-system in the kernel will report the blocks being consumed by the storage driver on the underlying partition.
Since this number also includes the containers writable layer, we will have to somehow subtract that usage from the overall usage of the storage driver directory. Luckily, we can use the same mechanism for tracking containers writable layer. Once we apply a different `gid` to the containers writable layer, which is located under `/var/lib/docker/<storage_driver>/diff/<container_id>`, the quota subsystem will not include the containers writable layer usage.
Xfs on the other hand support project quota which lets us track disk usage of arbitrary directories using a project. Support for this feature in ext4 is being reviewed. So on xfs, we can use quota without having to clobber the writable layer's uid and gid.
**Pros**:
* Low overhead tracking provided by the kernel.
**Cons**
* Requires updates to default ownership on dockers internal storage driver directories. We will have to deal with storage driver implementation details in any approach that is not docker native.
* Requires additional node configuration - quota subsystem needs to be setup on the node. This can either be automated or made a requirement for the node.
* Kubelet needs to perform gid management. A range of gids have to allocated to the kubelet for the purposes of quota management. This range must not be used for any other purposes out of band. Not required if project quota is available.
* Breaks `docker save` semantics. Since kubernetes assumes immutable images, this is not a blocker. To support quota in docker, we will need user-namespaces along with custom gid mapping for each container. This feature does not exist today. This is not an issue with project quota.
*Note: Refer to the [Appendix](#appendix) section more real examples on using quota with docker.*
**Project Quota**
Project Quota support for ext4 is currently being reviewed upstream. If that feature lands in upstream sometime soon, project IDs will be used to disk tracking instead of uids and gids.
##### Devicemapper
Devicemapper storage driver will setup two volumes, metadata and data, that will be used to store image layers and container writable layer. The volumes can be real devices or loopback. A Pool device is created which uses the underlying volume for real storage.
A new thinly-provisioned volume, based on the pool, will be created for running containers.
The kernel tracks the usage of the pool device at the block device layer. The usage here includes image layers and containers writable layers.
Since the kubelet has to track the writable layer usage anyways, we can subtract the aggregated root filesystem usage from the overall pool device usage to get the image layers disk usage.
Linux quota and `du` will not work with device mapper.
A docker dry run option (mentioned above) is another possibility.
#### Container Writable Layer
###### Overlayfs / Aufs
Docker creates a separate directory for the containers writable layer which is then overlayed on top of read-only image layers.
Both the previously mentioned options of `du` and `Linux Quota` will work for this case as well.
Kubelet can use `du` to track usage and enforce `limits` once disk becomes a schedulable resource. As mentioned earlier `du` is resource intensive.
To use Disk quota, kubelet will have to allocate a separate gid per container. Kubelet can reuse the same gid for multiple instances of the same container (restart scenario). As and when kubelet garbage collects dead containers, the usage of the container will drop.
If local disk becomes a schedulable resource, `linux quota` can be used to impose `request` and `limits` on the container writable layer.
`limits` can be enforced using hard limits. Enforcing `request` will be tricky. One option is to enforce `requests` only when the disk availability drops below a threshold (10%). Kubelet can at this point evict pods that are exceeding their requested space. Other options include using `soft limits` with grace periods, but this option is complex.
###### Devicemapper
FIXME: How to calculate writable layer usage with devicemapper?
To enforce `limits` the volume created for the containers writable layer filesystem can be dynamically [resized](https://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/), to not use more than `limit`. `request` will have to be enforced by the kubelet.
#### Container logs
Container logs are not storage driver specific. We can use either `du` or `quota` to track log usage per container. Log files are stored under `/var/lib/docker/containers/<container-id>`.
In the case of quota, we can create a separate gid for tracking log usage. This will let users track log usage and writable layers usage individually.
For the purposes of enforcing limits though, kubelet will use the sum of logs and writable layer.
In the future, we can consider adding log rotation support for these log files either in kubelet or via docker.
#### Volumes
The local disk based volumes map to a directory on the disk. We can use `du` or `quota` to track the usage of volumes.
There exists a concept called `FsGroup` today in kubernetes, which lets users specify a gid for all volumes in a pod. If that is set, we can use the `FsGroup` gid for quota purposes. This requires `limits` for volumes to be a pod level resource though.
### Yet to be explored
* Support for filesystems other than ext4 and xfs like `zfs`
* Support for Btrfs
It should be clear at this point that we need a plugin based model for disk accounting. Support for other filesystems both CoW and regular can be added as and when required. As we progress towards making accounting work on the above mentioned storage drivers, we can come up with an abstraction for storage plugins in general.
### Implementation Plan and Milestones
#### Milestone 1 - Get accounting to just work!
This milestone targets exposing the following categories of disk usage from the kubelet - infrastructure (images, sys daemons, etc), containers (log + writable layer) and volumes.
* `du` works today. Use `du` for all the categories and ensure that it works on both on aufs and overlayfs.
* Add device mapper support.
* Define a storage driver based pluggable disk accounting interface in cadvisor.
* Reuse that interface for accounting volumes in kubelet.
* Define a disk manager module in kubelet that will serve as a source of disk usage information for the rest of the kubelet.
* Ensure that the kubelet metrics APIs (/apis/metrics/v1beta1) exposes the disk usage information. Add an integration test.
#### Milestone 2 - node reliability
Improve user experience by doing whatever is necessary to keep the node running.
NOTE: [`Out of Resource Killing`](https://github.com/kubernetes/kubernetes/issues/17186) design is a prerequisite.
* Disk manager will evict pods and containers based on QoS class whenever the disk availability is below a critical level.
* Explore combining existing container and image garbage collection logic into disk manager.
Ideally, this phase should be completed before v1.2.
#### Milestone 3 - Performance improvements
In this milestone, we will add support for quota and make it opt-in. There should be no user visible changes in this phase.
* Add gid allocation manager to kubelet
* Reconcile gids allocated after restart.
* Configure linux quota automatically on startup. Do not set any limits in this phase.
* Allocate gids for pod volumes, containers writable layer and logs, and also for image layers.
* Update the docker runtime plugin in kubelet to perform the necessary `chowns` and `chmods` between container creation and startup.
* Pass the allocated gids as supplementary gids to containers.
* Update disk manager in kubelet to use quota when configured.
#### Milestone 4 - Users manage local disks
In this milestone, we will make local disk a schedulable resource.
* Finalize volume accounting - is it at the pod level or per-volume.
* Finalize multi-disk management policy. Will additional disks be handled as whole units?
* Set aside some space for image layers and rest of the infra overhead - node allocable resources includes local disk.
* `du` plugin triggers container or pod eviction whenever usage exceeds limit.
* Quota plugin sets hard limits equal to user specified `limits`.
* Devicemapper plugin resizes writable layer to not exceed the containers disk `limit`.
* Disk manager evicts pods based on `usage` - `request` delta instead of just QoS class.
* Sufficient integration testing to this feature.
### Appendix
#### Implementation Notes
The following is a rough outline of the testing I performed to corroborate by prior design ideas.
Test setup information
* Testing was performed on GCE virtual machines
* All the test VMs were using ext4.
* Distribution tested against is mentioned as part of each graph driver.
##### AUFS testing notes:
Tested on Debian jessie
1. Setup Linux Quota following this [tutorial](https://www.google.com/url?q=https://www.howtoforge.com/tutorial/linux-quota-ubuntu-debian/&sa=D&ust=1446146816105000&usg=AFQjCNHThn4nwfj1YLoVmv5fJ6kqAQ9FlQ).
2. Create a new group x on the host and enable quota for that group
1. `groupadd -g 9000 x`
2. `setquota -g 9000 -a 0 100 0 100` // 100 blocks (4096 bytes each*)
3. `quota -g 9000 -v` // Check that quota is enabled
3. Create a docker container
4. `docker create -it busybox /bin/sh -c "dd if=/dev/zero of=/file count=10 bs=1M"`
8d8c56dcfbf5cda9f9bfec7c6615577753292d9772ab455f581951d9a92d169d
4. Change group on the writable layer directory for this container
5. `chmod a+s /var/lib/docker/aufs/diff/8d8c56dcfbf5cda9f9bfec7c6615577753292d9772ab455f581951d9a92d169d`
6. `chown :x /var/lib/docker/aufs/diff/8d8c56dcfbf5cda9f9bfec7c6615577753292d9772ab455f581951d9a92d169d`
5. Start the docker container
7. `docker start 8d`
8. Check usage using quota and group x
```shell
$ quota -g x -v
Disk quotas for group x (gid 9000):
Filesystem **blocks** quota limit grace files quota limit grace
/dev/sda1 **10248** 0 0 3 0 0
```
Using the same workflow, we can add new sticky group IDs to emptyDir volumes and account for their usage against pods.
Since each container requires a gid for the purposes of quota, we will have to reserve ranges of gids for use by the kubelet. Since kubelet does not checkpoint its state, recovery of group id allocations will be an interesting problem. More on this later.
Track the space occupied by images after it has been pulled locally as follows.
*Note: This approach requires serialized image pulls to be of any use to the kubelet.*
1. Create a group specifically for the graph driver
1. `groupadd -g 9001 docker-images`
2. Update group ownership on the graph (tracks image metadata) and storage driver directories.
2. `chown -R :9001 /var/lib/docker/[overlay | aufs]`
3. `chmod a+s /var/lib/docker/[overlay | aufs]`
4. `chown -R :9001 /var/lib/docker/graph`
5. `chmod a+s /var/lib/docker/graph`
3. Any new images pulled or containers created will be accounted to the `docker-images` group by default.
4. Once we update the group ownership on newly created containers to a different gid, the container writable layers specific disk usage gets dropped from this group.
#### Overlayfs
Tested on Ubuntu 15.10.
Overlayfs works similar to Aufs. The path to the writable directory for container writable layer changes.
* Setup Linux Quota following this [tutorial](https://www.google.com/url?q=https://www.howtoforge.com/tutorial/linux-quota-ubuntu-debian/&sa=D&ust=1446146816105000&usg=AFQjCNHThn4nwfj1YLoVmv5fJ6kqAQ9FlQ).
* Create a new group x on the host and enable quota for that group
* `groupadd -g 9000 x`
* `setquota -g 9000 -a 0 100 0 100` // 100 blocks (4096 bytes each*)
* `quota -g 9000 -v` // Check that quota is enabled
* Create a docker container
* `docker create -it busybox /bin/sh -c "dd if=/dev/zero of=/file count=10 bs=1M"`
* `b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61`
* Change group on the writable layers directory for this container
* `chmod -R a+s /var/lib/docker/overlay/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*`
* `chown -R :9000 /var/lib/docker/overlay/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*`
* Check quota before and after running the container.
```shell
$ quota -g x -v
Disk quotas for group x (gid 9000):
Filesystem blocks quota limit grace files quota limit grace
/dev/sda1 48 0 0 19 0 0
```
* Start the docker container
* `docker start b8`
* ```shell
quota -g x -v
Disk quotas for group x (gid 9000):
Filesystem **blocks** quota limit grace files quota limit grace
/dev/sda1 **10288** 0 0 20 0 0
```
##### Device mapper
Usage of Linux Quota should be possible for the purposes of volumes and log files.
Devicemapper storage driver in docker uses ["thin targets"](https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt). Underneath there are two block devices devices - “data” and “metadata”, using which more block devices are created for containers. More information [here](http://www.projectatomic.io/docs/filesystems/).
These devices can be loopback or real storage devices.
The base device has a maximum storage capacity. This means that the sum total of storage space occupied by images and containers cannot exceed this capacity.
By default, all images and containers are created from an initial filesystem with a 10GB limit.
A separate filesystem is created for each container as part of start (not create).
It is possible to [resize](https://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/) the container filesystem.
For the purposes of image space tracking, we can
####Testing notes:
* ```shell
$ docker info
...
Storage Driver: devicemapper
Pool Name: **docker-8:1-268480-pool**
Pool Blocksize: 65.54 kB
Backing Filesystem: extfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 2.059 GB
Data Space Total: 107.4 GB
Data Space Available: 48.45 GB
Metadata Space Used: 1.806 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.146 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.99 (2015-06-20)
```
```shell
$ dmsetup table docker-8\:1-268480-pool
0 209715200 thin-pool 7:1 7:0 **128** 32768 1 skip_block_zeroing
```
128 is the data block size
Usage from kernel for the primary block device
```shell
$ dmsetup status docker-8\:1-268480-pool
0 209715200 thin-pool 37 441/524288 **31424/1638400** - rw discard_passdown queue_if_no_space -
```
Usage/Available - 31424/1638400
Usage in MB = 31424 * 512 * 128 (block size from above) bytes = 1964 MB
Capacity in MB = 1638400 * 512 * 128 bytes = 100 GB
#### Log file accounting
* Setup Linux quota for a container as mentioned above.
* Update group ownership on the following directories to that of the container group ID created for graphing. Adapting the examples above:
* `chmod -R a+s /var/lib/docker/**containers**/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*`
* `chown -R :9000 /var/lib/docker/**container**/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*`
##### Testing titbits
* Ubuntu 15.10 doesnt ship with the quota module on virtual machines. [Install linux-image-extra-virtual](http://askubuntu.com/questions/109585/quota-format-not-supported-in-kernel) package to get quota to work.
* Overlay storage driver needs kernels >= 3.18. I used Ubuntu 15.10 to test Overlayfs.
* If you use a non-default location for docker storage, change `/var/lib/docker` in the examples to your storage location.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/disk-accounting.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,266 +1 @@
# Proposal: Dramatically Simplify Kubernetes Cluster Creation This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/dramatically-simplify-cluster-creation.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/dramatically-simplify-cluster-creation.md)
> ***Please note: this proposal doesn't reflect final implementation, it's here for the purpose of capturing the original ideas.***
> ***You should probably [read `kubeadm` docs](http://kubernetes.io/docs/getting-started-guides/kubeadm/), to understand the end-result of this effor.***
Luke Marsden & many others in [SIG-cluster-lifecycle](https://github.com/kubernetes/community/tree/master/sig-cluster-lifecycle).
17th August 2016
*This proposal aims to capture the latest consensus and plan of action of SIG-cluster-lifecycle. It should satisfy the first bullet point [required by the feature description](https://github.com/kubernetes/features/issues/11).*
See also: [this presentation to community hangout on 4th August 2016](https://docs.google.com/presentation/d/17xrFxrTwqrK-MJk0f2XCjfUPagljG7togXHcC39p0sM/edit?ts=57a33e24#slide=id.g158d2ee41a_0_76)
## Motivation
Kubernetes is hard to install, and there are many different ways to do it today. None of them are excellent. We believe this is hindering adoption.
## Goals
Have one recommended, official, tested, "happy path" which will enable a majority of new and existing Kubernetes users to:
* Kick the tires and easily turn up a new cluster on infrastructure of their choice
* Get a reasonably secure, production-ready cluster, with reasonable defaults and a range of easily-installable add-ons
We plan to do so by improving and simplifying Kubernetes itself, rather than building lots of tooling which "wraps" Kubernetes by poking all the bits into the right place.
## Scope of project
There are logically 3 steps to deploying a Kubernetes cluster:
1. *Provisioning*: Getting some servers - these may be VMs on a developer's workstation, VMs in public clouds, or bare-metal servers in a user's data center.
2. *Install & Discovery*: Installing the Kubernetes core components on those servers (kubelet, etc) - and bootstrapping the cluster to a state of basic liveness, including allowing each server in the cluster to discover other servers: for example teaching etcd servers about their peers, having TLS certificates provisioned, etc.
3. *Add-ons*: Now that basic cluster functionality is working, installing add-ons such as DNS or a pod network (should be possible using kubectl apply).
Notably, this project is *only* working on dramatically improving 2 and 3 from the perspective of users typing commands directly into root shells of servers. The reason for this is that there are a great many different ways of provisioning servers, and users will already have their own preferences.
What's more, once we've radically improved the user experience of 2 and 3, it will make the job of tools that want to do all three much easier.
## User stories
### Phase I
**_In time to be an alpha feature in Kubernetes 1.4._**
Note: the current plan is to deliver `kubeadm` which implements these stories as "alpha" packages built from master (after the 1.4 feature freeze), but which are capable of installing a Kubernetes 1.4 cluster.
* *Install*: As a potential Kubernetes user, I can deploy a Kubernetes 1.4 cluster on a handful of computers running Linux and Docker by typing two commands on each of those computers. The process is so simple that it becomes obvious to me how to easily automate it if I so wish.
* *Pre-flight check*: If any of the computers don't have working dependencies installed (e.g. bad version of Docker, too-old Linux kernel), I am informed early on and given clear instructions on how to fix it so that I can keep trying until it works.
* *Control*: Having provisioned a cluster, I can gain user credentials which allow me to remotely control it using kubectl.
* *Install-addons*: I can select from a set of recommended add-ons to install directly after installing Kubernetes on my set of initial computers with kubectl apply.
* *Add-node*: I can add another computer to the cluster.
* *Secure*: As an attacker with (presumed) control of the network, I cannot add malicious nodes I control to the cluster created by the user. I also cannot remotely control the cluster.
### Phase II
**_In time for Kubernetes 1.5:_**
*Everything from Phase I as beta/stable feature, everything else below as beta feature in Kubernetes 1.5.*
* *Upgrade*: Later, when Kubernetes 1.4.1 or any newer release is published, I can upgrade to it by typing one other command on each computer.
* *HA*: If one of the computers in the cluster fails, the cluster carries on working. I can find out how to replace the failed computer, including if the computer was one of the masters.
## Top-down view: UX for Phase I items
We will introduce a new binary, kubeadm, which ships with the Kubernetes OS packages (and binary tarballs, for OSes without package managers).
```
laptop$ kubeadm --help
kubeadm: bootstrap a secure kubernetes cluster easily.
/==========================================================\
| KUBEADM IS ALPHA, DO NOT USE IT FOR PRODUCTION CLUSTERS! |
| |
| But, please try it out! Give us feedback at: |
| https://github.com/kubernetes/kubernetes/issues |
| and at-mention @kubernetes/sig-cluster-lifecycle |
\==========================================================/
Example usage:
Create a two-machine cluster with one master (which controls the cluster),
and one node (where workloads, like pods and containers run).
On the first machine
====================
master# kubeadm init master
Your token is: <token>
On the second machine
=====================
node# kubeadm join node --token=<token> <ip-of-master>
Usage:
kubeadm [command]
Available Commands:
init Run this on the first server you deploy onto.
join Run this on other servers to join an existing cluster.
user Get initial admin credentials for a cluster.
manual Advanced, less-automated functionality, for power users.
Use "kubeadm [command] --help" for more information about a command.
```
### Install
*On first machine:*
```
master# kubeadm init master
Initializing kubernetes master... [done]
Cluster token: 73R2SIPM739TNZOA
Run the following command on machines you want to become nodes:
kubeadm join node --token=73R2SIPM739TNZOA <master-ip>
You can now run kubectl here.
```
*On N "node" machines:*
```
node# kubeadm join node --token=73R2SIPM739TNZOA <master-ip>
Initializing kubernetes node... [done]
Bootstrapping certificates... [done]
Joined node to cluster, see 'kubectl get nodes' on master.
```
Note `[done]` would be colored green in all of the above.
### Install: alternative for automated deploy
*The user (or their config management system) creates a token and passes the same one to both init and join.*
```
master# kubeadm init master --token=73R2SIPM739TNZOA
Initializing kubernetes master... [done]
You can now run kubectl here.
```
### Pre-flight check
```
master# kubeadm init master
Error: socat not installed. Unable to proceed.
```
### Control
*On master, after Install, kubectl is automatically able to talk to localhost:8080:*
```
master# kubectl get pods
[normal kubectl output]
```
*To mint new user credentials on the master:*
```
master# kubeadm user create -o kubeconfig-bob bob
Waiting for cluster to become ready... [done]
Creating user certificate for user... [done]
Waiting for user certificate to be signed... [done]
Your cluster configuration file has been saved in kubeconfig.
laptop# scp <master-ip>:/root/kubeconfig-bob ~/.kubeconfig
laptop# kubectl get pods
[normal kubectl output]
```
### Install-addons
*Using CNI network as example:*
```
master# kubectl apply --purge -f \
https://git.io/kubernetes-addons/<X>.yaml
[normal kubectl apply output]
```
### Add-node
*Same as Install  "on node machines".*
### Secure
```
node# kubeadm join --token=GARBAGE node <master-ip>
Unable to join mesh network. Check your token.
```
## Work streams  critical path must have in 1.4 before feature freeze
1. [TLS bootstrapping](https://github.com/kubernetes/features/issues/43) - so that kubeadm can mint credentials for kubelets and users
* Requires [#25764](https://github.com/kubernetes/kubernetes/pull/25764) and auto-signing [#30153](https://github.com/kubernetes/kubernetes/pull/30153) but does not require [#30094](https://github.com/kubernetes/kubernetes/pull/30094).
* @philips, @gtank & @yifan-gu
1. Fix for [#30515](https://github.com/kubernetes/kubernetes/issues/30515) - so that kubeadm can install a kubeconfig which kubelet then picks up
* @smarterclayton
## Work streams  can land after 1.4 feature freeze
1. [Debs](https://github.com/kubernetes/release/pull/35) and [RPMs](https://github.com/kubernetes/release/pull/50) (and binaries?) - so that kubernetes can be installed in the first place
* @mikedanese & @dgoodwin
1. [kubeadm implementation](https://github.com/lukemarsden/kubernetes/tree/kubeadm-scaffolding) - the kubeadm CLI itself, will get bundled into "alpha" kubeadm packages
* @lukemarsden & @errordeveloper
1. [Implementation of JWS server](https://github.com/jbeda/kubernetes/blob/discovery-api/docs/proposals/super-simple-discovery-api.md#method-jws-token) from [#30707](https://github.com/kubernetes/kubernetes/pull/30707) - so that we can implement the simple UX with no dependencies
* @jbeda & @philips?
1. Documentation - so that new users can see this in 1.4 (even if its caveated with alpha/experimental labels and flags all over it)
* @lukemarsden
1. `kubeadm` alpha packages
* @lukemarsden, @mikedanese, @dgoodwin
### Nice to have
1. [Kubectl apply --purge](https://github.com/kubernetes/kubernetes/pull/29551) - so that addons can be maintained using k8s infrastructure
* @lukemarsden & @errordeveloper
## kubeadm implementation plan
Based on [@philips' comment here](https://github.com/kubernetes/kubernetes/pull/30361#issuecomment-239588596).
The key point with this implementation plan is that it requires basically no changes to kubelet except [#30515](https://github.com/kubernetes/kubernetes/issues/30515).
It also doesn't require kubelet to do TLS bootstrapping - kubeadm handles that.
### kubeadm init master
1. User installs and configures kubelet to look for manifests in `/etc/kubernetes/manifests`
1. API server CA certs are generated by kubeadm
1. kubeadm generates pod manifests to launch API server and etcd
1. kubeadm pushes replica set for prototype jsw-server and the JWS into API server with host-networking so it is listening on the master node IP
1. kubeadm prints out the IP of JWS server and JWS token
### kubeadm join node --token IP
1. User installs and configures kubelet to have a kubeconfig at `/var/lib/kubelet/kubeconfig` but the kubelet is in a crash loop and is restarted by host init system
1. kubeadm talks to jws-server on IP with token and gets the cacert, then talks to the apiserver TLS bootstrap API to get client cert, etc and generates a kubelet kubeconfig
1. kubeadm places kubeconfig into `/var/lib/kubelet/kubeconfig` and waits for kubelet to restart
1. Mission accomplished, we think.
## See also
* [Joe Beda's "K8s the hard way easier"](https://docs.google.com/document/d/1lJ26LmCP-I_zMuqs6uloTgAnHPcuT7kOYtQ7XSgYLMA/edit#heading=h.ilgrv18sg5t) which combines Kelsey's "Kubernetes the hard way" with history of proposed UX at the end (scroll all the way down to the bottom).
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/dramatically-simplify-cluster-creation.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,238 +1 @@
<!-- BEGIN MUNGE: GENERATED_TOC --> This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/external-lb-source-ip-preservation.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/external-lb-source-ip-preservation.md)
- [Overview](#overview)
- [Motivation](#motivation)
- [Alpha Design](#alpha-design)
- [Overview](#overview-1)
- [Traffic Steering using LB programming](#traffic-steering-using-lb-programming)
- [Traffic Steering using Health Checks](#traffic-steering-using-health-checks)
- [Choice of traffic steering approaches by individual Cloud Provider implementations](#choice-of-traffic-steering-approaches-by-individual-cloud-provider-implementations)
- [API Changes](#api-changes)
- [Local Endpoint Recognition Support](#local-endpoint-recognition-support)
- [Service Annotation to opt-in for new behaviour](#service-annotation-to-opt-in-for-new-behaviour)
- [NodePort allocation for HealthChecks](#nodeport-allocation-for-healthchecks)
- [Behavior Changes expected](#behavior-changes-expected)
- [External Traffic Blackholed on nodes with no local endpoints](#external-traffic-blackholed-on-nodes-with-no-local-endpoints)
- [Traffic Balancing Changes](#traffic-balancing-changes)
- [Cloud Provider support](#cloud-provider-support)
- [GCE 1.4](#gce-14)
- [GCE Expected Packet Source/Destination IP (Datapath)](#gce-expected-packet-sourcedestination-ip-datapath)
- [GCE Expected Packet Destination IP (HealthCheck path)](#gce-expected-packet-destination-ip-healthcheck-path)
- [AWS TBD](#aws-tbd)
- [Openstack TBD](#openstack-tbd)
- [Azure TBD](#azure-tbd)
- [Testing](#testing)
- [Beta Design](#beta-design)
- [API Changes from Alpha to Beta](#api-changes-from-alpha-to-beta)
- [Future work](#future-work)
- [Appendix](#appendix)
<!-- END MUNGE: GENERATED_TOC -->
# Overview
Kubernetes provides an external loadbalancer service type which creates a virtual external ip
(in supported cloud provider environments) that can be used to load-balance traffic to
the pods matching the service pod-selector.
## Motivation
The current implementation requires that the cloud loadbalancer balances traffic across all
Kubernetes worker nodes, and this traffic is then equally distributed to all the backend
pods for that service.
Due to the DNAT required to redirect the traffic to its ultimate destination, the return
path for each session MUST traverse the same node again. To ensure this, the node also
performs a SNAT, replacing the source ip with its own.
This causes the service endpoint to see the session as originating from a cluster local ip address.
*The original external source IP is lost*
This is not a satisfactory solution - the original external source IP MUST be preserved for a
lot of applications and customer use-cases.
# Alpha Design
This section describes the proposed design for
[alpha-level](../../docs/devel/api_changes.md#alpha-beta-and-stable-versions) support, although
additional features are described in [future work](#future-work).
## Overview
The double hop must be prevented by programming the external load balancer to direct traffic
only to nodes that have local pods for the service. This can be accomplished in two ways, either
by API calls to add/delete nodes from the LB node pool or by adding health checking to the LB and
failing/passing health checks depending on the presence of local pods.
## Traffic Steering using LB programming
This approach requires that the Cloud LB be reprogrammed to be in sync with endpoint presence.
Whenever the first service endpoint is scheduled onto a node, the node is added to the LB pool.
Whenever the last service endpoint is unhealthy on a node, the node needs to be removed from the LB pool.
This is a slow operation, on the order of 30-60 seconds, and involves the Cloud Provider API path.
If the API endpoint is temporarily unavailable, the datapath will be misprogrammed till the
reprogramming is successful and the API->datapath tables are updated by the cloud provider backend.
## Traffic Steering using Health Checks
This approach requires that all worker nodes in the cluster be programmed into the LB target pool.
To steer traffic only onto nodes that have endpoints for the service, we program the LB to perform
node healthchecks. The kube-proxy daemons running on each node will be responsible for responding
to these healthcheck requests (URL `/healthz`) from the cloud provider LB healthchecker. An additional nodePort
will be allocated for these health check for this purpose.
kube-proxy already watches for Service and Endpoint changes, it will maintain an in-memory lookup
table indicating the number of local endpoints for each service.
For a value of zero local endpoints, it responds with a health check failure (503 Service Unavailable),
and success (200 OK) for non-zero values.
Healthchecks are programmable with a min period of 1 second on most cloud provider LBs, and min
failures to trigger node health state change can be configurable from 2 through 5.
This will allow much faster transition times on the order of 1-5 seconds, and involve no
API calls to the cloud provider (and hence reduce the impact of API unreliability), keeping the
time window where traffic might get directed to nodes with no local endpoints to a minimum.
## Choice of traffic steering approaches by individual Cloud Provider implementations
The cloud provider package may choose either of these approaches. kube-proxy will provide these
healthcheck responder capabilities, regardless of the cloud provider configured on a cluster.
## API Changes
### Local Endpoint Recognition Support
To allow kube-proxy to recognize if an endpoint is local requires that the EndpointAddress struct
should also contain the NodeName it resides on. This new string field will be read-only and
populated *only* by the Endpoints Controller.
### Service Annotation to opt-in for new behaviour
A new annotation `service.alpha.kubernetes.io/external-traffic` will be recognized
by the service controller only for services of Type LoadBalancer. Services that wish to opt-in to
the new LoadBalancer behaviour must annotate the Service to request the new ESIPP behavior.
Supported values for this annotation are OnlyLocal and Global.
- OnlyLocal activates the new logic (described in this proposal) and balances locally within a node.
- Global activates the old logic of balancing traffic across the entire cluster.
### NodePort allocation for HealthChecks
An additional nodePort allocation will be necessary for services that are of type LoadBalancer and
have the new annotation specified. This additional nodePort is necessary for kube-proxy to listen for
healthcheck requests on all nodes.
This NodePort will be added as an annotation (`service.alpha.kubernetes.io/healthcheck-nodeport`) to
the Service after allocation (in the alpha release). The value of this annotation may also be
specified during the Create call and the allocator will reserve that specific nodePort.
## Behavior Changes expected
### External Traffic Blackholed on nodes with no local endpoints
When the last endpoint on the node has gone away and the LB has not marked the node as unhealthy,
worst-case window size = (N+1) * HCP, where N = minimum failed healthchecks and HCP = Health Check Period,
external traffic will still be steered to the node. This traffic will be blackholed and not forwarded
to other endpoints elsewhere in the cluster.
Internal pod to pod traffic should behave as before, with equal probability across all pods.
### Traffic Balancing Changes
GCE/AWS load balancers do not provide weights for their target pools. This was not an issue with the old LB
kube-proxy rules which would correctly balance across all endpoints.
With the new functionality, the external traffic will not be equally load balanced across pods, but rather
equally balanced at the node level (because GCE/AWS and other external LB implementations do not have the ability
for specifying the weight per node, they balance equally across all target nodes, disregarding the number of
pods on each node).
We can, however, state that for NumServicePods << NumNodes or NumServicePods >> NumNodes, a fairly close-to-equal
distribution will be seen, even without weights.
Once the external load balancers provide weights, this functionality can be added to the LB programming path.
*Future Work: No support for weights is provided for the 1.4 release, but may be added at a future date*
## Cloud Provider support
This feature is added as an opt-in annotation.
Default behaviour of LoadBalancer type services will be unchanged for all Cloud providers.
The annotation will be ignored by existing cloud provider libraries until they add support.
### GCE 1.4
For the 1.4 release, this feature will be implemented for the GCE cloud provider.
#### GCE Expected Packet Source/Destination IP (Datapath)
- Node: On the node, we expect to see the real source IP of the client. Destination IP will be the Service Virtual External IP.
- Pod: For processes running inside the Pod network namepsace, the source IP will be the real client source IP. The destination address will the be Pod IP.
#### GCE Expected Packet Destination IP (HealthCheck path)
kube-proxy listens on the health check node port for TCP health checks on :::.
This allow responding to health checks when the destination IP is either the VM IP or the Service Virtual External IP.
In practice, tcpdump traces on GCE show source IP is 169.254.169.254 and destination address is the Service Virtual External IP.
### AWS TBD
TBD *discuss timelines and feasibility with Kubernetes sig-aws team members*
### Openstack TBD
This functionality may not be introduced in Openstack in the near term.
*Note from Openstack team member @anguslees*
Underlying vendor devices might be able to do this, but we only expose full-NAT/proxy loadbalancing through the OpenStack API (LBaaS v1/v2 and Octavia). So I'm afraid this will be unsupported on OpenStack, afaics.
### Azure TBD
*To be confirmed* For the 1.4 release, this feature will be implemented for the Azure cloud provider.
## Testing
The cases we should test are:
1. Core Functionality Tests
1.1 Source IP Preservation
Test the main intent of this change, source ip preservation - use the all-in-one network tests container
with new functionality that responds with the client IP. Verify the container is seeing the external IP
of the test client.
1.2 Health Check responses
Testcases use pods explicitly pinned to nodes and delete/add to nodes randomly. Validate that healthchecks succeed
and fail on the expected nodes as endpoints move around. Gather LB response times (time from pod declares ready to
time for Cloud LB to declare node healthy and vice versa) to endpoint changes.
2. Inter-Operability Tests
Validate that internal cluster communications are still possible from nodes without local endpoints. This change
is only for externally sourced traffic.
3. Backward Compatibility Tests
Validate that old and new functionality can simultaneously exist in a single cluster. Create services with and without
the annotation, and validate datapath correctness.
# Beta Design
The only part of the design that changes for beta is the API, which is upgraded from
annotation-based to first class fields.
## API Changes from Alpha to Beta
Annotation `service.alpha.kubernetes.io/node-local-loadbalancer` will switch to a Service object field.
# Future work
Post-1.4 feature ideas. These are not fully-fleshed designs.
# Appendix
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/external-lb-source-ip-preservation.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,209 +1 @@
# Federated API Servers This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-api-servers.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-api-servers.md)
## Abstract
We want to divide the single monolithic API server into multiple federated
servers. Anyone should be able to write their own federated API server to expose APIs they want.
Cluster admins should be able to expose new APIs at runtime by bringing up new
federated servers.
## Motivation
* Extensibility: We want to allow community members to write their own API
servers to expose APIs they want. Cluster admins should be able to use these
servers without having to require any change in the core kubernetes
repository.
* Unblock new APIs from core kubernetes team review: A lot of new API proposals
are currently blocked on review from the core kubernetes team. By allowing
developers to expose their APIs as a separate server and enabling the cluster
admin to use it without any change to the core kubernetes repository, we
unblock these APIs.
* Place for staging experimental APIs: New APIs can remain in separate
federated servers until they become stable, at which point, they can be moved
to the core kubernetes master, if appropriate.
* Ensure that new APIs follow kubernetes conventions: Without the mechanism
proposed here, community members might be forced to roll their own thing which
may or may not follow kubernetes conventions.
## Goal
* Developers should be able to write their own API server and cluster admins
should be able to add them to their cluster, exposing new APIs at runtime. All
of this should not require any change to the core kubernetes API server.
* These new APIs should be seamless extension of the core kubernetes APIs (ex:
they should be operated upon via kubectl).
## Non Goals
The following are related but are not the goals of this specific proposal:
* Make it easy to write a kubernetes API server.
## High Level Architecture
There will be 2 new components in the cluster:
* A simple program to summarize discovery information from all the servers.
* A reverse proxy to proxy client requests to individual servers.
The reverse proxy is optional. Clients can discover server URLs using the
summarized discovery information and contact them directly. Simple clients, can
always use the proxy.
The same program can provide both discovery summarization and reverse proxy.
### Constraints
* Unique API groups across servers: Each API server (and groups of servers, in HA)
should expose unique API groups.
* Follow API conventions: APIs exposed by every API server should adhere to [kubernetes API
conventions](../devel/api-conventions.md).
* Support discovery API: Each API server should support the kubernetes discovery API
(list the suported groupVersions at `/apis` and list the supported resources
at `/apis/<groupVersion>/`)
* No bootstrap problem: The core kubernetes server should not depend on any
other federated server to come up. Other servers can only depend on the core
kubernetes server.
## Implementation Details
### Summarizing discovery information
We can have a very simple Go program to summarize discovery information from all
servers. Cluster admins will register each federated API server (its baseURL and swagger
spec path) with the proxy. The proxy will summarize the list of all group versions
exposed by all registered API servers with their individual URLs at `/apis`.
### Reverse proxy
We can use any standard reverse proxy server like nginx or extend the same Go program that
summarizes discovery information to act as reverse proxy for all federated servers.
Cluster admins are also free to use any of the multiple open source API management tools
(for example, there is [Kong](https://getkong.org/), which is written in lua and there is
[Tyk](https://tyk.io/), which is written in Go). These API management tools
provide a lot more functionality like: rate-limiting, caching, logging,
transformations and authentication.
In future, we can also use ingress. That will give cluster admins the flexibility to
easily swap out the ingress controller by a Go reverse proxy, nginx, haproxy
or any other solution they might want.
### Storage
Each API server is responsible for storing their resources. They can have their
own etcd or can use kubernetes server's etcd using [third party
resources](../design/extending-api.md#adding-custom-resources-to-the-kubernetes-api-server).
### Health check
Kubernetes server's `/api/v1/componentstatuses` will continue to report status
of master components that it depends on (scheduler and various controllers).
Since clients have access to server URLs, they can use that to do
health check of individual servers.
In future, if a global health check is required, we can expose a health check
endpoint in the proxy that will report the status of all federated api servers
in the cluster.
### Auth
Since the actual server which serves client's request can be opaque to the client,
all API servers need to have homogeneous authentication and authorisation mechanisms.
All API servers will handle authn and authz for their resources themselves.
In future, we can also have the proxy do the auth and then have apiservers trust
it (via client certs) to report the actual user in an X-something header.
For now, we will trust system admins to configure homogeneous auth on all servers.
Future proposals will refine how auth is managed across the cluster.
### kubectl
kubectl will talk to the discovery endpoint (or proxy) and use the discovery API to
figure out the operations and resources supported in the cluster.
Today, it uses RESTMapper to determine that. We will update kubectl code to populate
RESTMapper using the discovery API so that we can add and remove resources
at runtime.
We will also need to make kubectl truly generic. Right now, a lot of operations
(like get, describe) are hardcoded in the binary for all resources. A future
proposal will provide details on moving those operations to server.
Note that it is possible for kubectl to talk to individual servers directly in
which case proxy will not be required at all, but this requires a bit more logic
in kubectl. We can do this in future, if desired.
### Handling global policies
Now that we have resources spread across multiple API servers, we need to
be careful to ensure that global policies (limit ranges, resource quotas, etc) are enforced.
Future proposals will improve how this is done across the cluster.
#### Namespaces
When a namespaced resource is created in any of the federated server, that
server first needs to check with the kubernetes server that:
* The namespace exists.
* User has authorization to create resources in that namespace.
* Resource quota for the namespace is not exceeded.
To prevent race conditions, the kubernetes server might need to expose an atomic
API for all these operations.
While deleting a namespace, kubernetes server needs to ensure that resources in
that namespace maintained by other servers are deleted as well. We can do this
using resource [finalizers](../design/namespaces.md#finalizers). Each server
will add themselves in the set of finalizers before they create a resource in
the corresponding namespace and delete all their resources in that namespace,
whenever it is to be deleted (kubernetes API server already has this code, we
will refactor it into a library to enable reuse).
Future proposal will talk about this in more detail and provide a better
mechanism.
#### Limit ranges and resource quotas
kubernetes server maintains [resource quotas](../admin/resourcequota/README.md) and
[limit ranges](../admin/limitrange/README.md) for all resources.
Federated servers will need to check with the kubernetes server before creating any
resource.
## Running on hosted kubernetes cluster
This proposal is not enough for hosted cluster users, but allows us to improve
that in the future.
On a hosted kubernetes cluster, for e.g. on GKE - where Google manages the kubernetes
API server, users will have to bring up and maintain the proxy and federated servers
themselves.
Other system components like the various controllers, will not be aware of the
proxy and will only talk to the kubernetes API server.
One possible solution to fix this is to update kubernetes API server to detect when
there are federated servers in the cluster and then change its advertise address to
the IP address of the proxy.
Future proposal will talk about this in more detail.
## Alternatives
There were other alternatives that we had discussed.
* Instead of adding a proxy in front, let the core kubernetes server provide an
API for other servers to register themselves. It can also provide a discovery
API which the clients can use to discover other servers and then talk to them
directly. But this would have required another server API a lot of client logic as well.
* Validating federated servers: We can validate new servers when they are registered
with the proxy, or keep validating them at regular intervals, or validate
them only when explicitly requested, or not validate at all.
We decided that the proxy will just assume that all the servers are valid
(conform to our api conventions). In future, we can provide conformance tests.
## Future Work
* Validate servers: We should have some conformance tests that validate that the
servers follow kubernetes api-conventions.
* Provide centralised auth service: It is very hard to ensure homogeneous auth
across multiple federated servers, especially in case of hosted clusters
(where different people control the different servers). We can fix it by
providing a centralised authentication and authorization service which all of
the servers can use.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federated-api-servers.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,223 +1 @@
<!-- BEGIN MUNGE: UNVERSIONED_WARNING --> This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-ingress.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-ingress.md)
<!-- BEGIN STRIP_FOR_RELEASE -->
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.
Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--
<!-- END STRIP_FOR_RELEASE -->
<!-- END MUNGE: UNVERSIONED_WARNING -->
# Kubernetes Federated Ingress
Requirements and High Level Design
Quinton Hoole
July 17, 2016
## Overview/Summary
[Kubernetes Ingress](https://github.com/kubernetes/kubernetes.github.io/blob/master/docs/user-guide/ingress.md)
provides an abstraction for sophisticated L7 load balancing through a
single IP address (and DNS name) across multiple pods in a single
Kubernetes cluster. Multiple alternative underlying implementations
are provided, including one based on GCE L7 load balancing and another
using an in-cluster nginx/HAProxy deployment (for non-GCE
environments). An AWS implementation, based on Elastic Load Balancers
and Route53 is under way by the community.
To extend the above to cover multiple clusters, Kubernetes Federated
Ingress aims to provide a similar/identical API abstraction and,
again, multiple implementations to cover various
cloud-provider-specific as well as multi-cloud scenarios. The general
model is to allow the user to instantiate a single Ingress object via
the Federation API, and have it automatically provision all of the
necessary underlying resources (L7 cloud load balancers, in-cluster
proxies etc) to provide L7 load balancing across a service spanning
multiple clusters.
Four options are outlined:
1. GCP only
1. AWS only
1. Cross-cloud via GCP in-cluster proxies (i.e. clients get to AWS and on-prem via GCP).
1. Cross-cloud via AWS in-cluster proxies (i.e. clients get to GCP and on-prem via AWS).
Option 1 is the:
1. easiest/quickest,
1. most featureful
Recommendations:
+ Suggest tackling option 1 (GCP only) first (target beta in v1.4)
+ Thereafter option 3 (cross-cloud via GCP)
+ We should encourage/facilitate the community to tackle option 2 (AWS-only)
## Options
## Google Cloud Platform only - backed by GCE L7 Load Balancers
This is an option for federations across clusters which all run on Google Cloud Platform (i.e. GCE and/or GKE)
### Features
In summary, all of [GCE L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/) features:
1. Single global virtual (a.k.a. "anycast") IP address ("VIP" - no dependence on dynamic DNS)
1. Geo-locality for both external and GCP-internal clients
1. Load-based overflow to next-closest geo-locality (i.e. cluster). Based on either queries per second, or CPU load (unfortunately on the first-hop target VM, not the final destination K8s Service).
1. URL-based request direction (different backend services can fulfill each different URL).
1. HTTPS request termination (at the GCE load balancer, with server SSL certs)
### Implementation
1. Federation user creates (federated) Ingress object (the services
backing the ingress object must share the same nodePort, as they
share a single GCP health check).
1. Federated Ingress Controller creates Ingress object in each cluster
in the federation (after [configuring each cluster ingress
controller to share the same ingress UID](https://gist.github.com/bprashanth/52648b2a0b6a5b637f843e7efb2abc97)).
1. Each cluster-level Ingress Controller ("GLBC") creates Google L7
Load Balancer machinery (forwarding rules, target proxy, URL map,
backend service, health check) which ensures that traffic to the
Ingress (backed by a Service), is directed to the nodes in the cluster.
1. KubeProxy redirects to one of the backend Pods (currently round-robin, per KubeProxy instance)
An alternative implementation approach involves lifting the current
Federated Ingress Controller functionality up into the Federation
control plane. This alternative is not considered any any further
detail in this document.
### Outstanding work Items
1. This should in theory all work out of the box. Need to confirm
with a manual setup. ([#29341](https://github.com/kubernetes/kubernetes/issues/29341))
1. Implement Federated Ingress:
1. API machinery (~1 day)
1. Controller (~3 weeks)
1. Add DNS field to Ingress object (currently missing, but needs to be added, independent of federation)
1. API machinery (~1 day)
1. KubeDNS support (~ 1 week?)
### Pros
1. Global VIP is awesome - geo-locality, load-based overflow (but see caveats below)
1. Leverages existing K8s Ingress machinery - not too much to add.
1. Leverages existing Federated Service machinery - controller looks
almost identical, DNS provider also re-used.
### Cons
1. Only works across GCP clusters (but see below for a light at the end of the tunnel, for future versions).
## Amazon Web Services only - backed by Route53
This is an option for AWS-only federations. Parts of this are
apparently work in progress, see e.g.
[AWS Ingress controller](https://github.com/kubernetes/contrib/issues/346)
[[WIP/RFC] Simple ingress -> DNS controller, using AWS
Route53](https://github.com/kubernetes/contrib/pull/841).
### Features
In summary, most of the features of [AWS Elastic Load Balancing](https://aws.amazon.com/elasticloadbalancing/) and [Route53 DNS](https://aws.amazon.com/route53/).
1. Geo-aware DNS direction to closest regional elastic load balancer
1. DNS health checks to route traffic to only healthy elastic load
balancers
1. A variety of possible DNS routing types, including Latency Based Routing, Geo DNS, and Weighted Round Robin
1. Elastic Load Balancing automatically routes traffic across multiple
instances and multiple Availability Zones within the same region.
1. Health checks ensure that only healthy Amazon EC2 instances receive traffic.
### Implementation
1. Federation user creates (federated) Ingress object
1. Federated Ingress Controller creates Ingress object in each cluster in the federation
1. Each cluster-level AWS Ingress Controller creates/updates
1. (regional) AWS Elastic Load Balancer machinery which ensures that traffic to the Ingress (backed by a Service), is directed to one of the nodes in one of the clusters in the region.
1. (global) AWS Route53 DNS machinery which ensures that clients are directed to the closest non-overloaded (regional) elastic load balancer.
1. KubeProxy redirects to one of the backend Pods (currently round-robin, per KubeProxy instance) in the destination K8s cluster.
### Outstanding Work Items
Most of this remains is currently unimplemented ([AWS Ingress controller](https://github.com/kubernetes/contrib/issues/346)
[[WIP/RFC] Simple ingress -> DNS controller, using AWS
Route53](https://github.com/kubernetes/contrib/pull/841).
1. K8s AWS Ingress Controller
1. Re-uses all of the non-GCE specific Federation machinery discussed above under "GCP-only...".
### Pros
1. Geo-locality (via geo-DNS, not VIP)
1. Load-based overflow
1. Real load balancing (same caveats as for GCP above).
1. L7 SSL connection termination.
1. Seems it can be made to work for hybrid with on-premise (using VPC). More research required.
### Cons
1. K8s Ingress Controller still needs to be developed. Lots of work.
1. geo-DNS based locality/failover is not as nice as VIP-based (but very useful, nonetheless)
1. Only works on AWS (initial version, at least).
## Cross-cloud via GCP
### Summary
Use GCP Federated Ingress machinery described above, augmented with additional HA-proxy backends in all GCP clusters to proxy to non-GCP clusters (via either Service External IP's, or VPN directly to KubeProxy or Pods).
### Features
As per GCP-only above, except that geo-locality would be to the closest GCP cluster (and possibly onwards to the closest AWS/on-prem cluster).
### Implementation
TBD - see Summary above in the mean time.
### Outstanding Work
Assuming that GCP-only (see above) is complete:
1. Wire-up the HA-proxy load balancers to redirect to non-GCP clusters
1. Probably some more - additional detailed research and design necessary.
### Pros
1. Works for cross-cloud.
### Cons
1. Traffic to non-GCP clusters proxies through GCP clusters. Additional bandwidth costs (3x?) in those cases.
## Cross-cloud via AWS
In theory the same approach as "Cross-cloud via GCP" above could be used, except that AWS infrastructure would be used to get traffic first to an AWS cluster, and then proxied onwards to non-AWS and/or on-prem clusters.
Detail docs TBD.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federated-ingress.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,201 +1 @@
# Kubernetes Multi-AZ Clusters This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation-lite.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation-lite.md)
## (previously nicknamed "Ubernetes-Lite")
## Introduction
Full Cluster Federation will offer sophisticated federation between multiple kubernetes
clusters, offering true high-availability, multiple provider support &
cloud-bursting, multiple region support etc. However, many users have
expressed a desire for a "reasonably" high-available cluster, that runs in
multiple zones on GCE or availability zones in AWS, and can tolerate the failure
of a single zone without the complexity of running multiple clusters.
Multi-AZ Clusters aim to deliver exactly that functionality: to run a single
Kubernetes cluster in multiple zones. It will attempt to make reasonable
scheduling decisions, in particular so that a replication controller's pods are
spread across zones, and it will try to be aware of constraints - for example
that a volume cannot be mounted on a node in a different zone.
Multi-AZ Clusters are deliberately limited in scope; for many advanced functions
the answer will be "use full Cluster Federation". For example, multiple-region
support is not in scope. Routing affinity (e.g. so that a webserver will
prefer to talk to a backend service in the same zone) is similarly not in
scope.
## Design
These are the main requirements:
1. kube-up must allow bringing up a cluster that spans multiple zones.
1. pods in a replication controller should attempt to spread across zones.
1. pods which require volumes should not be scheduled onto nodes in a different zone.
1. load-balanced services should work reasonably
### kube-up support
kube-up support for multiple zones will initially be considered
advanced/experimental functionality, so the interface is not initially going to
be particularly user-friendly. As we design the evolution of kube-up, we will
make multiple zones better supported.
For the initial implementation, kube-up must be run multiple times, once for
each zone. The first kube-up will take place as normal, but then for each
additional zone the user must run kube-up again, specifying
`KUBE_USE_EXISTING_MASTER=true` and `KUBE_SUBNET_CIDR=172.20.x.0/24`. This will then
create additional nodes in a different zone, but will register them with the
existing master.
### Zone spreading
This will be implemented by modifying the existing scheduler priority function
`SelectorSpread`. Currently this priority function aims to put pods in an RC
on different hosts, but it will be extended first to spread across zones, and
then to spread across hosts.
So that the scheduler does not need to call out to the cloud provider on every
scheduling decision, we must somehow record the zone information for each node.
The implementation of this will be described in the implementation section.
Note that zone spreading is 'best effort'; zones are just be one of the factors
in making scheduling decisions, and thus it is not guaranteed that pods will
spread evenly across zones. However, this is likely desirable: if a zone is
overloaded or failing, we still want to schedule the requested number of pods.
### Volume affinity
Most cloud providers (at least GCE and AWS) cannot attach their persistent
volumes across zones. Thus when a pod is being scheduled, if there is a volume
attached, that will dictate the zone. This will be implemented using a new
scheduler predicate (a hard constraint): `VolumeZonePredicate`.
When `VolumeZonePredicate` observes a pod scheduling request that includes a
volume, if that volume is zone-specific, `VolumeZonePredicate` will exclude any
nodes not in that zone.
Again, to avoid the scheduler calling out to the cloud provider, this will rely
on information attached to the volumes. This means that this will only support
PersistentVolumeClaims, because direct mounts do not have a place to attach
zone information. PersistentVolumes will then include zone information where
volumes are zone-specific.
### Load-balanced services should operate reasonably
For both AWS & GCE, Kubernetes creates a native cloud load-balancer for each
service of type LoadBalancer. The native cloud load-balancers on both AWS &
GCE are region-level, and support load-balancing across instances in multiple
zones (in the same region). For both clouds, the behaviour of the native cloud
load-balancer is reasonable in the face of failures (indeed, this is why clouds
provide load-balancing as a primitve).
For multi-AZ clusters we will therefore simply rely on the native cloud provider
load balancer behaviour, and we do not anticipate substantial code changes.
One notable shortcoming here is that load-balanced traffic still goes through
kube-proxy controlled routing, and kube-proxy does not (currently) favor
targeting a pod running on the same instance or even the same zone. This will
likely produce a lot of unnecessary cross-zone traffic (which is likely slower
and more expensive). This might be sufficiently low-hanging fruit that we
choose to address it in kube-proxy / multi-AZ clusters, but this can be addressed
after the initial implementation.
## Implementation
The main implementation points are:
1. how to attach zone information to Nodes and PersistentVolumes
1. how nodes get zone information
1. how volumes get zone information
### Attaching zone information
We must attach zone information to Nodes and PersistentVolumes, and possibly to
other resources in future. There are two obvious alternatives: we can use
labels/annotations, or we can extend the schema to include the information.
For the initial implementation, we propose to use labels. The reasoning is:
1. It is considerably easier to implement.
1. We will reserve the two labels `failure-domain.alpha.kubernetes.io/zone` and
`failure-domain.alpha.kubernetes.io/region` for the two pieces of information
we need. By putting this under the `kubernetes.io` namespace there is no risk
of collision, and by putting it under `alpha.kubernetes.io` we clearly mark
this as an experimental feature.
1. We do not yet know whether these labels will be sufficient for all
environments, nor which entities will require zone information. Labels give us
more flexibility here.
1. Because the labels are reserved, we can move to schema-defined fields in
future using our cross-version mapping techniques.
### Node labeling
We do not want to require an administrator to manually label nodes. We instead
modify the kubelet to include the appropriate labels when it registers itself.
The information is easily obtained by the kubelet from the cloud provider.
### Volume labeling
As with nodes, we do not want to require an administrator to manually label
volumes. We will create an admission controller `PersistentVolumeLabel`.
`PersistentVolumeLabel` will intercept requests to create PersistentVolumes,
and will label them appropriately by calling in to the cloud provider.
## AWS Specific Considerations
The AWS implementation here is fairly straightforward. The AWS API is
region-wide, meaning that a single call will find instances and volumes in all
zones. In addition, instance ids and volume ids are unique per-region (and
hence also per-zone). I believe they are actually globally unique, but I do
not know if this is guaranteed; in any case we only need global uniqueness if
we are to span regions, which will not be supported by multi-AZ clusters (to do
that correctly requires a full Cluster Federation type approach).
## GCE Specific Considerations
The GCE implementation is more complicated than the AWS implementation because
GCE APIs are zone-scoped. To perform an operation, we must perform one REST
call per zone and combine the results, unless we can determine in advance that
an operation references a particular zone. For many operations, we can make
that determination, but in some cases - such as listing all instances, we must
combine results from calls in all relevant zones.
A further complexity is that GCE volume names are scoped per-zone, not
per-region. Thus it is permitted to have two volumes both named `myvolume` in
two different GCE zones. (Instance names are currently unique per-region, and
thus are not a problem for multi-AZ clusters).
The volume scoping leads to a (small) behavioural change for multi-AZ clusters on
GCE. If you had two volumes both named `myvolume` in two different GCE zones,
this would not be ambiguous when Kubernetes is operating only in a single zone.
But, when operating a cluster across multiple zones, `myvolume` is no longer
sufficient to specify a volume uniquely. Worse, the fact that a volume happens
to be unambigious at a particular time is no guarantee that it will continue to
be unambigious in future, because a volume with the same name could
subsequently be created in a second zone. While perhaps unlikely in practice,
we cannot automatically enable multi-AZ clusters for GCE users if this then causes
volume mounts to stop working.
This suggests that (at least on GCE), multi-AZ clusters must be optional (i.e.
there must be a feature-flag). It may be that we can make this feature
semi-automatic in future, by detecting whether nodes are running in multiple
zones, but it seems likely that kube-up could instead simply set this flag.
For the initial implementation, creating volumes with identical names will
yield undefined results. Later, we may add some way to specify the zone for a
volume (and possibly require that volumes have their zone specified when
running in multi-AZ cluster mode). We could add a new `zone` field to the
PersistentVolume type for GCE PD volumes, or we could use a DNS-style dotted
name for the volume name (<name>.<zone>)
Initially therefore, the GCE changes will be to:
1. change kube-up to support creation of a cluster in multiple zones
1. pass a flag enabling multi-AZ clusters with kube-up
1. change the kubernetes cloud provider to iterate through relevant zones when resolving items
1. tag GCE PD volumes with the appropriate zone information
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federation-lite.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,648 +1 @@
# Kubernetes Cluster Federation This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation.md)
## (previously nicknamed "Ubernetes")
## Requirements Analysis and Product Proposal
## _by Quinton Hoole ([quinton@google.com](mailto:quinton@google.com))_
_Initial revision: 2015-03-05_
_Last updated: 2015-08-20_
This doc: [tinyurl.com/ubernetesv2](http://tinyurl.com/ubernetesv2)
Original slides: [tinyurl.com/ubernetes-slides](http://tinyurl.com/ubernetes-slides)
Updated slides: [tinyurl.com/ubernetes-whereto](http://tinyurl.com/ubernetes-whereto)
## Introduction
Today, each Kubernetes cluster is a relatively self-contained unit,
which typically runs in a single "on-premise" data centre or single
availability zone of a cloud provider (Google's GCE, Amazon's AWS,
etc).
Several current and potential Kubernetes users and customers have
expressed a keen interest in tying together ("federating") multiple
clusters in some sensible way in order to enable the following kinds
of use cases (intentionally vague):
1. _"Preferentially run my workloads in my on-premise cluster(s), but
automatically overflow to my cloud-hosted cluster(s) if I run out
of on-premise capacity"_.
1. _"Most of my workloads should run in my preferred cloud-hosted
cluster(s), but some are privacy-sensitive, and should be
automatically diverted to run in my secure, on-premise
cluster(s)"_.
1. _"I want to avoid vendor lock-in, so I want my workloads to run
across multiple cloud providers all the time. I change my set of
such cloud providers, and my pricing contracts with them,
periodically"_.
1. _"I want to be immune to any single data centre or cloud
availability zone outage, so I want to spread my service across
multiple such zones (and ideally even across multiple cloud
providers)."_
The above use cases are by necessity left imprecisely defined. The
rest of this document explores these use cases and their implications
in further detail, and compares a few alternative high level
approaches to addressing them. The idea of cluster federation has
informally become known as _"Ubernetes"_.
## Summary/TL;DR
Four primary customer-driven use cases are explored in more detail.
The two highest priority ones relate to High Availability and
Application Portability (between cloud providers, and between
on-premise and cloud providers).
Four primary federation primitives are identified (location affinity,
cross-cluster scheduling, service discovery and application
migration). Fortunately not all four of these primitives are required
for each primary use case, so incremental development is feasible.
## What exactly is a Kubernetes Cluster?
A central design concept in Kubernetes is that of a _cluster_. While
loosely speaking, a cluster can be thought of as running in a single
data center, or cloud provider availability zone, a more precise
definition is that each cluster provides:
1. a single Kubernetes API entry point,
1. a consistent, cluster-wide resource naming scheme
1. a scheduling/container placement domain
1. a service network routing domain
1. an authentication and authorization model.
The above in turn imply the need for a relatively performant, reliable
and cheap network within each cluster.
There is also assumed to be some degree of failure correlation across
a cluster, i.e. whole clusters are expected to fail, at least
occasionally (due to cluster-wide power and network failures, natural
disasters etc). Clusters are often relatively homogeneous in that all
compute nodes are typically provided by a single cloud provider or
hardware vendor, and connected by a common, unified network fabric.
But these are not hard requirements of Kubernetes.
Other classes of Kubernetes deployments than the one sketched above
are technically feasible, but come with some challenges of their own,
and are not yet common or explicitly supported.
More specifically, having a Kubernetes cluster span multiple
well-connected availability zones within a single geographical region
(e.g. US North East, UK, Japan etc) is worthy of further
consideration, in particular because it potentially addresses
some of these requirements.
## What use cases require Cluster Federation?
Let's name a few concrete use cases to aid the discussion:
## 1.Capacity Overflow
_"I want to preferentially run my workloads in my on-premise cluster(s), but automatically "overflow" to my cloud-hosted cluster(s) when I run out of on-premise capacity."_
This idea is known in some circles as "[cloudbursting](http://searchcloudcomputing.techtarget.com/definition/cloud-bursting)".
**Clarifying questions:** What is the unit of overflow? Individual
pods? Probably not always. Replication controllers and their
associated sets of pods? Groups of replication controllers
(a.k.a. distributed applications)? How are persistent disks
overflowed? Can the "overflowed" pods communicate with their
brethren and sistren pods and services in the other cluster(s)?
Presumably yes, at higher cost and latency, provided that they use
external service discovery. Is "overflow" enabled only when creating
new workloads/replication controllers, or are existing workloads
dynamically migrated between clusters based on fluctuating available
capacity? If so, what is the desired behaviour, and how is it
achieved? How, if at all, does this relate to quota enforcement
(e.g. if we run out of on-premise capacity, can all or only some
quotas transfer to other, potentially more expensive off-premise
capacity?)
It seems that most of this boils down to:
1. **location affinity** (pods relative to each other, and to other
stateful services like persistent storage - how is this expressed
and enforced?)
1. **cross-cluster scheduling** (given location affinity constraints
and other scheduling policy, which resources are assigned to which
clusters, and by what?)
1. **cross-cluster service discovery** (how do pods in one cluster
discover and communicate with pods in another cluster?)
1. **cross-cluster migration** (how do compute and storage resources,
and the distributed applications to which they belong, move from
one cluster to another)
1. **cross-cluster load-balancing** (how does is user traffic directed
to an appropriate cluster?)
1. **cross-cluster monitoring and auditing** (a.k.a. Unified Visibility)
## 2. Sensitive Workloads
_"I want most of my workloads to run in my preferred cloud-hosted
cluster(s), but some are privacy-sensitive, and should be
automatically diverted to run in my secure, on-premise cluster(s). The
list of privacy-sensitive workloads changes over time, and they're
subject to external auditing."_
**Clarifying questions:**
1. What kinds of rules determine which
workloads go where?
1. Is there in fact a requirement to have these rules be
declaratively expressed and automatically enforced, or is it
acceptable/better to have users manually select where to run
their workloads when starting them?
1. Is a static mapping from container (or more typically,
replication controller) to cluster maintained and enforced?
1. If so, is it only enforced on startup, or are things migrated
between clusters when the mappings change?
This starts to look quite similar to "1. Capacity Overflow", and again
seems to boil down to:
1. location affinity
1. cross-cluster scheduling
1. cross-cluster service discovery
1. cross-cluster migration
1. cross-cluster monitoring and auditing
1. cross-cluster load balancing
## 3. Vendor lock-in avoidance
_"My CTO wants us to avoid vendor lock-in, so she wants our workloads
to run across multiple cloud providers at all times. She changes our
set of preferred cloud providers and pricing contracts with them
periodically, and doesn't want to have to communicate and manually
enforce these policy changes across the organization every time this
happens. She wants it centrally and automatically enforced, monitored
and audited."_
**Clarifying questions:**
1. How does this relate to other use cases (high availability,
capacity overflow etc), as they may all be across multiple vendors.
It's probably not strictly speaking a separate
use case, but it's brought up so often as a requirement, that it's
worth calling out explicitly.
1. Is a useful intermediate step to make it as simple as possible to
migrate an application from one vendor to another in a one-off fashion?
Again, I think that this can probably be
reformulated as a Capacity Overflow problem - the fundamental
principles seem to be the same or substantially similar to those
above.
## 4. "High Availability"
_"I want to be immune to any single data centre or cloud availability
zone outage, so I want to spread my service across multiple such zones
(and ideally even across multiple cloud providers), and have my
service remain available even if one of the availability zones or
cloud providers "goes down"_.
It seems useful to split this into multiple sets of sub use cases:
1. Multiple availability zones within a single cloud provider (across
which feature sets like private networks, load balancing,
persistent disks, data snapshots etc are typically consistent and
explicitly designed to inter-operate).
1. within the same geographical region (e.g. metro) within which network
is fast and cheap enough to be almost analogous to a single data
center.
1. across multiple geographical regions, where high network cost and
poor network performance may be prohibitive.
1. Multiple cloud providers (typically with inconsistent feature sets,
more limited interoperability, and typically no cheap inter-cluster
networking described above).
The single cloud provider case might be easier to implement (although
the multi-cloud provider implementation should just work for a single
cloud provider). Propose high-level design catering for both, with
initial implementation targeting single cloud provider only.
**Clarifying questions:**
**How does global external service discovery work?** In the steady
state, which external clients connect to which clusters? GeoDNS or
similar? What is the tolerable failover latency if a cluster goes
down? Maybe something like (make up some numbers, notwithstanding
some buggy DNS resolvers, TTL's, caches etc) ~3 minutes for ~90% of
clients to re-issue DNS lookups and reconnect to a new cluster when
their home cluster fails is good enough for most Kubernetes users
(or at least way better than the status quo), given that these sorts
of failure only happen a small number of times a year?
**How does dynamic load balancing across clusters work, if at all?**
One simple starting point might be "it doesn't". i.e. if a service
in a cluster is deemed to be "up", it receives as much traffic as is
generated "nearby" (even if it overloads). If the service is deemed
to "be down" in a given cluster, "all" nearby traffic is redirected
to some other cluster within some number of seconds (failover could
be automatic or manual). Failover is essentially binary. An
improvement would be to detect when a service in a cluster reaches
maximum serving capacity, and dynamically divert additional traffic
to other clusters. But how exactly does all of this work, and how
much of it is provided by Kubernetes, as opposed to something else
bolted on top (e.g. external monitoring and manipulation of GeoDNS)?
**How does this tie in with auto-scaling of services?** More
specifically, if I run my service across _n_ clusters globally, and
one (or more) of them fail, how do I ensure that the remaining _n-1_
clusters have enough capacity to serve the additional, failed-over
traffic? Either:
1. I constantly over-provision all clusters by 1/n (potentially expensive), or
1. I "manually" (or automatically) update my replica count configurations in the
remaining clusters by 1/n when the failure occurs, and Kubernetes
takes care of the rest for me, or
1. Auto-scaling in the remaining clusters takes
care of it for me automagically as the additional failed-over
traffic arrives (with some latency). Note that this implies that
the cloud provider keeps the necessary resources on hand to
accommodate such auto-scaling (e.g. via something similar to AWS reserved
and spot instances)
Up to this point, this use case ("Unavailability Zones") seems materially different from all the others above. It does not require dynamic cross-cluster service migration (we assume that the service is already running in more than one cluster when the failure occurs). Nor does it necessarily involve cross-cluster service discovery or location affinity. As a result, I propose that we address this use case somewhat independently of the others (although I strongly suspect that it will become substantially easier once we've solved the others).
All of the above (regarding "Unavailability Zones") refers primarily
to already-running user-facing services, and minimizing the impact on
end users of those services becoming unavailable in a given cluster.
What about the people and systems that deploy Kubernetes services
(devops etc)? Should they be automatically shielded from the impact
of the cluster outage? i.e. have their new resource creation requests
automatically diverted to another cluster during the outage? While
this specific requirement seems non-critical (manual fail-over seems
relatively non-arduous, ignoring the user-facing issues above), it
smells a lot like the first three use cases listed above ("Capacity
Overflow, Sensitive Services, Vendor lock-in..."), so if we address
those, we probably get this one free of charge.
## Core Challenges of Cluster Federation
As we saw above, a few common challenges fall out of most of the use
cases considered above, namely:
## Location Affinity
Can the pods comprising a single distributed application be
partitioned across more than one cluster? More generally, how far
apart, in network terms, can a given client and server within a
distributed application reasonably be? A server need not necessarily
be a pod, but could instead be a persistent disk housing data, or some
other stateful network service. What is tolerable is typically
application-dependent, primarily influenced by network bandwidth
consumption, latency requirements and cost sensitivity.
For simplicity, let's assume that all Kubernetes distributed
applications fall into one of three categories with respect to relative
location affinity:
1. **"Strictly Coupled"**: Those applications that strictly cannot be
partitioned between clusters. They simply fail if they are
partitioned. When scheduled, all pods _must_ be scheduled to the
same cluster. To move them, we need to shut the whole distributed
application down (all pods) in one cluster, possibly move some
data, and then bring the up all of the pods in another cluster. To
avoid downtime, we might bring up the replacement cluster and
divert traffic there before turning down the original, but the
principle is much the same. In some cases moving the data might be
prohibitively expensive or time-consuming, in which case these
applications may be effectively _immovable_.
1. **"Strictly Decoupled"**: Those applications that can be
indefinitely partitioned across more than one cluster, to no
disadvantage. An embarrassingly parallel YouTube porn detector,
where each pod repeatedly dequeues a video URL from a remote work
queue, downloads and chews on the video for a few hours, and
arrives at a binary verdict, might be one such example. The pods
derive no benefit from being close to each other, or anything else
(other than the source of YouTube videos, which is assumed to be
equally remote from all clusters in this example). Each pod can be
scheduled independently, in any cluster, and moved at any time.
1. **"Preferentially Coupled"**: Somewhere between Coupled and
Decoupled. These applications prefer to have all of their pods
located in the same cluster (e.g. for failure correlation, network
latency or bandwidth cost reasons), but can tolerate being
partitioned for "short" periods of time (for example while
migrating the application from one cluster to another). Most small
to medium sized LAMP stacks with not-very-strict latency goals
probably fall into this category (provided that they use sane
service discovery and reconnect-on-fail, which they need to do
anyway to run effectively, even in a single Kubernetes cluster).
From a fault isolation point of view, there are also opposites of the
above. For example, a master database and its slave replica might
need to be in different availability zones. We'll refer to this a
anti-affinity, although it is largely outside the scope of this
document.
Note that there is somewhat of a continuum with respect to network
cost and quality between any two nodes, ranging from two nodes on the
same L2 network segment (lowest latency and cost, highest bandwidth)
to two nodes on different continents (highest latency and cost, lowest
bandwidth). One interesting point on that continuum relates to
multiple availability zones within a well-connected metro or region
and single cloud provider. Despite being in different data centers,
or areas within a mega data center, network in this case is often very fast
and effectively free or very cheap. For the purposes of this network location
affinity discussion, this case is considered analogous to a single
availability zone. Furthermore, if a given application doesn't fit
cleanly into one of the above, shoe-horn it into the best fit,
defaulting to the "Strictly Coupled and Immovable" bucket if you're
not sure.
And then there's what I'll call _absolute_ location affinity. Some
applications are required to run in bounded geographical or network
topology locations. The reasons for this are typically
political/legislative (data privacy laws etc), or driven by network
proximity to consumers (or data providers) of the application ("most
of our users are in Western Europe, U.S. West Coast" etc).
**Proposal:** First tackle Strictly Decoupled applications (which can
be trivially scheduled, partitioned or moved, one pod at a time).
Then tackle Preferentially Coupled applications (which must be
scheduled in totality in a single cluster, and can be moved, but
ultimately in total, and necessarily within some bounded time).
Leave strictly coupled applications to be manually moved between
clusters as required for the foreseeable future.
## Cross-cluster service discovery
I propose having pods use standard discovery methods used by external
clients of Kubernetes applications (i.e. DNS). DNS might resolve to a
public endpoint in the local or a remote cluster. Other than Strictly
Coupled applications, software should be largely oblivious of which of
the two occurs.
_Aside:_ How do we avoid "tromboning" through an external VIP when DNS
resolves to a public IP on the local cluster? Strictly speaking this
would be an optimization for some cases, and probably only matters to
high-bandwidth, low-latency communications. We could potentially
eliminate the trombone with some kube-proxy magic if necessary. More
detail to be added here, but feel free to shoot down the basic DNS
idea in the mean time. In addition, some applications rely on private
networking between clusters for security (e.g. AWS VPC or more
generally VPN). It should not be necessary to forsake this in
order to use Cluster Federation, for example by being forced to use public
connectivity between clusters.
## Cross-cluster Scheduling
This is closely related to location affinity above, and also discussed
there. The basic idea is that some controller, logically outside of
the basic Kubernetes control plane of the clusters in question, needs
to be able to:
1. Receive "global" resource creation requests.
1. Make policy-based decisions as to which cluster(s) should be used
to fulfill each given resource request. In a simple case, the
request is just redirected to one cluster. In a more complex case,
the request is "demultiplexed" into multiple sub-requests, each to
a different cluster. Knowledge of the (albeit approximate)
available capacity in each cluster will be required by the
controller to sanely split the request. Similarly, knowledge of
the properties of the application (Location Affinity class --
Strictly Coupled, Strictly Decoupled etc, privacy class etc) will
be required. It is also conceivable that knowledge of service
SLAs and monitoring thereof might provide an input into
scheduling/placement algorithms.
1. Multiplex the responses from the individual clusters into an
aggregate response.
There is of course a lot of detail still missing from this section,
including discussion of:
1. admission control
1. initial placement of instances of a new
service vs. scheduling new instances of an existing service in response
to auto-scaling
1. rescheduling pods due to failure (response might be
different depending on if it's failure of a node, rack, or whole AZ)
1. data placement relative to compute capacity,
etc.
## Cross-cluster Migration
Again this is closely related to location affinity discussed above,
and is in some sense an extension of Cross-cluster Scheduling. When
certain events occur, it becomes necessary or desirable for the
cluster federation system to proactively move distributed applications
(either in part or in whole) from one cluster to another. Examples of
such events include:
1. A low capacity event in a cluster (or a cluster failure).
1. A change of scheduling policy ("we no longer use cloud provider X").
1. A change of resource pricing ("cloud provider Y dropped their
prices - let's migrate there").
Strictly Decoupled applications can be trivially moved, in part or in
whole, one pod at a time, to one or more clusters (within applicable
policy constraints, for example "PrivateCloudOnly").
For Preferentially Decoupled applications, the federation system must
first locate a single cluster with sufficient capacity to accommodate
the entire application, then reserve that capacity, and incrementally
move the application, one (or more) resources at a time, over to the
new cluster, within some bounded time period (and possibly within a
predefined "maintenance" window). Strictly Coupled applications (with
the exception of those deemed completely immovable) require the
federation system to:
1. start up an entire replica application in the destination cluster
1. copy persistent data to the new application instance (possibly
before starting pods)
1. switch user traffic across
1. tear down the original application instance
It is proposed that support for automated migration of Strictly
Coupled applications be deferred to a later date.
## Other Requirements
These are often left implicit by customers, but are worth calling out explicitly:
1. Software failure isolation between Kubernetes clusters should be
retained as far as is practically possible. The federation system
should not materially increase the failure correlation across
clusters. For this reason the federation control plane software
should ideally be completely independent of the Kubernetes cluster
control software, and look just like any other Kubernetes API
client, with no special treatment. If the federation control plane
software fails catastrophically, the underlying Kubernetes clusters
should remain independently usable.
1. Unified monitoring, alerting and auditing across federated Kubernetes clusters.
1. Unified authentication, authorization and quota management across
clusters (this is in direct conflict with failure isolation above,
so there are some tough trade-offs to be made here).
## Proposed High-Level Architectures
Two distinct potential architectural approaches have emerged from discussions
thus far:
1. An explicitly decoupled and hierarchical architecture, where the
Federation Control Plane sits logically above a set of independent
Kubernetes clusters, each of which is (potentially) unaware of the
other clusters, and of the Federation Control Plane itself (other
than to the extent that it is an API client much like any other).
One possible example of this general architecture is illustrated
below, and will be referred to as the "Decoupled, Hierarchical"
approach.
1. A more monolithic architecture, where a single instance of the
Kubernetes control plane itself manages a single logical cluster
composed of nodes in multiple availability zones and cloud
providers.
A very brief, non-exhaustive list of pro's and con's of the two
approaches follows. (In the interest of full disclosure, the author
prefers the Decoupled Hierarchical model for the reasons stated below).
1. **Failure isolation:** The Decoupled Hierarchical approach provides
better failure isolation than the Monolithic approach, as each
underlying Kubernetes cluster, and the Federation Control Plane,
can operate and fail completely independently of each other. In
particular, their software and configurations can be updated
independently. Such updates are, in our experience, the primary
cause of control-plane failures, in general.
1. **Failure probability:** The Decoupled Hierarchical model incorporates
numerically more independent pieces of software and configuration
than the Monolithic one. But the complexity of each of these
decoupled pieces is arguably better contained in the Decoupled
model (per standard arguments for modular rather than monolithic
software design). Which of the two models presents higher
aggregate complexity and consequent failure probability remains
somewhat of an open question.
1. **Scalability:** Conceptually the Decoupled Hierarchical model wins
here, as each underlying Kubernetes cluster can be scaled
completely independently w.r.t. scheduling, node state management,
monitoring, network connectivity etc. It is even potentially
feasible to stack federations of clusters (i.e. create
federations of federations) should scalability of the independent
Federation Control Plane become an issue (although the author does
not envision this being a problem worth solving in the short
term).
1. **Code complexity:** I think that an argument can be made both ways
here. It depends on whether you prefer to weave the logic for
handling nodes in multiple availability zones and cloud providers
within a single logical cluster into the existing Kubernetes
control plane code base (which was explicitly not designed for
this), or separate it into a decoupled Federation system (with
possible code sharing between the two via shared libraries). The
author prefers the latter because it:
1. Promotes better code modularity and interface design.
1. Allows the code
bases of Kubernetes and the Federation system to progress
largely independently (different sets of developers, different
release schedules etc).
1. **Administration complexity:** Again, I think that this could be argued
both ways. Superficially it would seem that administration of a
single Monolithic multi-zone cluster might be simpler by virtue of
being only "one thing to manage", however in practise each of the
underlying availability zones (and possibly cloud providers) has
its own capacity, pricing, hardware platforms, and possibly
bureaucratic boundaries (e.g. "our EMEA IT department manages those
European clusters"). So explicitly allowing for (but not
mandating) completely independent administration of each
underlying Kubernetes cluster, and the Federation system itself,
in the Decoupled Hierarchical model seems to have real practical
benefits that outweigh the superficial simplicity of the
Monolithic model.
1. **Application development and deployment complexity:** It's not clear
to me that there is any significant difference between the two
models in this regard. Presumably the API exposed by the two
different architectures would look very similar, as would the
behavior of the deployed applications. It has even been suggested
to write the code in such a way that it could be run in either
configuration. It's not clear that this makes sense in practise
though.
1. **Control plane cost overhead:** There is a minimum per-cluster
overhead -- two possibly virtual machines, or more for redundant HA
deployments. For deployments of very small Kubernetes
clusters with the Decoupled Hierarchical approach, this cost can
become significant.
### The Decoupled, Hierarchical Approach - Illustrated
![image](federation-high-level-arch.png)
## Cluster Federation API
It is proposed that this look a lot like the existing Kubernetes API
but be explicitly multi-cluster.
+ Clusters become first class objects, which can be registered,
listed, described, deregistered etc via the API.
+ Compute resources can be explicitly requested in specific clusters,
or automatically scheduled to the "best" cluster by the Cluster
Federation control system (by a
pluggable Policy Engine).
+ There is a federated equivalent of a replication controller type (or
perhaps a [deployment](deployment.md)),
which is multicluster-aware, and delegates to cluster-specific
replication controllers/deployments as required (e.g. a federated RC for n
replicas might simply spawn multiple replication controllers in
different clusters to do the hard work).
## Policy Engine and Migration/Replication Controllers
The Policy Engine decides which parts of each application go into each
cluster at any point in time, and stores this desired state in the
Desired Federation State store (an etcd or
similar). Migration/Replication Controllers reconcile this against the
desired states stored in the underlying Kubernetes clusters (by
watching both, and creating or updating the underlying Replication
Controllers and related Services accordingly).
## Authentication and Authorization
This should ideally be delegated to some external auth system, shared
by the underlying clusters, to avoid duplication and inconsistency.
Either that, or we end up with multilevel auth. Local readonly
eventually consistent auth slaves in each cluster and in the Cluster
Federation control system
could potentially cache auth, to mitigate an SPOF auth system.
## Data consistency, failure and availability characteristics
The services comprising the Cluster Federation control plane) have to run
somewhere. Several options exist here:
* For high availability Cluster Federation deployments, these
services may run in either:
* a dedicated Kubernetes cluster, not co-located in the same
availability zone with any of the federated clusters (for fault
isolation reasons). If that cluster/availability zone, and hence the Federation
system, fails catastrophically, the underlying pods and
applications continue to run correctly, albeit temporarily
without the Federation system.
* across multiple Kubernetes availability zones, probably with
some sort of cross-AZ quorum-based store. This provides
theoretically higher availability, at the cost of some
complexity related to data consistency across multiple
availability zones.
* For simpler, less highly available deployments, just co-locate the
Federation control plane in/on/with one of the underlying
Kubernetes clusters. The downside of this approach is that if
that specific cluster fails, all automated failover and scaling
logic which relies on the federation system will also be
unavailable at the same time (i.e. precisely when it is needed).
But if one of the other federated clusters fails, everything
should work just fine.
There is some further thinking to be done around the data consistency
model upon which the Federation system is based, and it's impact
on the detailed semantics, failure and availability
characteristics of the system.
## Proposed Next Steps
Identify concrete applications of each use case and configure a proof
of concept service that exercises the use case. For example, cluster
failure tolerance seems popular, so set up an apache frontend with
replicas in each of three availability zones with either an Amazon Elastic
Load Balancer or Google Cloud Load Balancer pointing at them? What
does the zookeeper config look like for N=3 across 3 AZs -- and how
does each replica find the other replicas and how do clients find
their primary zookeeper replica? And now how do I do a shared, highly
available redis database? Use a few common specific use cases like
this to flesh out the detailed API and semantics of Cluster Federation.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federation.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,132 +1 @@
# Flannel integration with Kubernetes This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/flannel-integration.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/flannel-integration.md)
## Why?
* Networking works out of the box.
* Cloud gateway configuration is regulated by quota.
* Consistent bare metal and cloud experience.
* Lays foundation for integrating with networking backends and vendors.
## How?
Thus:
```
Master | Node1
----------------------------------------------------------------------
{192.168.0.0/16, 256 /24} | docker
| | | restart with podcidr
apiserver <------------------ kubelet (sends podcidr)
| | | here's podcidr, mtu
flannel-server:10253 <------------------ flannel-daemon
Allocates a /24 ------------------> [config iptables, VXLan]
<------------------ [watch subnet leases]
I just allocated ------------------> [config VXLan]
another /24 |
```
## Proposal
Explaining vxlan is out of the scope of this document, however it does take some basic understanding to grok the proposal. Assume some pod wants to communicate across nodes with the above setup. Check the flannel vxlan devices:
```console
node1 $ ip -d link show flannel.1
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN mode DEFAULT
link/ether a2:53:86:b5:5f:c1 brd ff:ff:ff:ff:ff:ff
vxlan
node1 $ ip -d link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT qlen 1000
link/ether 42:01:0a:f0:00:04 brd ff:ff:ff:ff:ff:ff
node2 $ ip -d link show flannel.1
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN mode DEFAULT
link/ether 56:71:35:66:4a:d8 brd ff:ff:ff:ff:ff:ff
vxlan
node2 $ ip -d link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT qlen 1000
link/ether 42:01:0a:f0:00:03 brd ff:ff:ff:ff:ff:ff
```
Note that we're ignoring cbr0 for the sake of simplicity. Spin-up a container on each node. We're using raw docker for this example only because we want control over where the container lands:
```
node1 $ docker run -it radial/busyboxplus:curl /bin/sh
[ root@5ca3c154cde3:/ ]$ ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue
8: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue
link/ether 02:42:12:10:20:03 brd ff:ff:ff:ff:ff:ff
inet 192.168.32.3/24 scope global eth0
valid_lft forever preferred_lft forever
node2 $ docker run -it radial/busyboxplus:curl /bin/sh
[ root@d8a879a29f5d:/ ]$ ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue
16: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue
link/ether 02:42:12:10:0e:07 brd ff:ff:ff:ff:ff:ff
inet 192.168.14.7/24 scope global eth0
valid_lft forever preferred_lft forever
[ root@d8a879a29f5d:/ ]$ ping 192.168.32.3
PING 192.168.32.3 (192.168.32.3): 56 data bytes
64 bytes from 192.168.32.3: seq=0 ttl=62 time=1.190 ms
```
__What happened?__:
From 1000 feet:
* vxlan device driver starts up on node1 and creates a udp tunnel endpoint on 8472
* container 192.168.32.3 pings 192.168.14.7
- what's the MAC of 192.168.14.0?
- L2 miss, flannel looks up MAC of subnet
- Stores `192.168.14.0 <-> 56:71:35:66:4a:d8` in neighbor table
- what's tunnel endpoint of this MAC?
- L3 miss, flannel looks up destination VM ip
- Stores `10.240.0.3 <-> 56:71:35:66:4a:d8` in bridge database
* Sends `[56:71:35:66:4a:d8, 10.240.0.3][vxlan: port, vni][02:42:12:10:20:03, 192.168.14.7][icmp]`
__But will it blend?__
Kubernetes integration is fairly straight-forward once we understand the pieces involved, and can be prioritized as follows:
* Kubelet understands flannel daemon in client mode, flannel server manages independent etcd store on master, node controller backs off CIDR allocation
* Flannel server consults the Kubernetes master for everything network related
* Flannel daemon works through network plugins in a generic way without bothering the kubelet: needs CNI x Kubernetes standardization
The first is accomplished in this PR, while a timeline for 2. and 3. is TDB. To implement the flannel api we can either run a proxy per node and get rid of the flannel server, or service all requests in the flannel server with something like a go-routine per node:
* `/network/config`: read network configuration and return
* `/network/leases`:
- Post: Return a lease as understood by flannel
- Lookip node by IP
- Store node metadata from [flannel request] (https://github.com/coreos/flannel/blob/master/subnet/subnet.go#L34) in annotations
- Return [Lease object] (https://github.com/coreos/flannel/blob/master/subnet/subnet.go#L40) reflecting node cidr
- Get: Handle a watch on leases
* `/network/leases/subnet`:
- Put: This is a request for a lease. If the nodecontroller is allocating CIDRs we can probably just no-op.
* `/network/reservations`: TDB, we can probably use this to accommodate node controller allocating CIDR instead of flannel requesting it
The ick-iest part of this implementation is going to the `GET /network/leases`, i.e. the watch proxy. We can side-step by waiting for a more generic Kubernetes resource. However, we can also implement it as follows:
* Watch all nodes, ignore heartbeats
* On each change, figure out the lease for the node, construct a [lease watch result](https://github.com/coreos/flannel/blob/0bf263826eab1707be5262703a8092c7d15e0be4/subnet/subnet.go#L72), and send it down the watch with the RV from the node
* Implement a lease list that does a similar translation
I say this is gross without an api object because for each node->lease translation one has to store and retrieve the node metadata sent by flannel (eg: VTEP) from node annotations. [Reference implementation](https://github.com/bprashanth/kubernetes/blob/network_vxlan/pkg/kubelet/flannel_server.go) and [watch proxy](https://github.com/bprashanth/kubernetes/blob/network_vxlan/pkg/kubelet/watch_proxy.go).
# Limitations
* Integration is experimental
* Flannel etcd not stored in persistent disk
* CIDR allocation does *not* flow from Kubernetes down to nodes anymore
# Wishlist
This proposal is really just a call for community help in writing a Kubernetes x flannel backend.
* CNI plugin integration
* Flannel daemon in privileged pod
* Flannel server talks to apiserver, described in proposal above
* HTTPs between flannel daemon/server
* Investigate flannel server running on every node (as done in the reference implementation mentioned above)
* Use flannel reservation mode to support node controller podcidr allocation
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/flannel-integration.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,357 +1 @@
**Table of Contents** This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/garbage-collection.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/garbage-collection.md)
- [Overview](#overview)
- [Cascading deletion with Garbage Collector](#cascading-deletion-with-garbage-collector)
- [Orphaning the descendants with "orphan" finalizer](#orphaning-the-descendants-with-orphan-finalizer)
- [Part I. The finalizer framework](#part-i-the-finalizer-framework)
- [Part II. The "orphan" finalizer](#part-ii-the-orphan-finalizer)
- [Related issues](#related-issues)
- [Orphan adoption](#orphan-adoption)
- [Upgrading a cluster to support cascading deletion](#upgrading-a-cluster-to-support-cascading-deletion)
- [End-to-End Examples](#end-to-end-examples)
- [Life of a Deployment and its descendants](#life-of-a-deployment-and-its-descendants)
- [Open Questions](#open-questions)
- [Considered and Rejected Designs](#considered-and-rejected-designs)
- [1. Tombstone + GC](#1-tombstone--gc)
- [2. Recovering from abnormal cascading deletion](#2-recovering-from-abnormal-cascading-deletion)
# Overview
Currently most cascading deletion logic is implemented at client-side. For example, when deleting a replica set, kubectl uses a reaper to delete the created pods and then delete the replica set. We plan to move the cascading deletion to the server to simplify the client-side logic. In this proposal, we present the garbage collector which implements cascading deletion for all API resources in a generic way; we also present the finalizer framework, particularly the "orphan" finalizer, to enable flexible alternation between cascading deletion and orphaning.
Goals of the design include:
* Supporting cascading deletion at the server-side.
* Centralizing the cascading deletion logic, rather than spreading in controllers.
* Allowing optionally orphan the dependent objects.
Non-goals include:
* Releasing the name of an object immediately, so it can be reused ASAP.
* Propagating the grace period in cascading deletion.
# Cascading deletion with Garbage Collector
## API Changes
```
type ObjectMeta struct {
...
OwnerReferences []OwnerReference
}
```
**ObjectMeta.OwnerReferences**:
List of objects depended by this object. If ***all*** objects in the list have been deleted, this object will be garbage collected. For example, a replica set `R` created by a deployment `D` should have an entry in ObjectMeta.OwnerReferences pointing to `D`, set by the deployment controller when `R` is created. This field can be updated by any client that has the privilege to both update ***and*** delete the object. For safety reasons, we can add validation rules to restrict what resources could be set as owners. For example, Events will likely be banned from being owners.
```
type OwnerReference struct {
// Version of the referent.
APIVersion string
// Kind of the referent.
Kind string
// Name of the referent.
Name string
// UID of the referent.
UID types.UID
}
```
**OwnerReference struct**: OwnerReference contains enough information to let you identify an owning object. Please refer to the inline comments for the meaning of each field. Currently, an owning object must be in the same namespace as the dependent object, so there is no namespace field.
## New components: the Garbage Collector
The Garbage Collector is responsible to delete an object if none of the owners listed in the object's OwnerReferences exist.
The Garbage Collector consists of a scanner, a garbage processor, and a propagator.
* Scanner:
* Uses the discovery API to detect all the resources supported by the system.
* Periodically scans all resources in the system and adds each object to the *Dirty Queue*.
* Garbage Processor:
* Consists of the *Dirty Queue* and workers.
* Each worker:
* Dequeues an item from *Dirty Queue*.
* If the item's OwnerReferences is empty, continues to process the next item in the *Dirty Queue*.
* Otherwise checks each entry in the OwnerReferences:
* If at least one owner exists, do nothing.
* If none of the owners exist, requests the API server to delete the item.
* Propagator:
* The Propagator is for optimization, not for correctness.
* Consists of an *Event Queue*, a single worker, and a DAG of owner-dependent relations.
* The DAG stores only name/uid/orphan triplets, not the entire body of every item.
* Watches for create/update/delete events for all resources, enqueues the events to the *Event Queue*.
* Worker:
* Dequeues an item from the *Event Queue*.
* If the item is an creation or update, then updates the DAG accordingly.
* If the object has an owner and the owner doesnt exist in the DAG yet, then apart from adding the object to the DAG, also enqueues the object to the *Dirty Queue*.
* If the item is a deletion, then removes the object from the DAG, and enqueues all its dependent objects to the *Dirty Queue*.
* The propagator shouldn't need to do any RPCs, so a single worker should be sufficient. This makes locking easier.
* With the Propagator, we *only* need to run the Scanner when starting the GC to populate the DAG and the *Dirty Queue*.
# Orphaning the descendants with "orphan" finalizer
Users may want to delete an owning object (e.g., a replicaset) while orphaning the dependent object (e.g., pods), that is, leaving the dependent objects untouched. We support such use cases by introducing the "orphan" finalizer. Finalizer is a generic API that has uses other than supporting orphaning, so we first describe the generic finalizer framework, then describe the specific design of the "orphan" finalizer.
## Part I. The finalizer framework
## API changes
```
type ObjectMeta struct {
Finalizers []string
}
```
**ObjectMeta.Finalizers**: List of finalizers that need to run before deleting the object. This list must be empty before the object is deleted from the registry. Each string in the list is an identifier for the responsible component that will remove the entry from the list. If the deletionTimestamp of the object is non-nil, entries in this list can only be removed. For safety reasons, updating finalizers requires special privileges. To enforce the admission rules, we will expose finalizers as a subresource and disallow directly changing finalizers when updating the main resource.
## New components
* Finalizers:
* Like a controller, a finalizer is always running.
* A third party can develop and run their own finalizer in the cluster. A finalizer doesn't need to be registered with the API server.
* Watches for update events that meet two conditions:
1. the updated object has the identifier of the finalizer in ObjectMeta.Finalizers;
2. ObjectMeta.DeletionTimestamp is updated from nil to non-nil.
* Applies the finalizing logic to the object in the update event.
* After the finalizing logic is completed, removes itself from ObjectMeta.Finalizers.
* The API server deletes the object after the last finalizer removes itself from the ObjectMeta.Finalizers field.
* Because it's possible for the finalizing logic to be applied multiple times (e.g., the finalizer crashes after applying the finalizing logic but before being removed form ObjectMeta.Finalizers), the finalizing logic has to be idempotent.
* If a finalizer fails to act in a timely manner, users with proper privileges can manually remove the finalizer from ObjectMeta.Finalizers. We will provide a kubectl command to do this.
## Changes to existing components
* API server:
* Deletion handler:
* If the `ObjectMeta.Finalizers` of the object being deleted is non-empty, then updates the DeletionTimestamp, but does not delete the object.
* If the `ObjectMeta.Finalizers` is empty and the options.GracePeriod is zero, then deletes the object. If the options.GracePeriod is non-zero, then just updates the DeletionTimestamp.
* Update handler:
* If the update removes the last finalizer, and the DeletionTimestamp is non-nil, and the DeletionGracePeriodSeconds is zero, then deletes the object from the registry.
* If the update removes the last finalizer, and the DeletionTimestamp is non-nil, but the DeletionGracePeriodSeconds is non-zero, then just updates the object.
## Part II. The "orphan" finalizer
## API changes
```
type DeleteOptions struct {
OrphanDependents bool
}
```
**DeleteOptions.OrphanDependents**: allows a user to express whether the dependent objects should be orphaned. It defaults to true, because controllers before release 1.2 expect dependent objects to be orphaned.
## Changes to existing components
* API server:
When handling a deletion request, depending on if DeleteOptions.OrphanDependents is true, the API server updates the object to add/remove the "orphan" finalizer to/from the ObjectMeta.Finalizers map.
## New components
Adding a fourth component to the Garbage Collector, the"orphan" finalizer:
* Watches for update events as described in [Part I](#part-i-the-finalizer-framework).
* Removes the object in the event from the `OwnerReferences` of its dependents.
* dependent objects can be found via the DAG kept by the GC, or by relisting the dependent resource and checking the OwnerReferences field of each potential dependent object.
* Also removes any dangling owner references the dependent objects have.
* At last, removes the itself from the `ObjectMeta.Finalizers` of the object.
# Related issues
## Orphan adoption
Controllers are responsible for adopting orphaned dependent resources. To do so, controllers
* Checks a potential dependent objects OwnerReferences to determine if it is orphaned.
* Fills the OwnerReferences if the object matches the controllers selector and is orphaned.
There is a potential race between the "orphan" finalizer removing an owner reference and the controllers adding it back during adoption. Imagining this case: a user deletes an owning object and intends to orphan the dependent objects, so the GC removes the owner from the dependent object's OwnerReferences list, but the controller of the owner resource hasn't observed the deletion yet, so it adopts the dependent again and adds the reference back, resulting in the mistaken deletion of the dependent object. This race can be avoided by implementing Status.ObservedGeneration in all resources. Before updating the dependent Object's OwnerReferences, the "orphan" finalizer checks Status.ObservedGeneration of the owning object to ensure its controller has already observed the deletion.
## Upgrading a cluster to support cascading deletion
For the master, after upgrading to a version that supports cascading deletion, the OwnerReferences of existing objects remain empty, so the controllers will regard them as orphaned and start the adoption procedures. After the adoptions are done, server-side cascading will be effective for these existing objects.
For nodes, cascading deletion does not affect them.
For kubectl, we will keep the kubectls cascading deletion logic for one more release.
# End-to-End Examples
This section presents an example of all components working together to enforce the cascading deletion or orphaning.
## Life of a Deployment and its descendants
1. User creates a deployment `D1`.
2. The Propagator of the GC observes the creation. It creates an entry of `D1` in the DAG.
3. The deployment controller observes the creation of `D1`. It creates the replicaset `R1`, whose OwnerReferences field contains a reference to `D1`, and has the "orphan" finalizer in its ObjectMeta.Finalizers map.
4. The Propagator of the GC observes the creation of `R1`. It creates an entry of `R1` in the DAG, with `D1` as its owner.
5. The replicaset controller observes the creation of `R1` and creates Pods `P1`~`Pn`, all with `R1` in their OwnerReferences.
6. The Propagator of the GC observes the creation of `P1`~`Pn`. It creates entries for them in the DAG, with `R1` as their owner.
***In case the user wants to cascadingly delete `D1`'s descendants, then***
7. The user deletes the deployment `D1`, with `DeleteOptions.OrphanDependents=false`. API server checks if `D1` has "orphan" finalizer in its Finalizers map, if so, it updates `D1` to remove the "orphan" finalizer. Then API server deletes `D1`.
8. The "orphan" finalizer does *not* take any action, because the observed deletion shows `D1` has an empty Finalizers map.
9. The Propagator of the GC observes the deletion of `D1`. It deletes `D1` from the DAG. It adds its dependent object, replicaset `R1`, to the *dirty queue*.
10. The Garbage Processor of the GC dequeues `R1` from the *dirty queue*. It finds `R1` has an owner reference pointing to `D1`, and `D1` no longer exists, so it requests API server to delete `R1`, with `DeleteOptions.OrphanDependents=false`. (The Garbage Processor should always set this field to false.)
11. The API server updates `R1` to remove the "orphan" finalizer if it's in the `R1`'s Finalizers map. Then the API server deletes `R1`, as `R1` has an empty Finalizers map.
12. The Propagator of the GC observes the deletion of `R1`. It deletes `R1` from the DAG. It adds its dependent objects, Pods `P1`~`Pn`, to the *Dirty Queue*.
13. The Garbage Processor of the GC dequeues `Px` (1 <= x <= n) from the *Dirty Queue*. It finds that `Px` have an owner reference pointing to `D1`, and `D1` no longer exists, so it requests API server to delete `Px`, with `DeleteOptions.OrphanDependents=false`.
14. API server deletes the Pods.
***In case the user wants to orphan `D1`'s descendants, then***
7. The user deletes the deployment `D1`, with `DeleteOptions.OrphanDependents=true`.
8. The API server first updates `D1`, with DeletionTimestamp=now and DeletionGracePeriodSeconds=0, increments the Generation by 1, and add the "orphan" finalizer to ObjectMeta.Finalizers if it's not present yet. The API server does not delete `D1`, because its Finalizers map is not empty.
9. The deployment controller observes the update, and acknowledges by updating the `D1`'s ObservedGeneration. The deployment controller won't create more replicasets on `D1`'s behalf.
10. The "orphan" finalizer observes the update, and notes down the Generation. It waits until the ObservedGeneration becomes equal to or greater than the noted Generation. Then it updates `R1` to remove `D1` from its OwnerReferences. At last, it updates `D1`, removing itself from `D1`'s Finalizers map.
11. The API server handles the update of `D1`, because *i)* DeletionTimestamp is non-nil, *ii)* the DeletionGracePeriodSeconds is zero, and *iii)* the last finalizer is removed from the Finalizers map, API server deletes `D1`.
12. The Propagator of the GC observes the deletion of `D1`. It deletes `D1` from the DAG. It adds its dependent, replicaset `R1`, to the *Dirty Queue*.
13. The Garbage Processor of the GC dequeues `R1` from the *Dirty Queue* and skips it, because its OwnerReferences is empty.
# Open Questions
1. In case an object has multiple owners, some owners are deleted with DeleteOptions.OrphanDependents=true, and some are deleted with DeleteOptions.OrphanDependents=false, what should happen to the object?
The presented design will respect the setting in the deletion request of last owner.
2. How to propagate the grace period in a cascading deletion? For example, when deleting a ReplicaSet with grace period of 5s, a user may expect the same grace period to be applied to the deletion of the Pods controlled the ReplicaSet.
Propagating grace period in a cascading deletion is a ***non-goal*** of this proposal. Nevertheless, the presented design can be extended to support it. A tentative solution is letting the garbage collector to propagate the grace period when deleting dependent object. To persist the grace period set by the user, the owning object should not be deleted from the registry until all its dependent objects are in the graceful deletion state. This could be ensured by introducing another finalizer, tentatively named as the "populating graceful deletion" finalizer. Upon receiving the graceful deletion request, the API server adds this finalizer to the finalizers list of the owning object. Later the GC will remove it when all dependents are in the graceful deletion state.
[#25055](https://github.com/kubernetes/kubernetes/issues/25055) tracks this problem.
3. How can a client know when the cascading deletion is completed?
A tentative solution is introducing a "completing cascading deletion" finalizer, which will be added to the finalizers list of the owning object, and removed by the GC when all dependents are deleted. The user can watch for the deletion event of the owning object to ensure the cascading deletion process has completed.
---
***THE REST IS FOR ARCHIVAL PURPOSES***
---
# Considered and Rejected Designs
# 1. Tombstone + GC
## Reasons of rejection
* It likely would conflict with our plan in the future to use all resources as their own tombstones, once the registry supports multi-object transaction.
* The TTL of the tombstone is hand-waving, there is no guarantee that the value of the TTL is long enough.
* This design is essentially the same as the selected design, with the tombstone as an extra element. The benefit the extra complexity buys is that a parent object can be deleted immediately even if the user wants to orphan the children. The benefit doesn't justify the complexity.
## API Changes
```
type DeleteOptions struct {
OrphanChildren bool
}
```
**DeleteOptions.OrphanChildren**: allows a user to express whether the child objects should be orphaned.
```
type ObjectMeta struct {
...
ParentReferences []ObjectReference
}
```
**ObjectMeta.ParentReferences**: links the resource to the parent resources. For example, a replica set `R` created by a deployment `D` should have an entry in ObjectMeta.ParentReferences pointing to `D`. The link should be set when the child object is created. It can be updated after the creation.
```
type Tombstone struct {
unversioned.TypeMeta
ObjectMeta
UID types.UID
}
```
**Tombstone**: a tombstone is created when an object is deleted and the user requires the children to be orphaned.
**Tombstone.UID**: the UID of the original object.
## New components
The only new component is the Garbage Collector, which consists of a scanner, a garbage processor, and a propagator.
* Scanner:
* Uses the discovery API to detect all the resources supported by the system.
* For performance reasons, resources can be marked as not participating cascading deletion in the discovery info, then the GC will not monitor them.
* Periodically scans all resources in the system and adds each object to the *Dirty Queue*.
* Garbage Processor:
* Consists of the *Dirty Queue* and workers.
* Each worker:
* Dequeues an item from *Dirty Queue*.
* If the item's ParentReferences is empty, continues to process the next item in the *Dirty Queue*.
* Otherwise checks each entry in the ParentReferences:
* If a parent exists, continues to check the next parent.
* If a parent doesn't exist, checks if a tombstone standing for the parent exists.
* If the step above shows no parent nor tombstone exists, requests the API server to delete the item. That is, only if ***all*** parents are non-existent, and none of them have tombstones, the child object will be garbage collected.
* Otherwise removes the item's ParentReferences to non-existent parents.
* Propagator:
* The Propagator is for optimization, not for correctness.
* Maintains a DAG of parent-child relations. This DAG stores only name/uid/orphan triplets, not the entire body of every item.
* Consists of an *Event Queue* and a single worker.
* Watches for create/update/delete events for all resources that participating cascading deletion, enqueues the events to the *Event Queue*.
* Worker:
* Dequeues an item from the *Event Queue*.
* If the item is an creation or update, then updates the DAG accordingly.
* If the object has a parent and the parent doesnt exist in the DAG yet, then apart from adding the object to the DAG, also enqueues the object to the *Dirty Queue*.
* If the item is a deletion, then removes the object from the DAG, and enqueues all its children to the *Dirty Queue*.
* The propagator shouldn't need to do any RPCs, so a single worker should be sufficient. This makes locking easier.
* With the Propagator, we *only* need to run the Scanner when starting the Propagator to populate the DAG and the *Dirty Queue*.
## Changes to existing components
* Storage: we should add a REST storage for Tombstones. The index should be UID rather than namespace/name.
* API Server: when handling a deletion request, if DeleteOptions.OrphanChildren is true, then the API Server either creates a tombstone with TTL if the tombstone doesn't exist yet, or updates the TTL of the existing tombstone. The API Server deletes the object after the tombstone is created.
* Controllers: when creating child objects, controllers need to fill up their ObjectMeta.ParentReferences field. Objects that dont have a parent should have the namespace object as the parent.
## Comparison with the selected design
The main difference between the two designs is when to update the ParentReferences. In design #1, because a tombstone is created to indicate "orphaning" is desired, the updates to ParentReferences can be deferred until the deletion of the tombstone. In design #2, the updates need to be done before the parent object is deleted from the registry.
* Advantages of "Tombstone + GC" design
* Faster to free the resource name compared to using finalizers. The original object can be deleted to free the resource name once the tombstone is created, rather than waiting for the finalizers to update all childrens ObjectMeta.ParentReferences.
* Advantages of "Finalizer Framework + GC"
* The finalizer framework is needed for other purposes as well.
# 2. Recovering from abnormal cascading deletion
## Reasons of rejection
* Not a goal
* Tons of work, not feasible in the near future
In case the garbage collector is mistakenly deleting objects, we should provide mechanism to stop the garbage collector and restore the objects.
* Stopping the garbage collector
We will add a "--enable-garbage-collector" flag to the controller manager binary to indicate if the garbage collector should be enabled. Admin can stop the garbage collector in a running cluster by restarting the kube-controller-manager with --enable-garbage-collector=false.
* Restoring mistakenly deleted objects
* Guidelines
* The restoration should be implemented as a roll-forward rather than a roll-back, because likely the state of the cluster (e.g., available resources on a node) has changed since the object was deleted.
* Need to archive the complete specs of the deleted objects.
* The content of the archive is sensitive, so the access to the archive subjects to the same authorization policy enforced on the original resource.
* States should be stored in etcd. All components should remain stateless.
* A preliminary design
This is a generic design for “undoing a deletion”, not specific to undoing cascading deletion.
* Add a `/archive` sub-resource to every resource, it's used to store the spec of the deleted objects.
* Before an object is deleted from the registry, the API server clears fields like DeletionTimestamp, then creates the object in /archive and sets a TTL.
* Add a `kubectl restore` command, which takes a resource/name pair as input, creates the object with the spec stored in the /archive, and deletes the archived object.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/garbage-collection.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,279 +1 @@
<!-- BEGIN MUNGE: GENERATED_TOC --> This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/gpu-support.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/gpu-support.md)
- [GPU support](#gpu-support)
- [Objective](#objective)
- [Background](#background)
- [Detailed discussion](#detailed-discussion)
- [Inventory](#inventory)
- [Scheduling](#scheduling)
- [The runtime](#the-runtime)
- [NVIDIA support](#nvidia-support)
- [Event flow](#event-flow)
- [Too complex for now: nvidia-docker](#too-complex-for-now-nvidia-docker)
- [Implementation plan](#implementation-plan)
- [V0](#v0)
- [Scheduling](#scheduling-1)
- [Runtime](#runtime)
- [Other](#other)
- [Future work](#future-work)
- [V1](#v1)
- [V2](#v2)
- [V3](#v3)
- [Undetermined](#undetermined)
- [Security considerations](#security-considerations)
<!-- END MUNGE: GENERATED_TOC -->
# GPU support
Author: @therc
Date: Apr 2016
Status: Design in progress, early implementation of requirements
## Objective
Users should be able to request GPU resources for their workloads, as easily as
for CPU or memory. Kubernetes should keep an inventory of machines with GPU
hardware, schedule containers on appropriate nodes and set up the container
environment with all that's necessary to access the GPU. All of this should
eventually be supported for clusters on either bare metal or cloud providers.
## Background
An increasing number of workloads, such as machine learning and seismic survey
processing, benefits from offloading computations to graphic hardware. While not
as tuned as traditional, dedicated high performance computing systems such as
MPI, a Kubernetes cluster can still be a great environment for organizations
that need a variety of additional, "classic" workloads, such as database, web
serving, etc.
GPU support is hard to provide extensively and will thus take time to tame
completely, because
- different vendors expose the hardware to users in different ways
- some vendors require fairly tight coupling between the kernel driver
controlling the GPU and the libraries/applications that access the hardware
- it adds more resource types (whole GPUs, GPU cores, GPU memory)
- it can introduce new security pitfalls
- for systems with multiple GPUs, affinity matters, similarly to NUMA
considerations for CPUs
- running GPU code in containers is still a relatively novel idea
## Detailed discussion
Currently, this document is mostly focused on the basic use case: run GPU code
on AWS `g2.2xlarge` EC2 machine instances using Docker. It constitutes a narrow
enough scenario that it does not require large amounts of generic code yet. GCE
doesn't support GPUs at all; bare metal systems throw a lot of extra variables
into the mix.
Later sections will outline future work to support a broader set of hardware,
environments and container runtimes.
### Inventory
Before any scheduling can occur, we need to know what's available out there. In
v0, we'll hardcode capacity detected by the kubelet based on a flag,
`--experimental-nvidia-gpu`. This will result in the user-defined resource
`alpha.kubernetes.io/nvidia-gpu` to be reported for `NodeCapacity` and
`NodeAllocatable`, as well as as a node label.
### Scheduling
GPUs will be visible as first-class resources. In v0, we'll only assign whole
devices; sharing among multiple pods is left to future implementations. It's
probable that GPUs will exacerbate the need for [a rescheduler](rescheduler.md)
or pod priorities, especially if the nodes in a cluster are not homogeneous.
Consider these two cases:
> Only half of the machines have a GPU and they're all busy with other
workloads. The other half of the cluster is doing very little work. A GPU
workload comes, but it can't schedule, because the devices are sitting idle on
nodes that are running something else and the nodes with little load lack the
hardware.
> Some or all the machines have two graphic cards each. A number of jobs get
scheduled, requesting one device per pod. The scheduler puts them all on
different machines, spreading the load, perhaps by design. Then a new job comes
in, requiring two devices per pod, but it can't schedule anywhere, because all
we can find, at most, is one unused device per node.
### The runtime
Once we know where to run the container, it's time to set up its environment. At
a minimum, we'll need to map the host device(s) into the container. Because each
manufacturer exposes different device nodes (`/dev/ati/card0`, `/dev/nvidia0`,
but also the required `/dev/nvidiactl` and `/dev/nvidia-uvm`), some of the logic
needs to be hardware-specific, mapping from a logical device to a list of device
nodes necessary for software to talk to it.
Support binaries and libraries are often versioned along with the kernel module,
so there should be further hooks to project those under `/bin` and some kind of
`/lib` before the application is started. This can be done for Docker with the
use of a versioned [Docker
volume](https://docs.docker.com/engine/tutorials/dockervolumes/) or
with upcoming Kubernetes-specific hooks such as init containers and volume
containers. In v0, images are expected to bundle everything they need.
#### NVIDIA support
The first implementation and testing ground will be for NVIDIA devices, by far
the most common setup.
In v0, the `--experimental-nvidia-gpu` flag will also result in the host devices
(limited to those required to drive the first card, `nvidia0`) to be mapped into
the container by the dockertools library.
### Event flow
This is what happens before and after an user schedules a GPU pod.
1. Administrator installs a number of Kubernetes nodes with GPUs. The correct
kernel modules and device nodes under `/dev/` are present.
1. Administrator makes sure the latest CUDA/driver versions are installed.
1. Administrator enables `--experimental-nvidia-gpu` on kubelets
1. Kubelets update node status with information about the GPU device, in addition
to cAdvisor's usual data about CPU/memory/disk
1. User creates a Docker image compiling their application for CUDA, bundling
the necessary libraries. We ignore any versioning requirements in the image
using labels based on [NVIDIA's
conventions](https://github.com/NVIDIA/nvidia-docker/blob/64510511e3fd0d00168eb076623854b0fcf1507d/tools/src/nvidia-docker/utils.go#L13).
1. User creates a pod using the image, requiring
`alpha.kubernetes.io/nvidia-gpu: 1`
1. Scheduler picks a node for the pod
1. The kubelet notices the GPU requirement and maps the three devices. In
Docker's engine-api, this means it'll add them to the Resources.Devices list.
1. Docker runs the container to completion
1. The scheduler notices that the device is available again
### Too complex for now: nvidia-docker
For v0, we discussed at length, but decided to leave aside initially the
[nvidia-docker plugin](https://github.com/NVIDIA/nvidia-docker). The plugin is
an officially supported solution, thus avoiding a lot of new low level code, as
it takes care of functionality such as:
- creating a Docker volume with binaries such as `nvidia-smi` and shared
libraries
- providing HTTP endpoints that monitoring tools can use to collect GPU metrics
- abstracting details such as `/dev` entry names for each device, as well as
control ones like `nvidiactl`
The `nvidia-docker` wrapper also verifies that the CUDA version required by a
given image is supported by the host drivers, through inspection of well-known
image labels, if present. We should try to provide equivalent checks, either
for CUDA or OpenCL.
This is current sample output from `nvidia-docker-plugin`, wrapped for
readability:
$ curl -s localhost:3476/docker/cli
--device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0
--volume-driver=nvidia-docker
--volume=nvidia_driver_352.68:/usr/local/nvidia:ro
It runs as a daemon listening for HTTP requests on port 3476. The endpoint above
returns flags that need to be added to the Docker command line in order to
expose GPUs to the containers. There are optional URL arguments to request
specific devices if more than one are present on the system, as well as specific
versions of the support software. An obvious improvement is an additional
endpoint for JSON output.
The unresolved question is whether `nvidia-docker-plugin` would run standalone
as it does today (called over HTTP, perhaps with endpoints for a new Kubernetes
resource API) or whether the relevant code from its `nvidia` package should be
linked directly into kubelet. A partial list of tradeoffs:
| | External binary | Linked in |
|---------------------|---------------------------------------------------------------------------------------------------|--------------------------------------------------------------|
| Use of cgo | Confined to binary | Linked into kubelet, but with lazy binding |
| Expandibility | Limited if we run the plugin, increased if library is used to build a Kubernetes-tailored daemon. | Can reuse the `nvidia` library as we prefer |
| Bloat | None | Larger kubelet, even for systems without GPUs |
| Reliability | Need to handle the binary disappearing at any time | Fewer headeaches |
| (Un)Marshalling | Need to talk over JSON | None |
| Administration cost | One more daemon to install, configure and monitor | No extra work required, other than perhaps configuring flags |
| Releases | Potentially on its own schedule | Tied to Kubernetes' |
## Implementation plan
### V0
The first two tracks can progress in parallel.
#### Scheduling
1. Define new resource `alpha.kubernetes.io:nvidia-gpu` in `pkg/api/types.go`
and co.
1. Plug resource into feasability checks used by kubelet, scheduler and
schedulercache. Maybe gated behind a flag?
1. Plug resource into resource_helpers.go
1. Plug resource into the limitranger
#### Runtime
1. Add kubelet config parameter to enable the resource
1. Make kubelet's `setNodeStatusMachineInfo` report the resource
1. Add a Devices list to container.RunContainerOptions
1. Use it from DockerManager's runContainer
1. Do the same for rkt (stretch goal)
1. When a pod requests a GPU, add the devices to the container options
#### Other
1. Add new resource to `kubectl describe` output. Optional for non-GPU users?
1. Administrator documentation, with sample scripts
1. User documentation
## Future work
Above all, we need to collect feedback from real users and use that to set
priorities for any of the items below.
### V1
- Perform real detection of the installed hardware
- Figure a standard way to avoid bundling of shared libraries in images
- Support fractional resources so multiple pods can share the same GPU
- Support bare metal setups
- Report resource usage
### V2
- Support multiple GPUs with resource hierarchies and affinities
- Support versioning of resources (e.g. "CUDA v7.5+")
- Build resource plugins into the kubelet?
- Support other device vendors
- Support Azure?
- Support rkt?
### V3
- Support OpenCL (so images can be device-agnostic)
### Undetermined
It makes sense to turn the output of this project (external resource plugins,
etc.) into a more generic abstraction at some point.
## Security considerations
There should be knobs for the cluster administrator to only allow certain users
or roles to schedule GPU workloads. Overcommitting or sharing the same device
across different pods is not considered safe. It should be possible to segregate
such GPU-sharing pods by user, namespace or a combination thereof.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/gpu-support.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,8 +1 @@
# High Availability of Scheduling and Controller Components in Kubernetes This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/high-availability.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/high-availability.md)
This document is deprecated. For more details about running a highly available
cluster master, please see the [admin instructions document](../../docs/admin/high-availability.md).
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/high-availability.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,331 +1 @@
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/image-provenance.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/image-provenance.md)
# Overview
Organizations wish to avoid running "unapproved" images.
The exact nature of "approval" is beyond the scope of Kubernetes, but may include reasons like:
- only run images that are scanned to confirm they do not contain vulnerabilities
- only run images that use a "required" base image
- only run images that contain binaries which were built from peer reviewed, checked-in source
by a trusted compiler toolchain.
- only allow images signed by certain public keys.
- etc...
Goals of the design include:
* Block creation of pods that would cause "unapproved" images to run.
* Make it easy for users or partners to build "image provenance checkers" which check whether images are "approved".
* We expect there will be multiple implementations.
* Allow users to request an "override" of the policy in a convenient way (subject to the override being allowed).
* "overrides" are needed to allow "emergency changes", but need to not happen accidentally, since they may
require tedious after-the-fact justification and affect audit controls.
Non-goals include:
* Encoding image policy into Kubernetes code.
* Implementing objects in core kubernetes which describe complete policies for what images are approved.
* A third-party implementation of an image policy checker could optionally use ThirdPartyResource to store its policy.
* Kubernetes core code dealing with concepts of image layers, build processes, source repositories, etc.
* We expect there will be multiple PaaSes and/or de-facto programming environments, each with different takes on
these concepts. At any rate, Kubernetes is not ready to be opinionated on these concepts.
* Sending more information than strictly needed to a third-party service.
* Information sent by Kubernetes to a third-party service constitutes an API of Kubernetes, and we want to
avoid making these broader than necessary, as it restricts future evolution of Kubernetes, and makes
Kubernetes harder to reason about. Also, excessive information limits cache-ability of decisions. Caching
reduces latency and allows short outages of the backend to be tolerated.
Detailed discussion in [Ensuring only images are from approved sources are run](
https://github.com/kubernetes/kubernetes/issues/22888).
# Implementation
A new admission controller will be added. That will be the only change.
## Admission controller
An `ImagePolicyWebhook` admission controller will be written. The admission controller examines all pod objects which are
created or updated. It can either admit the pod, or reject it. If it is rejected, the request sees a `403 FORBIDDEN`
The admission controller code will go in `plugin/pkg/admission/imagepolicy`.
There will be a cache of decisions in the admission controller.
If the apiserver cannot reach the webhook backend, it will log a warning and either admit or deny the pod.
A flag will control whether it admits or denies on failure.
The rationale for deny is that an attacker could DoS the backend or wait for it to be down, and then sneak a
bad pod into the system. The rationale for allow here is that, if the cluster admin also does
after-the-fact auditing of what images were run (which we think will be common), this will catch
any bad images run during periods of backend failure. With default-allow, the availability of Kubernetes does
not depend on the availability of the backend.
# Webhook Backend
The admission controller code in that directory does not contain logic to make an admit/reject decision. Instead, it extracts
relevant fields from the Pod creation/update request and sends those fields to a Backend (which we have been loosely calling "WebHooks"
in Kubernetes). The request the admission controller sends to the backend is called a WebHook request to distinguish it from the
request being admission-controlled. The server that accepts the WebHook request from Kubernetes is called the "Backend"
to distinguish it from the WebHook request itself, and from the API server.
The whole system will work similarly to the [Authentication WebHook](
https://github.com/kubernetes/kubernetes/pull/24902
) or the [AuthorizationWebHook](
https://github.com/kubernetes/kubernetes/pull/20347).
The WebHook request can optionally authenticate itself to its backend using a token from a `kubeconfig` file.
The WebHook request and response are JSON, and correspond to the following `go` structures:
```go
// Filename: pkg/apis/imagepolicy.k8s.io/register.go
package imagepolicy
// ImageReview checks if the set of images in a pod are allowed.
type ImageReview struct {
unversioned.TypeMeta
// Spec holds information about the pod being evaluated
Spec ImageReviewSpec
// Status is filled in by the backend and indicates whether the pod should be allowed.
Status ImageReviewStatus
}
// ImageReviewSpec is a description of the pod creation request.
type ImageReviewSpec struct {
// Containers is a list of a subset of the information in each container of the Pod being created.
Containers []ImageReviewContainerSpec
// Annotations is a list of key-value pairs extracted from the Pod's annotations.
// It only includes keys which match the pattern `*.image-policy.k8s.io/*`.
// It is up to each webhook backend to determine how to interpret these annotations, if at all.
Annotations map[string]string
// Namespace is the namespace the pod is being created in.
Namespace string
}
// ImageReviewContainerSpec is a description of a container within the pod creation request.
type ImageReviewContainerSpec struct {
Image string
// In future, we may add command line overrides, exec health check command lines, and so on.
}
// ImageReviewStatus is the result of the token authentication request.
type ImageReviewStatus struct {
// Allowed indicates that all images were allowed to be run.
Allowed bool
// Reason should be empty unless Allowed is false in which case it
// may contain a short description of what is wrong. Kubernetes
// may truncate excessively long errors when displaying to the user.
Reason string
}
```
## Extending with Annotations
All annotations on a Pod that match `*.image-policy.k8s.io/*` are sent to the webhook.
Sending annotations allows users who are aware of the image policy backend to send
extra information to it, and for different backends implementations to accept
different information.
Examples of information you might put here are
- request to "break glass" to override a policy, in case of emergency.
- a ticket number from a ticket system that documents the break-glass request
- provide a hint to the policy server as to the imageID of the image being provided, to save it a lookup
In any case, the annotations are provided by the user and are not validated by Kubernetes in any way. In the future, if an annotation is determined to be widely
useful, we may promote it to a named field of ImageReviewSpec.
In the case of a Pod update, Kubernetes may send the backend either all images in the updated image, or only the ones that
changed, at its discretion.
## Interaction with Controllers
In the case of a Deployment object, no image check is done when the Deployment object is created or updated.
Likewise, no check happens when the Deployment controller creates a ReplicaSet. The check only happens
when the ReplicaSet controller creates a Pod. Checking Pod is necessary since users can directly create pods,
and since third-parties can write their own controllers, which kubernetes might not be aware of or even contain
pod templates.
The ReplicaSet, or other controller, is responsible for recognizing when a 403 has happened
(whether due to user not having permission due to bad image, or some other permission reason)
and throttling itself and surfacing the error in a way that CLIs and UIs can show to the user.
Issue [22298](https://github.com/kubernetes/kubernetes/issues/22298) needs to be resolved to
propagate Pod creation errors up through a stack of controllers.
## Changes in policy over time
The Backend might change the policy over time. For example, yesterday `redis:v1` was allowed, but today `redis:v1` is not allowed
due to a CVE that just came out (fictional scenario). In this scenario:
.
- a newly created replicaSet will be unable to create Pods.
- updating a deployment will be safe in the sense that it will detect that the new ReplicaSet is not scaling
up and not scale down the old one.
- an existing replicaSet will be unable to create Pods that replace ones which are terminated. If this is due to
slow loss of nodes, then there should be time to react before significant loss of capacity.
- For non-replicated things (size 1 ReplicaSet, StatefulSet), a single node failure may disable it.
- a node rolling update will eventually check for liveness of replacements, and would be throttled if
in the case when the image was no longer allowed and so replacements could not be started.
- rapid node restarts will cause existing pod objects to be restarted by kubelet.
- slow node restarts or network partitions will cause node controller to delete pods and there will be no replacement
It is up to the Backend implementor, and the cluster administrator who decides to use that backend, to decide
whether the Backend should be allowed to change its mind. There is a tradeoff between responsiveness
to changes in policy, versus keeping existing services running. The two models that make sense are:
- never change a policy, unless some external process has ensured no active objects depend on the to-be-forbidden
images.
- change a policy and assume that transition to new image happens faster than the existing pods decay.
## Ubernetes
If two clusters share an image policy backend, then they will have the same policies.
The clusters can pass different tokens to the backend, and the backend can use this to distinguish
between different clusters.
## Image tags and IDs
Image tags are like: `myrepo/myimage:v1`.
Image IDs are like: `myrepo/myimage@sha256:beb6bd6a68f114c1dc2ea4b28db81bdf91de202a9014972bec5e4d9171d90ed`.
You can see image IDs with `docker images --no-trunc`.
The Backend needs to be able to resolve tags to IDs (by talking to the images repo).
If the Backend resolves tags to IDs, there is some risk that the tag-to-ID mapping will be
modified after approval by the Backend, but before Kubelet pulls the image. We will not address this
race condition at this time.
We will wait and see how much demand there is for closing this hole. If the community demands a solution,
we may suggest one of these:
1. Use a backend that refuses to accept images that are specified with tags, and require users to resolve to IDs
prior to creating a pod template.
- [kubectl could be modified to automate this process](https://github.com/kubernetes/kubernetes/issues/1697)
- a CI/CD system or templating system could be used that maps IDs to tags before Deployment modification/creation.
1. Audit logs from kubelets to see image IDs were actually run, to see if any unapproved images slipped through.
1. Monitor tag changes in image repository for suspicious activity, or restrict remapping of tags after initial application.
If none of these works well, we could do the following:
- Image Policy Admission Controller adds new field to Pod, e.g. `pod.spec.container[i].imageID` (or an annotation).
and kubelet will enforce that both the imageID and image match the image pulled.
Since this adds complexity and interacts with imagePullPolicy, we avoid adding the above feature initially.
### Caching
There will be a cache of decisions in the admission controller.
TTL will be user-controllable, but default to 1 hour for allows and 30s for denies.
Low TTL for deny allows user to correct a setting on the backend and see the fix
rapidly. It is assumed that denies are infrequent.
Caching allows permits RC to scale up services even during short unavailability of the webhook backend.
The ImageReviewSpec is used as the key to the cache.
In the case of a cache miss and timeout talking to the backend, the default is to allow Pod creation.
Keeping services running is more important than a hypothetical threat from an un-verified image.
### Post-pod-creation audit
There are several cases where an image not currently allowed might still run. Users wanting a
complete audit solution are advised to also do after-the-fact auditing of what images
ran. This can catch:
- images allowed due to backend not reachable
- images that kept running after policy change (e.g. CVE discovered)
- images started via local files or http option of kubelet
- checking SHA of images allowed by a tag which was remapped
This proposal does not include post-pod-creation audit.
## Alternatives considered
### Admission Control on Controller Objects
We could have done admission control on Deployments, Jobs, ReplicationControllers, and anything else that creates a Pod, directly or indirectly.
This approach is good because it provides immediate feedback to the user that the image is not allowed. However, we do not expect disallowed images
to be used often. And controllers need to be able to surface problems creating pods for a variety of other reasons anyways.
Other good things about this alternative are:
- Fewer calls to Backend, once per controller rather than once per pod creation. Caching in backend should be able to help with this, though.
- End user that created the object is seen, rather than the user of the controller process. This can be fixed by implementing `Impersonate-User` for controllers.
Other problems are:
- Works only with "core" controllers. Need to update admission controller if we add more "core" controllers. Won't work with "third party controllers", e.g. how we open-source distributed systems like hadoop, spark, zookeeper, etc running on kubernetes. Because those controllers don't have config that can be "admission controlled", or if they do, schema is not known to admission controller, have to "search" for pod templates in json. Yuck.
- How would it work if user created pod directly, which is allowed, and the recommended way to run something at most once.
### Sending User to Backend
We could have sent the username of the pod creator to the backend. The username could be used to allow different users to run
different categories of images. This would require propagating the username from e.g. Deployment creation, through to
Pod creation via, e.g. the `Impersonate-User:` header. This feature is [not ready](https://github.com/kubernetes/kubernetes/issues/27152).
When it is, we will re-evaluate adding user as a field of `ImagePolicyRequest`.
### Enforcement at Docker level
Docker supports plugins which can check any container creation before it happens. For example the [twistlock/authz](https://github.com/twistlock/authz)
Docker plugin can audit the full request sent to the Docker daemon and approve or deny it. This could include checking if the image is allowed.
We reject this option because:
- it requires all nodes to be able to configured with how to reach the Backend, which complicates node setup.
- it may not work with other runtimes
- propagating error messages back to the user is more difficult
- it requires plumbing additional information about requests to nodes (if we later want to consider `User` in policy).
### Policy Stored in API
We decided to store policy about what SecurityContexts a pod can have in the API, via PodSecurityPolicy.
This is because Pods are a Kubernetes object, and the Policy is very closely tied to the definition of Pods,
and grows in step as the Pods API grows.
For Image policy, the connection is not as strong. To Kubernetes API, and Image is just a string, and it
does not know any of the image metadata, which lives outside the API.
Image policy may depend on the Dockerfile, the source code, the source repo, the source review tools,
vulnerability databases, and so on. Kubernetes does not have these as built-in concepts or have plans to add
them anytime soon.
### Registry whitelist/blacklist
We considered a whitelist/blacklist of registries and/or repositories. Basically, a prefix match on image strings.
The problem of approving images would be then pushed to a problem of controlling who has access to push to a
trusted registry/repository. That approach is simple for kubernetes. Problems with it are:
- tricky to allow users to share a repository but have different image policies per user or per namespace.
- tricky to do things after image push, such as scan image for vulnerabilities (such as Docker Nautilus), and have those results considered by policy
- tricky to block "older" versions from running, whose interaction with current system may not be well understood.
- how to allow emergency override?
- hard to change policy decision over time.
We still want to use rkt trust, docker content trust, etc for any registries used. We just need additional
image policy checks beyond what trust can provide.
### Send every Request to a Generic Admission Control Backend
Instead of just sending a subset of PodSpec to an Image Provenance backed, we could have sent every object
that is created or updated (or deleted?) to one or ore Generic Admission Control Backends.
This might be a good idea, but needs quite a bit more thought. Some questions with that approach are:
It will not be a generic webhook. A generic webhook would need a lot more discussion:
- a generic webhook needs to touch all objects, not just pods. So it won't have a fixed schema. How to express this in our IDL? Harder to write clients
that interpret unstructured data rather than a fixed schema. Harder to version, and to detect errors.
- a generic webhook client needs to ignore kinds it does not care about, or the apiserver needs to know which backends care about which kinds. How
to specify which backends see which requests. Sending all requests including high-rate requests like events and pod-status updated, might be
too high a rate for some backends?
Additionally, just sending all the fields of just the Pod kind also has problems:
- it exposes our whole API to a webhook backend without giving us (the project) any chance to review or understand how it is being used.
- because we do not know which fields of an object are inspected by the backend, caching of decisions is not effective. Sending fewer fields allows caching.
- sending fewer fields makes it possible to rev the version of the webhook request slower than the version of our internal obejcts (e.g. pod v2 could still use imageReview v1.)
probably lots more reasons.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/image-provenance.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,75 +1 @@
## Abstract This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/initial-resources.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/initial-resources.md)
Initial Resources is a data-driven feature that based on historical data tries to estimate resource usage of a container without Resources specified
and set them before the container is run. This document describes design of the component.
## Motivation
Since we want to make Kubernetes as simple as possible for its users we dont want to require setting [Resources](../design/resource-qos.md) for container by its owner.
On the other hand having Resources filled is critical for scheduling decisions.
Current solution to set up Resources to hardcoded value has obvious drawbacks.
We need to implement a component which will set initial Resources to a reasonable value.
## Design
InitialResources component will be implemented as an [admission plugin](../../plugin/pkg/admission/) and invoked right before
[LimitRanger](https://github.com/kubernetes/kubernetes/blob/7c9bbef96ed7f2a192a1318aa312919b861aee00/cluster/gce/config-default.sh#L91).
For every container without Resources specified it will try to predict amount of resources that should be sufficient for it.
So that a pod without specified resources will be treated as
.
InitialResources will set only [request](../design/resource-qos.md#requests-and-limits) (independently for each resource type: cpu, memory) field in the first version to avoid killing containers due to OOM (however the container still may be killed if exceeds requested resources).
To make the component work with LimitRanger the estimated value will be capped by min and max possible values if defined.
It will prevent from situation when the pod is rejected due to too low or too high estimation.
The container wont be marked as managed by this component in any way, however appropriate event will be exported.
The predicting algorithm should have very low latency to not increase significantly e2e pod startup latency
[#3954](https://github.com/kubernetes/kubernetes/pull/3954).
### Predicting algorithm details
In the first version estimation will be made based on historical data for the Docker image being run in the container (both the name and the tag matters).
CPU/memory usage of each container is exported periodically (by default with 1 minute resolution) to the backend (see more in [Monitoring pipeline](#monitoring-pipeline)).
InitialResources will set Request for both cpu/mem as the 90th percentile of the first (in the following order) set of samples defined in the following way:
* 7 days same image:tag, assuming there is at least 60 samples (1 hour)
* 30 days same image:tag, assuming there is at least 60 samples (1 hour)
* 30 days same image, assuming there is at least 1 sample
If there is still no data the default value will be set by LimitRanger. Same parameters will be configurable with appropriate flags.
#### Example
If we have at least 60 samples from image:tag over the past 7 days, we will use the 90th percentile of all of the samples of image:tag over the past 7 days.
Otherwise, if we have at least 60 samples from image:tag over the past 30 days, we will use the 90th percentile of all of the samples over of image:tag the past 30 days.
Otherwise, if we have at least 1 sample from image over the past 30 days, we will use that the 90th percentile of all of the samples of image over the past 30 days.
Otherwise we will use default value.
### Monitoring pipeline
In the first version there will be available 2 options for backend for predicting algorithm:
* [InfluxDB](../../docs/user-guide/monitoring.md#influxdb-and-grafana) - aggregation will be made in SQL query
* [GCM](../../docs/user-guide/monitoring.md#google-cloud-monitoring) - since GCM is not as powerful as InfluxDB some aggregation will be made on the client side
Both will be hidden under an abstraction layer, so it would be easy to add another option.
The code will be a part of Initial Resources component to not block development, however in the future it should be a part of Heapster.
## Next steps
The first version will be quite simple so there is a lot of possible improvements. Some of them seem to have high priority
and should be introduced shortly after the first version is done:
* observe OOM and then react to it by increasing estimation
* add possibility to specify if estimation should be made, possibly as ```InitialResourcesPolicy``` with options: *always*, *if-not-set*, *never*
* add other features to the model like *namespace*
* remember predefined values for the most popular images like *mysql*, *nginx*, *redis*, etc.
* dry mode, which allows to ask system for resource recommendation for a container without running it
* add estimation as annotations for those containers that already has resources set
* support for other data sources like [Hawkular](http://www.hawkular.org/)
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/initial-resources.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,159 +1 @@
# Job Controller This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/job.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/job.md)
## Abstract
A proposal for implementing a new controller - Job controller - which will be responsible
for managing pod(s) that require running once to completion even if the machine
the pod is running on fails, in contrast to what ReplicationController currently offers.
Several existing issues and PRs were already created regarding that particular subject:
* Job Controller [#1624](https://github.com/kubernetes/kubernetes/issues/1624)
* New Job resource [#7380](https://github.com/kubernetes/kubernetes/pull/7380)
## Use Cases
1. Be able to start one or several pods tracked as a single entity.
1. Be able to run batch-oriented workloads on Kubernetes.
1. Be able to get the job status.
1. Be able to specify the number of instances performing a job at any one time.
1. Be able to specify the number of successfully finished instances required to finish a job.
## Motivation
Jobs are needed for executing multi-pod computation to completion; a good example
here would be the ability to implement any type of batch oriented tasks.
## Implementation
Job controller is similar to replication controller in that they manage pods.
This implies they will follow the same controller framework that replication
controllers already defined. The biggest difference between a `Job` and a
`ReplicationController` object is the purpose; `ReplicationController`
ensures that a specified number of Pods are running at any one time, whereas
`Job` is responsible for keeping the desired number of Pods to a completion of
a task. This difference will be represented by the `RestartPolicy` which is
required to always take value of `RestartPolicyNever` or `RestartOnFailure`.
The new `Job` object will have the following content:
```go
// Job represents the configuration of a single job.
type Job struct {
TypeMeta
ObjectMeta
// Spec is a structure defining the expected behavior of a job.
Spec JobSpec
// Status is a structure describing current status of a job.
Status JobStatus
}
// JobList is a collection of jobs.
type JobList struct {
TypeMeta
ListMeta
Items []Job
}
```
`JobSpec` structure is defined to contain all the information how the actual job execution
will look like.
```go
// JobSpec describes how the job execution will look like.
type JobSpec struct {
// Parallelism specifies the maximum desired number of pods the job should
// run at any given time. The actual number of pods running in steady state will
// be less than this number when ((.spec.completions - .status.successful) < .spec.parallelism),
// i.e. when the work left to do is less than max parallelism.
Parallelism *int
// Completions specifies the desired number of successfully finished pods the
// job should be run with. Defaults to 1.
Completions *int
// Selector is a label query over pods running a job.
Selector map[string]string
// Template is the object that describes the pod that will be created when
// executing a job.
Template *PodTemplateSpec
}
```
`JobStatus` structure is defined to contain information about pods executing
specified job. The structure holds information about pods currently executing
the job.
```go
// JobStatus represents the current state of a Job.
type JobStatus struct {
Conditions []JobCondition
// CreationTime represents time when the job was created
CreationTime unversioned.Time
// StartTime represents time when the job was started
StartTime unversioned.Time
// CompletionTime represents time when the job was completed
CompletionTime unversioned.Time
// Active is the number of actively running pods.
Active int
// Successful is the number of pods successfully completed their job.
Successful int
// Unsuccessful is the number of pods failures, this applies only to jobs
// created with RestartPolicyNever, otherwise this value will always be 0.
Unsuccessful int
}
type JobConditionType string
// These are valid conditions of a job.
const (
// JobComplete means the job has completed its execution.
JobComplete JobConditionType = "Complete"
)
// JobCondition describes current state of a job.
type JobCondition struct {
Type JobConditionType
Status ConditionStatus
LastHeartbeatTime unversioned.Time
LastTransitionTime unversioned.Time
Reason string
Message string
}
```
## Events
Job controller will be emitting the following events:
* JobStart
* JobFinish
## Future evolution
Below are the possible future extensions to the Job controller:
* Be able to limit the execution time for a job, similarly to ActiveDeadlineSeconds for Pods. *now implemented*
* Be able to create a chain of jobs dependent one on another. *will be implemented in a separate type called Workflow*
* Be able to specify the work each of the workers should execute (see type 1 from
[this comment](https://github.com/kubernetes/kubernetes/issues/1624#issuecomment-97622142))
* Be able to inspect Pods running a Job, especially after a Job has finished, e.g.
by providing pointers to Pods in the JobStatus ([see comment](https://github.com/kubernetes/kubernetes/pull/11746/files#r37142628)).
* help users avoid non-unique label selectors ([see this proposal](../../docs/design/selector-generation.md))
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/job.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,220 +1 @@
# Kubectl Login Subcommand This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubectl-login.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubectl-login.md)
**Authors**: Eric Chiang (@ericchiang)
## Goals
`kubectl login` is an entrypoint for any user attempting to connect to an
existing server. It should provide a more tailored experience than the existing
`kubectl config` including config validation, auth challenges, and discovery.
Short term the subcommand should recognize and attempt to help:
* New users with an empty configuration trying to connect to a server.
* Users with no credentials, by prompt for any required information.
* Fully configured users who want to validate credentials.
* Users trying to switch servers.
* Users trying to reauthenticate as the same user because credentials have expired.
* Authenticate as a different user to the same server.
Long term `kubectl login` should enable authentication strategies to be
discoverable from a master to avoid the end-user having to know how their
sysadmin configured the Kubernetes cluster.
## Design
The "login" subcommand helps users move towards a fully functional kubeconfig by
evaluating the current state of the kubeconfig and trying to prompt the user for
and validate the necessary information to login to the kubernetes cluster.
This is inspired by a similar tools such as:
* [os login](https://docs.openshift.org/latest/cli_reference/get_started_cli.html#basic-setup-and-login)
* [gcloud auth login](https://cloud.google.com/sdk/gcloud/reference/auth/login)
* [aws configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html)
The steps taken are:
1. If no cluster configured, prompt user for cluster information.
2. If no user is configured, discover the authentication strategies supported by the API server.
3. Prompt the user for some information based on the authentication strategy they choose.
4. Attempt to login as a user, including authentication challenges such as OAuth2 flows, and display user info.
Importantly, each step is skipped if the existing configuration is validated or
can be supplied without user interaction (refreshing an OAuth token, redeeming
a Kerberos ticket, etc.). Users with fully configured kubeconfigs will only see
the user they're logged in as, useful for opaque credentials such as X509 certs
or bearer tokens.
The command differs from `kubectl config` by:
* Communicating with the API server to determine if the user is supplying valid auth events.
* Validating input and being opinionated about the input it asks for.
* Triggering authentication challenges for example:
* Basic auth: Actually try to communicate with the API server.
* OpenID Connect: Create an OAuth2 redirect.
However `kubectl login` should still be seen as a supplement to, not a
replacement for, `kubectl config` by helping validate any kubeconfig generated
by the latter command.
## Credential validation
When clusters utilize authorization plugins access decisions are based on the
correct configuration of an auth-N plugin, an auth-Z plugin, and client side
credentials. Being rejected then begs several questions. Is the user's
kubeconfig misconfigured? Is the authorization plugin setup wrong? Is the user
authenticating as a different user than the one they assume?
To help `kubectl login` diagnose misconfigured credentials, responses from the
API server to authenticated requests SHOULD include the `Authentication-Info`
header as defined in [RFC 7615](https://tools.ietf.org/html/rfc7615). The value
will hold name value pairs for `username` and `uid`. Since usernames and IDs
can be arbitrary strings, these values will be escaped using the `quoted-string`
format noted in the RFC.
```
HTTP/1.1 200 OK
Authentication-Info: username="janedoe@example.com", uid="123456"
```
If the user successfully authenticates this header will be set, regardless of
auth-Z decisions. For example a 401 Unauthorized (user didn't provide valid
credentials) would lack this header, while a 403 Forbidden response would
contain it.
## Authentication discovery
A long term goal of `kubectl login` is to facilitate a customized experience
for clusters configured with different auth providers. This will require some
way for the API server to indicate to `kubectl` how a user is expected to
login.
Currently, this document doesn't propose a specific implementation for
discovery. While it'd be preferable to utilize an existing standard (such as the
`WWW-Authenticate` HTTP header), discovery may require a solution custom to the
API server, such as an additional discovery endpoint with a custom type.
## Use in non-interactive session
For the initial implementation, if `kubectl login` requires prompting and is
called from a non-interactive session (determined by if the session is using a
TTY) it errors out, recommending using `kubectl config` instead. In future
updates `kubectl login` may include options for non-interactive sessions so
auth strategies which require custom behavior not built into `kubectl config`,
such as the exchanges in Kerberos or OpenID Connect, can be triggered from
scripts.
## Examples
If kubeconfig isn't configured, `kubectl login` will attempt to fully configure
and validate the client's credentials.
```
$ kubectl login
Cluster URL []: https://172.17.4.99:443
Cluster CA [(defaults to host certs)]: ${PWD}/ssl/ca.pem
Cluster Name ["cluster-1"]:
The kubernetes server supports the following methods:
1. Bearer token
2. Username and password
3. Keystone
4. OpenID Connect
5. TLS client certificate
Enter login method [1]: 4
Logging in using OpenID Connect.
Issuer ["valuefromdiscovery"]: https://accounts.google.com
Issuer CA [(defaults to host certs)]:
Scopes ["profile email"]:
Client ID []: client@localhost:foobar
Client Secret []: *****
Open the following address in a browser.
https://accounts.google.com/o/oauth2/v2/auth?redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scopes=openid%20email&access_type=offline&...
Enter security code: ****
Logged in as "janedoe@gmail.com"
```
Human readable names are provided by a combination of the auth providers
understood by `kubectl login` and the authenticator discovery. For instance,
Keystone uses basic auth credentials in the same way as a static user file, but
if the discovery indicates that the Keystone plugin is being used it should be
presented to the user differently.
Users with configured credentials will simply auth against the API server and see
who they are. Running this command again simply validates the user's credentials.
```
$ kubectl login
Logged in as "janedoe@gmail.com"
```
Users who are halfway through the flow will start where they left off. For
instance if a user has configured the cluster field but on a user field, they will
be prompted for credentials.
```
$ kubectl login
No auth type configured. The kubernetes server supports the following methods:
1. Bearer token
2. Username and password
3. Keystone
4. OpenID Connect
5. TLS client certificate
Enter login method [1]: 2
Logging in with basic auth. Enter the following fields.
Username: janedoe
Password: ****
Logged in as "janedoe@gmail.com"
```
Users who wish to switch servers can provide the `--switch-cluster` flag which
will prompt the user for new cluster details and switch the current context. It
behaves identically to `kubectl login` when a cluster is not set.
```
$ kubectl login --switch-cluster
# ...
```
Switching users goes through a similar flow attempting to prompt the user for
new credentials to the same server.
```
$ kubectl login --switch-user
# ...
```
## Work to do
Phase 1:
* Provide a simple dialog for configuring authentication.
* Kubectl can trigger authentication actions such as trigging OAuth2 redirects.
* Validation of user credentials thought the `Authentication-Info` endpoint.
Phase 2:
* Update proposal with auth provider discovery mechanism.
* Customize dialog using discovery data.
Further improvements will require adding more authentication providers, and
adapting existing plugins to take advantage of challenge based authentication.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubectl-login.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,106 +1 @@
# Kubelet Authentication / Authorization This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-auth.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-auth.md)
Author: Jordan Liggitt (jliggitt@redhat.com)
## Overview
The kubelet exposes endpoints which give access to data of varying sensitivity,
and allow performing operations of varying power on the node and within containers.
There is no built-in way to limit or subdivide access to those endpoints,
so deployers must secure the kubelet API using external, ad-hoc methods.
This document proposes a method for authenticating and authorizing access
to the kubelet API, using interfaces and methods that complement the existing
authentication and authorization used by the API server.
## Preliminaries
This proposal assumes the existence of:
* a functioning API server
* the SubjectAccessReview and TokenReview APIs
It also assumes each node is additionally provisioned with the following information:
1. Location of the API server
2. Any CA certificates necessary to trust the API server's TLS certificate
3. Client credentials authorized to make SubjectAccessReview and TokenReview API calls
## API Changes
None
## Kubelet Authentication
Enable starting the kubelet with one or more of the following authentication methods:
* x509 client certificate
* bearer token
* anonymous (current default)
For backwards compatibility, the default is to enable anonymous authentication.
### x509 client certificate
Add a new `--client-ca-file=[file]` option to the kubelet.
When started with this option, the kubelet authenticates incoming requests using x509
client certificates, validated against the root certificates in the provided bundle.
The kubelet will reuse the x509 authenticator already used by the API server.
The master API server can already be started with `--kubelet-client-certificate` and
`--kubelet-client-key` options in order to make authenticated requests to the kubelet.
### Bearer token
Add a new `--authentication-token-webhook=[true|false]` option to the kubelet.
When true, the kubelet authenticates incoming requests with bearer tokens by making
`TokenReview` API calls to the API server.
The kubelet will reuse the webhook authenticator already used by the API server, configured
to call the API server using the connection information already provided to the kubelet.
To improve performance of repeated requests with the same bearer token, the
`--authentication-token-webhook-cache-ttl` option supported by the API server
would be supported.
### Anonymous
Add a new `--anonymous-auth=[true|false]` option to the kubelet.
When true, requests to the secure port that are not rejected by other configured
authentication methods are treated as anonymous requests, and given a username
of `system:anonymous` and a group of `system:unauthenticated`.
## Kubelet Authorization
Add a new `--authorization-mode` option to the kubelet, specifying one of the following modes:
* `Webhook`
* `AlwaysAllow` (current default)
For backwards compatibility, the authorization mode defaults to `AlwaysAllow`.
### Webhook
Webhook mode converts the request to authorization attributes, and makes a `SubjectAccessReview`
API call to check if the authenticated subject is allowed to make a request with those attributes.
This enables authorization policy to be centrally managed by the authorizer configured for the API server.
The kubelet will reuse the webhook authorizer already used by the API server, configured
to call the API server using the connection information already provided to the kubelet.
To improve performance of repeated requests with the same authenticated subject and request attributes,
the same webhook authorizer caching options supported by the API server would be supported:
* `--authorization-webhook-cache-authorized-ttl`
* `--authorization-webhook-cache-unauthorized-ttl`
### AlwaysAllow
This mode allows any authenticated request.
## Future Work
* Add support for CRL revocation for x509 client certificate authentication (http://issue.k8s.io/18982)
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-auth.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,269 +1 @@
<!-- BEGIN MUNGE: UNVERSIONED_WARNING --> This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-cri-logging.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-cri-logging.md)
<!-- BEGIN STRIP_FOR_RELEASE -->
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
width="25" height="25">
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.
Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--
<!-- END STRIP_FOR_RELEASE -->
<!-- END MUNGE: UNVERSIONED_WARNING -->
# CRI: Log management for container stdout/stderr streams
## Goals and non-goals
Container Runtime Interface (CRI) is an ongoing project to allow container
runtimes to integrate with kubernetes via a newly-defined API. The goal of this
proposal is to define how container's *stdout/stderr* log streams should be
handled in CRI.
The explicit non-goal is to define how (non-stdout/stderr) application logs
should be handled. Collecting and managing arbitrary application logs is a
long-standing issue [1] in kubernetes and is worth a proposal of its own. Even
though this proposal does not touch upon these logs, the direction of
this proposal is aligned with one of the most-discussed solutions, logging
volumes [1], for general logging management.
*In this proposal, “logs” refer to the stdout/stderr streams of the
containers, unless specified otherwise.*
Previous CRI logging issues:
- Tracking issue: https://github.com/kubernetes/kubernetes/issues/30709
- Proposal (by @tmrtfs): https://github.com/kubernetes/kubernetes/pull/33111
The scope of this proposal is narrower than the #33111 proposal, and hopefully
this will encourage a more focused discussion.
## Background
Below is a brief overview of logging in kubernetes with docker, which is the
only container runtime with fully functional integration today.
**Log lifecycle and management**
Docker supports various logging drivers (e.g., syslog, journal, and json-file),
and allows users to configure the driver by passing flags to the docker daemon
at startup. Kubernetes defaults to the "json-file" logging driver, in which
docker writes the stdout/stderr streams to a file in the json format as shown
below.
```
{“log”: “The actual log line”, “stream”: “stderr”, “time”: “2016-10-05T00:00:30.082640485Z”}
```
Docker deletes the log files when the container is removed, and a cron-job (or
systemd timer-based job) on the node is responsible to rotate the logs (using
`logrotate`). To preserve the logs for introspection and debuggability, kubelet
keeps the terminated container until the pod object has been deleted from the
apiserver.
**Container log retrieval**
The kubernetes CLI tool, kubectl, allows users to access the container logs
using [`kubectl logs`]
(http://kubernetes.io/docs/user-guide/kubectl/kubectl_logs/) command.
`kubectl logs` supports flags such as `--since` that requires understanding of
the format and the metadata (i.e., timestamps) of the logs. In the current
implementation, kubelet calls `docker logs` with parameters to return the log
content. As of now, docker only supports `log` operations for the “journal” and
“json-file” drivers [2]. In other words, *the support of `kubectl logs` is not
universal in all kuernetes deployments*.
**Cluster logging support**
In a production cluster, logs are usually collected, aggregated, and shipped to
a remote store where advanced analysis/search/archiving functions are
supported. In kubernetes, the default cluster-addons includes a per-node log
collection daemon, `fluentd`. To facilitate the log collection, kubelet creates
symbolic links to all the docker containers logs under `/var/log/containers`
with pod and container metadata embedded in the filename.
```
/var/log/containers/<pod_name>_<pod_namespace>_<container_name>-<container_id>.log`
```
The fluentd daemon watches the `/var/log/containers/` directory and extract the
metadata associated with the log from the path. Note that this integration
requires kubelet to know where the container runtime stores the logs, and will
not be directly applicable to CRI.
## Requirements
1. **Provide ways for CRI-compliant runtimes to support all existing logging
features, i.e., `kubectl logs`.**
2. **Allow kubelet to manage the lifecycle of the logs to pave the way for
better disk management in the future.** This implies that the lifecycle
of containers and their logs need to be decoupled.
3. **Allow log collectors to easily integrate with Kubernetes across
different container runtimes while preserving efficient storage and
retrieval.**
Requirement (1) provides opportunities for runtimes to continue support
`kubectl logs --since` and related features. Note that even though such
features are only supported today for a limited set of log drivers, this is an
important usability tool for a fresh, basic kubernetes cluster, and should not
be overlooked. Requirement (2) stems from the fact that disk is managed by
kubelet as a node-level resource (not per-pod) today, hence it is difficult to
delegate to the runtime by enforcing per-pod disk quota policy. In addition,
container disk quota is not well supported yet, and such limitation may not
even be well-perceived by users. Requirement (1) is crucial to the kubernetes'
extensibility and usability across all deployments.
## Proposed solution
This proposal intends to satisfy the requirements by
1. Enforce where the container logs should be stored on the host
filesystem. Both kubelet and the log collector can interact with
the log files directly.
2. Ask the runtime to decorate the logs in a format that kubelet understands.
**Log directories and structures**
Kubelet will be configured with a root directory (e.g., `/var/log/pods` or
`/var/lib/kubelet/logs/) to store all container logs. Below is an example of a
path to the log of a container in a pod.
```
/var/log/pods/<podUID>/<containerName>_<instance#>.log
```
In CRI, this is implemented by setting the pod-level log directory when
creating the pod sandbox, and passing the relative container log path
when creating a container.
```
PodSandboxConfig.LogDirectory: /var/log/pods/<podUID>/
ContainerConfig.LogPath: <containerName>_<instance#>.log
```
Because kubelet determines where the logs are stores and can access them
directly, this meets requirement (1). As for requirement (2), the log collector
can easily extract basic pod metadata (e.g., pod UID, container name) from
the paths, and watch the directly for any changes. In the future, we can
extend this by maintaining a metada file in the pod directory.
**Log format**
The runtime should decorate each log entry with a RFC 3339Nano timestamp
prefix, the stream type (i.e., "stdout" or "stderr"), and ends with a newline.
```
2016-10-06T00:17:09.669794202Z stdout The content of the log entry 1
2016-10-06T00:17:10.113242941Z stderr The content of the log entry 2
```
With the knowledge, kubelet can parses the logs and serve them for `kubectl
logs` requests. This meets requirement (3). Note that the format is defined
deliberately simple to provide only information necessary to serve the requests.
We do not intend for kubelet to host various logging plugins. It is also worth
mentioning again that the scope of this proposal is restricted to stdout/stderr
streams of the container, and we impose no restriction to the logging format of
arbitrary container logs.
**Who should rotate the logs?**
We assume that a separate task (e.g., cron job) will be configured on the node
to rotate the logs periodically, similar to todays implementation.
We do not rule out the possibility of letting kubelet or a per-node daemon
(`DaemonSet`) to take up the responsibility, or even declare rotation policy
in the kubernetes API as part of the `PodSpec`, but it is beyond the scope of
the this proposal.
**What about non-supported log formats?**
If a runtime chooses to store logs in non-supported formats, it essentially
opts out of `kubectl logs` features, which is backed by kubelet today. It is
assumed that the user can rely on the advanced, cluster logging infrastructure
to examine the logs.
It is also possible that in the future, `kubectl logs` can contact the cluster
logging infrastructure directly to serve logs [1a]. Note that this does not
eliminate the need to store the logs on the node locally for reliability.
**How can existing runtimes (docker/rkt) comply to the logging requirements?**
In the short term, the ongoing docker-CRI integration [3] will support the
proposed solution only partially by (1) creating symbolic links for kubelet
to access, but not manage the logs, and (2) add support for json format in
kubelet. A more sophisticated solution that either involves using a custom
plugin or launching a separate process to copy and decorate the log will be
considered as a mid-term solution.
For rkt, implementation will rely on providing external file-descriptors for
stdout/stderr to applications via systemd [4]. Those streams are currently
managed by a journald sidecar, which collects stream outputs and store them
in the journal file of the pod. This will replaced by a custom sidecar which
can produce logs in the format expected by this specification and can handle
clients attaching as well.
## Alternatives
There are ad-hoc solutions/discussions that addresses one or two of the
requirements, but no comprehensive solution for CRI specifically has been
proposed so far (with the excpetion of @tmrtfs's proposal
[#33111](https://github.com/kubernetes/kubernetes/pull/33111), which has a much
wider scope). It has come up in discussions that kubelet can delegate all the
logging management to the runtime to allow maximum flexibility. However, it is
difficult for this approach to meet either requirement (1) or (2), without
defining complex logging API.
There are also possibilities to implement the current proposal by imposing the
log file paths, while leveraging the runtime to access and/or manage logs. This
requires the runtime to expose knobs in CRI to retrieve, remove, and examine
the disk usage of logs. The upside of this approach is that kubelet needs not
mandate the logging format, assuming runtime already includes plugins for
various logging formats. Unfortunately, this is not true for existing runtimes
such as docker, which supports log retrieval only for a very limited number of
log drivers [2]. On the other hand, the downside is that we would be enforcing
more requirements on the runtime through log storage location on the host, and
a potentially premature logging API that may change as the disk management
evolves.
## References
[1] Log management issues:
- a. https://github.com/kubernetes/kubernetes/issues/17183
- b. https://github.com/kubernetes/kubernetes/issues/24677
- c. https://github.com/kubernetes/kubernetes/pull/13010
[2] Docker logging drivers:
- https://docs.docker.com/engine/admin/logging/overview/
[3] Docker CRI integration:
- https://github.com/kubernetes/kubernetes/issues/31459
[4] rkt support: https://github.com/systemd/systemd/pull/4179
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-cri-logging.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,462 +1 @@
# Kubelet - Eviction Policy This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-eviction.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-eviction.md)
**Authors**: Derek Carr (@derekwaynecarr), Vishnu Kannan (@vishh)
**Status**: Proposed (memory evictions WIP)
This document presents a specification for how the `kubelet` evicts pods when compute resources are too low.
## Goals
The node needs a mechanism to preserve stability when available compute resources are low.
This is especially important when dealing with incompressible compute resources such
as memory or disk. If either resource is exhausted, the node would become unstable.
The `kubelet` has some support for influencing system behavior in response to a system OOM by
having the system OOM killer see higher OOM score adjust scores for containers that have consumed
the largest amount of memory relative to their request. System OOM events are very compute
intensive, and can stall the node until the OOM killing process has completed. In addition,
the system is prone to return to an unstable state since the containers that are killed due to OOM
are either restarted or a new pod is scheduled on to the node.
Instead, we would prefer a system where the `kubelet` can pro-actively monitor for
and prevent against total starvation of a compute resource, and in cases of where it
could appear to occur, pro-actively fail one or more pods, so the workload can get
moved and scheduled elsewhere when/if its backing controller creates a new pod.
## Scope of proposal
This proposal defines a pod eviction policy for reclaiming compute resources.
As of now, memory and disk based evictions are supported.
The proposal focuses on a simple default eviction strategy
intended to cover the broadest class of user workloads.
## Eviction Signals
The `kubelet` will support the ability to trigger eviction decisions on the following signals.
| Eviction Signal | Description |
|------------------|---------------------------------------------------------------------------------|
| memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet |
| nodefs.available | nodefs.available := node.stats.fs.available |
| nodefs.inodesFree | nodefs.inodesFree := node.stats.fs.inodesFree |
| imagefs.available | imagefs.available := node.stats.runtime.imagefs.available |
| imagefs.inodesFree | imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree |
Each of the above signals support either a literal or percentage based value. The percentage based value
is calculated relative to the total capacity associated with each signal.
`kubelet` supports only two filesystem partitions.
1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.
`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor.
`kubelet` does not care about any other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is *not OK* to store volumes and logs in a dedicated `imagefs`.
## Eviction Thresholds
The `kubelet` will support the ability to specify eviction thresholds.
An eviction threshold is of the following form:
`<eviction-signal><operator><quantity | int%>`
* valid `eviction-signal` tokens as defined above.
* valid `operator` tokens are `<`
* valid `quantity` tokens must match the quantity representation used by Kubernetes
* an eviction threshold can be expressed as a percentage if ends with `%` token.
If threshold criteria are met, the `kubelet` will take pro-active action to attempt
to reclaim the starved compute resource associated with the eviction signal.
The `kubelet` will support soft and hard eviction thresholds.
For example, if a node has `10Gi` of memory, and the desire is to induce eviction
if available memory falls below `1Gi`, an eviction signal can be specified as either
of the following (but not both).
* `memory.available<10%`
* `memory.available<1Gi`
### Soft Eviction Thresholds
A soft eviction threshold pairs an eviction threshold with a required
administrator specified grace period. No action is taken by the `kubelet`
to reclaim resources associated with the eviction signal until that grace
period has been exceeded. If no grace period is provided, the `kubelet` will
error on startup.
In addition, if a soft eviction threshold has been met, an operator can
specify a maximum allowed pod termination grace period to use when evicting
pods from the node. If specified, the `kubelet` will use the lesser value among
the `pod.Spec.TerminationGracePeriodSeconds` and the max allowed grace period.
If not specified, the `kubelet` will kill pods immediately with no graceful
termination.
To configure soft eviction thresholds, the following flags will be supported:
```
--eviction-soft="": A set of eviction thresholds (e.g. memory.available<1.5Gi) that if met over a corresponding grace period would trigger a pod eviction.
--eviction-soft-grace-period="": A set of eviction grace periods (e.g. memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction.
--eviction-max-pod-grace-period="0": Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
```
### Hard Eviction Thresholds
A hard eviction threshold has no grace period, and if observed, the `kubelet`
will take immediate action to reclaim the associated starved resource. If a
hard eviction threshold is met, the `kubelet` will kill the pod immediately
with no graceful termination.
To configure hard eviction thresholds, the following flag will be supported:
```
--eviction-hard="": A set of eviction thresholds (e.g. memory.available<1Gi) that if met would trigger a pod eviction.
```
## Eviction Monitoring Interval
The `kubelet` will initially evaluate eviction thresholds at the same
housekeeping interval as `cAdvisor` housekeeping.
In Kubernetes 1.2, this was defaulted to `10s`.
It is a goal to shrink the monitoring interval to a much shorter window.
This may require changes to `cAdvisor` to let alternate housekeeping intervals
be specified for selected data (https://github.com/google/cadvisor/issues/1247)
For the purposes of this proposal, we expect the monitoring interval to be no
more than `10s` to know when a threshold has been triggered, but we will strive
to reduce that latency time permitting.
## Node Conditions
The `kubelet` will support a node condition that corresponds to each eviction signal.
If a hard eviction threshold has been met, or a soft eviction threshold has been met
independent of its associated grace period, the `kubelet` will report a condition that
reflects the node is under pressure.
The following node conditions are defined that correspond to the specified eviction signal.
| Node Condition | Eviction Signal | Description |
|----------------|------------------|------------------------------------------------------------------|
| MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold |
| DiskPressure | nodefs.available, nodefs.inodesFree, imagefs.available, or imagefs.inodesFree | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
The `kubelet` will continue to report node status updates at the frequency specified by
`--node-status-update-frequency` which defaults to `10s`.
### Oscillation of node conditions
If a node is oscillating above and below a soft eviction threshold, but not exceeding
its associated grace period, it would cause the corresponding node condition to
constantly oscillate between true and false, and could cause poor scheduling decisions
as a consequence.
To protect against this oscillation, the following flag is defined to control how
long the `kubelet` must wait before transitioning out of a pressure condition.
```
--eviction-pressure-transition-period=5m0s: Duration for which the kubelet has to wait
before transitioning out of an eviction pressure condition.
```
The `kubelet` would ensure that it has not observed an eviction threshold being met
for the specified pressure condition for the period specified before toggling the
condition back to `false`.
## Eviction scenarios
### Memory
Let's assume the operator started the `kubelet` with the following:
```
--eviction-hard="memory.available<100Mi"
--eviction-soft="memory.available<300Mi"
--eviction-soft-grace-period="memory.available=30s"
```
The `kubelet` will run a sync loop that looks at the available memory
on the node as reported from `cAdvisor` by calculating (capacity - workingSet).
If available memory is observed to drop below 100Mi, the `kubelet` will immediately
initiate eviction. If available memory is observed as falling below `300Mi`,
it will record when that signal was observed internally in a cache. If at the next
sync, that criteria was no longer satisfied, the cache is cleared for that
signal. If that signal is observed as being satisfied for longer than the
specified period, the `kubelet` will initiate eviction to attempt to
reclaim the resource that has met its eviction threshold.
### Disk
Let's assume the operator started the `kubelet` with the following:
```
--eviction-hard="nodefs.available<1Gi,nodefs.inodesFree<1,imagefs.available<10Gi,imagefs.inodesFree<10"
--eviction-soft="nodefs.available<1.5Gi,nodefs.inodesFree<10,imagefs.available<20Gi,imagefs.inodesFree<100"
--eviction-soft-grace-period="nodefs.available=1m,imagefs.available=2m"
```
The `kubelet` will run a sync loop that looks at the available disk
on the node's supported partitions as reported from `cAdvisor`.
If available disk space on the node's primary filesystem is observed to drop below 1Gi
or the free inodes on the node's primary filesystem is less than 1,
the `kubelet` will immediately initiate eviction.
If available disk space on the node's image filesystem is observed to drop below 10Gi
or the free inodes on the node's primary image filesystem is less than 10,
the `kubelet` will immediately initiate eviction.
If available disk space on the node's primary filesystem is observed as falling below `1.5Gi`,
or if the free inodes on the node's primary filesystem is less than 10,
or if available disk space on the node's image filesystem is observed as falling below `20Gi`,
or if the free inodes on the node's image filesystem is less than 100,
it will record when that signal was observed internally in a cache. If at the next
sync, that criterion was no longer satisfied, the cache is cleared for that
signal. If that signal is observed as being satisfied for longer than the
specified period, the `kubelet` will initiate eviction to attempt to
reclaim the resource that has met its eviction threshold.
## Eviction of Pods
If an eviction threshold has been met, the `kubelet` will initiate the
process of evicting pods until it has observed the signal has gone below
its defined threshold.
The eviction sequence works as follows:
* for each monitoring interval, if eviction thresholds have been met
* find candidate pod
* fail the pod
* block until pod is terminated on node
If a pod is not terminated because a container does not happen to die
(i.e. processes stuck in disk IO for example), the `kubelet` may select
an additional pod to fail instead. The `kubelet` will invoke the `KillPod`
operation exposed on the runtime interface. If an error is returned,
the `kubelet` will select a subsequent pod.
## Eviction Strategy
The `kubelet` will implement a default eviction strategy oriented around
the pod quality of service class.
It will target pods that are the largest consumers of the starved compute
resource relative to their scheduling request. It ranks pods within a
quality of service tier in the following order.
* `BestEffort` pods that consume the most of the starved resource are failed
first.
* `Burstable` pods that consume the greatest amount of the starved resource
relative to their request for that resource are killed first. If no pod
has exceeded its request, the strategy targets the largest consumer of the
starved resource.
* `Guaranteed` pods that consume the greatest amount of the starved resource
relative to their request are killed first. If no pod has exceeded its request,
the strategy targets the largest consumer of the starved resource.
A guaranteed pod is guaranteed to never be evicted because of another pod's
resource consumption. That said, guarantees are only as good as the underlying
foundation they are built upon. If a system daemon
(i.e. `kubelet`, `docker`, `journald`, etc.) is consuming more resources than
were reserved via `system-reserved` or `kube-reserved` allocations, and the node
only has guaranteed pod(s) remaining, then the node must choose to evict a
guaranteed pod in order to preserve node stability, and to limit the impact
of the unexpected consumption to other guaranteed pod(s).
## Disk based evictions
### With Imagefs
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
1. Delete logs
1. Evict Pods if required.
If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
1. Delete unused images
1. Evict Pods if required.
### Without Imagefs
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
1. Delete logs
1. Delete unused images
1. Evict Pods if required.
Let's explore the different options for freeing up disk space.
### Delete logs of dead pods/containers
As of today, logs are tied to a container's lifetime. `kubelet` keeps dead containers around,
to provide access to logs.
In the future, if we store logs of dead containers outside of the container itself, then
`kubelet` can delete these logs to free up disk space.
Once the lifetime of containers and logs are split, kubelet can support more user friendly policies
around log evictions. `kubelet` can delete logs of the oldest containers first.
Since logs from the first and the most recent incarnation of a container is the most important for most applications,
kubelet can try to preserve these logs and aggressively delete logs from other container incarnations.
Until logs are split from container's lifetime, `kubelet` can delete dead containers to free up disk space.
### Delete unused images
`kubelet` performs image garbage collection based on thresholds today. It uses a high and a low watermark.
Whenever disk usage exceeds the high watermark, it removes images until the low watermark is reached.
`kubelet` employs a LRU policy when it comes to deleting images.
The existing policy will be replaced with a much simpler policy.
Images will be deleted based on eviction thresholds. If kubelet can delete logs and keep disk space availability
above eviction thresholds, then kubelet will not delete any images.
If `kubelet` decides to delete unused images, it will delete *all* unused images.
### Evict pods
There is no ability to specify disk limits for pods/containers today.
Disk is a best effort resource. When necessary, `kubelet` can evict pods one at a time.
`kubelet` will follow the [Eviction Strategy](#eviction-strategy) mentioned above for making eviction decisions.
`kubelet` will evict the pod that will free up the maximum amount of disk space on the filesystem that has hit eviction thresholds.
Within each QoS bucket, `kubelet` will sort pods according to their disk usage.
`kubelet` will sort pods in each bucket as follows:
#### Without Imagefs
If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
- local volumes + logs & writable layer of all its containers.
#### With Imagefs
If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
- local volumes + logs of all its containers.
If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.
## Minimum eviction reclaim
In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
is time consuming.
To mitigate these issues, `kubelet` will have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource.
Following are the flags through which `minimum-reclaim` can be configured for each evictable resource:
`--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
The default `eviction-minimum-reclaim` is `0` for all resources.
## Deprecation of existing features
`kubelet` has been freeing up disk space on demand to keep the node stable. As part of this proposal,
some of the existing features/flags around disk space retrieval will be deprecated in-favor of this proposal.
| Existing Flag | New Flag | Rationale |
| ------------- | -------- | --------- |
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` | existing eviction signals can capture image garbage collection |
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` | eviction reclaims achieve the same behavior |
| `--maximum-dead-containers` | | deprecated once old logs are stored outside of container's context |
| `--maximum-dead-containers-per-container` | | deprecated once old logs are stored outside of container's context |
| `--minimum-container-ttl-duration` | | deprecated once old logs are stored outside of container's context |
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` | this use case is better handled by this proposal |
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` | make the flag generic to suit all compute resources |
## Kubelet Admission Control
### Feasibility checks during kubelet admission
#### Memory
The `kubelet` will reject `BestEffort` pods if any of the memory
eviction thresholds have been exceeded independent of the configured
grace period.
Let's assume the operator started the `kubelet` with the following:
```
--eviction-soft="memory.available<256Mi"
--eviction-soft-grace-period="memory.available=30s"
```
If the `kubelet` sees that it has less than `256Mi` of memory available
on the node, but the `kubelet` has not yet initiated eviction since the
grace period criteria has not yet been met, the `kubelet` will still immediately
fail any incoming best effort pods.
The reasoning for this decision is the expectation that the incoming pod is
likely to further starve the particular compute resource and the `kubelet` should
return to a steady state before accepting new workloads.
#### Disk
The `kubelet` will reject all pods if any of the disk eviction thresholds have been met.
Let's assume the operator started the `kubelet` with the following:
```
--eviction-soft="nodefs.available<1500Mi"
--eviction-soft-grace-period="nodefs.available=30s"
```
If the `kubelet` sees that it has less than `1500Mi` of disk available
on the node, but the `kubelet` has not yet initiated eviction since the
grace period criteria has not yet been met, the `kubelet` will still immediately
fail any incoming pods.
The rationale for failing **all** pods instead of just best effort is because disk is currently
a best effort resource for all QoS classes.
Kubelet will apply the same policy even if there is a dedicated `image` filesystem.
## Scheduler
The node will report a condition when a compute resource is under pressure. The
scheduler should view that condition as a signal to dissuade placing additional
best effort pods on the node.
In this case, the `MemoryPressure` condition if true should dissuade the scheduler
from placing new best effort pods on the node since they will be rejected by the `kubelet` in admission.
On the other hand, the `DiskPressure` condition if true should dissuade the scheduler from
placing **any** new pods on the node since they will be rejected by the `kubelet` in admission.
## Best Practices
### DaemonSet
It is never desired for a `kubelet` to evict a pod that was derived from
a `DaemonSet` since the pod will immediately be recreated and rescheduled
back to the same node.
At the moment, the `kubelet` has no ability to distinguish a pod created
from `DaemonSet` versus any other object. If/when that information is
available, the `kubelet` could pro-actively filter those pods from the
candidate set of pods provided to the eviction strategy.
In general, it should be strongly recommended that `DaemonSet` not
create `BestEffort` pods to avoid being identified as a candidate pod
for eviction. Instead `DaemonSet` should ideally include Guaranteed pods only.
## Known issues
### kubelet may evict more pods than needed
The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding
the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future.
### How kubelet ranks pods for eviction in response to inode exhaustion
At this time, it is not possible to know how many inodes were consumed by a particular container. If the `kubelet` observes
inode exhaustion, it will evict pods by ranking them by quality of service. The following issue has been opened in cadvisor
to track per container inode consumption (https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods
by inode consumption. For example, this would let us identify a container that created large numbers of 0 byte files, and evict
that pod over others.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,45 +1 @@
Kubelet HyperContainer Container Runtime This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-hypercontainer-runtime.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-hypercontainer-runtime.md)
=======================================
Authors: Pengfei Ni (@feiskyer), Harry Zhang (@resouer)
## Abstract
This proposal aims to support [HyperContainer](http://hypercontainer.io) container
runtime in Kubelet.
## Motivation
HyperContainer is a Hypervisor-agnostic Container Engine that allows you to run Docker images using
hypervisors (KVM, Xen, etc.). By running containers within separate VM instances, it offers a
hardware-enforced isolation, which is required in multi-tenant environments.
## Goals
1. Complete pod/container/image lifecycle management with HyperContainer.
2. Setup network by network plugins.
3. 100% Pass node e2e tests.
4. Easy to deploy for both local dev/test and production clusters.
## Design
The HyperContainer runtime will make use of the kubelet Container Runtime Interface. [Fakti](https://github.com/kubernetes/frakti) implements the CRI interface and exposes
a local endpoint to Kubelet. Fakti communicates with [hyperd](https://github.com/hyperhq/hyperd)
with its gRPC API to manage the lifecycle of sandboxes, containers and images.
![frakti](https://cloud.githubusercontent.com/assets/676637/18940978/6e3e5384-863f-11e6-9132-b638d862fd09.png)
## Limitations
Since pods are running directly inside hypervisor, host network is not supported in HyperContainer
runtime.
## Development
The HyperContainer runtime is maintained by <https://github.com/kubernetes/frakti>.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-hypercontainer-runtime.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,103 +1 @@
Next generation rkt runtime integration This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-rkt-runtime.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-rkt-runtime.md)
=======================================
Authors: Euan Kemp (@euank), Yifan Gu (@yifan-gu)
## Abstract
This proposal describes the design and road path for integrating rkt with kubelet with the new container runtime interface.
## Background
Currently, the Kubernetes project supports rkt as a container runtime via an implementation under [pkg/kubelet/rkt package](https://github.com/kubernetes/kubernetes/tree/v1.5.0-alpha.0/pkg/kubelet/rkt).
This implementation, for historical reasons, has required implementing a large amount of logic shared by the original Docker implementation.
In order to make additional container runtime integrations easier, more clearly defined, and more consistent, a new [Container Runtime Interface](https://github.com/kubernetes/kubernetes/blob/v1.5.0-alpha.0/pkg/kubelet/api/v1alpha1/runtime/api.proto) (CRI) is being designed.
The existing runtimes, in order to both prove the correctness of the interface and reduce maintenance burden, are incentivized to move to this interface.
This document proposes how the rkt runtime integration will transition to using the CRI.
## Goals
### Full-featured
The CRI integration must work as well as the existing integration in terms of features.
Until that's the case, the existing integration will continue to be maintained.
### Easy to Deploy
The new integration should not be any more difficult to deploy and configure than the existing integration.
### Easy to Develop
This iteration should be as easy to work and iterate on as the original one.
It will be available in an initial usable form quickly in order to validate the CRI.
## Design
In order to fulfill the above goals, the rkt CRI integration will make the following choices:
### Remain in-process with Kubelet
The current rkt container runtime integration is able to be deployed simply by deploying the kubelet binary.
This is, in no small part, to make it *Easy to Deploy*.
Remaining in-process also helps this integration not regress on performance, one axis of being *Full-Featured*.
### Communicate through gRPC
Although the kubelet and rktlet will be compiled together, the runtime and kubelet will still communicate through gRPC interface for better API abstraction.
For the near short term, they will still talk through a unix socket before we implement a custom gRPC connection that skips the network stack.
### Developed as a Separate Repository
Brian Grant's discussion on splitting the Kubernetes project into [separate repos](https://github.com/kubernetes/kubernetes/issues/24343) is a compelling argument for why it makes sense to split this work into a separate repo.
In order to be *Easy to Develop*, this iteration will be maintained as a separate repository, and re-vendored back in.
This choice will also allow better long-term growth in terms of better issue-management, testing pipelines, and so on.
Unfortunately, in the short term, it's possible that some aspects of this will also cause pain and it's very difficult to weight each side correctly.
### Exec the rkt binary (initially)
While significant work on the rkt [api-service](https://coreos.com/rkt/docs/latest/subcommands/api-service.html) has been made,
it has also been a source of problems and additional complexity,
and was never transitioned to entirely.
In addition, the rkt cli has historically been the primary interface to the rkt runtime.
The initial integration will execute the rkt binary directly for app creation/start/stop/removal, as well as image pulling/removal.
The creation of pod sanbox is also done via rkt command line, but it will run under `systemd-run` so it's monitored by the init process.
In the future, some of these decisions are expected to be changed such that rkt is vendored as a library dependency for all operations, and other init systems will be supported as well.
## Roadmap and Milestones
1. rktlet integrate with kubelet to support basic pod/container lifecycle (pod creation, container creation/start/stop, pod stop/removal) [[Done]](https://github.com/kubernetes-incubator/rktlet/issues/9)
2. rktlet integrate with kubelet to support more advanced features:
- Support kubelet networking, host network
- Support mount / volumes [[#33526]](https://github.com/kubernetes/kubernetes/issues/33526)
- Support exposing ports
- Support privileged containers
- Support selinux options [[#33139]](https://github.com/kubernetes/kubernetes/issues/33139)
- Support attach [[#29579]](https://github.com/kubernetes/kubernetes/issues/29579)
- Support exec [[#29579]](https://github.com/kubernetes/kubernetes/issues/29579)
- Support logging [[#33111]](https://github.com/kubernetes/kubernetes/pull/33111)
3. rktlet integrate with kubelet, pass 100% e2e and node e2e tests, with nspawn stage1.
4. rktlet integrate with kubelet, pass 100% e2e and node e2e tests, with kvm stage1.
5. Revendor rktlet into `pkg/kubelet/rktshim`, and start deprecating the `pkg/kubelet/rkt` package.
6. Eventually replace the current `pkg/kubelet/rkt` package.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-rkt-runtime.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,407 +1 @@
# Kubelet and systemd interaction This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-systemd.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-systemd.md)
**Author**: Derek Carr (@derekwaynecarr)
**Status**: Proposed
## Motivation
Many Linux distributions have either adopted, or plan to adopt `systemd` as their init system.
This document describes how the node should be configured, and a set of enhancements that should
be made to the `kubelet` to better integrate with these distributions independent of container
runtime.
## Scope of proposal
This proposal does not account for running the `kubelet` in a container.
## Background on systemd
To help understand this proposal, we first provide a brief summary of `systemd` behavior.
### systemd units
`systemd` manages a hierarchy of `slice`, `scope`, and `service` units.
* `service` - application on the server that is launched by `systemd`; how it should start/stop;
when it should be started; under what circumstances it should be restarted; and any resource
controls that should be applied to it.
* `scope` - a process or group of processes which are not launched by `systemd` (i.e. fork), like
a service, resource controls may be applied
* `slice` - organizes a hierarchy in which `scope` and `service` units are placed. a `slice` may
contain `slice`, `scope`, or `service` units; processes are attached to `service` and `scope`
units only, not to `slices`. The hierarchy is intended to be unified, meaning a process may
only belong to a single leaf node.
### cgroup hierarchy: split versus unified hierarchies
Classical `cgroup` hierarchies were split per resource group controller, and a process could
exist in different parts of the hierarchy.
For example, a process `p1` could exist in each of the following at the same time:
* `/sys/fs/cgroup/cpu/important/`
* `/sys/fs/cgroup/memory/unimportant/`
* `/sys/fs/cgroup/cpuacct/unimportant/`
In addition, controllers for one resource group could depend on another in ways that were not
always obvious.
For example, the `cpu` controller depends on the `cpuacct` controller yet they were treated
separately.
Many found it confusing for a single process to belong to different nodes in the `cgroup` hierarchy
across controllers.
The Kernel direction for `cgroup` support is to move toward a unified `cgroup` hierarchy, where the
per-controller hierarchies are eliminated in favor of hierarchies like the following:
* `/sys/fs/cgroup/important/`
* `/sys/fs/cgroup/unimportant/`
In a unified hierarchy, a process may only belong to a single node in the `cgroup` tree.
### cgroupfs single writer
The Kernel direction for `cgroup` management is to promote a single-writer model rather than
allowing multiple processes to independently write to parts of the file-system.
In distributions that run `systemd` as their init system, the cgroup tree is managed by `systemd`
by default since it implicitly interacts with the cgroup tree when starting units. Manual changes
made by other cgroup managers to the cgroup tree are not guaranteed to be preserved unless `systemd`
is made aware. `systemd` can be told to ignore sections of the cgroup tree by configuring the unit
to have the `Delegate=` option.
See: http://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate=
### cgroup management with systemd and container runtimes
A `slice` corresponds to an inner-node in the `cgroup` file-system hierarchy.
For example, the `system.slice` is represented as follows:
`/sys/fs/cgroup/<controller>/system.slice`
A `slice` is nested in the hierarchy by its naming convention.
For example, the `system-foo.slice` is represented as follows:
`/sys/fs/cgroup/<controller>/system.slice/system-foo.slice/`
A `service` or `scope` corresponds to leaf nodes in the `cgroup` file-system hierarchy managed by
`systemd`. Services and scopes can have child nodes managed outside of `systemd` if they have been
delegated with the `Delegate=` option.
For example, if the `docker.service` is associated with the `system.slice`, it is
represented as follows:
`/sys/fs/cgroup/<controller>/system.slice/docker.service/`
To demonstrate the use of `scope` units using the `docker` container runtime, if a
user launches a container via `docker run -m 100M busybox`, a `scope` will be created
because the process was not launched by `systemd` itself. The `scope` is parented by
the `slice` associated with the launching daemon.
For example:
`/sys/fs/cgroup/<controller>/system.slice/docker-<container-id>.scope`
`systemd` defines a set of slices. By default, service and scope units are placed in
`system.slice`, virtual machines and containers registered with `systemd-machined` are
found in `machine.slice`, and user sessions handled by `systemd-logind` in `user.slice`.
## Node Configuration on systemd
### kubelet cgroup driver
The `kubelet` reads and writes to the `cgroup` tree during bootstrapping
of the node. In the future, it will write to the `cgroup` tree to satisfy other
purposes around quality of service, etc.
The `kubelet` must cooperate with `systemd` in order to ensure proper function of the
system. The bootstrapping requirements for a `systemd` system are different than one
without it.
The `kubelet` will accept a new flag to control how it interacts with the `cgroup` tree.
* `--cgroup-driver=` - cgroup driver used by the kubelet. `cgroupfs` or `systemd`.
By default, the `kubelet` should default `--cgroup-driver` to `systemd` on `systemd` distributions.
The `kubelet` should associate node bootstrapping semantics to the configured
`cgroup driver`.
### Node allocatable
The proposal makes no changes to the definition as presented here:
https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/node-allocatable.md
The node will report a set of allocatable compute resources defined as follows:
`[Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved]`
### Node capacity
The `kubelet` will continue to interface with `cAdvisor` to determine node capacity.
### System reserved
The node may set aside a set of designated resources for non-Kubernetes components.
The `kubelet` accepts the followings flags that support this feature:
* `--system-reserved=` - A set of `ResourceName`=`ResourceQuantity` pairs that
describe resources reserved for host daemons.
* `--system-container=` - Optional resource-only container in which to place all
non-kernel processes that are not already in a container. Empty for no container.
Rolling back the flag requires a reboot. (Default: "").
The current meaning of `system-container` is inadequate on `systemd` environments.
The `kubelet` should use the flag to know the location that has the processes that
are associated with `system-reserved`, but it should not modify the cgroups of
existing processes on the system during bootstrapping of the node. This is
because `systemd` is the `cgroup manager` on the host and it has not delegated
authority to the `kubelet` to change how it manages `units`.
The following describes the type of things that can happen if this does not change:
https://bugzilla.redhat.com/show_bug.cgi?id=1202859
As a result, the `kubelet` needs to distinguish placement of non-kernel processes
based on the cgroup driver, and only do its current behavior when not on `systemd`.
The flag should be modified as follows:
* `--system-container=` - Name of resource-only container that holds all
non-kernel processes whose resource consumption is accounted under
system-reserved. The default value is cgroup driver specific. systemd
defaults to system, cgroupfs defines no default. Rolling back the flag
requires a reboot.
The `kubelet` will error if the defined `--system-container` does not exist
on `systemd` environments. It will verify that the appropriate `cpu` and `memory`
controllers are enabled.
### Kubernetes reserved
The node may set aside a set of resources for Kubernetes components:
* `--kube-reserved=:` - A set of `ResourceName`=`ResourceQuantity` pairs that
describe resources reserved for host daemons.
The `kubelet` does not enforce `--kube-reserved` at this time, but the ability
to distinguish the static reservation from observed usage is important for node accounting.
This proposal asserts that `kubernetes.slice` is the default slice associated with
the `kubelet` and `kube-proxy` service units defined in the project. Keeping it
separate from `system.slice` allows for accounting to be distinguished separately.
The `kubelet` will detect its `cgroup` to track `kube-reserved` observed usage on `systemd`.
If the `kubelet` detects that its a child of the `system-container` based on the observed
`cgroup` hierarchy, it will warn.
If the `kubelet` is launched directly from a terminal, it's most likely destination will
be in a `scope` that is a child of `user.slice` as follows:
`/sys/fs/cgroup/<controller>/user.slice/user-1000.slice/session-1.scope`
In this context, the parent `scope` is what will be used to facilitate local developer
debugging scenarios for tracking `kube-reserved` usage.
The `kubelet` has the following flag:
* `--resource-container="/kubelet":` Absolute name of the resource-only container to create
and run the Kubelet in (Default: /kubelet).
This flag will not be supported on `systemd` environments since the init system has already
spawned the process and placed it in the corresponding container associated with its unit.
### Kubernetes container runtime reserved
This proposal asserts that the reservation of compute resources for any associated
container runtime daemons is tracked by the operator under the `system-reserved` or
`kubernetes-reserved` values and any enforced limits are set by the
operator specific to the container runtime.
**Docker**
If the `kubelet` is configured with the `container-runtime` set to `docker`, the
`kubelet` will detect the `cgroup` associated with the `docker` daemon and use that
to do local node accounting. If an operator wants to impose runtime limits on the
`docker` daemon to control resource usage, the operator should set those explicitly in
the `service` unit that launches `docker`. The `kubelet` will not set any limits itself
at this time and will assume whatever budget was set aside for `docker` was included in
either `--kube-reserved` or `--system-reserved` reservations.
Many OS distributions package `docker` by default, and it will often belong to the
`system.slice` hierarchy, and therefore operators will need to budget it for there
by default unless they explicitly move it.
**rkt**
rkt has no client/server daemon, and therefore has no explicit requirements on container-runtime
reservation.
### kubelet cgroup enforcement
The `kubelet` does not enforce the `system-reserved` or `kube-reserved` values by default.
The `kubelet` should support an additional flag to turn on enforcement:
* `--system-reserved-enforce=false` - Optional flag that if true tells the `kubelet`
to enforce the `system-reserved` constraints defined (if any)
* `--kube-reserved-enforce=false` - Optional flag that if true tells the `kubelet`
to enforce the `kube-reserved` constraints defined (if any)
Usage of this flag requires that end-user containers are launched in a separate part
of cgroup hierarchy via `cgroup-root`.
If this flag is enabled, the `kubelet` will continually validate that the configured
resource constraints are applied on the associated `cgroup`.
### kubelet cgroup-root behavior under systemd
The `kubelet` supports a `cgroup-root` flag which is the optional root `cgroup` to use for pods.
This flag should be treated as a pass-through to the underlying configured container runtime.
If `--cgroup-enforce=true`, this flag warrants special consideration by the operator depending
on how the node was configured. For example, if the container runtime is `docker` and its using
the `systemd` cgroup driver, then `docker` will take the daemon wide default and launch containers
in the same slice associated with the `docker.service`. By default, this would mean `system.slice`
which could cause end-user pods to be launched in the same part of the cgroup hierarchy as system daemons.
In those environments, it is recommended that `cgroup-root` is configured to be a subtree of `machine.slice`.
### Proposed cgroup hierarchy
```
$ROOT
|
+- system.slice
| |
| +- sshd.service
| +- docker.service (optional)
| +- ...
|
+- kubernetes.slice
| |
| +- kubelet.service
| +- docker.service (optional)
|
+- machine.slice (container runtime specific)
| |
| +- docker-<container-id>.scope
|
+- user.slice
| +- ...
```
* `system.slice` corresponds to `--system-reserved`, and contains any services the
operator brought to the node as normal configuration.
* `kubernetes.slice` corresponds to the `--kube-reserved`, and contains kube specific
daemons.
* `machine.slice` should parent all end-user containers on the system and serve as the
root of the end-user cluster workloads run on the system.
* `user.slice` is not explicitly tracked by the `kubelet`, but it is possible that `ssh`
sessions to the node where the user launches actions directly. Any resource accounting
reserved for those actions should be part of `system-reserved`.
The container runtime daemon, `docker` in this outline, must be accounted for in either
`system.slice` or `kubernetes.slice`.
In the future, the depth of the container hierarchy is not recommended to be rooted
more than 2 layers below the root as it historically has caused issues with node performance
in other `cgroup` aware systems (https://bugzilla.redhat.com/show_bug.cgi?id=850718). It
is anticipated that the `kubelet` will parent containers based on quality of service
in the future. In that environment, those changes will be relative to the configured
`cgroup-root`.
### Linux Kernel Parameters
The `kubelet` will set the following:
* `sysctl -w vm.overcommit_memory=1`
* `sysctl -w vm.panic_on_oom=0`
* `sysctl -w kernel/panic=10`
* `sysctl -w kernel/panic_on_oops=1`
### OOM Score Adjustment
The `kubelet` at bootstrapping will set the `oom_score_adj` value for Kubernetes
daemons, and any dependent container-runtime daemons.
If `container-runtime` is set to `docker`, then set its `oom_score_adj=-999`
## Implementation concerns
### kubelet block-level architecture
```
+----------+ +----------+ +----------+
| | | | | Pod |
| Node <-------+ Container<----+ Lifecycle|
| Manager | | Manager | | Manager |
| +-------> | | |
+---+------+ +-----+----+ +----------+
| |
| |
| +-----------------+
| | |
| | |
+---v--v--+ +-----v----+
| cgroups | | container|
| library | | runtimes |
+---+-----+ +-----+----+
| |
| |
+---------+----------+
|
|
+-----------v-----------+
| Linux Kernel |
+-----------------------+
```
The `kubelet` should move to an architecture that resembles the above diagram:
* The `kubelet` should not interface directly with the `cgroup` file-system, but instead
should use a common `cgroups library` that has the proper abstraction in place to
work with either `cgroupfs` or `systemd`. The `kubelet` should just use `libcontainer`
abstractions to facilitate this requirement. The `libcontainer` abstractions as
currently defined only support an `Apply(pid)` pattern, and we need to separate that
abstraction to allow cgroup to be created and then later joined.
* The existing `ContainerManager` should separate node bootstrapping into a separate
`NodeManager` that is dependent on the configured `cgroup-driver`.
* The `kubelet` flags for cgroup paths will convert internally as part of cgroup library,
i.e. `/foo/bar` will just convert to `foo-bar.slice`
### kubelet accounting for end-user pods
This proposal re-enforces that it is inappropriate at this time to depend on `--cgroup-root` as the
primary mechanism to distinguish and account for end-user pod compute resource usage.
Instead, the `kubelet` can and should sum the usage of each running `pod` on the node to account for
end-user pod usage separate from system-reserved and kubernetes-reserved accounting via `cAdvisor`.
## Known issues
### Docker runtime support for --cgroup-parent
Docker versions <= 1.0.9 did not have proper support for `-cgroup-parent` flag on `systemd`. This
was fixed in this PR (https://github.com/docker/docker/pull/18612). As result, it's expected
that containers launched by the `docker` daemon may continue to go in the default `system.slice` and
appear to be counted under system-reserved node usage accounting.
If operators run with later versions of `docker`, they can avoid this issue via the use of `cgroup-root`
flag on the `kubelet`, but this proposal makes no requirement on operators to do that at this time, and
this can be revisited if/when the project adopts docker 1.10.
Some OS distributions will fix this bug in versions of docker <= 1.0.9, so operators should
be aware of how their version of `docker` was packaged when using this feature.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-systemd.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,243 +1 @@
# Kubelet TLS bootstrap This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-tls-bootstrap.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-tls-bootstrap.md)
Author: George Tankersley (george.tankersley@coreos.com)
## Preface
This document describes a method for a kubelet to bootstrap itself
into a TLS-secured cluster. Crucially, it automates the provision and
distribution of signed certificates.
## Overview
When a kubelet runs for the first time, it must be given TLS assets
or generate them itself. In the first case, this is a burden on the cluster
admin and a significant logistical barrier to secure Kubernetes rollouts. In
the second, the kubelet must self-sign its certificate and forfeits many of the
advantages of a PKI system. Instead, we propose that the kubelet generate a
private key and a CSR for submission to a cluster-level certificate signing
process.
## Preliminaries
We assume the existence of a functioning control plane. The
apiserver should be configured for TLS initially or possess the ability to
generate valid TLS credentials for itself. If secret information is passed in
the request (e.g. auth tokens supplied with the request or included in
ExtraInfo) then all communications from the node to the apiserver must take
place over a verified TLS connection.
Each node is additionally provisioned with the following information:
1. Location of the apiserver
2. Any CA certificates necessary to trust the apiserver's TLS certificate
3. Access tokens (if needed) to communicate with the CSR endpoint
These should not change often and are thus simple to include in a static
provisioning script.
## API Changes
### CertificateSigningRequest Object
We introduce a new API object to represent PKCS#10 certificate signing
requests. It will be accessible under:
`/apis/certificates/v1beta1/certificatesigningrequests/mycsr`
It will have the following structure:
```go
// Describes a certificate signing request
type CertificateSigningRequest struct {
unversioned.TypeMeta `json:",inline"`
api.ObjectMeta `json:"metadata,omitempty"`
// The certificate request itself and any additional information.
Spec CertificateSigningRequestSpec `json:"spec,omitempty"`
// Derived information about the request.
Status CertificateSigningRequestStatus `json:"status,omitempty"`
}
// This information is immutable after the request is created.
type CertificateSigningRequestSpec struct {
// Base64-encoded PKCS#10 CSR data
Request string `json:"request"`
// Any extra information the node wishes to send with the request.
ExtraInfo []string `json:"extrainfo,omitempty"`
}
// This information is derived from the request by Kubernetes and cannot be
// modified by users. All information is optional since it might not be
// available in the underlying request. This is intended to aid approval
// decisions.
type CertificateSigningRequestStatus struct {
// Information about the requesting user (if relevant)
// See user.Info interface for details
Username string `json:"username,omitempty"`
UID string `json:"uid,omitempty"`
Groups []string `json:"groups,omitempty"`
// Fingerprint of the public key in request
Fingerprint string `json:"fingerprint,omitempty"`
// Subject fields from the request
Subject internal.Subject `json:"subject,omitempty"`
// DNS SANs from the request
Hostnames []string `json:"hostnames,omitempty"`
// IP SANs from the request
IPAddresses []string `json:"ipaddresses,omitempty"`
Conditions []CertificateSigningRequestCondition `json:"conditions,omitempty"`
}
type RequestConditionType string
// These are the possible states for a certificate request.
const (
Approved RequestConditionType = "Approved"
Denied RequestConditionType = "Denied"
)
type CertificateSigningRequestCondition struct {
// request approval state, currently Approved or Denied.
Type RequestConditionType `json:"type"`
// brief reason for the request state
Reason string `json:"reason,omitempty"`
// human readable message with details about the request state
Message string `json:"message,omitempty"`
// If request was approved, the controller will place the issued certificate here.
Certificate []byte `json:"certificate,omitempty"`
}
type CertificateSigningRequestList struct {
unversioned.TypeMeta `json:",inline"`
unversioned.ListMeta `json:"metadata,omitempty"`
Items []CertificateSigningRequest `json:"items,omitempty"`
}
```
We also introduce CertificateSigningRequestList to allow listing all the CSRs in the cluster:
```go
type CertificateSigningRequestList struct {
api.TypeMeta
api.ListMeta
Items []CertificateSigningRequest
}
```
## Certificate Request Process
### Node intialization
When the kubelet executes it checks a location on disk for TLS assets
(currently `/var/run/kubernetes/kubelet.{key,crt}` by default). If it finds
them, it proceeds. If there are no TLS assets, the kubelet generates a keypair
and self-signed certificate. We propose the following optional behavior:
1. Generate a keypair
2. Generate a CSR for that keypair with CN set to the hostname (or
`--hostname-override` value) and DNS/IP SANs supplied with whatever values
the host knows for itself.
3. Post the CSR to the CSR API endpoint.
4. Set a watch on the CSR object to be notified of approval or rejection.
### Controller response
The apiserver persists the CertificateSigningRequests and exposes the List of
all CSRs for an administrator to approve or reject.
A new certificate controller watches for certificate requests. It must first
validate the signature on each CSR and add `Condition=Denied` on
any requests with invalid signatures (with Reason and Message incidicating
such). For valid requests, the controller will derive the information in
`CertificateSigningRequestStatus` and update that object. The controller should
watch for updates to the approval condition of any CertificateSigningRequest.
When a request is approved (signified by Conditions containing only Approved)
the controller should generate and sign a certificate based on that CSR, then
update the condition with the certificate data using the `/approval`
subresource.
### Manual CSR approval
An administrator using `kubectl` or another API client can query the
CertificateSigningRequestList and update the approval condition of
CertificateSigningRequests. The default state is empty, indicating that there
has been no decision so far. A state of "Approved" indicates that the admin has
approved the request and the certificate controller should issue the
certificate. A state of "Denied" indicates that admin has denied the
request. An admin may also supply Reason and Message fields to explain the
rejection.
## kube-apiserver support
The apiserver will present the new endpoints mentioned above and support the
relevant object types.
## kube-controller-manager support
To handle certificate issuance, the controller-manager will need access to CA
signing assets. This could be as simple as a private key and a config file or
as complex as a PKCS#11 client and supplementary policy system. For now, we
will add flags for a signing key, a certificate, and a basic policy file.
## kubectl support
To support manual CSR inspection and approval, we will add support for listing,
inspecting, and approving or denying CertificateSigningRequests to kubectl. The
interaction will be similar to
[salt-key](https://docs.saltstack.com/en/latest/ref/cli/salt-key.html).
Specifically, the admin will have the ability to retrieve the full list of
pending CSRs, inspect their contents, and set their approval conditions to one
of:
1. **Approved** if the controller should issue the cert
2. **Denied** if the controller should not issue the cert
The suggested command for listing is `kubectl get csrs`. The approve/deny
interactions can be accomplished with normal updates, but would be more
conveniently accessed by direct subresource updates. We leave this for future
updates to kubectl.
## Security Considerations
### Endpoint Access Control
The ability to post CSRs to the signing endpoint should be controlled. As a
simple solution we propose that each node be provisioned with an auth token
(possibly static across the cluster) that is scoped via ABAC to only allow
access to the CSR endpoint.
### Expiration & Revocation
The node is responsible for monitoring its own certificate expiration date.
When the certificate is close to expiration, the kubelet should begin repeating
this flow until it successfully obtains a new certificate. If the expiring
certificate has not been revoked and the previous certificate request is still
approved, then it may do so using the same keypair unless the cluster policy
(see "Future Work") requires fresh keys.
Revocation is for the most part an unhandled problem in Go, requiring each
application to produce its own logic around a variety of parsing functions. For
now, our suggested best practice is to issue only short-lived certificates. In
the future it may make sense to add CRL support to the apiserver's client cert
auth.
## Future Work
- revocation UI in kubectl and CRL support at the apiserver
- supplemental policy (e.g. cluster CA only issues 30-day certs for hostnames *.k8s.example.com, each new cert must have fresh keys, ...)
- fully automated provisioning (using a handshake protocol or external list of authorized machines)
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-tls-bootstrap.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,157 +1 @@
# Kubemark proposal This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubemark.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubemark.md)
## Goal of this document
This document describes a design of Kubemark - a system that allows performance testing of a Kubernetes cluster. It describes the
assumption, high level design and discusses possible solutions for lower-level problems. It is supposed to be a starting point for more
detailed discussion.
## Current state and objective
Currently performance testing happens on live clusters of up to 100 Nodes. It takes quite a while to start such cluster or to push
updates to all Nodes, and it uses quite a lot of resources. At this scale the amount of wasted time and used resources is still acceptable.
In the next quarter or two were targeting 1000 Node cluster, which will push it way beyond acceptable level. Additionally we want to
enable people without many resources to run scalability tests on bigger clusters than they can afford at given time. Having an ability to
cheaply run scalability tests will enable us to run some set of them on "normal" test clusters, which in turn would mean ability to run
them on every PR.
This means that we need a system that will allow for realistic performance testing on (much) smaller number of “real” machines. First
assumption we make is that Nodes are independent, i.e. number of existing Nodes do not impact performance of a single Node. This is not
entirely true, as number of Nodes can increase latency of various components on Master machine, which in turn may increase latency of Node
operations, but were not interested in measuring this effect here. Instead we want to measure how number of Nodes and the load imposed by
Node daemons affects the performance of Master components.
## Kubemark architecture overview
The high-level idea behind Kubemark is to write library that allows running artificial "Hollow" Nodes that will be able to simulate a
behavior of real Kubelet and KubeProxy in a single, lightweight binary. Hollow components will need to correctly respond to Controllers
(via API server), and preferably, in the fullness of time, be able to replay previously recorded real traffic (this is out of scope for
initial version). To teach Hollow components replaying recorded traffic they will need to store data specifying when given Pod/Container
should die (e.g. observed lifetime). Such data can be extracted e.g. from etcd Raft logs, or it can be reconstructed from Events. In the
initial version we only want them to be able to fool Master components and put some configurable (in what way TBD) load on them.
When we have Hollow Node ready, well be able to test performance of Master Components by creating a real Master Node, with API server,
Controllers, etcd and whatnot, and create number of Hollow Nodes that will register to the running Master.
To make Kubemark easier to maintain when system evolves Hollow components will reuse real "production" code for Kubelet and KubeProxy, but
will mock all the backends with no-op or very simple mocks. We believe that this approach is better in the long run than writing special
"performance-test-aimed" separate version of them. This may take more time to create an initial version, but we think maintenance cost will
be noticeably smaller.
### Option 1
For the initial version we will teach Master components to use port number to identify Kubelet/KubeProxy. This will allow running those
components on non-default ports, and in the same time will allow to run multiple Hollow Nodes on a single machine. During setup we will
generate credentials for cluster communication and pass them to HollowKubelet/HollowProxy to use. Master will treat all HollowNodes as
normal ones.
![Kubmark architecture diagram for option 1](Kubemark_architecture.png?raw=true "Kubemark architecture overview")
*Kubmark architecture diagram for option 1*
### Option 2
As a second (equivalent) option we will run Kubemark on top of 'real' Kubernetes cluster, where both Master and Hollow Nodes will be Pods.
In this option we'll be able to use Kubernetes mechanisms to streamline setup, e.g. by using Kubernetes networking to ensure unique IPs for
Hollow Nodes, or using Secrets to distribute Kubelet credentials. The downside of this configuration is that it's likely that some noise
will appear in Kubemark results from either CPU/Memory pressure from other things running on Nodes (e.g. FluentD, or Kubelet) or running
cluster over an overlay network. We believe that it'll be possible to turn off cluster monitoring for Kubemark runs, so that the impact
of real Node daemons will be minimized, but we don't know what will be the impact of using higher level networking stack. Running a
comparison will be an interesting test in itself.
### Discussion
Before taking a closer look at steps necessary to set up a minimal Hollow cluster it's hard to tell which approach will be simpler. It's
quite possible that the initial version will end up as hybrid between running the Hollow cluster directly on top of VMs and running the
Hollow cluster on top of a Kubernetes cluster that is running on top of VMs. E.g. running Nodes as Pods in Kubernetes cluster and Master
directly on top of VM.
## Things to simulate
In real Kubernetes on a single Node we run two daemons that communicate with Master in some way: Kubelet and KubeProxy.
### KubeProxy
As a replacement for KubeProxy we'll use HollowProxy, which will be a real KubeProxy with injected no-op mocks everywhere it makes sense.
### Kubelet
As a replacement for Kubelet we'll use HollowKubelet, which will be a real Kubelet with injected no-op or simple mocks everywhere it makes
sense.
Kubelet also exposes cadvisor endpoint which is scraped by Heapster, healthz to be read by supervisord, and we have FluentD running as a
Pod on each Node that exports logs to Elasticsearch (or Google Cloud Logging). Both Heapster and Elasticsearch are running in Pods in the
cluster so do not add any load on a Master components by themselves. There can be other systems that scrape Heapster through proxy running
on Master, which adds additional load, but they're not the part of default setup, so in the first version we won't simulate this behavior.
In the first version well assume that all started Pods will run indefinitely if not explicitly deleted. In the future we can add a model
of short-running batch jobs, but in the initial version well assume only serving-like Pods.
### Heapster
In addition to system components we run Heapster as a part of cluster monitoring setup. Heapster currently watches Events, Pods and Nodes
through the API server. In the test setup we can use real heapster for watching API server, with mocked out piece that scrapes cAdvisor
data from Kubelets.
### Elasticsearch and Fluentd
Similarly to Heapster Elasticsearch runs outside the Master machine but generates some traffic on it. Fluentd “daemon” running on Master
periodically sends Docker logs it gathered to the Elasticsearch running on one of the Nodes. In the initial version we omit Elasticsearch,
as it produces only a constant small load on Master Node that does not change with the size of the cluster.
## Necessary work
There are three more or less independent things that needs to be worked on:
- HollowNode implementation, creating a library/binary that will be able to listen to Watches and respond in a correct fashion with Status
updates. This also involves creation of a CloudProvider that can produce such Hollow Nodes, or making sure that HollowNodes can correctly
self-register in no-provider Master.
- Kubemark setup, including figuring networking model, number of Hollow Nodes that will be allowed to run on a single “machine”, writing
setup/run/teardown scripts (in [option 1](#option-1)), or figuring out how to run Master and Hollow Nodes on top of Kubernetes
(in [option 2](#option-2))
- Creating a Player component that will send requests to the API server putting a load on a cluster. This involves creating a way to
specify desired workload. This task is
very well isolated from the rest, as it is about sending requests to the real API server. Because of that we can discuss requirements
separately.
## Concerns
Network performance most likely won't be a problem for the initial version if running on directly on VMs rather than on top of a Kubernetes
cluster, as Kubemark will be running on standard networking stack (no cloud-provider software routes, or overlay network is needed, as we
don't need custom routing between Pods). Similarly we don't think that running Kubemark on Kubernetes virtualized cluster networking will
cause noticeable performance impact, but it requires testing.
On the other hand when adding additional features it may turn out that we need to simulate Kubernetes Pod network. In such, when running
'pure' Kubemark we may try one of the following:
- running overlay network like Flannel or OVS instead of using cloud providers routes,
- write simple network multiplexer to multiplex communications from the Hollow Kubelets/KubeProxies on the machine.
In case of Kubemark on Kubernetes it may turn that we run into a problem with adding yet another layer of network virtualization, but we
don't need to solve this problem now.
## Work plan
- Teach/make sure that Master can talk to multiple Kubelets on the same Machine [option 1](#option-1):
- make sure that Master can talk to a Kubelet on non-default port,
- make sure that Master can talk to all Kubelets on different ports,
- Write HollowNode library:
- new HollowProxy,
- new HollowKubelet,
- new HollowNode combining the two,
- make sure that Master can talk to two HollowKubelets running on the same machine
- Make sure that we can run Hollow cluster on top of Kubernetes [option 2](#option-2)
- Write a player that will automatically put some predefined load on Master, <- this is the moment when its possible to play with it and is useful by itself for
scalability tests. Alternatively we can just use current density/load tests,
- Benchmark our machines - see how many Watch clients we can have before everything explodes,
- See how many HollowNodes we can run on a single machine by attaching them to the real master <- this is the moment it starts to useful
- Update kube-up/kube-down scripts to enable creating “HollowClusters”/write a new scripts/something, integrate HollowCluster with a Elasticsearch/Heapster equivalents,
- Allow passing custom configuration to the Player
## Future work
In the future we want to add following capabilities to the Kubemark system:
- replaying real traffic reconstructed from the recorded Events stream,
- simulating scraping things running on Nodes through Master proxy.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubemark.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,161 +1 @@
# Kubernetes Local Cluster Experience This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/local-cluster-ux.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/local-cluster-ux.md)
This proposal attempts to improve the existing local cluster experience for kubernetes.
The current local cluster experience is sub-par and often not functional.
There are several options to setup a local cluster (docker, vagrant, linux processes, etc) and we do not test any of them continuously.
Here are some highlighted issues:
- Docker based solution breaks with docker upgrades, does not support DNS, and many kubelet features are not functional yet inside a container.
- Vagrant based solution are too heavy and have mostly failed on OS X.
- Local linux cluster is poorly documented and is undiscoverable.
From an end user perspective, they want to run a kubernetes cluster. They care less about *how* a cluster is setup locally and more about what they can do with a functional cluster.
## Primary Goals
From a high level the goal is to make it easy for a new user to run a Kubernetes cluster and play with curated examples that require least amount of knowledge about Kubernetes.
These examples will only use kubectl and only a subset of Kubernetes features that are available will be exposed.
- Works across multiple OSes - OS X, Linux and Windows primarily.
- Single command setup and teardown UX.
- Unified UX across OSes
- Minimal dependencies on third party software.
- Minimal resource overhead.
- Eliminate any other alternatives to local cluster deployment.
## Secondary Goals
- Enable developers to use the local cluster for kubernetes development.
## Non Goals
- Simplifying kubernetes production deployment experience. [Kube-deploy](https://github.com/kubernetes/kube-deploy) is attempting to tackle this problem.
- Supporting all possible deployment configurations of Kubernetes like various types of storage, networking, etc.
## Local cluster requirements
- Includes all the master components & DNS (Apiserver, scheduler, controller manager, etcd and kube dns)
- Basic auth
- Service accounts should be setup
- Kubectl should be auto-configured to use the local cluster
- Tested & maintained as part of Kubernetes core
## Existing solutions
Following are some of the existing solutions that attempt to simplify local cluster deployments.
### [Spread](https://github.com/redspread/spread)
Spread's UX is great!
It is adapted from monokube and includes DNS as well.
It satisfies almost all the requirements, excepting that of requiring docker to be pre-installed.
It has a loose dependency on docker.
New releases of docker might break this setup.
### [Kmachine](https://github.com/skippbox/kmachine)
Kmachine is adapted from docker-machine.
It exposes the entire docker-machine CLI.
It is possible to repurpose Kmachine to meet all our requirements.
### [Monokube](https://github.com/polvi/monokube)
Single binary that runs all kube master components.
Does not include DNS.
This is only a part of the overall local cluster solution.
### Vagrant
The kube-up.sh script included in Kubernetes release supports a few Vagrant based local cluster deployments.
kube-up.sh is not user friendly.
It typically takes a long time for the cluster to be set up using vagrant and often times is unsuccessful on OS X.
The [Core OS single machine guide](https://coreos.com/kubernetes/docs/latest/kubernetes-on-vagrant-single.html) uses Vagrant as well and it just works.
Since we are targeting a single command install/teardown experience, vagrant needs to be an implementation detail and not be exposed to our users.
## Proposed Solution
To avoid exposing users to third party software and external dependencies, we will build a toolbox that will be shipped with all the dependencies including all kubernetes components, hypervisor, base image, kubectl, etc.
*Note: Docker provides a [similar toolbox](https://www.docker.com/products/docker-toolbox).*
This "Localkube" tool will be referred to as "Minikube" in this proposal to avoid ambiguity against Spread's existing ["localkube"](https://github.com/redspread/localkube).
The final name of this tool is TBD. Suggestions are welcome!
Minikube will provide a unified CLI to interact with the local cluster.
The CLI will support only a few operations:
- **Start** - creates & starts a local cluster along with setting up kubectl & networking (if necessary)
- **Stop** - suspends the local cluster & preserves cluster state
- **Delete** - deletes the local cluster completely
- **Upgrade** - upgrades internal components to the latest available version (upgrades are not guaranteed to preserve cluster state)
For running and managing the kubernetes components themselves, we can re-use [Spread's localkube](https://github.com/redspread/localkube).
Localkube is a self-contained go binary that includes all the master components including DNS and runs them using multiple go threads.
Each Kubernetes release will include a localkube binary that has been tested exhaustively.
To support Windows and OS X, minikube will use [libmachine](https://github.com/docker/machine/tree/master/libmachine) internally to create and destroy virtual machines.
Minikube will be shipped with an hypervisor (virtualbox) in the case of OS X.
Minikube will include a base image that will be well tested.
In the case of Linux, since the cluster can be run locally, we ideally want to avoid setting up a VM.
Since docker is the only fully supported runtime as of Kubernetes v1.2, we can initially use docker to run and manage localkube.
There is risk of being incompatible with the existing version of docker.
By using a VM, we can avoid such incompatibility issues though.
Feedback from the community will be helpful here.
If the goal is to run outside of a VM, we can have minikube prompt the user if docker is unavailable or version is incompatible.
Alternatives to docker for running the localkube core includes using [rkt](https://coreos.com/rkt/docs/latest/), setting up systemd services, or a System V Init script depending on the distro.
To summarize the pipeline is as follows:
##### OS X / Windows
minikube -> libmachine -> virtualbox/hyper V -> linux VM -> localkube
##### Linux
minikube -> docker -> localkube
### Alternatives considered
#### Bring your own docker
##### Pros
- Kubernetes users will probably already have it
- No extra work for us
- Only one VM/daemon, we can just reuse the existing one
##### Cons
- Not designed to be wrapped, may be unstable
- Might make configuring networking difficult on OS X and Windows
- Versioning and updates will be challenging. We can mitigate some of this with testing at HEAD, but we'll - inevitably hit situations where it's infeasible to work with multiple versions of docker.
- There are lots of different ways to install docker, networking might be challenging if we try to support many paths.
#### Vagrant
##### Pros
- We control the entire experience
- Networking might be easier to build
- Docker can't break us since we'll include a pinned version of Docker
- Easier to support rkt or hyper in the future
- Would let us run some things outside of containers (kubelet, maybe ingress/load balancers)
##### Cons
- More work
- Extra resources (if the user is also running docker-machine)
- Confusing if there are two docker daemons (images built in one can't be run in another)
- Always needs a VM, even on Linux
- Requires installing and possibly understanding Vagrant.
## Releases & Distribution
- Minikube will be released independent of Kubernetes core in order to facilitate fixing of issues that are outside of Kubernetes core.
- The latest version of Minikube is guaranteed to support the latest release of Kubernetes, including documentation.
- The Google Cloud SDK will package minikube and provide utilities for configuring kubectl to use it, but will not in any other way wrap minikube.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/local-cluster-ux.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,532 +1 @@
# Kubernetes for multiple platforms This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/multi-platform.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/multi-platform.md)
**Author**: Lucas Käldström ([@luxas](https://github.com/luxas))
**Status** (25th of August 2016): Some parts are already implemented; but still there quite a lot of work to be done.
## Abstract
We obviously want Kubernetes to run on as many platforms as possible, in order to make Kubernetes a even more powerful system.
This is a proposal that explains what should be done in order to achieve a true cross-platform container management system.
Kubernetes is written in Go, and Go code is portable across platforms.
Docker and rkt are also written in Go, and it's already possible to use them on various platforms.
When it's possible to run containers on a specific architecture, people also want to use Kubernetes to manage the containers.
In this proposal, a `platform` is defined as `operating system/architecture` or `${GOOS}/${GOARCH}` in Go terms.
The following platforms are proposed to be built for in a Kubernetes release:
- linux/amd64
- linux/arm (GOARM=6 initially, but we probably have to bump this to GOARM=7 due to that the most of other ARM things are ARMv7)
- linux/arm64
- linux/ppc64le
If there's interest in running Kubernetes on `linux/s390x` too, it won't require many changes to the source now when we've laid the ground for a multi-platform Kubernetes already.
There is also work going on with porting Kubernetes to Windows (`windows/amd64`). See [this issue](https://github.com/kubernetes/kubernetes/issues/22623) for more details.
But note that when porting to a new OS like windows, a lot of os-specific changes have to be implemented before cross-compiling, releasing and other concerns this document describes may apply.
## Motivation
Then the question probably is: Why?
In fact, making it possible to run Kubernetes on other platforms will enable people to create customized and highly-optimized solutions that exactly fits their hardware needs.
Example: [Paypal validates arm64 for real-time data analysis](http://www.datacenterdynamics.com/content-tracks/servers-storage/paypal-successfully-tests-arm-based-servers/93835.fullarticle)
Also, by including other platforms to the Kubernetes party a healthy competition between platforms can/will take place.
Every platform obviously has both pros and cons. By adding the option to make clusters of mixed platforms, the end user may take advantage of the good sides of every platform.
## Use Cases
For a large enterprise where computing power is the king, one may imagine the following combinations:
- `linux/amd64`: For running most of the general-purpose computing tasks, cluster addons, etc.
- `linux/ppc64le`: For running highly-optimized software; especially massive compute tasks
- `windows/amd64`: For running services that are only compatible on windows; e.g. business applications written in C# .NET
For a mid-sized business where efficiency is most important, these could be combinations:
- `linux/amd64`: For running most of the general-purpose computing tasks, plus tasks that require very high single-core performance.
- `linux/arm64`: For running webservices and high-density tasks => the cluster could autoscale in a way that `linux/amd64` machines could hibernate at night in order to minimize power usage.
For a small business or university, arm is often sufficient:
- `linux/arm`: Draws very little power, and can run web sites and app backends efficiently on Scaleway for example.
And last but not least; Raspberry Pi's should be used for [education at universities](http://kubecloud.io/) and are great for **demoing Kubernetes' features at conferences.**
## Main proposal
### Release binaries for all platforms
First and foremost, binaries have to be released for all platforms.
This affects the build-release tools. Fortunately, this is quite straightforward to implement, once you understand how Go cross-compilation works.
Since Kubernetes' release and build jobs run on `linux/amd64`, binaries have to be cross-compiled and Docker images should be cross-built.
Builds should be run in a Docker container in order to get reproducible builds; and `gcc` should be installed for all platforms inside that image (`kube-cross`)
All released binaries should be uploaded to `https://storage.googleapis.com/kubernetes-release/release/${version}/bin/${os}/${arch}/${binary}`
This is a fairly long topic. If you're interested how to cross-compile, see [details about cross-compilation](#cross-compilation-details)
### Support all platforms in a "run everywhere" deployment
The easiest way of running Kubernetes on another architecture at the time of writing is probably by using the docker-multinode deployment. Of course, you may choose whatever deployment you want, the binaries are easily downloadable from the URL above.
[docker-multinode](https://github.com/kubernetes/kube-deploy/tree/master/docker-multinode) is intended to be a "kick-the-tires" multi-platform solution with Docker as the only real dependency (but it's not production ready)
But when we (`sig-cluster-lifecycle`) have standardized the deployments to about three and made them production ready; at least one deployment should support **all platforms**.
### Set up a build and e2e CI's
#### Build CI
Kubernetes should always enforce that all binaries are compiling.
**On every PR, `make release` have to be run** in order to require the code proposed to be merged to be compatible for all architectures.
For more information, see [conflicts](#conflicts)
#### e2e CI
To ensure all functionality really is working on all other platforms, the community should be able to setup a CI.
To be able to do that, all the test-specific images have to be ported to multiple architectures, and the test images should preferably be manifest lists.
If the test images aren't manifest lists, the test code should automatically choose the right image based on the image naming.
IBM volunteered to run continuously running e2e tests for `linux/ppc64le`.
Still it's hard to set up a such CI (even on `linux/amd64`), but that work belongs to `kubernetes/test-infra` proposals.
When it's possible to test Kubernetes using Kubernetes; volunteers should be given access to publish their results on `k8s-testgrid.appspot.com`.
### Official support level
When all e2e tests are passing for a given platform; the platform should be officially supported by the Kubernetes team.
At the time of writing, `amd64` is in the officially supported category category.
When a platform is building and it's possible to set up a cluster with the core functionality, the platform is supported on a "best-effort" and experimental basis.
At the time of writing, `arm`, `arm64` and `ppc64le` are in the experimental category; the e2e tests aren't cross-platform yet.
### Docker image naming and manifest lists
#### Docker manifest lists
Here's a good article about how the "manifest list" in the Docker image [manifest spec v2](https://github.com/docker/distribution/pull/1068) works: [A step towards multi-platform Docker images](https://integratedcode.us/2016/04/22/a-step-towards-multi-platform-docker-images/)
A short summary: A manifest list is a list of Docker images with a single name (e.g. `busybox`), that holds layers for multiple platforms _when it's stored in a registry_.
When the image is pulled by a client (`docker pull busybox`), only layers for the target platforms are downloaded.
Right now we have to write `busybox-${ARCH}` for example instead, but that leads to extra scripting and unnecessary logic.
For reference see [docker/docker#24739](https://github.com/docker/docker/issues/24739) and [appc/docker2aci#193](https://github.com/appc/docker2aci/issues/193)
#### Image naming
This has been debated quite a lot about; how we should name non-amd64 docker images that are pushed to `gcr.io`. See [#23059](https://github.com/kubernetes/kubernetes/pull/23059) and [#23009](https://github.com/kubernetes/kubernetes/pull/23009).
This means that the naming `gcr.io/google_containers/${binary}:${version}` should contain a _manifest list_ for future tags.
The manifest list thereby becomes a wrapper that is pointing to the `-${arch}` images.
This requires `docker-1.10` or newer, which probably means Kubernetes v1.4 and higher.
TL;DR;
- `${binary}-${arch}:${version}` images should be pushed for all platforms
- `${binary}:${version}` images should point to the `-${arch}`-specific ones, and docker will then download the right image.
### Components should expose their platform
It should be possible to run clusters with mixed platforms smoothly. After all, bringing heterogenous machines together to a single unit (a cluster) is one of Kubernetes' greatest strengths. And since the Kubernetes' components communicate over HTTP, two binaries of different architectures may talk to each other normally.
The crucial thing here is that the components that handle platform-specific tasks (e.g. kubelet) should expose their platform. In the kubelet case, we've initially solved it by exposing the labels `beta.kubernetes.io/{os,arch}` on every node. This way an user may run binaries for different platforms on a multi-platform cluster, but still it requires manual work to apply the label to every manifest.
Also, [the apiserver now exposes](https://github.com/kubernetes/kubernetes/pull/19905) it's platform at `GET /version`. But note that the value exposed at `/version` only is the apiserver's platform; there might be kubelets of various other platforms.
### Standardize all image Makefiles to follow the same pattern
All Makefiles should push for all platforms when doing `make push`, and build for all platforms when doing `make build`.
Under the hood; they should compile binaries in a container for reproducability, and use QEMU for emulating Dockerfile `RUN` commands if necessary.
### Remove linux/amd64 hard-codings from the codebase
All places where `linux/amd64` is hardcoded in the codebase should be rewritten.
#### Make kubelet automatically use the right pause image
The `pause` is used for connecting containers into Pods. It's a binary that just sleeps forever.
When Kubernetes starts up a Pod, it first starts a `pause` container, and let's all "real" containers join the same network by setting `--net=${pause_container_id}`.
So in order to start Kubernetes Pods on any other architecture, an ever-sleeping image have to exist.
Fortunately, `kubelet` has the `--pod-infra-container-image` option, and it has been used when running Kubernetes on other platforms.
But relying on the deployment setup to specify the right image for the platform isn't great, the kubelet should be smarter than that.
This specific problem has been fixed in [#23059](https://github.com/kubernetes/kubernetes/pull/23059).
#### Vendored packages
Here are two common problems that a vendored package might have when trying to add/update it:
- Including constants combined with build tags
```go
//+ build linux,amd64
const AnAmd64OnlyConstant = 123
```
- Relying on platform-specific syscalls (e.g. `syscall.Dup2`)
If someone tries to add a dependency that doesn't satisfy these requirements; the CI will catch it and block the PR until the author has updated the vendored repo and fixed the problem.
### kubectl should be released for all platforms that are relevant
kubectl is released for more platforms than the proposed server platforms, if you want to check out an up-to-date list of them, [see here](../../hack/lib/golang.sh).
kubectl is trivial to cross-compile, so if there's interest in adding a new platform for it, it may be as easy as appending the platform to the list linked above.
### Addons
Addons like dns, heapster and ingress play a big role in a working Kubernetes cluster, and we should aim to be able to deploy these addons on multiple platforms too.
`kube-dns`, `dashboard` and `addon-manager` are the most important images, and they are already ported for multiple platforms.
These addons should also be converted to multiple platforms:
- heapster, influxdb + grafana
- nginx-ingress
- elasticsearch, fluentd + kibana
- registry
### Conflicts
What should we do if there's a conflict between keeping e.g. `linux/ppc64le` builds vs. merging a release blocker?
In fact, we faced this problem while this proposal was being written; in [#25243](https://github.com/kubernetes/kubernetes/pull/25243). It is quite obvious that the release blocker is of higher priority.
However, before temporarily [deactivating builds](https://github.com/kubernetes/kubernetes/commit/2c9b83f291e3e506acc3c08cd10652c255f86f79), the author of the breaking PR should first try to fix the problem. If it turns out being really hard to solve, builds for the affected platform may be deactivated and a P1 issue should be made to activate them again.
## Cross-compilation details (for reference)
### Go language details
Go 1.5 introduced many changes. To name a few that are relevant to Kubernetes:
- C was eliminated from the tree (it was earlier used for the bootstrap runtime).
- All processors are used by default, which means we should be able to remove [lines like this one](https://github.com/kubernetes/kubernetes/blob/v1.2.0/cmd/kubelet/kubelet.go#L37)
- The garbage collector became more efficent (but also [confused our latency test](https://github.com/golang/go/issues/14396)).
- `linux/arm64` and `linux/ppc64le` were added as new ports.
- The `GO15VENDOREXPERIMENT` was started. We switched from `Godeps/_workspace` to the native `vendor/` in [this PR](https://github.com/kubernetes/kubernetes/pull/24242).
- It's not required to pre-build the whole standard library `std` when cross-compliling. [Details](#prebuilding-the-standard-library-std)
- Builds are approximately twice as slow as earlier. That affects the CI. [Details](#releasing)
- The native Go DNS resolver will suffice in the most situations. This makes static linking much easier.
All release notes for Go 1.5 [are here](https://golang.org/doc/go1.5)
Go 1.6 didn't introduce as many changes as Go 1.5 did, but here are some of note:
- It should perform a little bit better than Go 1.5.
- `linux/mips64` and `linux/mips64le` were added as new ports.
- Go < 1.6.2 for `ppc64le` had [bugs in it](https://github.com/kubernetes/kubernetes/issues/24922).
All release notes for Go 1.6 [are here](https://golang.org/doc/go1.6)
In Kubernetes 1.2, the only supported Go version was `1.4.2`, so `linux/arm` was the only possible extra architecture: [#19769](https://github.com/kubernetes/kubernetes/pull/19769).
In Kubernetes 1.3, [we upgraded to Go 1.6](https://github.com/kubernetes/kubernetes/pull/22149), which made it possible to build Kubernetes for even more architectures [#23931](https://github.com/kubernetes/kubernetes/pull/23931).
#### The `sync/atomic` bug on 32-bit platforms
From https://golang.org/pkg/sync/atomic/#pkg-note-BUG:
> On both ARM and x86-32, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically. The first word in a global variable or in an allocated struct or slice can be relied upon to be 64-bit aligned.
`etcd` have had [issues](https://github.com/coreos/etcd/issues/2308) with this. See [how to fix it here](https://github.com/coreos/etcd/pull/3249)
```go
// 32-bit-atomic-bug.go
package main
import "sync/atomic"
type a struct {
b chan struct{}
c int64
}
func main(){
d := a{}
atomic.StoreInt64(&d.c, 10 * 1000 * 1000 * 1000)
}
```
```console
$ GOARCH=386 go build 32-bit-atomic-bug.go
$ file 32-bit-atomic-bug
32-bit-atomic-bug: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, not stripped
$ ./32-bit-atomic-bug
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x808cd9b]
goroutine 1 [running]:
panic(0x8098de0, 0x1830a038)
/usr/local/go/src/runtime/panic.go:481 +0x326
sync/atomic.StoreUint64(0x1830e0f4, 0x540be400, 0x2)
/usr/local/go/src/sync/atomic/asm_386.s:190 +0xb
main.main()
/tmp/32-bit-atomic-bug.go:11 +0x4b
```
This means that all structs should keep all `int64` and `uint64` fields at the top of the struct to be safe. If we would move `a.c` to the top of the `a` struct above, the operation would succeed.
The bug affects `32-bit` platforms when a `(u)int64` field is accessed by an `atomic` method.
It would be great to write a tool that checks so all `atomic` accessed fields are aligned at the top of the struct, but it's hard: [coreos/etcd#5027](https://github.com/coreos/etcd/issues/5027).
## Prebuilding the Go standard library (`std`)
A great blog post [that is describing this](https://medium.com/@rakyll/go-1-5-cross-compilation-488092ba44ec#.5jcd0owem)
Before Go 1.5, the whole Go project had to be cross-compiled from source for **all** platforms that _might_ be used, and that was quite a slow process:
```console
# From build-tools/build-image/cross/Dockerfile when we used Go 1.4
$ cd /usr/src/go/src
$ for platform in ${PLATFORMS}; do GOOS=${platform%/*} GOARCH=${platform##*/} ./make.bash --no-clean; done
```
With Go 1.5+, cross-compiling the Go repository isn't required anymore. Go will automatically cross-compile the `std` packages that are being used by the code that is being compiled, _and throw it away after the compilation_.
If you cross-compile multiple times, Go will build parts of `std`, throw it away, compile parts of it again, throw that away and so on.
However, there is an easy way of cross-compiling all `std` packages in advance with Go 1.5+:
```console
# From build-tools/build-image/cross/Dockerfile when we're using Go 1.5+
$ for platform in ${PLATFORMS}; do GOOS=${platform%/*} GOARCH=${platform##*/} go install std; done
```
### Static cross-compilation
Static compilation with Go 1.5+ is dead easy:
```go
// main.go
package main
import "fmt"
func main() {
fmt.Println("Hello Kubernetes!")
}
```
```console
$ go build main.go
$ file main
main: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped
$ GOOS=linux GOARCH=arm go build main.go
$ file main
main: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, not stripped
```
The only thing you have to do is change the `GOARCH` and `GOOS` variables. Here's a list of valid values for [GOOS/GOARCH](https://golang.org/doc/install/source#environment)
#### Static compilation with `net`
Consider this:
```go
// main-with-net.go
package main
import "net"
import "fmt"
func main() {
fmt.Println(net.ParseIP("10.0.0.10").String())
}
```
```console
$ go build main-with-net.go
$ file main-with-net
main-with-net: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked,
interpreter /lib64/ld-linux-x86-64.so.2, not stripped
$ GOOS=linux GOARCH=arm go build main-with-net.go
$ file main-with-net
main-with-net: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, not stripped
```
Wait, what? Just because we included `net` from the `std` package, the binary defaults to being dynamically linked when the target platform equals to the host platform?
Let's take a look at `go env` to get a clue why this happens:
```console
$ go env
GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/go"
GOROOT="/usr/local/go"
GO15VENDOREXPERIMENT="1"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
```
See the `CGO_ENABLED=1` at the end? That's where compilation for the host and cross-compilation differs. By default, Go will link statically if no `cgo` code is involved. `net` is one of the packages that prefers `cgo`, but doesn't depend on it.
When cross-compiling on the other hand, `CGO_ENABLED` is set to `0` by default.
To always be safe, run this when compiling statically:
```console
$ CGO_ENABLED=0 go build -a -installsuffix cgo main-with-net.go
$ file main-with-net
main-with-net: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped
```
See [golang/go#9344](https://github.com/golang/go/issues/9344) for more details.
### Dynamic cross-compilation
In order to dynamically compile a go binary with `cgo`, we need `gcc` installed at build time.
The only Kubernetes binary that is using C code is the `kubelet`, or in fact `cAdvisor` on which `kubelet` depends. `hyperkube` is also dynamically linked as long as `kubelet` is. We should aim to make `kubelet` statically linked.
The normal `x86_64-linux-gnu` can't cross-compile binaries, so we have to install gcc cross-compilers for every platform. We do this in the [`kube-cross`](../../build-tools/build-image/cross/Dockerfile) image,
and depend on the [`emdebian.org` repository](https://wiki.debian.org/CrossToolchains). Depending on `emdebian` isn't ideal, so we should consider using the latest `gcc` cross-compiler packages from the `ubuntu` main repositories in the future.
Here's an example when cross-compiling plain C code:
```c
// main.c
#include <stdio.h>
main()
{
printf("Hello Kubernetes!\n");
}
```
```console
$ arm-linux-gnueabi-gcc -o main-c main.c
$ file main-c
main-c: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked,
interpreter /lib/ld-linux.so.3, for GNU/Linux 2.6.32, not stripped
```
And here's an example when cross-compiling `go` and `c`:
```go
// main-cgo.go
package main
/*
char* sayhello(void) { return "Hello Kubernetes!"; }
*/
import "C"
import "fmt"
func main() {
fmt.Println(C.GoString(C.sayhello()))
}
```
```console
$ CGO_ENABLED=1 CC=arm-linux-gnueabi-gcc GOOS=linux GOARCH=arm go build main-cgo.go
$ file main-cgo
./main-cgo: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked,
interpreter /lib/ld-linux.so.3, for GNU/Linux 2.6.32, not stripped
```
The bad thing with dynamic compilation is that it adds an unnecessary dependency on `glibc` _at runtime_.
### Static compilation with CGO code
Lastly, it's even possible to cross-compile `cgo` code _statically_:
```console
$ CGO_ENABLED=1 CC=arm-linux-gnueabi-gcc GOARCH=arm go build -ldflags '-extldflags "-static"' main-cgo.go
$ file main-cgo
./main-cgo: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked,
for GNU/Linux 2.6.32, not stripped
```
This is especially useful if we want to include the binary in a container.
If the binary is statically compiled, we may use `busybox` or even `scratch` as the base image.
This should be the preferred way of compiling binaries that strictly require C code to be a part of it.
#### GOARM
32-bit ARM comes in two main flavours: ARMv5 and ARMv7. Go has the `GOARM` environment variable that controls which version of ARM Go should target. Here's a table of all ARM versions and how they play together:
ARM Version | GOARCH | GOARM | GCC package | No. of bits
----------- | ------ | ----- | ----------- | -----------
ARMv5 | arm | 5 | armel | 32-bit
ARMv6 | arm | 6 | - | 32-bit
ARMv7 | arm | 7 | armhf | 32-bit
ARMv8 | arm64 | - | aarch64 | 64-bit
The compability between the versions is pretty straightforward, ARMv5 binaries may run on ARMv7 hosts, but not vice versa.
## Cross-building docker images for linux
After binaries have been cross-compiled, they should be distributed in some manner.
The default and maybe the most intuitive way of doing this is by packaging it in a docker image.
### Trivial Dockerfile
All `Dockerfile` commands except for `RUN` works for any architecture without any modification.
The base image has to be switched to an arch-specific one, but except from that, a cross-built image is only a `docker build` away.
```Dockerfile
FROM armel/busybox
ENV kubernetes=true
COPY kube-apiserver /usr/local/bin/
CMD ["/usr/local/bin/kube-apiserver"]
```
```console
$ file kube-apiserver
kube-apiserver: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, not stripped
$ docker build -t gcr.io/google_containers/kube-apiserver-arm:v1.x.y .
Step 1 : FROM armel/busybox
---> 9bb1e6d4f824
Step 2 : ENV kubernetes true
---> Running in 8a1bfcb220ac
---> e4ef9f34236e
Removing intermediate container 8a1bfcb220ac
Step 3 : COPY kube-apiserver /usr/local/bin/
---> 3f0c4633e5ac
Removing intermediate container b75a054ab53c
Step 4 : CMD /usr/local/bin/kube-apiserver
---> Running in 4e6fe931a0a5
---> 28f50e58c909
Removing intermediate container 4e6fe931a0a5
Successfully built 28f50e58c909
```
### Complex Dockerfile
However, in the most cases, `RUN` statements are needed when building the image.
The `RUN` statement invokes `/bin/sh` inside the container, but in this example, `/bin/sh` is an ARM binary, which can't execute on an `amd64` processor.
#### QEMU to the rescue
Here's a way to run ARM Docker images on an amd64 host by using `qemu`:
```console
# Register other architectures` magic numbers in the binfmt_misc kernel module, so it`s possible to run foreign binaries
$ docker run --rm --privileged multiarch/qemu-user-static:register --reset
# Download qemu 2.5.0
$ curl -sSL https://github.com/multiarch/qemu-user-static/releases/download/v2.5.0/x86_64_qemu-arm-static.tar.xz \
| tar -xJ
# Run a foreign docker image, and inject the amd64 qemu binary for translating all syscalls
$ docker run -it -v $(pwd)/qemu-arm-static:/usr/bin/qemu-arm-static armel/busybox /bin/sh
# Now we`re inside an ARM container although we`re running on an amd64 host
$ uname -a
Linux 0a7da80f1665 4.2.0-25-generic #30-Ubuntu SMP Mon Jan 18 12:31:50 UTC 2016 armv7l GNU/Linux
```
Here a linux module called `binfmt_misc` registered the "magic numbers" in the kernel, so the kernel may detect which architecture a binary is, and prepend the call with `/usr/bin/qemu-(arm|aarch64|ppc64le)-static`. For example, `/usr/bin/qemu-arm-static` is a statically linked `amd64` binary that translates all ARM syscalls to `amd64` syscalls.
The multiarch guys have done a great job here, you may find the source for this and other images at [GitHub](https://github.com/multiarch)
## Implementation
## History
32-bit ARM (`linux/arm`) was the first platform Kubernetes was ported to, and luxas' project [`Kubernetes on ARM`](https://github.com/luxas/kubernetes-on-arm) (released on GitHub the 31st of September 2015)
served as a way of running Kubernetes on ARM devices easily.
The 30th of November 2015, a tracking issue about making Kubernetes run on ARM was opened: [#17981](https://github.com/kubernetes/kubernetes/issues/17981). It later shifted focus to how to make Kubernetes a more platform-independent system.
The 27th of April 2016, Kubernetes `v1.3.0-alpha.3` was released, and it became the first release that was able to run the [docker getting started guide](http://kubernetes.io/docs/getting-started-guides/docker/) on `linux/amd64`, `linux/arm`, `linux/arm64` and `linux/ppc64le` without any modification.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/multi-platform.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,138 +1 @@
# Multi-Scheduler in Kubernetes This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/multiple-schedulers.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/multiple-schedulers.md)
**Status**: Design & Implementation in progress.
> Contact @HaiyangDING for questions & suggestions.
## Motivation
In current Kubernetes design, there is only one default scheduler in a Kubernetes cluster.
However it is common that multiple types of workload, such as traditional batch, DAG batch, streaming and user-facing production services,
are running in the same cluster and they need to be scheduled in different ways. For example, in
[Omega](http://research.google.com/pubs/pub41684.html) batch workload and service workload are scheduled by two types of schedulers:
the batch workload is scheduled by a scheduler which looks at the current usage of the cluster to improve the resource usage rate
and the service workload is scheduled by another one which considers the reserved resources in the
cluster and many other constraints since their performance must meet some higher SLOs.
[Mesos](http://mesos.apache.org/) has done a great work to support multiple schedulers by building a
two-level scheduling structure. This proposal describes how Kubernetes is going to support multi-scheduler
so that users could be able to run their user-provided scheduler(s) to enable some customized scheduling
behavior as they need. As previously discussed in [#11793](https://github.com/kubernetes/kubernetes/issues/11793),
[#9920](https://github.com/kubernetes/kubernetes/issues/9920) and [#11470](https://github.com/kubernetes/kubernetes/issues/11470),
the design of the multiple scheduler should be generic and includes adding a scheduler name annotation to separate the pods.
It is worth mentioning that the proposal does not address the question of how the scheduler name annotation gets
set although it is reasonable to anticipate that it would be set by a component like admission controller/initializer,
as the doc currently does.
Before going to the details of this proposal, below lists a number of the methods to extend the scheduler:
- Write your own scheduler and run it along with Kubernetes native scheduler. This is going to be detailed in this proposal
- Use the callout approach such as the one implemented in [#13580](https://github.com/kubernetes/kubernetes/issues/13580)
- Recompile the scheduler with a new policy
- Restart the scheduler with a new [scheduler policy config file](../../examples/scheduler-policy-config.json)
- Or maybe in future dynamically link a new policy into the running scheduler
## Challenges in multiple schedulers
- Separating the pods
Each pod should be scheduled by only one scheduler. As for implementation, a pod should
have an additional field to tell by which scheduler it wants to be scheduled. Besides,
each scheduler, including the default one, should have a unique logic of how to add unscheduled
pods to its to-be-scheduled pod queue. Details will be explained in later sections.
- Dealing with conflicts
Different schedulers are essentially separated processes. When all schedulers try to schedule
their pods onto the nodes, there might be conflicts.
One example of the conflicts is resource racing: Suppose there be a `pod1` scheduled by
`my-scheduler` requiring 1 CPU's *request*, and a `pod2` scheduled by `kube-scheduler` (k8s native
scheduler, acting as default scheduler) requiring 2 CPU's *request*, while `node-a` only has 2.5
free CPU's, if both schedulers all try to put their pods on `node-a`, then one of them would eventually
fail when Kubelet on `node-a` performs the create action due to insufficient CPU resources.
This conflict is complex to deal with in api-server and etcd. Our current solution is to let Kubelet
to do the conflict check and if the conflict happens, effected pods would be put back to scheduler
and waiting to be scheduled again. Implementation details are in later sections.
## Where to start: initial design
We definitely want the multi-scheduler design to be a generic mechanism. The following lists the changes
we want to make in the first step.
- Add an annotation in pod template: `scheduler.alpha.kubernetes.io/name: scheduler-name`, this is used to
separate pods between schedulers. `scheduler-name` should match one of the schedulers' `scheduler-name`
- Add a `scheduler-name` to each scheduler. It is done by hardcode or as command-line argument. The
Kubernetes native scheduler (now `kube-scheduler` process) would have the name as `kube-scheduler`
- The `scheduler-name` plays an important part in separating the pods between different schedulers.
Pods are statically dispatched to different schedulers based on `scheduler.alpha.kubernetes.io/name: scheduler-name`
annotation and there should not be any conflicts between different schedulers handling their pods, i.e. one pod must
NOT be claimed by more than one scheduler. To be specific, a scheduler can add a pod to its queue if and only if:
1. The pod has no nodeName, **AND**
2. The `scheduler-name` specified in the pod's annotation `scheduler.alpha.kubernetes.io/name: scheduler-name`
matches the `scheduler-name` of the scheduler.
The only one exception is the default scheduler. Any pod that has no `scheduler.alpha.kubernetes.io/name: scheduler-name`
annotation is assumed to be handled by the "default scheduler". In the first version of the multi-scheduler feature,
the default scheduler would be the Kubernetes built-in scheduler with `scheduler-name` as `kube-scheduler`.
The Kubernetes build-in scheduler will claim any pod which has no `scheduler.alpha.kubernetes.io/name: scheduler-name`
annotation or which has `scheduler.alpha.kubernetes.io/name: kube-scheduler`. In the future, it may be possible to
change which scheduler is the default for a given cluster.
- Dealing with conflicts. All schedulers must use predicate functions that are at least as strict as
the ones that Kubelet applies when deciding whether to accept a pod, otherwise Kubelet and scheduler
may get into an infinite loop where Kubelet keeps rejecting a pod and scheduler keeps re-scheduling
it back the same node. To make it easier for people who write new schedulers to obey this rule, we will
create a library containing the predicates Kubelet uses. (See issue [#12744](https://github.com/kubernetes/kubernetes/issues/12744).)
In summary, in the initial version of this multi-scheduler design, we will achieve the following:
- If a pod has the annotation `scheduler.alpha.kubernetes.io/name: kube-scheduler` or the user does not explicitly
sets this annotation in the template, it will be picked up by default scheduler
- If the annotation is set and refers to a valid `scheduler-name`, it will be picked up by the scheduler of
specified `scheduler-name`
- If the annotation is set but refers to an invalid `scheduler-name`, the pod will not be picked by any scheduler.
The pod will keep PENDING.
### An example
```yaml
kind: Pod
apiVersion: v1
metadata:
name: pod-abc
labels:
foo: bar
annotations:
scheduler.alpha.kubernetes.io/name: my-scheduler
```
This pod will be scheduled by "my-scheduler" and ignored by "kube-scheduler". If there is no running scheduler
of name "my-scheduler", the pod will never be scheduled.
## Next steps
1. Use admission controller to add and verify the annotation, and do some modification if necessary. For example, the
admission controller might add the scheduler annotation based on the namespace of the pod, and/or identify if
there are conflicting rules, and/or set a default value for the scheduler annotation, and/or reject pods on
which the client has set a scheduler annotation that does not correspond to a running scheduler.
2. Dynamic launching scheduler(s) and registering to admission controller (as an external call). This also
requires some work on authorization and authentication to control what schedulers can write the /binding
subresource of which pods.
3. Optimize the behaviors of priority functions in multi-scheduler scenario. In the case where multiple schedulers have
the same predicate and priority functions (for example, when using multiple schedulers for parallelism rather than to
customize the scheduling policies), all schedulers would tend to pick the same node as "best" when scheduling identical
pods and therefore would be likely to conflict on the Kubelet. To solve this problem, we can pass
an optional flag such as `--randomize-node-selection=N` to scheduler, setting this flag would cause the scheduler to pick
randomly among the top N nodes instead of the one with the highest score.
## Other issues/discussions related to scheduler design
- [#13580](https://github.com/kubernetes/kubernetes/pull/13580): scheduler extension
- [#17097](https://github.com/kubernetes/kubernetes/issues/17097): policy config file in pod template
- [#16845](https://github.com/kubernetes/kubernetes/issues/16845): scheduling groups of pods
- [#17208](https://github.com/kubernetes/kubernetes/issues/17208): guide to writing a new scheduler
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/multiple-schedulers.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,304 +1 @@
# NetworkPolicy This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/network-policy.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/network-policy.md)
## Abstract
A proposal for implementing a new resource - NetworkPolicy - which
will enable definition of ingress policies for selections of pods.
The design for this proposal has been created by, and discussed
extensively within the Kubernetes networking SIG. It has been implemented
and tested using Kubernetes API extensions by various networking solutions already.
In this design, users can create various NetworkPolicy objects which select groups of pods and
define how those pods should be allowed to communicate with each other. The
implementation of that policy at the network layer is left up to the
chosen networking solution.
> Note that this proposal does not yet include egress / cidr-based policy, which is still actively undergoing discussion in the SIG. These are expected to augment this proposal in a backwards compatible way.
## Implementation
The implementation in Kubernetes consists of:
- A v1beta1 NetworkPolicy API object
- A structure on the `Namespace` object to control policy, to be developed as an annotation for now.
### Namespace changes
The following objects will be defined on a Namespace Spec.
>NOTE: In v1beta1 the Namespace changes will be implemented as an annotation.
```go
type IngressIsolationPolicy string
const (
// Deny all ingress traffic to pods in this namespace. Ingress means
// any incoming traffic to pods, whether that be from other pods within this namespace
// or any source outside of this namespace.
DefaultDeny IngressIsolationPolicy = "DefaultDeny"
)
// Standard NamespaceSpec object, modified to include a new
// NamespaceNetworkPolicy field.
type NamespaceSpec struct {
// This is a pointer so that it can be left undefined.
NetworkPolicy *NamespaceNetworkPolicy `json:"networkPolicy,omitempty"`
}
type NamespaceNetworkPolicy struct {
// Ingress configuration for this namespace. This config is
// applied to all pods within this namespace. For now, only
// ingress is supported. This field is optional - if not
// defined, then the cluster default for ingress is applied.
Ingress *NamespaceIngressPolicy `json:"ingress,omitempty"`
}
// Configuration for ingress to pods within this namespace.
// For now, this only supports specifying an isolation policy.
type NamespaceIngressPolicy struct {
// The isolation policy to apply to pods in this namespace.
// Currently this field only supports "DefaultDeny", but could
// be extended to support other policies in the future. When set to DefaultDeny,
// pods in this namespace are denied ingress traffic by default. When not defined,
// the cluster default ingress isolation policy is applied (currently allow all).
Isolation *IngressIsolationPolicy `json:"isolation,omitempty"`
}
```
```yaml
kind: Namespace
apiVersion: v1
spec:
networkPolicy:
ingress:
isolation: DefaultDeny
```
The above structures will be represented in v1beta1 as a json encoded annotation like so:
```yaml
kind: Namespace
apiVersion: v1
metadata:
annotations:
net.beta.kubernetes.io/network-policy: |
{
"ingress": {
"isolation": "DefaultDeny"
}
}
```
### NetworkPolicy Go Definition
For a namespace with ingress isolation, connections to pods in that namespace (from any source) are prevented.
The user needs a way to explicitly declare which connections are allowed into pods of that namespace.
This is accomplished through ingress rules on `NetworkPolicy`
objects (of which there can be multiple in a single namespace). Pods selected by
one or more NetworkPolicy objects should allow any incoming connections that match any
ingress rule on those NetworkPolicy objects, per the network plugins capabilities.
NetworkPolicy objects and the above namespace isolation both act on _connections_ rather than individual packets. That is to say that if traffic from pod A to pod B is allowed by the configured
policy, then the return packets for that connection from B -> A are also allowed, even if the policy in place would not allow B to initiate a connection to A. NetworkPolicy objects act on a broad definition of _connection_ which includes both TCP and UDP streams. If new network policy is applied that would block an existing connection between two endpoints, the enforcer of policy
should terminate and block the existing connection as soon as can be expected by the implementation.
We propose adding the new NetworkPolicy object to the `extensions/v1beta1` API group for now.
The SIG also considered the following while developing the proposed NetworkPolicy object:
- A per-pod policy field. We discounted this in favor of the loose coupling that labels provide, similar to Services.
- Per-Service policy. We chose not to attach network policy to services to avoid semantic overloading of a single object, and conflating the existing semantics of load-balancing and service discovery with those of network policy.
```go
type NetworkPolicy struct {
TypeMeta
ObjectMeta
// Specification of the desired behavior for this NetworkPolicy.
Spec NetworkPolicySpec
}
type NetworkPolicySpec struct {
// Selects the pods to which this NetworkPolicy object applies. The array of ingress rules
// is applied to any pods selected by this field. Multiple network policies can select the
// same set of pods. In this case, the ingress rules for each are combined additively.
// This field is NOT optional and follows standard unversioned.LabelSelector semantics.
// An empty podSelector matches all pods in this namespace.
PodSelector unversioned.LabelSelector `json:"podSelector"`
// List of ingress rules to be applied to the selected pods.
// Traffic is allowed to a pod if namespace.networkPolicy.ingress.isolation is undefined and cluster policy allows it,
// OR if the traffic source is the pod's local node,
// OR if the traffic matches at least one ingress rule across all of the NetworkPolicy
// objects whose podSelector matches the pod.
// If this field is empty then this NetworkPolicy does not affect ingress isolation.
// If this field is present and contains at least one rule, this policy allows any traffic
// which matches at least one of the ingress rules in this list.
Ingress []NetworkPolicyIngressRule `json:"ingress,omitempty"`
}
// This NetworkPolicyIngressRule matches traffic if and only if the traffic matches both ports AND from.
type NetworkPolicyIngressRule struct {
// List of ports which should be made accessible on the pods selected for this rule.
// Each item in this list is combined using a logical OR.
// If this field is not provided, this rule matches all ports (traffic not restricted by port).
// If this field is empty, this rule matches no ports (no traffic matches).
// If this field is present and contains at least one item, then this rule allows traffic
// only if the traffic matches at least one port in the ports list.
Ports *[]NetworkPolicyPort `json:"ports,omitempty"`
// List of sources which should be able to access the pods selected for this rule.
// Items in this list are combined using a logical OR operation.
// If this field is not provided, this rule matches all sources (traffic not restricted by source).
// If this field is empty, this rule matches no sources (no traffic matches).
// If this field is present and contains at least on item, this rule allows traffic only if the
// traffic matches at least one item in the from list.
From *[]NetworkPolicyPeer `json:"from,omitempty"`
}
type NetworkPolicyPort struct {
// Optional. The protocol (TCP or UDP) which traffic must match.
// If not specified, this field defaults to TCP.
Protocol *api.Protocol `json:"protocol,omitempty"`
// If specified, the port on the given protocol. This can
// either be a numerical or named port. If this field is not provided,
// this matches all port names and numbers.
// If present, only traffic on the specified protocol AND port
// will be matched.
Port *intstr.IntOrString `json:"port,omitempty"`
}
type NetworkPolicyPeer struct {
// Exactly one of the following must be specified.
// This is a label selector which selects Pods in this namespace.
// This field follows standard unversioned.LabelSelector semantics.
// If present but empty, this selector selects all pods in this namespace.
PodSelector *unversioned.LabelSelector `json:"podSelector,omitempty"`
// Selects Namespaces using cluster scoped-labels. This
// matches all pods in all namespaces selected by this label selector.
// This field follows standard unversioned.LabelSelector semantics.
// If present but empty, this selector selects all namespaces.
NamespaceSelector *unversioned.LabelSelector `json:"namespaceSelector,omitempty"`
}
```
### Behavior
The following pseudo-code attempts to define when traffic is allowed to a given pod when using this API.
```python
def is_traffic_allowed(traffic, pod):
"""
Returns True if traffic is allowed to this pod, False otherwise.
"""
if not pod.Namespace.Spec.NetworkPolicy.Ingress.Isolation:
# If ingress isolation is disabled on the Namespace, use cluster default.
return clusterDefault(traffic, pod)
elif traffic.source == pod.node.kubelet:
# Traffic is from kubelet health checks.
return True
else:
# If namespace ingress isolation is enabled, only allow traffic
# that matches a network policy which selects this pod.
for network_policy in network_policies(pod.Namespace):
if not network_policy.Spec.PodSelector.selects(pod):
# This policy doesn't select this pod. Try the next one.
continue
# This policy selects this pod. Check each ingress rule
# defined on this policy to see if it allows the traffic.
# If at least one does, then the traffic is allowed.
for ingress_rule in network_policy.Ingress or []:
if ingress_rule.matches(traffic):
return True
# Ingress isolation is DefaultDeny and no policies match the given pod and traffic.
return false
```
### Potential Future Work / Questions
- A single podSelector per NetworkPolicy may lead to managing a large number of NetworkPolicy objects, each of which is small and easy to understand on its own. However, this may lead for a policy change to require touching several policy objects. Allowing an optional podSelector per ingress rule additionally to the podSelector per NetworkPolicy object would allow the user to group rules into logical segments and define size/complexity ratio where it makes sense. This may lead to a smaller number of objects with more complexity if the user opts in to the additional podSelector. This increases the complexity of the NetworkPolicy object itself. This proposal has opted to favor a larger number of smaller objects that are easier to understand, with the understanding that additional podSelectors could be added to this design in the future should the requirement become apparent.
- Is the `Namespaces` selector in the `NetworkPolicyPeer` struct too coarse? Do we need to support the AND combination of `Namespaces` and `Pods`?
### Examples
1) Only allow traffic from frontend pods on TCP port 6379 to backend pods in the same namespace.
```yaml
kind: Namespace
apiVersion: v1
metadata:
name: myns
annotations:
net.beta.kubernetes.io/network-policy: |
{
"ingress": {
"isolation": "DefaultDeny"
}
}
---
kind: NetworkPolicy
apiVersion: extensions/v1beta1
metadata:
name: allow-frontend
namespace: myns
spec:
podSelector:
matchLabels:
role: backend
ingress:
- from:
- podSelector:
matchLabels:
role: frontend
ports:
- protocol: TCP
port: 6379
```
2) Allow TCP 443 from any source in Bob's namespaces.
```yaml
kind: NetworkPolicy
apiVersion: extensions/v1beta1
metadata:
name: allow-tcp-443
spec:
podSelector:
matchLabels:
role: frontend
ingress:
- ports:
- protocol: TCP
port: 443
from:
- namespaceSelector:
matchLabels:
user: bob
```
3) Allow all traffic to all pods in this namespace.
```yaml
kind: NetworkPolicy
apiVersion: extensions/v1beta1
metadata:
name: allow-all
spec:
podSelector:
ingress:
- {}
```
## References
- https://github.com/kubernetes/kubernetes/issues/22469 tracks network policy in kubernetes.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/network-policy.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,151 +1 @@
# Node Allocatable Resources This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md)
**Issue:** https://github.com/kubernetes/kubernetes/issues/13984
## Overview
Currently Node.Status has Capacity, but no concept of node Allocatable. We need additional
parameters to serve several purposes:
1. Kubernetes metrics provides "/docker-daemon", "/kubelet",
"/kube-proxy", "/system" etc. raw containers for monitoring system component resource usage
patterns and detecting regressions. Eventually we want to cap system component usage to a certain
limit / request. However this is not currently feasible due to a variety of reasons including:
1. Docker still uses tons of computing resources (See
[#16943](https://github.com/kubernetes/kubernetes/issues/16943))
2. We have not yet defined the minimal system requirements, so we cannot control Kubernetes
nodes or know about arbitrary daemons, which can make the system resources
unmanageable. Even with a resource cap we cannot do a full resource management on the
node, but with the proposed parameters we can mitigate really bad resource over commits
3. Usage scales with the number of pods running on the node
2. For external schedulers (such as mesos, hadoop, etc.) integration, they might want to partition
compute resources on a given node, limiting how much Kubelet can use. We should provide a
mechanism by which they can query kubelet, and reserve some resources for their own purpose.
### Scope of proposal
This proposal deals with resource reporting through the [`Allocatable` field](#allocatable) for more
reliable scheduling, and minimizing resource over commitment. This proposal *does not* cover
resource usage enforcement (e.g. limiting kubernetes component usage), pod eviction (e.g. when
reservation grows), or running multiple Kubelets on a single node.
## Design
### Definitions
![image](node-allocatable.png)
1. **Node Capacity** - Already provided as
[`NodeStatus.Capacity`](https://htmlpreview.github.io/?https://github.com/kubernetes/kubernetes/blob/HEAD/docs/api-reference/v1/definitions.html#_v1_nodestatus),
this is total capacity read from the node instance, and assumed to be constant.
2. **System-Reserved** (proposed) - Compute resources reserved for processes which are not managed by
Kubernetes. Currently this covers all the processes lumped together in the `/system` raw
container.
3. **Kubelet Allocatable** - Compute resources available for scheduling (including scheduled &
unscheduled resources). This value is the focus of this proposal. See [below](#api-changes) for
more details.
4. **Kube-Reserved** (proposed) - Compute resources reserved for Kubernetes components such as the
docker daemon, kubelet, kube proxy, etc.
### API changes
#### Allocatable
Add `Allocatable` (4) to
[`NodeStatus`](https://htmlpreview.github.io/?https://github.com/kubernetes/kubernetes/blob/HEAD/docs/api-reference/v1/definitions.html#_v1_nodestatus):
```
type NodeStatus struct {
...
// Allocatable represents schedulable resources of a node.
Allocatable ResourceList `json:"allocatable,omitempty"`
...
}
```
Allocatable will be computed by the Kubelet and reported to the API server. It is defined to be:
```
[Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved]
```
The scheduler will use `Allocatable` in place of `Capacity` when scheduling pods, and the Kubelet
will use it when performing admission checks.
*Note: Since kernel usage can fluctuate and is out of kubernetes control, it will be reported as a
separate value (probably via the metrics API). Reporting kernel usage is out-of-scope for this
proposal.*
#### Kube-Reserved
`KubeReserved` is the parameter specifying resources reserved for kubernetes components (4). It is
provided as a command-line flag to the Kubelet at startup, and therefore cannot be changed during
normal Kubelet operation (this may change in the [future](#future-work)).
The flag will be specified as a serialized `ResourceList`, with resources defined by the API
`ResourceName` and values specified in `resource.Quantity` format, e.g.:
```
--kube-reserved=cpu=500m,memory=5Mi
```
Initially we will only support CPU and memory, but will eventually support more resources. See
[#16889](https://github.com/kubernetes/kubernetes/pull/16889) for disk accounting.
If KubeReserved is not set it defaults to a sane value (TBD) calculated from machine capacity. If it
is explicitly set to 0 (along with `SystemReserved`), then `Allocatable == Capacity`, and the system
behavior is equivalent to the 1.1 behavior with scheduling based on Capacity.
#### System-Reserved
In the initial implementation, `SystemReserved` will be functionally equivalent to
[`KubeReserved`](#system-reserved), but with a different semantic meaning. While KubeReserved
designates resources set aside for kubernetes components, SystemReserved designates resources set
aside for non-kubernetes components (currently this is reported as all the processes lumped
together in the `/system` raw container).
## Issues
### Kubernetes reservation is smaller than kubernetes component usage
**Solution**: Initially, do nothing (best effort). Let the kubernetes daemons overflow the reserved
resources and hope for the best. If the node usage is less than Allocatable, there will be some room
for overflow and the node should continue to function. If the node has been scheduled to capacity
(worst-case scenario) it may enter an unstable state, which is the current behavior in this
situation.
In the [future](#future-work) we may set a parent cgroup for kubernetes components, with limits set
according to `KubeReserved`.
### Version discrepancy
**API server / scheduler is not allocatable-resources aware:** If the Kubelet rejects a Pod but the
scheduler expects the Kubelet to accept it, the system could get stuck in an infinite loop
scheduling a Pod onto the node only to have Kubelet repeatedly reject it. To avoid this situation,
we will do a 2-stage rollout of `Allocatable`. In stage 1 (targeted for 1.2), `Allocatable` will
be reported by the Kubelet and the scheduler will be updated to use it, but Kubelet will continue
to do admission checks based on `Capacity` (same as today). In stage 2 of the rollout (targeted
for 1.3 or later), the Kubelet will start doing admission checks based on `Allocatable`.
**API server expects `Allocatable` but does not receive it:** If the kubelet is older and does not
provide `Allocatable` in the `NodeStatus`, then `Allocatable` will be
[defaulted](../../pkg/api/v1/defaults.go) to
`Capacity` (which will yield today's behavior of scheduling based on capacity).
### 3rd party schedulers
The community should be notified that an update to schedulers is recommended, but if a scheduler is
not updated it falls under the above case of "scheduler is not allocatable-resources aware".
## Future work
1. Convert kubelet flags to Config API - Prerequisite to (2). See
[#12245](https://github.com/kubernetes/kubernetes/issues/12245).
2. Set cgroup limits according KubeReserved - as described in the [overview](#overview)
3. Report kernel usage to be considered with scheduling decisions.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/node-allocatable.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,116 +1 @@
# Performance Monitoring This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/performance-related-monitoring.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/performance-related-monitoring.md)
## Reason for this document
This document serves as a place to gather information about past performance regressions, their reason and impact and discuss ideas to avoid similar regressions in the future.
Main reason behind doing this is to understand what kind of monitoring needs to be in place to keep Kubernetes fast.
## Known past and present performance issues
### Higher logging level causing scheduler stair stepping
Issue https://github.com/kubernetes/kubernetes/issues/14216 was opened because @spiffxp observed a regression in scheduler performance in 1.1 branch in comparison to `old` 1.0
cut. In the end it turned out the be caused by `--v=4` (instead of default `--v=2`) flag in the scheduler together with the flag `--logtostderr` which disables batching of
log lines and a number of logging without explicit V level. This caused weird behavior of the whole component.
Because we now know that logging may have big performance impact we should consider instrumenting logging mechanism and compute statistics such as number of logged messages,
total and average size of them. Each binary should be responsible for exposing its metrics. An unaccounted but way too big number of days, if not weeks, of engineering time was
lost because of this issue.
### Adding per-pod probe-time, which increased the number of PodStatus updates, causing major slowdown
In September 2015 we tried to add per-pod probe times to the PodStatus. It caused (https://github.com/kubernetes/kubernetes/issues/14273) a massive increase in both number and
total volume of object (PodStatus) changes. It drastically increased the load on API server which wasnt able to handle new number of requests quickly enough, violating our
response time SLO. We had to revert this change.
### Late Ready->Running PodPhase transition caused test failures as it seemed like slowdown
In late September we encountered a strange problem (https://github.com/kubernetes/kubernetes/issues/14554): we observed an increased observed latencies in small clusters (few
Nodes). It turned out that its caused by an added latency between PodRunning and PodReady phases. This was not a real regression, but our tests thought it were, which shows
how careful we need to be.
### Huge number of handshakes slows down API server
It was a long standing issue for performance and is/was an important bottleneck for scalability (https://github.com/kubernetes/kubernetes/issues/13671). The bug directly
causing this problem was incorrect (from the golangs standpoint) handling of TCP connections. Secondary issue was that elliptic curve encryption (only one available in go 1.4)
is unbelievably slow.
## Proposed metrics/statistics to gather/compute to avoid problems
### Cluster-level metrics
Basic ideas:
- number of Pods/ReplicationControllers/Services in the cluster
- number of running replicas of master components (if they are replicated)
- current elected master of ectd cluster (if running distributed version)
- nuber of master component restarts
- number of lost Nodes
### Logging monitoring
Log spam is a serious problem and we need to keep it under control. Simplest way to check for regressions, suggested by @brendandburns, is to compute the rate in which log files
grow in e2e tests.
Basic ideas:
- log generation rate (B/s)
### REST call monitoring
We do measure REST call duration in the Density test, but we need an API server monitoring as well, to avoid false failures caused e.g. by the network traffic. We already have
some metrics in place (https://github.com/kubernetes/kubernetes/blob/master/pkg/apiserver/metrics/metrics.go), but we need to revisit the list and add some more.
Basic ideas:
- number of calls per verb, client, resource type
- latency distribution per verb, client, resource type
- number of calls that was rejected per client, resource type and reason (invalid version number, already at maximum number of requests in flight)
- number of relists in various watchers
### Rate limit monitoring
Reverse of REST call monitoring done in the API server. We need to know when a given component increases a pressure it puts on the API server. As a proxy for number of
requests sent we can track how saturated are rate limiters. This has additional advantage of giving us data needed to fine-tune rate limiter constants.
Because we have rate limiting on both ends (client and API server) we should monitor number of inflight requests in API server and how it relates to `max-requests-inflight`.
Basic ideas:
- percentage of used non-burst limit,
- amount of time in last hour with depleted burst tokens,
- number of inflight requests in API server.
### Network connection monitoring
During development we observed incorrect use/reuse of HTTP connections multiple times already. We should at least monitor number of created connections.
### ETCD monitoring
@xiang-90 and @hongchaodeng - you probably have way more experience on what'd be good to look at from the ETCD perspective.
Basic ideas:
- ETCD memory footprint
- number of objects per kind
- read/write latencies per kind
- number of requests from the API server
- read/write counts per key (it may be too heavy though)
### Resource consumption
On top of all things mentioned above we need to monitor changes in resource usage in both: cluster components (API server, Kubelet, Scheduler, etc.) and system add-ons
(Heapster, L7 load balancer, etc.). Monitoring memory usage is tricky, because if no limits are set, system won't apply memory pressure to processes, which makes their memory
footprint constantly grow. We argue that monitoring usage in tests still makes sense, as tests should be repeatable, and if memory usage will grow drastically between two runs
it most likely can be attributed to some kind of regression (assuming that nothing else has changed in the environment).
Basic ideas:
- CPU usage
- memory usage
### Other saturation metrics
We should monitor other aspects of the system, which may indicate saturation of some component.
Basic ideas:
- queue length for queues in the system,
- wait time for WaitGroups.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/performance-related-monitoring.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,201 +1 @@
# Kubelet: Pod Lifecycle Event Generator (PLEG) This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-lifecycle-event-generator.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-lifecycle-event-generator.md)
In Kubernetes, Kubelet is a per-node daemon that manages the pods on the node,
driving the pod states to match their pod specifications (specs). To achieve
this, Kubelet needs to react to changes in both (1) pod specs and (2) the
container states. For the former, Kubelet watches the pod specs changes from
multiple sources; for the latter, Kubelet polls the container runtime
periodically (e.g., 10s) for the latest states for all containers.
Polling incurs non-negligible overhead as the number of pods/containers increases,
and is exacerbated by Kubelet's parallelism -- one worker (goroutine) per pod, which
queries the container runtime individually. Periodic, concurrent, large number
of requests causes high CPU usage spikes (even when there is no spec/state
change), poor performance, and reliability problems due to overwhelmed container
runtime. Ultimately, it limits Kubelet's scalability.
(Related issues reported by users: [#10451](https://issues.k8s.io/10451),
[#12099](https://issues.k8s.io/12099), [#12082](https://issues.k8s.io/12082))
## Goals and Requirements
The goal of this proposal is to improve Kubelet's scalability and performance
by lowering the pod management overhead.
- Reduce unnecessary work during inactivity (no spec/state changes)
- Lower the concurrent requests to the container runtime.
The design should be generic so that it can support different container runtimes
(e.g., Docker and rkt).
## Overview
This proposal aims to replace the periodic polling with a pod lifecycle event
watcher.
![pleg](pleg.png)
## Pod Lifecycle Event
A pod lifecycle event interprets the underlying container state change at the
pod-level abstraction, making it container-runtime-agnostic. The abstraction
shields Kubelet from the runtime specifics.
```go
type PodLifeCycleEventType string
const (
ContainerStarted PodLifeCycleEventType = "ContainerStarted"
ContainerStopped PodLifeCycleEventType = "ContainerStopped"
NetworkSetupCompleted PodLifeCycleEventType = "NetworkSetupCompleted"
NetworkFailed PodLifeCycleEventType = "NetworkFailed"
)
// PodLifecycleEvent is an event reflects the change of the pod state.
type PodLifecycleEvent struct {
// The pod ID.
ID types.UID
// The type of the event.
Type PodLifeCycleEventType
// The accompanied data which varies based on the event type.
Data interface{}
}
```
Using Docker as an example, starting of a POD infra container would be
translated to a NetworkSetupCompleted`pod lifecycle event.
## Detect Changes in Container States Via Relisting
In order to generate pod lifecycle events, PLEG needs to detect changes in
container states. We can achieve this by periodically relisting all containers
(e.g., docker ps). Although this is similar to Kubelet's polling today, it will
only be performed by a single thread (PLEG). This means that we still
benefit from not having all pod workers hitting the container runtime
concurrently. Moreover, only the relevant pod worker would be woken up
to perform a sync.
The upside of relying on relisting is that it is container runtime-agnostic,
and requires no external dependency.
### Relist period
The shorter the relist period is, the sooner that Kubelet can detect the
change. Shorter relist period also implies higher cpu usage. Moreover, the
relist latency depends on the underlying container runtime, and usually
increases as the number of containers/pods grows. We should set a default
relist period based on measurements. Regardless of what period we set, it will
likely be significantly shorter than the current pod sync period (10s), i.e.,
Kubelet will detect container changes sooner.
## Impact on the Pod Worker Control Flow
Kubelet is responsible for dispatching an event to the appropriate pod
worker based on the pod ID. Only one pod worker would be woken up for
each event.
Today, the pod syncing routine in Kubelet is idempotent as it always
examines the pod state and the spec, and tries to drive to state to
match the spec by performing a series of operations. It should be
noted that this proposal does not intend to change this property --
the sync pod routine would still perform all necessary checks,
regardless of the event type. This trades some efficiency for
reliability and eliminate the need to build a state machine that is
compatible with different runtimes.
## Leverage Upstream Container Events
Instead of relying on relisting, PLEG can leverage other components which
provide container events, and translate these events into pod lifecycle
events. This will further improve Kubelet's responsiveness and reduce the
resource usage caused by frequent relisting.
The upstream container events can come from:
(1). *Event stream provided by each container runtime*
Docker's API exposes an [event
stream](https://docs.docker.com/reference/api/docker_remote_api_v1.17/#monitor-docker-s-events).
Nonetheless, rkt does not support this yet, but they will eventually support it
(see [coreos/rkt#1193](https://github.com/coreos/rkt/issues/1193)).
(2). *cgroups event stream by cAdvisor*
cAdvisor is integrated in Kubelet to provide container stats. It watches cgroups
containers using inotify and exposes an event stream. Even though it does not
support rkt yet, it should be straightforward to add such a support.
Option (1) may provide richer sets of events, but option (2) has the advantage
to be more universal across runtimes, as long as the container runtime uses
cgroups. Regardless of what one chooses to implement now, the container event
stream should be easily swappable with a clearly defined interface.
Note that we cannot solely rely on the upstream container events due to the
possibility of missing events. PLEG should relist infrequently to ensure no
events are missed.
## Generate Expected Events
*This is optional for PLEGs which performs only relisting, but required for
PLEGs that watch upstream events.*
A pod worker's actions could lead to pod lifecycle events (e.g.,
create/kill a container), which the worker would not observe until
later. The pod worker should ignore such events to avoid unnecessary
work.
For example, assume a pod has two containers, A and B. The worker
- Creates container A
- Receives an event `(ContainerStopped, B)`
- Receives an event `(ContainerStarted, A)`
The worker should ignore the `(ContainerStarted, A)` event since it is
expected. Arguably, the worker could process `(ContainerStopped, B)`
as soon as it receives the event, before observing the creation of
A. However, it is desirable to wait until the expected event
`(ContainerStarted, A)` is observed to keep a consistent per-pod view
at the worker. Therefore, the control flow of a single pod worker
should adhere to the following rules:
1. Pod worker should process the events sequentially.
2. Pod worker should not start syncing until it observes the outcome of its own
actions in the last sync to maintain a consistent view.
In other words, a pod worker should record the expected events, and
only wake up to perform the next sync until all expectations are met.
- Creates container A, records an expected event `(ContainerStarted, A)`
- Receives `(ContainerStopped, B)`; stores the event and goes back to sleep.
- Receives `(ContainerStarted, A)`; clears the expectation. Proceeds to handle
`(ContainerStopped, B)`.
We should set an expiration time for each expected events to prevent the worker
from being stalled indefinitely by missing events.
## TODOs for v1.2
For v1.2, we will add a generic PLEG which relists periodically, and leave
adopting container events for future work. We will also *not* implement the
optimization that generate and filters out expected events to minimize
redundant syncs.
- Add a generic PLEG using relisting. Modify the container runtime interface
to provide all necessary information to detect container state changes
in `GetPods()` (#13571).
- Benchmark docker to adjust relising frequency.
- Fix/adapt features that rely on frequent, periodic pod syncing.
* Liveness/Readiness probing: Create a separate probing manager using
explicitly container probing period [#10878](https://issues.k8s.io/10878).
* Instruct pod workers to set up a wake-up call if syncing failed, so that
it can retry.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-lifecycle-event-generator.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,416 +1 @@
# Pod level resource management in Kubelet This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-resource-management.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-resource-management.md)
**Author**: Buddha Prakash (@dubstack), Vishnu Kannan (@vishh)
**Last Updated**: 06/23/2016
**Status**: Draft Proposal (WIP)
This document proposes a design for introducing pod level resource accounting to Kubernetes, and outlines the implementation and rollout plan.
<!-- BEGIN MUNGE: GENERATED_TOC -->
- [Pod level resource management in Kubelet](#pod-level-resource-management-in-kubelet)
- [Introduction](#introduction)
- [Non Goals](#non-goals)
- [Motivations](#motivations)
- [Design](#design)
- [Proposed cgroup hierarchy:](#proposed-cgroup-hierarchy)
- [QoS classes](#qos-classes)
- [Guaranteed](#guaranteed)
- [Burstable](#burstable)
- [Best Effort](#best-effort)
- [With Systemd](#with-systemd)
- [Hierarchy Outline](#hierarchy-outline)
- [QoS Policy Design Decisions](#qos-policy-design-decisions)
- [Implementation Plan](#implementation-plan)
- [Top level Cgroups for QoS tiers](#top-level-cgroups-for-qos-tiers)
- [Pod level Cgroup creation and deletion (Docker runtime)](#pod-level-cgroup-creation-and-deletion-docker-runtime)
- [Container level cgroups](#container-level-cgroups)
- [Rkt runtime](#rkt-runtime)
- [Add Pod level metrics to Kubelet's metrics provider](#add-pod-level-metrics-to-kubelets-metrics-provider)
- [Rollout Plan](#rollout-plan)
- [Implementation Status](#implementation-status)
<!-- END MUNGE: GENERATED_TOC -->
## Introduction
As of now [Quality of Service(QoS)](../../docs/design/resource-qos.md) is not enforced at a pod level. Excepting pod evictions, all the other QoS features are not applicable at the pod level.
To better support QoS, there is a need to add support for pod level resource accounting in Kubernetes.
We propose to have a unified cgroup hierarchy with pod level cgroups for better resource management. We will have a cgroup hierarchy with top level cgroups for the three QoS classes Guaranteed, Burstable and BestEffort. Pods (and their containers) belonging to a QoS class will be grouped under these top level QoS cgroups. And all containers in a pod are nested under the pod cgroup.
The proposed cgroup hierarchy would allow for more efficient resource management and lead to improvements in node reliability.
This would also allow for significant latency optimizations in terms of pod eviction on nodes with the use of pod level resource usage metrics.
This document provides a basic outline of how we plan to implement and rollout this feature.
## Non Goals
- Pod level disk accounting will not be tackled in this proposal.
- Pod level resource specification in the Kubernetes API will not be tackled in this proposal.
## Motivations
Kubernetes currently supports container level isolation only and lets users specify resource requests/limits on the containers [Compute Resources](../../docs/design/resources.md). The `kubelet` creates a cgroup sandbox (via it's container runtime) for each container.
There are a few shortcomings to the current model.
- Existing QoS support does not apply to pods as a whole. On-going work to support pod level eviction using QoS requires all containers in a pod to belong to the same class. By having pod level cgroups, it is easy to track pod level usage and make eviction decisions.
- Infrastructure overhead per pod is currently charged to the node. The overhead of setting up and managing the pod sandbox is currently accounted to the node. If the pod sandbox is a bit expensive, like in the case of hyper, having pod level accounting becomes critical.
- For the docker runtime we have a containerd-shim which is a small library that sits in front of a runtime implementation allowing it to be reparented to init, handle reattach from the caller etc. With pod level cgroups containerd-shim can be charged to the pod instead of the machine.
- If a container exits, all its anonymous pages (tmpfs) gets accounted to the machine (root). With pod level cgroups, that usage can also be attributed to the pod.
- Let containers share resources - with pod level limits, a pod with a Burstable container and a BestEffort container is classified as Burstable pod. The BestEffort container is able to consume slack resources not used by the Burstable container, and still be capped by the overall pod level limits.
## Design
High level requirements for the design are as follows:
- Do not break existing users. Ideally, there should be no changes to the Kubernetes API semantics.
- Support multiple cgroup managers - systemd, cgroupfs, etc.
How we intend to achieve these high level goals is covered in greater detail in the Implementation Plan.
We use the following denotations in the sections below:
For the three QoS classes
`G⇒ Guaranteed QoS, Bu⇒ Burstable QoS, BE⇒ BestEffort QoS`
For the value specified for the --qos-memory-overcommitment flag
`qmo⇒ qos-memory-overcommitment`
Currently the Kubelet highly prioritizes resource utilization and thus allows BE pods to use as much resources as they want. And in case of OOM the BE pods are first to be killed. We follow this policy as G pods often don't use the amount of resource request they specify. By overcommiting the node the BE pods are able to utilize these left over resources. And in case of OOM the BE pods are evicted by the eviciton manager. But there is some latency involved in the pod eviction process which can be a cause of concern in latency-sensitive servers. On such servers we would want to avoid OOM conditions on the node. Pod level cgroups allow us to restrict the amount of available resources to the BE pods. So reserving the requested resources for the G and Bu pods would allow us to avoid invoking the OOM killer.
We add a flag `qos-memory-overcommitment` to kubelet which would allow users to configure the percentage of memory overcommitment on the node. We have the default as 100, so by default we allow complete overcommitment on the node and let the BE pod use as much memory as it wants, and not reserve any resources for the G and Bu pods. As expected if there is an OOM in such a case we first kill the BE pods before the G and Bu pods.
On the other hand if a user wants to ensure very predictable tail latency for latency-sensitive servers he would need to set qos-memory-overcommitment to a really low value(preferrably 0). In this case memory resources would be reserved for the G and Bu pods and BE pods would be able to use only the left over memory resource.
Examples in the next section.
### Proposed cgroup hierarchy:
For the initial implementation we will only support limits for cpu and memory resources.
#### QoS classes
A pod can belong to one of the following 3 QoS classes: Guaranteed, Burstable, and BestEffort, in decreasing order of priority.
#### Guaranteed
`G` pods will be placed at the `$Root` cgroup by default. `$Root` is the system root i.e. "/" by default and if `--cgroup-root` flag is used then we use the specified cgroup-root as the `$Root`. To ensure Kubelet's idempotent behaviour we follow a pod cgroup naming format which is opaque and deterministic. Say we have a pod with UID: `5f9b19c9-3a30-11e6-8eea-28d2444e470d` the pod cgroup PodUID would be named: `pod-5f9b19c93a3011e6-8eea28d2444e470d`.
__Note__: The cgroup-root flag would allow the user to configure the root of the QoS cgroup hierarchy. Hence cgroup-root would be redefined as the root of QoS cgroup hierarchy and not containers.
```
/PodUID/cpu.quota = cpu limit of Pod
/PodUID/cpu.shares = cpu request of Pod
/PodUID/memory.limit_in_bytes = memory limit of Pod
```
Example:
We have two pods Pod1 and Pod2 having Pod Spec given below
```yaml
kind: Pod
metadata:
name: Pod1
spec:
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 2Gi
```
```yaml
kind: Pod
metadata:
name: Pod2
spec:
containers:
name: foo
resources:
limits:
cpu: 20m
memory: 2Gii
```
Pod1 and Pod2 are both classified as `G` and are nested under the `Root` cgroup.
```
/Pod1/cpu.quota = 110m
/Pod1/cpu.shares = 110m
/Pod2/cpu.quota = 20m
/Pod2/cpu.shares = 20m
/Pod1/memory.limit_in_bytes = 3Gi
/Pod2/memory.limit_in_bytes = 2Gi
```
#### Burstable
We have the following resource parameters for the `Bu` cgroup.
```
/Bu/cpu.shares = summation of cpu requests of all Bu pods
/Bu/PodUID/cpu.quota = Pod Cpu Limit
/Bu/PodUID/cpu.shares = Pod Cpu Request
/Bu/memory.limit_in_bytes = Allocatable - {(summation of memory requests/limits of `G` pods)*(1-qom/100)}
/Bu/PodUID/memory.limit_in_bytes = Pod memory limit
```
`Note: For the `Bu` QoS when limits are not specified for any one of the containers, the Pod limit defaults to the node resource allocatable quantity.`
Example:
We have two pods Pod3 and Pod4 having Pod Spec given below:
```yaml
kind: Pod
metadata:
name: Pod3
spec:
containers:
name: foo
resources:
limits:
cpu: 50m
memory: 2Gi
requests:
cpu: 20m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 1Gi
```
```yaml
kind: Pod
metadata:
name: Pod4
spec:
containers:
name: foo
resources:
limits:
cpu: 20m
memory: 2Gi
requests:
cpu: 10m
memory: 1Gi
```
Pod3 and Pod4 are both classified as `Bu` and are hence nested under the Bu cgroup
And for `qom` = 0
```
/Bu/cpu.shares = 30m
/Bu/Pod3/cpu.quota = 150m
/Bu/Pod3/cpu.shares = 20m
/Bu/Pod4/cpu.quota = 20m
/Bu/Pod4/cpu.shares = 10m
/Bu/memory.limit_in_bytes = Allocatable - 5Gi
/Bu/Pod3/memory.limit_in_bytes = 3Gi
/Bu/Pod4/memory.limit_in_bytes = 2Gi
```
#### Best Effort
For pods belonging to the `BE` QoS we don't set any quota.
```
/BE/cpu.shares = 2
/BE/cpu.quota= not set
/BE/memory.limit_in_bytes = Allocatable - {(summation of memory requests of all `G` and `Bu` pods)*(1-qom/100)}
/BE/PodUID/memory.limit_in_bytes = no limit
```
Example:
We have a pod 'Pod5' having Pod Spec given below:
```yaml
kind: Pod
metadata:
name: Pod5
spec:
containers:
name: foo
resources:
name: bar
resources:
```
Pod5 is classified as `BE` and is hence nested under the BE cgroup
And for `qom` = 0
```
/BE/cpu.shares = 2
/BE/cpu.quota= not set
/BE/memory.limit_in_bytes = Allocatable - 7Gi
/BE/Pod5/memory.limit_in_bytes = no limit
```
### With Systemd
In systemd we have slices for the three top level QoS class. Further each pod is a subslice of exactly one of the three QoS slices. Each container in a pod belongs to a scope nested under the qosclass-pod slice.
Example: We plan to have the following cgroup hierarchy on systemd systems
```
/memory/G-PodUID.slice/containerUID.scope
/cpu,cpuacct/G-PodUID.slice/containerUID.scope
/memory/Bu.slice/Bu-PodUID.slice/containerUID.scope
/cpu,cpuacct/Bu.slice/Bu-PodUID.slice/containerUID.scope
/memory/BE.slice/BE-PodUID.slice/containerUID.scope
/cpu,cpuacct/BE.slice/BE-PodUID.slice/containerUID.scope
```
### Hierarchy Outline
- "$Root" is the system root of the node i.e. "/" by default and if `--cgroup-root` is specified then the specified cgroup-root is used as "$Root".
- We have a top level QoS cgroup for the `Bu` and `BE` QoS classes.
- But we __dont__ have a separate cgroup for the `G` QoS class. `G` pod cgroups are brought up directly under the `Root` cgroup.
- Each pod has its own cgroup which is nested under the cgroup matching the pod's QoS class.
- All containers brought up by the pod are nested under the pod's cgroup.
- system-reserved cgroup contains the system specific processes.
- kube-reserved cgroup contains the kubelet specific daemons.
```
$ROOT
|
+- Pod1
| |
| +- Container1
| +- Container2
| ...
+- Pod2
| +- Container3
| ...
+- ...
|
+- Bu
| |
| +- Pod3
| | |
| | +- Container4
| | ...
| +- Pod4
| | +- Container5
| | ...
| +- ...
|
+- BE
| |
| +- Pod5
| | |
| | +- Container6
| | +- Container7
| | ...
| +- ...
|
+- System-reserved
| |
| +- system
| +- docker (optional)
| +- ...
|
+- Kube-reserved
| |
| +- kubelet
| +- docker (optional)
| +- ...
|
```
#### QoS Policy Design Decisions
- This hierarchy highly prioritizes resource guarantees to the `G` over `Bu` and `BE` pods.
- By not having a separate cgroup for the `G` class, the hierarchy allows the `G` pods to burst and utilize all of Node's Allocatable capacity.
- The `BE` and `Bu` pods are strictly restricted from bursting and hogging resources and thus `G` Pods are guaranteed resource isolation.
- `BE` pods are treated as lowest priority. So for the `BE` QoS cgroup we set cpu shares to the lowest possible value ie.2. This ensures that the `BE` containers get a relatively small share of cpu time.
- Also we don't set any quota on the cpu resources as the containers on the `BE` pods can use any amount of free resources on the node.
- Having memory limit of `BE` cgroup as (Allocatable - summation of memory requests of `G` and `Bu` pods) would result in `BE` pods becoming more susceptible to being OOM killed. As more `G` and `Bu` pods are scheduled kubelet will more likely kill `BE` pods, even if the `G` and `Bu` pods are using less than their request since we will be dynamically reducing the size of `BE` m.limit_in_bytes. But this allows for better memory guarantees to the `G` and `Bu` pods.
## Implementation Plan
The implementation plan is outlined in the next sections.
We will have a 'experimental-cgroups-per-qos' flag to specify if the user wants to use the QoS based cgroup hierarchy. The flag would be set to false by default at least in v1.5.
#### Top level Cgroups for QoS tiers
Two top level cgroups for `Bu` and `BE` QoS classes are created when Kubelet starts to run on a node. All `G` pods cgroups are by default nested under the `Root`. So we dont create a top level cgroup for the `G` class. For raw cgroup systems we would use libcontainers cgroups manager for general cgroup management(cgroup creation/destruction). But for systemd we don't have equivalent support for slice management in libcontainer yet. So we will be adding support for the same in the Kubelet. These cgroups are only created once on Kubelet initialization as a part of node setup. Also on systemd these cgroups are transient units and will not survive reboot.
#### Pod level Cgroup creation and deletion (Docker runtime)
- When a new pod is brought up, its QoS class is firstly determined.
- We add an interface to Kubelets ContainerManager to create and delete pod level cgroups under the cgroup that matches the pods QoS class.
- This interface will be pluggable. Kubelet will support both systemd and raw cgroups based __cgroup__ drivers. We will be using the --cgroup-driver flag proposed in the [Systemd Node Spec](kubelet-systemd.md) to specify the cgroup driver.
- We inject creation and deletion of pod level cgroups into the pod workers.
- As new pods are added QoS class cgroup parameters are updated to match the resource requests by the Pod.
#### Container level cgroups
Have docker manager create container cgroups under pod level cgroups. With the docker runtime, we will pass --cgroup-parent using the syntax expected for the corresponding cgroup-driver the runtime was configured to use.
#### Rkt runtime
We want to have rkt create pods under a root QoS class that kubelet specifies, and set pod level cgroup parameters mentioned in this proposal by itself.
#### Add Pod level metrics to Kubelet's metrics provider
Update Kubelets metrics provider to include Pod level metrics. Use cAdvisor's cgroup subsystem information to determine various Pod level usage metrics.
`Note: Changes to cAdvisor might be necessary.`
## Rollout Plan
This feature will be opt-in in v1.4 and an opt-out in v1.5. We recommend users to drain their nodes and opt-in, before switching to v1.5, which will result in a no-op when v1.5 kubelet is rolled out.
## Implementation Status
The implementation goals of the first milestone are outlined below.
- [x] Finalize and submit Pod Resource Management proposal for the project #26751
- [x] Refactor qos package to be used globally throughout the codebase #27749 #28093
- [x] Add interfaces for CgroupManager and CgroupManagerImpl which implements the CgroupManager interface and creates, destroys/updates cgroups using the libcontainer cgroupfs driver. #27755 #28566
- [x] Inject top level QoS Cgroup creation in the Kubelet and add e2e tests to test that behaviour. #27853
- [x] Add PodContainerManagerImpl Create and Destroy methods which implements the respective PodContainerManager methods using a cgroupfs driver. #28017
- [x] Have docker manager create container cgroups under pod level cgroups. Inject creation and deletion of pod cgroups into the pod workers. Add e2e tests to test this behaviour. #29049
- [x] Add support for updating policy for the pod cgroups. Add e2e tests to test this behaviour. #29087
- [ ] Enabling 'cgroup-per-qos' flag in Kubelet: The user is expected to drain the node and restart it before enabling this feature, but as a fallback we also want to allow the user to just restart the kubelet with the cgroup-per-qos flag enabled to use this feature. As a part of this we need to figure out a policy for pods having Restart Policy: Never. More details in this [issue](https://github.com/kubernetes/kubernetes/issues/29946).
- [ ] Removing terminated pod's Cgroup : We need to cleanup the pod's cgroup once the pod is terminated. More details in this [issue](https://github.com/kubernetes/kubernetes/issues/29927).
- [ ] Kubelet needs to ensure that the cgroup settings are what the kubelet expects them to be. If security is not of concern, one can assume that once kubelet applies cgroups setting successfully, the values will never change unless kubelet changes it. If security is of concern, then kubelet will have to ensure that the cgroup values meet its requirements and then continue to watch for updates to cgroups via inotify and re-apply cgroup values if necessary.
Updating QoS limits needs to happen before pod cgroups values are updated. When pod cgroups are being deleted, QoS limits have to be updated after pod cgroup values have been updated for deletion or pod cgroups have been removed. Given that kubelet doesn't have any checkpoints and updates to QoS and pod cgroups are not atomic, kubelet needs to reconcile cgroups status whenever it restarts to ensure that the cgroups values match kubelet's expectation.
- [ ] [TEST] Opting in for this feature and rollbacks should be accompanied by detailed error message when killing pod intermittently.
- [ ] Add a systemd implementation for Cgroup Manager interface
Other smaller work items that we would be good to have before the release of this feature.
- [ ] Add Pod UID to the downward api which will help simplify the e2e testing logic.
- [ ] Check if parent cgroup exist and error out if they dont.
- [ ] Set top level cgroup limit to resource allocatable until we support QoS level cgroup updates. If cgroup root is not `/` then set node resource allocatable as the cgroup resource limits on cgroup root.
- [ ] Add a NodeResourceAllocatableProvider which returns the amount of allocatable resources on the nodes. This interface would be used both by the Kubelet and ContainerManager.
- [ ] Add top level feasibility check to ensure that pod can be admitted on the node by estimating left over resources on the node.
- [ ] Log basic cgroup management ie. creation/deletion metrics
To better support our requirements we needed to make some changes/add features to Libcontainer as well
- [x] Allowing or denying all devices by writing 'a' to devices.allow or devices.deny is
not possible once the device cgroups has children. Libcontainer doesnt have the option of skipping updates on parent devices cgroup. opencontainers/runc/pull/958
- [x] To use libcontainer for creating and managing cgroups in the Kubelet, I would like to just create a cgroup with no pid attached and if need be apply a pid to the cgroup later on. But libcontainer did not support cgroup creation without attaching a pid. opencontainers/runc/pull/956
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-resource-management.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,374 +1 @@
## Abstract This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-security-context.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-security-context.md)
A proposal for refactoring `SecurityContext` to have pod-level and container-level attributes in
order to correctly model pod- and container-level security concerns.
## Motivation
Currently, containers have a `SecurityContext` attribute which contains information about the
security settings the container uses. In practice, many of these attributes are uniform across all
containers in a pod. Simultaneously, there is also a need to apply the security context pattern
at the pod level to correctly model security attributes that apply only at a pod level.
Users should be able to:
1. Express security settings that are applicable to the entire pod
2. Express base security settings that apply to all containers
3. Override only the settings that need to be differentiated from the base in individual
containers
This proposal is a dependency for other changes related to security context:
1. [Volume ownership management in the Kubelet](https://github.com/kubernetes/kubernetes/pull/12944)
2. [Generic SELinux label management in the Kubelet](https://github.com/kubernetes/kubernetes/pull/14192)
Goals of this design:
1. Describe the use cases for which a pod-level security context is necessary
2. Thoroughly describe the API backward compatibility issues that arise from the introduction of
a pod-level security context
3. Describe all implementation changes necessary for the feature
## Constraints and assumptions
1. We will not design for intra-pod security; we are not currently concerned about isolating
containers in the same pod from one another
1. We will design for backward compatibility with the current V1 API
## Use Cases
1. As a developer, I want to correctly model security attributes which belong to an entire pod
2. As a user, I want to be able to specify container attributes that apply to all containers
without repeating myself
3. As an existing user, I want to be able to use the existing container-level security API
### Use Case: Pod level security attributes
Some security attributes make sense only to model at the pod level. For example, it is a
fundamental property of pods that all containers in a pod share the same network namespace.
Therefore, using the host namespace makes sense to model at the pod level only, and indeed, today
it is part of the `PodSpec`. Other host namespace support is currently being added and these will
also be pod-level settings; it makes sense to model them as a pod-level collection of security
attributes.
## Use Case: Override pod security context for container
Some use cases require the containers in a pod to run with different security settings. As an
example, a user may want to have a pod with two containers, one of which runs as root with the
privileged setting, and one that runs as a non-root UID. To support use cases like this, it should
be possible to override appropriate (i.e., not intrinsically pod-level) security settings for
individual containers.
## Proposed Design
### SecurityContext
For posterity and ease of reading, note the current state of `SecurityContext`:
```go
package api
type Container struct {
// Other fields omitted
// Optional: SecurityContext defines the security options the pod should be run with
SecurityContext *SecurityContext `json:"securityContext,omitempty"`
}
type SecurityContext struct {
// Capabilities are the capabilities to add/drop when running the container
Capabilities *Capabilities `json:"capabilities,omitempty"`
// Run the container in privileged mode
Privileged *bool `json:"privileged,omitempty"`
// SELinuxOptions are the labels to be applied to the container
// and volumes
SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"`
// RunAsUser is the UID to run the entrypoint of the container process.
RunAsUser *int64 `json:"runAsUser,omitempty"`
// RunAsNonRoot indicates that the container should be run as a non-root user. If the RunAsUser
// field is not explicitly set then the kubelet may check the image for a specified user or
// perform defaulting to specify a user.
RunAsNonRoot bool `json:"runAsNonRoot,omitempty"`
}
// SELinuxOptions contains the fields that make up the SELinux context of a container.
type SELinuxOptions struct {
// SELinux user label
User string `json:"user,omitempty"`
// SELinux role label
Role string `json:"role,omitempty"`
// SELinux type label
Type string `json:"type,omitempty"`
// SELinux level label.
Level string `json:"level,omitempty"`
}
```
### PodSecurityContext
`PodSecurityContext` specifies two types of security attributes:
1. Attributes that apply to the pod itself
2. Attributes that apply to the containers of the pod
In the internal API, fields of the `PodSpec` controlling the use of the host PID, IPC, and network
namespaces are relocated to this type:
```go
package api
type PodSpec struct {
// Other fields omitted
// Optional: SecurityContext specifies pod-level attributes and container security attributes
// that apply to all containers.
SecurityContext *PodSecurityContext `json:"securityContext,omitempty"`
}
// PodSecurityContext specifies security attributes of the pod and container attributes that apply
// to all containers of the pod.
type PodSecurityContext struct {
// Use the host's network namespace. If this option is set, the ports that will be
// used must be specified.
// Optional: Default to false.
HostNetwork bool
// Use the host's IPC namespace
HostIPC bool
// Use the host's PID namespace
HostPID bool
// Capabilities are the capabilities to add/drop when running containers
Capabilities *Capabilities `json:"capabilities,omitempty"`
// Run the container in privileged mode
Privileged *bool `json:"privileged,omitempty"`
// SELinuxOptions are the labels to be applied to the container
// and volumes
SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"`
// RunAsUser is the UID to run the entrypoint of the container process.
RunAsUser *int64 `json:"runAsUser,omitempty"`
// RunAsNonRoot indicates that the container should be run as a non-root user. If the RunAsUser
// field is not explicitly set then the kubelet may check the image for a specified user or
// perform defaulting to specify a user.
RunAsNonRoot bool
}
// Comments and generated docs will change for the container.SecurityContext field to indicate
// the precedence of these fields over the pod-level ones.
type Container struct {
// Other fields omitted
// Optional: SecurityContext defines the security options the pod should be run with.
// Settings specified in this field take precedence over the settings defined in
// pod.Spec.SecurityContext.
SecurityContext *SecurityContext `json:"securityContext,omitempty"`
}
```
In the V1 API, the pod-level security attributes which are currently fields of the `PodSpec` are
retained on the `PodSpec` for backward compatibility purposes:
```go
package v1
type PodSpec struct {
// Other fields omitted
// Use the host's network namespace. If this option is set, the ports that will be
// used must be specified.
// Optional: Default to false.
HostNetwork bool `json:"hostNetwork,omitempty"`
// Use the host's pid namespace.
// Optional: Default to false.
HostPID bool `json:"hostPID,omitempty"`
// Use the host's ipc namespace.
// Optional: Default to false.
HostIPC bool `json:"hostIPC,omitempty"`
// Optional: SecurityContext specifies pod-level attributes and container security attributes
// that apply to all containers.
SecurityContext *PodSecurityContext `json:"securityContext,omitempty"`
}
```
The `pod.Spec.SecurityContext` specifies the security context of all containers in the pod.
The containers' `securityContext` field is overlaid on the base security context to determine the
effective security context for the container.
The new V1 API should be backward compatible with the existing API. Backward compatibility is
defined as:
> 1. Any API call (e.g. a structure POSTed to a REST endpoint) that worked before your change must
> work the same after your change.
> 2. Any API call that uses your change must not cause problems (e.g. crash or degrade behavior) when
> issued against servers that do not include your change.
> 3. It must be possible to round-trip your change (convert to different API versions and back) with
> no loss of information.
Previous versions of this proposal attempted to deal with backward compatibility by defining
the affect of setting the pod-level fields on the container-level fields. While trying to find
consensus on this design, it became apparent that this approach was going to be extremely complex
to implement, explain, and support. Instead, we will approach backward compatibility as follows:
1. Pod-level and container-level settings will not affect one another
2. Old clients will be able to use container-level settings in the exact same way
3. Container level settings always override pod-level settings if they are set
#### Examples
1. Old client using `pod.Spec.Containers[x].SecurityContext`
An old client creates a pod:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: a
securityContext:
runAsUser: 1001
- name: b
securityContext:
runAsUser: 1002
```
looks to old clients like:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: a
securityContext:
runAsUser: 1001
- name: b
securityContext:
runAsUser: 1002
```
looks to new clients like:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: a
securityContext:
runAsUser: 1001
- name: b
securityContext:
runAsUser: 1002
```
2. New client using `pod.Spec.SecurityContext`
A new client creates a pod using a field of `pod.Spec.SecurityContext`:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
securityContext:
runAsUser: 1001
containers:
- name: a
- name: b
```
appears to new clients as:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
securityContext:
runAsUser: 1001
containers:
- name: a
- name: b
```
old clients will see:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: a
- name: b
```
3. Pods created using `pod.Spec.SecurityContext` and `pod.Spec.Containers[x].SecurityContext`
If a field is set in both `pod.Spec.SecurityContext` and
`pod.Spec.Containers[x].SecurityContext`, the value in `pod.Spec.Containers[x].SecurityContext`
wins. In the following pod:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
securityContext:
runAsUser: 1001
containers:
- name: a
securityContext:
runAsUser: 1002
- name: b
```
The effective setting for `runAsUser` for container A is `1002`.
#### Testing
A backward compatibility test suite will be established for the v1 API. The test suite will
verify compatibility by converting objects into the internal API and back to the version API and
examining the results.
All of the examples here will be used as test-cases. As more test cases are added, the proposal will
be updated.
An example of a test like this can be found in the
[OpenShift API package](https://github.com/openshift/origin/blob/master/pkg/api/compatibility_test.go)
E2E test cases will be added to test the correct determination of the security context for containers.
### Kubelet changes
1. The Kubelet will use the new fields on the `PodSecurityContext` for host namespace control
2. The Kubelet will be modified to correctly implement the backward compatibility and effective
security context determination defined here
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-security-context.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,480 +1 @@
# Protobuf serialization and internal storage This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/protobuf.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/protobuf.md)
@smarterclayton
March 2016
## Proposal and Motivation
The Kubernetes API server is a "dumb server" which offers storage, versioning,
validation, update, and watch semantics on API resources. In a large cluster
the API server must efficiently retrieve, store, and deliver large numbers
of coarse-grained objects to many clients. In addition, Kubernetes traffic is
heavily biased towards intra-cluster traffic - as much as 90% of the requests
served by the APIs are for internal cluster components like nodes, controllers,
and proxies. The primary format for intercluster API communication is JSON
today for ease of client construction.
At the current time, the latency of reaction to change in the cluster is
dominated by the time required to load objects from persistent store (etcd),
convert them to an output version, serialize them JSON over the network, and
then perform the reverse operation in clients. The cost of
serialization/deserialization and the size of the bytes on the wire, as well
as the memory garbage created during those operations, dominate the CPU and
network usage of the API servers.
In order to reach clusters of 10k nodes, we need roughly an order of magnitude
efficiency improvement in a number of areas of the cluster, starting with the
masters but also including API clients like controllers, kubelets, and node
proxies.
We propose to introduce a Protobuf serialization for all common API objects
that can optionally be used by intra-cluster components. Experiments have
demonstrated a 10x reduction in CPU use during serialization and deserialization,
a 2x reduction in size in bytes on the wire, and a 6-9x reduction in the amount
of objects created on the heap during serialization. The Protobuf schema
for each object will be automatically generated from the external API Go structs
we use to serialize to JSON.
Benchmarking showed that the time spent on the server in a typical GET
resembles:
etcd -> decode -> defaulting -> convert to internal ->
JSON 50us 5us 15us
Proto 5us
JSON 150allocs 80allocs
Proto 100allocs
process -> convert to external -> encode -> client
JSON 15us 40us
Proto 5us
JSON 80allocs 100allocs
Proto 4allocs
Protobuf has a huge benefit on encoding because it does not need to allocate
temporary objects, just one large buffer. Changing to protobuf moves our
hotspot back to conversion, not serialization.
## Design Points
* Generate Protobuf schema from Go structs (like we do for JSON) to avoid
manual schema update and drift
* Generate Protobuf schema that is field equivalent to the JSON fields (no
special types or enumerations), reducing drift for clients across formats.
* Follow our existing API versioning rules (backwards compatible in major
API versions, breaking changes across major versions) by creating one
Protobuf schema per API type.
* Continue to use the existing REST API patterns but offer an alternative
serialization, which means existing client and server tooling can remain
the same while benefiting from faster decoding.
* Protobuf objects on disk or in etcd will need to be self identifying at
rest, like JSON, in order for backwards compatibility in storage to work,
so we must add an envelope with apiVersion and kind to wrap the nested
object, and make the data format recognizable to clients.
* Use the [gogo-protobuf](https://github.com/gogo/protobuf) Golang library to generate marshal/unmarshal
operations, allowing us to bypass the expensive reflection used by the
golang JSOn operation
## Alternatives
* We considered JSON compression to reduce size on wire, but that does not
reduce the amount of memory garbage created during serialization and
deserialization.
* More efficient formats like Msgpack were considered, but they only offer
2x speed up vs. the 10x observed for Protobuf
* gRPC was considered, but is a larger change that requires more core
refactoring. This approach does not eliminate the possibility of switching
to gRPC in the future.
* We considered attempting to improve JSON serialization, but the cost of
implementing a more efficient serializer library than ugorji is
significantly higher than creating a protobuf schema from our Go structs.
## Schema
The Protobuf schema for each API group and version will be generated from
the objects in that API group and version. The schema will be named using
the package identifier of the Go package, i.e.
k8s.io/kubernetes/pkg/api/v1
Each top level object will be generated as a Protobuf message, i.e.:
type Pod struct { ... }
message Pod {}
Since the Go structs are designed to be serialized to JSON (with only the
int, string, bool, map, and array primitive types), we will use the
canonical JSON serialization as the protobuf field type wherever possible,
i.e.:
JSON Protobuf
string -> string
int -> varint
bool -> bool
array -> repeating message|primitive
We disallow the use of the Go `int` type in external fields because it is
ambiguous depending on compiler platform, and instead always use `int32` or
`int64`.
We will use maps (a protobuf 3 extension that can serialize to protobuf 2)
to represent JSON maps:
JSON Protobuf Wire (proto2)
map -> map<string, ...> -> repeated Message { key string; value bytes }
We will not convert known string constants to enumerations, since that
would require extra logic we do not already have in JSOn.
To begin with, we will use Protobuf 3 to generate a Protobuf 2 schema, and
in the future investigate a Protobuf 3 serialization. We will introduce
abstractions that let us have more than a single protobuf serialization if
necessary. Protobuf 3 would require us to support message types for
pointer primitive (nullable) fields, which is more complex than Protobuf 2's
support for pointers.
### Example of generated proto IDL
Without gogo extensions:
```
syntax = 'proto2';
package k8s.io.kubernetes.pkg.api.v1;
import "k8s.io/kubernetes/pkg/api/resource/generated.proto";
import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto";
import "k8s.io/kubernetes/pkg/runtime/generated.proto";
import "k8s.io/kubernetes/pkg/util/intstr/generated.proto";
// Package-wide variables from generator "generated".
option go_package = "v1";
// Represents a Persistent Disk resource in AWS.
//
// An AWS EBS disk must exist before mounting to a container. The disk
// must also be in the same AWS zone as the kubelet. An AWS EBS disk
// can only be mounted as read/write once. AWS EBS volumes support
// ownership management and SELinux relabeling.
message AWSElasticBlockStoreVolumeSource {
// Unique ID of the persistent disk resource in AWS (Amazon EBS volume).
// More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
optional string volumeID = 1;
// Filesystem type of the volume that you want to mount.
// Tip: Ensure that the filesystem type is supported by the host operating system.
// Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
// More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
// TODO: how do we prevent errors in the filesystem from compromising the machine
optional string fsType = 2;
// The partition in the volume that you want to mount.
// If omitted, the default is to mount by volume name.
// Examples: For volume /dev/sda1, you specify the partition as "1".
// Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty).
optional int32 partition = 3;
// Specify "true" to force and set the ReadOnly property in VolumeMounts to "true".
// If omitted, the default is "false".
// More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
optional bool readOnly = 4;
}
// Affinity is a group of affinity scheduling rules, currently
// only node affinity, but in the future also inter-pod affinity.
message Affinity {
// Describes node affinity scheduling rules for the pod.
optional NodeAffinity nodeAffinity = 1;
}
```
With extensions:
```
syntax = 'proto2';
package k8s.io.kubernetes.pkg.api.v1;
import "github.com/gogo/protobuf/gogoproto/gogo.proto";
import "k8s.io/kubernetes/pkg/api/resource/generated.proto";
import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto";
import "k8s.io/kubernetes/pkg/runtime/generated.proto";
import "k8s.io/kubernetes/pkg/util/intstr/generated.proto";
// Package-wide variables from generator "generated".
option (gogoproto.marshaler_all) = true;
option (gogoproto.sizer_all) = true;
option (gogoproto.unmarshaler_all) = true;
option (gogoproto.goproto_unrecognized_all) = false;
option (gogoproto.goproto_enum_prefix_all) = false;
option (gogoproto.goproto_getters_all) = false;
option go_package = "v1";
// Represents a Persistent Disk resource in AWS.
//
// An AWS EBS disk must exist before mounting to a container. The disk
// must also be in the same AWS zone as the kubelet. An AWS EBS disk
// can only be mounted as read/write once. AWS EBS volumes support
// ownership management and SELinux relabeling.
message AWSElasticBlockStoreVolumeSource {
// Unique ID of the persistent disk resource in AWS (Amazon EBS volume).
// More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
optional string volumeID = 1 [(gogoproto.customname) = "VolumeID", (gogoproto.nullable) = false];
// Filesystem type of the volume that you want to mount.
// Tip: Ensure that the filesystem type is supported by the host operating system.
// Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
// More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
// TODO: how do we prevent errors in the filesystem from compromising the machine
optional string fsType = 2 [(gogoproto.customname) = "FSType", (gogoproto.nullable) = false];
// The partition in the volume that you want to mount.
// If omitted, the default is to mount by volume name.
// Examples: For volume /dev/sda1, you specify the partition as "1".
// Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty).
optional int32 partition = 3 [(gogoproto.customname) = "Partition", (gogoproto.nullable) = false];
// Specify "true" to force and set the ReadOnly property in VolumeMounts to "true".
// If omitted, the default is "false".
// More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
optional bool readOnly = 4 [(gogoproto.customname) = "ReadOnly", (gogoproto.nullable) = false];
}
// Affinity is a group of affinity scheduling rules, currently
// only node affinity, but in the future also inter-pod affinity.
message Affinity {
// Describes node affinity scheduling rules for the pod.
optional NodeAffinity nodeAffinity = 1 [(gogoproto.customname) = "NodeAffinity"];
}
```
## Wire format
In order to make Protobuf serialized objects recognizable in a binary form,
the encoded object must be prefixed by a magic number, and then wrap the
non-self-describing Protobuf object in a Protobuf object that contains
schema information. The protobuf object is referred to as the `raw` object
and the encapsulation is referred to as `wrapper` object.
The simplest serialization is the raw Protobuf object with no identifying
information. In some use cases, we may wish to have the server identify the
raw object type on the wire using a protocol dependent format (gRPC uses
a type HTTP header). This works when all objects are of the same type, but
we occasionally have reasons to encode different object types in the same
context (watches, lists of objects on disk, and API calls that may return
errors).
To identify the type of a wrapped Protobuf object, we wrap it in a message
in package `k8s.io/kubernetes/pkg/runtime` with message name `Unknown`
having the following schema:
message Unknown {
optional TypeMeta typeMeta = 1;
optional bytes value = 2;
optional string contentEncoding = 3;
optional string contentType = 4;
}
message TypeMeta {
optional string apiVersion = 1;
optional string kind = 2;
}
The `value` field is an encoded protobuf object that matches the schema
defined in `typeMeta` and has optional `contentType` and `contentEncoding`
fields. `contentType` and `contentEncoding` have the same meaning as in
HTTP, if unspecified `contentType` means "raw protobuf object", and
`contentEncoding` defaults to no encoding. If `contentEncoding` is
specified, the defined transformation should be applied to `value` before
attempting to decode the value.
The `contentType` field is required to support objects without a defined
protobuf schema, like the ThirdPartyResource or templates. Those objects
would have to be encoded as JSON or another structure compatible form
when used with Protobuf. Generic clients must deal with the possibility
that the returned value is not in the known type.
We add the `contentEncoding` field here to preserve room for future
optimizations like encryption-at-rest or compression of the nested content.
Clients should error when receiving an encoding they do not support.
Negotioting encoding is not defined here, but introducing new encodings
is similar to introducing a schema change or new API version.
A client should use the `kind` and `apiVersion` fields to identify the
correct protobuf IDL for that message and version, and then decode the
`bytes` field into that Protobuf message.
Any Unknown value written to stable storage will be given a 4 byte prefix
`0x6b, 0x38, 0x73, 0x00`, which correspond to `k8s` followed by a zero byte.
The content-type `application/vnd.kubernetes.protobuf` is defined as
representing the following schema:
MESSAGE = '0x6b 0x38 0x73 0x00' UNKNOWN
UNKNOWN = <protobuf serialization of k8s.io/kubernetes/pkg/runtime#Unknown>
A client should check for the first four bytes, then perform a protobuf
deserialization of the remaining bytes into the `runtime.Unknown` type.
## Streaming wire format
While the majority of Kubernetes APIs return single objects that can vary
in type (Pod vs. Status, PodList vs. Status), the watch APIs return a stream
of identical objects (Events). At the time of this writing, this is the only
current or anticipated streaming RESTful protocol (logging, port-forwarding,
and exec protocols use a binary protocol over Websockets or SPDY).
In JSON, this API is implemented as a stream of JSON objects that are
separated by their syntax (the closing `}` brace is followed by whitespace
and the opening `{` brace starts the next object). There is no formal
specification covering this pattern, nor a unique content-type. Each object
is expected to be of type `watch.Event`, and is currently not self describing.
For expediency and consistency, we define a format for Protobuf watch Events
that is similar. Since protobuf messages are not self describing, we must
identify the boundaries between Events (a `frame`). We do that by prefixing
each frame of N bytes with a 4-byte, big-endian, unsigned integer with the
value N.
frame = length body
length = 32-bit unsigned integer in big-endian order, denoting length of
bytes of body
body = <bytes>
# frame containing a single byte 0a
frame = 01 00 00 00 0a
# equivalent JSON
frame = {"type": "added", ...}
The body of each frame is a serialized Protobuf message `Event` in package
`k8s.io/kubernetes/pkg/watch/versioned`. The content type used for this
format is `application/vnd.kubernetes.protobuf;type=watch`.
## Negotiation
To allow clients to request protobuf serialization optionally, the `Accept`
HTTP header is used by callers to indicate which serialization they wish
returned in the response, and the `Content-Type` header is used to tell the
server how to decode the bytes sent in the request (for DELETE/POST/PUT/PATCH
requests). The server will return 406 if the `Accept` header is not
recognized or 415 if the `Content-Type` is not recognized (as defined in
RFC2616).
To be backwards compatible, clients must consider that the server does not
support protobuf serialization. A number of options are possible:
### Preconfigured
Clients can have a configuration setting that instructs them which version
to use. This is the simplest option, but requires intervention when the
component upgrades to protobuf.
### Include serialization information in api-discovery
Servers can define the list of content types they accept and return in
their API discovery docs, and clients can use protobuf if they support it.
Allows dynamic configuration during upgrade if the client is already using
API-discovery.
### Optimistically attempt to send and receive requests using protobuf
Using multiple `Accept` values:
Accept: application/vnd.kubernetes.protobuf, application/json
clients can indicate their preferences and handle the returned
`Content-Type` using whatever the server responds. On update operations,
clients can try protobuf and if they receive a 415 error, record that and
fall back to JSON. Allows the client to be backwards compatible with
any server, but comes at the cost of some implementation complexity.
## Generation process
Generation proceeds in five phases:
1. Generate a gogo-protobuf annotated IDL from the source Go struct.
2. Generate temporary Go structs from the IDL using gogo-protobuf.
3. Generate marshaller/unmarshallers based on the IDL using gogo-protobuf.
4. Take all tag numbers generated for the IDL and apply them as struct tags
to the original Go types.
5. Generate a final IDL without gogo-protobuf annotations as the canonical IDL.
The output is a `generated.proto` file in each package containing a standard
proto2 IDL, and a `generated.pb.go` file in each package that contains the
generated marshal/unmarshallers.
The Go struct generated by gogo-protobuf from the first IDL must be identical
to the origin struct - a number of changes have been made to gogo-protobuf
to ensure exact 1-1 conversion. A small number of additions may be necessary
in the future if we introduce more exotic field types (Go type aliases, maps
with aliased Go types, and embedded fields were fixed). If they are identical,
the output marshallers/unmarshallers can then work on the origin struct.
Whenever a new field is added, generation will assign that field a unique tag
and the 4th phase will write that tag back to the origin Go struct as a `protobuf`
struct tag. This ensures subsequent generation passes are stable, even in the
face of internal refactors. The first time a field is added, the author will
need to check in both the new IDL AND the protobuf struct tag changes.
The second IDL is generated without gogo-protobuf annotations to allow clients
in other languages to generate easily.
Any errors in the generation process are considered fatal and must be resolved
early (being unable to identify a field type for conversion, duplicate fields,
duplicate tags, protoc errors, etc). The conversion fuzzer is used to ensure
that a Go struct can be round-tripped to protobuf and back, as we do for JSON
and conversion testing.
## Changes to development process
All existing API change rules would still apply. New fields added would be
automatically assigned a tag by the generation process. New API versions will
have a new proto IDL, and field name and changes across API versions would be
handled using our existing API change rules. Tags cannot change within an
API version.
Generation would be done by developers and then checked into source control,
like conversions and ugorji JSON codecs.
Because protoc is not packaged well across all platforms, we will add it to
the `kube-cross` Docker image and developers can use that to generate
updated protobufs. Protobuf 3 beta is required.
The generated protobuf will be checked with a verify script before merging.
## Implications
* The generated marshal code is large and will increase build times and binary
size. We may be able to remove ugorji after protobuf is added, since the
bulk of our decoding would switch to protobuf.
* The protobuf schema is naive, which means it may not be as a minimal as
possible.
* Debugging of protobuf related errors is harder due to the binary nature of
the format.
* Migrating API object storage from JSON to protobuf will require that all
API servers are upgraded before beginning to write protobuf to disk, since
old servers won't recognize protobuf.
* Transport of protobuf between etcd and the api server will be less efficient
in etcd2 than etcd3 (since etcd2 must encode binary values returned as JSON).
Should still be smaller than current JSON request.
* Third-party API objects must be stored as JSON inside of a protobuf wrapper
in etcd, and the API endpoints will not benefit from clients that speak
protobuf. Clients will have to deal with some API objects not supporting
protobuf.
## Open Questions
* Is supporting stored protobuf files on disk in the kubectl client worth it?
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/protobuf.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,194 +1 @@
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/release-notes.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/release-notes.md)
# Kubernetes Release Notes
[djmm@google.com](mailto:djmm@google.com)<BR>
Last Updated: 2016-04-06
<!-- BEGIN MUNGE: GENERATED_TOC -->
- [Kubernetes Release Notes](#kubernetes-release-notes)
- [Objective](#objective)
- [Background](#background)
- [The Problem](#the-problem)
- [The (general) Solution](#the-general-solution)
- [Then why not just list *every* change that was submitted, CHANGELOG-style?](#then-why-not-just-list-every-change-that-was-submitted-changelog-style)
- [Options](#options)
- [Collection Design](#collection-design)
- [Publishing Design](#publishing-design)
- [Location](#location)
- [Layout](#layout)
- [Alpha/Beta/Patch Releases](#alphabetapatch-releases)
- [Major/Minor Releases](#majorminor-releases)
- [Work estimates](#work-estimates)
- [Caveats / Considerations](#caveats--considerations)
<!-- END MUNGE: GENERATED_TOC -->
## Objective
Define a process and design tooling for collecting, arranging and publishing
release notes for Kubernetes releases, automating as much of the process as
possible.
The goal is to introduce minor changes to the development workflow
in a way that is mostly frictionless and allows for the capture of release notes
as PRs are submitted to the repository.
This direct association of release notes to PRs captures the intention of
release visibility of the PR at the point an idea is submitted upstream.
The release notes can then be more easily collected and published when the
release is ready.
## Background
### The Problem
Release notes are often an afterthought and clarifying and finalizing them
is often left until the very last minute at the time the release is made.
This is usually long after the feature or bug fix was added and is no longer on
the mind of the author. Worse, the collecting and summarizing of the
release is often left to those who may know little or nothing about these
individual changes!
Writing and editing release notes at the end of the cycle can be a rushed,
interrupt-driven and often stressful process resulting in incomplete,
inconsistent release notes often with errors and omissions.
### The (general) Solution
Like most things in the development/release pipeline, the earlier you do it,
the easier it is for everyone and the better the outcome. Gather your release
notes earlier in the development cycle, at the time the features and fixes are
added.
#### Then why not just list *every* change that was submitted, CHANGELOG-style?
On larger projects like Kubernetes, showing every single change (PR) would mean
hundreds of entries. The goal is to highlight the major changes for a release.
## Options
1. Use of pre-commit and other local git hooks
* Experiments here using `prepare-commit-msg` and `commit-msg` git hook files
were promising but less than optimal due to the fact that they would
require input/confirmation with each commit and there may be multiple
commits in a push and eventual PR.
1. Use of [github templates](https://github.com/blog/2111-issue-and-pull-request-templates)
* Templates provide a great way to pre-fill PR comments, but there are no
server-side hooks available to parse and/or easily check the contents of
those templates to ensure that checkboxes were checked or forms were filled
in.
1. Use of labels enforced by mungers/bots
* We already make great use of mungers/bots to manage labels on PRs and it
fits very nicely in the existing workflow
## Collection Design
The munger/bot option fits most cleanly into the existing workflow.
All `release-note-*` labeling is managed on the master branch PR only.
No `release-note-*` labels are needed on cherry-pick PRs and no information
will be collected from that cherry-pick PR.
The only exception to this rule is when a PR is not a cherry-pick and is
targeted directly to the non-master branch. In this case, a `release-note-*`
label is required for that non-master PR.
1. New labels added to github: `release-note-none`, maybe others for new release note categories - see Layout section below
1. A [new munger](https://github.com/kubernetes/kubernetes/issues/23409) that will:
* Add a `release-note-label-needed` label to all new master branch PRs
* Block merge by the submit queue on all PRs labeled as `release-note-label-needed`
* Auto-remove `release-note-label-needed` when one of the `release-note-*` labels is added
## Publishing Design
### Location
With v1.2.0, the release notes were moved from their previous [github releases](https://github.com/kubernetes/kubernetes/releases)
location to [CHANGELOG.md](../../CHANGELOG.md). Going forward this seems like a good plan.
Other projects do similarly.
The kubernetes.tar.gz download link is also displayed along with the release notes
in [CHANGELOG.md](../../CHANGELOG.md).
Is there any reason to continue publishing anything to github releases if
the complete release story is published in [CHANGELOG.md](../../CHANGELOG.md)?
### Layout
Different types of releases will generally have different requirements in
terms of layout. As expected, major releases like v1.2.0 are going
to require much more detail than the automated release notes will provide.
The idea is that these mechanisms will provide 100% of the release note
content for alpha, beta and most minor releases and bootstrap the content
with a release note 'template' for the authors of major releases like v1.2.0.
The authors can then collaborate and edit the higher level sections of the
release notes in a PR, updating [CHANGELOG.md](../../CHANGELOG.md) as needed.
v1.2.0 demonstrated the need, at least for major releases like v1.2.0, for
several sections in the published release notes.
In order to provide a basic layout for release notes in the future,
new releases can bootstrap [CHANGELOG.md](../../CHANGELOG.md) with the following template types:
#### Alpha/Beta/Patch Releases
These are automatically generated from `release-note*` labels, but can be modified as needed.
```
Action Required
* PR titles from the release-note-action-required label
Other notable changes
* PR titles from the release-note label
```
#### Major/Minor Releases
```
Major Themes
* Add to or delete this section
Other notable improvements
* Add to or delete this section
Experimental Features
* Add to or delete this section
Action Required
* PR titles from the release-note-action-required label
Known Issues
* Add to or delete this section
Provider-specific Notes
* Add to or delete this section
Other notable changes
* PR titles from the release-note label
```
## Work estimates
* The [new munger](https://github.com/kubernetes/kubernetes/issues/23409)
* Owner: @eparis
* Time estimate: Mostly done
* Updates to the tool that collects, organizes, publishes and sends release
notifications.
* Owner: @david-mcmahon
* Time estimate: A few days
## Caveats / Considerations
* As part of the planning and development workflow how can we capture
release notes for bigger features?
[#23070](https://github.com/kubernetes/kubernetes/issues/23070)
* For now contributors should simply use the first PR that enables a new
feature by default. We'll revisit if this does not work well.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/release-notes.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,123 +1 @@
# Rescheduler design space This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduler.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduler.md)
@davidopp, @erictune, @briangrant
July 2015
## Introduction and definition
A rescheduler is an agent that proactively causes currently-running
Pods to be moved, so as to optimize some objective function for
goodness of the layout of Pods in the cluster. (The objective function
doesn't have to be expressed mathematically; it may just be a
collection of ad-hoc rules, but in principle there is an objective
function. Implicitly an objective function is described by the
scheduler's predicate and priority functions.) It might be triggered
to run every N minutes, or whenever some event happens that is known
to make the objective function worse (for example, whenever any Pod goes
PENDING for a long time.)
## Motivation and use cases
A rescheduler is useful because without a rescheduler, scheduling
decisions are only made at the time Pods are created. But later on,
the state of the cell may have changed in some way such that it would
be better to move the Pod to another node.
There are two categories of movements a rescheduler might trigger: coalescing
and spreading.
### Coalesce Pods
This is the most common use case. Cluster layout changes over time. For
example, run-to-completion Pods terminate, producing free space in their wake, but that space
is fragmented. This fragmentation might prevent a PENDING Pod from scheduling
(there are enough free resource for the Pod in aggregate across the cluster,
but not on any single node). A rescheduler can coalesce free space like a
disk defragmenter, thereby producing enough free space on a node for a PENDING
Pod to schedule. In some cases it can do this just by moving Pods into existing
holes, but often it will need to evict (and reschedule) running Pods in order to
create a large enough hole.
A second use case for a rescheduler to coalesce pods is when it becomes possible
to support the running Pods on a fewer number of nodes. The rescheduler can
gradually move Pods off of some set of nodes to make those nodes empty so
that they can then be shut down/removed. More specifically,
the system could do a simulation to see whether after removing a node from the
cluster, will the Pods that were on that node be able to reschedule,
either directly or with the help of the rescheduler; if the answer is
yes, then you can safely auto-scale down (assuming services will still
meeting their application-level SLOs).
### Spread Pods
The main use cases for spreading Pods revolve around relieving congestion on (a) highly
utilized node(s). For example, some process might suddenly start receiving a significantly
above-normal amount of external requests, leading to starvation of best-effort
Pods on the node. We can use the rescheduler to move the best-effort Pods off of the
node. (They are likely to have generous eviction SLOs, so are more likely to be movable
than the Pod that is experiencing the higher load, but in principle we might move either.)
Or even before any node becomes overloaded, we might proactively re-spread Pods from nodes
with high-utilization, to give them some buffer against future utilization spikes. In either
case, the nodes we move the Pods onto might have been in the system for a long time or might
have been added by the cluster auto-scaler specifically to allow the rescheduler to
rebalance utilization.
A second spreading use case is to separate antagonists.
Sometimes the processes running in two different Pods on the same node
may have unexpected antagonistic
behavior towards one another. A system component might monitor for such
antagonism and ask the rescheduler to move one of the antagonists to a new node.
### Ranking the use cases
The vast majority of users probably only care about rescheduling for three scenarios:
1. Move Pods around to get a PENDING Pod to schedule
1. Redistribute Pods onto new nodes added by a cluster auto-scaler when there are no PENDING Pods
1. Move Pods around when CPU starvation is detected on a node
## Design considerations and design space
Because rescheduling is disruptive--it causes one or more
already-running Pods to die when they otherwise wouldn't--a key
constraint on rescheduling is that it must be done subject to
disruption SLOs. There are a number of ways to specify these SLOs--a
global rate limit across all Pods, a rate limit across a set of Pods
defined by some particular label selector, a maximum number of Pods
that can be down at any one time among a set defined by some
particular label selector, etc. These policies are presumably part of
the Rescheduler's configuration.
There are a lot of design possibilities for a rescheduler. To explain
them, it's easiest to start with the description of a baseline
rescheduler, and then describe possible modifications. The Baseline
rescheduler
* only kicks in when there are one or more PENDING Pods for some period of time; its objective function is binary: completely happy if there are no PENDING Pods, and completely unhappy if there are PENDING Pods; it does not try to optimize for any other aspect of cluster layout
* is not a scheduler -- it simply identifies a node where a PENDING Pod could fit if one or more Pods on that node were moved out of the way, and then kills those Pods to make room for the PENDING Pod, which will then be scheduled there by the regular scheduler(s). [obviously this killing operation must be able to specify "don't allow the killed Pod to reschedule back to whence it was killed" otherwise the killing is pointless] Of course it should only do this if it is sure the killed Pods will be able to reschedule into already-free space in the cluster. Note that although it is not a scheduler, the Rescheduler needs to be linked with the predicate functions of the scheduling algorithm(s) so that it can know (1) that the PENDING Pod would actually schedule into the hole it has identified once the hole is created, and (2) that the evicted Pod(s) will be able to schedule somewhere else in the cluster.
Possible variations on this Baseline rescheduler are
1. it can kill the Pod(s) whose space it wants **and also schedule the Pod that will take that space and reschedule the Pod(s) that were killed**, rather than just killing the Pod(s) whose space it wants and relying on the regular scheduler(s) to schedule the Pod that will take that space (and to reschedule the Pod(s) that were evicted)
1. it can run continuously in the background to optimize general cluster layout instead of just trying to get a PENDING Pod to schedule
1. it can try to move groups of Pods instead of using a one-at-a-time / greedy approach
1. it can formulate multi-hop plans instead of single-hop
A key design question for a Rescheduler is how much knowledge it needs about the scheduling policies used by the cluster's scheduler(s).
* For the Baseline rescheduler, it needs to know the predicate functions used by the cluster's scheduler(s) else it can't know how to create a hole that the PENDING Pod will fit into, nor be sure that the evicted Pod(s) will be able to reschedule elsewhere.
* If it is going to run continuously in the background to optimize cluster layout but is still only going to kill Pods, then it still needs to know the predicate functions for the reason mentioned above. In principle it doesn't need to know the priority functions; it could just randomly kill Pods and rely on the regular scheduler to put them back in better places. However, this is a rather inexact approach. Thus it is useful for the rescheduler to know the priority functions, or at least some subset of them, so it can be sure that an action it takes will actually improve the cluster layout.
* If it is going to run continuously in the background to optimize cluster layout and is going to act as a scheduler rather than just killing Pods, then it needs to know the predicate functions and some compatible (but not necessarily identical) priority functions One example of a case where "compatible but not identical" might be useful is if the main scheduler(s) has a very simple scheduling policy optimized for low scheduling latency, and the Rescheduler having a more sophisticated/optimal scheduling policy that requires more computation time. The main thing to avoid is for the scheduler(s) and rescheduler to have incompatible priority functions, as this will cause them to "fight" (though it still can't lead to an infinite loop, since the scheduler(s) only ever touches a Pod once).
## Appendix: Integrating rescheduler with cluster auto-scaler (scale up)
For scaling up the cluster, a reasonable workflow might be:
1. pod horizontal auto-scaler decides to add one or more Pods to a service, based on the metrics it is observing
1. the Pod goes PENDING due to lack of a suitable node with sufficient resources
1. rescheduler notices the PENDING Pod and determines that the Pod cannot schedule just by rearranging existing Pods (while respecting SLOs)
1. rescheduler triggers cluster auto-scaler to add a node of the appropriate type for the PENDING Pod
1. the PENDING Pod schedules onto the new node (and possibly the rescheduler also moves other Pods onto that node)
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/rescheduler.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,88 +1 @@
# Rescheduler: guaranteed scheduling of critical addons This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduling-for-critical-pods.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduling-for-critical-pods.md)
## Motivation
In addition to Kubernetes core components like api-server, scheduler, controller-manager running on a master machine
there is a bunch of addons which due to various reasons have to run on a regular cluster node, not the master.
Some of them are critical to have fully functional cluster: Heapster, DNS, UI. Users can break their cluster
by evicting a critical addon (either manually or as a side effect of an other operation like upgrade)
which possibly can become pending (for example when the cluster is highly utilized).
To avoid such situation we want to have a mechanism which guarantees that
critical addons are scheduled assuming the cluster is big enough.
This possibly may affect other pods (including production users applications).
## Design
Rescheduler will ensure that critical addons are always scheduled.
In the first version it will implement only this policy, but later we may want to introduce other policies.
It will be a standalone component running on master machine similarly to scheduler.
Those components will share common logic (initially rescheduler will in fact import some of scheduler packages).
### Guaranteed scheduling of critical addons
Rescheduler will observe critical addons
(with annotation `scheduler.alpha.kubernetes.io/critical-pod`).
If one of them is marked by scheduler as unschedulable (pod condition `PodScheduled` set to `false`, the reason set to `Unschedulable`)
the component will try to find a space for the addon by evicting some pods and then the scheduler will schedule the addon.
#### Scoring nodes
Initially we want to choose a random node with enough capacity
(chosen as described in [Evicting pods](rescheduling-for-critical-pods.md#evicting-pods)) to schedule given addons.
Later we may want to introduce some heuristic:
* minimize number of evicted pods with violation of disruption budget or shortened termination grace period
* minimize number of affected pods by choosing a node on which we have to evict less pods
* increase probability of scheduling of evicted pods by preferring a set of pods with the smallest total sum of requests
* avoid nodes which are non-drainable (according to drain logic), for example on which there is a pod which doesnt belong to any RC/RS/Deployment
#### Evicting pods
There are 2 mechanism which possibly can delay a pod eviction: Disruption Budget and Termination Grace Period.
While removing a pod we will try to avoid violating Disruption Budget, though we cant guarantee it
since there is a chance that it would block this operation for longer period of time.
We will also try to respect Termination Grace Period, though without any guarantee.
In case we have to remove a pod with termination grace period longer than 10s it will be shortened to 10s.
The proposed order while choosing a node to schedule a critical addon and pods to remove:
1. a node where the critical addon pod can fit after evicting only pods satisfying both
(1) their disruption budget will not be violated by such eviction and (2) they have grace period <= 10 seconds
1. a node where the critical addon pod can fit after evicting only pods whose disruption budget will not be violated by such eviction
1. any node where the critical addon pod can fit after evicting some pods
### Interaction with Scheduler
To avoid situation when Scheduler will schedule another pod into the space prepared for the critical addon,
the chosen node has to be temporarily excluded from a list of nodes considered by Scheduler while making decisions.
For this purpose the node will get a temporary
[Taint](../../docs/design/taint-toleration-dedicated.md) “CriticalAddonsOnly”
and each critical addon has to have defined toleration for this taint.
After Rescheduler has no more work to do: all critical addons are scheduled or cluster is too small for them,
all taints will be removed.
### Interaction with Cluster Autoscaler
Rescheduler possibly can duplicate the responsibility of Cluster Autoscaler:
both components are taking action when there is unschedulable pod.
It may cause the situation when CA will add extra node for a pending critical addon
and Rescheduler will evict some running pods to make a space for the addon.
This situation would be rare and usually an extra node would be anyway needed for evicted pods.
In the worst case CA will add and then remove the node.
To not complicate architecture by introducing interaction between those 2 components we accept this overlap.
We want to ensure that CA wont remove nodes with critical addons by adding appropriate logic there.
### Rescheduler control loop
The rescheduler control loop will be as follow:
* while there is an unschedulable critical addon do the following:
* choose a node on which the addon should be scheduled (as described in Evicting pods)
* add taint to the node to prevent scheduler from using it
* delete pods which blocks the addon from being scheduled
* wait until scheduler will schedule the critical addon
* if there is no more critical addons for which we can help, ensure there is no node with the taint
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/rescheduling-for-critical-pods.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,493 +1 @@
# Controlled Rescheduling in Kubernetes This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduling.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduling.md)
## Overview
Although the Kubernetes scheduler(s) try to make good placement decisions for pods,
conditions in the cluster change over time (e.g. jobs finish and new pods arrive, nodes
are removed due to failures or planned maintenance or auto-scaling down, nodes appear due
to recovery after a failure or re-joining after maintenance or auto-scaling up or adding
new hardware to a bare-metal cluster), and schedulers are not omniscient (e.g. there are
some interactions between pods, or between pods and nodes, that they cannot predict). As
a result, the initial node selected for a pod may turn out to be a bad match, from the
perspective of the pod and/or the cluster as a whole, at some point after the pod has
started running.
Today (Kubernetes version 1.2) once a pod is scheduled to a node, it never moves unless
it terminates on its own, is deleted by the user, or experiences some unplanned event
(e.g. the node where it is running dies). Thus in a cluster with long-running pods, the
assignment of pods to nodes degrades over time, no matter how good an initial scheduling
decision the scheduler makes. This observation motivates "controlled rescheduling," a
mechanism by which Kubernetes will "move" already-running pods over time to improve their
placement. Controlled rescheduling is the subject of this proposal.
Note that the term "move" is not technically accurate -- the mechanism used is that
Kubernetes will terminate a pod that is managed by a controller, and the controller will
create a replacement pod that is then scheduled by the pod's scheduler. The terminated
pod and replacement pod are completely separate pods, and no pod migration is
implied. However, describing the process as "moving" the pod is approximately accurate
and easier to understand, so we will use this terminology in the document.
We use the term "rescheduling" to describe any action the system takes to move an
already-running pod. The decision may be made and executed by any component; we wil
introduce the concept of a "rescheduler" component later, but it is not the only
component that can do rescheduling.
This proposal primarily focuses on the architecture and features/mechanisms used to
achieve rescheduling, and only briefly discuss example policies. We expect that community
experimentation will lead to a significantly better understanding of the range, potential,
and limitations of rescheduling policies.
## Example use cases
Example use cases for rescheduling are
* moving a running pod onto a node that better satisfies its scheduling criteria
* moving a pod onto an under-utilized node
* moving a pod onto a node that meets more of the pod's affinity/anti-affinity preferences
* moving a running pod off of a node in anticipation of a known or speculated future event
* draining a node in preparation for maintenance, decomissioning, auto-scale-down, etc.
* "preempting" a running pod to make room for a pending pod to schedule
* proactively/speculatively make room for large and/or exclusive pods to facilitate
fast scheduling in the future (often called "defragmentation")
* (note that these last two cases are the only use cases where the first-order intent
is to move a pod specifically for the benefit of another pod)
* moving a running pod off of a node from which it is receiving poor service
* anomalous crashlooping or other mysterious incompatiblity between the pod and the node
* repeated out-of-resource killing (see #18724)
* repeated attempts by the scheduler to schedule the pod onto some node, but it is
rejected by Kubelet admission control due to incomplete scheduler knowledge
* poor performance due to interference from other containers on the node (CPU hogs,
cache thrashers, etc.) (note that in this case there is a choice of moving the victim
or the aggressor)
## Some axes of the design space
Among the key design decisions are
* how does a pod specify its tolerance for these system-generated disruptions, and how
does the system enforce such disruption limits
* for each use case, where is the decision made about when and which pods to reschedule
(controllers, schedulers, an entirely new component e.g. "rescheduler", etc.)
* rescheduler design issues: how much does a rescheduler need to know about pods'
schedulers' policies, how does the rescheduler specify its rescheduling
requests/decisions (e.g. just as an eviction, an eviction with a hint about where to
reschedule, or as an eviction paired with a specific binding), how does the system
implement these requests, does the rescheduler take into account the second-order
effects of decisions (e.g. whether an evicted pod will reschedule, will cause
a preemption when it reschedules, etc.), does the rescheduler execute multi-step plans
(e.g. evict two pods at the same time with the intent of moving one into the space
vacated by the other, or even more complex plans)
Additional musings on the rescheduling design space can be found [here](rescheduler.md).
## Design proposal
The key mechanisms and components of the proposed design are priority, preemption,
disruption budgets, the `/evict` subresource, and the rescheduler.
### Priority
#### Motivation
Just as it is useful to overcommit nodes to increase node-level utilization, it is useful
to overcommit clusters to increase cluster-level utilization. Scheduling priority (which
we abbreviate as *priority*, in combination with disruption budgets (described in the
next section), allows Kubernetes to safely overcommit clusters much as QoS levels allow
it to safely overcommit nodes.
Today, cluster sharing among users, workload types, etc. is regulated via the
[quota](../admin/resourcequota/README.md) mechanism. When allocating quota, a cluster
administrator has two choices: (1) the sum of the quotas is less than or equal to the
capacity of the cluster, or (2) the sum of the quotas is greater than the capacity of the
cluster (that is, the cluster is overcommitted). (1) is likely to lead to cluster
under-utilization, while (2) is unsafe in the sense that someone's pods may go pending
indefinitely even though they are still within their quota. Priority makes cluster
overcommitment (i.e. case (2)) safe by allowing users and/or administrators to identify
which pods should be allowed to run, and which should go pending, when demand for cluster
resources exceeds supply to due to cluster overcommitment.
Priority is also useful in some special-case scenarios, such as ensuring that system
DaemonSets can always schedule and reschedule onto every node where they want to run
(assuming they are given the highest priority), e.g. see #21767.
#### Specifying priorities
We propose to add a required `Priority` field to `PodSpec`. Its value type is string, and
the cluster administrator defines a total ordering on these strings (for example
`Critical`, `Normal`, `Preemptible`). We choose string instead of integer so that it is
easy for an administrator to add new priority levels in between existing levels, to
encourage thinking about priority in terms of user intent and avoid magic numbers, and to
make the internal implementation more flexible.
When a scheduler is scheduling a new pod P and cannot find any node that meets all of P's
scheduling predicates, it is allowed to evict ("preempt") one or more pods that are at
the same or lower priority than P (subject to disruption budgets, see next section) from
a node in order to make room for P, i.e. in order to make the scheduling predicates
satisfied for P on that node. (Note that when we add cluster-level resources (#19080),
it might be necessary to preempt from multiple nodes, but that scenario is outside the
scope of this document.) The preempted pod(s) may or may not be able to reschedule. The
net effect of this process is that when demand for cluster resources exceeds supply, the
higher-priority pods will be able to run while the lower-priority pods will be forced to
wait. The detailed mechanics of preemption are described in a later section.
In addition to taking disruption budget into account, for equal-priority preemptions the
scheduler will try to enforce fairness (across victim controllers, services, etc.)
Priorities could be specified directly by users in the podTemplate, or assigned by an
admission controller using
properties of the pod. Either way, all schedulers must be configured to understand the
same priorities (names and ordering). This could be done by making them constants in the
API, or using ConfigMap to configure the schedulers with the information. The advantage of
the former (at least making the names, if not the ordering, constants in the API) is that
it allows the API server to do validation (e.g. to catch mis-spelling).
In the future, which priorities are usable for a given namespace and pods with certain
attributes may be configurable, similar to ResourceQuota, LimitRange, or security policy.
Priority and resource QoS are indepedent.
The priority we have described here might be used to prioritize the scheduling queue
(i.e. the order in which a scheduler examines pods in its scheduling loop), but the two
priority concepts do not have to be connected. It is somewhat logical to tie them
together, since a higher priority genreally indicates that a pod is more urgent to get
running. Also, scheduling low-priority pods before high-priority pods might lead to
avoidable preemptions if the high-priority pods end up preempting the low-priority pods
that were just scheduled.
TODO: Priority and preemption are global or namespace-relative? See
[this discussion thread](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r55737389).
#### Relationship of priority to quota
Of course, if the decision of what priority to give a pod is solely up to the user, then
users have no incentive to ever request any priority less than the maximum. Thus
priority is intimately related to quota, in the sense that resource quotas must be
allocated on a per-priority-level basis (X amount of RAM at priority A, Y amount of RAM
at priority B, etc.). The "guarantee" that highest-priority pods will always be able to
schedule can only be achieved if the sum of the quotas at the top priority level is less
than or equal to the cluster capacity. This is analogous to QoS, where safety can only be
achieved if the sum of the limits of the top QoS level ("Guaranteed") is less than or
equal to the node capacity. In terms of incentives, an organization could "charge"
an amount proportional to the priority of the resources.
The topic of how to allocate quota at different priority levels to achieve a desired
balance between utilization and probability of schedulability is an extremely complex
topic that is outside the scope of this document. For example, resource fragmentation and
RequiredDuringScheduling node and pod affinity and anti-affinity means that even if the
sum of the quotas at the top priority level is less than or equal to the total aggregate
capacity of the cluster, some pods at the top priority level might still go pending. In
general, priority provdes a *probabilistic* guarantees of pod schedulability in the face
of overcommitment, by allowing prioritization of which pods should be allowed to run pods
when demand for cluster resources exceeds supply.
### Disruption budget
While priority can protect pods from one source of disruption (preemption by a
lower-priority pod), *disruption budgets* limit disruptions from all Kubernetes-initiated
causes, including preemption by an equal or higher-priority pod, or being evicted to
achieve other rescheduling goals. In particular, each pod is optionally associated with a
"disruption budget," a new API resource that limits Kubernetes-initiated terminations
across a set of pods (e.g. the pods of a particular Service might all point to the same
disruption budget object), regardless of cause. Initially we expect disruption budget
(e.g. `DisruptionBudgetSpec`) to consist of
* a rate limit on disruptions (preemption and other evictions) across the corresponding
set of pods, e.g. no more than one disruption per hour across the pods of a particular Service
* a minimum number of pods that must be up simultaneously (sometimes called "shard
strength") (of course this can also be expressed as the inverse, i.e. the number of
pods of the collection that can be down simultaneously)
The second item merits a bit more explanation. One use case is to specify a quorum size,
e.g. to ensure that at least 3 replicas of a quorum-based service with 5 replicas are up
at the same time. In practice, a service should ideally create enough replicas to survive
at least one planned and one unplanned outage. So in our quorum example, we would specify
that at least 4 replicas must be up at the same time; this allows for one intentional
disruption (bringing the number of live replicas down from 5 to 4 and consuming one unit
of shard strength budget) and one unplanned disruption (bringing the number of live
replicas down from 4 to 3) while still maintaining a quorum. Shard strength is also
useful for simpler replicated services; for example, you might not want more than 10% of
your front-ends to be down at the same time, so as to avoid overloading the remaining
replicas.
Initially, disruption budgets will be specified by the user. Thus as with priority,
disruption budgets need to be tied into quota, to prevent users from saying none of their
pods can ever be disrupted. The exact way of expressing and enforcing this quota is TBD,
though a simple starting point would be to have an admission controller assign a default
disruption budget based on priority level (more liberal with decreasing priority).
We also likely need a quota that applies to Kubernetes *components*, to the limit the rate
at which any one component is allowed to consume disruption budget.
Of course there should also be a `DisruptionBudgetStatus` that indicates the current
disruption rate that the collection of pods is experiencing, and the number of pods that
are up.
For the purposes of disruption budget, a pod is considered to be disrupted as soon as its
graceful termination period starts.
A pod that is not covered by a disruption budget but is managed by a controller,
gets an implicit disruption budget of infinite (though the system should try to not
unduly victimize such pods). How a pod that is not managed by a controller is
handled is TBD.
TBD: In addition to `PodSpec`, where do we store pointer to disruption budget
(podTemplate in controller that managed the pod?)? Do we auto-generate a disruption
budget (e.g. when instantiating a Service), or require the user to create it manually
before they create a controller? Which objects should return the disruption budget object
as part of the output on `kubectl get` other than (obviously) `kubectl get` for the
disruption budget itself?
TODO: Clean up distinction between "down due to voluntary action taken by Kubernetes"
and "down due to unplanned outage" in spec and status.
For now, there is nothing to prevent clients from circumventing the disruption budget
protections. Of course, clients that do this are not being "good citizens." In the next
section we describe a mechanism that at least makes it easy for well-behaved clients to
obey the disruption budgets.
See #12611 for additional discussion of disruption budgets.
### /evict subresource and PreferAvoidPods
Although we could put the responsibility for checking and updating disruption budgets
solely on the client, it is safer and more convenient if we implement that functionality
in the API server. Thus we will introduce a new `/evict` subresource on pod. It is similar to
today's "delete" on pod except
* It will be rejected if the deletion would violate disruption budget. (See how
Deployment handles failure of /rollback for ideas on how clients could handle failure
of `/evict`.) There are two possible ways to implement this:
* For the initial implementation, this will be accomplished by the API server just
looking at the `DisruptionBudgetStatus` and seeing if the disruption would violate the
`DisruptionBudgetSpec`. In this approach, we assume a disruption budget controller
keeps the `DisruptionBudgetStatus` up-to-date by observing all pod deletions and
creations in the cluster, so that an approved disruption is quickly reflected in the
`DisruptionBudgetStatus`. Of course this approach does allow a race in which one or
more additional disruptions could be approved before the first one is reflected in the
`DisruptionBudgetStatus`.
* Thus a subsequent implementation will have the API server explicitly debit the
`DisruptionBudgetStatus` when it accepts an `/evict`. (There still needs to be a
controller, to keep the shard strength status up-to-date when replacement pods are
created after an eviction; the controller may also be necessary for the rate status
depending on how rate is represented, e.g. adding tokens to a bucket at a fixed rate.)
Once etcd support multi-object transactions (etcd v3), the debit and pod deletion will
be placed in the same transaction.
* Note: For the purposes of disruption budget, a pod is considered to be disrupted as soon as its
graceful termination period starts (so when we say "delete" here we do not mean
"deleted from etcd" but rather "graceful termination period has started").
* It will allow clients to communicate additional parameters when they wish to delete a
pod. (In the absence of the `/evict` subresource, we would have to create a pod-specific
type analogous to `api.DeleteOptions`.)
We will make `kubectl delete pod` use `/evict` by default, and require a command-line
flag to delete the pod directly.
We will add to `NodeStatus` a bounded-sized list of signatures of pods that should avoid
that node (provisionally called `PreferAvoidPods`). One of the pieces of information
specified in the `/evict` subresource is whether the eviction should add the evicted
pod's signature to the corresponding node's `PreferAvoidPods`. Initially the pod
signature will be a
[controllerRef](https://github.com/kubernetes/kubernetes/issues/14961#issuecomment-183431648),
i.e. a reference to the pod's controller. Controllers are responsible for garbage
collecting, after some period of time, `PreferAvoidPods` entries that point to them, but the API
server will also enforce a bounded size on the list. All schedulers will have a
highest-weighted priority function that gives a node the worst priority if the pod it is
scheduling appears in that node's `PreferAvoidPods` list. Thus appearing in
`PreferAvoidPods` is similar to
[RequiredDuringScheduling node anti-affinity](../../docs/user-guide/node-selection/README.md)
but it takes precedence over all other priority criteria and is not explicitly listed in
the `NodeAffinity` of the pod.
`PreferAvoidPods` is useful for the "moving a running pod off of a node from which it is
receiving poor service" use case, as it reduces the chance that the replacement pod will
end up on the same node (keep in mind that most of those cases are situations that the
scheduler does not have explicit priority functions for, for example it cannot know in
advance that a pod will be starved). Also, though we do not intend to implement any such
policies in the first version of the rescheduler, it is useful whenever the rescheduler evicts
two pods A and B with the intention of moving A into the space vacated by B (it prevents
B from rescheduling back into the space it vacated before A's scheduler has a chance to
reschedule A there). Note that these two uses are subtly different; in the first
case we want the avoidance to last a relatively long time, whereas in the second case we
may only need it to last until A schedules.
See #20699 for more discussion.
### Preemption mechanics
**NOTE: We expect a fuller design doc to be written on preemption before it is implemented.
However, a sketch of some ideas are presented here, since preemption is closely related to the
concepts discussed in this doc.**
Pod schedulers will decide and enact preemptions, subject to the priority and disruption
budget rules described earlier. (Though note that we currently do not have any mechanism
to prevent schedulers from bypassing either the priority or disruption budget rules.)
The scheduler does not concern itself with whether the evicted pod(s) can reschedule. The
eviction(s) use(s) the `/evict` subresource so that it is subject to the disruption
budget(s) of the victim(s), but it does not request to add the victim pod(s) to the
nodes' `PreferAvoidPods`.
Evicting victim(s) and binding the pending pod that the evictions are intended to enable
to schedule, are not transactional. We expect the scheduler to issue the operations in
sequence, but it is still possible that another scheduler could schedule its pod in
between the eviction(s) and the binding, or that the set of pods running on the node in
question changed between the time the scheduler made its decision and the time it sent
the operations to the API server thereby causing the eviction(s) to be not sufficient to get the
pending pod to schedule. In general there are a number of race conditions that cannot be
avoided without (1) making the evictions and binding be part of a single transaction, and
(2) making the binding preconditioned on a version number that is associated with the
node and is incremented on every binding. We may or may not implement those mechanisms in
the future.
Given a choice between a node where scheduling a pod requires preemption and one where it
does not, all other things being equal, a scheduler should choose the one where
preemption is not required. (TBD: Also, if the selected node does require preemption, the
scheduler should preempt lower-priority pods before higher-priority pods (e.g. if the
scheduler needs to free up 4 GB of RAM, and the node has two 2 GB low-priority pods and
one 4 GB high-priority pod, all of which have sufficient disruption budget, it should
preempt the two low-priority pods). This is debatable, since all have sufficient
disruption budget. But still better to err on the side of giving better disruption SLO to
higher-priority pods when possible?)
Preemption victims must be given their termination grace period. One possible sequence
of events is
1. The API server binds the preemptor to the node (i.e. sets `nodeName` on the
preempting pod) and sets `deletionTimestamp` on the victims
2. Kubelet sees that `deletionTimestamp` has been set on the victims; they enter their
graceful termination period
3. Kubelet sees the preempting pod. It runs the admission checks on the new pod
assuming all pods that are in their graceful termination period are gone and that
all pods that are in the waiting state (see (4)) are running.
4. If (3) fails, then the new pod is rejected. If (3) passes, then Kubelet holds the
new pod in a waiting state, and does not run it until the pod passes passes the
admission checks using the set of actually running pods.
Note that there are a lot of details to be figured out here; above is just a very
hand-wavy sketch of one general approach that might work.
See #22212 for additional discussion.
### Node drain
Node drain will be handled by one or more components not described in this document. They
will respect disruption budgets. Initially, we will just make `kubectl drain`
respect disruption budgets. See #17393 for other discussion.
### Rescheduler
All rescheduling other than preemption and node drain will be decided and enacted by a
new component called the *rescheduler*. It runs continuously in the background, looking
for opportunities to move pods to better locations. It acts when the degree of
improvement meets some threshold and is allowed by the pod's disruption budget. The
action is eviction of a pod using the `/evict` subresource, with the pod's signature
enqueued in the node's `PreferAvoidPods`. It does not force the pod to reschedule to any
particular node. Thus it is really an "unscheduler"; only in combination with the evicted
pod's scheduler, which schedules the replacement pod, do we get true "rescheduling." See
the "Example use cases" section earlier for some example use cases.
The rescheduler is a best-effort service that makes no guarantees about how quickly (or
whether) it will resolve a suboptimal pod placement.
The first version of the rescheduler will not take into consideration where or whether an
evicted pod will reschedule. The evicted pod may go pending, consuming one unit of the
corresponding shard strength disruption budget by one indefinitely. By using the `/evict`
subresource, the rescheduler ensures that an evicted pod has sufficient budget for the
evicted pod to go and stay pending. We expect future versions of the rescheduler may be
linked with the "mandatory" predicate functions (currently, the ones that constitute the
Kubelet admission criteria), and will only evict if the rescheduler determines that the
pod can reschedule somewhere according to those criteria. (Note that this still does not
guarantee that the pod actually will be able to reschedule, for at least two reasons: (1)
the state of the cluster may change between the time the rescheduler evaluates it and
when the evicted pod's scheduler tries to schedule the replacement pod, and (2) the
evicted pod's scheduler may have additional predicate functions in addition to the
mandatory ones).
(Note: see [this comment](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r54527968)).
The first version of the rescheduler will only implement two objectives: moving a pod
onto an under-utilized node, and moving a pod onto a node that meets more of the pod's
affinity/anti-affinity preferences than wherever it is currently running. (We assume that
nodes that are intentionally under-utilized, e.g. because they are being drained, are
marked unschedulable, thus the first objective will not cause the rescheduler to "fight"
a system that is draining nodes.) We assume that all schedulers sufficiently weight the
priority functions for affinity/anti-affinity and avoiding very packed nodes,
otherwise evicted pods may not actually move onto a node that is better according to
the criteria that caused it to be evicted. (But note that in all cases it will move to a
node that is better according to the totality of its scheduler's priority functions,
except in the case where the node where it was already running was the only node
where it can run.) As a general rule, the rescheduler should only act when it sees
particularly bad situations, since (1) an eviction for a marginal improvement is likely
not worth the disruption--just because there is sufficient budget for an eviction doesn't
mean an eviction is painless to the application, and (2) rescheduling the pod might not
actually mitigate the identified problem if it is minor enough that other scheduling
factors dominate the decision of where the replacement pod is scheduled.
We assume schedulers' priority functions are at least vaguely aligned with the
rescheduler's policies; otherwise the rescheduler will never accomplish anything useful,
given that it relies on the schedulers to actually reschedule the evicted pods. (Even if
the rescheduler acted as a scheduler, explicitly rebinding evicted pods, we'd still want
this to be true, to prevent the schedulers and rescheduler from "fighting" one another.)
The rescheduler will be configured using ConfigMap; the cluster administrator can enable
or disable policies and can tune the rescheduler's aggressiveness (aggressive means it
will use a relatively low threshold for triggering an eviction and may consume a lot of
disruption budget, while non-aggressive means it will use a relatively high threshold for
triggering an eviction and will try to leave plenty of buffer in disruption budgets). The
first version of the rescheduler will not be extensible or pluggable, since we want to
keep the code simple while we gain experience with the overall concept. In the future, we
anticipate a version that will be extensible and pluggable.
We might want some way to force the evicted pod to the front of the scheduler queue,
independently of its priority.
See #12140 for additional discussion.
### Final comments
In general, the design space for this topic is huge. This document describes some of the
design considerations and proposes one particular initial implementation. We expect
certain aspects of the design to be "permanent" (e.g. the notion and use of priorities,
preemption, disruption budgets, and the `/evict` subresource) while others may change over time
(e.g. the partitioning of functionality between schedulers, controllers, rescheduler,
horizontal pod autoscaler, and cluster autoscaler; the policies the rescheduler implements;
the factors the rescheduler takes into account when making decisions (e.g. knowledge of
schedulers' predicate and priority functions, second-order effects like whether and where
evicted pod will be able to reschedule, etc.); the way the rescheduler enacts its
decisions; and the complexity of the plans the rescheduler attempts to implement).
## Implementation plan
The highest-priority feature to implement is the rescheduler with the two use cases
highlighted earlier: moving a pod onto an under-utilized node, and moving a pod onto a
node that meets more of the pod's affinity/anti-affinity preferences. The former is
useful to rebalance pods after cluster auto-scale-up, and the latter is useful for
Ubernetes. This requires implementing disruption budgets and the `/evict` subresource,
but not priority or preemption.
Because the general topic of rescheduling is very speculative, we have intentionally
proposed that the first version of the rescheduler be very simple -- only uses eviction
(no attempt to guide replacement pod to any particular node), doesn't know schedulers'
predicate or priority functions, doesn't try to move two pods at the same time, and only
implements two use cases. As alluded to in the previous subsection, we expect the design
and implementation to evolve over time, and we encourage members of the community to
experiment with more sophisticated policies and to report their results from using them
on real workloads.
## Alternative implementations
TODO.
## Additional references
TODO.
TODO: Add reference to this doc from docs/proposals/rescheduler.md
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/rescheduling.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,151 +1 @@
# Resource Metrics API This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-metrics-api.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-metrics-api.md)
This document describes API part of MVP version of Resource Metrics API effort in Kubernetes.
Once the agreement will be made the document will be extended to also cover implementation details.
The shape of the effort may be also a subject of changes once we will have more well-defined use cases.
## Goal
The goal for the effort is to provide resource usage metrics for pods and nodes through the API server.
This will be a stable, versioned API which core Kubernetes components can rely on.
In the first version only the well-defined use cases will be handled,
although the API should be easily extensible for potential future use cases.
## Main use cases
This section describes well-defined use cases which should be handled in the first version.
Use cases which are not listed below are out of the scope of MVP version of Resource Metrics API.
#### Horizontal Pod Autoscaler
HPA uses the latest value of cpu usage as an average aggregated across 1 minute
(the window may change in the future). The data for a given set of pods
(defined either by pod list or label selector) should be accesible in one request
due to performance issues.
#### Scheduler
Scheduler in order to schedule best-effort pods requires node level resource usage metrics
as an average aggregated across 1 minute (the window may change in the future).
The metrics should be available for all resources supported in the scheduler.
Currently the scheduler does not need this information, because it schedules best-effort pods
without considering node usage. But having the metrics available in the API server is a blocker
for adding the ability to take node usage into account when scheduling best-effort pods.
## Other considered use cases
This section describes the other considered use cases and explains why they are out
of the scope of the MVP version.
#### Custom metrics in HPA
HPA requires the latest value of application level metrics.
The design of the pipeline for collecting application level metrics should
be revisited and it's not clear whether application level metrics should be
available in API server so the use case initially won't be supported.
#### Cluster Federation
The Cluster Federation control system might want to consider cluster-level usage (in addition to cluster-level request)
of running pods when choosing where to schedule new pods. Although
Cluster Federation is still in design,
we expect the metrics API described here to be sufficient. Cluster-level usage can be
obtained by summing over usage of all nodes in the cluster.
#### kubectl top
This feature is not yet specified/implemented although it seems reasonable to provide users information
about resource usage on pod/node level.
Since this feature has not been fully specified yet it will be not supported initially in the API although
it will be probably possible to provide a reasonable implementation of the feature anyway.
#### Kubernetes dashboard
[Kubernetes dashboard](https://github.com/kubernetes/dashboard) in order to draw graphs requires resource usage
in timeseries format from relatively long period of time. The aggregations should be also possible on various levels
including replication controllers, deployments, services, etc.
Since the use case is complicated it will not be supported initially in the API and they will query Heapster
directly using some custom API there.
## Proposed API
Initially the metrics API will be in a separate [API group](api-group.md) called ```metrics```.
Later if we decided to have Node and Pod in different API groups also
NodeMetrics and PodMetrics should be in different API groups.
#### Schema
The proposed schema is as follow. Each top-level object has `TypeMeta` and `ObjectMeta` fields
to be compatible with Kubernetes API standards.
```go
type NodeMetrics struct {
unversioned.TypeMeta
ObjectMeta
// The following fields define time interval from which metrics were
// collected in the following format [Timestamp-Window, Timestamp].
Timestamp unversioned.Time
Window unversioned.Duration
// The memory usage is the memory working set.
Usage v1.ResourceList
}
type PodMetrics struct {
unversioned.TypeMeta
ObjectMeta
// The following fields define time interval from which metrics were
// collected in the following format [Timestamp-Window, Timestamp].
Timestamp unversioned.Time
Window unversioned.Duration
// Metrics for all containers are collected within the same time window.
Containers []ContainerMetrics
}
type ContainerMetrics struct {
// Container name corresponding to the one from v1.Pod.Spec.Containers.
Name string
// The memory usage is the memory working set.
Usage v1.ResourceList
}
```
By default `Usage` is the mean from samples collected within the returned time window.
The default time window is 1 minute.
#### Endpoints
All endpoints are GET endpoints, rooted at `/apis/metrics/v1alpha1/`.
There won't be support for the other REST methods.
The list of supported endpoints:
- `/nodes` - all node metrics; type `[]NodeMetrics`
- `/nodes/{node}` - metrics for a specified node; type `NodeMetrics`
- `/namespaces/{namespace}/pods` - all pod metrics within namespace with support for `all-namespaces`; type `[]PodMetrics`
- `/namespaces/{namespace}/pods/{pod}` - metrics for a specified pod; type `PodMetrics`
The following query parameters are supported:
- `labelSelector` - restrict the list of returned objects by labels (list endpoints only)
In the future we may want to introduce the following params:
`aggregator` (`max`, `min`, `95th`, etc.) and `window` (`1h`, `1d`, `1w`, etc.)
which will allow to get the other aggregates over the custom time window.
## Further improvements
Depending on the further requirements the following features may be added:
- support for more metrics
- support for application level metrics
- watch for metrics
- possibility to query for window sizes and aggregation functions (though single window size/aggregation function per request)
- cluster level metrics
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/resource-metrics-api.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,333 +1 @@
# Resource Quota - Scoping resources This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-quota-scoping.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-quota-scoping.md)
## Problem Description
### Ability to limit compute requests and limits
The existing `ResourceQuota` API object constrains the total amount of compute
resource requests. This is useful when a cluster-admin is interested in
controlling explicit resource guarantees such that there would be a relatively
strong guarantee that pods created by users who stay within their quota will find
enough free resources in the cluster to be able to schedule. The end-user creating
the pod is expected to have intimate knowledge on their minimum required resource
as well as their potential limits.
There are many environments where a cluster-admin does not extend this level
of trust to their end-user because user's often request too much resource, and
they have trouble reasoning about what they hope to have available for their
application versus what their application actually needs. In these environments,
the cluster-admin will often just expose a single value (the limit) to the end-user.
Internally, they may choose a variety of other strategies for setting the request.
For example, some cluster operators are focused on satisfying a particular over-commit
ratio and may choose to set the request as a factor of the limit to control for
over-commit. Other cluster operators may defer to a resource estimation tool that
sets the request based on known historical trends. In this environment, the
cluster-admin is interested in exposing a quota to their end-users that maps
to their desired limit instead of their request since that is the value the user
manages.
### Ability to limit impact to node and promote fair-use
The current `ResourceQuota` API object does not allow the ability
to quota best-effort pods separately from pods with resource guarantees.
For example, if a cluster-admin applies a quota that caps requested
cpu at 10 cores and memory at 10Gi, all pods in the namespace must
make an explicit resource request for cpu and memory to satisfy
quota. This prevents a namespace with a quota from supporting best-effort
pods.
In practice, the cluster-admin wants to control the impact of best-effort
pods to the cluster, but not restrict the ability to run best-effort pods
altogether.
As a result, the cluster-admin requires the ability to control the
max number of active best-effort pods. In addition, the cluster-admin
requires the ability to scope a quota that limits compute resources to
exclude best-effort pods.
### Ability to quota long-running vs. bounded-duration compute resources
The cluster-admin may want to quota end-users separately
based on long-running vs. bounded-duration compute resources.
For example, a cluster-admin may offer more compute resources
for long running pods that are expected to have a more permanent residence
on the node than bounded-duration pods. Many batch style workloads
tend to consume as much resource as they can until something else applies
the brakes. As a result, these workloads tend to operate at their limit,
while many traditional web applications may often consume closer to their
request if there is no active traffic. An operator that wants to control
density will offer lower quota limits for batch workloads than web applications.
A classic example is a PaaS deployment where the cluster-admin may
allow a separate budget for pods that run their web application vs. pods that
build web applications.
Another example is providing more quota to a database pod than a
pod that performs a database migration.
## Use Cases
* As a cluster-admin, I want the ability to quota
* compute resource requests
* compute resource limits
* compute resources for terminating vs. non-terminating workloads
* compute resources for best-effort vs. non-best-effort pods
## Proposed Change
### New quota tracked resources
Support the following resources that can be tracked by quota.
| Resource Name | Description |
| ------------- | ----------- |
| cpu | total cpu requests (backwards compatibility) |
| memory | total memory requests (backwards compatibility) |
| requests.cpu | total cpu requests |
| requests.memory | total memory requests |
| limits.cpu | total cpu limits |
| limits.memory | total memory limits |
### Resource Quota Scopes
Add the ability to associate a set of `scopes` to a quota.
A quota will only measure usage for a `resource` if it matches
the intersection of enumerated `scopes`.
Adding a `scope` to a quota limits the number of resources
it supports to those that pertain to the `scope`. Specifying
a resource on the quota object outside of the allowed set
would result in a validation error.
| Scope | Description |
| ----- | ----------- |
| Terminating | Match `kind=Pod` where `spec.activeDeadlineSeconds >= 0` |
| NotTerminating | Match `kind=Pod` where `spec.activeDeadlineSeconds = nil` |
| BestEffort | Match `kind=Pod` where `status.qualityOfService in (BestEffort)` |
| NotBestEffort | Match `kind=Pod` where `status.qualityOfService not in (BestEffort)` |
A `BestEffort` scope restricts a quota to tracking the following resources:
* pod
A `Terminating`, `NotTerminating`, `NotBestEffort` scope restricts a quota to
tracking the following resources:
* pod
* memory, requests.memory, limits.memory
* cpu, requests.cpu, limits.cpu
## Data Model Impact
```
// The following identify resource constants for Kubernetes object types
const (
// CPU request, in cores. (500m = .5 cores)
ResourceRequestsCPU ResourceName = "requests.cpu"
// Memory request, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024)
ResourceRequestsMemory ResourceName = "requests.memory"
// CPU limit, in cores. (500m = .5 cores)
ResourceLimitsCPU ResourceName = "limits.cpu"
// Memory limit, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024)
ResourceLimitsMemory ResourceName = "limits.memory"
)
// A scope is a filter that matches an object
type ResourceQuotaScope string
const (
ResourceQuotaScopeTerminating ResourceQuotaScope = "Terminating"
ResourceQuotaScopeNotTerminating ResourceQuotaScope = "NotTerminating"
ResourceQuotaScopeBestEffort ResourceQuotaScope = "BestEffort"
ResourceQuotaScopeNotBestEffort ResourceQuotaScope = "NotBestEffort"
)
// ResourceQuotaSpec defines the desired hard limits to enforce for Quota
// The quota matches by default on all objects in its namespace.
// The quota can optionally match objects that satisfy a set of scopes.
type ResourceQuotaSpec struct {
// Hard is the set of desired hard limits for each named resource
Hard ResourceList `json:"hard,omitempty"`
// A collection of filters that must match each object tracked by a quota.
// If not specified, the quota matches all objects.
Scopes []ResourceQuotaScope `json:"scopes,omitempty"`
}
```
## Rest API Impact
None.
## Security Impact
None.
## End User Impact
The `kubectl` commands that render quota should display its scopes.
## Performance Impact
This feature will make having more quota objects in a namespace
more common in certain clusters. This impacts the number of quota
objects that need to be incremented during creation of an object
in admission control. It impacts the number of quota objects
that need to be updated during controller loops.
## Developer Impact
None.
## Alternatives
This proposal initially enumerated a solution that leveraged a
`FieldSelector` on a `ResourceQuota` object. A `FieldSelector`
grouped an `APIVersion` and `Kind` with a selector over its
fields that supported set-based requirements. It would have allowed
a quota to track objects based on cluster defined attributes.
For example, a quota could do the following:
* match `Kind=Pod` where `spec.restartPolicy in (Always)`
* match `Kind=Pod` where `spec.restartPolicy in (Never, OnFailure)`
* match `Kind=Pod` where `status.qualityOfService in (BestEffort)`
* match `Kind=Service` where `spec.type in (LoadBalancer)`
* see [#17484](https://github.com/kubernetes/kubernetes/issues/17484)
Theoretically, it would enable support for fine-grained tracking
on a variety of resource types. While extremely flexible, there
are cons to to this approach that make it premature to pursue
at this time.
* Generic field selectors are not yet settled art
* see [#1362](https://github.com/kubernetes/kubernetes/issues/1362)
* see [#19084](https://github.com/kubernetes/kubernetes/pull/19804)
* Discovery API Limitations
* Not possible to discover the set of field selectors supported by kind.
* Not possible to discover if a field is readonly, readwrite, or immutable
post-creation.
The quota system would want to validate that a field selector is valid,
and it would only want to select on those fields that are readonly/immutable
post creation to make resource tracking work during update operations.
The current proposal could grow to support a `FieldSelector` on a
`ResourceQuotaSpec` and support a simple migration path to convert
`scopes` to the matching `FieldSelector` once the project has identified
how it wants to handle `fieldSelector` requirements longer term.
This proposal previously discussed a solution that leveraged a
`LabelSelector` as a mechanism to partition quota. This is potentially
interesting to explore in the future to allow `namespace-admins` to
quota workloads based on local knowledge. For example, a quota
could match all kinds that match the selector
`tier=cache, environment in (dev, qa)` separately from quota that
matched `tier=cache, environment in (prod)`. This is interesting to
explore in the future, but labels are insufficient selection targets
for `cluster-administrators` to control footprint. In those instances,
you need fields that are cluster controlled and not user-defined.
## Example
### Scenario 1
The cluster-admin wants to restrict the following:
* limit 2 best-effort pods
* limit 2 terminating pods that can not use more than 1Gi of memory, and 2 cpu cores
* limit 4 long-running pods that can not use more than 4Gi of memory, and 4 cpu cores
* limit 6 pods in total, 10 replication controllers
This would require the following quotas to be added to the namespace:
```
$ cat quota-best-effort
apiVersion: v1
kind: ResourceQuota
metadata:
name: quota-best-effort
spec:
hard:
pods: "2"
scopes:
- BestEffort
$ cat quota-terminating
apiVersion: v1
kind: ResourceQuota
metadata:
name: quota-terminating
spec:
hard:
pods: "2"
memory.limit: 1Gi
cpu.limit: 2
scopes:
- Terminating
- NotBestEffort
$ cat quota-longrunning
apiVersion: v1
kind: ResourceQuota
metadata:
name: quota-longrunning
spec:
hard:
pods: "2"
memory.limit: 4Gi
cpu.limit: 4
scopes:
- NotTerminating
- NotBestEffort
$ cat quota
apiVersion: v1
kind: ResourceQuota
metadata:
name: quota
spec:
hard:
pods: "6"
replicationcontrollers: "10"
```
In the above scenario, every pod creation will result in its usage being
tracked by `quota` since it has no additional scoping. The pod will then
be tracked by at 1 additional quota object based on the scope it
matches. In order for the pod creation to succeed, it must not violate
the constraint of any matching quota. So for example, a best-effort pod
would only be created if there was available quota in `quota-best-effort`
and `quota`.
## Implementation
### Assignee
@derekwaynecarr
### Work Items
* Add support for requests and limits
* Add support for scopes in quota-related admission and controller code
## Dependencies
None.
Longer term, we should evaluate what we want to do with `fieldSelector` as
the requests around different quota semantics will continue to grow.
## Testing
Appropriate unit and e2e testing will be authored.
## Documentation Impact
Existing resource quota documentation and examples will be updated.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/resource-quota-scoping.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,206 +1 @@
# Client/Server container runtime This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/runtime-client-server.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/runtime-client-server.md)
## Abstract
A proposal of client/server implementation of kubelet container runtime interface.
## Motivation
Currently, any container runtime has to be linked into the kubelet. This makes
experimentation difficult, and prevents users from landing an alternate
container runtime without landing code in core kubernetes.
To facilitate experimentation and to enable user choice, this proposal adds a
client/server implementation of the [new container runtime interface](https://github.com/kubernetes/kubernetes/pull/25899). The main goal
of this proposal is:
- make it easy to integrate new container runtimes
- improve code maintainability
## Proposed design
**Design of client/server container runtime**
The main idea of client/server container runtime is to keep main control logic in kubelet while letting remote runtime only do dedicated actions. An alpha [container runtime API](../../pkg/kubelet/api/v1alpha1/runtime/api.proto) is introduced for integrating new container runtimes. The API is based on [protobuf](https://developers.google.com/protocol-buffers/) and [gRPC](http://www.grpc.io) for a number of benefits:
- Perform faster than json
- Get client bindings for free: gRPC supports ten languages
- No encoding/decoding codes needed
- Manage api interfaces easily: server and client interfaces are generated automatically
A new container runtime manager `KubeletGenericRuntimeManager` will be introduced to kubelet, which will
- conforms to kubelet's [Runtime](../../pkg/kubelet/container/runtime.go#L58) interface
- manage Pods and Containers lifecycle according to kubelet policies
- call remote runtime's API to perform specific pod, container or image operations
A simple workflow of invoking remote runtime API on starting a Pod with two containers can be shown:
```
Kubelet KubeletGenericRuntimeManager RemoteRuntime
+ + +
| | |
+---------SyncPod------------->+ |
| | |
| +---- Create PodSandbox ------->+
| +<------------------------------+
| | |
| XXXXXXXXXXXX |
| | X |
| | NetworkPlugin. |
| | SetupPod |
| | X |
| XXXXXXXXXXXX |
| | |
| +<------------------------------+
| +---- Pull image1 -------->+
| +<------------------------------+
| +---- Create container1 ------->+
| +<------------------------------+
| +---- Start container1 -------->+
| +<------------------------------+
| | |
| +<------------------------------+
| +---- Pull image2 -------->+
| +<------------------------------+
| +---- Create container2 ------->+
| +<------------------------------+
| +---- Start container2 -------->+
| +<------------------------------+
| | |
| <-------Success--------------+ |
| | |
+ + +
```
And deleting a pod can be shown:
```
Kubelet KubeletGenericRuntimeManager RemoteRuntime
+ + +
| | |
+---------SyncPod------------->+ |
| | |
| +---- Stop container1 ----->+
| +<------------------------------+
| +---- Delete container1 ----->+
| +<------------------------------+
| | |
| +---- Stop container2 ------>+
| +<------------------------------+
| +---- Delete container2 ------>+
| +<------------------------------+
| | |
| XXXXXXXXXXXX |
| | X |
| | NetworkPlugin. |
| | TeardownPod |
| | X |
| XXXXXXXXXXXX |
| | |
| | |
| +---- Delete PodSandbox ------>+
| +<------------------------------+
| | |
| <-------Success--------------+ |
| | |
+ + +
```
**API definition**
Since we are going to introduce more image formats and want to separate image management from containers and pods, this proposal introduces two services `RuntimeService` and `ImageService`. Both services are defined at [pkg/kubelet/api/v1alpha1/runtime/api.proto](../../pkg/kubelet/api/v1alpha1/runtime/api.proto):
```proto
// Runtime service defines the public APIs for remote container runtimes
service RuntimeService {
// Version returns the runtime name, runtime version and runtime API version
rpc Version(VersionRequest) returns (VersionResponse) {}
// CreatePodSandbox creates a pod-level sandbox.
// The definition of PodSandbox is at https://github.com/kubernetes/kubernetes/pull/25899
rpc CreatePodSandbox(CreatePodSandboxRequest) returns (CreatePodSandboxResponse) {}
// StopPodSandbox stops the sandbox. If there are any running containers in the
// sandbox, they should be force terminated.
rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}
// DeletePodSandbox deletes the sandbox. If there are any running containers in the
// sandbox, they should be force deleted.
rpc DeletePodSandbox(DeletePodSandboxRequest) returns (DeletePodSandboxResponse) {}
// PodSandboxStatus returns the Status of the PodSandbox.
rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}
// ListPodSandbox returns a list of SandBox.
rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}
// CreateContainer creates a new container in specified PodSandbox
rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}
// StartContainer starts the container.
rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}
// StopContainer stops a running container with a grace period (i.e., timeout).
rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}
// RemoveContainer removes the container. If the container is running, the container
// should be force removed.
rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}
// ListContainers lists all containers by filters.
rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}
// ContainerStatus returns status of the container.
rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}
// Exec executes the command in the container.
rpc Exec(stream ExecRequest) returns (stream ExecResponse) {}
}
// Image service defines the public APIs for managing images
service ImageService {
// ListImages lists existing images.
rpc ListImages(ListImagesRequest) returns (ListImagesResponse) {}
// ImageStatus returns the status of the image.
rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse) {}
// PullImage pulls a image with authentication config.
rpc PullImage(PullImageRequest) returns (PullImageResponse) {}
// RemoveImage removes the image.
rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse) {}
}
```
Note that some types in [pkg/kubelet/api/v1alpha1/runtime/api.proto](../../pkg/kubelet/api/v1alpha1/runtime/api.proto) are already defined at [Container runtime interface/integration](https://github.com/kubernetes/kubernetes/pull/25899).
We should decide how to integrate the types in [#25899](https://github.com/kubernetes/kubernetes/pull/25899) with gRPC services:
* Auto-generate those types into protobuf by [go2idl](../../cmd/libs/go2idl/)
- Pros:
- trace type changes automatically, all type changes in Go will be automatically generated into proto files
- Cons:
- type change may break existing API implementations, e.g. new fields added automatically may not noticed by remote runtime
- needs to convert Go types to gRPC generated types, and vise versa
- needs processing attributes order carefully so as not to break generated protobufs (this could be done by using [protobuf tag](https://developers.google.com/protocol-buffers/docs/gotutorial))
- go2idl doesn't support gRPC, [protoc-gen-gogo](https://github.com/gogo/protobuf) is still required for generating gRPC client
* Embed those types as raw protobuf definitions and generate Go files by [protoc-gen-gogo](https://github.com/gogo/protobuf)
- Pros:
- decouple type definitions, all type changes in Go will be added to proto manually, so it's easier to track gRPC API version changes
- Kubelet could reuse Go types generated by `protoc-gen-gogo` to avoid type conversions
- Cons:
- duplicate definition of same types
- hard to track type changes automatically
- need to manage proto files manually
For better version controlling and fast iterations, this proposal embeds all those types in `api.proto` directly.
## Implementation
Each new runtime should implement the [gRPC](http://www.grpc.io) server based on [pkg/kubelet/api/v1alpha1/runtime/api.proto](../../pkg/kubelet/api/v1alpha1/runtime/api.proto). For version controlling, `KubeletGenericRuntimeManager` will request `RemoteRuntime`'s `Version()` interface with the runtime api version. To keep backward compatibility, the API follows standard [protobuf guide](https://developers.google.com/protocol-buffers/docs/proto) to deprecate or add new interfaces.
A new flag `--container-runtime-endpoint` (overrides `--container-runtime`) will be introduced to kubelet which identifies the unix socket file of the remote runtime service. And new flag `--image-service-endpoint` will be introduced to kubelet which identifies the unix socket file of the image service.
To facilitate switching current container runtime (e.g. `docker` or `rkt`) to new runtime API, `KubeletGenericRuntimeManager` will provide a plugin mechanism allowing to specify local implementation or gRPC implementation.
## Community Discussion
This proposal is first filed by [@brendandburns](https://github.com/brendandburns) at [kubernetes/13768](https://github.com/kubernetes/kubernetes/issues/13768):
* [kubernetes/13768](https://github.com/kubernetes/kubernetes/issues/13768)
* [kubernetes/13709](https://github.com/kubernetes/kubernetes/pull/13079)
* [New container runtime interface](https://github.com/kubernetes/kubernetes/pull/25899)
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/runtime-client-server.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,173 +1 @@
# Kubelet: Runtime Pod Cache This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/runtime-pod-cache.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/runtime-pod-cache.md)
This proposal builds on top of the Pod Lifecycle Event Generator (PLEG) proposed
in [#12802](https://issues.k8s.io/12802). It assumes that Kubelet subscribes to
the pod lifecycle event stream to eliminate periodic polling of pod
states. Please see [#12802](https://issues.k8s.io/12802). for the motivation and
design concept for PLEG.
Runtime pod cache is an in-memory cache which stores the *status* of
all pods, and is maintained by PLEG. It serves as a single source of
truth for internal pod status, freeing Kubelet from querying the
container runtime.
## Motivation
With PLEG, Kubelet no longer needs to perform comprehensive state
checking for all pods periodically. It only instructs a pod worker to
start syncing when there is a change of its pod status. Nevertheless,
during each sync, a pod worker still needs to construct the pod status
by examining all containers (whether dead or alive) in the pod, due to
the lack of the caching of previous states. With the integration of
pod cache, we can further improve Kubelet's CPU usage by
1. Lowering the number of concurrent requests to the container
runtime since pod workers no longer have to query the runtime
individually.
2. Lowering the total number of inspect requests because there is no
need to inspect containers with no state changes.
***Don't we already have a [container runtime cache]
(https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/container/runtime_cache.go)?***
The runtime cache is an optimization that reduces the number of `GetPods()`
calls from the workers. However,
* The cache does not store all information necessary for a worker to
complete a sync (e.g., `docker inspect`); workers still need to inspect
containers individually to generate `api.PodStatus`.
* Workers sometimes need to bypass the cache in order to retrieve the
latest pod state.
This proposal generalizes the cache and instructs PLEG to populate the cache, so
that the content is always up-to-date.
**Why can't each worker cache its own pod status?**
The short answer is yes, they can. The longer answer is that localized
caching limits the use of the cache content -- other components cannot
access it. This often leads to caching at multiple places and/or passing
objects around, complicating the control flow.
## Runtime Pod Cache
![pod cache](pod-cache.png)
Pod cache stores the `PodStatus` for all pods on the node. `PodStatus` encompasses
all the information required from the container runtime to generate
`api.PodStatus` for a pod.
```go
// PodStatus represents the status of the pod and its containers.
// api.PodStatus can be derived from examining PodStatus and api.Pod.
type PodStatus struct {
ID types.UID
Name string
Namespace string
IP string
ContainerStatuses []*ContainerStatus
}
// ContainerStatus represents the status of a container.
type ContainerStatus struct {
ID ContainerID
Name string
State ContainerState
CreatedAt time.Time
StartedAt time.Time
FinishedAt time.Time
ExitCode int
Image string
ImageID string
Hash uint64
RestartCount int
Reason string
Message string
}
```
`PodStatus` is defined in the container runtime interface, hence is
runtime-agnostic.
PLEG is responsible for updating the entries pod cache, hence always keeping
the cache up-to-date.
1. Detect change of container state
2. Inspect the pod for details
3. Update the pod cache with the new PodStatus
- If there is no real change of the pod entry, do nothing
- Otherwise, generate and send out the corresponding pod lifecycle event
Note that in (3), PLEG can check if there is any disparity between the old
and the new pod entry to filter out duplicated events if needed.
### Evict cache entries
Note that the cache represents all the pods/containers known by the container
runtime. A cache entry should only be evicted if the pod is no longer visible
by the container runtime. PLEG is responsible for deleting entries in the
cache.
### Generate `api.PodStatus`
Because pod cache stores the up-to-date `PodStatus` of the pods, Kubelet can
generate the `api.PodStatus` by interpreting the cache entry at any
time. To avoid sending intermediate status (e.g., while a pod worker
is restarting a container), we will instruct the pod worker to generate a new
status at the beginning of each sync.
### Cache contention
Cache contention should not be a problem when the number of pods is
small. When Kubelet scales, we can always shard the pods by ID to
reduce contention.
### Disk management
The pod cache is not capable to fulfill the needs of container/image garbage
collectors as they may demand more than pod-level information. These components
will still need to query the container runtime directly at times. We may
consider extending the cache for these use cases, but they are beyond the scope
of this proposal.
## Impact on Pod Worker Control Flow
A pod worker may perform various operations (e.g., start/kill a container)
during a sync. They will expect to see the results of such operations reflected
in the cache in the next sync. Alternately, they can bypass the cache and
query the container runtime directly to get the latest status. However, this
is not desirable since the cache is introduced exactly to eliminate unnecessary,
concurrent queries. Therefore, a pod worker should be blocked until all expected
results have been updated to the cache by PLEG.
Depending on the type of PLEG (see [#12802](https://issues.k8s.io/12802)) in
use, the methods to check whether a requirement is met can differ. For a
PLEG that solely relies on relisting, a pod worker can simply wait until the
relist timestamp is newer than the end of the worker's last sync. On the other
hand, if pod worker knows what events to expect, they can also block until the
events are observed.
It should be noted that `api.PodStatus` will only be generated by the pod
worker *after* the cache has been updated. This means that the perceived
responsiveness of Kubelet (from querying the API server) will be affected by
how soon the cache can be populated. For the pure-relisting PLEG, the relist
period can become the bottleneck. On the other hand, A PLEG which watches the
upstream event stream (and knows how what events to expect) is not restricted
by such periods and should improve Kubelet's perceived responsiveness.
## TODOs for v1.2
- Redefine container runtime types ([#12619](https://issues.k8s.io/12619)):
and introduce `PodStatus`. Refactor dockertools and rkt to use the new type.
- Add cache and instruct PLEG to populate it.
- Refactor Kubelet to use the cache.
- Deprecate the old runtime cache.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/runtime-pod-cache.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,69 +1 @@
# Overview This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/runtimeconfig.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/runtimeconfig.md)
Proposes adding a `--feature-config` to core kube system components:
apiserver , scheduler, controller-manager, kube-proxy, and selected addons.
This flag will be used to enable/disable alpha features on a per-component basis.
## Motivation
Motivation is enabling/disabling features that are not tied to
an API group. API groups can be selectively enabled/disabled in the
apiserver via existing `--runtime-config` flag on apiserver, but there is
currently no mechanism to toggle alpha features that are controlled by
e.g. annotations. This means the burden of controlling whether such
features are enabled in a particular cluster is on feature implementors;
they must either define some ad hoc mechanism for toggling (e.g. flag
on component binary) or else toggle the feature on/off at compile time.
By adding a`--feature-config` to all kube-system components, alpha features
can be toggled on a per-component basis by passing `enableAlphaFeature=true|false`
to `--feature-config` for each component that the feature touches.
## Design
The following components will all get a `--feature-config` flag,
which loads a `config.ConfigurationMap`:
- kube-apiserver
- kube-scheduler
- kube-controller-manager
- kube-proxy
- kube-dns
(Note kubelet is omitted, it's dynamic config story is being addressed
by #29459). Alpha features that are not accessed via an alpha API
group should define an `enableFeatureName` flag and use it to toggle
activation of the feature in each system component that the feature
uses.
## Suggested conventions
This proposal only covers adding a mechanism to toggle features in
system components. Implementation details will still depend on the alpha
feature's owner(s). The following are suggested conventions:
- Naming for feature config entries should follow the pattern
"enable<FeatureName>=true".
- Features that touch multiple components should reserve the same key
in each component to toggle on/off.
- Alpha features should be disabled by default. Beta features may
be enabled by default. Refer to docs/devel/api_changes.md#alpha-beta-and-stable-versions
for more detailed guidance on alpha vs. beta.
## Upgrade support
As the primary motivation for cluster config is toggling alpha
features, upgrade support is not in scope. Enabling or disabling
a feature is necessarily a breaking change, so config should
not be altered in a running cluster.
## Future work
1. The eventual plan is for component config to be managed by versioned
APIs and not flags (#12245). When that is added, toggling of features
could be handled by versioned component config and the component flags
deprecated.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/runtimeconfig.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,72 +1 @@
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scalability-testing.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scalability-testing.md)
## Background
We have a goal to be able to scale to 1000-node clusters by end of 2015.
As a result, we need to be able to run some kind of regression tests and deliver
a mechanism so that developers can test their changes with respect to performance.
Ideally, we would like to run performance tests also on PRs - although it might
be impossible to run them on every single PR, we may introduce a possibility for
a reviewer to trigger them if the change has non obvious impact on the performance
(something like "k8s-bot run scalability tests please" should be feasible).
However, running performance tests on 1000-node clusters (or even bigger in the
future is) is a non-starter. Thus, we need some more sophisticated infrastructure
to simulate big clusters on relatively small number of machines and/or cores.
This document describes two approaches to tackling this problem.
Once we have a better understanding of their consequences, we may want to
decide to drop one of them, but we are not yet in that position.
## Proposal 1 - Kubmark
In this proposal we are focusing on scalability testing of master components.
We do NOT focus on node-scalability - this issue should be handled separately.
Since we do not focus on the node performance, we don't need real Kubelet nor
KubeProxy - in fact we don't even need to start real containers.
All we actually need is to have some Kubelet-like and KubeProxy-like components
that will be simulating the load on apiserver that their real equivalents are
generating (e.g. sending NodeStatus updated, watching for pods, watching for
endpoints (KubeProxy), etc.).
What needs to be done:
1. Determine what requests both KubeProxy and Kubelet are sending to apiserver.
2. Create a KubeletSim that is generating the same load on apiserver that the
real Kubelet, but is not starting any containers. In the initial version we
can assume that pods never die, so it is enough to just react on the state
changes read from apiserver.
TBD: Maybe we can reuse a real Kubelet for it by just injecting some "fake"
interfaces to it?
3. Similarly create a KubeProxySim that is generating the same load on apiserver
as a real KubeProxy. Again, since we are not planning to talk to those
containers, it basically doesn't need to do anything apart from that.
TBD: Maybe we can reuse a real KubeProxy for it by just injecting some "fake"
interfaces to it?
4. Refactor kube-up/kube-down scripts (or create new ones) to allow starting
a cluster with KubeletSim and KubeProxySim instead of real ones and put
a bunch of them on a single machine.
5. Create a load generator for it (probably initially it would be enough to
reuse tests that we use in gce-scalability suite).
## Proposal 2 - Oversubscribing
The other method we are proposing is to oversubscribe the resource,
or in essence enable a single node to look like many separate nodes even though
they reside on a single host. This is a well established pattern in many different
cluster managers (for more details see
http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html ).
There are a couple of different ways to accomplish this, but the most viable method
is to run privileged kubelet pods under a hosts kubelet process. These pods then
register back with the master via the introspective service using modified names
as not to collide.
Complications may currently exist around container tracking and ownership in docker.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/scalability-testing.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,335 +1 @@
# ScheduledJob Controller This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduledjob.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduledjob.md)
## Abstract
A proposal for implementing a new controller - ScheduledJob controller - which
will be responsible for managing time based jobs, namely:
* once at a specified point in time,
* repeatedly at a specified point in time.
There is already a discussion regarding this subject:
* Distributed CRON jobs [#2156](https://issues.k8s.io/2156)
There are also similar solutions available, already:
* [Mesos Chronos](https://github.com/mesos/chronos)
* [Quartz](http://quartz-scheduler.org/)
## Use Cases
1. Be able to schedule a job execution at a given point in time.
1. Be able to create a periodic job, e.g. database backup, sending emails.
## Motivation
ScheduledJobs are needed for performing all time-related actions, namely backups,
report generation and the like. Each of these tasks should be allowed to run
repeatedly (once a day/month, etc.) or once at a given point in time.
## Design Overview
Users create a ScheduledJob object. One ScheduledJob object
is like one line of a crontab file. It has a schedule of when to run,
in [Cron](https://en.wikipedia.org/wiki/Cron) format.
The ScheduledJob controller creates a Job object [Job](job.md)
about once per execution time of the scheduled (e.g. once per
day for a daily schedule.) We say "about" because there are certain
circumstances where two jobs might be created, or no job might be
created. We attempt to make these rare, but do not completely prevent
them. Therefore, Jobs should be idempotent.
The Job object is responsible for any retrying of Pods, and any parallelism
among pods it creates, and determining the success or failure of the set of
pods. The ScheduledJob does not examine pods at all.
### ScheduledJob resource
The new `ScheduledJob` object will have the following contents:
```go
// ScheduledJob represents the configuration of a single scheduled job.
type ScheduledJob struct {
TypeMeta
ObjectMeta
// Spec is a structure defining the expected behavior of a job, including the schedule.
Spec ScheduledJobSpec
// Status is a structure describing current status of a job.
Status ScheduledJobStatus
}
// ScheduledJobList is a collection of scheduled jobs.
type ScheduledJobList struct {
TypeMeta
ListMeta
Items []ScheduledJob
}
```
The `ScheduledJobSpec` structure is defined to contain all the information how the actual
job execution will look like, including the `JobSpec` from [Job API](job.md)
and the schedule in [Cron](https://en.wikipedia.org/wiki/Cron) format. This implies
that each ScheduledJob execution will be created from the JobSpec actual at a point
in time when the execution will be started. This also implies that any changes
to ScheduledJobSpec will be applied upon subsequent execution of a job.
```go
// ScheduledJobSpec describes how the job execution will look like and when it will actually run.
type ScheduledJobSpec struct {
// Schedule contains the schedule in Cron format, see https://en.wikipedia.org/wiki/Cron.
Schedule string
// Optional deadline in seconds for starting the job if it misses scheduled
// time for any reason. Missed jobs executions will be counted as failed ones.
StartingDeadlineSeconds *int64
// ConcurrencyPolicy specifies how to treat concurrent executions of a Job.
ConcurrencyPolicy ConcurrencyPolicy
// Suspend flag tells the controller to suspend subsequent executions, it does
// not apply to already started executions. Defaults to false.
Suspend bool
// JobTemplate is the object that describes the job that will be created when
// executing a ScheduledJob.
JobTemplate *JobTemplateSpec
}
// JobTemplateSpec describes of the Job that will be created when executing
// a ScheduledJob, including its standard metadata.
type JobTemplateSpec struct {
ObjectMeta
// Specification of the desired behavior of the job.
Spec JobSpec
}
// ConcurrencyPolicy describes how the job will be handled.
// Only one of the following concurrent policies may be specified.
// If none of the following policies is specified, the default one
// is AllowConcurrent.
type ConcurrencyPolicy string
const (
// AllowConcurrent allows ScheduledJobs to run concurrently.
AllowConcurrent ConcurrencyPolicy = "Allow"
// ForbidConcurrent forbids concurrent runs, skipping next run if previous
// hasn't finished yet.
ForbidConcurrent ConcurrencyPolicy = "Forbid"
// ReplaceConcurrent cancels currently running job and replaces it with a new one.
ReplaceConcurrent ConcurrencyPolicy = "Replace"
)
```
`ScheduledJobStatus` structure is defined to contain information about scheduled
job executions. The structure holds a list of currently running job instances
and additional information about overall successful and unsuccessful job executions.
```go
// ScheduledJobStatus represents the current state of a Job.
type ScheduledJobStatus struct {
// Active holds pointers to currently running jobs.
Active []ObjectReference
// Successful tracks the overall amount of successful completions of this job.
Successful int64
// Failed tracks the overall amount of failures of this job.
Failed int64
// LastScheduleTime keeps information of when was the last time the job was successfully scheduled.
LastScheduleTime Time
}
```
Users must use a generated selector for the job.
## Modifications to Job resource
TODO for beta: forbid manual selector since that could cause confusing between
subsequent jobs.
### Running ScheduledJobs using kubectl
A user should be able to easily start a Scheduled Job using `kubectl` (similarly
to running regular jobs). For example to run a job with a specified schedule,
a user should be able to type something simple like:
```
kubectl run pi --image=perl --restart=OnFailure --runAt="0 14 21 7 *" -- perl -Mbignum=bpi -wle 'print bpi(2000)'
```
In the above example:
* `--restart=OnFailure` implies creating a job instead of replicationController.
* `--runAt="0 14 21 7 *"` implies the schedule with which the job should be run, here
July 21, 2pm. This value will be validated according to the same rules which
apply to `.spec.schedule`.
## Fields Added to Job Template
When the controller creates a Job from the JobTemplateSpec in the ScheduledJob, it
adds the following fields to the Job:
- a name, based on the ScheduledJob's name, but with a suffix to distinguish
multiple executions, which may overlap.
- the standard created-by annotation on the Job, pointing to the SJ that created it
The standard key is `kubernetes.io/created-by`. The value is a serialized JSON object, like
`{ "kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ScheduledJob","namespace":"default",`
`"name":"nightly-earnings-report","uid":"5ef034e0-1890-11e6-8935-42010af0003e","apiVersion":...`
This serialization contains the UID of the parent. This is used to match the Job to the SJ that created
it.
## Updates to ScheduledJobs
If the schedule is updated on a ScheduledJob, it will:
- continue to use the Status.Active list of jobs to detect conflicts.
- try to fulfill all recently-passed times for the new schedule, by starting
new jobs. But it will not try to fulfill times prior to the
Status.LastScheduledTime.
- Example: If you have a schedule to run every 30 minutes, and change that to hourly, then the previously started
top-of-the-hour run, in Status.Active, will be seen and no new job started.
- Example: If you have a schedule to run every hour, change that to 30-minutely, at 31 minutes past the hour,
one run will be started immediately for the starting time that has just passed.
If the job template of a ScheduledJob is updated, then future executions use the new template
but old ones still satisfy the schedule and are not re-run just because the template changed.
If you delete and replace a ScheduledJob with one of the same name, it will:
- not use any old Status.Active, and not consider any existing running or terminated jobs from the previous
ScheduledJob (with a different UID) at all when determining coflicts, what needs to be started, etc.
- If there is an existing Job with the same time-based hash in its name (see below), then
new instances of that job will not be able to be created. So, delete it if you want to re-run.
with the same name as conflicts.
- not "re-run" jobs for "start times" before the creation time of the new ScheduledJobJob object.
- not consider executions from the previous UID when making decisions about what executions to
start, or status, etc.
- lose the history of the old SJ.
To preserve status, you can suspend the old one, and make one with a new name, or make a note of the old status.
## Fault-Tolerance
### Starting Jobs in the face of controller failures
If the process with the scheduledJob controller in it fails,
and takes a while to restart, the scheduledJob controller
may miss the time window and it is too late to start a job.
With a single scheduledJob controller process, we cannot give
very strong assurances about not missing starting jobs.
With a suggested HA configuration, there are multiple controller
processes, and they use master election to determine which one
is active at any time.
If the Job's StartingDeadlineSeconds is long enough, and the
lease for the master lock is short enough, and other controller
processes are running, then a Job will be started.
TODO: consider hard-coding the minimum StartingDeadlineSeconds
at say 1 minute. Then we can offer a clearer guarantee,
assuming we know what the setting of the lock lease duration is.
### Ensuring jobs are run at most once
There are three problems here:
- ensure at most one Job created per "start time" of a schedule.
- ensure that at most one Pod is created per Job
- ensure at most one container start occurs per Pod
#### Ensuring one Job
Multiple jobs might be created in the following sequence:
1. scheduled job controller sends request to start Job J1 to fulfill start time T.
1. the create request is accepted by the apiserver and enqueued but not yet written to etcd.
1. scheduled job controller crashes
1. new scheduled job controller starts, and lists the existing jobs, and does not see one created.
1. it creates a new one.
1. the first one eventually gets written to etcd.
1. there are now two jobs for the same start time.
We can solve this in several ways:
1. with three-phase protocol, e.g.:
1. controller creates a "suspended" job.
1. controller writes writes an annotation in the SJ saying that it created a job for this time.
1. controller unsuspends that job.
1. by picking a deterministic name, so that at most one object create can succeed.
#### Ensuring one Pod
Job object does not currently have a way to ask for this.
Even if it did, controller is not written to support it.
Same problem as above.
#### Ensuring one container invocation per Pod
Kubelet is not written to ensure at-most-one-container-start per pod.
#### Decision
This is too hard to do for the alpha version. We will await user
feedback to see if the "at most once" property is needed in the beta version.
This is awkward but possible for a containerized application ensure on it own, as it needs
to know what ScheduledJob name and Start Time it is from, and then record the attempt
in a shared storage system. We should ensure it could extract this data from its annotations
using the downward API.
## Name of Jobs
A ScheduledJob creates one Job at each time when a Job should run.
Since there may be concurrent jobs, and since we might want to keep failed
non-overlapping Jobs around as a debugging record, each Job created by the same ScheduledJob
needs a distinct name.
To make the Jobs from the same ScheduledJob distinct, we could use a random string,
in the way that pods have a `generateName`. For example, a scheduledJob named `nightly-earnings-report`
in namespace `ns1` might create a job `nightly-earnings-report-3m4d3`, and later create
a job called `nightly-earnings-report-6k7ts`. This is consistent with pods, but
does not give the user much information.
Alternatively, we can use time as a uniquifier. For example, the same scheduledJob could
create a job called `nightly-earnings-report-2016-May-19`.
However, for Jobs that run more than once per day, we would need to represent
time as well as date. Standard date formats (e.g. RFC 3339) use colons for time.
Kubernetes names cannot include time. Using a non-standard date format without colons
will annoy some users.
Also, date strings are much longer than random suffixes, which means that
the pods will also have long names, and that we are more likely to exceed the
253 character name limit when combining the scheduled-job name,
the time suffix, and pod random suffix.
One option would be to compute a hash of the nominal start time of the job,
and use that as a suffix. This would not provide the user with an indication
of the start time, but it would prevent creation of the same execution
by two instances (replicated or restarting) of the controller process.
We chose to use the hashed-date suffix approach.
## Future evolution
Below are the possible future extensions to the Job controller:
* Be able to specify workflow template in `.spec` field. This relates to the work
happening in [#18827](https://issues.k8s.io/18827).
* Be able to specify more general template in `.spec` field, to create arbitrary
types of resources. This relates to the work happening in [#18215](https://issues.k8s.io/18215).
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/scheduledjob.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,186 +1 @@
# Secrets, configmaps and downwardAPI file mode bits This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/secret-configmap-downwarapi-file-mode.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/secret-configmap-downwarapi-file-mode.md)
Author: Rodrigo Campos (@rata), Tim Hockin (@thockin)
Date: July 2016
Status: Design in progress
# Goal
Allow users to specify permission mode bits for a secret/configmap/downwardAPI
file mounted as a volume. For example, if a secret has several keys, a user
should be able to specify the permission mode bits for any file, and they may
all have different modes.
Let me say that with "permission" I only refer to the file mode here and I may
use them interchangeably. This is not about the file owners, although let me
know if you prefer to discuss that here too.
# Motivation
There is currently no way to set permissions on secret files mounted as volumes.
This can be a problem for applications that enforce files to have permissions
only for the owner (like fetchmail, ssh, pgpass file in postgres[1], etc.) and
it's just not possible to run them without changing the file mode. Also,
in-house applications may have this restriction too.
It doesn't seem totally wrong if someone wants to make a secret, that is
sensitive information, not world-readable (or group, too) as it is by default.
Although it's already in a container that is (hopefully) running only one
process and it might not be so bad. But people running more than one process in
a container asked for this too[2].
For example, my use case is that we are migrating to kubernetes, the migration
is in progress (and will take a while) and we have migrated our deployment web
interface to kubernetes. But this interface connects to the servers via ssh, so
it needs the ssh keys, and ssh will only work if the ssh key file mode is the
one it expects.
This was asked on the mailing list here[2] and here[3], too.
[1]: https://www.postgresql.org/docs/9.1/static/libpq-pgpass.html
[2]: https://groups.google.com/forum/#!topic/kubernetes-dev/eTnfMJSqmaM
[3]: https://groups.google.com/forum/#!topic/google-containers/EcaOPq4M758
# Alternatives considered
Several alternatives have been considered:
* Add a mode to the API definition when using secrets: this is backward
compatible as described in (docs/devel/api_changes.md) IIUC and seems like the
way to go. Also @thockin said in the ML that he would consider such an
approach. But it might be worth to consider if we want to do the same for
configmaps or owners, but there is no need to do it now either.
* Change the default file mode for secrets: I think this is unacceptable as it
is stated in the api_changes doc. And besides it doesn't feel correct IMHO, it
is technically one option. The argument for this might be that world and group
readable for a secret is not a nice default, we already take care of not
writing it to disk, etc. but the file is created world-readable anyways. Such a
default change has been done recently: the default was 0444 in kubernetes <= 1.2
and is now 0644 in kubernetes >= 1.3 (and the file is not a regular file,
it's a symlink now). This change was done here to minimize differences between
configmaps and secrets: https://github.com/kubernetes/kubernetes/pull/25285. But
doing it again, and changing to something more restrictive (now is 0644 and it
should be 0400 to work with ssh and most apps) seems too risky, it's even more
restrictive than in k8s 1.2. Specially if there is no way to revert to the old
permissions and some use case is broken by this. And if we are adding a way to
change it, like in the option above, there is no need to rush changing the
default. So I would discard this.
* We don't want to people be able to change this, at least for now, and the
ones who do, suggest that do it as a "postStart" command. This is acceptable
if we don't want to change kubernetes core for some reason, although there
seem to be valid use cases. But if the user want's to use the "postStart" for
something else, then it is more disturbing to do both things (have a script
in the docker image that deals with this, but is not probably concern of the
project so it's not nice, or specify several commands by using "sh").
# Proposed implementation
The proposed implementation goes with the first alternative: adding a `mode`
to the API.
There will be a `defaultMode`, type `int`, in: `type SecretVolumeSource`, `type
ConfigMapVolumeSource` and `type DownwardAPIVolumeSource`. And a `mode`, type
`int` too, in `type KeyToPath` and `DownwardAPIVolumeFile`.
The mask provided in any of these fields will be ANDed with 0777 to disallow
setting sticky and setuid bits. It's not clear that use case is needed nor
really understood. And directories within the volume will be created as before
and are not affected by this setting.
In other words, the fields will look like this:
```
type SecretVolumeSource struct {
// Name of the secret in the pod's namespace to use.
SecretName string `json:"secretName,omitempty"`
// If unspecified, each key-value pair in the Data field of the referenced
// Secret will be projected into the volume as a file whose name is the
// key and content is the value. If specified, the listed keys will be
// projected into the specified paths, and unlisted keys will not be
// present. If a key is specified which is not present in the Secret,
// the volume setup will error. Paths must be relative and may not contain
// the '..' path or start with '..'.
Items []KeyToPath `json:"items,omitempty"`
// Mode bits to use on created files by default. The used mode bits will
// be the provided AND 0777.
// Directories within the path are not affected by this setting
DefaultMode int32 `json:"defaultMode,omitempty"`
}
type ConfigMapVolumeSource struct {
LocalObjectReference `json:",inline"`
// If unspecified, each key-value pair in the Data field of the referenced
// ConfigMap will be projected into the volume as a file whose name is the
// key and content is the value. If specified, the listed keys will be
// projected into the specified paths, and unlisted keys will not be
// present. If a key is specified which is not present in the ConfigMap,
// the volume setup will error. Paths must be relative and may not contain
// the '..' path or start with '..'.
Items []KeyToPath `json:"items,omitempty"`
// Mode bits to use on created files by default. The used mode bits will
// be the provided AND 0777.
// Directories within the path are not affected by this setting
DefaultMode int32 `json:"defaultMode,omitempty"`
}
type KeyToPath struct {
// The key to project.
Key string `json:"key"`
// The relative path of the file to map the key to.
// May not be an absolute path.
// May not contain the path element '..'.
// May not start with the string '..'.
Path string `json:"path"`
// Mode bits to use on this file. The used mode bits will be the
// provided AND 0777.
Mode int32 `json:"mode,omitempty"`
}
type DownwardAPIVolumeSource struct {
// Items is a list of DownwardAPIVolume file
Items []DownwardAPIVolumeFile `json:"items,omitempty"`
// Mode bits to use on created files by default. The used mode bits will
// be the provided AND 0777.
// Directories within the path are not affected by this setting
DefaultMode int32 `json:"defaultMode,omitempty"`
}
type DownwardAPIVolumeFile struct {
// Required: Path is the relative path name of the file to be created. Must not be absolute or contain the '..' path. Must be utf-8 encoded. The first item of the relative path must not start with '..'
Path string `json:"path"`
// Required: Selects a field of the pod: only annotations, labels, name and namespace are supported.
FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"`
// Selects a resource of the container: only resources limits and requests
// (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported.
ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"`
// Mode bits to use on this file. The used mode bits will be the
// provided AND 0777.
Mode int32 `json:"mode,omitempty"`
}
```
Adding it there allows the user to change the mode bits of every file in the
object, so it achieves the goal, while having the option to have a default and
not specify all files in the object.
The are two downside:
* The files are symlinks pointint to the real file, and the realfile
permissions are only set. The symlink has the classic symlink permissions.
This is something already present in 1.3, and it seems applications like ssh
work just fine with that. Something worth mentioning, but doesn't seem to be
an issue.
* If the secret/configMap/downwardAPI is mounted in more than one container,
the file permissions will be the same on all. This is already the case for
Key mappings and doesn't seem like a big issue either.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/secret-configmap-downwarapi-file-mode.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,348 +1 @@
## Abstract This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/security-context-constraints.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/security-context-constraints.md)
PodSecurityPolicy allows cluster administrators to control the creation and validation of a security
context for a pod and containers.
## Motivation
Administration of a multi-tenant cluster requires the ability to provide varying sets of permissions
among the tenants, the infrastructure components, and end users of the system who may themselves be
administrators within their own isolated namespace.
Actors in a cluster may include infrastructure that is managed by administrators, infrastructure
that is exposed to end users (builds, deployments), the isolated end user namespaces in the cluster, and
the individual users inside those namespaces. Infrastructure components that operate on behalf of a
user (builds, deployments) should be allowed to run at an elevated level of permissions without
granting the user themselves an elevated set of permissions.
## Goals
1. Associate [service accounts](../design/service_accounts.md), groups, and users with
a set of constraints that dictate how a security context is established for a pod and the pod's containers.
1. Provide the ability for users and infrastructure components to run pods with elevated privileges
on behalf of another user or within a namespace where privileges are more restrictive.
1. Secure the ability to reference elevated permissions or to change the constraints under which
a user runs.
## Use Cases
Use case 1:
As an administrator, I can create a namespace for a person that can't create privileged containers
AND enforce that the UID of the containers is set to a certain value
Use case 2:
As a cluster operator, an infrastructure component should be able to create a pod with elevated
privileges in a namespace where regular users cannot create pods with these privileges or execute
commands in that pod.
Use case 3:
As a cluster administrator, I can allow a given namespace (or service account) to create privileged
pods or to run root pods
Use case 4:
As a cluster administrator, I can allow a project administrator to control the security contexts of
pods and service accounts within a project
## Requirements
1. Provide a set of restrictions that controls how a security context is created for pods and containers
as a new cluster-scoped object called `PodSecurityPolicy`.
1. User information in `user.Info` must be available to admission controllers. (Completed in
https://github.com/GoogleCloudPlatform/kubernetes/pull/8203)
1. Some authorizers may restrict a users ability to reference a service account. Systems requiring
the ability to secure service accounts on a user level must be able to add a policy that enables
referencing specific service accounts themselves.
1. Admission control must validate the creation of Pods against the allowed set of constraints.
## Design
### Model
PodSecurityPolicy objects exist in the root scope, outside of a namespace. The
PodSecurityPolicy will reference users and groups that are allowed
to operate under the constraints. In order to support this, `ServiceAccounts` must be mapped
to a user name or group list by the authentication/authorization layers. This allows the security
context to treat users, groups, and service accounts uniformly.
Below is a list of PodSecurityPolicies which will likely serve most use cases:
1. A default policy object. This object is permissioned to something which covers all actors, such
as a `system:authenticated` group, and will likely be the most restrictive set of constraints.
1. A default constraints object for service accounts. This object can be identified as serving
a group identified by `system:service-accounts`, which can be imposed by the service account authenticator / token generator.
1. Cluster admin constraints identified by `system:cluster-admins` group - a set of constraints with elevated privileges that can be used
by an administrative user or group.
1. Infrastructure components constraints which can be identified either by a specific service
account or by a group containing all service accounts.
```go
// PodSecurityPolicy governs the ability to make requests that affect the SecurityContext
// that will be applied to a pod and container.
type PodSecurityPolicy struct {
unversioned.TypeMeta `json:",inline"`
api.ObjectMeta `json:"metadata,omitempty"`
// Spec defines the policy enforced.
Spec PodSecurityPolicySpec `json:"spec,omitempty"`
}
// PodSecurityPolicySpec defines the policy enforced.
type PodSecurityPolicySpec struct {
// Privileged determines if a pod can request to be run as privileged.
Privileged bool `json:"privileged,omitempty"`
// Capabilities is a list of capabilities that can be added.
Capabilities []api.Capability `json:"capabilities,omitempty"`
// Volumes allows and disallows the use of different types of volume plugins.
Volumes VolumeSecurityPolicy `json:"volumes,omitempty"`
// HostNetwork determines if the policy allows the use of HostNetwork in the pod spec.
HostNetwork bool `json:"hostNetwork,omitempty"`
// HostPorts determines which host port ranges are allowed to be exposed.
HostPorts []HostPortRange `json:"hostPorts,omitempty"`
// HostPID determines if the policy allows the use of HostPID in the pod spec.
HostPID bool `json:"hostPID,omitempty"`
// HostIPC determines if the policy allows the use of HostIPC in the pod spec.
HostIPC bool `json:"hostIPC,omitempty"`
// SELinuxContext is the strategy that will dictate the allowable labels that may be set.
SELinuxContext SELinuxContextStrategyOptions `json:"seLinuxContext,omitempty"`
// RunAsUser is the strategy that will dictate the allowable RunAsUser values that may be set.
RunAsUser RunAsUserStrategyOptions `json:"runAsUser,omitempty"`
// The users who have permissions to use this policy
Users []string `json:"users,omitempty"`
// The groups that have permission to use this policy
Groups []string `json:"groups,omitempty"`
}
// HostPortRange defines a range of host ports that will be enabled by a policy
// for pods to use. It requires both the start and end to be defined.
type HostPortRange struct {
// Start is the beginning of the port range which will be allowed.
Start int `json:"start"`
// End is the end of the port range which will be allowed.
End int `json:"end"`
}
// VolumeSecurityPolicy allows and disallows the use of different types of volume plugins.
type VolumeSecurityPolicy struct {
// HostPath allows or disallows the use of the HostPath volume plugin.
// More info: http://kubernetes.io/docs/user-guide/volumes#hostpath
HostPath bool `json:"hostPath,omitempty"`
// EmptyDir allows or disallows the use of the EmptyDir volume plugin.
// More info: http://kubernetes.io/docs/user-guide/volumes#emptydir
EmptyDir bool `json:"emptyDir,omitempty"`
// GCEPersistentDisk allows or disallows the use of the GCEPersistentDisk volume plugin.
// More info: http://kubernetes.io/docs/user-guide/volumes#gcepersistentdisk
GCEPersistentDisk bool `json:"gcePersistentDisk,omitempty"`
// AWSElasticBlockStore allows or disallows the use of the AWSElasticBlockStore volume plugin.
// More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
AWSElasticBlockStore bool `json:"awsElasticBlockStore,omitempty"`
// GitRepo allows or disallows the use of the GitRepo volume plugin.
GitRepo bool `json:"gitRepo,omitempty"`
// Secret allows or disallows the use of the Secret volume plugin.
// More info: http://kubernetes.io/docs/user-guide/volumes#secrets
Secret bool `json:"secret,omitempty"`
// NFS allows or disallows the use of the NFS volume plugin.
// More info: http://kubernetes.io/docs/user-guide/volumes#nfs
NFS bool `json:"nfs,omitempty"`
// ISCSI allows or disallows the use of the ISCSI volume plugin.
// More info: http://releases.k8s.io/HEAD/examples/volumes/iscsi/README.md
ISCSI bool `json:"iscsi,omitempty"`
// Glusterfs allows or disallows the use of the Glusterfs volume plugin.
// More info: http://releases.k8s.io/HEAD/examples/volumes/glusterfs/README.md
Glusterfs bool `json:"glusterfs,omitempty"`
// PersistentVolumeClaim allows or disallows the use of the PersistentVolumeClaim volume plugin.
// More info: http://kubernetes.io/docs/user-guide/persistent-volumes#persistentvolumeclaims
PersistentVolumeClaim bool `json:"persistentVolumeClaim,omitempty"`
// RBD allows or disallows the use of the RBD volume plugin.
// More info: http://releases.k8s.io/HEAD/examples/volumes/rbd/README.md
RBD bool `json:"rbd,omitempty"`
// Cinder allows or disallows the use of the Cinder volume plugin.
// More info: http://releases.k8s.io/HEAD/examples/mysql-cinder-pd/README.md
Cinder bool `json:"cinder,omitempty"`
// CephFS allows or disallows the use of the CephFS volume plugin.
CephFS bool `json:"cephfs,omitempty"`
// DownwardAPI allows or disallows the use of the DownwardAPI volume plugin.
DownwardAPI bool `json:"downwardAPI,omitempty"`
// FC allows or disallows the use of the FC volume plugin.
FC bool `json:"fc,omitempty"`
}
// SELinuxContextStrategyOptions defines the strategy type and any options used to create the strategy.
type SELinuxContextStrategyOptions struct {
// Type is the strategy that will dictate the allowable labels that may be set.
Type SELinuxContextStrategy `json:"type"`
// seLinuxOptions required to run as; required for MustRunAs
// More info: http://releases.k8s.io/HEAD/docs/design/security_context.md#security-context
SELinuxOptions *api.SELinuxOptions `json:"seLinuxOptions,omitempty"`
}
// SELinuxContextStrategyType denotes strategy types for generating SELinux options for a
// SecurityContext.
type SELinuxContextStrategy string
const (
// container must have SELinux labels of X applied.
SELinuxStrategyMustRunAs SELinuxContextStrategy = "MustRunAs"
// container may make requests for any SELinux context labels.
SELinuxStrategyRunAsAny SELinuxContextStrategy = "RunAsAny"
)
// RunAsUserStrategyOptions defines the strategy type and any options used to create the strategy.
type RunAsUserStrategyOptions struct {
// Type is the strategy that will dictate the allowable RunAsUser values that may be set.
Type RunAsUserStrategy `json:"type"`
// UID is the user id that containers must run as. Required for the MustRunAs strategy if not using
// a strategy that supports pre-allocated uids.
UID *int64 `json:"uid,omitempty"`
// UIDRangeMin defines the min value for a strategy that allocates by a range based strategy.
UIDRangeMin *int64 `json:"uidRangeMin,omitempty"`
// UIDRangeMax defines the max value for a strategy that allocates by a range based strategy.
UIDRangeMax *int64 `json:"uidRangeMax,omitempty"`
}
// RunAsUserStrategyType denotes strategy types for generating RunAsUser values for a
// SecurityContext.
type RunAsUserStrategy string
const (
// container must run as a particular uid.
RunAsUserStrategyMustRunAs RunAsUserStrategy = "MustRunAs"
// container must run as a particular uid.
RunAsUserStrategyMustRunAsRange RunAsUserStrategy = "MustRunAsRange"
// container must run as a non-root uid
RunAsUserStrategyMustRunAsNonRoot RunAsUserStrategy = "MustRunAsNonRoot"
// container may make requests for any uid.
RunAsUserStrategyRunAsAny RunAsUserStrategy = "RunAsAny"
)
```
### PodSecurityPolicy Lifecycle
As reusable objects in the root scope, PodSecurityPolicy follows the lifecycle of the
cluster itself. Maintenance of constraints such as adding, assigning, or changing them is the
responsibility of the cluster administrator.
Creating a new user within a namespace should not require the cluster administrator to
define the user's PodSecurityPolicy. They should receive the default set of policies
that the administrator has defined for the groups they are assigned.
## Default PodSecurityPolicy And Overrides
In order to establish policy for service accounts and users, there must be a way
to identify the default set of constraints that is to be used. This is best accomplished by using
groups. As mentioned above, groups may be used by the authentication/authorization layer to ensure
that every user maps to at least one group (with a default example of `system:authenticated`) and it
is up to the cluster administrator to ensure that a `PodSecurityPolicy` object exists that
references the group.
If an administrator would like to provide a user with a changed set of security context permissions,
they may do the following:
1. Create a new `PodSecurityPolicy` object and add a reference to the user or a group
that the user belongs to.
1. Add the user (or group) to an existing `PodSecurityPolicy` object with the proper
elevated privileges.
## Admission
Admission control using an authorizer provides the ability to control the creation of resources
based on capabilities granted to a user. In terms of the `PodSecurityPolicy`, it means
that an admission controller may inspect the user info made available in the context to retrieve
an appropriate set of policies for validation.
The appropriate set of PodSecurityPolicies is defined as all of the policies
available that have reference to the user or groups that the user belongs to.
Admission will use the PodSecurityPolicy to ensure that any requests for a
specific security context setting are valid and to generate settings using the following approach:
1. Determine all the available `PodSecurityPolicy` objects that are allowed to be used
1. Sort the `PodSecurityPolicy` objects in a most restrictive to least restrictive order.
1. For each `PodSecurityPolicy`, generate a `SecurityContext` for each container. The generation phase will not override
any user requested settings in the `SecurityContext`, and will rely on the validation phase to ensure that
the user requests are valid.
1. Validate the generated `SecurityContext` to ensure it falls within the boundaries of the `PodSecurityPolicy`
1. If all containers validate under a single `PodSecurityPolicy` then the pod will be admitted
1. If all containers DO NOT validate under the `PodSecurityPolicy` then try the next `PodSecurityPolicy`
1. If no `PodSecurityPolicy` validates for the pod then the pod will not be admitted
## Creation of a SecurityContext Based on PodSecurityPolicy
The creation of a `SecurityContext` based on a `PodSecurityPolicy` is based upon the configured
settings of the `PodSecurityPolicy`.
There are three scenarios under which a `PodSecurityPolicy` field may fall:
1. Governed by a boolean: fields of this type will be defaulted to the most restrictive value.
For instance, `AllowPrivileged` will always be set to false if unspecified.
1. Governed by an allowable set: fields of this type will be checked against the set to ensure
their value is allowed. For example, `AllowCapabilities` will ensure that only capabilities
that are allowed to be requested are considered valid. `HostNetworkSources` will ensure that
only pods created from source X are allowed to request access to the host network.
1. Governed by a strategy: Items that have a strategy to generate a value will provide a
mechanism to generate the value as well as a mechanism to ensure that a specified value falls into
the set of allowable values. See the Types section for the description of the interfaces that
strategies must implement.
Strategies have the ability to become dynamic. In order to support a dynamic strategy it should be
possible to make a strategy that has the ability to either be pre-populated with dynamic data by
another component (such as an admission controller) or has the ability to retrieve the information
itself based on the data in the pod. An example of this would be a pre-allocated UID for the namespace.
A dynamic `RunAsUser` strategy could inspect the namespace of the pod in order to find the required pre-allocated
UID and generate or validate requests based on that information.
```go
// SELinuxStrategy defines the interface for all SELinux constraint strategies.
type SELinuxStrategy interface {
// Generate creates the SELinuxOptions based on constraint rules.
Generate(pod *api.Pod, container *api.Container) (*api.SELinuxOptions, error)
// Validate ensures that the specified values fall within the range of the strategy.
Validate(pod *api.Pod, container *api.Container) fielderrors.ValidationErrorList
}
// RunAsUserStrategy defines the interface for all uid constraint strategies.
type RunAsUserStrategy interface {
// Generate creates the uid based on policy rules.
Generate(pod *api.Pod, container *api.Container) (*int64, error)
// Validate ensures that the specified values fall within the range of the strategy.
Validate(pod *api.Pod, container *api.Container) fielderrors.ValidationErrorList
}
```
## Escalating Privileges by an Administrator
An administrator may wish to create a resource in a namespace that runs with
escalated privileges. By allowing security context
constraints to operate on both the requesting user and the pod's service account, administrators are able to
create pods in namespaces with elevated privileges based on the administrator's security context
constraints.
This also allows the system to guard commands being executed in the non-conforming container. For
instance, an `exec` command can first check the security context of the pod against the security
context constraints of the user or the user's ability to reference a service account.
If it does not validate then it can block users from executing the command. Since the validation
will be user aware, administrators would still be able to run the commands that are restricted to normal users.
## Interaction with the Kubelet
In certain cases, the Kubelet may need provide information about
the image in order to validate the security context. An example of this is a cluster
that is configured to run with a UID strategy of `MustRunAsNonRoot`.
In this case the admission controller can set the existing `MustRunAsNonRoot` flag on the `SecurityContext`
based on the UID strategy of the `SecurityPolicy`. It should still validate any requests on the pod
for a specific UID and fail early if possible. However, if the `RunAsUser` is not set on the pod
it should still admit the pod and allow the Kubelet to ensure that the image does not run as
`root` with the existing non-root checks.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/security-context-constraints.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,135 +1 @@
# Proposal: Self-hosted kubelet This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/self-hosted-kubelet.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/self-hosted-kubelet.md)
## Abstract
In a self-hosted Kubernetes deployment (see [this
comment](https://github.com/kubernetes/kubernetes/issues/246#issuecomment-64533959)
for background on self hosted kubernetes), we have the initial bootstrap problem.
When running self-hosted components, there needs to be a mechanism for pivoting
from the initial bootstrap state to the kubernetes-managed (self-hosted) state.
In the case of a self-hosted kubelet, this means pivoting from the initial
kubelet defined and run on the host, to the kubelet pod which has been scheduled
to the node.
This proposal presents a solution to the kubelet bootstrap, and assumes a
functioning control plane (e.g. an apiserver, controller-manager, scheduler, and
etcd cluster), and a kubelet that can securely contact the API server. This
functioning control plane can be temporary, and not necessarily the "production"
control plane that will be used after the initial pivot / bootstrap.
## Background and Motivation
In order to understand the goals of this proposal, one must understand what
"self-hosted" means. This proposal defines "self-hosted" as a kubernetes cluster
that is installed and managed by the kubernetes installation itself. This means
that each kubernetes component is described by a kubernetes manifest (Daemonset,
Deployment, etc) and can be updated via kubernetes.
The overall goal of this proposal is to make kubernetes easier to install and
upgrade. We can then treat kubernetes itself just like any other application
hosted in a kubernetes cluster, and have access to easy upgrades, monitoring,
and durability for core kubernetes components themselves.
We intend to achieve this by using kubernetes to manage itself. However, in
order to do that we must first "bootstrap" the cluster, by using kubernetes to
install kubernetes components. This is where this proposal fits in, by
describing the necessary modifications, and required procedures, needed to run a
self-hosted kubelet.
The approach being proposed for a self-hosted kubelet is a "pivot" style
installation. This procedure assumes a short-lived “bootstrap” kubelet will run
and start a long-running “self-hosted” kubelet. Once the self-hosted kubelet is
running the bootstrap kubelet will exit. As part of this, we propose introducing
a new `--bootstrap` flag to the kubelet. The behaviour of that flag will be
explained in detail below.
## Proposal
We propose adding a new flag to the kubelet, the `--bootstrap` flag, which is
assumed to be used in conjunction with the `--lock-file` flag. The `--lock-file`
flag is used to ensure only a single kubelet is running at any given time during
this pivot process. When the `--bootstrap` flag is provided, after the kubelet
acquires the file lock, it will begin asynchronously waiting on
[inotify](http://man7.org/linux/man-pages/man7/inotify.7.html) events. Once an
"open" event is received, the kubelet will assume another kubelet is attempting
to take control and will exit by calling `exit(0)`.
Thus, the initial bootstrap becomes:
1. "bootstrap" kubelet is started by $init system.
1. "bootstrap" kubelet pulls down "self-hosted" kubelet as a pod from a
daemonset
1. "self-hosted" kubelet attempts to acquire the file lock, causing "bootstrap"
kubelet to exit
1. "self-hosted" kubelet acquires lock and takes over
1. "bootstrap" kubelet is restarted by $init system and blocks on acquiring the
file lock
During an upgrade of the kubelet, for simplicity we will consider 3 kubelets,
namely "bootstrap", "v1", and "v2". We imagine the following scenario for
upgrades:
1. Cluster administrator introduces "v2" kubelet daemonset
1. "v1" kubelet pulls down and starts "v2"
1. Cluster administrator removes "v1" kubelet daemonset
1. "v1" kubelet is killed
1. Both "bootstrap" and "v2" kubelets race for file lock
1. If "v2" kubelet acquires lock, process has completed
1. If "bootstrap" kubelet acquires lock, it is assumed that "v2" kubelet will
fail a health check and be killed. Once restarted, it will try to acquire the
lock, triggering the "bootstrap" kubelet to exit.
Alternatively, it would also be possible via this mechanism to delete the "v1"
daemonset first, allow the "bootstrap" kubelet to take over, and then introduce
the "v2" kubelet daemonset, effectively eliminating the race between "bootstrap"
and "v2" for lock acquisition, and the reliance on the failing health check
procedure.
Eventually this could be handled by a DaemonSet upgrade policy.
This will allow a "self-hosted" kubelet with minimal new concepts introduced
into the core Kubernetes code base, and remains flexible enough to work well
with future [bootstrapping
services](https://github.com/kubernetes/kubernetes/issues/5754).
## Production readiness considerations / Out of scope issues
* Deterministically pulling and running kubelet pod: we would prefer not to have
to loop until we finally get a kubelet pod.
* It is possible that the bootstrap kubelet version is incompatible with the
newer versions that were run in the node. For example, the cgroup
configurations might be incompatible. In the beginning, we will require
cluster admins to keep the configuration in sync. Since we want the bootstrap
kubelet to come up and run even if the API server is not available, we should
persist the configuration for bootstrap kubelet on the node. Once we have
checkpointing in kubelet, we will checkpoint the updated config and have the
bootstrap kubelet use the updated config, if it were to take over.
* Currently best practice when upgrading the kubelet on a node is to drain all
pods first. Automatically draining of the node during kubelet upgrade is out
of scope for this proposal. It is assumed that either the cluster
administrator or the daemonset upgrade policy will handle this.
## Other discussion
Various similar approaches have been discussed
[here](https://github.com/kubernetes/kubernetes/issues/246#issuecomment-64533959)
and
[here](https://github.com/kubernetes/kubernetes/issues/23073#issuecomment-198478997).
Other discussion around the kubelet being able to be run inside a container is
[here](https://github.com/kubernetes/kubernetes/issues/4869). Note this isn't a
strict requirement as the kubelet could be run in a chroot jail via rkt fly or
other such similar approach.
Additionally, [Taints and
Tolerations](../../docs/design/taint-toleration-dedicated.md), whose design has
already been accepted, would make the overall kubelet bootstrap more
deterministic. With this, we would also need the ability for a kubelet to
register itself with a given taint when it first contacts the API server. Given
that, a kubelet could register itself with a given taint such as
“component=kubelet”, and a kubelet pod could exist that has a toleration to that
taint, ensuring it is the only pod the “bootstrap” kubelet runs.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/self-hosted-kubelet.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,209 +1 @@
## Abstract This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/selinux-enhancements.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/selinux-enhancements.md)
Presents a proposal for enhancing the security of Kubernetes clusters using
SELinux and simplifying the implementation of SELinux support within the
Kubelet by removing the need to label the Kubelet directory with an SELinux
context usable from a container.
## Motivation
The current Kubernetes codebase relies upon the Kubelet directory being
labeled with an SELinux context usable from a container. This means that a
container escaping namespace isolation will be able to use any file within the
Kubelet directory without defeating kernel
[MAC (mandatory access control)](https://en.wikipedia.org/wiki/Mandatory_access_control).
In order to limit the attack surface, we should enhance the Kubelet to relabel
any bind-mounts into containers into a usable SELinux context without depending
on the Kubelet directory's SELinux context.
## Constraints and Assumptions
1. No API changes allowed
2. Behavior must be fully backward compatible
3. No new admission controllers - make incremental improvements without huge
refactorings
## Use Cases
1. As a cluster operator, I want to avoid having to label the Kubelet
directory with a label usable from a container, so that I can limit the
attack surface available to a container escaping its namespace isolation
2. As a user, I want to run a pod without an SELinux context explicitly
specified and be isolated using MCS (multi-category security) on systems
where SELinux is enabled, so that the pods on each host are isolated from
one another
3. As a user, I want to run a pod that uses the host IPC or PID namespace and
want the system to do the right thing with regard to SELinux, so that no
unnecessary relabel actions are performed
### Labeling the Kubelet directory
As previously stated, the current codebase relies on the Kubelet directory
being labeled with an SELinux context usable from a container. The Kubelet
uses the SELinux context of this directory to determine what SELinux context
`tmpfs` mounts (provided by the EmptyDir memory-medium option) should receive.
The problem with this is that it opens an attack surface to a container that
escapes its namespace isolation; such a container would be able to use any
file in the Kubelet directory without defeating kernel MAC.
### SELinux when no context is specified
When no SELinux context is specified, Kubernetes should just do the right
thing, where doing the right thing is defined as isolating pods with a node-
unique set of categories. Node-uniqueness means unique among the pods
scheduled onto the node. Long-term, we want to have a cluster-wide allocator
for MCS labels. Node-unique MCS labels are a good middle ground that is
possible without a new, large, feature.
### SELinux and host IPC and PID namespaces
Containers in pods that use the host IPC or PID namespaces need access to
other processes and IPC mechanisms on the host. Therefore, these containers
should be run with the `spc_t` SELinux type by the container runtime. The
`spc_t` type is an unconfined type that other SELinux domains are allowed to
connect to. In the case where a pod uses one of these host namespaces, it
should be unnecessary to relabel the pod's volumes.
## Analysis
### Libcontainer SELinux library
Docker and rkt both use the libcontainer SELinux library. This library
provides a method, `GetLxcContexts`, that returns the a unique SELinux
contexts for container processes and files used by them. `GetLxcContexts`
reads the base SELinux context information from a file at `/etc/selinux/<policy-
name>/contexts/lxc_contexts` and then adds a process-unique MCS label.
Docker and rkt both leverage this call to determine the 'starting' SELinux
contexts for containers.
### Docker
Docker's behavior when no SELinux context is defined for a container is to
give the container a node-unique MCS label.
#### Sharing IPC namespaces
On the Docker runtime, the containers in a Kubernetes pod share the IPC and
PID namespaces of the pod's infra container.
Docker's behavior for containers sharing these namespaces is as follows: if a
container B shares the IPC namespace of another container A, container B is
given the SELinux context of container A. Therefore, for Kubernetes pods
running on docker, in a vacuum the containers in a pod should have the same
SELinux context.
[**Known issue**](https://bugzilla.redhat.com/show_bug.cgi?id=1377869): When
the seccomp profile is set on a docker container that shares the IPC namespace
of another container, that container will not receive the other container's
SELinux context.
#### Host IPC and PID namespaces
In the case of a pod that shares the host IPC or PID namespace, this flag is
simply ignored and the container receives the `spc_t` SELinux type. The
`spc_t` type is unconfined, and so no relabeling needs to be done for volumes
for these pods. Currently, however, there is code which relabels volumes into
explicitly specified SELinux contexts for these pods. This code is unnecessary
and should be removed.
#### Relabeling bind-mounts
Docker is capable of relabeling bind-mounts into containers using the `:Z`
bind-mount flag. However, in the current implementation of the docker runtime
in Kubernetes, the `:Z` option is only applied when the pod's SecurityContext
contains an SELinux context. We could easily implement the correct behaviors
by always setting `:Z` on systems where SELinux is enabled.
### rkt
rkt's behavior when no SELinux context is defined for a pod is similar to
Docker's -- an SELinux context with a node-unique MCS label is given to the
containers of a pod.
#### Sharing IPC namespaces
Containers (apps, in rkt terminology) in rkt pods share an IPC and PID
namespace by default.
#### Relabeling bind-mounts
Bind-mounts into rkt pods are automatically relabeled into the pod's SELinux
context.
#### Host IPC and PID namespaces
Using the host IPC and PID namespaces is not currently supported by rkt.
## Proposed Changes
### Refactor `pkg/util/selinux`
1. The `selinux` package should provide a method `SELinuxEnabled` that returns
whether SELinux is enabled, and is built for all platforms (the
libcontainer SELinux is only built on linux)
2. The `SelinuxContextRunner` interface should be renamed to `SELinuxRunner`
and be changed to have the same method names and signatures as the
libcontainer methods its implementations wrap
3. The `SELinuxRunner` interface only needs `Getfilecon`, which is used by
the rkt code
```go
package selinux
// Note: the libcontainer SELinux package is only built for Linux, so it is
// necessary to have a NOP wrapper which is built for non-Linux platforms to
// allow code that links to this package not to differentiate its own methods
// for Linux and non-Linux platforms.
//
// SELinuxRunner wraps certain libcontainer SELinux calls. For more
// information, see:
//
// https://github.com/opencontainers/runc/blob/master/libcontainer/selinux/selinux.go
type SELinuxRunner interface {
// Getfilecon returns the SELinux context for the given path or returns an
// error.
Getfilecon(path string) (string, error)
}
```
### Kubelet Changes
1. The `relabelVolumes` method in `kubelet_volumes.go` is not needed and can
be removed
2. The `GenerateRunContainerOptions` method in `kubelet_pods.go` should no
longer call `relabelVolumes`
3. The `makeHostsMount` method in `kubelet_pods.go` should set the
`SELinuxRelabel` attribute of the mount for the pod's hosts file to `true`
### Changes to `pkg/kubelet/dockertools/`
1. The `makeMountBindings` should be changed to:
1. No longer accept the `podHasSELinuxLabel` parameter
2. Always use the `:Z` bind-mount flag when SELinux is enabled and the mount
has the `SELinuxRelabel` attribute set to `true`
2. The `runContainer` method should be changed to always use the `:Z`
bind-mount flag on the termination message mount when SELinux is enabled
### Changes to `pkg/kubelet/rkt`
The should not be any required changes for the rkt runtime; we should test to
ensure things work as expected under rkt.
### Changes to volume plugins and infrastructure
1. The `VolumeHost` interface contains a method called `GetRootContext`; this
is an artifact of the old assumptions about the Kubelet directory's SELinux
context and can be removed
2. The `empty_dir.go` file should be changed to be completely agnostic of
SELinux; no behavior in this plugin needs to be differentiated when SELinux
is enabled
### Changes to `pkg/controller/...`
The `VolumeHost` abstraction is used in a couple of PV controllers as NOP
implementations. These should be altered to no longer include `GetRootContext`.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/selinux-enhancements.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,69 +1 @@
# Service Discovery Proposal This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/service-discovery.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/service-discovery.md)
## Goal of this document
To consume a service, a developer needs to know the full URL and a description of the API. Kubernetes contains the host and port information of a service, but it lacks the scheme and the path information needed if the service is not bound at the root. In this document we propose some standard kubernetes service annotations to fix these gaps. It is important that these annotations are a standard to allow for standard service discovery across Kubernetes implementations. Note that the example largely speaks to consuming WebServices but that the same concepts apply to other types of services.
## Endpoint URL, Service Type
A URL can accurately describe the location of a Service. A generic URL is of the following form
scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
however for the purpose of service discovery we can simplify this to the following form
scheme:[//host[:port]][/]path
If a user and/or password is required then this information can be passed using Kubernetes Secrets. Kubernetes contains the host and port of each service but it lacks the scheme and path.
`Service Path` - Every Service has one (or more) endpoint. As a rule the endpoint should be located at the root "/" of the location URL, i.e. `http://172.100.1.52/`. There are cases where this is not possible and the actual service endpoint could be located at `http://172.100.1.52/cxfcdi`. The Kubernetes metadata for a service does not capture the path part, making it hard to consume this service.
`Service Scheme` - Services can be deployed using different schemes. Some popular schemes include `http`,`https`,`file`,`ftp` and `jdbc`.
`Service Protocol` - Services use different protocols that clients need to speak in order to communicate with the service, some examples of service level protocols are SOAP, REST (Yes, technically REST isnt a protocol but an architectural style). For service consumers it can be hard to tell what protocol is expected.
## Service Description
The API of a service is the point of interaction with a service consumer. The description of the API is an essential piece of information at creation time of the service consumer. It has become common to publish a service definition document on a know location on the service itself. This 'well known' place it not very standard, so it is proposed the service developer provides the service description path and the type of Definition Language (DL) used.
`Service Description Path` - To facilitate the consumption of the service by client, the location this document would be greatly helpful to the service consumer. In some cases the client side code can be generated from such a document. It is assumed that the service description document is published somewhere on the service endpoint itself.
`Service Description Language` - A number of Definition Languages (DL) have been developed to describe the service. Some of examples are `WSDL`, `WADL` and `Swagger`. In order to consume a description document it is good to know the type of DL used.
## Standard Service Annotations
Kubernetes allows the creation of Service Annotations. Here we propose the use of the following standard annotations
* `api.service.kubernetes.io/path` - the path part of the service endpoint url. An example value could be `cxfcdi`,
* `api.service.kubernetes.io/scheme` - the scheme part of the service endpoint url. Some values could be `http` or `https`.
* `api.service.kubernetes.io/protocol` - the protocol of the service. Known values are `SOAP`, `XML-RPC` and `REST`,
* `api.service.kubernetes.io/description-path` - the path part of the service description documents endpoint. It is a pretty safe assumption that the service self-documents. An example value for a swagger 2.0 document can be `cxfcdi/swagger.json`,
* `api.kubernetes.io/description-language` - the type of Description Language used. Known values are `WSDL`, `WADL`, `SwaggerJSON`, `SwaggerYAML`.
The fragment below is taken from the service section of the kubernetes.json were these annotations are used
...
"objects" : [ {
"apiVersion" : "v1",
"kind" : "Service",
"metadata" : {
"annotations" : {
"api.service.kubernetes.io/protocol" : "REST",
"api.service.kubernetes.io/scheme" "http",
"api.service.kubernetes.io/path" : "cxfcdi",
"api.service.kubernetes.io/description-path" : "cxfcdi/swagger.json",
"api.service.kubernetes.io/description-language" : "SwaggerJSON"
},
...
## Conclusion
Five service annotations are proposed as a standard way to describe a service endpoint. These five annotation are promoted as a Kubernetes standard, so that services can be discovered and a service catalog can be build to facilitate service consumers.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/service-discovery.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,161 +1 @@
# Service externalName This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/service-external-name.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/service-external-name.md)
Author: Tim Hockin (@thockin), Rodrigo Campos (@rata), Rudi C (@therc)
Date: August 2016
Status: Implementation in progress
# Goal
Allow a service to have a CNAME record in the cluster internal DNS service. For
example, the lookup for a `db` service could return a CNAME that points to the
RDS resource `something.rds.aws.amazon.com`. No proxying is involved.
# Motivation
There were many related issues, but we'll try to summarize them here. More info
is on GitHub issues/PRs: #13748, #11838, #13358, #23921
One motivation is to present as native cluster services, services that are
hosted externally. Some cloud providers, like AWS, hand out hostnames (IPs are
not static) and the user wants to refer to these services using regular
Kubernetes tools. This was requested in bugs, at least for AWS, for RedShift,
RDS, Elasticsearch Service, ELB, etc.
Other users just want to use an external service, for example `oracle`, with dns
name `oracle-1.testdev.mycompany.com`, without having to keep DNS in sync, and
are fine with a CNAME.
Another use case is to "integrate" some services for local development. For
example, consider a search service running in Kubernetes in staging, let's say
`search-1.stating.mycompany.com`. It's running on AWS, so it resides behind an
ELB (which has no static IP, just a hostname). A developer is building an app
that consumes `search-1`, but doesn't want to run it on their machine (before
Kubernetes, they didn't, either). They can just create a service that has a
CNAME to the `search-1` endpoint in staging and be happy as before.
Also, Openshift needs this for "service refs". Service ref is really just the
three use cases mentioned above, but in the future a way to automatically inject
"service ref"s into namespaces via "service catalog"[1] might be considered. And
service ref is the natural way to integrate an external service, since it takes
advantage of native DNS capabilities already in wide use.
[1]: https://github.com/kubernetes/kubernetes/pull/17543
# Alternatives considered
In the issues linked above, some alternatives were also considered. A partial
summary of them follows.
One option is to add the hostname to endpoints, as proposed in
https://github.com/kubernetes/kubernetes/pull/11838. This is problematic, as
endpoints are used in many places and users assume the required fields (such as
IP address) are always present and valid (and check that, too). If the field is
not required anymore or if there is just a hostname instead of the IP,
applications could break. Even assuming those cases could be solved, the
hostname will have to be resolved, which presents further questions and issues:
the timeout to use, whether the lookup is synchronous or asynchronous, dealing
with DNS TTL and more. One imperfect approach was to only resolve the hostname
upon creation, but this was considered not a great idea. A better approach
would be at a higher level, maybe a service type.
There are more ideas described in #13748, but all raised further issues,
ranging from using another upstream DNS server to creating a Name object
associated with DNSs.
# Proposed solution
The proposed solution works at the service layer, by adding a new `externalName`
type for services. This will create a CNAME record in the internal cluster DNS
service. No virtual IP or proxying is involved.
Using a CNAME gets rid of unnecessary DNS lookups. There's no need for the
Kubernetes control plane to issue them, to pick a timeout for them and having to
refresh them when the TTL for a record expires. It's way simpler to implement,
while solving the right problem. And addressing it at the service layer avoids
all the complications mentioned above about doing it at the endpoints layer.
The solution was outlined by Tim Hockin in
https://github.com/kubernetes/kubernetes/issues/13748#issuecomment-230397975
Currently a ServiceSpec looks like this, with comments edited for clarity:
```
type ServiceSpec struct {
Ports []ServicePort
// If not specified, the associated Endpoints object is not automatically managed
Selector map[string]string
// "", a real IP, or "None". If not specified, this is default allocated. If "None", this Service is not load-balanced
ClusterIP string
// ClusterIP, NodePort, LoadBalancer. Only applies if clusterIP != "None"
Type ServiceType
// Only applies if clusterIP != "None"
ExternalIPs []string
SessionAffinity ServiceAffinity
// Only applies to type=LoadBalancer
LoadBalancerIP string
LoadBalancerSourceRanges []string
```
The proposal is to change it to:
```
type ServiceSpec struct {
Ports []ServicePort
// If not specified, the associated Endpoints object is not automatically managed
+ // Only applies if type is ClusterIP, NodePort, or LoadBalancer. If type is ExternalName, this is ignored.
Selector map[string]string
// "", a real IP, or "None". If not specified, this is default allocated. If "None", this Service is not load-balanced.
+ // Only applies if type is ClusterIP, NodePort, or LoadBalancer. If type is ExternalName, this is ignored.
ClusterIP string
- // ClusterIP, NodePort, LoadBalancer. Only applies if clusterIP != "None"
+ // ExternalName, ClusterIP, NodePort, LoadBalancer. Only applies if clusterIP != "None"
Type ServiceType
+ // Only applies if type is ExternalName
+ ExternalName string
// Only applies if clusterIP != "None"
ExternalIPs []string
SessionAffinity ServiceAffinity
// Only applies to type=LoadBalancer
LoadBalancerIP string
LoadBalancerSourceRanges []string
```
For example, it can be used like this:
```
apiVersion: v1
kind: Service
metadata:
name: my-rds
spec:
ports:
- port: 12345
type: ExternalName
externalName: myapp.rds.whatever.aws.says
```
There is one issue to take into account, that no other alternative considered
fixes, either: TLS. If the service is a CNAME for an endpoint that uses TLS,
connecting with the Kubernetes name `my-service.my-ns.svc.cluster.local` may
result in a failure during server certificate validation. This is acknowledged
and left for future consideration. For the time being, users and administrators
might need to ensure that the server certificates also mentions the Kubernetes
name as an alternate host name.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/service-external-name.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,363 +1 @@
# StatefulSets: Running pods which need strong identity and storage This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/stateful-apps.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/stateful-apps.md)
## Motivation
Many examples of clustered software systems require stronger guarantees per instance than are provided
by the Replication Controller (aka Replication Controllers). Instances of these systems typically require:
1. Data per instance which should not be lost even if the pod is deleted, typically on a persistent volume
* Some cluster instances may have tens of TB of stored data - forcing new instances to replicate data
from other members over the network is onerous
2. A stable and unique identity associated with that instance of the storage - such as a unique member id
3. A consistent network identity that allows other members to locate the instance even if the pod is deleted
4. A predictable number of instances to ensure that systems can form a quorum
* This may be necessary during initialization
5. Ability to migrate from node to node with stable network identity (DNS name)
6. The ability to scale up in a controlled fashion, but are very rarely scaled down without human
intervention
Kubernetes should expose a pod controller (a StatefulSet) that satisfies these requirements in a flexible
manner. It should be easy for users to manage and reason about the behavior of this set. An administrator
with familiarity in a particular cluster system should be able to leverage this controller and its
supporting documentation to run that clustered system on Kubernetes. It is expected that some adaptation
is required to support each new cluster.
This resource is **stateful** because it offers an easy way to link a pod's network identity to its storage
identity and because it is intended to be used to run software that is the holders of state for other
components. That does not mean that all stateful applications *must* use StatefulSets, but the tradeoffs
in this resource are intended to facilitate holding state in the cluster.
## Use Cases
The software listed below forms the primary use-cases for a StatefulSet on the cluster - problems encountered
while adapting these for Kubernetes should be addressed in a final design.
* Quorum with Leader Election
* MongoDB - in replica set mode forms a quorum with an elected leader, but instances must be preconfigured
and have stable network identities.
* ZooKeeper - forms a quorum with an elected leader, but is sensitive to cluster membership changes and
replacement instances *must* present consistent identities
* etcd - forms a quorum with an elected leader, can alter cluster membership in a consistent way, and
requires stable network identities
* Decentralized Quorum
* Cassandra - allows flexible consistency and distributes data via innate hash ring sharding, is also
flexible to scaling, more likely to support members that come and go. Scale down may trigger massive
rebalances.
* Active-active
* Galera - has multiple active masters which must remain in sync
* Leader-followers
* Spark in standalone mode - A single unilateral leader and a set of workers
## Background
Replica sets are designed with a weak guarantee - that there should be N replicas of a particular
pod template. Each pod instance varies only by name, and the replication controller errs on the side of
ensuring that N replicas exist as quickly as possible (by creating new pods as soon as old ones begin graceful
deletion, for instance, or by being able to pick arbitrary pods to scale down). In addition, pods by design
have no stable network identity other than their assigned pod IP, which can change over the lifetime of a pod
resource. ReplicaSets are best leveraged for stateless, shared-nothing, zero-coordination,
embarassingly-parallel, or fungible software.
While it is possible to emulate the guarantees described above by leveraging multiple replication controllers
(for distinct pod templates and pod identities) and multiple services (for stable network identity), the
resulting objects are hard to maintain and must be copied manually in order to scale a cluster.
By constrast, a DaemonSet *can* offer some of the guarantees above, by leveraging Nodes as stable, long-lived
entities. An administrator might choose a set of nodes, label them a particular way, and create a
DaemonSet that maps pods to each node. The storage of the node itself (which could be network attached
storage, or a local SAN) is the persistent storage. The network identity of the node is the stable
identity. However, while there are examples of clustered software that benefit from close association to
a node, this creates an undue burden on administrators to design their cluster to satisfy these
constraints, when a goal of Kubernetes is to decouple system administration from application management.
## Design Assumptions
* **Specialized Controller** - Rather than increase the complexity of the ReplicaSet to satisfy two distinct
use cases, create a new resource that assists users in solving this particular problem.
* **Safety first** - Running a clustered system on Kubernetes should be no harder
than running a clustered system off Kube. Authors should be given tools to guard against common cluster
failure modes (split brain, phantom member) to prevent introducing more failure modes. Sophisticated
distributed systems designers can implement more sophisticated solutions than StatefulSet if necessary -
new users should not become vulnerable to additional failure modes through an overly flexible design.
* **Controlled scaling** - While flexible scaling is important for some clusters, other examples of clusters
do not change scale without significant external intervention. Human intervention may be required after
scaling. Changing scale during cluster operation can lead to split brain in quorum systems. It should be
possible to scale, but there may be responsibilities on the set author to correctly manage the scale.
* **No generic cluster lifecycle** - Rather than design a general purpose lifecycle for clustered software,
focus on ensuring the information necessary for the software to function is available. For example,
rather than providing a "post-creation" hook invoked when the cluster is complete, provide the necessary
information to the "first" (or last) pod to determine the identity of the remaining cluster members and
allow it to manage its own initialization.
## Proposed Design
Add a new resource to Kubernetes to represent a set of pods that are individually distinct but each
individual can safely be replaced-- the name **StatefulSet** is chosen to convey that the individual members of
the set are themselves "stateful" and thus each one is preserved. Each member has an identity, and there will
always be a member that thinks it is the "first" one.
The StatefulSet is responsible for creating and maintaining a set of **identities** and ensuring that there is
one pod and zero or more **supporting resources** for each identity. There should never be more than one pod
or unique supporting resource per identity at any one time. A new pod can be created for an identity only
if a previous pod has been fully terminated (reached its graceful termination limit or cleanly exited).
A StatefulSet has 0..N **members**, each with a unique **identity** which is a name that is unique within the
set.
```
type StatefulSet struct {
ObjectMeta
Spec StatefulSetSpec
...
}
type StatefulSetSpec struct {
// Replicas is the desired number of replicas of the given template.
// Each replica is assigned a unique name of the form `name-$replica`
// where replica is in the range `0 - (replicas-1)`.
Replicas int
// A label selector that "owns" objects created under this set
Selector *LabelSelector
// Template is the object describing the pod that will be created - each
// pod created by this set will match the template, but have a unique identity.
Template *PodTemplateSpec
// VolumeClaimTemplates is a list of claims that members are allowed to reference.
// The StatefulSet controller is responsible for mapping network identities to
// claims in a way that maintains the identity of a member. Every claim in
// this list must have at least one matching (by name) volumeMount in one
// container in the template. A claim in this list takes precedence over
// any volumes in the template, with the same name.
VolumeClaimTemplates []PersistentVolumeClaim
// ServiceName is the name of the service that governs this StatefulSet.
// This service must exist before the StatefulSet, and is responsible for
// the network identity of the set. Members get DNS/hostnames that follow the
// pattern: member-specific-string.serviceName.default.svc.cluster.local
// where "member-specific-string" is managed by the StatefulSet controller.
ServiceName string
}
```
Like a replication controller, a StatefulSet may be targeted by an autoscaler. The StatefulSet makes no assumptions
about upgrading or altering the pods in the set for now - instead, the user can trigger graceful deletion
and the StatefulSet will replace the terminated member with the newer template once it exits. Future proposals
may offer update capabilities. A StatefulSet requires RestartAlways pods. The addition of forgiveness may be
necessary in the future to increase the safety of the controller recreating pods.
### How identities are managed
A key question is whether scaling down a StatefulSet and then scaling it back up should reuse identities. If not,
scaling down becomes a destructive action (an admin cannot recover by scaling back up). Given the safety
first assumption, identity reuse seems the correct default. This implies that identity assignment should
be deterministic and not subject to controller races (a controller that has crashed during scale up should
assign the same identities on restart, and two concurrent controllers should decide on the same outcome
identities).
The simplest way to manage identities, and easiest to understand for users, is a numeric identity system
starting at I=0 that ranges up to the current replica count and is contiguous.
Future work:
* Cover identity reclamation - cleaning up resources for identities that are no longer in use.
* Allow more sophisticated identity assignment - instead of `{name}-{0 - replicas-1}`, allow subsets and
complex indexing.
### Controller behavior.
When a StatefulSet is scaled up, the controller must create both pods and supporting resources for
each new identity. The controller must create supporting resources for the pod before creating the
pod. If a supporting resource with the appropriate name already exists, the controller should treat that as
creation succeeding. If a supporting resource cannot be created, the controller should flag an error to
status, back-off (like a scheduler or replication controller), and try again later. Each resource created
by a StatefulSet controller must have a set of labels that match the selector, support orphaning, and have a
controller back reference annotation identifying the owning StatefulSet by name and UID.
When a StatefulSet is scaled down, the pod for the removed indentity should be deleted. It is less clear what the
controller should do to supporting resources. If every pod requires a PV, and a user accidentally scales
up to N=200 and then back down to N=3, leaving 197 PVs lying around may be undesirable (potential for
abuse). On the other hand, a cluster of 5 that is accidentally scaled down to 3 might irreparably destroy
the cluster if the PV for identities 4 and 5 are deleted (may not be recoverable). For the initial proposal,
leaving the supporting resources is the safest path (safety first) with a potential future policy applied
to the StatefulSet for how to manage supporting resources (DeleteImmediately, GarbageCollect, Preserve).
The controller should reflect summary counts of resources on the StatefulSet status to enable clients to easily
understand the current state of the set.
### Parameterizing pod templates and supporting resources
Since each pod needs a unique and distinct identity, and the pod needs to know its own identity, the
StatefulSet must allow a pod template to be parameterized by the identity assigned to the pod. The pods that
are created should be easily identified by their cluster membership.
Because that pod needs access to stable storage, the StatefulSet may specify a template for one or more
**persistent volume claims** that can be used for each distinct pod. The name of the volume claim must
match a volume mount within the pod template.
Future work:
* In the future other resources may be added that must also be templated - for instance, secrets (unique secret per member), config data (unique config per member), and in the futher future, arbitrary extension resources.
* Consider allowing the identity value itself to be passed as an environment variable via the downward API
* Consider allowing per identity values to be specified that are passed to the pod template or volume claim.
### Accessing pods by stable network identity
In order to provide stable network identity, given that pods may not assume pod IP is constant over the
lifetime of a pod, it must be possible to have a resolvable DNS name for the pod that is tied to the
pod identity. There are two broad classes of clustered services - those that require clients to know
all members of the cluster (load balancer intolerant) and those that are amenable to load balancing.
For the former, clients must also be able to easily enumerate the list of DNS names that represent the
member identities and access them inside the cluster. Within a pod, it must be possible for containers
to find and access that DNS name for identifying itself to the cluster.
Since a pod is expected to be controlled by a single controller at a time, it is reasonable for a pod to
have a single identity at a time. Therefore, a service can expose a pod by its identity in a unique
fashion via DNS by leveraging information written to the endpoints by the endpoints controller.
The end result might be DNS resolution as follows:
```
# service mongo pointing to pods created by StatefulSet mdb, with identities mdb-1, mdb-2, mdb-3
dig mongodb.namespace.svc.cluster.local +short A
172.130.16.50
dig mdb-1.mongodb.namespace.svc.cluster.local +short A
# IP of pod created for mdb-1
dig mdb-2.mongodb.namespace.svc.cluster.local +short A
# IP of pod created for mdb-2
dig mdb-3.mongodb.namespace.svc.cluster.local +short A
# IP of pod created for mdb-3
```
This is currently implemented via an annotation on pods, which is surfaced to endpoints, and finally
surfaced as DNS on the service that exposes those pods.
```
// The pods created by this StatefulSet will have the DNS names "mysql-0.NAMESPACE.svc.cluster.local"
// and "mysql-1.NAMESPACE.svc.cluster.local"
kind: StatefulSet
metadata:
name: mysql
spec:
replicas: 2
serviceName: db
template:
spec:
containers:
- image: mysql:latest
// Example pod created by stateful set
kind: Pod
metadata:
name: mysql-0
annotations:
pod.beta.kubernetes.io/hostname: "mysql-0"
pod.beta.kubernetes.io/subdomain: db
spec:
...
```
### Preventing duplicate identities
The StatefulSet controller is expected to execute like other controllers, as a single writer. However, when
considering designing for safety first, the possibility of the controller running concurrently cannot
be overlooked, and so it is important to ensure that duplicate pod identities are not achieved.
There are two mechanisms to acheive this at the current time. One is to leverage unique names for pods
that carry the identity of the pod - this prevents duplication because etcd 2 can guarantee single
key transactionality. The other is to use the status field of the StatefulSet to coordinate membership
information. It is possible to leverage both at this time, and encourage users to not assume pod
name is significant, but users are likely to take what they can get. A downside of using unique names
is that it complicates pre-warming of pods and pod migration - on the other hand, those are also
advanced use cases that might be better solved by another, more specialized controller (a
MigratableStatefulSet).
### Managing lifecycle of members
The most difficult aspect of managing a member set is ensuring that all members see a consistent configuration
state of the set. Without a strongly consistent view of cluster state, most clustered software is
vulnerable to split brain. For example, a new set is created with 3 members. If the node containing the
first member is partitioned from the cluster, it may not observe the other two members, and thus create its
own cluster of size 1. The other two members do see the first member, so they form a cluster of size 3.
Both clusters appear to have quorum, which can lead to data loss if not detected.
StatefulSets should provide basic mechanisms that enable a consistent view of cluster state to be possible,
and in the future provide more tools to reduce the amount of work necessary to monitor and update that
state.
The first mechanism is that the StatefulSet controller blocks creation of new pods until all previous pods
are reporting a healthy status. The StatefulSet controller uses the strong serializability of the underyling
etcd storage to ensure that it acts on a consistent view of the cluster membership (the pods and their)
status, and serializes the creation of pods based on the health state of other pods. This simplifies
reasoning about how to initialize a StatefulSet, but is not sufficient to guarantee split brain does not
occur.
The second mechanism is having each "member" use the state of the cluster and transform that into cluster
configuration or decisions about membership. This is currently implemented using a side car container
that watches the master (via DNS today, although in the future this may be to endpoints directly) to
receive an ordered history of events, and then applying those safely to the configuration. Note that
for this to be safe, the history received must be strongly consistent (must be the same order of
events from all observers) and the config change must be bounded (an old config version may not
be allowed to exist forever). For now, this is known as a 'babysitter' (working name) and is intended
to help identify abstractions that can be provided by the StatefulSet controller in the future.
## Future Evolution
Criteria for advancing to beta:
* StatefulSets do not accidentally lose data due to cluster design - the pod safety proposal will
help ensure StatefulSets can guarantee **at most one** instance of a pod identity is running at
any time.
* A design consensus is reached on StatefulSet upgrades.
Criteria for advancing to GA:
* StatefulSets solve 80% of clustered software configuraton with minimal input from users and are safe from common split brain problems
* Several representative examples of StatefulSets from the community have been proven/tested to be "correct" for a variety of partition problems (possibly via Jepsen or similar)
* Sufficient testing and soak time has been in place (like for Deployments) to ensure the necessary features are in place.
* StatefulSets are considered easy to use for deploying clustered software for common cases
Requested features:
* IPs per member for clustered software like Cassandra that cache resolved DNS addresses that can be used outside the cluster
* Individual services can potentially be used to solve this in some cases.
* Send more / simpler events to each pod from a central spot via the "signal API"
* Persistent local volumes that can leverage local storage
* Allow pods within the StatefulSet to identify "leader" in a way that can direct requests from a service to a particular member.
* Provide upgrades of a StatefulSet in a controllable way (like Deployments).
## Overlap with other proposals
* Jobs can be used to perform a run-once initialization of the cluster
* Init containers can be used to prime PVs and config with the identity of the pod.
* Templates and how fields are overriden in the resulting object should have broad alignment
* DaemonSet defines the core model for how new controllers sit alongside replication controller and
how upgrades can be implemented outside of Deployment objects.
## History
StatefulSets were formerly known as PetSets and were renamed to be less "cutesy" and more descriptive as a
prerequisite to moving to beta. No animals were harmed in the making of this proposal.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/stateful-apps.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,175 +1 @@
**Table of Contents** This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/synchronous-garbage-collection.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/synchronous-garbage-collection.md)
<!-- BEGIN MUNGE: GENERATED_TOC -->
- [Overview](#overview)
- [API Design](#api-design)
- [Standard Finalizers](#standard-finalizers)
- [OwnerReference](#ownerreference)
- [DeleteOptions](#deleteoptions)
- [Components changes](#components-changes)
- [API Server](#api-server)
- [Garbage Collector](#garbage-collector)
- [Controllers](#controllers)
- [Handling circular dependencies](#handling-circular-dependencies)
- [Unhandled cases](#unhandled-cases)
- [Implications to existing clients](#implications-to-existing-clients)
<!-- END MUNGE: GENERATED_TOC -->
# Overview
Users of the server-side garbage collection need to determine if the garbage collection is done. For example:
* Currently `kubectl delete rc` blocks until all the pods are terminating. To convert to use server-side garbage collection, kubectl has to be able to determine if the garbage collection is done.
* [#19701](https://github.com/kubernetes/kubernetes/issues/19701#issuecomment-236997077) is a use case where the user needs to wait for all service dependencies garbage collected and their names released, before she recreates the dependencies.
We define the garbage collection as "done" when all the dependents are deleted from the key-value store, rather than merely in the terminating state. There are two reasons: *i)* for `Pod`s, the most usual garbage, only when they are deleted from the key-value store, we know kubelet has released resources they occupy; *ii)* some users need to recreate objects with the same names, they need to wait for the old objects to be deleted from the key-value store. (This limitation is because we index objects by their names in the key-value store today.)
Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandled-cases)) mechanism that allows user to determine if the garbage collection is done: after the API server receives a deletion request of an owning object, the object keeps existing in the key-value store until all its dependents are deleted from the key-value store by the garbage collector.
Tracking issue: https://github.com/kubernetes/kubernetes/issues/29891
# API Design
## Standard Finalizers
We will introduce a new standard finalizer:
```go
const GCFinalizer string = “DeletingDependents”
```
This finalizer indicates the object is terminating and is waiting for its dependents whose `OwnerReference.BlockOwnerDeletion` is true get deleted.
## OwnerReference
```go
OwnerReference {
...
// If true, AND if the owner has the "DeletingDependents" finalizer, then the owner cannot be deleted from the key-value store until this reference is removed.
// Defaults to false.
// To set this field, a user needs "delete" permission of the owner, otherwise 422 (Unprocessable Entity) will be returned.
BlockOwnerDeletion *bool
}
```
The initial draft of the proposal did not include this field and it had a security loophole: a user who is only authorized to update one resource can set ownerReference to block the synchronous GC of other resources. Requiring users to explicitly set `BlockOwnerDeletion` allows the master to properly authorize the request.
## DeleteOptions
```go
DeleteOptions {
// Whether and how garbage collection will be performed.
// Defaults to DeletePropagationDefault
// Either this field or OrphanDependents may be set, but not both.
PropagationPolicy *DeletePropagationPolicy
}
type DeletePropagationPolicy string
const (
// The default depends on the existing finalizers on the object and the type of the object.
DeletePropagationDefault DeletePropagationPolicy = "DeletePropagationDefault"
// Orphans the dependents
DeletePropagationOrphan DeletePropagationPolicy = "DeletePropagationOrphan"
// Deletes the object from the key-value store, the garbage collector will delete the dependents in the background.
DeletePropagationBackground DeletePropagationPolicy = "DeletePropagationBackground"
// The object exists in the key-value store until the garbage collector deletes all the dependents whose ownerReference.blockOwnerDeletion=true from the key-value store.
// API sever will put the "DeletingDependents" finalizer on the object, and sets its deletionTimestamp.
// This policy is cascading, i.e., the dependents will be deleted with GarbageCollectionSynchronous.
DeletePropagationForeground DeletePropagationPolicy = "DeletePropagationForeground"
)
```
The `DeletePropagationForeground` policy represents the synchronous GC mode.
`DeleteOptions.OrphanDependents *bool` will be marked as deprecated and will be removed in 1.7. Validation code will make sure only one of `OrphanDependents` and `PropagationPolicy` may be set. We decided not to add another `DeleteAfterDependentsDeleted *bool`, because together with `OrphanDependents`, it will result in 9 possible combinations and is thus confusing.
The conversion rules are described in the following table:
| 1.5 | pre 1.4/1.4 |
|------------------------------------------|--------------------------|
| DeletePropagationDefault | OrphanDependents==nil |
| DeletePropagationOrphan | *OrphanDependents==true |
| DeletePropagationBackground | *OrphanDependents==false |
| DeletePropagationForeground | N/A |
# Components changes
## API Server
`Delete()` function checks `DeleteOptions.PropagationPolicy`. If the policy is `DeletePropagationForeground`, the API server will update the object instead of deleting it, add the "DeletingDependents" finalizer, remove the "OrphanDependents" finalizer if it's present, and set the `ObjectMeta.DeletionTimestamp`.
When validating the ownerReference, API server needs to query the `Authorizer` to check if the user has "delete" permission of the owner object. It returns 422 if the user does not have the permissions but intends to set `OwnerReference.BlockOwnerDeletion` to true.
## Garbage Collector
**Modifications to processEvent()**
Currently `processEvent()` manages GCs internal owner-dependency relationship graph, `uidToNode`. It updates `uidToNode` according to the Add/Update/Delete events in the cluster. To support synchronous GC, it has to:
* handle Add or Update events where `obj.Finalizers.Has(GCFinalizer) && obj.DeletionTimestamp != nil`. The object will be added into the `dirtyQueue`. The object will be marked as “GC in progress” in `uidToNode`.
* Upon receiving the deletion event of an object, put its owner into the `dirtyQueue` if the owner node is marked as "GC in progress". This is to force the `processItem()` (described next) to re-check if all dependents of the owner is deleted.
**Modifications to processItem()**
Currently `processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to:
* treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise synchronous GC will not progress because the owner keeps existing in the key-value store.
* when deleting dependents, if the owner's finalizers include `DeletingDependents`, it should use the `GarbageCollectionSynchronous` as GC policy.
* if an object has multiple owners, some owners still exist while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of owners, `processItem()` has to remove the ownerReferences pointing to them.
In addition, if an object popped from `dirtyQueue` is marked as "GC in progress", `processItem()` treats it specially:
* To avoid racing with another controller, it requeues the object if `observedGeneration < Generation`. This is best-effort, see [unhandled cases](#unhandled-cases).
* Checks if the object has dependents
* If not, send a PUT request to remove the `GCFinalizer`;
* If so, then add all dependents to the `dirtryQueue`; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times.
## Controllers
To utilize the synchronous garbage collection feature, controllers (e.g., the replicaset controller) need to set `OwnerReference.BlockOwnerDeletion` when creating dependent objects (e.g. pods).
# Handling circular dependencies
SynchronousGC will enter a deadlock in the presence of circular dependencies. The garbage collector can break the circle by lazily breaking circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it removes the `GCFinalizer` from the object.
Note that the approach is not rigorous and thus having false positives. For example, if a user first sends a SynchronousGC delete request for an object, then sends the delete request for its owner, then `processItem()` will be fooled to believe there is a circle. We expect user not to do this. We can make the circle detection more rigorous if needed.
Circular dependencies are regarded as user error. If needed, we can add more guarantees to handle such cases later.
# Unhandled cases
* If the GC observes the owning object with the `GCFinalizer` before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, synchronous GC is best-effort, though we guarantee that the dependents will be deleted eventually. We face a similar case when handling OrphanFinalizer, see [GC known issues](https://github.com/kubernetes/kubernetes/issues/26120).
# Implications to existing clients
Finalizer breaks an assumption that many Kubernetes components have: a deletion request with `grace period=0` will immediately remove the object from the key-value store. This is not true if an object has pending finalizers, the object will continue to exist, and currently the API server will not return an error in this case.
**Namespace controller** suffered from this [problem](https://github.com/kubernetes/kubernetes/issues/32519) and was fixed in [#32524](https://github.com/kubernetes/kubernetes/pull/32524) by retrying every 15s if there are objects with pending finalizers to be removed from the key-value store. Object with pending `GCFinalizer` might take arbitrary long time be deleted, so namespace deletion might time out.
**kubelet** deletes the pod from the key-value store after all its containers are terminated ([code](../../pkg/kubelet/status/status_manager.go#L441-L443)). It also assumes that if the API server does not return an error, the pod is removed from the key-value store. Breaking the assumption will not break `kubelet` though, because the `pod` must have already been in the terminated phase, `kubelet` will not care to manage it.
**Node controller** forcefully deletes pod if the pod is scheduled to a node that does not exist ([code](../../pkg/controller/node/nodecontroller.go#L474)). The pod will continue to exist if it has pending finalizers. The node controller will futilely retry the deletion. Also, the `node controller` forcefully deletes pods before deleting the node ([code](../../pkg/controller/node/nodecontroller.go#L592)). If the pods have pending finalizers, the `node controller` will go ahead deleting the node, leaving those pods behind. These pods will be deleted from the key-value store when the pending finalizers are removed.
**Podgc** deletes terminated pods if there are too many of them in the cluster. We need to make sure finalizers on Pods are taken off quickly enough so that the progress of `Podgc` is not affected.
**Deployment controller** adopts existing `ReplicaSet` (RS) if its template matches. If a matching RS has a pending `GCFinalizer`, deployment should adopt it, take its pods into account, but shouldn't try to mutate it, because the RS controller will ignore a RS that's being deleted. Hence, `deployment controller` should wait for the RS to be deleted, and then create a new one.
**Replication controller manager**, **Job controller**, and **ReplicaSet controller** ignore pods in terminated phase, so pods with pending finalizers will not block these controllers.
**StatefulSet controller** will be blocked by a pod with pending finalizers, so synchronous GC might slow down its progress.
**kubectl**: synchronous GC can simplify the **kubectl delete** reapers. Let's take the `deployment reaper` as an example, since it's the most complicated one. Currently, the reaper finds all `RS` with matching labels, scales them down, polls until `RS.Status.Replica` reaches 0, deletes the `RS`es, and finally deletes the `deployment`. If using synchronous GC, `kubectl delete deployment` is as easy as sending a synchronous GC delete request for the deployment, and polls until the deployment is deleted from the key-value store.
Note that this **changes the behavior** of `kubectl delete`. The command will be blocked until all pods are deleted from the key-value store, instead of being blocked until pods are in the terminating state. This means `kubectl delete` blocks for longer time, but it has the benefit that the resources used by the pods are released when the `kubectl delete` returns. To allow kubectl user not waiting for the cleanup, we will add a `--wait` flag. It defaults to true; if it's set to `false`, `kubectl delete` will send the delete request with `PropagationPolicy=DeletePropagationBackground` and return immediately.
To make the new kubectl compatible with the 1.4 and earlier masters, kubectl needs to switch to use the old reaper logic if it finds synchronous GC is not supported by the master.
1.4 `kubectl delete rc/rs` uses `DeleteOptions.OrphanDependents=true`, which is going to be converted to `DeletePropagationBackground` (see [API Design](#api-changes)) by a 1.5 master, so its behavior keeps the same.
Pre 1.4 `kubectl delete` uses `DeleteOptions.OrphanDependents=nil`, so does the 1.4 `kubectl delete` for resources other than rc and rs. The option is going to be converted to `DeletePropagationDefault` (see [API Design](#api-changes)) by a 1.5 master, so these commands behave the same as when working with a 1.4 master.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/synchronous-garbage-collection.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,569 +1 @@
# Templates+Parameterization: Repeatedly instantiating user-customized application topologies. This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/templates.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/templates.md)
## Motivation
Addresses https://github.com/kubernetes/kubernetes/issues/11492
There are two main motivators for Template functionality in Kubernetes: Controller Instantiation and Application Definition
### Controller Instantiation
Today the replication controller defines a PodTemplate which allows it to instantiate multiple pods with identical characteristics.
This is useful but limited. Stateful applications have a need to instantiate multiple instances of a more sophisticated topology
than just a single pod (e.g. they also need Volume definitions). A Template concept would allow a Controller to stamp out multiple
instances of a given Template definition. This capability would be immediately useful to the [StatefulSet](https://github.com/kubernetes/kubernetes/pull/18016) proposal.
Similarly the [Service Catalog proposal](https://github.com/kubernetes/kubernetes/pull/17543) could leverage template instantiation as a mechanism for claiming service instances.
### Application Definition
Kubernetes gives developers a platform on which to run images and many configuration objects to control those images, but
constructing a cohesive application made up of images and configuration objects is currently difficult. Applications
require:
* Information sharing between images (e.g. one image provides a DB service, another consumes it)
* Configuration/tuning settings (memory sizes, queue limits)
* Unique/customizable identifiers (service names, routes)
Application authors know which values should be tunable and what information must be shared, but there is currently no
consistent way for an application author to define that set of information so that application consumers can easily deploy
an application and make appropriate decisions about the tunable parameters the author intended to expose.
Furthermore, even if an application author provides consumers with a set of API object definitions (e.g. a set of yaml files)
it is difficult to build a UI around those objects that would allow the deployer to modify names in one place without
potentially breaking assumed linkages to other pieces. There is also no prescriptive way to define which configuration
values are appropriate for a deployer to tune or what the parameters control.
## Use Cases
### Use cases for templates in general
* Providing a full baked application experience in a single portable object that can be repeatably deployed in different environments.
* e.g. Wordpress deployment with separate database pod/replica controller
* Complex service/replication controller/volume topologies
* Bulk object creation
* Provide a management mechanism for deleting/uninstalling an entire set of components related to a single deployed application
* Providing a library of predefined application definitions that users can select from
* Enabling the creation of user interfaces that can guide an application deployer through the deployment process with descriptive help about the configuration value decisions they are making, and useful default values where appropriate
* Exporting a set of objects in a namespace as a template so the topology can be inspected/visualized or recreated in another environment
* Controllers that need to instantiate multiple instances of identical objects (e.g. StatefulSets).
### Use cases for parameters within templates
* Share passwords between components (parameter value is provided to each component as an environment variable or as a Secret reference, with the Secret value being parameterized or produced by an [initializer](https://github.com/kubernetes/kubernetes/issues/3585))
* Allow for simple deployment-time customization of “app” configuration via environment values or api objects, e.g. memory
tuning parameters to a MySQL image, Docker image registry prefix for image strings, pod resource requests and limits, default
scale size.
* Allow simple, declarative defaulting of parameter values and expose them to end users in an approachable way - a parameter
like “MySQL table space” can be parameterized in images as an env var - the template parameters declare the parameter, give
it a friendly name, give it a reasonable default, and informs the user what tuning options are available.
* Customization of component names to avoid collisions and ensure matched labeling (e.g. replica selector value and pod label are
user provided and in sync).
* Customize cross-component references (e.g. user provides the name of a secret that already exists in their namespace, to use in
a pod as a TLS cert).
* Provide guidance to users for parameters such as default values, descriptions, and whether or not a particular parameter value
is required or can be left blank.
* Parameterize the replica count of a deployment or [StatefulSet](https://github.com/kubernetes/kubernetes/pull/18016)
* Parameterize part of the labels and selector for a DaemonSet
* Parameterize quota/limit values for a pod
* Parameterize a secret value so a user can provide a custom password or other secret at deployment time
## Design Assumptions
The goal for this proposal is a simple schema which addresses a few basic challenges:
* Allow application authors to expose configuration knobs for application deployers, with suggested defaults and
descriptions of the purpose of each knob
* Allow application deployers to easily customize exposed values like object names while maintaining referential integrity
between dependent pieces (for example ensuring a pod's labels always match the corresponding selector definition of the service)
* Support maintaining a library of templates within Kubernetes that can be accessed and instantiated by end users
* Allow users to quickly and repeatedly deploy instances of well-defined application patterns produced by the community
* Follow established Kubernetes API patterns by defining new template related APIs which consume+return first class Kubernetes
API (and therefore json conformant) objects.
We do not wish to invent a new Turing-complete templating language. There are good options available
(e.g. https://github.com/mustache/mustache) for developers who want a completely flexible and powerful solution for creating
arbitrarily complex templates with parameters, and tooling can be built around such schemes.
This desire for simplicity also intentionally excludes template composability/embedding as a supported use case.
Allowing templates to reference other templates presents versioning+consistency challenges along with making the template
no longer a self-contained portable object. Scenarios necessitating multiple templates can be handled in one of several
alternate ways:
* Explicitly constructing a new template that merges the existing templates (tooling can easily be constructed to perform this
operation since the templates are first class api objects).
* Manually instantiating each template and utilizing [service linking](https://github.com/kubernetes/kubernetes/pull/17543) to share
any necessary configuration data.
This document will also refrain from proposing server APIs or client implementations. This has been a point of debate, and it makes
more sense to focus on the template/parameter specification/syntax than to worry about the tooling that will process or manage the
template objects. However since there is a desire to at least be able to support a server side implementation, this proposal
does assume the specification will be k8s API friendly.
## Desired characteristics
* Fully k8s object json-compliant syntax. This allows server side apis that align with existing k8s apis to be constructed
which consume templates and existing k8s tooling to work with them. It also allows for api versioning/migration to be managed by
the existing k8s codec scheme rather than having to define/introduce a new syntax evolution mechanism.
* (Even if they are not part of the k8s core, it would still be good if a server side template processing+managing api supplied
as an ApiGroup consumed the same k8s object schema as the peer k8s apis rather than introducing a new one)
* Self-contained parameter definitions. This allows a template to be a portable object which includes metadata that describe
the inputs it expects, making it easy to wrapper a user interface around the parameterization flow.
* Object field primitive types include string, int, boolean, byte[]. The substitution scheme should support all of those types.
* complex types (struct/map/list) can be defined in terms of the available primitives, so it's preferred to avoid the complexity
of allowing for full complex-type substitution.
* Parameter metadata. Parameters should include at a minimum, information describing the purpose of the parameter, whether it is
required/optional, and a default/suggested value. Type information could also be required to enable more intelligent client interfaces.
* Template metadata. Templates should be able to include metadata describing their purpose or links to further documentation and
versioning information. Annotations on the Template's metadata field can fulfill this requirement.
## Proposed Implementation
### Overview
We began by looking at the List object which allows a user to easily group a set of objects together for easy creation via a
single CLI invocation. It also provides a portable format which requires only a single file to represent an application.
From that starting point, we propose a Template API object which can encapsulate the definition of all components of an
application to be created. The application definition is encapsulated in the form of an array of API objects (identical to
List), plus a parameterization section. Components reference the parameter by name and the value of the parameter is
substituted during a processing step, prior to submitting each component to the appropriate API endpoint for creation.
The primary capability provided is that parameter values can easily be shared between components, such as a database password
that is provided by the user once, but then attached as an environment variable to both a database pod and a web frontend pod.
In addition, the template can be repeatedly instantiated for a consistent application deployment experience in different
namespaces or Kubernetes clusters.
Lastly, we propose the Template API object include a “Labels” section in which the template author can define a set of labels
to be applied to all objects created from the template. This will give the template deployer an easy way to manage all the
components created from a given template. These labels will also be applied to selectors defined by Objects within the template,
allowing a combination of templates and labels to be used to scope resources within a namespace. That is, a given template
can be instantiated multiple times within the same namespace, as long as a different label value is used each for each
instantiation. The resulting objects will be independent from a replica/load-balancing perspective.
Generation of parameter values for fields such as Secrets will be delegated to an [admission controller/initializer/finalizer](https://github.com/kubernetes/kubernetes/issues/3585) rather than being solved by the template processor. Some discussion about a generation
service is occurring [here](https://github.com/kubernetes/kubernetes/issues/12732)
Labels to be assigned to all objects could also be generated in addition to, or instead of, allowing labels to be supplied in the
Template definition.
### API Objects
**Template Object**
```
// Template contains the inputs needed to produce a Config.
type Template struct {
unversioned.TypeMeta
kapi.ObjectMeta
// Optional: Parameters is an array of Parameters used during the
// Template to Config transformation.
Parameters []Parameter
// Required: A list of resources to create
Objects []runtime.Object
// Optional: ObjectLabels is a set of labels that are applied to every
// object during the Template to Config transformation
// These labels are also be applied to selectors defined by objects in the template
ObjectLabels map[string]string
}
```
**Parameter Object**
```
// Parameter defines a name/value variable that is to be processed during
// the Template to Config transformation.
type Parameter struct {
// Required: Parameter name must be set and it can be referenced in Template
// Items using $(PARAMETER_NAME)
Name string
// Optional: The name that will show in UI instead of parameter 'Name'
DisplayName string
// Optional: Parameter can have description
Description string
// Optional: Value holds the Parameter data.
// The value replaces all occurrences of the Parameter $(Name) or
// $((Name)) expression during the Template to Config transformation.
Value string
// Optional: Indicates the parameter must have a non-empty value either provided by the user or provided by a default. Defaults to false.
Required bool
// Optional: Type-value of the parameter (one of string, int, bool, or base64)
// Used by clients to provide validation of user input and guide users.
Type ParameterType
}
```
As seen above, parameters allow for metadata which can be fed into client implementations to display information about the
parameters purpose and whether a value is required. In lieu of type information, two reference styles are offered: `$(PARAM)`
and `$((PARAM))`. When the single parens option is used, the result of the substitution will remain quoted. When the double
parens option is used, the result of the substitution will not be quoted. For example, given a parameter defined with a value
of "BAR", the following behavior will be observed:
```
somefield: "$(FOO)" -> somefield: "BAR"
somefield: "$((FOO))" -> somefield: BAR
```
// for concatenation, the result value reflects the type of substitution (quoted or unquoted):
```
somefield: "prefix_$(FOO)_suffix" -> somefield: "prefix_BAR_suffix"
somefield: "prefix_$((FOO))_suffix" -> somefield: prefix_BAR_suffix
```
// if both types of substitution exist, quoting is performed:
```
somefield: "prefix_$((FOO))_$(FOO)_suffix" -> somefield: "prefix_BAR_BAR_suffix"
```
This mechanism allows for integer/boolean values to be substituted properly.
The value of the parameter can be explicitly defined in template. This should be considered a default value for the parameter, clients
which process templates are free to override this value based on user input.
**Example Template**
Illustration of a template which defines a service and replication controller with parameters to specialized
the name of the top level objects, the number of replicas, and several environment variables defined on the
pod template.
```
{
"kind": "Template",
"apiVersion": "v1",
"metadata": {
"name": "mongodb-ephemeral",
"annotations": {
"description": "Provides a MongoDB database service"
}
},
"labels": {
"template": "mongodb-ephemeral-template"
},
"objects": [
{
"kind": "Service",
"apiVersion": "v1",
"metadata": {
"name": "$(DATABASE_SERVICE_NAME)"
},
"spec": {
"ports": [
{
"name": "mongo",
"protocol": "TCP",
"targetPort": 27017
}
],
"selector": {
"name": "$(DATABASE_SERVICE_NAME)"
}
}
},
{
"kind": "ReplicationController",
"apiVersion": "v1",
"metadata": {
"name": "$(DATABASE_SERVICE_NAME)"
},
"spec": {
"replicas": "$((REPLICA_COUNT))",
"selector": {
"name": "$(DATABASE_SERVICE_NAME)"
},
"template": {
"metadata": {
"creationTimestamp": null,
"labels": {
"name": "$(DATABASE_SERVICE_NAME)"
}
},
"spec": {
"containers": [
{
"name": "mongodb",
"image": "docker.io/centos/mongodb-26-centos7",
"ports": [
{
"containerPort": 27017,
"protocol": "TCP"
}
],
"env": [
{
"name": "MONGODB_USER",
"value": "$(MONGODB_USER)"
},
{
"name": "MONGODB_PASSWORD",
"value": "$(MONGODB_PASSWORD)"
},
{
"name": "MONGODB_DATABASE",
"value": "$(MONGODB_DATABASE)"
}
]
}
]
}
}
}
}
],
"parameters": [
{
"name": "DATABASE_SERVICE_NAME",
"description": "Database service name",
"value": "mongodb",
"required": true
},
{
"name": "MONGODB_USER",
"description": "Username for MongoDB user that will be used for accessing the database",
"value": "username",
"required": true
},
{
"name": "MONGODB_PASSWORD",
"description": "Password for the MongoDB user",
"required": true
},
{
"name": "MONGODB_DATABASE",
"description": "Database name",
"value": "sampledb",
"required": true
},
{
"name": "REPLICA_COUNT",
"description": "Number of mongo replicas to run",
"value": "1",
"required": true
}
]
}
```
### API Endpoints
* **/processedtemplates** - when a template is POSTed to this endpoint, all parameters in the template are processed and
substituted into appropriate locations in the object definitions. Validation is performed to ensure required parameters have
a value supplied. In addition labels defined in the template are applied to the object definitions. Finally the customized
template (still a `Template` object) is returned to the caller. (The possibility of returning a List instead has
also been discussed and will be considered for implementation).
The client is then responsible for iterating the objects returned and POSTing them to the appropriate resource api endpoint to
create each object, if that is the desired end goal for the client.
Performing parameter substitution on the server side has the benefit of centralizing the processing so that new clients of
k8s, such as IDEs, CI systems, Web consoles, etc, do not need to reimplement template processing or embed the k8s binary.
Instead they can invoke the k8s api directly.
* **/templates** - the REST storage resource for storing and retrieving template objects, scoped within a namespace.
Storing templates within k8s has the benefit of enabling template sharing and securing via the same roles/resources
that are used to provide access control to other cluster resources. It also enables sophisticated service catalog
flows in which selecting a service from a catalog results in a new instantiation of that service. (This is not the
only way to implement such a flow, but it does provide a useful level of integration).
Creating a new template (POST to the /templates api endpoint) simply stores the template definition, it has no side
effects(no other objects are created).
This resource can also support a subresource "/templates/templatename/processed". This resource would accept just a
Parameters object and would process the template stored in the cluster as "templatename". The processed result would be
returned in the same form as `/processedtemplates`
### Workflow
#### Template Instantiation
Given a well-formed template, a client will
1. Optionally set an explicit `value` for any parameter values the user wishes to explicitly set
2. Submit the new template object to the `/processedtemplates` api endpoint
The api endpoint will then:
1. Validate the template including confirming “required” parameters have an explicit value.
2. Walk each api object in the template.
3. Adding all labels defined in the templates ObjectLabels field.
4. For each field, check if the value matches a parameter name and if so, set the value of the field to the value of the parameter.
* Partial substitutions are accepted, such as `SOME_$(PARAM)` which would be transformed into `SOME_XXXX` where `XXXX` is the value
of the `$(PARAM)` parameter.
* If a given $(VAL) could be resolved to either a parameter or an environment variable/downward api reference, an error will be
returned.
5. Return the processed template object. (or List, depending on the choice made when this is implemented)
The client can now either return the processed template to the user in a desired form (e.g. json or yaml), or directly iterate the
api objects within the template, invoking the appropriate object creation api endpoint for each element. (If the api returns
a List, the client would simply iterate the list to create the objects).
The result is a consistently recreatable application configuration, including well-defined labels for grouping objects created by
the template, with end-user customizations as enabled by the template author.
#### Template Authoring
To aid application authors in the creation of new templates, it should be possible to export existing objects from a project
in template form. A user should be able to export all or a filtered subset of objects from a namespace, wrappered into a
Template API object. The user will still need to customize the resulting object to enable parameterization and labeling,
though sophisticated export logic could attempt to auto-parameterize well understood api fields. Such logic is not considered
in this proposal.
#### Tooling
As described above, templates can be instantiated by posting them to a template processing endpoint. CLI tools should
exist which can input parameter values from the user as part of the template instantiation flow.
More sophisticated UI implementations should also guide the user through which parameters the template expects, the description
of those templates, and the collection of user provided values.
In addition, as described above, existing objects in a namespace can be exported in template form, making it easy to recreate a
set of objects in a new namespace or a new cluster.
## Examples
### Example Templates
These examples reflect the current OpenShift template schema, not the exact schema proposed in this document, however this
proposal, if accepted, provides sufficient capability to support the examples defined here, with the exception of
automatic generation of passwords.
* [Jenkins template](https://github.com/openshift/origin/blob/master/examples/jenkins/jenkins-persistent-template.json)
* [MySQL DB service template](https://github.com/openshift/origin/blob/master/examples/db-templates/mysql-persistent-template.json)
### Examples of OpenShift Parameter Usage
(mapped to use cases described above)
* [Share passwords](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L146-L152)
* [Simple deployment-time customization of “app” configuration via environment values](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L108-L126) (e.g. memory tuning, resource limits, etc)
* [Customization of component names with referential integrity](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L199-L207)
* [Customize cross-component references](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L78-L83) (e.g. user provides the name of a secret that already exists in their namespace, to use in a pod as a TLS cert)
## Requirements analysis
There has been some discussion of desired goals for a templating/parameterization solution [here](https://github.com/kubernetes/kubernetes/issues/11492#issuecomment-160853594). This section will attempt to address each of those points.
*The primary goal is that parameterization should facilitate reuse of declarative configuration templates in different environments in
a "significant number" of common cases without further expansion, substitution, or other static preprocessing.*
* This solution provides for templates that can be reused as is (assuming parameters are not used or provide sane default values) across
different environments, they are a self-contained description of a topology.
*Parameterization should not impede the ability to use kubectl commands with concrete resource specifications.*
* The parameterization proposal here does not extend beyond Template objects. That is both a strength and limitation of this proposal.
Parameterizable objects must be wrapped into a Template object, rather than existing on their own.
*Parameterization should work with all kubectl commands that accept --filename, and should work on templates comprised of multiple resources.*
* Same as above.
*The parameterization mechanism should not prevent the ability to wrap kubectl with workflow/orchestration tools, such as Deployment manager.*
* Since this proposal uses standard API objects, a DM or Helm flow could still be constructed around a set of templates, just as those flows are
constructed around other API objects today.
*Any parameterization mechanism we add should not preclude the use of a different parameterization mechanism, it should be possible
to use different mechanisms for different resources, and, ideally, the transformation should be composable with other
substitution/decoration passes.*
* This templating scheme does not preclude layering an additional templating mechanism over top of it. For example, it would be
possible to write a Mustache template which, after Mustache processing, resulted in a Template which could then be instantiated
through the normal template instantiating process.
*Parameterization should not compromise reproducibility. For instance, it should be possible to manage template arguments as well as
templates under version control.*
* Templates are a single file, including default or chosen values for parameters. They can easily be managed under version control.
*It should be possible to specify template arguments (i.e., parameter values) declaratively, in a way that is "self-describing"
(i.e., naming the parameters and the template to which they correspond). It should be possible to write generic commands to
process templates.*
* Parameter definitions include metadata which describes the purpose of the parameter. Since parameter definitions are part of the template,
there is no need to indicate which template they correspond to.
*It should be possible to validate templates and template parameters, both values and the schema.*
* Template objects are subject to standard api validation.
*It should also be possible to validate and view the output of the substitution process.*
* The `/processedtemplates` api returns the result of the substitution process, which is itself a Template object that can be validated.
*It should be possible to generate forms for parameterized templates, as discussed in #4210 and #6487.*
* Parameter definitions provide metadata that allows for the construction of form-based UIs to gather parameter values from users.
*It shouldn't be inordinately difficult to evolve templates. Thus, strategies such as versioning and encapsulation should be
encouraged, at least by convention.*
* Templates can be versioned via annotations on the template object.
## Key discussion points
The preceding document is opinionated about each of these topics, however they have been popular topics of discussion so they are called out explicitly below.
### Where to define parameters
There has been some discussion around where to define parameters that are being injected into a Template
1. In a separate standalone file
2. Within the Template itself
This proposal suggests including the parameter definitions within the Template, which provides a self-contained structure that
can be easily versioned, transported, and instantiated without risk of mismatching content. In addition, a Template can easily
be validated to confirm that all parameter references are resolveable.
Separating the parameter definitions makes for a more complex process with respect to
* Editing a template (if/when first class editing tools are created)
* Storing/retrieving template objects with a central store
Note that the `/templates/sometemplate/processed` subresource would accept a standalone set of parameters to be applied to `sometemplate`.
### How to define parameters
There has also been debate about how a parameter should be referenced from within a template. This proposal suggests that
fields to be substituted by a parameter value use the "$(parameter)" syntax which is already used elsewhere within k8s. The
value of `parameter` should be matched to a parameter with that name, and the value of the matched parameter substituted into
the field value.
Other suggestions include a path/map approach in which a list of field paths (e.g. json path expressions) and corresponding
parameter names are provided. The substitution process would walk the map, replacing fields with the appropriate
parameter value. This approach makes templates more fragile from the perspective of editing/refactoring as field paths
may change, thus breaking the map. There is of course also risk of breaking references with the previous scheme, but
renaming parameters seems less likely than changing field paths.
### Storing templates in k8s
Openshift defines templates as a first class resource so they can be created/retrieved/etc via standard tools. This allows client tools to list available templates (available in the openshift cluster), allows existing resource security controls to be applied to templates, and generally provides a more integrated feel to templates. However there is no explicit requirement that for k8s to adopt templates, it must also adopt storing them in the cluster.
### Processing templates (server vs. client)
Openshift handles template processing via a server endpoint which consumes a template object from the client and returns the list of objects
produced by processing the template. It is also possible to handle the entire template processing flow via the client, but this was deemed
undesirable as it would force each client tool to reimplement template processing (e.g. the standard CLI tool, an eclipse plugin, a plugin for a CI system like Jenkins, etc). The assumption in this proposal is that server side template processing is the preferred implementation approach for
this reason.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/templates.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,150 +1 @@
# Support HostPath volume existence qualifiers This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-hostpath-qualifiers.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-hostpath-qualifiers.md)
## Introduction
A Host volume source is probably the simplest volume type to define, needing
only a single path. However, that simplicity comes with many assumptions and
caveats.
This proposal describes one of the issues associated with Host volumes &mdash;
their silent and implicit creation of directories on the host &mdash; and
proposes a solution.
## Problem
Right now, under Docker, when a bindmount references a hostPath, that path will
be created as an empty directory, owned by root, if it does not already exist.
This is rarely what the user actually wants because hostPath volumes are
typically used to express a dependency on an existing external file or
directory.
This concern was raised during the [initial
implementation](https://github.com/docker/docker/issues/1279#issuecomment-22965058)
of this behavior in Docker and it was suggested that orchestration systems
could better manage volume creation than Docker, but Docker does so as well
anyways.
To fix this problem, I propose allowing a pod to specify whether a given
hostPath should exist prior to the pod running, whether it should be created,
and what it should exist as.
I also propose the inclusion of a default value which matches the current
behavior to ensure backwards compatibility.
To understand exactly when this behavior will or won't be correct, it's
important to look at the use-cases of Host Volumes.
The table below broadly classifies the use-case of Host Volumes and asserts
whether this change would be of benefit to that use-case.
### HostPath volume Use-cases
| Use-case | Description | Examples | Benefits from this change? | Why? |
|:---------|:------------|:---------|:--------------------------:|:-----|
| Accessing an external system, data, or configuration | Data or a unix socket is created by a process on the host, and a pod within kubernetes consumes it | [fluentd-es-addon](https://github.com/kubernetes/kubernetes/blob/74b01041cc3feb2bb731cc243ab0e4515bef9a84/cluster/saltbase/salt/fluentd-es/fluentd-es.yaml#L30), [addon-manager](https://github.com/kubernetes/kubernetes/blob/808f3ecbe673b4127627a457dc77266ede49905d/cluster/gce/coreos/kube-manifests/kube-addon-manager.yaml#L23), [kube-proxy](https://github.com/kubernetes/kubernetes/blob/010c976ce8dd92904a7609483c8e794fd8e94d4e/cluster/saltbase/salt/kube-proxy/kube-proxy.manifest#L65), etc | :white_check_mark: | Fails faster and with more useful messages, and won't run when basic assumptions are false (e.g. that docker is the runtime and the docker.sock exists) |
| Providing data to external systems | Some pods wish to publish data to the host for other systems to consume, sometimes to a generic directory and sometimes to more component-specific ones | Kubelet core components which bindmount their logs out to `/var/log/*.log` so logrotate and other tools work with them | :white_check_mark: | Sometimes, but not always. It's directory-specific whether it not existing will be a problem. |
| Communicating between instances and versions of yourself | A pod can use a hostPath directory as a sort of cache and, as opposed to an emptyDir, persist the directory between versions of itself | [etcd](https://github.com/kubernetes/kubernetes/blob/fac54c9b22eff5c5052a8e3369cf8416a7827d36/cluster/saltbase/salt/etcd/etcd.manifest#L84), caches | :x: | It's pretty much always okay to create them |
### Other motivating factors
One additional motivating factor for this change is that under the rkt runtime
paths are not created when they do not exist. This change moves the management
of these volumes into the Kubelet to the benefit of the rkt container runtime.
## Proposed API Change
### Host Volume
I propose that the
[`v1.HostPathVolumeSource`](https://github.com/kubernetes/kubernetes/blob/d26b4ca2859aa667ad520fb9518e0db67b74216a/pkg/api/types.go#L447-L451)
object be changed to include the following additional field:
`Type` - An optional string of `exists|file|device|socket|directory` - If not
set, it will default to a backwards-compatible default behavior described
below.
| Value | Behavior |
|:------|:---------|
| *unset* | If nothing exists at the given path, an empty directory will be created there. Otherwise, behaves like `exists` |
| `exists` | If nothing exists at the given path, the pod will fail to run and provide an informative error message |
| `file` | If a file does not exist at the given path, the pod will fail to run and provide an informative error message |
| `device` | If a block or character device does not exist at the given path, the pod will fail to run and provide an informative error message |
| `socket` | If a socket does not exist at the given path, the pod will fail to run and provide an informative error message |
| `directory` | If a directory does not exist at the given path, the pod will fail to run and provide an informative error message |
Additional possible values, which are proposed to be excluded:
|Value | Behavior | Reason for exclusion |
|:-----|:---------|:---------------------|
| `new-directory` | Like `auto`, but the given path must be a directory if it exists | `auto` mostly fills this use-case |
| `character-device` | | Granularity beyond `device` shouldn't matter often |
| `block-device` | | Granularity beyond `device` shouldn't matter often |
| `new-file` | Like file, but if nothing exist an empty file is created instead | In general, bindmounting the parent directory of the file you intend to create addresses this usecase |
| `optional` | If a path does not exist, then do not create any container-mount at all | This would better be handled by a new field entirely if this behavior is desirable |
### Why not as part of any other volume types?
This feature does not make sense for any of the other volume types simply
because all of the other types are already fully qualified. For example, NFS
volumes are known to always be in existence else they will not mount.
Similarly, EmptyDir volumes will always exist as a directory.
Only the HostVolume and SubPath means of referencing a path have the potential
to reference arbitrary incorrect or nonexistent things without erroring out.
### Alternatives
One alternative is to augment Host Volumes with a `MustExist` bool and provide
no further granularity. This would allow toggling between the `auto` and
`exists` behaviors described above. This would likely cover the "90%" use-case
and would be a simpler API. It would be sufficient for all of the examples
linked above in my opionion.
## Kubelet implementation
It's proposed that prior to starting a pod, the Kubelet validates that the
given path meets the qualifications of its type. Namely, if the type is `auto`
the Kubelet will create an empty directory if none exists there, and for each
of the others the Kubelet will perform the given validation prior to running
the pod. This validation might be done by a volume plugin, but further
technical consideration (out of scope of this proposal) is needed.
## Possible concerns
### Permissions
This proposal does not attempt to change the state of volume permissions. Currently, a HostPath volume is created with `root` ownership and `755` permissions. This behavior will be retained. An argument for this behavior is given [here](volumes.md#shared-storage-hostpath).
### SELinux
This proposal should not impact SELinux relabeling. Verifying the presence and
type of a given path will be logically separate from SELinux labeling.
Similarly, creating the directory when it doesn't exist will happen before any
SELinux operations and should not impact it.
### Containerized Kubelet
A containerized kubelet would have difficulty creating directories. The
implementation will likely respect the `containerized` flag, or similar,
allowing it to either break out or be "/rootfs/" aware and thus operate as
desired.
### Racy Validation
Ideally the validation would be done at the time the bindmounts are created,
else it's possible for a given path or directory to change in the duration from
when it's validated and the container runtime attempts to create said mount.
The only way to solve this problem is to integrate these sorts of qualification
into container runtimes themselves.
I don't think this problem is severe enough that we need to push to solve it;
rather I think we can simply accept this minor race, and if runtimes eventually
allow this we can begin to leverage them.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volume-hostpath-qualifiers.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,108 +1 @@
## Volume plugins and idempotency This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-ownership-management.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-ownership-management.md)
Currently, volume plugins have a `SetUp` method which is called in the context of a higher-level
workflow within the kubelet which has externalized the problem of managing the ownership of volumes.
This design has a number of drawbacks that can be mitigated by completely internalizing all concerns
of volume setup behind the volume plugin `SetUp` method.
### Known issues with current externalized design
1. The ownership management is currently repeatedly applied, which breaks packages that require
special permissions in order to work correctly
2. There is a gap between files being mounted/created by volume plugins and when their ownership
is set correctly; race conditions exist around this
3. Solving the correct application of ownership management in an externalized model is difficult
and makes it clear that the a transaction boundary is being broken by the externalized design
### Additional issues with externalization
Fully externalizing any one concern of volumes is difficult for a number of reasons:
1. Many types of idempotence checks exist, and are used in a variety of combinations and orders
2. Workflow in the kubelet becomes much more complex to handle:
1. composition of plugins
2. correct timing of application of ownership management
3. callback to volume plugins when we know the whole `SetUp` flow is complete and correct
4. callback to touch sentinel files
5. etc etc
3. We want to support fully external volume plugins -- would require complex orchestration / chatty
remote API
## Proposed implementation
Since all of the ownership information is known in advance of the call to the volume plugin `SetUp`
method, we can easily internalize these concerns into the volume plugins and pass the ownership
information to `SetUp`.
The volume `Builder` interface's `SetUp` method changes to accept the group that should own the
volume. Plugins become responsible for ensuring that the correct group is applied. The volume
`Attributes` struct can be modified to remove the `SupportsOwnershipManagement` field.
```go
package volume
type Builder interface {
// other methods omitted
// SetUp prepares and mounts/unpacks the volume to a self-determined
// directory path and returns an error. The group ID that should own the volume
// is passed as a parameter. Plugins may choose to ignore the group ID directive
// in the event that they do not support it (example: NFS). A group ID of -1
// indicates that the group ownership of the volume should not be modified by the plugin.
//
// SetUp will be called multiple times and should be idempotent.
SetUp(gid int64) error
}
```
Each volume plugin will have to change to support the new `SetUp` signature. The existing
ownership management code will be refactored into a library that volume plugins can use:
```
package volume
func ManageOwnership(path string, fsGroup int64) error {
// 1. recursive chown of path
// 2. make path +setgid
}
```
The workflow from the Kubelet's perspective for handling volume setup and refresh becomes:
```go
// go-ish pseudocode
func mountExternalVolumes(pod) error {
podVolumes := make(kubecontainer.VolumeMap)
for i := range pod.Spec.Volumes {
volSpec := &pod.Spec.Volumes[i]
var fsGroup int64 = 0
if pod.Spec.SecurityContext != nil &&
pod.Spec.SecurityContext.FSGroup != nil {
fsGroup = *pod.Spec.SecurityContext.FSGroup
} else {
fsGroup = -1
}
// Try to use a plugin for this volume.
plugin := volume.NewSpecFromVolume(volSpec)
builder, err := kl.newVolumeBuilderFromPlugins(plugin, pod)
if err != nil {
return err
}
if builder == nil {
return errUnsupportedVolumeType
}
err := builder.SetUp(fsGroup)
if err != nil {
return nil
}
}
return nil
}
```
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volume-ownership-management.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,500 +1 @@
## Abstract This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-provisioning.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-provisioning.md)
Real Kubernetes clusters have a variety of volumes which differ widely in
size, iops performance, retention policy, and other characteristics.
Administrators need a way to dynamically provision volumes of these different
types to automatically meet user demand.
A new mechanism called 'storage classes' is proposed to provide this
capability.
## Motivation
In Kubernetes 1.2, an alpha form of limited dynamic provisioning was added
that allows a single volume type to be provisioned in clouds that offer
special volume types.
In Kubernetes 1.3, a label selector was added to persistent volume claims to
allow administrators to create a taxonomy of volumes based on the
characteristics important to them, and to allow users to make claims on those
volumes based on those characteristics. This allows flexibility when claiming
existing volumes; the same flexibility is needed when dynamically provisioning
volumes.
After gaining experience with dynamic provisioning after the 1.2 release, we
want to create a more flexible feature that allows configuration of how
different storage classes are provisioned and supports provisioning multiple
types of volumes within a single cloud.
### Out-of-tree provisioners
One of our goals is to enable administrators to create out-of-tree
provisioners, that is, provisioners whose code does not live in the Kubernetes
project.
## Design
This design represents the minimally viable changes required to provision based on storage class configuration. Additional incremental features may be added as a separate effort.
We propose that:
1. Both for in-tree and out-of-tree storage provisioners, the PV created by the
provisioners must match the PVC that led to its creations. If a provisioner
is unable to provision such a matching PV, it reports an error to the
user.
2. The above point applies also to PVC label selector. If user submits a PVC
with a label selector, the provisioner must provision a PV with matching
labels. This directly implies that the provisioner understands meaning
behind these labels - if user submits a claim with selector that wants
a PV with label "region" not in "[east,west]", the provisioner must
understand what label "region" means, what available regions are there and
choose e.g. "north".
In other words, provisioners should either refuse to provision a volume for
a PVC that has a selector, or select few labels that are allowed in
selectors (such as the "region" example above), implement necessary logic
for their parsing, document them and refuse any selector that references
unknown labels.
3. An api object will be incubated in storage.k8s.io/v1beta1 to hold the a `StorageClass`
API resource. Each StorageClass object contains parameters required by the provisioner to provision volumes of that class. These parameters are opaque to the user.
4. `PersistentVolume.Spec.Class` attribute is added to volumes. This attribute
is optional and specifies which `StorageClass` instance represents
storage characteristics of a particular PV.
During incubation, `Class` is an annotation and not
actual attribute.
5. `PersistentVolume` instances do not require labels by the provisioner.
6. `PersistentVolumeClaim.Spec.Class` attribute is added to claims. This
attribute specifies that only a volume with equal
`PersistentVolume.Spec.Class` value can satisfy a claim.
During incubation, `Class` is just an annotation and not
actual attribute.
7. The existing provisioner plugin implementations be modified to accept
parameters as specified via `StorageClass`.
8. The persistent volume controller modified to invoke provisioners using `StorageClass` configuration and bind claims with `PersistentVolumeClaim.Spec.Class` to volumes with equivalent `PersistentVolume.Spec.Class`
9. The existing alpha dynamic provisioning feature be phased out in the
next release.
### Controller workflow for provisioning volumes
0. Kubernetes administator can configure name of a default StorageClass. This
StorageClass instance is then used when user requests a dynamically
provisioned volume, but does not specify a StorageClass. In other words,
`claim.Spec.Class == ""`
(or annotation `volume.beta.kubernetes.io/storage-class == ""`).
1. When a new claim is submitted, the controller attempts to find an existing
volume that will fulfill the claim.
1. If the claim has non-empty `claim.Spec.Class`, only PVs with the same
`pv.Spec.Class` are considered.
2. If the claim has empty `claim.Spec.Class`, only PVs with an unset `pv.Spec.Class` are considered.
All "considered" volumes are evaluated and the
smallest matching volume is bound to the claim.
2. If no volume is found for the claim and `claim.Spec.Class` is not set or is
empty string dynamic provisioning is disabled.
3. If `claim.Spec.Class` is set the controller tries to find instance of StorageClass with this name. If no
such StorageClass is found, the controller goes back to step 1. and
periodically retries finding a matching volume or storage class again until
a match is found. The claim is `Pending` during this period.
4. With StorageClass instance, the controller updates the claim:
* `claim.Annotations["volume.beta.kubernetes.io/storage-provisioner"] = storageClass.Provisioner`
* **In-tree provisioning**
The controller tries to find an internal volume plugin referenced by
`storageClass.Provisioner`. If it is found:
5. The internal provisioner implements interface`ProvisionableVolumePlugin`,
which has a method called `NewProvisioner` that returns a new provisioner.
6. The controller calls volume plugin `Provision` with Parameters
from the `StorageClass` configuration object.
7. If `Provision` returns an error, the controller generates an event on the
claim and goes back to step 1., i.e. it will retry provisioning
periodically.
8. If `Provision` returns no error, the controller creates the returned
`api.PersistentVolume`, fills its `Class` attribute with `claim.Spec.Class`
and makes it already bound to the claim
1. If the create operation for the `api.PersistentVolume` fails, it is
retried
2. If the create operation does not succeed in reasonable time, the
controller attempts to delete the provisioned volume and creates an event
on the claim
Existing behavior is unchanged for claims that do not specify
`claim.Spec.Class`.
* **Out of tree provisioning**
Following step 4. above, the controller tries to find internal plugin for the
`StorageClass`. If it is not found, it does not do anything, it just
periodically goes to step 1., i.e. tries to find available matching PV.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
interpreted as described in RFC 2119.
External provisioner must have these features:
* It MUST have a distinct name, following Kubernetenes plugin naming scheme
`<vendor name>/<provisioner name>`, e.g. `gluster.org/gluster-volume`.
* The provisioner SHOULD send events on a claim to report any errors
related to provisioning a volume for the claim. This way, users get the same
experience as with internal provisioners.
* The provisioner MUST implement also a deleter. It must be able to delete
storage assets it created. It MUST NOT assume that any other internal or
external plugin is present.
The external provisioner runs in a separate process which watches claims, be
it an external storage appliance, a daemon or a Kubernetes pod. For every
claim creation or update, it implements these steps:
1. The provisioner inspects if
`claim.Annotations["volume.beta.kubernetes.io/storage-provisioner"] == <provisioner name>`.
All other claims MUST be ignored.
2. The provisioner MUST check that the claim is unbound, i.e. its
`claim.Spec.VolumeName` is empty. Bound volumes MUST be ignored.
*Race condition when the provisioner provisions a new PV for a claim and
at the same time Kubernetes binds the same claim to another PV that was
just created by admin is discussed below.*
3. It tries to find a StorageClass instance referenced by annotation
`claim.Annotations["volume.beta.kubernetes.io/storage-class"]`. If not
found, it SHOULD report an error (by sending an event to the claim) and it
SHOULD retry periodically with step i.
4. The provisioner MUST parse arguments in the `StorageClass` and
`claim.Spec.Selector` and provisions appropriate storage asset that matches
both the parameters and the selector.
When it encounters unknown parameters in `storageClass.Parameters` or
`claim.Spec.Selector` or the combination of these parameters is impossible
to achieve, it SHOULD report an error and it MUST NOT provision a volume.
All errors found during parsing or provisioning SHOULD be send as events
on the claim and the provisioner SHOULD retry periodically with step i.
As parsing (and understanding) claim selectors is hard, the sentence
"MUST parse ... `claim.Spec.Selector`" will in typical case lead to simple
refusal of claims that have any selector:
```go
if pvc.Spec.Selector != nil {
return Error("can't parse PVC selector!")
}
```
5. When the volume is provisioned, the provisioner MUST create a new PV
representing the storage asset and save it in Kubernetes. When this fails,
it SHOULD retry creating the PV again few times. If all attempts fail, it
MUST delete the storage asset. All errors SHOULD be sent as events to the
claim.
The created PV MUST have these properties:
* `pv.Spec.ClaimRef` MUST point to the claim that led to its creation
(including the claim UID).
*This way, the PV will be bound to the claim.*
* `pv.Annotations["pv.kubernetes.io/provisioned-by"]` MUST be set to name
of the external provisioner. This provisioner will be used to delete the
volume.
*The provisioner/delete should not assume there is any other
provisioner/deleter available that would delete the volume.*
* `pv.Annotations["volume.beta.kubernetes.io/storage-class"]` MUST be set
to name of the storage class requested by the claim.
*So the created PV matches the claim.*
* The provisioner MAY store any other information to the created PV as
annotations. It SHOULD save any information that is needed to delete the
storage asset there, as appropriate StorageClass instance may not exist
when the volume will be deleted. However, references to Secret instance
or direct username/password to a remote storage appliance MUST NOT be
stored there, see issue #34822.
* `pv.Labels` MUST be set to match `claim.spec.selector`. The provisioner
MAY add additional labels.
*So the created PV matches the claim.*
* `pv.Spec` MUST be set to match requirements in `claim.Spec`, especially
access mode and PV size. The provisioned volume size MUST NOT be smaller
than size requested in the claim, however it MAY be larger.
*So the created PV matches the claim.*
* `pv.Spec.PersistentVolumeSource` MUST be set to point to the created
storage asset.
* `pv.Spec.PersistentVolumeReclaimPolicy` SHOULD be set to `Delete` unless
user manually configures other reclaim policy.
* `pv.Name` MUST be unique. Internal provisioners use name based on
`claim.UID` to produce conflicts when two provisioners accidentally
provision a PV for the same claim, however external provisioners can use
any mechanism to generate an unique PV name.
Example of a claim that is to be provisioned by an external provisioner for
`foo.org/foo-volume`:
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
volume.beta.kubernetes.io/storage-class: myClass
volume.beta.kubernetes.io/storage-provisioner: foo.org/foo-volume
name: fooclaim
namespace: default
resourceVersion: "53"
uid: 5a294561-7e5b-11e6-a20e-0eb6048532a3
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
# volumeName: must be empty!
```
Example of the created PV:
```yaml
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: foo.org/foo-volume
volume.beta.kubernetes.io/storage-class: myClass
foo.org/provisioner: "any other annotations as needed"
labels:
foo.org/my-label: "any labels as needed"
generateName: "foo-volume-"
spec:
accessModes:
- ReadWriteOnce
awsElasticBlockStore:
fsType: ext4
volumeID: aws://us-east-1d/vol-de401a79
capacity:
storage: 4Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: fooclaim
namespace: default
resourceVersion: "53"
uid: 5a294561-7e5b-11e6-a20e-0eb6048532a3
persistentVolumeReclaimPolicy: Delete
```
As result, Kubernetes has a PV that represents the storage asset and is bound
to the claim. When everything went well, Kubernetes completed binding of the
claim to the PV.
Kubernetes was not blocked in any way during the provisioning and could
either bound the claim to another PV that was created by user or even the
claim may have been deleted by the user. In both cases, Kubernetes will mark
the PV to be delete using the protocol below.
The external provisioner MAY save any annotations to the claim that is
provisioned, however the claim may be modified or even deleted by the user at
any time.
### Controller workflow for deleting volumes
When the controller decides that a volume should be deleted it performs these
steps:
1. The controller changes `pv.Status.Phase` to `Released`.
2. The controller looks for `pv.Annotations["pv.kubernetes.io/provisioned-by"]`.
If found, it uses this provisioner/deleter to delete the volume.
3. If the volume is not annotated by `pv.kubernetes.io/provisioned-by`, the
controller inspects `pv.Spec` and finds in-tree deleter for the volume.
4. If the deleter found by steps 2. or 3. is internal, it calls it and deletes
the storage asset together with the PV that represents it.
5. If the deleter is not known to Kubernetes, it does not do anything.
6. External deleters MUST watch for PV changes. When
`pv.Status.Phase == Released && pv.Annotations['pv.kubernetes.io/provisioned-by'] == <deleter name>`,
the deleter:
* It MUST check reclaim policy of the PV and ignore all PVs whose
`Spec.PersistentVolumeReclaimPolicy` is not `Delete`.
* It MUST delete the storage asset.
* Only after the storage asset was successfully deleted, it MUST delete the
PV object in Kubernetes.
* Any error SHOULD be sent as an event on the PV being deleted and the
deleter SHOULD retry to delete the volume periodically.
* The deleter SHOULD NOT use any information from StorageClass instance
referenced by the PV. This is different to internal deleters, which
need to be StorageClass instance present at the time of deletion to read
Secret instances (see Gluster provisioner for example), however we would
like to phase out this behavior.
Note that watching `pv.Status` has been frowned upon in the past, however in
this particular case we could use it quite reliably to trigger deletion.
It's not trivial to find out if a PV is not needed and should be deleted.
*Alternatively, an annotation could be used.*
### Security considerations
Both internal and external provisioners and deleters may need access to
credentials (e.g. username+password) of an external storage appliance to
provision and delete volumes.
* For internal provisioners, a Secret instance in a well secured namespace
should be used. Pointer to the Secret instance shall be parameter of the
StorageClass and it MUST NOT be copied around the system e.g. in annotations
of PVs. See issue #34822.
* External provisioners running in pod should have appropriate credentials
mouted as Secret inside pods that run the provisioner. Namespace with the pods
and Secret instance should be well secured.
### `StorageClass` API
A new API group should hold the API for storage classes, following the pattern
of autoscaling, metrics, etc. To allow for future storage-related APIs, we
should call this new API group `storage.k8s.io` and incubate in storage.k8s.io/v1beta1.
Storage classes will be represented by an API object called `StorageClass`:
```go
package storage
// StorageClass describes the parameters for a class of storage for
// which PersistentVolumes can be dynamically provisioned.
//
// StorageClasses are non-namespaced; the name of the storage class
// according to etcd is in ObjectMeta.Name.
type StorageClass struct {
unversioned.TypeMeta `json:",inline"`
ObjectMeta `json:"metadata,omitempty"`
// Provisioner indicates the type of the provisioner.
Provisioner string `json:"provisioner,omitempty"`
// Parameters for dynamic volume provisioner.
Parameters map[string]string `json:"parameters,omitempty"`
}
```
`PersistentVolumeClaimSpec` and `PersistentVolumeSpec` both get Class attribute
(the existing annotation is used during incubation):
```go
type PersistentVolumeClaimSpec struct {
// Name of requested storage class. If non-empty, only PVs with this
// pv.Spec.Class will be considered for binding and if no such PV is
// available, StorageClass with this name will be used to dynamically
// provision the volume.
Class string
...
}
type PersistentVolumeSpec struct {
// Name of StorageClass instance that this volume belongs to.
Class string
...
}
```
Storage classes are natural to think of as a global resource, since they:
1. Align with PersistentVolumes, which are a global resource
2. Are administrator controlled
### Provisioning configuration
With the scheme outlined above the provisioner creates PVs using parameters specified in the `StorageClass` object.
### Provisioner interface changes
`struct volume.VolumeOptions` (containing parameters for a provisioner plugin)
will be extended to contain StorageClass.Parameters.
The existing provisioner implementations will be modified to accept the StorageClass configuration object.
### PV Controller Changes
The persistent volume controller will be modified to implement the new
workflow described in this proposal. The changes will be limited to the
`provisionClaimOperation` method, which is responsible for invoking the
provisioner and to favor existing volumes before provisioning a new one.
## Examples
### AWS provisioners with distinct QoS
This example shows two storage classes, "aws-fast" and "aws-slow".
```
apiVersion: v1
kind: StorageClass
metadata:
name: aws-fast
provisioner: kubernetes.io/aws-ebs
parameters:
zone: us-east-1b
type: ssd
apiVersion: v1
kind: StorageClass
metadata:
name: aws-slow
provisioner: kubernetes.io/aws-ebs
parameters:
zone: us-east-1b
type: spinning
```
# Additional Implementation Details
0. Annotation `volume.alpha.kubernetes.io/storage-class` is used instead of `claim.Spec.Class` and `volume.Spec.Class` during incubation.
1. `claim.Spec.Selector` and `claim.Spec.Class` are mutually exclusive for now (1.4). User can either match existing volumes with `Selector` XOR match existing volumes with `Class` and get dynamic provisioning by using `Class`. This simplifies initial PR and also provisioners. This limitation may be lifted in future releases.
# Cloud Providers
Since the `volume.alpha.kubernetes.io/storage-class` is in use a `StorageClass` must be defined to support provisioning. No default is assumed as before.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volume-provisioning.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,268 +1 @@
## Abstract This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-selectors.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-selectors.md)
Real Kubernetes clusters have a variety of volumes which differ widely in
size, iops performance, retention policy, and other characteristics. A
mechanism is needed to enable administrators to describe the taxonomy of these
volumes, and for users to make claims on these volumes based on their
attributes within this taxonomy.
A label selector mechanism is proposed to enable flexible selection of volumes
by persistent volume claims.
## Motivation
Currently, users of persistent volumes have the ability to make claims on
those volumes based on some criteria such as the access modes the volume
supports and minimum resources offered by a volume. In an organization, there
are often more complex requirements for the storage volumes needed by
different groups of users. A mechanism is needed to model these different
types of volumes and to allow users to select those different types without
being intimately familiar with their underlying characteristics.
As an example, many cloud providers offer a range of performance
characteristics for storage, with higher performing storage being more
expensive. Cluster administrators want the ability to:
1. Invent a taxonomy of logical storage classes using the attributes
important to them
2. Allow users to make claims on volumes using these attributes
## Constraints and Assumptions
The proposed design should:
1. Deal with manually-created volumes
2. Not necessarily require users to know or understand the differences between
volumes (ie, Kubernetes should not dictate any particular set of
characteristics to administrators to think in terms of)
We will focus **only** on the barest mechanisms to describe and implement
label selectors in this proposal. We will address the following topics in
future proposals:
1. An extension resource or third party resource for storage classes
1. Dynamically provisioning new volumes for based on storage class
## Use Cases
1. As a user, I want to be able to make a claim on a persistent volume by
specifying a label selector as well as the currently available attributes
### Use Case: Taxonomy of Persistent Volumes
Kubernetes offers volume types for a variety of storage systems. Within each
of those storage systems, there are numerous ways in which volume instances
may differ from one another: iops performance, retention policy, etc.
Administrators of real clusters typically need to manage a variety of
different volumes with different characteristics for different groups of
users.
Kubernetes should make it possible for administrators to flexibly model the
taxonomy of volumes in their clusters and to label volumes with their storage
class. This capability must be optional and fully backward-compatible with
the existing API.
Let's look at an example. This example is *purely fictitious* and the
taxonomies presented here are not a suggestion of any sort. In the case of
AWS EBS there are four different types of volume (in ascending order of cost):
1. Cold HDD
2. Throughput optimized HDD
3. General purpose SSD
4. Provisioned IOPS SSD
Currently, there is no way to distinguish between a group of 4 PVs where each
volume is of one of these different types. Administrators need the ability to
distinguish between instances of these types. An administrator might decide
to think of these volumes as follows:
1. Cold HDD - `tin`
2. Throughput optimized HDD - `bronze`
3. General purpose SSD - `silver`
4. Provisioned IOPS SSD - `gold`
This is not the only dimension that EBS volumes can differ in. Let's simplify
things and imagine that AWS has two availability zones, `east` and `west`. Our
administrators want to differentiate between volumes of the same type in these
two zones, so they create a taxonomy of volumes like so:
1. `tin-west`
2. `tin-east`
3. `bronze-west`
4. `bronze-east`
5. `silver-west`
6. `silver-east`
7. `gold-west`
8. `gold-east`
Another administrator of the same cluster might label things differently,
choosing to focus on the business role of volumes. Say that the data
warehouse department is the sole consumer of the cold HDD type, and the DB as
a service offering is the sole consumer of provisioned IOPS volumes. The
administrator might decide on the following taxonomy of volumes:
1. `warehouse-east`
2. `warehouse-west`
3. `dbaas-east`
4. `dbaas-west`
There are any number of ways an administrator may choose to distinguish
between volumes. Labels are used in Kubernetes to express the user-defined
properties of API objects and are a good fit to express this information for
volumes. In the examples above, administrators might differentiate between
the classes of volumes using the labels `business-unit`, `volume-type`, or
`region`.
Label selectors are used through the Kubernetes API to describe relationships
between API objects using flexible, user-defined criteria. It makes sense to
use the same mechanism with persistent volumes and storage claims to provide
the same functionality for these API objects.
## Proposed Design
We propose that:
1. A new field called `Selector` be added to the `PersistentVolumeClaimSpec`
type
2. The persistent volume controller be modified to account for this selector
when determining the volume to bind to a claim
### Persistent Volume Selector
Label selectors are used throughout the API to allow users to express
relationships in a flexible manner. The problem of selecting a volume to
match a claim fits perfectly within this metaphor. Adding a label selector to
`PersistentVolumeClaimSpec` will allow users to label their volumes with
criteria important to them and select volumes based on these criteria.
```go
// PersistentVolumeClaimSpec describes the common attributes of storage devices
// and allows a Source for provider-specific attributes
type PersistentVolumeClaimSpec struct {
// Contains the types of access modes required
AccessModes []PersistentVolumeAccessMode `json:"accessModes,omitempty"`
// Selector is a selector which must be true for the claim to bind to a volume
Selector *unversioned.Selector `json:"selector,omitempty"`
// Resources represents the minimum resources required
Resources ResourceRequirements `json:"resources,omitempty"`
// VolumeName is the binding reference to the PersistentVolume backing this claim
VolumeName string `json:"volumeName,omitempty"`
}
```
### Labeling volumes
Volumes can already be labeled:
```yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: ebs-pv-1
labels:
ebs-volume-type: iops
aws-availability-zone: us-east-1
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
awsElasticBlockStore:
volumeID: vol-12345
fsType: xfs
```
### Controller Changes
At the time of this writing, the various controllers for persistent volumes
are in the process of being refactored into a single controller (see
[kubernetes/24331](https://github.com/kubernetes/kubernetes/pull/24331)).
The resulting controller should be modified to use the new
`selector` field to match a claim to a volume. In order to
match to a volume, all criteria must be satisfied; ie, if a label selector is
specified on a claim, a volume must match both the label selector and any
specified access modes and resource requirements to be considered a match.
## Examples
Let's take a look at a few examples, revisiting the taxonomy of EBS volumes and regions:
Volumes of the different types might be labeled as follows:
```yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: ebs-pv-west
labels:
ebs-volume-type: iops-ssd
aws-availability-zone: us-west-1
spec:
capacity:
storage: 150Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
awsElasticBlockStore:
volumeID: vol-23456
fsType: xfs
apiVersion: v1
kind: PersistentVolume
metadata:
name: ebs-pv-east
labels:
ebs-volume-type: gp-ssd
aws-availability-zone: us-east-1
spec:
capacity:
storage: 150Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
awsElasticBlockStore:
volumeID: vol-34567
fsType: xfs
```
...claims on these volumes would look like:
```yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: ebs-claim-west
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
selector:
matchLabels:
ebs-volume-type: iops-ssd
aws-availability-zone: us-west-1
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: ebs-claim-east
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
selector:
matchLabels:
ebs-volume-type: gp-ssd
aws-availability-zone: us-east-1
```
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volume-selectors.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -1,482 +1 @@
## Abstract This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volumes.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volumes.md)
A proposal for sharing volumes between containers in a pod using a special supplemental group.
## Motivation
Kubernetes volumes should be usable regardless of the UID a container runs as. This concern cuts
across all volume types, so the system should be able to handle them in a generalized way to provide
uniform functionality across all volume types and lower the barrier to new plugins.
Goals of this design:
1. Enumerate the different use-cases for volume usage in pods
2. Define the desired goal state for ownership and permission management in Kubernetes
3. Describe the changes necessary to achieve desired state
## Constraints and Assumptions
1. When writing permissions in this proposal, `D` represents a don't-care value; example: `07D0`
represents permissions where the owner has `7` permissions, all has `0` permissions, and group
has a don't-care value
2. Read-write usability of a volume from a container is defined as one of:
1. The volume is owned by the container's effective UID and has permissions `07D0`
2. The volume is owned by the container's effective GID or one of its supplemental groups and
has permissions `0D70`
3. Volume plugins should not have to handle setting permissions on volumes
5. Preventing two containers within a pod from reading and writing to the same volume (by choosing
different container UIDs) is not something we intend to support today
6. We will not design to support multiple processes running in a single container as different
UIDs; use cases that require work by different UIDs should be divided into different pods for
each UID
## Current State Overview
### Kubernetes
Kubernetes volumes can be divided into two broad categories:
1. Unshared storage:
1. Volumes created by the kubelet on the host directory: empty directory, git repo, secret,
downward api. All volumes in this category delegate to `EmptyDir` for their underlying
storage. These volumes are created with ownership `root:root`.
2. Volumes based on network block devices: AWS EBS, iSCSI, RBD, etc, *when used exclusively
by a single pod*.
2. Shared storage:
1. `hostPath` is shared storage because it is necessarily used by a container and the host
2. Network file systems such as NFS, Glusterfs, Cephfs, etc. For these volumes, the ownership
is determined by the configuration of the shared storage system.
3. Block device based volumes in `ReadOnlyMany` or `ReadWriteMany` modes are shared because
they may be used simultaneously by multiple pods.
The `EmptyDir` volume was recently modified to create the volume directory with `0777` permissions
from `0750` to support basic usability of that volume as a non-root UID.
### Docker
Docker recently added supplemental group support. This adds the ability to specify additional
groups that a container should be part of, and will be released with Docker 1.8.
There is a [proposal](https://github.com/docker/docker/pull/14632) to add a bind-mount flag to tell
Docker to change the ownership of a volume to the effective UID and GID of a container, but this has
not yet been accepted.
### rkt
rkt
[image manifests](https://github.com/appc/spec/blob/master/spec/aci.md#image-manifest-schema) can
specify users and groups, similarly to how a Docker image can. A rkt
[pod manifest](https://github.com/appc/spec/blob/master/spec/pods.md#pod-manifest-schema) can also
override the default user and group specified by the image manifest.
rkt does not currently support supplemental groups or changing the owning UID or
group of a volume, but it has been [requested](https://github.com/coreos/rkt/issues/1309).
## Use Cases
1. As a user, I want the system to set ownership and permissions on volumes correctly to enable
reads and writes with the following scenarios:
1. All containers running as root
2. All containers running as the same non-root user
3. Multiple containers running as a mix of root and non-root users
### All containers running as root
For volumes that only need to be used by root, no action needs to be taken to change ownership or
permissions, but setting the ownership based on the supplemental group shared by all containers in a
pod will also work. For situations where read-only access to a shared volume is required from one
or more containers, the `VolumeMount`s in those containers should have the `readOnly` field set.
### All containers running as a single non-root user
In use cases whether a volume is used by a single non-root UID the volume ownership and permissions
should be set to enable read/write access.
Currently, a non-root UID will not have permissions to write to any but an `EmptyDir` volume.
Today, users that need this case to work can:
1. Grant the container the necessary capabilities to `chown` and `chmod` the volume:
- `CAP_FOWNER`
- `CAP_CHOWN`
- `CAP_DAC_OVERRIDE`
2. Run a wrapper script that runs `chown` and `chmod` commands to set the desired ownership and
permissions on the volume before starting their main process
This workaround has significant drawbacks:
1. It grants powerful kernel capabilities to the code in the image and thus is not securing,
defeating the reason containers are run as non-root users
2. The user experience is poor; it requires changing Dockerfile, adding a layer, or modifying the
container's command
Some cluster operators manage the ownership of shared storage volumes on the server side.
In this scenario, the UID of the container using the volume is known in advance. The ownership of
the volume is set to match the container's UID on the server side.
### Containers running as a mix of root and non-root users
If the list of UIDs that need to use a volume includes both root and non-root users, supplemental
groups can be applied to enable sharing volumes between containers. The ownership and permissions
`root:<supplemental group> 2770` will make a volume usable from both containers running as root and
running as a non-root UID and the supplemental group. The setgid bit is used to ensure that files
created in the volume will inherit the owning GID of the volume.
## Community Design Discussion
- [kubernetes/2630](https://github.com/kubernetes/kubernetes/issues/2630)
- [kubernetes/11319](https://github.com/kubernetes/kubernetes/issues/11319)
- [kubernetes/9384](https://github.com/kubernetes/kubernetes/pull/9384)
## Analysis
The system needs to be able to:
1. Model correctly which volumes require ownership management
1. Determine the correct ownership of each volume in a pod if required
1. Set the ownership and permissions on volumes when required
### Modeling whether a volume requires ownership management
#### Unshared storage: volumes derived from `EmptyDir`
Since Kubernetes creates `EmptyDir` volumes, it should ensure the ownership is set to enable the
volumes to be usable for all of the above scenarios.
#### Unshared storage: network block devices
Volume plugins based on network block devices such as AWS EBS and RBS can be treated the same way
as local volumes. Since inodes are written to these block devices in the same way as `EmptyDir`
volumes, permissions and ownership can be managed on the client side by the Kubelet when used
exclusively by one pod. When the volumes are used outside of a persistent volume, or with the
`ReadWriteOnce` mode, they are effectively unshared storage.
When used by multiple pods, there are many additional use-cases to analyze before we can be
confident that we can support ownership management robustly with these file systems. The right
design is one that makes it easy to experiment and develop support for ownership management with
volume plugins to enable developers and cluster operators to continue exploring these issues.
#### Shared storage: hostPath
The `hostPath` volume should only be used by effective-root users, and the permissions of paths
exposed into containers via hostPath volumes should always be managed by the cluster operator. If
the Kubelet managed the ownership for `hostPath` volumes, a user who could create a `hostPath`
volume could affect changes in the state of arbitrary paths within the host's filesystem. This
would be a severe security risk, so we will consider hostPath a corner case that the kubelet should
never perform ownership management for.
#### Shared storage
Ownership management of shared storage is a complex topic. Ownership for existing shared storage
will be managed externally from Kubernetes. For this case, our API should make it simple to express
whether a particular volume should have these concerns managed by Kubernetes.
We will not attempt to address the ownership and permissions concerns of new shared storage
in this proposal.
When a network block device is used as a persistent volume in `ReadWriteMany` or `ReadOnlyMany`
modes, it is shared storage, and thus outside the scope of this proposal.
#### Plugin API requirements
From the above, we know that some volume plugins will 'want' ownership management from the Kubelet
and others will not. Plugins should be able to opt in to ownership management from the Kubelet. To
facilitate this, there should be a method added to the `volume.Plugin` interface that the Kubelet
uses to determine whether to perform ownership management for a volume.
### Determining correct ownership of a volume
Using the approach of a pod-level supplemental group to own volumes solves the problem in any of the
cases of UID/GID combinations within a pod. Since this is the simplest approach that handles all
use-cases, our solution will be made in terms of it.
Eventually, Kubernetes should allocate a unique group for each pod so that a pod's volumes are
usable by that pod's containers, but not by containers of another pod. The supplemental group used
to share volumes must be unique in a multitenant cluster. If uniqueness is enforced at the host
level, pods from one host may be able to use shared filesystems meant for pods on another host.
Eventually, Kubernetes should integrate with external identity management systems to populate pod
specs with the right supplemental groups necessary to use shared volumes. In the interim until the
identity management story is far enough along to implement this type of integration, we will rely
on being able to set arbitrary groups. (Note: as of this writing, a PR is being prepared for
setting arbitrary supplemental groups).
An admission controller could handle allocating groups for each pod and setting the group in the
pod's security context.
#### A note on the root group
Today, by default, all docker containers are run in the root group (GID 0). This is relied on by
image authors that make images to run with a range of UIDs: they set the group ownership for
important paths to be the root group, so that containers running as GID 0 *and* an arbitrary UID
can read and write to those paths normally.
It is important to note that the changes proposed here will not affect the primary GID of
containers in pods. Setting the `pod.Spec.SecurityContext.FSGroup` field will not
override the primary GID and should be safe to use in images that expect GID 0.
### Setting ownership and permissions on volumes
For `EmptyDir`-based volumes and unshared storage, `chown` and `chmod` on the node are sufficient to
set ownership and permissions. Shared storage is different because:
1. Shared storage may not live on the node a pod that uses it runs on
2. Shared storage may be externally managed
## Proposed design:
Our design should minimize code for handling ownership required in the Kubelet and volume plugins.
### API changes
We should not interfere with images that need to run as a particular UID or primary GID. A pod
level supplemental group allows us to express a group that all containers in a pod run as in a way
that is orthogonal to the primary UID and GID of each container process.
```go
package api
type PodSecurityContext struct {
// FSGroup is a supplemental group that all containers in a pod run under. This group will own
// volumes that the Kubelet manages ownership for. If this is not specified, the Kubelet will
// not set the group ownership of any volumes.
FSGroup *int64 `json:"fsGroup,omitempty"`
}
```
The V1 API will be extended with the same field:
```go
package v1
type PodSecurityContext struct {
// FSGroup is a supplemental group that all containers in a pod run under. This group will own
// volumes that the Kubelet manages ownership for. If this is not specified, the Kubelet will
// not set the group ownership of any volumes.
FSGroup *int64 `json:"fsGroup,omitempty"`
}
```
The values that can be specified for the `pod.Spec.SecurityContext.FSGroup` field are governed by
[pod security policy](https://github.com/kubernetes/kubernetes/pull/7893).
#### API backward compatibility
Pods created by old clients will have the `pod.Spec.SecurityContext.FSGroup` field unset;
these pods will not have their volumes managed by the Kubelet. Old clients will not be able to set
or read the `pod.Spec.SecurityContext.FSGroup` field.
### Volume changes
The `volume.Mounter` interface should have a new method added that indicates whether the plugin
supports ownership management:
```go
package volume
type Mounter interface {
// other methods omitted
// SupportsOwnershipManagement indicates that this volume supports having ownership
// and permissions managed by the Kubelet; if true, the caller may manipulate UID
// or GID of this volume.
SupportsOwnershipManagement() bool
}
```
In the first round of work, only `hostPath` and `emptyDir` and its derivations will be tested with
ownership management support:
| Plugin Name | SupportsOwnershipManagement |
|-------------------------|-------------------------------|
| `hostPath` | false |
| `emptyDir` | true |
| `gitRepo` | true |
| `secret` | true |
| `downwardAPI` | true |
| `gcePersistentDisk` | false |
| `awsElasticBlockStore` | false |
| `nfs` | false |
| `iscsi` | false |
| `glusterfs` | false |
| `persistentVolumeClaim` | depends on underlying volume and PV mode |
| `rbd` | false |
| `cinder` | false |
| `cephfs` | false |
Ultimately, the matrix will theoretically look like:
| Plugin Name | SupportsOwnershipManagement |
|-------------------------|-------------------------------|
| `hostPath` | false |
| `emptyDir` | true |
| `gitRepo` | true |
| `secret` | true |
| `downwardAPI` | true |
| `gcePersistentDisk` | true |
| `awsElasticBlockStore` | true |
| `nfs` | false |
| `iscsi` | true |
| `glusterfs` | false |
| `persistentVolumeClaim` | depends on underlying volume and PV mode |
| `rbd` | true |
| `cinder` | false |
| `cephfs` | false |
### Kubelet changes
The Kubelet should be modified to perform ownership and label management when required for a volume.
For ownership management the criteria are:
1. The `pod.Spec.SecurityContext.FSGroup` field is populated
2. The volume builder returns `true` from `SupportsOwnershipManagement`
Logic should be added to the `mountExternalVolumes` method that runs a local `chgrp` and `chmod` if
the pod-level supplemental group is set and the volume supports ownership management:
```go
package kubelet
type ChgrpRunner interface {
Chgrp(path string, gid int) error
}
type ChmodRunner interface {
Chmod(path string, mode os.FileMode) error
}
type Kubelet struct {
chgrpRunner ChgrpRunner
chmodRunner ChmodRunner
}
func (kl *Kubelet) mountExternalVolumes(pod *api.Pod) (kubecontainer.VolumeMap, error) {
podFSGroup = pod.Spec.PodSecurityContext.FSGroup
podFSGroupSet := false
if podFSGroup != 0 {
podFSGroupSet = true
}
podVolumes := make(kubecontainer.VolumeMap)
for i := range pod.Spec.Volumes {
volSpec := &pod.Spec.Volumes[i]
rootContext, err := kl.getRootDirContext()
if err != nil {
return nil, err
}
// Try to use a plugin for this volume.
internal := volume.NewSpecFromVolume(volSpec)
builder, err := kl.newVolumeMounterFromPlugins(internal, pod, volume.VolumeOptions{RootContext: rootContext}, kl.mounter)
if err != nil {
glog.Errorf("Could not create volume builder for pod %s: %v", pod.UID, err)
return nil, err
}
if builder == nil {
return nil, errUnsupportedVolumeType
}
err = builder.SetUp()
if err != nil {
return nil, err
}
if builder.SupportsOwnershipManagement() &&
podFSGroupSet {
err = kl.chgrpRunner.Chgrp(builder.GetPath(), podFSGroup)
if err != nil {
return nil, err
}
err = kl.chmodRunner.Chmod(builder.GetPath(), os.FileMode(1770))
if err != nil {
return nil, err
}
}
podVolumes[volSpec.Name] = builder
}
return podVolumes, nil
}
```
This allows the volume plugins to determine when they do and don't want this type of support from
the Kubelet, and allows the criteria each plugin uses to evolve without changing the Kubelet.
The docker runtime will be modified to set the supplemental group of each container based on the
`pod.Spec.SecurityContext.FSGroup` field. Theoretically, the `rkt` runtime could support this
feature in a similar way.
### Examples
#### EmptyDir
For a pod that has two containers sharing an `EmptyDir` volume:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
securityContext:
fsGroup: 1001
containers:
- name: a
securityContext:
runAsUser: 1009
volumeMounts:
- mountPath: "/example/hostpath/a"
name: empty-vol
- name: b
securityContext:
runAsUser: 1010
volumeMounts:
- mountPath: "/example/hostpath/b"
name: empty-vol
volumes:
- name: empty-vol
```
When the Kubelet runs this pod, the `empty-vol` volume will have ownership root:1001 and permissions
`0770`. It will be usable from both containers a and b.
#### HostPath
For a volume that uses a `hostPath` volume with containers running as different UIDs:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
securityContext:
fsGroup: 1001
containers:
- name: a
securityContext:
runAsUser: 1009
volumeMounts:
- mountPath: "/example/hostpath/a"
name: host-vol
- name: b
securityContext:
runAsUser: 1010
volumeMounts:
- mountPath: "/example/hostpath/b"
name: host-vol
volumes:
- name: host-vol
hostPath:
path: "/tmp/example-pod"
```
The cluster operator would need to manually `chgrp` and `chmod` the `/tmp/example-pod` on the host
in order for the volume to be usable from the pod.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volumes.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->