address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree
Signed-off-by: mikebrow <brownwm@us.ibm.com>
This commit is contained in:
parent
4638f2f355
commit
6bdc0bfdb7
@ -34,19 +34,59 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
# Kubernetes Design Overview
|
# Kubernetes Design Overview
|
||||||
|
|
||||||
Kubernetes is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications.
|
Kubernetes is a system for managing containerized applications across multiple
|
||||||
|
hosts, providing basic mechanisms for deployment, maintenance, and scaling of
|
||||||
|
applications.
|
||||||
|
|
||||||
Kubernetes establishes robust declarative primitives for maintaining the desired state requested by the user. We see these primitives as the main value added by Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and replicating containers require active controllers, not just imperative orchestration.
|
Kubernetes establishes robust declarative primitives for maintaining the desired
|
||||||
|
state requested by the user. We see these primitives as the main value added by
|
||||||
|
Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and
|
||||||
|
replicating containers require active controllers, not just imperative
|
||||||
|
orchestration.
|
||||||
|
|
||||||
Kubernetes is primarily targeted at applications composed of multiple containers, such as elastic, distributed micro-services. It is also designed to facilitate migration of non-containerized application stacks to Kubernetes. It therefore includes abstractions for grouping containers in both loosely coupled and tightly coupled formations, and provides ways for containers to find and communicate with each other in relatively familiar ways.
|
Kubernetes is primarily targeted at applications composed of multiple
|
||||||
|
containers, such as elastic, distributed micro-services. It is also designed to
|
||||||
|
facilitate migration of non-containerized application stacks to Kubernetes. It
|
||||||
|
therefore includes abstractions for grouping containers in both loosely coupled
|
||||||
|
and tightly coupled formations, and provides ways for containers to find and
|
||||||
|
communicate with each other in relatively familiar ways.
|
||||||
|
|
||||||
Kubernetes enables users to ask a cluster to run a set of containers. The system automatically chooses hosts to run those containers on. While Kubernetes's scheduler is currently very simple, we expect it to grow in sophistication over time. Scheduling is a policy-rich, topology-aware, workload-specific function that significantly impacts availability, performance, and capacity. The scheduler needs to take into account individual and collective resource requirements, quality of service requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, deadlines, and so on. Workload-specific requirements will be exposed through the API as necessary.
|
Kubernetes enables users to ask a cluster to run a set of containers. The system
|
||||||
|
automatically chooses hosts to run those containers on. While Kubernetes's
|
||||||
|
scheduler is currently very simple, we expect it to grow in sophistication over
|
||||||
|
time. Scheduling is a policy-rich, topology-aware, workload-specific function
|
||||||
|
that significantly impacts availability, performance, and capacity. The
|
||||||
|
scheduler needs to take into account individual and collective resource
|
||||||
|
requirements, quality of service requirements, hardware/software/policy
|
||||||
|
constraints, affinity and anti-affinity specifications, data locality,
|
||||||
|
inter-workload interference, deadlines, and so on. Workload-specific
|
||||||
|
requirements will be exposed through the API as necessary.
|
||||||
|
|
||||||
Kubernetes is intended to run on a number of cloud providers, as well as on physical hosts.
|
Kubernetes is intended to run on a number of cloud providers, as well as on
|
||||||
|
physical hosts.
|
||||||
|
|
||||||
A single Kubernetes cluster is not intended to span multiple availability zones. Instead, we recommend building a higher-level layer to replicate complete deployments of highly available applications across multiple zones (see [the multi-cluster doc](../admin/multi-cluster.md) and [cluster federation proposal](../proposals/federation.md) for more details).
|
A single Kubernetes cluster is not intended to span multiple availability zones.
|
||||||
|
Instead, we recommend building a higher-level layer to replicate complete
|
||||||
|
deployments of highly available applications across multiple zones (see
|
||||||
|
[the multi-cluster doc](../admin/multi-cluster.md) and [cluster federation proposal](../proposals/federation.md)
|
||||||
|
for more details).
|
||||||
|
|
||||||
Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS platform and toolkit. Therefore, architecturally, we want Kubernetes to be built as a collection of pluggable components and layers, with the ability to use alternative schedulers, controllers, storage systems, and distribution mechanisms, and we're evolving its current code in that direction. Furthermore, we want others to be able to extend Kubernetes functionality, such as with higher-level PaaS functionality or multi-cluster layers, without modification of core Kubernetes source. Therefore, its API isn't just (or even necessarily mainly) targeted at end users, but at tool and extension developers. Its APIs are intended to serve as the foundation for an open ecosystem of tools, automation systems, and higher-level API layers. Consequently, there are no "internal" inter-component APIs. All APIs are visible and available, including the APIs used by the scheduler, the node controller, the replication-controller manager, Kubelet's API, etc. There's no glass to break -- in order to handle more complex use cases, one can just access the lower-level APIs in a fully transparent, composable manner.
|
Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS
|
||||||
|
platform and toolkit. Therefore, architecturally, we want Kubernetes to be built
|
||||||
|
as a collection of pluggable components and layers, with the ability to use
|
||||||
|
alternative schedulers, controllers, storage systems, and distribution
|
||||||
|
mechanisms, and we're evolving its current code in that direction. Furthermore,
|
||||||
|
we want others to be able to extend Kubernetes functionality, such as with
|
||||||
|
higher-level PaaS functionality or multi-cluster layers, without modification of
|
||||||
|
core Kubernetes source. Therefore, its API isn't just (or even necessarily
|
||||||
|
mainly) targeted at end users, but at tool and extension developers. Its APIs
|
||||||
|
are intended to serve as the foundation for an open ecosystem of tools,
|
||||||
|
automation systems, and higher-level API layers. Consequently, there are no
|
||||||
|
"internal" inter-component APIs. All APIs are visible and available, including
|
||||||
|
the APIs used by the scheduler, the node controller, the replication-controller
|
||||||
|
manager, Kubelet's API, etc. There's no glass to break -- in order to handle
|
||||||
|
more complex use cases, one can just access the lower-level APIs in a fully
|
||||||
|
transparent, composable manner.
|
||||||
|
|
||||||
For more about the Kubernetes architecture, see [architecture](architecture.md).
|
For more about the Kubernetes architecture, see [architecture](architecture.md).
|
||||||
|
|
||||||
|
@ -34,23 +34,30 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
# K8s Identity and Access Management Sketch
|
# K8s Identity and Access Management Sketch
|
||||||
|
|
||||||
This document suggests a direction for identity and access management in the Kubernetes system.
|
This document suggests a direction for identity and access management in the
|
||||||
|
Kubernetes system.
|
||||||
|
|
||||||
|
|
||||||
## Background
|
## Background
|
||||||
|
|
||||||
High level goals are:
|
High level goals are:
|
||||||
- Have a plan for how identity, authentication, and authorization will fit in to the API.
|
- Have a plan for how identity, authentication, and authorization will fit in
|
||||||
- Have a plan for partitioning resources within a cluster between independent organizational units.
|
to the API.
|
||||||
|
- Have a plan for partitioning resources within a cluster between independent
|
||||||
|
organizational units.
|
||||||
- Ease integration with existing enterprise and hosted scenarios.
|
- Ease integration with existing enterprise and hosted scenarios.
|
||||||
|
|
||||||
### Actors
|
### Actors
|
||||||
|
|
||||||
Each of these can act as normal users or attackers.
|
Each of these can act as normal users or attackers.
|
||||||
- External Users: People who are accessing applications running on K8s (e.g. a web site served by webserver running in a container on K8s), but who do not have K8s API access.
|
- External Users: People who are accessing applications running on K8s (e.g.
|
||||||
- K8s Users : People who access the K8s API (e.g. create K8s API objects like Pods)
|
a web site served by webserver running in a container on K8s), but who do not
|
||||||
|
have K8s API access.
|
||||||
|
- K8s Users: People who access the K8s API (e.g. create K8s API objects like
|
||||||
|
Pods)
|
||||||
- K8s Project Admins: People who manage access for some K8s Users
|
- K8s Project Admins: People who manage access for some K8s Users
|
||||||
- K8s Cluster Admins: People who control the machines, networks, or binaries that make up a K8s cluster.
|
- K8s Cluster Admins: People who control the machines, networks, or binaries
|
||||||
|
that make up a K8s cluster.
|
||||||
- K8s Admin means K8s Cluster Admins and K8s Project Admins taken together.
|
- K8s Admin means K8s Cluster Admins and K8s Project Admins taken together.
|
||||||
|
|
||||||
### Threats
|
### Threats
|
||||||
@ -58,22 +65,31 @@ Each of these can act as normal users or attackers.
|
|||||||
Both intentional attacks and accidental use of privilege are concerns.
|
Both intentional attacks and accidental use of privilege are concerns.
|
||||||
|
|
||||||
For both cases it may be useful to think about these categories differently:
|
For both cases it may be useful to think about these categories differently:
|
||||||
- Application Path - attack by sending network messages from the internet to the IP/port of any application running on K8s. May exploit weakness in application or misconfiguration of K8s.
|
- Application Path - attack by sending network messages from the internet to
|
||||||
|
the IP/port of any application running on K8s. May exploit weakness in
|
||||||
|
application or misconfiguration of K8s.
|
||||||
- K8s API Path - attack by sending network messages to any K8s API endpoint.
|
- K8s API Path - attack by sending network messages to any K8s API endpoint.
|
||||||
- Insider Path - attack on K8s system components. Attacker may have privileged access to networks, machines or K8s software and data. Software errors in K8s system components and administrator error are some types of threat in this category.
|
- Insider Path - attack on K8s system components. Attacker may have
|
||||||
|
privileged access to networks, machines or K8s software and data. Software
|
||||||
|
errors in K8s system components and administrator error are some types of threat
|
||||||
|
in this category.
|
||||||
|
|
||||||
This document is primarily concerned with K8s API paths, and secondarily with Internal paths. The Application path also needs to be secure, but is not the focus of this document.
|
This document is primarily concerned with K8s API paths, and secondarily with
|
||||||
|
Internal paths. The Application path also needs to be secure, but is not the
|
||||||
|
focus of this document.
|
||||||
|
|
||||||
### Assets to protect
|
### Assets to protect
|
||||||
|
|
||||||
External User assets:
|
External User assets:
|
||||||
- Personal information like private messages, or images uploaded by External Users.
|
- Personal information like private messages, or images uploaded by External
|
||||||
|
Users.
|
||||||
- web server logs.
|
- web server logs.
|
||||||
|
|
||||||
K8s User assets:
|
K8s User assets:
|
||||||
- External User assets of each K8s User.
|
- External User assets of each K8s User.
|
||||||
- things private to the K8s app, like:
|
- things private to the K8s app, like:
|
||||||
- credentials for accessing other services (docker private repos, storage services, facebook, etc)
|
- credentials for accessing other services (docker private repos, storage
|
||||||
|
services, facebook, etc)
|
||||||
- SSL certificates for web servers
|
- SSL certificates for web servers
|
||||||
- proprietary data and code
|
- proprietary data and code
|
||||||
|
|
||||||
@ -82,38 +98,51 @@ K8s Cluster assets:
|
|||||||
- Machine Certificates or secrets.
|
- Machine Certificates or secrets.
|
||||||
- The value of K8s cluster computing resources (cpu, memory, etc).
|
- The value of K8s cluster computing resources (cpu, memory, etc).
|
||||||
|
|
||||||
This document is primarily about protecting K8s User assets and K8s cluster assets from other K8s Users and K8s Project and Cluster Admins.
|
This document is primarily about protecting K8s User assets and K8s cluster
|
||||||
|
assets from other K8s Users and K8s Project and Cluster Admins.
|
||||||
|
|
||||||
### Usage environments
|
### Usage environments
|
||||||
|
|
||||||
Cluster in Small organization:
|
Cluster in Small organization:
|
||||||
- K8s Admins may be the same people as K8s Users.
|
- K8s Admins may be the same people as K8s Users.
|
||||||
- few K8s Admins.
|
- Few K8s Admins.
|
||||||
- prefer ease of use to fine-grained access control/precise accounting, etc.
|
- Prefer ease of use to fine-grained access control/precise accounting, etc.
|
||||||
- Product requirement that it be easy for potential K8s Cluster Admin to try out setting up a simple cluster.
|
- Product requirement that it be easy for potential K8s Cluster Admin to try
|
||||||
|
out setting up a simple cluster.
|
||||||
|
|
||||||
Cluster in Large organization:
|
Cluster in Large organization:
|
||||||
- K8s Admins typically distinct people from K8s Users. May need to divide K8s Cluster Admin access by roles.
|
- K8s Admins typically distinct people from K8s Users. May need to divide
|
||||||
|
K8s Cluster Admin access by roles.
|
||||||
- K8s Users need to be protected from each other.
|
- K8s Users need to be protected from each other.
|
||||||
- Auditing of K8s User and K8s Admin actions important.
|
- Auditing of K8s User and K8s Admin actions important.
|
||||||
- flexible accurate usage accounting and resource controls important.
|
- Flexible accurate usage accounting and resource controls important.
|
||||||
- Lots of automated access to APIs.
|
- Lots of automated access to APIs.
|
||||||
- Need to integrate with existing enterprise directory, authentication, accounting, auditing, and security policy infrastructure.
|
- Need to integrate with existing enterprise directory, authentication,
|
||||||
|
accounting, auditing, and security policy infrastructure.
|
||||||
|
|
||||||
Org-run cluster:
|
Org-run cluster:
|
||||||
- organization that runs K8s master components is same as the org that runs apps on K8s.
|
- Organization that runs K8s master components is same as the org that runs
|
||||||
|
apps on K8s.
|
||||||
- Nodes may be on-premises VMs or physical machines; Cloud VMs; or a mix.
|
- Nodes may be on-premises VMs or physical machines; Cloud VMs; or a mix.
|
||||||
|
|
||||||
Hosted cluster:
|
Hosted cluster:
|
||||||
- Offering K8s API as a service, or offering a Paas or Saas built on K8s.
|
- Offering K8s API as a service, or offering a Paas or Saas built on K8s.
|
||||||
- May already offer web services, and need to integrate with existing customer account concept, and existing authentication, accounting, auditing, and security policy infrastructure.
|
- May already offer web services, and need to integrate with existing customer
|
||||||
- May want to leverage K8s User accounts and accounting to manage their User accounts (not a priority to support this use case.)
|
account concept, and existing authentication, accounting, auditing, and security
|
||||||
- Precise and accurate accounting of resources needed. Resource controls needed for hard limits (Users given limited slice of data) and soft limits (Users can grow up to some limit and then be expanded).
|
policy infrastructure.
|
||||||
|
- May want to leverage K8s User accounts and accounting to manage their User
|
||||||
|
accounts (not a priority to support this use case.)
|
||||||
|
- Precise and accurate accounting of resources needed. Resource controls
|
||||||
|
needed for hard limits (Users given limited slice of data) and soft limits
|
||||||
|
(Users can grow up to some limit and then be expanded).
|
||||||
|
|
||||||
K8s ecosystem services:
|
K8s ecosystem services:
|
||||||
- There may be companies that want to offer their existing services (Build, CI, A/B-test, release automation, etc) for use with K8s. There should be some story for this case.
|
- There may be companies that want to offer their existing services (Build, CI,
|
||||||
|
A/B-test, release automation, etc) for use with K8s. There should be some story
|
||||||
|
for this case.
|
||||||
|
|
||||||
Pods configs should be largely portable between Org-run and hosted configurations.
|
Pods configs should be largely portable between Org-run and hosted
|
||||||
|
configurations.
|
||||||
|
|
||||||
|
|
||||||
# Design
|
# Design
|
||||||
@ -123,65 +152,99 @@ Related discussion:
|
|||||||
- http://issue.k8s.io/443
|
- http://issue.k8s.io/443
|
||||||
|
|
||||||
This doc describes two security profiles:
|
This doc describes two security profiles:
|
||||||
- Simple profile: like single-user mode. Make it easy to evaluate K8s without lots of configuring accounts and policies. Protects from unauthorized users, but does not partition authorized users.
|
- Simple profile: like single-user mode. Make it easy to evaluate K8s
|
||||||
- Enterprise profile: Provide mechanisms needed for large numbers of users. Defense in depth. Should integrate with existing enterprise security infrastructure.
|
without lots of configuring accounts and policies. Protects from unauthorized
|
||||||
|
users, but does not partition authorized users.
|
||||||
|
- Enterprise profile: Provide mechanisms needed for large numbers of users.
|
||||||
|
Defense in depth. Should integrate with existing enterprise security
|
||||||
|
infrastructure.
|
||||||
|
|
||||||
K8s distribution should include templates of config, and documentation, for simple and enterprise profiles. System should be flexible enough for knowledgeable users to create intermediate profiles, but K8s developers should only reason about those two Profiles, not a matrix.
|
K8s distribution should include templates of config, and documentation, for
|
||||||
|
simple and enterprise profiles. System should be flexible enough for
|
||||||
|
knowledgeable users to create intermediate profiles, but K8s developers should
|
||||||
|
only reason about those two Profiles, not a matrix.
|
||||||
|
|
||||||
Features in this doc are divided into "Initial Feature", and "Improvements". Initial features would be candidates for version 1.00.
|
Features in this doc are divided into "Initial Feature", and "Improvements".
|
||||||
|
Initial features would be candidates for version 1.00.
|
||||||
|
|
||||||
## Identity
|
## Identity
|
||||||
|
|
||||||
### userAccount
|
### userAccount
|
||||||
|
|
||||||
K8s will have a `userAccount` API object.
|
K8s will have a `userAccount` API object.
|
||||||
- `userAccount` has a UID which is immutable. This is used to associate users with objects and to record actions in audit logs.
|
- `userAccount` has a UID which is immutable. This is used to associate users
|
||||||
- `userAccount` has a name which is a string and human readable and unique among userAccounts. It is used to refer to users in Policies, to ensure that the Policies are human readable. It can be changed only when there are no Policy objects or other objects which refer to that name. An email address is a suggested format for this field.
|
with objects and to record actions in audit logs.
|
||||||
- `userAccount` is not related to the unix username of processes in Pods created by that userAccount.
|
- `userAccount` has a name which is a string and human readable and unique among
|
||||||
|
userAccounts. It is used to refer to users in Policies, to ensure that the
|
||||||
|
Policies are human readable. It can be changed only when there are no Policy
|
||||||
|
objects or other objects which refer to that name. An email address is a
|
||||||
|
suggested format for this field.
|
||||||
|
- `userAccount` is not related to the unix username of processes in Pods created
|
||||||
|
by that userAccount.
|
||||||
- `userAccount` API objects can have labels.
|
- `userAccount` API objects can have labels.
|
||||||
|
|
||||||
The system may associate one or more Authentication Methods with a
|
The system may associate one or more Authentication Methods with a
|
||||||
`userAccount` (but they are not formally part of the userAccount object.)
|
`userAccount` (but they are not formally part of the userAccount object.)
|
||||||
In a simple deployment, the authentication method for a
|
|
||||||
user might be an authentication token which is verified by a K8s server. In a
|
In a simple deployment, the authentication method for a user might be an
|
||||||
more complex deployment, the authentication might be delegated to
|
authentication token which is verified by a K8s server. In a more complex
|
||||||
another system which is trusted by the K8s API to authenticate users, but where
|
deployment, the authentication might be delegated to another system which is
|
||||||
the authentication details are unknown to K8s.
|
trusted by the K8s API to authenticate users, but where the authentication
|
||||||
|
details are unknown to K8s.
|
||||||
|
|
||||||
Initial Features:
|
Initial Features:
|
||||||
- there is no superuser `userAccount`
|
- There is no superuser `userAccount`
|
||||||
- `userAccount` objects are statically populated in the K8s API store by reading a config file. Only a K8s Cluster Admin can do this.
|
- `userAccount` objects are statically populated in the K8s API store by reading
|
||||||
- `userAccount` can have a default `namespace`. If API call does not specify a `namespace`, the default `namespace` for that caller is assumed.
|
a config file. Only a K8s Cluster Admin can do this.
|
||||||
- `userAccount` is global. A single human with access to multiple namespaces is recommended to only have one userAccount.
|
- `userAccount` can have a default `namespace`. If API call does not specify a
|
||||||
|
`namespace`, the default `namespace` for that caller is assumed.
|
||||||
|
- `userAccount` is global. A single human with access to multiple namespaces is
|
||||||
|
recommended to only have one userAccount.
|
||||||
|
|
||||||
Improvements:
|
Improvements:
|
||||||
- Make `userAccount` part of a separate API group from core K8s objects like `pod`. Facilitates plugging in alternate Access Management.
|
- Make `userAccount` part of a separate API group from core K8s objects like
|
||||||
|
`pod.` Facilitates plugging in alternate Access Management.
|
||||||
|
|
||||||
Simple Profile:
|
Simple Profile:
|
||||||
- single `userAccount`, used by all K8s Users and Project Admins. One access token shared by all.
|
- Single `userAccount`, used by all K8s Users and Project Admins. One access
|
||||||
|
token shared by all.
|
||||||
|
|
||||||
Enterprise Profile:
|
Enterprise Profile:
|
||||||
- every human user has own `userAccount`.
|
- Every human user has own `userAccount`.
|
||||||
- `userAccount`s have labels that indicate both membership in groups, and ability to act in certain roles.
|
- `userAccount`s have labels that indicate both membership in groups, and
|
||||||
- each service using the API has own `userAccount` too. (e.g. `scheduler`, `repcontroller`)
|
ability to act in certain roles.
|
||||||
- automated jobs to denormalize the ldap group info into the local system list of users into the K8s userAccount file.
|
- Each service using the API has own `userAccount` too. (e.g. `scheduler`,
|
||||||
|
`repcontroller`)
|
||||||
|
- Automated jobs to denormalize the ldap group info into the local system
|
||||||
|
list of users into the K8s userAccount file.
|
||||||
|
|
||||||
### Unix accounts
|
### Unix accounts
|
||||||
|
|
||||||
A `userAccount` is not a Unix user account. The fact that a pod is started by a `userAccount` does not mean that the processes in that pod's containers run as a Unix user with a corresponding name or identity.
|
A `userAccount` is not a Unix user account. The fact that a pod is started by a
|
||||||
|
`userAccount` does not mean that the processes in that pod's containers run as a
|
||||||
|
Unix user with a corresponding name or identity.
|
||||||
|
|
||||||
Initially:
|
Initially:
|
||||||
- The unix accounts available in a container, and used by the processes running in a container are those that are provided by the combination of the base operating system and the Docker manifest.
|
- The unix accounts available in a container, and used by the processes running
|
||||||
- Kubernetes doesn't enforce any relation between `userAccount` and unix accounts.
|
in a container are those that are provided by the combination of the base
|
||||||
|
operating system and the Docker manifest.
|
||||||
|
- Kubernetes doesn't enforce any relation between `userAccount` and unix
|
||||||
|
accounts.
|
||||||
|
|
||||||
Improvements:
|
Improvements:
|
||||||
- Kubelet allocates disjoint blocks of root-namespace uids for each container. This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572)
|
- Kubelet allocates disjoint blocks of root-namespace uids for each container.
|
||||||
- requires docker to integrate user namespace support, and deciding what getpwnam() does for these uids.
|
This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572)
|
||||||
- any features that help users avoid use of privileged containers (http://issue.k8s.io/391)
|
- requires docker to integrate user namespace support, and deciding what
|
||||||
|
getpwnam() does for these uids.
|
||||||
|
- any features that help users avoid use of privileged containers
|
||||||
|
(http://issue.k8s.io/391)
|
||||||
|
|
||||||
### Namespaces
|
### Namespaces
|
||||||
|
|
||||||
K8s will have a `namespace` API object. It is similar to a Google Compute Engine `project`. It provides a namespace for objects created by a group of people co-operating together, preventing name collisions with non-cooperating groups. It also serves as a reference point for authorization policies.
|
K8s will have a `namespace` API object. It is similar to a Google Compute
|
||||||
|
Engine `project`. It provides a namespace for objects created by a group of
|
||||||
|
people co-operating together, preventing name collisions with non-cooperating
|
||||||
|
groups. It also serves as a reference point for authorization policies.
|
||||||
|
|
||||||
Namespaces are described in [namespaces.md](namespaces.md).
|
Namespaces are described in [namespaces.md](namespaces.md).
|
||||||
|
|
||||||
@ -192,20 +255,36 @@ In the Simple Profile:
|
|||||||
- There is a single `namespace` used by the single user.
|
- There is a single `namespace` used by the single user.
|
||||||
|
|
||||||
Namespaces versus userAccount vs Labels:
|
Namespaces versus userAccount vs Labels:
|
||||||
- `userAccount`s are intended for audit logging (both name and UID should be logged), and to define who has access to `namespace`s.
|
- `userAccount`s are intended for audit logging (both name and UID should be
|
||||||
- `labels` (see [docs/user-guide/labels.md](../../docs/user-guide/labels.md)) should be used to distinguish pods, users, and other objects that cooperate towards a common goal but are different in some way, such as version, or responsibilities.
|
logged), and to define who has access to `namespace`s.
|
||||||
- `namespace`s prevent name collisions between uncoordinated groups of people, and provide a place to attach common policies for co-operating groups of people.
|
- `labels` (see [docs/user-guide/labels.md](../../docs/user-guide/labels.md))
|
||||||
|
should be used to distinguish pods, users, and other objects that cooperate
|
||||||
|
towards a common goal but are different in some way, such as version, or
|
||||||
|
responsibilities.
|
||||||
|
- `namespace`s prevent name collisions between uncoordinated groups of people,
|
||||||
|
and provide a place to attach common policies for co-operating groups of people.
|
||||||
|
|
||||||
|
|
||||||
## Authentication
|
## Authentication
|
||||||
|
|
||||||
Goals for K8s authentication:
|
Goals for K8s authentication:
|
||||||
- Include a built-in authentication system with no configuration required to use in single-user mode, and little configuration required to add several user accounts, and no https proxy required.
|
- Include a built-in authentication system with no configuration required to use
|
||||||
- Allow for authentication to be handled by a system external to Kubernetes, to allow integration with existing to enterprise authorization systems. The Kubernetes namespace itself should avoid taking contributions of multiple authorization schemes. Instead, a trusted proxy in front of the apiserver can be used to authenticate users.
|
in single-user mode, and little configuration required to add several user
|
||||||
- For organizations whose security requirements only allow FIPS compliant implementations (e.g. apache) for authentication.
|
accounts, and no https proxy required.
|
||||||
- So the proxy can terminate SSL, and isolate the CA-signed certificate from less trusted, higher-touch APIserver.
|
- Allow for authentication to be handled by a system external to Kubernetes, to
|
||||||
- For organizations that already have existing SaaS web services (e.g. storage, VMs) and want a common authentication portal.
|
allow integration with existing to enterprise authorization systems. The
|
||||||
- Avoid mixing authentication and authorization, so that authorization policies be centrally managed, and to allow changes in authentication methods without affecting authorization code.
|
Kubernetes namespace itself should avoid taking contributions of multiple
|
||||||
|
authorization schemes. Instead, a trusted proxy in front of the apiserver can be
|
||||||
|
used to authenticate users.
|
||||||
|
- For organizations whose security requirements only allow FIPS compliant
|
||||||
|
implementations (e.g. apache) for authentication.
|
||||||
|
- So the proxy can terminate SSL, and isolate the CA-signed certificate from
|
||||||
|
less trusted, higher-touch APIserver.
|
||||||
|
- For organizations that already have existing SaaS web services (e.g.
|
||||||
|
storage, VMs) and want a common authentication portal.
|
||||||
|
- Avoid mixing authentication and authorization, so that authorization policies
|
||||||
|
be centrally managed, and to allow changes in authentication methods without
|
||||||
|
affecting authorization code.
|
||||||
|
|
||||||
Initially:
|
Initially:
|
||||||
- Tokens used to authenticate a user.
|
- Tokens used to authenticate a user.
|
||||||
@ -213,9 +292,12 @@ Initially:
|
|||||||
- Administrator utility generates tokens at cluster setup.
|
- Administrator utility generates tokens at cluster setup.
|
||||||
- OAuth2.0 Bearer tokens protocol, http://tools.ietf.org/html/rfc6750
|
- OAuth2.0 Bearer tokens protocol, http://tools.ietf.org/html/rfc6750
|
||||||
- No scopes for tokens. Authorization happens in the API server
|
- No scopes for tokens. Authorization happens in the API server
|
||||||
- Tokens dynamically generated by apiserver to identify pods which are making API calls.
|
- Tokens dynamically generated by apiserver to identify pods which are making
|
||||||
|
API calls.
|
||||||
- Tokens checked in a module of the APIserver.
|
- Tokens checked in a module of the APIserver.
|
||||||
- Authentication in apiserver can be disabled by flag, to allow testing without authorization enabled, and to allow use of an authenticating proxy. In this mode, a query parameter or header added by the proxy will identify the caller.
|
- Authentication in apiserver can be disabled by flag, to allow testing without
|
||||||
|
authorization enabled, and to allow use of an authenticating proxy. In this
|
||||||
|
mode, a query parameter or header added by the proxy will identify the caller.
|
||||||
|
|
||||||
Improvements:
|
Improvements:
|
||||||
- Refresh of tokens.
|
- Refresh of tokens.
|
||||||
@ -228,54 +310,86 @@ To be considered for subsequent versions:
|
|||||||
- http://www.ietf.org/proceedings/90/slides/slides-90-uta-0.pdf
|
- http://www.ietf.org/proceedings/90/slides/slides-90-uta-0.pdf
|
||||||
- http://www.browserauth.net
|
- http://www.browserauth.net
|
||||||
|
|
||||||
|
|
||||||
## Authorization
|
## Authorization
|
||||||
|
|
||||||
K8s authorization should:
|
K8s authorization should:
|
||||||
- Allow for a range of maturity levels, from single-user for those test driving the system, to integration with existing to enterprise authorization systems.
|
- Allow for a range of maturity levels, from single-user for those test driving
|
||||||
- Allow for centralized management of users and policies. In some organizations, this will mean that the definition of users and access policies needs to reside on a system other than k8s and encompass other web services (such as a storage service).
|
the system, to integration with existing to enterprise authorization systems.
|
||||||
- Allow processes running in K8s Pods to take on identity, and to allow narrow scoping of permissions for those identities in order to limit damage from software faults.
|
- Allow for centralized management of users and policies. In some
|
||||||
- Have Authorization Policies exposed as API objects so that a single config file can create or delete Pods, Replication Controllers, Services, and the identities and policies for those Pods and Replication Controllers.
|
organizations, this will mean that the definition of users and access policies
|
||||||
- Be separate as much as practical from Authentication, to allow Authentication methods to change over time and space, without impacting Authorization policies.
|
needs to reside on a system other than k8s and encompass other web services
|
||||||
|
(such as a storage service).
|
||||||
|
- Allow processes running in K8s Pods to take on identity, and to allow narrow
|
||||||
|
scoping of permissions for those identities in order to limit damage from
|
||||||
|
software faults.
|
||||||
|
- Have Authorization Policies exposed as API objects so that a single config
|
||||||
|
file can create or delete Pods, Replication Controllers, Services, and the
|
||||||
|
identities and policies for those Pods and Replication Controllers.
|
||||||
|
- Be separate as much as practical from Authentication, to allow Authentication
|
||||||
|
methods to change over time and space, without impacting Authorization policies.
|
||||||
|
|
||||||
K8s will implement a relatively simple
|
K8s will implement a relatively simple
|
||||||
[Attribute-Based Access Control](http://en.wikipedia.org/wiki/Attribute_Based_Access_Control) model.
|
[Attribute-Based Access Control](http://en.wikipedia.org/wiki/Attribute_Based_Access_Control) model.
|
||||||
The model will be described in more detail in a forthcoming document. The model will
|
|
||||||
|
The model will be described in more detail in a forthcoming document. The model
|
||||||
|
will:
|
||||||
- Be less complex than XACML
|
- Be less complex than XACML
|
||||||
- Be easily recognizable to those familiar with Amazon IAM Policies.
|
- Be easily recognizable to those familiar with Amazon IAM Policies.
|
||||||
- Have a subset/aliases/defaults which allow it to be used in a way comfortable to those users more familiar with Role-Based Access Control.
|
- Have a subset/aliases/defaults which allow it to be used in a way comfortable
|
||||||
|
to those users more familiar with Role-Based Access Control.
|
||||||
|
|
||||||
Authorization policy is set by creating a set of Policy objects.
|
Authorization policy is set by creating a set of Policy objects.
|
||||||
|
|
||||||
The API Server will be the Enforcement Point for Policy. For each API call that it receives, it will construct the Attributes needed to evaluate the policy (what user is making the call, what resource they are accessing, what they are trying to do that resource, etc) and pass those attributes to a Decision Point. The Decision Point code evaluates the Attributes against all the Policies and allows or denies the API call. The system will be modular enough that the Decision Point code can either be linked into the APIserver binary, or be another service that the apiserver calls for each Decision (with appropriate time-limited caching as needed for performance).
|
The API Server will be the Enforcement Point for Policy. For each API call that
|
||||||
|
it receives, it will construct the Attributes needed to evaluate the policy
|
||||||
Policy objects may be applicable only to a single namespace or to all namespaces; K8s Project Admins would be able to create those as needed. Other Policy objects may be applicable to all namespaces; a K8s Cluster Admin might create those in order to authorize a new type of controller to be used by all namespaces, or to make a K8s User into a K8s Project Admin.)
|
(what user is making the call, what resource they are accessing, what they are
|
||||||
|
trying to do that resource, etc) and pass those attributes to a Decision Point.
|
||||||
|
The Decision Point code evaluates the Attributes against all the Policies and
|
||||||
|
allows or denies the API call. The system will be modular enough that the
|
||||||
|
Decision Point code can either be linked into the APIserver binary, or be
|
||||||
|
another service that the apiserver calls for each Decision (with appropriate
|
||||||
|
time-limited caching as needed for performance).
|
||||||
|
|
||||||
|
Policy objects may be applicable only to a single namespace or to all
|
||||||
|
namespaces; K8s Project Admins would be able to create those as needed. Other
|
||||||
|
Policy objects may be applicable to all namespaces; a K8s Cluster Admin might
|
||||||
|
create those in order to authorize a new type of controller to be used by all
|
||||||
|
namespaces, or to make a K8s User into a K8s Project Admin.)
|
||||||
|
|
||||||
## Accounting
|
## Accounting
|
||||||
|
|
||||||
The API should have a `quota` concept (see http://issue.k8s.io/442). A quota object relates a namespace (and optionally a label selector) to a maximum quantity of resources that may be used (see [resources design doc](resources.md)).
|
The API should have a `quota` concept (see http://issue.k8s.io/442). A quota
|
||||||
|
object relates a namespace (and optionally a label selector) to a maximum
|
||||||
|
quantity of resources that may be used (see [resources design doc](resources.md)).
|
||||||
|
|
||||||
Initially:
|
Initially:
|
||||||
- a `quota` object is immutable.
|
- A `quota` object is immutable.
|
||||||
- for hosted K8s systems that do billing, Project is recommended level for billing accounts.
|
- For hosted K8s systems that do billing, Project is recommended level for
|
||||||
- Every object that consumes resources should have a `namespace` so that Resource usage stats are roll-up-able to `namespace`.
|
billing accounts.
|
||||||
|
- Every object that consumes resources should have a `namespace` so that
|
||||||
|
Resource usage stats are roll-up-able to `namespace`.
|
||||||
- K8s Cluster Admin sets quota objects by writing a config file.
|
- K8s Cluster Admin sets quota objects by writing a config file.
|
||||||
|
|
||||||
Improvements:
|
Improvements:
|
||||||
- allow one namespace to charge the quota for one or more other namespaces. This would be controlled by a policy which allows changing a billing_namespace= label on an object.
|
- Allow one namespace to charge the quota for one or more other namespaces. This
|
||||||
- allow quota to be set by namespace owners for (namespace x label) combinations (e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't allow "webserver" namespace and "instance=test" use more than 10 cores.
|
would be controlled by a policy which allows changing a billing_namespace =
|
||||||
- tools to help write consistent quota config files based on number of nodes, historical namespace usages, QoS needs, etc.
|
label on an object.
|
||||||
- way for K8s Cluster Admin to incrementally adjust Quota objects.
|
- Allow quota to be set by namespace owners for (namespace x label) combinations
|
||||||
|
(e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't
|
||||||
|
allow "webserver" namespace and "instance=test" use more than 10 cores.
|
||||||
|
- Tools to help write consistent quota config files based on number of nodes,
|
||||||
|
historical namespace usages, QoS needs, etc.
|
||||||
|
- Way for K8s Cluster Admin to incrementally adjust Quota objects.
|
||||||
|
|
||||||
Simple profile:
|
Simple profile:
|
||||||
- a single `namespace` with infinite resource limits.
|
- A single `namespace` with infinite resource limits.
|
||||||
|
|
||||||
Enterprise profile:
|
Enterprise profile:
|
||||||
- multiple namespaces each with their own limits.
|
- Multiple namespaces each with their own limits.
|
||||||
|
|
||||||
Issues:
|
Issues:
|
||||||
- need for locking or "eventual consistency" when multiple apiserver goroutines are accessing the object store and handling pod creations.
|
- Need for locking or "eventual consistency" when multiple apiserver goroutines
|
||||||
|
are accessing the object store and handling pod creations.
|
||||||
|
|
||||||
|
|
||||||
## Audit Logging
|
## Audit Logging
|
||||||
@ -287,7 +401,8 @@ Initial implementation:
|
|||||||
|
|
||||||
Improvements:
|
Improvements:
|
||||||
- API server does logging instead.
|
- API server does logging instead.
|
||||||
- Policies to drop logging for high rate trusted API calls, or by users performing audit or other sensitive functions.
|
- Policies to drop logging for high rate trusted API calls, or by users
|
||||||
|
performing audit or other sensitive functions.
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -43,24 +43,30 @@ Documentation for other releases can be found at
|
|||||||
## Background
|
## Background
|
||||||
|
|
||||||
High level goals:
|
High level goals:
|
||||||
|
* Enable an easy-to-use mechanism to provide admission control to cluster.
|
||||||
|
* Enable a provider to support multiple admission control strategies or author
|
||||||
|
their own.
|
||||||
|
* Ensure any rejected request can propagate errors back to the caller with why
|
||||||
|
the request failed.
|
||||||
|
|
||||||
* Enable an easy-to-use mechanism to provide admission control to cluster
|
Authorization via policy is focused on answering if a user is authorized to
|
||||||
* Enable a provider to support multiple admission control strategies or author their own
|
perform an action.
|
||||||
* Ensure any rejected request can propagate errors back to the caller with why the request failed
|
|
||||||
|
|
||||||
Authorization via policy is focused on answering if a user is authorized to perform an action.
|
|
||||||
|
|
||||||
Admission Control is focused on if the system will accept an authorized action.
|
Admission Control is focused on if the system will accept an authorized action.
|
||||||
|
|
||||||
Kubernetes may choose to dismiss an authorized action based on any number of admission control strategies.
|
Kubernetes may choose to dismiss an authorized action based on any number of
|
||||||
|
admission control strategies.
|
||||||
|
|
||||||
This proposal documents the basic design, and describes how any number of admission control plug-ins could be injected.
|
This proposal documents the basic design, and describes how any number of
|
||||||
|
admission control plug-ins could be injected.
|
||||||
|
|
||||||
Implementation of specific admission control strategies are handled in separate documents.
|
Implementation of specific admission control strategies are handled in separate
|
||||||
|
documents.
|
||||||
|
|
||||||
## kube-apiserver
|
## kube-apiserver
|
||||||
|
|
||||||
The kube-apiserver takes the following OPTIONAL arguments to enable admission control
|
The kube-apiserver takes the following OPTIONAL arguments to enable admission
|
||||||
|
control:
|
||||||
|
|
||||||
| Option | Behavior |
|
| Option | Behavior |
|
||||||
| ------ | -------- |
|
| ------ | -------- |
|
||||||
@ -72,7 +78,8 @@ An **AdmissionControl** plug-in is an implementation of the following interface:
|
|||||||
```go
|
```go
|
||||||
package admission
|
package admission
|
||||||
|
|
||||||
// Attributes is an interface used by a plug-in to make an admission decision on a individual request.
|
// Attributes is an interface used by a plug-in to make an admission decision
|
||||||
|
// on a individual request.
|
||||||
type Attributes interface {
|
type Attributes interface {
|
||||||
GetNamespace() string
|
GetNamespace() string
|
||||||
GetKind() string
|
GetKind() string
|
||||||
@ -88,8 +95,8 @@ type Interface interface {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
A **plug-in** must be compiled with the binary, and is registered as an available option by providing a name, and implementation
|
A **plug-in** must be compiled with the binary, and is registered as an
|
||||||
of admission.Interface.
|
available option by providing a name, and implementation of admission.Interface.
|
||||||
|
|
||||||
```go
|
```go
|
||||||
func init() {
|
func init() {
|
||||||
@ -97,9 +104,12 @@ func init() {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Invocation of admission control is handled by the **APIServer** and not individual **RESTStorage** implementations.
|
Invocation of admission control is handled by the **APIServer** and not
|
||||||
|
individual **RESTStorage** implementations.
|
||||||
|
|
||||||
This design assumes that **Issue 297** is adopted, and as a consequence, the general framework of the APIServer request/response flow will ensure the following:
|
This design assumes that **Issue 297** is adopted, and as a consequence, the
|
||||||
|
general framework of the APIServer request/response flow will ensure the
|
||||||
|
following:
|
||||||
|
|
||||||
1. Incoming request
|
1. Incoming request
|
||||||
2. Authenticate user
|
2. Authenticate user
|
||||||
|
@ -36,7 +36,8 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
## Background
|
## Background
|
||||||
|
|
||||||
This document proposes a system for enforcing resource requirements constraints as part of admission control.
|
This document proposes a system for enforcing resource requirements constraints
|
||||||
|
as part of admission control.
|
||||||
|
|
||||||
## Use cases
|
## Use cases
|
||||||
|
|
||||||
@ -64,7 +65,8 @@ const (
|
|||||||
LimitTypeContainer LimitType = "Container"
|
LimitTypeContainer LimitType = "Container"
|
||||||
)
|
)
|
||||||
|
|
||||||
// LimitRangeItem defines a min/max usage limit for any resource that matches on kind.
|
// LimitRangeItem defines a min/max usage limit for any resource that matches
|
||||||
|
// on kind.
|
||||||
type LimitRangeItem struct {
|
type LimitRangeItem struct {
|
||||||
// Type of resource that this limit applies to.
|
// Type of resource that this limit applies to.
|
||||||
Type LimitType `json:"type,omitempty"`
|
Type LimitType `json:"type,omitempty"`
|
||||||
@ -72,29 +74,38 @@ type LimitRangeItem struct {
|
|||||||
Max ResourceList `json:"max,omitempty"`
|
Max ResourceList `json:"max,omitempty"`
|
||||||
// Min usage constraints on this kind by resource name.
|
// Min usage constraints on this kind by resource name.
|
||||||
Min ResourceList `json:"min,omitempty"`
|
Min ResourceList `json:"min,omitempty"`
|
||||||
// Default resource requirement limit value by resource name if resource limit is omitted.
|
// Default resource requirement limit value by resource name if resource limit
|
||||||
|
// is omitted.
|
||||||
Default ResourceList `json:"default,omitempty"`
|
Default ResourceList `json:"default,omitempty"`
|
||||||
// DefaultRequest is the default resource requirement request value by resource name if resource request is omitted.
|
// DefaultRequest is the default resource requirement request value by
|
||||||
|
// resource name if resource request is omitted.
|
||||||
DefaultRequest ResourceList `json:"defaultRequest,omitempty"`
|
DefaultRequest ResourceList `json:"defaultRequest,omitempty"`
|
||||||
// MaxLimitRequestRatio if specified, the named resource must have a request and limit that are both non-zero where limit divided by request is less than or equal to the enumerated value; this represents the max burst for the named resource.
|
// MaxLimitRequestRatio if specified, the named resource must have a request
|
||||||
|
// and limit that are both non-zero where limit divided by request is less
|
||||||
|
// than or equal to the enumerated value; this represents the max burst for
|
||||||
|
// the named resource.
|
||||||
MaxLimitRequestRatio ResourceList `json:"maxLimitRequestRatio,omitempty"`
|
MaxLimitRequestRatio ResourceList `json:"maxLimitRequestRatio,omitempty"`
|
||||||
}
|
}
|
||||||
|
|
||||||
// LimitRangeSpec defines a min/max usage limit for resources that match on kind.
|
// LimitRangeSpec defines a min/max usage limit for resources that match
|
||||||
|
// on kind.
|
||||||
type LimitRangeSpec struct {
|
type LimitRangeSpec struct {
|
||||||
// Limits is the list of LimitRangeItem objects that are enforced.
|
// Limits is the list of LimitRangeItem objects that are enforced.
|
||||||
Limits []LimitRangeItem `json:"limits"`
|
Limits []LimitRangeItem `json:"limits"`
|
||||||
}
|
}
|
||||||
|
|
||||||
// LimitRange sets resource usage limits for each kind of resource in a Namespace.
|
// LimitRange sets resource usage limits for each kind of resource in a
|
||||||
|
// Namespace.
|
||||||
type LimitRange struct {
|
type LimitRange struct {
|
||||||
TypeMeta `json:",inline"`
|
TypeMeta `json:",inline"`
|
||||||
// Standard object's metadata.
|
// Standard object's metadata.
|
||||||
// More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata
|
// More info:
|
||||||
|
// http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata
|
||||||
ObjectMeta `json:"metadata,omitempty"`
|
ObjectMeta `json:"metadata,omitempty"`
|
||||||
|
|
||||||
// Spec defines the limits enforced.
|
// Spec defines the limits enforced.
|
||||||
// More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status
|
// More info:
|
||||||
|
// http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status
|
||||||
Spec LimitRangeSpec `json:"spec,omitempty"`
|
Spec LimitRangeSpec `json:"spec,omitempty"`
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -102,24 +113,29 @@ type LimitRange struct {
|
|||||||
type LimitRangeList struct {
|
type LimitRangeList struct {
|
||||||
TypeMeta `json:",inline"`
|
TypeMeta `json:",inline"`
|
||||||
// Standard list metadata.
|
// Standard list metadata.
|
||||||
// More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds
|
// More info:
|
||||||
|
// http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds
|
||||||
ListMeta `json:"metadata,omitempty"`
|
ListMeta `json:"metadata,omitempty"`
|
||||||
|
|
||||||
// Items is a list of LimitRange objects.
|
// Items is a list of LimitRange objects.
|
||||||
// More info: http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md
|
// More info:
|
||||||
|
// http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md
|
||||||
Items []LimitRange `json:"items"`
|
Items []LimitRange `json:"items"`
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Validation
|
### Validation
|
||||||
|
|
||||||
Validation of a **LimitRange** enforces that for a given named resource the following rules apply:
|
Validation of a **LimitRange** enforces that for a given named resource the
|
||||||
|
following rules apply:
|
||||||
|
|
||||||
Min (if specified) <= DefaultRequest (if specified) <= Default (if specified) <= Max (if specified)
|
Min (if specified) <= DefaultRequest (if specified) <= Default (if specified)
|
||||||
|
<= Max (if specified)
|
||||||
|
|
||||||
### Default Value Behavior
|
### Default Value Behavior
|
||||||
|
|
||||||
The following default value behaviors are applied to a LimitRange for a given named resource.
|
The following default value behaviors are applied to a LimitRange for a given
|
||||||
|
named resource.
|
||||||
|
|
||||||
```
|
```
|
||||||
if LimitRangeItem.Default[resourceName] is undefined
|
if LimitRangeItem.Default[resourceName] is undefined
|
||||||
@ -137,11 +153,14 @@ if LimitRangeItem.DefaultRequest[resourceName] is undefined
|
|||||||
|
|
||||||
## AdmissionControl plugin: LimitRanger
|
## AdmissionControl plugin: LimitRanger
|
||||||
|
|
||||||
The **LimitRanger** plug-in introspects all incoming pod requests and evaluates the constraints defined on a LimitRange.
|
The **LimitRanger** plug-in introspects all incoming pod requests and evaluates
|
||||||
|
the constraints defined on a LimitRange.
|
||||||
|
|
||||||
If a constraint is not specified for an enumerated resource, it is not enforced or tracked.
|
If a constraint is not specified for an enumerated resource, it is not enforced
|
||||||
|
or tracked.
|
||||||
|
|
||||||
To enable the plug-in and support for LimitRange, the kube-apiserver must be configured as follows:
|
To enable the plug-in and support for LimitRange, the kube-apiserver must be
|
||||||
|
configured as follows:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
$ kube-apiserver --admission-control=LimitRanger
|
$ kube-apiserver --admission-control=LimitRanger
|
||||||
@ -158,7 +177,7 @@ Supported Resources:
|
|||||||
|
|
||||||
Supported Constraints:
|
Supported Constraints:
|
||||||
|
|
||||||
Per container, the following must hold true
|
Per container, the following must hold true:
|
||||||
|
|
||||||
| Constraint | Behavior |
|
| Constraint | Behavior |
|
||||||
| ---------- | -------- |
|
| ---------- | -------- |
|
||||||
@ -168,8 +187,10 @@ Per container, the following must hold true
|
|||||||
|
|
||||||
Supported Defaults:
|
Supported Defaults:
|
||||||
|
|
||||||
1. Default - if the named resource has no enumerated value, the Limit is equal to the Default
|
1. Default - if the named resource has no enumerated value, the Limit is equal
|
||||||
2. DefaultRequest - if the named resource has no enumerated value, the Request is equal to the DefaultRequest
|
to the Default
|
||||||
|
2. DefaultRequest - if the named resource has no enumerated value, the Request
|
||||||
|
is equal to the DefaultRequest
|
||||||
|
|
||||||
**Type: Pod**
|
**Type: Pod**
|
||||||
|
|
||||||
@ -190,7 +211,8 @@ Across all containers in pod, the following must hold true
|
|||||||
|
|
||||||
## Run-time configuration
|
## Run-time configuration
|
||||||
|
|
||||||
The default ```LimitRange``` that is applied via Salt configuration will be updated as follows:
|
The default ```LimitRange``` that is applied via Salt configuration will be
|
||||||
|
updated as follows:
|
||||||
|
|
||||||
```
|
```
|
||||||
apiVersion: "v1"
|
apiVersion: "v1"
|
||||||
@ -219,7 +241,8 @@ the following would happen.
|
|||||||
|
|
||||||
1. The incoming container cpu would request 250m with a limit of 500m.
|
1. The incoming container cpu would request 250m with a limit of 500m.
|
||||||
2. The incoming container memory would request 250Mi with a limit of 500Mi
|
2. The incoming container memory would request 250Mi with a limit of 500Mi
|
||||||
3. If the container is later resized, it's cpu would be constrained to between .1 and 1 and the ratio of limit to request could not exceed 4.
|
3. If the container is later resized, it's cpu would be constrained to between
|
||||||
|
.1 and 1 and the ratio of limit to request could not exceed 4.
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
[]()
|
[]()
|
||||||
|
@ -36,7 +36,8 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
## Background
|
## Background
|
||||||
|
|
||||||
This document describes a system for enforcing hard resource usage limits per namespace as part of admission control.
|
This document describes a system for enforcing hard resource usage limits per
|
||||||
|
namespace as part of admission control.
|
||||||
|
|
||||||
## Use cases
|
## Use cases
|
||||||
|
|
||||||
@ -103,7 +104,7 @@ type ResourceQuotaList struct {
|
|||||||
|
|
||||||
## Quota Tracked Resources
|
## Quota Tracked Resources
|
||||||
|
|
||||||
The following resources are supported by the quota system.
|
The following resources are supported by the quota system:
|
||||||
|
|
||||||
| Resource | Description |
|
| Resource | Description |
|
||||||
| ------------ | ----------- |
|
| ------------ | ----------- |
|
||||||
@ -116,16 +117,19 @@ The following resources are supported by the quota system.
|
|||||||
| secrets | Total number of secrets |
|
| secrets | Total number of secrets |
|
||||||
| persistentvolumeclaims | Total number of persistent volume claims |
|
| persistentvolumeclaims | Total number of persistent volume claims |
|
||||||
|
|
||||||
If a third-party wants to track additional resources, it must follow the resource naming conventions prescribed
|
If a third-party wants to track additional resources, it must follow the
|
||||||
by Kubernetes. This means the resource must have a fully-qualified name (i.e. mycompany.org/shinynewresource)
|
resource naming conventions prescribed by Kubernetes. This means the resource
|
||||||
|
must have a fully-qualified name (i.e. mycompany.org/shinynewresource)
|
||||||
|
|
||||||
## Resource Requirements: Requests vs Limits
|
## Resource Requirements: Requests vs Limits
|
||||||
|
|
||||||
If a resource supports the ability to distinguish between a request and a limit for a resource,
|
If a resource supports the ability to distinguish between a request and a limit
|
||||||
the quota tracking system will only cost the request value against the quota usage. If a resource
|
for a resource, the quota tracking system will only cost the request value
|
||||||
is tracked by quota, and no request value is provided, the associated entity is rejected as part of admission.
|
against the quota usage. If a resource is tracked by quota, and no request value
|
||||||
|
is provided, the associated entity is rejected as part of admission.
|
||||||
|
|
||||||
For an example, consider the following scenarios relative to tracking quota on CPU:
|
For an example, consider the following scenarios relative to tracking quota on
|
||||||
|
CPU:
|
||||||
|
|
||||||
| Pod | Container | Request CPU | Limit CPU | Result |
|
| Pod | Container | Request CPU | Limit CPU | Result |
|
||||||
| --- | --------- | ----------- | --------- | ------ |
|
| --- | --------- | ----------- | --------- | ------ |
|
||||||
@ -134,13 +138,14 @@ For an example, consider the following scenarios relative to tracking quota on C
|
|||||||
| Y | C2 | none | 500m | The quota usage is incremented 500m since request will default to limit |
|
| Y | C2 | none | 500m | The quota usage is incremented 500m since request will default to limit |
|
||||||
| Z | C3 | none | none | The pod is rejected since it does not enumerate a request. |
|
| Z | C3 | none | none | The pod is rejected since it does not enumerate a request. |
|
||||||
|
|
||||||
The rationale for accounting for the requested amount of a resource versus the limit is the belief
|
The rationale for accounting for the requested amount of a resource versus the
|
||||||
that a user should only be charged for what they are scheduled against in the cluster. In addition,
|
limit is the belief that a user should only be charged for what they are
|
||||||
attempting to track usage against actual usage, where request < actual < limit, is considered highly
|
scheduled against in the cluster. In addition, attempting to track usage against
|
||||||
volatile.
|
actual usage, where request < actual < limit, is considered highly volatile.
|
||||||
|
|
||||||
As a consequence of this decision, the user is able to spread its usage of a resource across multiple tiers
|
As a consequence of this decision, the user is able to spread its usage of a
|
||||||
of service. Let's demonstrate this via an example with a 4 cpu quota.
|
resource across multiple tiers of service. Let's demonstrate this via an
|
||||||
|
example with a 4 cpu quota.
|
||||||
|
|
||||||
The quota may be allocated as follows:
|
The quota may be allocated as follows:
|
||||||
|
|
||||||
@ -150,48 +155,62 @@ The quota may be allocated as follows:
|
|||||||
| Y | C2 | 2 | 2 | Guaranteed | 2 |
|
| Y | C2 | 2 | 2 | Guaranteed | 2 |
|
||||||
| Z | C3 | 1 | 3 | Burstable | 1 |
|
| Z | C3 | 1 | 3 | Burstable | 1 |
|
||||||
|
|
||||||
It is possible that the pods may consume 9 cpu over a given time period depending on the nodes available cpu
|
It is possible that the pods may consume 9 cpu over a given time period
|
||||||
that held pod X and Z, but since we scheduled X and Z relative to the request, we only track the requesting
|
depending on the nodes available cpu that held pod X and Z, but since we
|
||||||
value against their allocated quota. If one wants to restrict the ratio between the request and limit,
|
scheduled X and Z relative to the request, we only track the requesting value
|
||||||
it is encouraged that the user define a **LimitRange** with **LimitRequestRatio** to control burst out behavior.
|
against their allocated quota. If one wants to restrict the ratio between the
|
||||||
This would in effect, let an administrator keep the difference between request and limit more in line with
|
request and limit, it is encouraged that the user define a **LimitRange** with
|
||||||
|
**LimitRequestRatio** to control burst out behavior. This would in effect, let
|
||||||
|
an administrator keep the difference between request and limit more in line with
|
||||||
tracked usage if desired.
|
tracked usage if desired.
|
||||||
|
|
||||||
## Status API
|
## Status API
|
||||||
|
|
||||||
A REST API endpoint to update the status section of the **ResourceQuota** is exposed. It requires an atomic compare-and-swap
|
A REST API endpoint to update the status section of the **ResourceQuota** is
|
||||||
in order to keep resource usage tracking consistent.
|
exposed. It requires an atomic compare-and-swap in order to keep resource usage
|
||||||
|
tracking consistent.
|
||||||
|
|
||||||
## Resource Quota Controller
|
## Resource Quota Controller
|
||||||
|
|
||||||
A resource quota controller monitors observed usage for tracked resources in the **Namespace**.
|
A resource quota controller monitors observed usage for tracked resources in the
|
||||||
|
**Namespace**.
|
||||||
|
|
||||||
If there is observed difference between the current usage stats versus the current **ResourceQuota.Status**, the controller
|
If there is observed difference between the current usage stats versus the
|
||||||
posts an update of the currently observed usage metrics to the **ResourceQuota** via the /status endpoint.
|
current **ResourceQuota.Status**, the controller posts an update of the
|
||||||
|
currently observed usage metrics to the **ResourceQuota** via the /status
|
||||||
|
endpoint.
|
||||||
|
|
||||||
The resource quota controller is the only component capable of monitoring and recording usage updates after a DELETE operation
|
The resource quota controller is the only component capable of monitoring and
|
||||||
since admission control is incapable of guaranteeing a DELETE request actually succeeded.
|
recording usage updates after a DELETE operation since admission control is
|
||||||
|
incapable of guaranteeing a DELETE request actually succeeded.
|
||||||
|
|
||||||
## AdmissionControl plugin: ResourceQuota
|
## AdmissionControl plugin: ResourceQuota
|
||||||
|
|
||||||
The **ResourceQuota** plug-in introspects all incoming admission requests.
|
The **ResourceQuota** plug-in introspects all incoming admission requests.
|
||||||
|
|
||||||
To enable the plug-in and support for ResourceQuota, the kube-apiserver must be configured as follows:
|
To enable the plug-in and support for ResourceQuota, the kube-apiserver must be
|
||||||
|
configured as follows:
|
||||||
|
|
||||||
```
|
```
|
||||||
$ kube-apiserver --admission-control=ResourceQuota
|
$ kube-apiserver --admission-control=ResourceQuota
|
||||||
```
|
```
|
||||||
|
|
||||||
It makes decisions by evaluating the incoming object against all defined **ResourceQuota.Status.Hard** resource limits in the request
|
It makes decisions by evaluating the incoming object against all defined
|
||||||
namespace. If acceptance of the resource would cause the total usage of a named resource to exceed its hard limit, the request is denied.
|
**ResourceQuota.Status.Hard** resource limits in the request namespace. If
|
||||||
|
acceptance of the resource would cause the total usage of a named resource to
|
||||||
|
exceed its hard limit, the request is denied.
|
||||||
|
|
||||||
If the incoming request does not cause the total usage to exceed any of the enumerated hard resource limits, the plug-in will post a
|
If the incoming request does not cause the total usage to exceed any of the
|
||||||
**ResourceQuota.Status** document to the server to atomically update the observed usage based on the previously read
|
enumerated hard resource limits, the plug-in will post a
|
||||||
**ResourceQuota.ResourceVersion**. This keeps incremental usage atomically consistent, but does introduce a bottleneck (intentionally)
|
**ResourceQuota.Status** document to the server to atomically update the
|
||||||
into the system.
|
observed usage based on the previously read **ResourceQuota.ResourceVersion**.
|
||||||
|
This keeps incremental usage atomically consistent, but does introduce a
|
||||||
|
bottleneck (intentionally) into the system.
|
||||||
|
|
||||||
To optimize system performance, it is encouraged that all resource quotas are tracked on the same **ResourceQuota** document in a **Namespace**. As a result, its encouraged to impose a cap on the total number of individual quotas that are tracked in the **Namespace**
|
To optimize system performance, it is encouraged that all resource quotas are
|
||||||
to 1 in the **ResourceQuota** document.
|
tracked on the same **ResourceQuota** document in a **Namespace**. As a result,
|
||||||
|
it is encouraged to impose a cap on the total number of individual quotas that
|
||||||
|
are tracked in the **Namespace** to 1 in the **ResourceQuota** document.
|
||||||
|
|
||||||
## kubectl
|
## kubectl
|
||||||
|
|
||||||
@ -199,7 +218,7 @@ kubectl is modified to support the **ResourceQuota** resource.
|
|||||||
|
|
||||||
`kubectl describe` provides a human-readable output of quota.
|
`kubectl describe` provides a human-readable output of quota.
|
||||||
|
|
||||||
For example,
|
For example:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
$ kubectl create -f docs/admin/resourcequota/namespace.yaml
|
$ kubectl create -f docs/admin/resourcequota/namespace.yaml
|
||||||
|
@ -34,49 +34,84 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
# Kubernetes architecture
|
# Kubernetes architecture
|
||||||
|
|
||||||
A running Kubernetes cluster contains node agents (`kubelet`) and master components (APIs, scheduler, etc), on top of a distributed storage solution. This diagram shows our desired eventual state, though we're still working on a few things, like making `kubelet` itself (all our components, really) run within containers, and making the scheduler 100% pluggable.
|
A running Kubernetes cluster contains node agents (`kubelet`) and master
|
||||||
|
components (APIs, scheduler, etc), on top of a distributed storage solution.
|
||||||
|
This diagram shows our desired eventual state, though we're still working on a
|
||||||
|
few things, like making `kubelet` itself (all our components, really) run within
|
||||||
|
containers, and making the scheduler 100% pluggable.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
## The Kubernetes Node
|
## The Kubernetes Node
|
||||||
|
|
||||||
When looking at the architecture of the system, we'll break it down to services that run on the worker node and services that compose the cluster-level control plane.
|
When looking at the architecture of the system, we'll break it down to services
|
||||||
|
that run on the worker node and services that compose the cluster-level control
|
||||||
|
plane.
|
||||||
|
|
||||||
The Kubernetes node has the services necessary to run application containers and be managed from the master systems.
|
The Kubernetes node has the services necessary to run application containers and
|
||||||
|
be managed from the master systems.
|
||||||
|
|
||||||
Each node runs Docker, of course. Docker takes care of the details of downloading images and running containers.
|
Each node runs Docker, of course. Docker takes care of the details of
|
||||||
|
downloading images and running containers.
|
||||||
|
|
||||||
### `kubelet`
|
### `kubelet`
|
||||||
|
|
||||||
The `kubelet` manages [pods](../user-guide/pods.md) and their containers, their images, their volumes, etc.
|
The `kubelet` manages [pods](../user-guide/pods.md) and their containers, their
|
||||||
|
images, their volumes, etc.
|
||||||
|
|
||||||
### `kube-proxy`
|
### `kube-proxy`
|
||||||
|
|
||||||
Each node also runs a simple network proxy and load balancer (see the [services FAQ](https://github.com/kubernetes/kubernetes/wiki/Services-FAQ) for more details). This reflects `services` (see [the services doc](../user-guide/services.md) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends.
|
Each node also runs a simple network proxy and load balancer (see the
|
||||||
|
[services FAQ](https://github.com/kubernetes/kubernetes/wiki/Services-FAQ) for
|
||||||
|
more details). This reflects `services` (see
|
||||||
|
[the services doc](../user-guide/services.md) for more details) as defined in
|
||||||
|
the Kubernetes API on each node and can do simple TCP and UDP stream forwarding
|
||||||
|
(round robin) across a set of backends.
|
||||||
|
|
||||||
Service endpoints are currently found via [DNS](../admin/dns.md) or through environment variables (both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and Kubernetes `{FOO}_SERVICE_HOST` and `{FOO}_SERVICE_PORT` variables are supported). These variables resolve to ports managed by the service proxy.
|
Service endpoints are currently found via [DNS](../admin/dns.md) or through
|
||||||
|
environment variables (both
|
||||||
|
[Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and
|
||||||
|
Kubernetes `{FOO}_SERVICE_HOST` and `{FOO}_SERVICE_PORT` variables are
|
||||||
|
supported). These variables resolve to ports managed by the service proxy.
|
||||||
|
|
||||||
## The Kubernetes Control Plane
|
## The Kubernetes Control Plane
|
||||||
|
|
||||||
The Kubernetes control plane is split into a set of components. Currently they all run on a single _master_ node, but that is expected to change soon in order to support high-availability clusters. These components work together to provide a unified view of the cluster.
|
The Kubernetes control plane is split into a set of components. Currently they
|
||||||
|
all run on a single _master_ node, but that is expected to change soon in order
|
||||||
|
to support high-availability clusters. These components work together to provide
|
||||||
|
a unified view of the cluster.
|
||||||
|
|
||||||
### `etcd`
|
### `etcd`
|
||||||
|
|
||||||
All persistent master state is stored in an instance of `etcd`. This provides a great way to store configuration data reliably. With `watch` support, coordinating components can be notified very quickly of changes.
|
All persistent master state is stored in an instance of `etcd`. This provides a
|
||||||
|
great way to store configuration data reliably. With `watch` support,
|
||||||
|
coordinating components can be notified very quickly of changes.
|
||||||
|
|
||||||
### Kubernetes API Server
|
### Kubernetes API Server
|
||||||
|
|
||||||
The apiserver serves up the [Kubernetes API](../api.md). It is intended to be a CRUD-y server, with most/all business logic implemented in separate components or in plug-ins. It mainly processes REST operations, validates them, and updates the corresponding objects in `etcd` (and eventually other stores).
|
The apiserver serves up the [Kubernetes API](../api.md). It is intended to be a
|
||||||
|
CRUD-y server, with most/all business logic implemented in separate components
|
||||||
|
or in plug-ins. It mainly processes REST operations, validates them, and updates
|
||||||
|
the corresponding objects in `etcd` (and eventually other stores).
|
||||||
|
|
||||||
### Scheduler
|
### Scheduler
|
||||||
|
|
||||||
The scheduler binds unscheduled pods to nodes via the `/binding` API. The scheduler is pluggable, and we expect to support multiple cluster schedulers and even user-provided schedulers in the future.
|
The scheduler binds unscheduled pods to nodes via the `/binding` API. The
|
||||||
|
scheduler is pluggable, and we expect to support multiple cluster schedulers and
|
||||||
|
even user-provided schedulers in the future.
|
||||||
|
|
||||||
### Kubernetes Controller Manager Server
|
### Kubernetes Controller Manager Server
|
||||||
|
|
||||||
All other cluster-level functions are currently performed by the Controller Manager. For instance, `Endpoints` objects are created and updated by the endpoints controller, and nodes are discovered, managed, and monitored by the node controller. These could eventually be split into separate components to make them independently pluggable.
|
All other cluster-level functions are currently performed by the Controller
|
||||||
|
Manager. For instance, `Endpoints` objects are created and updated by the
|
||||||
|
endpoints controller, and nodes are discovered, managed, and monitored by the
|
||||||
|
node controller. These could eventually be split into separate components to
|
||||||
|
make them independently pluggable.
|
||||||
|
|
||||||
The [`replicationcontroller`](../user-guide/replication-controller.md) is a mechanism that is layered on top of the simple [`pod`](../user-guide/pods.md) API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented.
|
The [`replicationcontroller`](../user-guide/replication-controller.md) is a
|
||||||
|
mechanism that is layered on top of the simple [`pod`](../user-guide/pods.md)
|
||||||
|
API. We eventually plan to port it to a generic plug-in mechanism, once one is
|
||||||
|
implemented.
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -35,7 +35,7 @@ Documentation for other releases can be found at
|
|||||||
# Peeking under the hood of Kubernetes on AWS
|
# Peeking under the hood of Kubernetes on AWS
|
||||||
|
|
||||||
This document provides high-level insight into how Kubernetes works on AWS and
|
This document provides high-level insight into how Kubernetes works on AWS and
|
||||||
maps to AWS objects. We assume that you are familiar with AWS.
|
maps to AWS objects. We assume that you are familiar with AWS.
|
||||||
|
|
||||||
We encourage you to use [kube-up](../getting-started-guides/aws.md) to create
|
We encourage you to use [kube-up](../getting-started-guides/aws.md) to create
|
||||||
clusters on AWS. We recommend that you avoid manual configuration but are aware
|
clusters on AWS. We recommend that you avoid manual configuration but are aware
|
||||||
@ -72,7 +72,7 @@ By default on AWS:
|
|||||||
|
|
||||||
* Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently
|
* Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently
|
||||||
modern kernel that pairs well with Docker and doesn't require a
|
modern kernel that pairs well with Docker and doesn't require a
|
||||||
reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.)
|
reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.)
|
||||||
* Nodes use aufs instead of ext4 as the filesystem / container storage (mostly
|
* Nodes use aufs instead of ext4 as the filesystem / container storage (mostly
|
||||||
because this is what Google Compute Engine uses).
|
because this is what Google Compute Engine uses).
|
||||||
|
|
||||||
@ -81,35 +81,36 @@ kube-up.
|
|||||||
|
|
||||||
### Storage
|
### Storage
|
||||||
|
|
||||||
AWS supports persistent volumes by using [Elastic Block Store (EBS)](../user-guide/volumes.md#awselasticblockstore). These can then be
|
AWS supports persistent volumes by using [Elastic Block Store (EBS)](../user-guide/volumes.md#awselasticblockstore).
|
||||||
attached to pods that should store persistent data (e.g. if you're running a
|
These can then be attached to pods that should store persistent data (e.g. if
|
||||||
database).
|
you're running a database).
|
||||||
|
|
||||||
By default, nodes in AWS use [instance storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html)
|
By default, nodes in AWS use [instance storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html)
|
||||||
unless you create pods with persistent volumes
|
unless you create pods with persistent volumes
|
||||||
[(EBS)](../user-guide/volumes.md#awselasticblockstore). In general, Kubernetes
|
[(EBS)](../user-guide/volumes.md#awselasticblockstore). In general, Kubernetes
|
||||||
containers do not have persistent storage unless you attach a persistent
|
containers do not have persistent storage unless you attach a persistent
|
||||||
volume, and so nodes on AWS use instance storage. Instance storage is cheaper,
|
volume, and so nodes on AWS use instance storage. Instance storage is cheaper,
|
||||||
often faster, and historically more reliable. Unless you can make do with whatever
|
often faster, and historically more reliable. Unless you can make do with
|
||||||
space is left on your root partition, you must choose an instance type that provides
|
whatever space is left on your root partition, you must choose an instance type
|
||||||
you with sufficient instance storage for your needs.
|
that provides you with sufficient instance storage for your needs.
|
||||||
|
|
||||||
Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to track
|
Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to
|
||||||
its state. Similar to nodes, containers are mostly run against instance
|
track its state. Similar to nodes, containers are mostly run against instance
|
||||||
storage, except that we repoint some important data onto the persistent volume.
|
storage, except that we repoint some important data onto the persistent volume.
|
||||||
|
|
||||||
The default storage driver for Docker images is aufs. Specifying btrfs (by passing the environment
|
The default storage driver for Docker images is aufs. Specifying btrfs (by
|
||||||
variable `DOCKER_STORAGE=btrfs` to kube-up) is also a good choice for a filesystem. btrfs
|
passing the environment variable `DOCKER_STORAGE=btrfs` to kube-up) is also a
|
||||||
is relatively reliable with Docker and has improved its reliability with modern
|
good choice for a filesystem. btrfs is relatively reliable with Docker and has
|
||||||
kernels. It can easily span multiple volumes, which is particularly useful
|
improved its reliability with modern kernels. It can easily span multiple
|
||||||
when we are using an instance type with multiple ephemeral instance disks.
|
volumes, which is particularly useful when we are using an instance type with
|
||||||
|
multiple ephemeral instance disks.
|
||||||
|
|
||||||
### Auto Scaling group
|
### Auto Scaling group
|
||||||
|
|
||||||
Nodes (but not the master) are run in an
|
Nodes (but not the master) are run in an
|
||||||
[Auto Scaling group](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html)
|
[Auto Scaling group](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html)
|
||||||
on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled
|
on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled
|
||||||
([#11935](http://issues.k8s.io/11935)). Instead, the Auto Scaling group means
|
([#11935](http://issues.k8s.io/11935)). Instead, the Auto Scaling group means
|
||||||
that AWS will relaunch any nodes that are terminated.
|
that AWS will relaunch any nodes that are terminated.
|
||||||
|
|
||||||
We do not currently run the master in an AutoScalingGroup, but we should
|
We do not currently run the master in an AutoScalingGroup, but we should
|
||||||
@ -117,13 +118,13 @@ We do not currently run the master in an AutoScalingGroup, but we should
|
|||||||
|
|
||||||
### Networking
|
### Networking
|
||||||
|
|
||||||
Kubernetes uses an IP-per-pod model. This means that a node, which runs many
|
Kubernetes uses an IP-per-pod model. This means that a node, which runs many
|
||||||
pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced
|
pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced
|
||||||
routing support so each pod is assigned a /24 CIDR. The assigned CIDR is then
|
routing support so each pod is assigned a /24 CIDR. The assigned CIDR is then
|
||||||
configured to route to an instance in the VPC routing table.
|
configured to route to an instance in the VPC routing table.
|
||||||
|
|
||||||
It is also possible to use overlay networking on AWS, but that is not the default
|
It is also possible to use overlay networking on AWS, but that is not the
|
||||||
configuration of the kube-up script.
|
default configuration of the kube-up script.
|
||||||
|
|
||||||
### NodePort and LoadBalancer services
|
### NodePort and LoadBalancer services
|
||||||
|
|
||||||
@ -137,8 +138,8 @@ the nodes. This traffic reaches kube-proxy where it is then forwarded to the
|
|||||||
pods.
|
pods.
|
||||||
|
|
||||||
ELB has some restrictions:
|
ELB has some restrictions:
|
||||||
* it requires that all nodes listen on a single port,
|
* ELB requires that all nodes listen on a single port,
|
||||||
* it acts as a forwarding proxy (i.e. the source IP is not preserved).
|
* ELB acts as a forwarding proxy (i.e. the source IP is not preserved).
|
||||||
|
|
||||||
To work with these restrictions, in Kubernetes, [LoadBalancer
|
To work with these restrictions, in Kubernetes, [LoadBalancer
|
||||||
services](../user-guide/services.md#type-loadbalancer) are exposed as
|
services](../user-guide/services.md#type-loadbalancer) are exposed as
|
||||||
@ -146,18 +147,18 @@ services](../user-guide/services.md#type-loadbalancer) are exposed as
|
|||||||
kube-proxy listens externally on the cluster-wide port that's assigned to
|
kube-proxy listens externally on the cluster-wide port that's assigned to
|
||||||
NodePort services and forwards traffic to the corresponding pods.
|
NodePort services and forwards traffic to the corresponding pods.
|
||||||
|
|
||||||
So for example, if we configure a service of Type LoadBalancer with a
|
For example, if we configure a service of Type LoadBalancer with a
|
||||||
public port of 80:
|
public port of 80:
|
||||||
* Kubernetes will assign a NodePort to the service (e.g. 31234)
|
* Kubernetes will assign a NodePort to the service (e.g. port 31234)
|
||||||
* ELB is configured to proxy traffic on the public port 80 to the NodePort
|
* ELB is configured to proxy traffic on the public port 80 to the NodePort
|
||||||
that is assigned to the service (31234).
|
assigned to the service (in this example port 31234).
|
||||||
* Then any in-coming traffic that ELB forwards to the NodePort (e.g. port 31234)
|
* Then any in-coming traffic that ELB forwards to the NodePort (31234)
|
||||||
is recognized by kube-proxy and sent to the correct pods for that service.
|
is recognized by kube-proxy and sent to the correct pods for that service.
|
||||||
|
|
||||||
Note that we do not automatically open NodePort services in the AWS firewall
|
Note that we do not automatically open NodePort services in the AWS firewall
|
||||||
(although we do open LoadBalancer services). This is because we expect that
|
(although we do open LoadBalancer services). This is because we expect that
|
||||||
NodePort services are more of a building block for things like inter-cluster
|
NodePort services are more of a building block for things like inter-cluster
|
||||||
services or for LoadBalancer. To consume a NodePort service externally, you
|
services or for LoadBalancer. To consume a NodePort service externally, you
|
||||||
will likely have to open the port in the node security group
|
will likely have to open the port in the node security group
|
||||||
(`kubernetes-minion-<clusterid>`).
|
(`kubernetes-minion-<clusterid>`).
|
||||||
|
|
||||||
@ -169,19 +170,19 @@ and one for the nodes called
|
|||||||
[kubernetes-minion](../../cluster/aws/templates/iam/kubernetes-minion-policy.json).
|
[kubernetes-minion](../../cluster/aws/templates/iam/kubernetes-minion-policy.json).
|
||||||
|
|
||||||
The master is responsible for creating ELBs and configuring them, as well as
|
The master is responsible for creating ELBs and configuring them, as well as
|
||||||
setting up advanced VPC routing. Currently it has blanket permissions on EC2,
|
setting up advanced VPC routing. Currently it has blanket permissions on EC2,
|
||||||
along with rights to create and destroy ELBs.
|
along with rights to create and destroy ELBs.
|
||||||
|
|
||||||
The nodes do not need a lot of access to the AWS APIs. They need to download
|
The nodes do not need a lot of access to the AWS APIs. They need to download
|
||||||
a distribution file, and then are responsible for attaching and detaching EBS
|
a distribution file, and then are responsible for attaching and detaching EBS
|
||||||
volumes from itself.
|
volumes from itself.
|
||||||
|
|
||||||
The node policy is relatively minimal. In 1.2 and later, nodes can retrieve ECR
|
The node policy is relatively minimal. In 1.2 and later, nodes can retrieve ECR
|
||||||
authorization tokens, refresh them every 12 hours if needed, and fetch Docker
|
authorization tokens, refresh them every 12 hours if needed, and fetch Docker
|
||||||
images from it, as long as the appropriate permissions are enabled. Those in
|
images from it, as long as the appropriate permissions are enabled. Those in
|
||||||
[AmazonEC2ContainerRegistryReadOnly](http://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html#AmazonEC2ContainerRegistryReadOnly),
|
[AmazonEC2ContainerRegistryReadOnly](http://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html#AmazonEC2ContainerRegistryReadOnly),
|
||||||
without write access, should suffice. The master policy is probably overly
|
without write access, should suffice. The master policy is probably overly
|
||||||
permissive. The security conscious may want to lock-down the IAM policies
|
permissive. The security conscious may want to lock-down the IAM policies
|
||||||
further ([#11936](http://issues.k8s.io/11936)).
|
further ([#11936](http://issues.k8s.io/11936)).
|
||||||
|
|
||||||
We should make it easier to extend IAM permissions and also ensure that they
|
We should make it easier to extend IAM permissions and also ensure that they
|
||||||
@ -190,106 +191,101 @@ are correctly configured ([#14226](http://issues.k8s.io/14226)).
|
|||||||
### Tagging
|
### Tagging
|
||||||
|
|
||||||
All AWS resources are tagged with a tag named "KubernetesCluster", with a value
|
All AWS resources are tagged with a tag named "KubernetesCluster", with a value
|
||||||
that is the unique cluster-id. This tag is used to identify a particular
|
that is the unique cluster-id. This tag is used to identify a particular
|
||||||
'instance' of Kubernetes, even if two clusters are deployed into the same VPC.
|
'instance' of Kubernetes, even if two clusters are deployed into the same VPC.
|
||||||
Resources are considered to belong to the same cluster if and only if they have
|
Resources are considered to belong to the same cluster if and only if they have
|
||||||
the same value in the tag named "KubernetesCluster". (The kube-up script is
|
the same value in the tag named "KubernetesCluster". (The kube-up script is
|
||||||
not configured to create multiple clusters in the same VPC by default, but it
|
not configured to create multiple clusters in the same VPC by default, but it
|
||||||
is possible to create another cluster in the same VPC.)
|
is possible to create another cluster in the same VPC.)
|
||||||
|
|
||||||
Within the AWS cloud provider logic, we filter requests to the AWS APIs to
|
Within the AWS cloud provider logic, we filter requests to the AWS APIs to
|
||||||
match resources with our cluster tag. By filtering the requests, we ensure
|
match resources with our cluster tag. By filtering the requests, we ensure
|
||||||
that we see only our own AWS objects.
|
that we see only our own AWS objects.
|
||||||
|
|
||||||
Important: If you choose not to use kube-up, you must pick a unique cluster-id
|
** Important: ** If you choose not to use kube-up, you must pick a unique
|
||||||
value, and ensure that all AWS resources have a tag with
|
cluster-id value, and ensure that all AWS resources have a tag with
|
||||||
`Name=KubernetesCluster,Value=<clusterid>`.
|
`Name=KubernetesCluster,Value=<clusterid>`.
|
||||||
|
|
||||||
### AWS objects
|
### AWS objects
|
||||||
|
|
||||||
The kube-up script does a number of things in AWS:
|
The kube-up script does a number of things in AWS:
|
||||||
|
* Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes
|
||||||
* Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes distribution
|
distribution and the salt scripts into it. They are made world-readable and the
|
||||||
and the salt scripts into it. They are made world-readable and the HTTP URLs
|
HTTP URLs are passed to instances; this is how Kubernetes code gets onto the
|
||||||
are passed to instances; this is how Kubernetes code gets onto the machines.
|
machines.
|
||||||
* Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/):
|
* Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/):
|
||||||
* `kubernetes-master` is used by the master.
|
* `kubernetes-master` is used by the master.
|
||||||
* `kubernetes-minion` is used by nodes.
|
* `kubernetes-minion` is used by nodes.
|
||||||
* Creates an AWS SSH key named `kubernetes-<fingerprint>`. Fingerprint here is
|
* Creates an AWS SSH key named `kubernetes-<fingerprint>`. Fingerprint here is
|
||||||
the OpenSSH key fingerprint, so that multiple users can run the script with
|
the OpenSSH key fingerprint, so that multiple users can run the script with
|
||||||
different keys and their keys will not collide (with near-certainty). It will
|
different keys and their keys will not collide (with near-certainty). It will
|
||||||
use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create
|
use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create
|
||||||
one there. (With the default Ubuntu images, if you have to SSH in: the user is
|
one there. (With the default Ubuntu images, if you have to SSH in: the user is
|
||||||
`ubuntu` and that user can `sudo`).
|
`ubuntu` and that user can `sudo`).
|
||||||
* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and
|
* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and
|
||||||
enables the `dns-support` and `dns-hostnames` options.
|
enables the `dns-support` and `dns-hostnames` options.
|
||||||
* Creates an internet gateway for the VPC.
|
* Creates an internet gateway for the VPC.
|
||||||
* Creates a route table for the VPC, with the internet gateway as the default
|
* Creates a route table for the VPC, with the internet gateway as the default
|
||||||
route.
|
route.
|
||||||
* Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE`
|
* Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE`
|
||||||
(defaults to us-west-2a). Currently, each Kubernetes cluster runs in a
|
(defaults to us-west-2a). Currently, each Kubernetes cluster runs in a
|
||||||
single AZ on AWS. Although, there are two philosophies in discussion on how to
|
single AZ on AWS. Although, there are two philosophies in discussion on how to
|
||||||
achieve High Availability (HA):
|
achieve High Availability (HA):
|
||||||
* cluster-per-AZ: An independent cluster for each AZ, where each cluster
|
* cluster-per-AZ: An independent cluster for each AZ, where each cluster
|
||||||
is entirely separate.
|
is entirely separate.
|
||||||
* cross-AZ-clusters: A single cluster spans multiple AZs.
|
* cross-AZ-clusters: A single cluster spans multiple AZs.
|
||||||
The debate is open here, where cluster-per-AZ is discussed as more robust but
|
The debate is open here, where cluster-per-AZ is discussed as more robust but
|
||||||
cross-AZ-clusters are more convenient.
|
cross-AZ-clusters are more convenient.
|
||||||
* Associates the subnet to the route table
|
* Associates the subnet to the route table
|
||||||
* Creates security groups for the master (`kubernetes-master-<clusterid>`)
|
* Creates security groups for the master (`kubernetes-master-<clusterid>`)
|
||||||
and the nodes (`kubernetes-minion-<clusterid>`).
|
and the nodes (`kubernetes-minion-<clusterid>`).
|
||||||
* Configures security groups so that masters and nodes can communicate. This
|
* Configures security groups so that masters and nodes can communicate. This
|
||||||
includes intercommunication between masters and nodes, opening SSH publicly
|
includes intercommunication between masters and nodes, opening SSH publicly
|
||||||
for both masters and nodes, and opening port 443 on the master for the HTTPS
|
for both masters and nodes, and opening port 443 on the master for the HTTPS
|
||||||
API endpoints.
|
API endpoints.
|
||||||
* Creates an EBS volume for the master of size `MASTER_DISK_SIZE` and type
|
* Creates an EBS volume for the master of size `MASTER_DISK_SIZE` and type
|
||||||
`MASTER_DISK_TYPE`.
|
`MASTER_DISK_TYPE`.
|
||||||
* Launches a master with a fixed IP address (172.20.0.9) that is also
|
* Launches a master with a fixed IP address (172.20.0.9) that is also
|
||||||
configured for the security group and all the necessary IAM credentials. An
|
configured for the security group and all the necessary IAM credentials. An
|
||||||
instance script is used to pass vital configuration information to Salt. Note:
|
instance script is used to pass vital configuration information to Salt. Note:
|
||||||
The hope is that over time we can reduce the amount of configuration
|
The hope is that over time we can reduce the amount of configuration
|
||||||
information that must be passed in this way.
|
information that must be passed in this way.
|
||||||
* Once the instance is up, it attaches the EBS volume and sets up a manual
|
* Once the instance is up, it attaches the EBS volume and sets up a manual
|
||||||
routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to
|
routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to
|
||||||
10.246.0.0/24).
|
10.246.0.0/24).
|
||||||
* For auto-scaling, on each nodes it creates a launch configuration and group.
|
* For auto-scaling, on each nodes it creates a launch configuration and group.
|
||||||
The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-minion-group. The default
|
The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-minion-group. The default
|
||||||
name is kubernetes-minion-group. The auto-scaling group has a min and max size
|
name is kubernetes-minion-group. The auto-scaling group has a min and max size
|
||||||
that are both set to NUM_NODES. You can change the size of the auto-scaling
|
that are both set to NUM_NODES. You can change the size of the auto-scaling
|
||||||
group to add or remove the total number of nodes from within the AWS API or
|
group to add or remove the total number of nodes from within the AWS API or
|
||||||
Console. Each nodes self-configures, meaning that they come up; run Salt with
|
Console. Each nodes self-configures, meaning that they come up; run Salt with
|
||||||
the stored configuration; connect to the master; are assigned an internal CIDR;
|
the stored configuration; connect to the master; are assigned an internal CIDR;
|
||||||
and then the master configures the route-table with the assigned CIDR. The
|
and then the master configures the route-table with the assigned CIDR. The
|
||||||
kube-up script performs a health-check on the nodes but it's a self-check that
|
kube-up script performs a health-check on the nodes but it's a self-check that
|
||||||
is not required.
|
is not required.
|
||||||
|
|
||||||
|
If attempting this configuration manually, it is recommend to follow along
|
||||||
If attempting this configuration manually, I highly recommend following along
|
|
||||||
with the kube-up script, and being sure to tag everything with a tag with name
|
with the kube-up script, and being sure to tag everything with a tag with name
|
||||||
`KubernetesCluster` and value set to a unique cluster-id. Also, passing the
|
`KubernetesCluster` and value set to a unique cluster-id. Also, passing the
|
||||||
right configuration options to Salt when not using the script is tricky: the
|
right configuration options to Salt when not using the script is tricky: the
|
||||||
plan here is to simplify this by having Kubernetes take on more node
|
plan here is to simplify this by having Kubernetes take on more node
|
||||||
configuration, and even potentially remove Salt altogether.
|
configuration, and even potentially remove Salt altogether.
|
||||||
|
|
||||||
|
|
||||||
### Manual infrastructure creation
|
### Manual infrastructure creation
|
||||||
|
|
||||||
While this work is not yet complete, advanced users might choose to manually
|
While this work is not yet complete, advanced users might choose to manually
|
||||||
create certain AWS objects while still making use of the kube-up script (to configure
|
create certain AWS objects while still making use of the kube-up script (to
|
||||||
Salt, for example). These objects can currently be manually created:
|
configure Salt, for example). These objects can currently be manually created:
|
||||||
|
|
||||||
* Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket.
|
* Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket.
|
||||||
* Set the `VPC_ID` environment variable to reuse an existing VPC.
|
* Set the `VPC_ID` environment variable to reuse an existing VPC.
|
||||||
* Set the `SUBNET_ID` environment variable to reuse an existing subnet.
|
* Set the `SUBNET_ID` environment variable to reuse an existing subnet.
|
||||||
* If your route table has a matching `KubernetesCluster` tag, it will
|
* If your route table has a matching `KubernetesCluster` tag, it will be reused.
|
||||||
be reused.
|
|
||||||
* If your security groups are appropriately named, they will be reused.
|
* If your security groups are appropriately named, they will be reused.
|
||||||
|
|
||||||
Currently there is no way to do the following with kube-up:
|
Currently there is no way to do the following with kube-up:
|
||||||
|
|
||||||
* Use an existing AWS SSH key with an arbitrary name.
|
* Use an existing AWS SSH key with an arbitrary name.
|
||||||
* Override the IAM credentials in a sensible way
|
* Override the IAM credentials in a sensible way
|
||||||
([#14226](http://issues.k8s.io/14226)).
|
([#14226](http://issues.k8s.io/14226)).
|
||||||
* Use different security group permissions.
|
* Use different security group permissions.
|
||||||
* Configure your own auto-scaling groups.
|
* Configure your own auto-scaling groups.
|
||||||
|
|
||||||
@ -312,8 +308,6 @@ Salt and Kubernetes from the S3 bucket, and then triggering Salt to actually
|
|||||||
install Kubernetes.
|
install Kubernetes.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
[]()
|
[]()
|
||||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -37,60 +37,122 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
The term "clustering" refers to the process of having all members of the Kubernetes cluster find and trust each other. There are multiple different ways to achieve clustering with different security and usability profiles. This document attempts to lay out the user experiences for clustering that Kubernetes aims to address.
|
The term "clustering" refers to the process of having all members of the
|
||||||
|
Kubernetes cluster find and trust each other. There are multiple different ways
|
||||||
|
to achieve clustering with different security and usability profiles. This
|
||||||
|
document attempts to lay out the user experiences for clustering that Kubernetes
|
||||||
|
aims to address.
|
||||||
|
|
||||||
Once a cluster is established, the following is true:
|
Once a cluster is established, the following is true:
|
||||||
|
|
||||||
1. **Master -> Node** The master needs to know which nodes can take work and what their current status is wrt capacity.
|
1. **Master -> Node** The master needs to know which nodes can take work and
|
||||||
1. **Location** The master knows the name and location of all of the nodes in the cluster.
|
what their current status is wrt capacity.
|
||||||
* For the purposes of this doc, location and name should be enough information so that the master can open a TCP connection to the Node. Most probably we will make this either an IP address or a DNS name. It is going to be important to be consistent here (master must be able to reach kubelet on that DNS name) so that we can verify certificates appropriately.
|
1. **Location** The master knows the name and location of all of the nodes in
|
||||||
2. **Target AuthN** A way to securely talk to the kubelet on that node. Currently we call out to the kubelet over HTTP. This should be over HTTPS and the master should know what CA to trust for that node.
|
the cluster.
|
||||||
3. **Caller AuthN/Z** This would be the master verifying itself (and permissions) when calling the node. Currently, this is only used to collect statistics as authorization isn't critical. This may change in the future though.
|
* For the purposes of this doc, location and name should be enough
|
||||||
2. **Node -> Master** The nodes currently talk to the master to know which pods have been assigned to them and to publish events.
|
information so that the master can open a TCP connection to the Node. Most
|
||||||
|
probably we will make this either an IP address or a DNS name. It is going to be
|
||||||
|
important to be consistent here (master must be able to reach kubelet on that
|
||||||
|
DNS name) so that we can verify certificates appropriately.
|
||||||
|
2. **Target AuthN** A way to securely talk to the kubelet on that node.
|
||||||
|
Currently we call out to the kubelet over HTTP. This should be over HTTPS and
|
||||||
|
the master should know what CA to trust for that node.
|
||||||
|
3. **Caller AuthN/Z** This would be the master verifying itself (and
|
||||||
|
permissions) when calling the node. Currently, this is only used to collect
|
||||||
|
statistics as authorization isn't critical. This may change in the future
|
||||||
|
though.
|
||||||
|
2. **Node -> Master** The nodes currently talk to the master to know which pods
|
||||||
|
have been assigned to them and to publish events.
|
||||||
1. **Location** The nodes must know where the master is at.
|
1. **Location** The nodes must know where the master is at.
|
||||||
2. **Target AuthN** Since the master is assigning work to the nodes, it is critical that they verify whom they are talking to.
|
2. **Target AuthN** Since the master is assigning work to the nodes, it is
|
||||||
3. **Caller AuthN/Z** The nodes publish events and so must be authenticated to the master. Ideally this authentication is specific to each node so that authorization can be narrowly scoped. The details of the work to run (including things like environment variables) might be considered sensitive and should be locked down also.
|
critical that they verify whom they are talking to.
|
||||||
|
3. **Caller AuthN/Z** The nodes publish events and so must be authenticated to
|
||||||
|
the master. Ideally this authentication is specific to each node so that
|
||||||
|
authorization can be narrowly scoped. The details of the work to run (including
|
||||||
|
things like environment variables) might be considered sensitive and should be
|
||||||
|
locked down also.
|
||||||
|
|
||||||
**Note:** While the description here refers to a singular Master, in the future we should enable multiple Masters operating in an HA mode. While the "Master" is currently the combination of the API Server, Scheduler and Controller Manager, we will restrict ourselves to thinking about the main API and policy engine -- the API Server.
|
**Note:** While the description here refers to a singular Master, in the future
|
||||||
|
we should enable multiple Masters operating in an HA mode. While the "Master" is
|
||||||
|
currently the combination of the API Server, Scheduler and Controller Manager,
|
||||||
|
we will restrict ourselves to thinking about the main API and policy engine --
|
||||||
|
the API Server.
|
||||||
|
|
||||||
## Current Implementation
|
## Current Implementation
|
||||||
|
|
||||||
A central authority (generally the master) is responsible for determining the set of machines which are members of the cluster. Calls to create and remove worker nodes in the cluster are restricted to this single authority, and any other requests to add or remove worker nodes are rejected. (1.i).
|
A central authority (generally the master) is responsible for determining the
|
||||||
|
set of machines which are members of the cluster. Calls to create and remove
|
||||||
|
worker nodes in the cluster are restricted to this single authority, and any
|
||||||
|
other requests to add or remove worker nodes are rejected. (1.i.)
|
||||||
|
|
||||||
Communication from the master to nodes is currently over HTTP and is not secured or authenticated in any way. (1.ii, 1.iii).
|
Communication from the master to nodes is currently over HTTP and is not secured
|
||||||
|
or authenticated in any way. (1.ii, 1.iii.)
|
||||||
|
|
||||||
The location of the master is communicated out of band to the nodes. For GCE, this is done via Salt. Other cluster instructions/scripts use other methods. (2.i)
|
The location of the master is communicated out of band to the nodes. For GCE,
|
||||||
|
this is done via Salt. Other cluster instructions/scripts use other methods.
|
||||||
|
(2.i.)
|
||||||
|
|
||||||
Currently most communication from the node to the master is over HTTP. When it is done over HTTPS there is currently no verification of the cert of the master (2.ii).
|
Currently most communication from the node to the master is over HTTP. When it
|
||||||
|
is done over HTTPS there is currently no verification of the cert of the master
|
||||||
|
(2.ii.)
|
||||||
|
|
||||||
Currently, the node/kubelet is authenticated to the master via a token shared across all nodes. This token is distributed out of band (using Salt for GCE) and is optional. If it is not present then the kubelet is unable to publish events to the master. (2.iii)
|
Currently, the node/kubelet is authenticated to the master via a token shared
|
||||||
|
across all nodes. This token is distributed out of band (using Salt for GCE) and
|
||||||
|
is optional. If it is not present then the kubelet is unable to publish events
|
||||||
|
to the master. (2.iii.)
|
||||||
|
|
||||||
Our current mix of out of band communication doesn't meet all of our needs from a security point of view and is difficult to set up and configure.
|
Our current mix of out of band communication doesn't meet all of our needs from
|
||||||
|
a security point of view and is difficult to set up and configure.
|
||||||
|
|
||||||
## Proposed Solution
|
## Proposed Solution
|
||||||
|
|
||||||
The proposed solution will provide a range of options for setting up and maintaining a secure Kubernetes cluster. We want to both allow for centrally controlled systems (leveraging pre-existing trust and configuration systems) or more ad-hoc automagic systems that are incredibly easy to set up.
|
The proposed solution will provide a range of options for setting up and
|
||||||
|
maintaining a secure Kubernetes cluster. We want to both allow for centrally
|
||||||
|
controlled systems (leveraging pre-existing trust and configuration systems) or
|
||||||
|
more ad-hoc automagic systems that are incredibly easy to set up.
|
||||||
|
|
||||||
The building blocks of an easier solution:
|
The building blocks of an easier solution:
|
||||||
|
|
||||||
* **Move to TLS** We will move to using TLS for all intra-cluster communication. We will explicitly identify the trust chain (the set of trusted CAs) as opposed to trusting the system CAs. We will also use client certificates for all AuthN.
|
* **Move to TLS** We will move to using TLS for all intra-cluster communication.
|
||||||
* [optional] **API driven CA** Optionally, we will run a CA in the master that will mint certificates for the nodes/kubelets. There will be pluggable policies that will automatically approve certificate requests here as appropriate.
|
We will explicitly identify the trust chain (the set of trusted CAs) as opposed
|
||||||
* **CA approval policy** This is a pluggable policy object that can automatically approve CA signing requests. Stock policies will include `always-reject`, `queue` and `insecure-always-approve`. With `queue` there would be an API for evaluating and accepting/rejecting requests. Cloud providers could implement a policy here that verifies other out of band information and automatically approves/rejects based on other external factors.
|
to trusting the system CAs. We will also use client certificates for all AuthN.
|
||||||
* **Scoped Kubelet Accounts** These accounts are per-node and (optionally) give a node permission to register itself.
|
* [optional] **API driven CA** Optionally, we will run a CA in the master that
|
||||||
* To start with, we'd have the kubelets generate a cert/account in the form of `kubelet:<host>`. To start we would then hard code policy such that we give that particular account appropriate permissions. Over time, we can make the policy engine more generic.
|
will mint certificates for the nodes/kubelets. There will be pluggable policies
|
||||||
* [optional] **Bootstrap API endpoint** This is a helper service hosted outside of the Kubernetes cluster that helps with initial discovery of the master.
|
that will automatically approve certificate requests here as appropriate.
|
||||||
|
* **CA approval policy** This is a pluggable policy object that can
|
||||||
|
automatically approve CA signing requests. Stock policies will include
|
||||||
|
`always-reject`, `queue` and `insecure-always-approve`. With `queue` there would
|
||||||
|
be an API for evaluating and accepting/rejecting requests. Cloud providers could
|
||||||
|
implement a policy here that verifies other out of band information and
|
||||||
|
automatically approves/rejects based on other external factors.
|
||||||
|
* **Scoped Kubelet Accounts** These accounts are per-node and (optionally) give
|
||||||
|
a node permission to register itself.
|
||||||
|
* To start with, we'd have the kubelets generate a cert/account in the form of
|
||||||
|
`kubelet:<host>`. To start we would then hard code policy such that we give that
|
||||||
|
particular account appropriate permissions. Over time, we can make the policy
|
||||||
|
engine more generic.
|
||||||
|
* [optional] **Bootstrap API endpoint** This is a helper service hosted outside
|
||||||
|
of the Kubernetes cluster that helps with initial discovery of the master.
|
||||||
|
|
||||||
### Static Clustering
|
### Static Clustering
|
||||||
|
|
||||||
In this sequence diagram there is out of band admin entity that is creating all certificates and distributing them. It is also making sure that the kubelets know where to find the master. This provides for a lot of control but is more difficult to set up as lots of information must be communicated outside of Kubernetes.
|
In this sequence diagram there is out of band admin entity that is creating all
|
||||||
|
certificates and distributing them. It is also making sure that the kubelets
|
||||||
|
know where to find the master. This provides for a lot of control but is more
|
||||||
|
difficult to set up as lots of information must be communicated outside of
|
||||||
|
Kubernetes.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
### Dynamic Clustering
|
### Dynamic Clustering
|
||||||
|
|
||||||
This diagram dynamic clustering using the bootstrap API endpoint. That API endpoint is used to both find the location of the master and communicate the root CA for the master.
|
This diagram shows dynamic clustering using the bootstrap API endpoint. This
|
||||||
|
endpoint is used to both find the location of the master and communicate the
|
||||||
|
root CA for the master.
|
||||||
|
|
||||||
This flow has the admin manually approving the kubelet signing requests. This is the `queue` policy defined above.This manual intervention could be replaced by code that can verify the signing requests via other means.
|
This flow has the admin manually approving the kubelet signing requests. This is
|
||||||
|
the `queue` policy defined above. This manual intervention could be replaced by
|
||||||
|
code that can verify the signing requests via other means.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
|
@ -33,7 +33,8 @@ Documentation for other releases can be found at
|
|||||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||||
This directory contains diagrams for the clustering design doc.
|
This directory contains diagrams for the clustering design doc.
|
||||||
|
|
||||||
This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html). Assuming you have a non-borked python install, this should be installable with
|
This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html).
|
||||||
|
Assuming you have a non-borked python install, this should be installable with:
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
pip install seqdiag
|
pip install seqdiag
|
||||||
@ -43,7 +44,8 @@ Just call `make` to regenerate the diagrams.
|
|||||||
|
|
||||||
## Building with Docker
|
## Building with Docker
|
||||||
|
|
||||||
If you are on a Mac or your pip install is messed up, you can easily build with docker.
|
If you are on a Mac or your pip install is messed up, you can easily build with
|
||||||
|
docker:
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
make docker
|
make docker
|
||||||
@ -51,13 +53,18 @@ make docker
|
|||||||
|
|
||||||
The first run will be slow but things should be fast after that.
|
The first run will be slow but things should be fast after that.
|
||||||
|
|
||||||
To clean up the docker containers that are created (and other cruft that is left around) you can run `make docker-clean`.
|
To clean up the docker containers that are created (and other cruft that is left
|
||||||
|
around) you can run `make docker-clean`.
|
||||||
|
|
||||||
If you are using boot2docker and get warnings about clock skew (or if things aren't building for some reason) then you can fix that up with `make fix-clock-skew`.
|
If you are using boot2docker and get warnings about clock skew (or if things
|
||||||
|
aren't building for some reason) then you can fix that up with
|
||||||
|
`make fix-clock-skew`.
|
||||||
|
|
||||||
## Automatically rebuild on file changes
|
## Automatically rebuild on file changes
|
||||||
|
|
||||||
If you have the fswatch utility installed, you can have it monitor the file system and automatically rebuild when files have changed. Just do a `make watch`.
|
If you have the fswatch utility installed, you can have it monitor the file
|
||||||
|
system and automatically rebuild when files have changed. Just do a
|
||||||
|
`make watch`.
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -36,14 +36,13 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
## Abstract
|
## Abstract
|
||||||
|
|
||||||
This describes an approach for providing support for:
|
This document describes how to use Kubernetes to execute commands in containers,
|
||||||
|
with stdin/stdout/stderr streams attached and how to implement port forwarding
|
||||||
- executing commands in containers, with stdin/stdout/stderr streams attached
|
to the containers.
|
||||||
- port forwarding to containers
|
|
||||||
|
|
||||||
## Background
|
## Background
|
||||||
|
|
||||||
There are several related issues/PRs:
|
See the following related issues/PRs:
|
||||||
|
|
||||||
- [Support attach](http://issue.k8s.io/1521)
|
- [Support attach](http://issue.k8s.io/1521)
|
||||||
- [Real container ssh](http://issue.k8s.io/1513)
|
- [Real container ssh](http://issue.k8s.io/1513)
|
||||||
@ -77,34 +76,39 @@ won't be able to work with this mechanism, unless adapters can be written.
|
|||||||
|
|
||||||
## Constraints and Assumptions
|
## Constraints and Assumptions
|
||||||
|
|
||||||
- SSH support is not currently in scope
|
- SSH support is not currently in scope.
|
||||||
- CGroup confinement is ultimately desired, but implementing that support is not currently in scope
|
- CGroup confinement is ultimately desired, but implementing that support is not
|
||||||
- SELinux confinement is ultimately desired, but implementing that support is not currently in scope
|
currently in scope.
|
||||||
|
- SELinux confinement is ultimately desired, but implementing that support is
|
||||||
|
not currently in scope.
|
||||||
|
|
||||||
## Use Cases
|
## Use Cases
|
||||||
|
|
||||||
- As a user of a Kubernetes cluster, I want to run arbitrary commands in a container, attaching my local stdin/stdout/stderr to the container
|
- A user of a Kubernetes cluster wants to run arbitrary commands in a
|
||||||
- As a user of a Kubernetes cluster, I want to be able to connect to local ports on my computer and have them forwarded to ports in the container
|
container with local stdin/stdout/stderr attached to the container.
|
||||||
|
- A user of a Kubernetes cluster wants to connect to local ports on his computer
|
||||||
|
and have them forwarded to ports in a container.
|
||||||
|
|
||||||
## Process Flow
|
## Process Flow
|
||||||
|
|
||||||
### Remote Command Execution Flow
|
### Remote Command Execution Flow
|
||||||
|
|
||||||
1. The client connects to the Kubernetes Master to initiate a remote command execution
|
1. The client connects to the Kubernetes Master to initiate a remote command
|
||||||
request
|
execution request.
|
||||||
2. The Master proxies the request to the Kubelet where the container lives
|
2. The Master proxies the request to the Kubelet where the container lives.
|
||||||
3. The Kubelet executes nsenter + the requested command and streams stdin/stdout/stderr back and forth between the client and the container
|
3. The Kubelet executes nsenter + the requested command and streams
|
||||||
|
stdin/stdout/stderr back and forth between the client and the container.
|
||||||
|
|
||||||
### Port Forwarding Flow
|
### Port Forwarding Flow
|
||||||
|
|
||||||
1. The client connects to the Kubernetes Master to initiate a remote command execution
|
1. The client connects to the Kubernetes Master to initiate a remote command
|
||||||
request
|
execution request.
|
||||||
2. The Master proxies the request to the Kubelet where the container lives
|
2. The Master proxies the request to the Kubelet where the container lives.
|
||||||
3. The client listens on each specified local port, awaiting local connections
|
3. The client listens on each specified local port, awaiting local connections.
|
||||||
4. The client connects to one of the local listening ports
|
4. The client connects to one of the local listening ports.
|
||||||
4. The client notifies the Kubelet of the new connection
|
4. The client notifies the Kubelet of the new connection.
|
||||||
5. The Kubelet executes nsenter + socat and streams data back and forth between the client and the port in the container
|
5. The Kubelet executes nsenter + socat and streams data back and forth between
|
||||||
|
the client and the port in the container.
|
||||||
|
|
||||||
## Design Considerations
|
## Design Considerations
|
||||||
|
|
||||||
@ -177,7 +181,10 @@ functionality. We need to make sure that users are not allowed to execute
|
|||||||
remote commands or do port forwarding to containers they aren't allowed to
|
remote commands or do port forwarding to containers they aren't allowed to
|
||||||
access.
|
access.
|
||||||
|
|
||||||
Additional work is required to ensure that multiple command execution or port forwarding connections from different clients are not able to see each other's data. This can most likely be achieved via SELinux labeling and unique process contexts.
|
Additional work is required to ensure that multiple command execution or port
|
||||||
|
forwarding connections from different clients are not able to see each other's
|
||||||
|
data. This can most likely be achieved via SELinux labeling and unique process
|
||||||
|
contexts.
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -36,8 +36,8 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
## Abstract
|
## Abstract
|
||||||
|
|
||||||
The `ConfigMap` API resource stores data used for the configuration of applications deployed on
|
The `ConfigMap` API resource stores data used for the configuration of
|
||||||
Kubernetes.
|
applications deployed on Kubernetes.
|
||||||
|
|
||||||
The main focus of this resource is to:
|
The main focus of this resource is to:
|
||||||
|
|
||||||
@ -47,71 +47,74 @@ The main focus of this resource is to:
|
|||||||
|
|
||||||
## Motivation
|
## Motivation
|
||||||
|
|
||||||
A `Secret`-like API resource is needed to store configuration data that pods can consume.
|
A `Secret`-like API resource is needed to store configuration data that pods can
|
||||||
|
consume.
|
||||||
|
|
||||||
Goals of this design:
|
Goals of this design:
|
||||||
|
|
||||||
1. Describe a `ConfigMap` API resource
|
1. Describe a `ConfigMap` API resource.
|
||||||
2. Describe the semantics of consuming `ConfigMap` as environment variables
|
2. Describe the semantics of consuming `ConfigMap` as environment variables.
|
||||||
3. Describe the semantics of consuming `ConfigMap` as files in a volume
|
3. Describe the semantics of consuming `ConfigMap` as files in a volume.
|
||||||
|
|
||||||
## Use Cases
|
## Use Cases
|
||||||
|
|
||||||
1. As a user, I want to be able to consume configuration data as environment variables
|
1. As a user, I want to be able to consume configuration data as environment
|
||||||
2. As a user, I want to be able to consume configuration data as files in a volume
|
variables.
|
||||||
3. As a user, I want my view of configuration data in files to be eventually consistent with changes
|
2. As a user, I want to be able to consume configuration data as files in a
|
||||||
to the data
|
volume.
|
||||||
|
3. As a user, I want my view of configuration data in files to be eventually
|
||||||
|
consistent with changes to the data.
|
||||||
|
|
||||||
### Consuming `ConfigMap` as Environment Variables
|
### Consuming `ConfigMap` as Environment Variables
|
||||||
|
|
||||||
Many programs read their configuration from environment variables. `ConfigMap` should be possible
|
A series of events for consuming `ConfigMap` as environment variables:
|
||||||
to consume in environment variables. The rough series of events for consuming `ConfigMap` this way
|
|
||||||
is:
|
|
||||||
|
|
||||||
1. A `ConfigMap` object is created
|
1. Create a `ConfigMap` object.
|
||||||
2. A pod that consumes the configuration data via environment variables is created
|
2. Create a pod to consume the configuration data via environment variables.
|
||||||
3. The pod is scheduled onto a node
|
3. The pod is scheduled onto a node.
|
||||||
4. The kubelet retrieves the `ConfigMap` resource(s) referenced by the pod and starts the container
|
4. The Kubelet retrieves the `ConfigMap` resource(s) referenced by the pod and
|
||||||
processes with the appropriate data in environment variables
|
starts the container processes with the appropriate configuration data from
|
||||||
|
environment variables.
|
||||||
|
|
||||||
### Consuming `ConfigMap` in Volumes
|
### Consuming `ConfigMap` in Volumes
|
||||||
|
|
||||||
Many programs read their configuration from configuration files. `ConfigMap` should be possible
|
A series of events for consuming `ConfigMap` as configuration files in a volume:
|
||||||
to consume in a volume. The rough series of events for consuming `ConfigMap` this way
|
|
||||||
is:
|
|
||||||
|
|
||||||
1. A `ConfigMap` object is created
|
1. Create a `ConfigMap` object.
|
||||||
2. A new pod using the `ConfigMap` via the volume plugin is created
|
2. Create a new pod using the `ConfigMap` via a volume plugin.
|
||||||
3. The pod is scheduled onto a node
|
3. The pod is scheduled onto a node.
|
||||||
4. The Kubelet creates an instance of the volume plugin and calls its `Setup()` method
|
4. The Kubelet creates an instance of the volume plugin and calls its `Setup()`
|
||||||
5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod and projects
|
method.
|
||||||
the appropriate data into the volume
|
5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod
|
||||||
|
and projects the appropriate configuration data into the volume.
|
||||||
|
|
||||||
### Consuming `ConfigMap` Updates
|
### Consuming `ConfigMap` Updates
|
||||||
|
|
||||||
Any long-running system has configuration that is mutated over time. Changes made to configuration
|
Any long-running system has configuration that is mutated over time. Changes
|
||||||
data must be made visible to pods consuming data in volumes so that they can respond to those
|
made to configuration data must be made visible to pods consuming data in
|
||||||
changes.
|
volumes so that they can respond to those changes.
|
||||||
|
|
||||||
The `resourceVersion` of the `ConfigMap` object will be updated by the API server every time the
|
The `resourceVersion` of the `ConfigMap` object will be updated by the API
|
||||||
object is modified. After an update, modifications will be made visible to the consumer container:
|
server every time the object is modified. After an update, modifications will be
|
||||||
|
made visible to the consumer container:
|
||||||
|
|
||||||
1. A `ConfigMap` object is created
|
1. Create a `ConfigMap` object.
|
||||||
2. A new pod using the `ConfigMap` via the volume plugin is created
|
2. Create a new pod using the `ConfigMap` via the volume plugin.
|
||||||
3. The pod is scheduled onto a node
|
3. The pod is scheduled onto a node.
|
||||||
4. During the sync loop, the Kubelet creates an instance of the volume plugin and calls its
|
4. During the sync loop, the Kubelet creates an instance of the volume plugin
|
||||||
`Setup()` method
|
and calls its `Setup()` method.
|
||||||
5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod and projects
|
5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod
|
||||||
the appropriate data into the volume
|
and projects the appropriate data into the volume.
|
||||||
6. The `ConfigMap` referenced by the pod is updated
|
6. The `ConfigMap` referenced by the pod is updated.
|
||||||
7. During the next iteration of the `syncLoop`, the Kubelet creates an instance of the volume plugin
|
7. During the next iteration of the `syncLoop`, the Kubelet creates an instance
|
||||||
and calls its `Setup()` method
|
of the volume plugin and calls its `Setup()` method.
|
||||||
8. The volume plugin projects the updated data into the volume atomically
|
8. The volume plugin projects the updated data into the volume atomically.
|
||||||
|
|
||||||
It is the consuming pod's responsibility to make use of the updated data once it is made visible.
|
It is the consuming pod's responsibility to make use of the updated data once it
|
||||||
|
is made visible.
|
||||||
|
|
||||||
Because environment variables cannot be updated without restarting a container, configuration data
|
Because environment variables cannot be updated without restarting a container,
|
||||||
consumed in environment variables will not be updated.
|
configuration data consumed in environment variables will not be updated.
|
||||||
|
|
||||||
### Advantages
|
### Advantages
|
||||||
|
|
||||||
@ -133,8 +136,8 @@ type ConfigMap struct {
|
|||||||
TypeMeta `json:",inline"`
|
TypeMeta `json:",inline"`
|
||||||
ObjectMeta `json:"metadata,omitempty"`
|
ObjectMeta `json:"metadata,omitempty"`
|
||||||
|
|
||||||
// Data contains the configuration data. Each key must be a valid DNS_SUBDOMAIN or leading
|
// Data contains the configuration data. Each key must be a valid
|
||||||
// dot followed by valid DNS_SUBDOMAIN.
|
// DNS_SUBDOMAIN or leading dot followed by valid DNS_SUBDOMAIN.
|
||||||
Data map[string]string `json:"data,omitempty"`
|
Data map[string]string `json:"data,omitempty"`
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -146,7 +149,8 @@ type ConfigMapList struct {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
A `Registry` implementation for `ConfigMap` will be added to `pkg/registry/configmap`.
|
A `Registry` implementation for `ConfigMap` will be added to
|
||||||
|
`pkg/registry/configmap`.
|
||||||
|
|
||||||
### Environment Variables
|
### Environment Variables
|
||||||
|
|
||||||
@ -174,8 +178,8 @@ type ConfigMapSelector struct {
|
|||||||
|
|
||||||
### Volume Source
|
### Volume Source
|
||||||
|
|
||||||
A new `ConfigMapVolumeSource` type of volume source containing the `ConfigMap` object will be
|
A new `ConfigMapVolumeSource` type of volume source containing the `ConfigMap`
|
||||||
added to the `VolumeSource` struct in the API:
|
object will be added to the `VolumeSource` struct in the API:
|
||||||
|
|
||||||
```go
|
```go
|
||||||
package api
|
package api
|
||||||
@ -209,13 +213,14 @@ type KeyToPath struct {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
**Note:** The update logic used in the downward API volume plug-in will be extracted and re-used in
|
**Note:** The update logic used in the downward API volume plug-in will be
|
||||||
the volume plug-in for `ConfigMap`.
|
extracted and re-used in the volume plug-in for `ConfigMap`.
|
||||||
|
|
||||||
### Changes to Secret
|
### Changes to Secret
|
||||||
|
|
||||||
We will update the Secret volume plugin to have a similar API to the new ConfigMap volume plugin.
|
We will update the Secret volume plugin to have a similar API to the new
|
||||||
The secret volume plugin will also begin updating secret content in the volume when secrets change.
|
`ConfigMap` volume plugin. The secret volume plugin will also begin updating
|
||||||
|
secret content in the volume when secrets change.
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
@ -281,7 +286,8 @@ spec:
|
|||||||
|
|
||||||
#### Consuming `ConfigMap` as Volumes
|
#### Consuming `ConfigMap` as Volumes
|
||||||
|
|
||||||
`redis-volume-config` is intended to be used as a volume containing a config file:
|
`redis-volume-config` is intended to be used as a volume containing a config
|
||||||
|
file:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
apiVersion: extensions/v1beta1
|
apiVersion: extensions/v1beta1
|
||||||
@ -320,8 +326,8 @@ spec:
|
|||||||
|
|
||||||
## Future Improvements
|
## Future Improvements
|
||||||
|
|
||||||
In the future, we may add the ability to specify an init-container that can watch the volume
|
In the future, we may add the ability to specify an init-container that can
|
||||||
contents for updates and respond to changes when they occur.
|
watch the volume contents for updates and respond to changes when they occur.
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
[]()
|
[]()
|
||||||
|
@ -54,7 +54,7 @@ ideas.
|
|||||||
* **High availability:** continuing to be available and work correctly
|
* **High availability:** continuing to be available and work correctly
|
||||||
even if some components are down or uncontactable. This typically
|
even if some components are down or uncontactable. This typically
|
||||||
involves multiple replicas of critical services, and a reliable way
|
involves multiple replicas of critical services, and a reliable way
|
||||||
to find available replicas. Note that it's possible (but not
|
to find available replicas. Note that it's possible (but not
|
||||||
desirable) to have high
|
desirable) to have high
|
||||||
availability properties (e.g. multiple replicas) in the absence of
|
availability properties (e.g. multiple replicas) in the absence of
|
||||||
self-healing properties (e.g. if a replica fails, nothing replaces
|
self-healing properties (e.g. if a replica fails, nothing replaces
|
||||||
@ -109,11 +109,11 @@ ideas.
|
|||||||
|
|
||||||
## Relative Priorities
|
## Relative Priorities
|
||||||
|
|
||||||
1. **(Possibly manual) recovery from catastrophic failures:** having a Kubernetes cluster, and all
|
1. **(Possibly manual) recovery from catastrophic failures:** having a
|
||||||
applications running inside it, disappear forever perhaps is the worst
|
Kubernetes cluster, and all applications running inside it, disappear forever
|
||||||
possible failure mode. So it is critical that we be able to
|
perhaps is the worst possible failure mode. So it is critical that we be able to
|
||||||
recover the applications running inside a cluster from such
|
recover the applications running inside a cluster from such failures in some
|
||||||
failures in some well-bounded time period.
|
well-bounded time period.
|
||||||
1. In theory a cluster can be recovered by replaying all API calls
|
1. In theory a cluster can be recovered by replaying all API calls
|
||||||
that have ever been executed against it, in order, but most
|
that have ever been executed against it, in order, but most
|
||||||
often that state has been lost, and/or is scattered across
|
often that state has been lost, and/or is scattered across
|
||||||
@ -121,12 +121,12 @@ ideas.
|
|||||||
probably infeasible.
|
probably infeasible.
|
||||||
1. In theory a cluster can also be recovered to some relatively
|
1. In theory a cluster can also be recovered to some relatively
|
||||||
recent non-corrupt backup/snapshot of the disk(s) backing the
|
recent non-corrupt backup/snapshot of the disk(s) backing the
|
||||||
etcd cluster state. But we have no default consistent
|
etcd cluster state. But we have no default consistent
|
||||||
backup/snapshot, verification or restoration process. And we
|
backup/snapshot, verification or restoration process. And we
|
||||||
don't routinely test restoration, so even if we did routinely
|
don't routinely test restoration, so even if we did routinely
|
||||||
perform and verify backups, we have no hard evidence that we
|
perform and verify backups, we have no hard evidence that we
|
||||||
can in practise effectively recover from catastrophic cluster
|
can in practise effectively recover from catastrophic cluster
|
||||||
failure or data corruption by restoring from these backups. So
|
failure or data corruption by restoring from these backups. So
|
||||||
there's more work to be done here.
|
there's more work to be done here.
|
||||||
1. **Self-healing:** Most major cloud providers provide the ability to
|
1. **Self-healing:** Most major cloud providers provide the ability to
|
||||||
easily and automatically replace failed virtual machines within a
|
easily and automatically replace failed virtual machines within a
|
||||||
@ -144,7 +144,6 @@ ideas.
|
|||||||
addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)
|
addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)
|
||||||
or [backup and
|
or [backup and
|
||||||
recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).
|
recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).
|
||||||
|
|
||||||
1. and boot disks are either:
|
1. and boot disks are either:
|
||||||
1. truely persistent (i.e. remote persistent disks), or
|
1. truely persistent (i.e. remote persistent disks), or
|
||||||
1. reconstructible (e.g. using boot-from-snapshot,
|
1. reconstructible (e.g. using boot-from-snapshot,
|
||||||
@ -157,7 +156,7 @@ ideas.
|
|||||||
quorum members). In environments where cloud-assisted automatic
|
quorum members). In environments where cloud-assisted automatic
|
||||||
self-healing might be infeasible (e.g. on-premise bare-metal
|
self-healing might be infeasible (e.g. on-premise bare-metal
|
||||||
deployments), it also gives cluster administrators more time to
|
deployments), it also gives cluster administrators more time to
|
||||||
respond (e.g. replace/repair failed machines) without incurring
|
respond (e.g. replace/repair failed machines) without incurring
|
||||||
system downtime.
|
system downtime.
|
||||||
|
|
||||||
## Design and Status (as of December 2015)
|
## Design and Status (as of December 2015)
|
||||||
@ -174,7 +173,7 @@ ideas.
|
|||||||
|
|
||||||
Multiple stateless, self-hosted, self-healing API servers behind a HA
|
Multiple stateless, self-hosted, self-healing API servers behind a HA
|
||||||
load balancer, built out by the default "kube-up" automation on GCE,
|
load balancer, built out by the default "kube-up" automation on GCE,
|
||||||
AWS and basic bare metal (BBM). Note that the single-host approach of
|
AWS and basic bare metal (BBM). Note that the single-host approach of
|
||||||
hving etcd listen only on localhost to ensure that onyl API server can
|
hving etcd listen only on localhost to ensure that onyl API server can
|
||||||
connect to it will no longer work, so alternative security will be
|
connect to it will no longer work, so alternative security will be
|
||||||
needed in the regard (either using firewall rules, SSL certs, or
|
needed in the regard (either using firewall rules, SSL certs, or
|
||||||
@ -189,13 +188,13 @@ design doc.
|
|||||||
<td>
|
<td>
|
||||||
|
|
||||||
No scripted self-healing or HA on GCE, AWS or basic bare metal
|
No scripted self-healing or HA on GCE, AWS or basic bare metal
|
||||||
currently exists in the OSS distro. To be clear, "no self healing"
|
currently exists in the OSS distro. To be clear, "no self healing"
|
||||||
means that even if multiple e.g. API servers are provisioned for HA
|
means that even if multiple e.g. API servers are provisioned for HA
|
||||||
purposes, if they fail, nothing replaces them, so eventually the
|
purposes, if they fail, nothing replaces them, so eventually the
|
||||||
system will fail. Self-healing and HA can be set up
|
system will fail. Self-healing and HA can be set up
|
||||||
manually by following documented instructions, but this is not
|
manually by following documented instructions, but this is not
|
||||||
currently an automated process, and it is not tested as part of
|
currently an automated process, and it is not tested as part of
|
||||||
continuous integration. So it's probably safest to assume that it
|
continuous integration. So it's probably safest to assume that it
|
||||||
doesn't actually work in practise.
|
doesn't actually work in practise.
|
||||||
|
|
||||||
</td>
|
</td>
|
||||||
@ -205,8 +204,8 @@ doesn't actually work in practise.
|
|||||||
<td>
|
<td>
|
||||||
|
|
||||||
Multiple self-hosted, self healing warm standby stateless controller
|
Multiple self-hosted, self healing warm standby stateless controller
|
||||||
managers and schedulers with leader election and automatic failover of API server
|
managers and schedulers with leader election and automatic failover of API
|
||||||
clients, automatically installed by default "kube-up" automation.
|
server clients, automatically installed by default "kube-up" automation.
|
||||||
|
|
||||||
</td>
|
</td>
|
||||||
<td>As above.</td>
|
<td>As above.</td>
|
||||||
@ -218,47 +217,49 @@ clients, automatically installed by default "kube-up" automation.
|
|||||||
Multiple (3-5) etcd quorum members behind a load balancer with session
|
Multiple (3-5) etcd quorum members behind a load balancer with session
|
||||||
affinity (to prevent clients from being bounced from one to another).
|
affinity (to prevent clients from being bounced from one to another).
|
||||||
|
|
||||||
Regarding self-healing, if a node running etcd goes down, it is always necessary to do three
|
Regarding self-healing, if a node running etcd goes down, it is always necessary
|
||||||
things:
|
to do three things:
|
||||||
<ol>
|
<ol>
|
||||||
<li>allocate a new node (not necessary if running etcd as a pod, in
|
<li>allocate a new node (not necessary if running etcd as a pod, in
|
||||||
which case specific measures are required to prevent user pods from
|
which case specific measures are required to prevent user pods from
|
||||||
interfering with system pods, for example using node selectors as
|
interfering with system pods, for example using node selectors as
|
||||||
described in <A HREF=")
|
described in <A HREF="),
|
||||||
<li>start an etcd replica on that new node,
|
<li>start an etcd replica on that new node, and
|
||||||
<li>have the new replica recover the etcd state.
|
<li>have the new replica recover the etcd state.
|
||||||
</ol>
|
</ol>
|
||||||
In the case of local disk (which fails in concert with the machine), the etcd
|
In the case of local disk (which fails in concert with the machine), the etcd
|
||||||
state must be recovered from the other replicas. This is called <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">dynamic member
|
state must be recovered from the other replicas. This is called
|
||||||
addition</A>.
|
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">
|
||||||
In the case of remote persistent disk, the etcd state can be recovered
|
dynamic member addition</A>.
|
||||||
by attaching the remote persistent disk to the replacement node, thus
|
|
||||||
the state is recoverable even if all other replicas are down.
|
In the case of remote persistent disk, the etcd state can be recovered by
|
||||||
|
attaching the remote persistent disk to the replacement node, thus the state is
|
||||||
|
recoverable even if all other replicas are down.
|
||||||
|
|
||||||
There are also significant performance differences between local disks and remote
|
There are also significant performance differences between local disks and remote
|
||||||
persistent disks. For example, the <A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">sustained throughput
|
persistent disks. For example, the
|
||||||
local disks in GCE is approximatley 20x that of remote disks</A>.
|
<A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">
|
||||||
|
sustained throughput local disks in GCE is approximatley 20x that of remote
|
||||||
|
disks</A>.
|
||||||
|
|
||||||
Hence we suggest that self-healing be provided by remotely mounted persistent disks in
|
Hence we suggest that self-healing be provided by remotely mounted persistent
|
||||||
non-performance critical, single-zone cloud deployments. For
|
disks in non-performance critical, single-zone cloud deployments. For
|
||||||
performance critical installations, faster local SSD's should be used,
|
performance critical installations, faster local SSD's should be used, in which
|
||||||
in which case remounting on node failure is not an option, so
|
case remounting on node failure is not an option, so
|
||||||
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">etcd runtime configuration</A>
|
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">
|
||||||
should be used to replace the failed machine. Similarly, for
|
etcd runtime configuration</A> should be used to replace the failed machine.
|
||||||
cross-zone self-healing, cloud persistent disks are zonal, so
|
Similarly, for cross-zone self-healing, cloud persistent disks are zonal, so
|
||||||
automatic
|
automatic <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">
|
||||||
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">runtime configuration</A>
|
runtime configuration</A> is required. Similarly, basic bare metal deployments
|
||||||
is required. Similarly, basic bare metal deployments cannot generally
|
cannot generally rely on remote persistent disks, so the same approach applies
|
||||||
rely on
|
there.
|
||||||
remote persistent disks, so the same approach applies there.
|
|
||||||
</td>
|
</td>
|
||||||
<td>
|
<td>
|
||||||
<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">
|
<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">
|
||||||
Somewhat vague instructions exist</A>
|
Somewhat vague instructions exist</A> on how to set some of this up manually in
|
||||||
on how to set some of this up manually in a self-hosted
|
a self-hosted configuration. But automatic bootstrapping and self-healing is not
|
||||||
configuration. But automatic bootstrapping and self-healing is not
|
described (and is not implemented for the non-PD cases). This all still needs to
|
||||||
described (and is not implemented for the non-PD cases). This all
|
be automated and continuously tested.
|
||||||
still needs to be automated and continuously tested.
|
|
||||||
</td>
|
</td>
|
||||||
</tr>
|
</tr>
|
||||||
</table>
|
</table>
|
||||||
|
@ -38,40 +38,68 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
**Status**: Implemented.
|
**Status**: Implemented.
|
||||||
|
|
||||||
This document presents the design of the Kubernetes DaemonSet, describes use cases, and gives an overview of the code.
|
This document presents the design of the Kubernetes DaemonSet, describes use
|
||||||
|
cases, and gives an overview of the code.
|
||||||
|
|
||||||
## Motivation
|
## Motivation
|
||||||
|
|
||||||
Many users have requested for a way to run a daemon on every node in a Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential for use cases such as building a sharded datastore, or running a logger on every node. In comes the DaemonSet, a way to conveniently create and manage daemon-like workloads in Kubernetes.
|
Many users have requested for a way to run a daemon on every node in a
|
||||||
|
Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential
|
||||||
|
for use cases such as building a sharded datastore, or running a logger on every
|
||||||
|
node. In comes the DaemonSet, a way to conveniently create and manage
|
||||||
|
daemon-like workloads in Kubernetes.
|
||||||
|
|
||||||
## Use Cases
|
## Use Cases
|
||||||
|
|
||||||
The DaemonSet can be used for user-specified system services, cluster-level applications with strong node ties, and Kubernetes node services. Below are example use cases in each category.
|
The DaemonSet can be used for user-specified system services, cluster-level
|
||||||
|
applications with strong node ties, and Kubernetes node services. Below are
|
||||||
|
example use cases in each category.
|
||||||
|
|
||||||
### User-Specified System Services:
|
### User-Specified System Services:
|
||||||
|
|
||||||
Logging: Some users want a way to collect statistics about nodes in a cluster and send those logs to an external database. For example, system administrators might want to know if their machines are performing as expected, if they need to add more machines to the cluster, or if they should switch cloud providers. The DaemonSet can be used to run a data collection service (for example fluentd) on every node and send the data to a service like ElasticSearch for analysis.
|
Logging: Some users want a way to collect statistics about nodes in a cluster
|
||||||
|
and send those logs to an external database. For example, system administrators
|
||||||
|
might want to know if their machines are performing as expected, if they need to
|
||||||
|
add more machines to the cluster, or if they should switch cloud providers. The
|
||||||
|
DaemonSet can be used to run a data collection service (for example fluentd) on
|
||||||
|
every node and send the data to a service like ElasticSearch for analysis.
|
||||||
|
|
||||||
### Cluster-Level Applications
|
### Cluster-Level Applications
|
||||||
|
|
||||||
Datastore: Users might want to implement a sharded datastore in their cluster. A few nodes in the cluster, labeled ‘app=datastore’, might be responsible for storing data shards, and pods running on these nodes might serve data. This architecture requires a way to bind pods to specific nodes, so it cannot be achieved using a Replication Controller. A DaemonSet is a convenient way to implement such a datastore.
|
Datastore: Users might want to implement a sharded datastore in their cluster. A
|
||||||
|
few nodes in the cluster, labeled ‘app=datastore’, might be responsible for
|
||||||
|
storing data shards, and pods running on these nodes might serve data. This
|
||||||
|
architecture requires a way to bind pods to specific nodes, so it cannot be
|
||||||
|
achieved using a Replication Controller. A DaemonSet is a convenient way to
|
||||||
|
implement such a datastore.
|
||||||
|
|
||||||
For other uses, see the related [feature request](https://issues.k8s.io/1518)
|
For other uses, see the related [feature request](https://issues.k8s.io/1518)
|
||||||
|
|
||||||
## Functionality
|
## Functionality
|
||||||
|
|
||||||
The DaemonSet supports standard API features:
|
The DaemonSet supports standard API features:
|
||||||
- create
|
- create
|
||||||
- The spec for DaemonSets has a pod template field.
|
- The spec for DaemonSets has a pod template field.
|
||||||
- Using the pod’s nodeSelector field, DaemonSets can be restricted to operate over nodes that have a certain label. For example, suppose that in a cluster some nodes are labeled ‘app=database’. You can use a DaemonSet to launch a datastore pod on exactly those nodes labeled ‘app=database’.
|
- Using the pod’s nodeSelector field, DaemonSets can be restricted to operate
|
||||||
- Using the pod's nodeName field, DaemonSets can be restricted to operate on a specified node.
|
over nodes that have a certain label. For example, suppose that in a cluster
|
||||||
- The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec used by the Replication Controller.
|
some nodes are labeled ‘app=database’. You can use a DaemonSet to launch a
|
||||||
- The initial implementation will not guarantee that DaemonSet pods are created on nodes before other pods.
|
datastore pod on exactly those nodes labeled ‘app=database’.
|
||||||
- The initial implementation of DaemonSet does not guarantee that DaemonSet pods show up on nodes (for example because of resource limitations of the node), but makes a best effort to launch DaemonSet pods (like Replication Controllers do with pods). Subsequent revisions might ensure that DaemonSet pods show up on nodes, preempting other pods if necessary.
|
- Using the pod's nodeName field, DaemonSets can be restricted to operate on a
|
||||||
- The DaemonSet controller adds an annotation "kubernetes.io/created-by: \<json API object reference\>"
|
specified node.
|
||||||
|
- The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec
|
||||||
|
used by the Replication Controller.
|
||||||
|
- The initial implementation will not guarantee that DaemonSet pods are
|
||||||
|
created on nodes before other pods.
|
||||||
|
- The initial implementation of DaemonSet does not guarantee that DaemonSet
|
||||||
|
pods show up on nodes (for example because of resource limitations of the node),
|
||||||
|
but makes a best effort to launch DaemonSet pods (like Replication Controllers
|
||||||
|
do with pods). Subsequent revisions might ensure that DaemonSet pods show up on
|
||||||
|
nodes, preempting other pods if necessary.
|
||||||
|
- The DaemonSet controller adds an annotation:
|
||||||
|
```"kubernetes.io/created-by: \<json API object reference\>"```
|
||||||
- YAML example:
|
- YAML example:
|
||||||
|
|
||||||
```YAML
|
```YAML
|
||||||
apiVersion: extensions/v1beta1
|
apiVersion: extensions/v1beta1
|
||||||
kind: DaemonSet
|
kind: DaemonSet
|
||||||
metadata:
|
metadata:
|
||||||
@ -94,42 +122,83 @@ The DaemonSet supports standard API features:
|
|||||||
name: main
|
name: main
|
||||||
```
|
```
|
||||||
|
|
||||||
- commands that get info
|
- commands that get info:
|
||||||
- get (e.g. kubectl get daemonsets)
|
- get (e.g. kubectl get daemonsets)
|
||||||
- describe
|
- describe
|
||||||
- Modifiers
|
- Modifiers:
|
||||||
- delete (if --cascade=true, then first the client turns down all the pods controlled by the DaemonSet (by setting the nodeSelector to a uuid pair that is unlikely to be set on any node); then it deletes the DaemonSet; then it deletes the pods)
|
- delete (if --cascade=true, then first the client turns down all the pods
|
||||||
|
controlled by the DaemonSet (by setting the nodeSelector to a uuid pair that is
|
||||||
|
unlikely to be set on any node); then it deletes the DaemonSet; then it deletes
|
||||||
|
the pods)
|
||||||
- label
|
- label
|
||||||
- annotate
|
- annotate
|
||||||
- update operations like patch and replace (only allowed to selector and to nodeSelector and nodeName of pod template)
|
- update operations like patch and replace (only allowed to selector and to
|
||||||
- DaemonSets have labels, so you could, for example, list all DaemonSets with certain labels (the same way you would for a Replication Controller).
|
nodeSelector and nodeName of pod template)
|
||||||
- In general, for all the supported features like get, describe, update, etc, the DaemonSet works in a similar way to the Replication Controller. However, note that the DaemonSet and the Replication Controller are different constructs.
|
- DaemonSets have labels, so you could, for example, list all DaemonSets
|
||||||
|
with certain labels (the same way you would for a Replication Controller).
|
||||||
|
|
||||||
|
In general, for all the supported features like get, describe, update, etc,
|
||||||
|
the DaemonSet works in a similar way to the Replication Controller. However,
|
||||||
|
note that the DaemonSet and the Replication Controller are different constructs.
|
||||||
|
|
||||||
### Persisting Pods
|
### Persisting Pods
|
||||||
|
|
||||||
- Ordinary liveness probes specified in the pod template work to keep pods created by a DaemonSet running.
|
- Ordinary liveness probes specified in the pod template work to keep pods
|
||||||
- If a daemon pod is killed or stopped, the DaemonSet will create a new replica of the daemon pod on the node.
|
created by a DaemonSet running.
|
||||||
|
- If a daemon pod is killed or stopped, the DaemonSet will create a new
|
||||||
|
replica of the daemon pod on the node.
|
||||||
|
|
||||||
### Cluster Mutations
|
### Cluster Mutations
|
||||||
|
|
||||||
- When a new node is added to the cluster, the DaemonSet controller starts daemon pods on the node for DaemonSets whose pod template nodeSelectors match the node’s labels.
|
- When a new node is added to the cluster, the DaemonSet controller starts
|
||||||
- Suppose the user launches a DaemonSet that runs a logging daemon on all nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label to a node (that did not initially have the label), the logging daemon will launch on the node. Additionally, if a user removes the label from a node, the logging daemon on that node will be killed.
|
daemon pods on the node for DaemonSets whose pod template nodeSelectors match
|
||||||
|
the node’s labels.
|
||||||
|
- Suppose the user launches a DaemonSet that runs a logging daemon on all
|
||||||
|
nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label
|
||||||
|
to a node (that did not initially have the label), the logging daemon will
|
||||||
|
launch on the node. Additionally, if a user removes the label from a node, the
|
||||||
|
logging daemon on that node will be killed.
|
||||||
|
|
||||||
## Alternatives Considered
|
## Alternatives Considered
|
||||||
|
|
||||||
We considered several alternatives, that were deemed inferior to the approach of creating a new DaemonSet abstraction.
|
We considered several alternatives, that were deemed inferior to the approach of
|
||||||
|
creating a new DaemonSet abstraction.
|
||||||
|
|
||||||
One alternative is to include the daemon in the machine image. In this case it would run outside of Kubernetes proper, and thus not be monitored, health checked, usable as a service endpoint, easily upgradable, etc.
|
One alternative is to include the daemon in the machine image. In this case it
|
||||||
|
would run outside of Kubernetes proper, and thus not be monitored, health
|
||||||
|
checked, usable as a service endpoint, easily upgradable, etc.
|
||||||
|
|
||||||
A related alternative is to package daemons as static pods. This would address most of the problems described above, but they would still not be easily upgradable, and more generally could not be managed through the API server interface.
|
A related alternative is to package daemons as static pods. This would address
|
||||||
|
most of the problems described above, but they would still not be easily
|
||||||
|
upgradable, and more generally could not be managed through the API server
|
||||||
|
interface.
|
||||||
|
|
||||||
A third alternative is to generalize the Replication Controller. We would do something like: if you set the `replicas` field of the ReplicationConrollerSpec to -1, then it means "run exactly one replica on every node matching the nodeSelector in the pod template." The ReplicationController would pretend `replicas` had been set to some large number -- larger than the largest number of nodes ever expected in the cluster -- and would use some anti-affinity mechanism to ensure that no more than one Pod from the ReplicationController runs on any given node. There are two downsides to this approach. First, there would always be a large number of Pending pods in the scheduler (these will be scheduled onto new machines when they are added to the cluster). The second downside is more philosophical: DaemonSet and the Replication Controller are very different concepts. We believe that having small, targeted controllers for distinct purposes makes Kubernetes easier to understand and use, compared to having larger multi-functional controllers (see ["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058) for some discussion of this topic).
|
A third alternative is to generalize the Replication Controller. We would do
|
||||||
|
something like: if you set the `replicas` field of the ReplicationConrollerSpec
|
||||||
|
to -1, then it means "run exactly one replica on every node matching the
|
||||||
|
nodeSelector in the pod template." The ReplicationController would pretend
|
||||||
|
`replicas` had been set to some large number -- larger than the largest number
|
||||||
|
of nodes ever expected in the cluster -- and would use some anti-affinity
|
||||||
|
mechanism to ensure that no more than one Pod from the ReplicationController
|
||||||
|
runs on any given node. There are two downsides to this approach. First,
|
||||||
|
there would always be a large number of Pending pods in the scheduler (these
|
||||||
|
will be scheduled onto new machines when they are added to the cluster). The
|
||||||
|
second downside is more philosophical: DaemonSet and the Replication Controller
|
||||||
|
are very different concepts. We believe that having small, targeted controllers
|
||||||
|
for distinct purposes makes Kubernetes easier to understand and use, compared to
|
||||||
|
having larger multi-functional controllers (see
|
||||||
|
["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058) for
|
||||||
|
some discussion of this topic).
|
||||||
|
|
||||||
## Design
|
## Design
|
||||||
|
|
||||||
#### Client
|
#### Client
|
||||||
|
|
||||||
- Add support for DaemonSet commands to kubectl and the client. Client code was added to client/unversioned. The main files in Kubectl that were modified are kubectl/describe.go and kubectl/stop.go, since for other calls like Get, Create, and Update, the client simply forwards the request to the backend via the REST API.
|
- Add support for DaemonSet commands to kubectl and the client. Client code was
|
||||||
|
added to client/unversioned. The main files in Kubectl that were modified are
|
||||||
|
kubectl/describe.go and kubectl/stop.go, since for other calls like Get, Create,
|
||||||
|
and Update, the client simply forwards the request to the backend via the REST
|
||||||
|
API.
|
||||||
|
|
||||||
#### Apiserver
|
#### Apiserver
|
||||||
|
|
||||||
@ -137,18 +206,29 @@ A third alternative is to generalize the Replication Controller. We would do som
|
|||||||
- REST API calls are handled in registry/daemon
|
- REST API calls are handled in registry/daemon
|
||||||
- In particular, the api server will add the object to etcd
|
- In particular, the api server will add the object to etcd
|
||||||
- DaemonManager listens for updates to etcd (using Framework.informer)
|
- DaemonManager listens for updates to etcd (using Framework.informer)
|
||||||
- API objects for DaemonSet were created in expapi/v1/types.go and expapi/v1/register.go
|
- API objects for DaemonSet were created in expapi/v1/types.go and
|
||||||
|
expapi/v1/register.go
|
||||||
- Validation code is in expapi/validation
|
- Validation code is in expapi/validation
|
||||||
|
|
||||||
#### Daemon Manager
|
#### Daemon Manager
|
||||||
|
|
||||||
- Creates new DaemonSets when requested. Launches the corresponding daemon pod on all nodes with labels matching the new DaemonSet’s selector.
|
- Creates new DaemonSets when requested. Launches the corresponding daemon pod
|
||||||
- Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each DaemonSet. If the label of the node matches the selector of the DaemonSet, then the daemon manager will create the corresponding daemon pod in the new node.
|
on all nodes with labels matching the new DaemonSet’s selector.
|
||||||
- The daemon manager creates a pod on a node by sending a command to the API server, requesting for a pod to be bound to the node (the node will be specified via its hostname)
|
- Listens for addition of new nodes to the cluster, by setting up a
|
||||||
|
framework.NewInformer that watches for the creation of Node API objects. When a
|
||||||
|
new node is added, the daemon manager will loop through each DaemonSet. If the
|
||||||
|
label of the node matches the selector of the DaemonSet, then the daemon manager
|
||||||
|
will create the corresponding daemon pod in the new node.
|
||||||
|
- The daemon manager creates a pod on a node by sending a command to the API
|
||||||
|
server, requesting for a pod to be bound to the node (the node will be specified
|
||||||
|
via its hostname.)
|
||||||
|
|
||||||
#### Kubelet
|
#### Kubelet
|
||||||
|
|
||||||
- Does not need to be modified, but health checking will occur for the daemon pods and revive the pods if they are killed (we set the pod restartPolicy to Always). We reject DaemonSet objects with pod templates that don’t have restartPolicy set to Always.
|
- Does not need to be modified, but health checking will occur for the daemon
|
||||||
|
pods and revive the pods if they are killed (we set the pod restartPolicy to
|
||||||
|
Always). We reject DaemonSet objects with pod templates that don’t have
|
||||||
|
restartPolicy set to Always.
|
||||||
|
|
||||||
## Open Issues
|
## Open Issues
|
||||||
|
|
||||||
|
@ -34,33 +34,60 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
# Enhance Pluggable Policy
|
# Enhance Pluggable Policy
|
||||||
|
|
||||||
While trying to develop an authorization plugin for Kubernetes, we found a few places where API extensions would ease development and add power. There are a few goals:
|
While trying to develop an authorization plugin for Kubernetes, we found a few
|
||||||
1. Provide an authorization plugin that can evaluate a .Authorize() call based on the full content of the request to RESTStorage. This includes information like the full verb, the content of creates and updates, and the names of resources being acted upon.
|
places where API extensions would ease development and add power. There are a
|
||||||
1. Provide a way to ask whether a user is permitted to take an action without running in process with the API Authorizer. For instance, a proxy for exec calls could ask whether a user can run the exec they are requesting.
|
few goals:
|
||||||
1. Provide a way to ask who can perform a given action on a given resource. This is useful for answering questions like, "who can create replication controllers in my namespace".
|
1. Provide an authorization plugin that can evaluate a .Authorize() call based
|
||||||
|
on the full content of the request to RESTStorage. This includes information
|
||||||
|
like the full verb, the content of creates and updates, and the names of
|
||||||
|
resources being acted upon.
|
||||||
|
1. Provide a way to ask whether a user is permitted to take an action without
|
||||||
|
running in process with the API Authorizer. For instance, a proxy for exec
|
||||||
|
calls could ask whether a user can run the exec they are requesting.
|
||||||
|
1. Provide a way to ask who can perform a given action on a given resource.
|
||||||
|
This is useful for answering questions like, "who can create replication
|
||||||
|
controllers in my namespace".
|
||||||
|
|
||||||
This proposal adds to and extends the existing API to so that authorizers may provide the functionality described above. It does not attempt to describe how the policies themselves can be expressed, that is up the authorization plugins themselves.
|
This proposal adds to and extends the existing API to so that authorizers may
|
||||||
|
provide the functionality described above. It does not attempt to describe how
|
||||||
|
the policies themselves can be expressed, that is up the authorization plugins
|
||||||
|
themselves.
|
||||||
|
|
||||||
|
|
||||||
## Enhancements to existing Authorization interfaces
|
## Enhancements to existing Authorization interfaces
|
||||||
|
|
||||||
The existing Authorization interfaces are described here: [docs/admin/authorization.md](../admin/authorization.md). A couple additions will allow the development of an Authorizer that matches based on different rules than the existing implementation.
|
The existing Authorization interfaces are described
|
||||||
|
[here](../admin/authorization.md). A couple additions will allow the development
|
||||||
|
of an Authorizer that matches based on different rules than the existing
|
||||||
|
implementation.
|
||||||
|
|
||||||
### Request Attributes
|
### Request Attributes
|
||||||
|
|
||||||
The existing authorizer.Attributes only has 5 attributes (user, groups, isReadOnly, kind, and namespace). If we add more detailed verbs, content, and resource names, then Authorizer plugins will have the same level of information available to RESTStorage components in order to express more detailed policy. The replacement excerpt is below.
|
The existing authorizer.Attributes only has 5 attributes (user, groups,
|
||||||
|
isReadOnly, kind, and namespace). If we add more detailed verbs, content, and
|
||||||
|
resource names, then Authorizer plugins will have the same level of information
|
||||||
|
available to RESTStorage components in order to express more detailed policy.
|
||||||
|
The replacement excerpt is below.
|
||||||
|
|
||||||
An API request has the following attributes that can be considered for authorization:
|
An API request has the following attributes that can be considered for
|
||||||
- user - the user-string which a user was authenticated as. This is included in the Context.
|
authorization:
|
||||||
- groups - the groups to which the user belongs. This is included in the Context.
|
- user - the user-string which a user was authenticated as. This is included
|
||||||
- verb - string describing the requesting action. Today we have: get, list, watch, create, update, and delete. The old `readOnly` behavior is equivalent to allowing get, list, watch.
|
in the Context.
|
||||||
- namespace - the namespace of the object being access, or the empty string if the endpoint does not support namespaced objects. This is included in the Context.
|
- groups - the groups to which the user belongs. This is included in the
|
||||||
|
Context.
|
||||||
|
- verb - string describing the requesting action. Today we have: get, list,
|
||||||
|
watch, create, update, and delete. The old `readOnly` behavior is equivalent to
|
||||||
|
allowing get, list, watch.
|
||||||
|
- namespace - the namespace of the object being access, or the empty string if
|
||||||
|
the endpoint does not support namespaced objects. This is included in the
|
||||||
|
Context.
|
||||||
- resourceGroup - the API group of the resource being accessed
|
- resourceGroup - the API group of the resource being accessed
|
||||||
- resourceVersion - the API version of the resource being accessed
|
- resourceVersion - the API version of the resource being accessed
|
||||||
- resource - which resource is being accessed
|
- resource - which resource is being accessed
|
||||||
- applies only to the API endpoints, such as
|
- applies only to the API endpoints, such as `/api/v1beta1/pods`. For
|
||||||
`/api/v1beta1/pods`. For miscellaneous endpoints, like `/version`, the kind is the empty string.
|
miscellaneous endpoints, like `/version`, the kind is the empty string.
|
||||||
- resourceName - the name of the resource during a get, update, or delete action.
|
- resourceName - the name of the resource during a get, update, or delete
|
||||||
|
action.
|
||||||
- subresource - which subresource is being accessed
|
- subresource - which subresource is being accessed
|
||||||
|
|
||||||
A non-API request has 2 attributes:
|
A non-API request has 2 attributes:
|
||||||
@ -70,7 +97,14 @@ A non-API request has 2 attributes:
|
|||||||
|
|
||||||
### Authorizer Interface
|
### Authorizer Interface
|
||||||
|
|
||||||
The existing Authorizer interface is very simple, but there isn't a way to provide details about allows, denies, or failures. The extended detail is useful for UIs that want to describe why certain actions are allowed or disallowed. Not all Authorizers will want to provide that information, but for those that do, having that capability is useful. In addition, adding a `GetAllowedSubjects` method that returns back the users and groups that can perform a particular action makes it possible to answer questions like, "who can see resources in my namespace" (see [ResourceAccessReview](#ResourceAccessReview) further down).
|
The existing Authorizer interface is very simple, but there isn't a way to
|
||||||
|
provide details about allows, denies, or failures. The extended detail is useful
|
||||||
|
for UIs that want to describe why certain actions are allowed or disallowed. Not
|
||||||
|
all Authorizers will want to provide that information, but for those that do,
|
||||||
|
having that capability is useful. In addition, adding a `GetAllowedSubjects`
|
||||||
|
method that returns back the users and groups that can perform a particular
|
||||||
|
action makes it possible to answer questions like, "who can see resources in my
|
||||||
|
namespace" (see [ResourceAccessReview](#ResourceAccessReview) further down).
|
||||||
|
|
||||||
```go
|
```go
|
||||||
// OLD
|
// OLD
|
||||||
@ -81,41 +115,65 @@ type Authorizer interface {
|
|||||||
|
|
||||||
```go
|
```go
|
||||||
// NEW
|
// NEW
|
||||||
// Authorizer provides the ability to determine if a particular user can perform a particular action
|
// Authorizer provides the ability to determine if a particular user can perform
|
||||||
|
// a particular action
|
||||||
type Authorizer interface {
|
type Authorizer interface {
|
||||||
// Authorize takes a Context (for namespace, user, and traceability) and Attributes to make a policy determination.
|
// Authorize takes a Context (for namespace, user, and traceability) and
|
||||||
// reason is an optional return value that can describe why a policy decision was made. Reasons are useful during
|
// Attributes to make a policy determination.
|
||||||
// debugging when trying to figure out why a user or group has access to perform a particular action.
|
// reason is an optional return value that can describe why a policy decision
|
||||||
|
// was made. Reasons are useful during debugging when trying to figure out
|
||||||
|
// why a user or group has access to perform a particular action.
|
||||||
Authorize(ctx api.Context, a Attributes) (allowed bool, reason string, evaluationError error)
|
Authorize(ctx api.Context, a Attributes) (allowed bool, reason string, evaluationError error)
|
||||||
}
|
}
|
||||||
|
|
||||||
// AuthorizerIntrospection is an optional interface that provides the ability to determine which users and groups can perform a particular action.
|
// AuthorizerIntrospection is an optional interface that provides the ability to
|
||||||
// This is useful for building caches of who can see what: for instance, "which namespaces can this user see". That would allow
|
// determine which users and groups can perform a particular action. This is
|
||||||
// someone to see only the namespaces they are allowed to view instead of having to choose between listing them all or listing none.
|
// useful for building caches of who can see what. For instance, "which
|
||||||
|
// namespaces can this user see". That would allow someone to see only the
|
||||||
|
// namespaces they are allowed to view instead of having to choose between
|
||||||
|
// listing them all or listing none.
|
||||||
type AuthorizerIntrospection interface {
|
type AuthorizerIntrospection interface {
|
||||||
// GetAllowedSubjects takes a Context (for namespace and traceability) and Attributes to determine which users and
|
// GetAllowedSubjects takes a Context (for namespace and traceability) and
|
||||||
// groups are allowed to perform the described action in the namespace. This API enables the ResourceBasedReview requests below
|
// Attributes to determine which users and groups are allowed to perform the
|
||||||
|
// described action in the namespace. This API enables the ResourceBasedReview
|
||||||
|
// requests below
|
||||||
GetAllowedSubjects(ctx api.Context, a Attributes) (users util.StringSet, groups util.StringSet, evaluationError error)
|
GetAllowedSubjects(ctx api.Context, a Attributes) (users util.StringSet, groups util.StringSet, evaluationError error)
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### SubjectAccessReviews
|
### SubjectAccessReviews
|
||||||
|
|
||||||
This set of APIs answers the question: can a user or group (use authenticated user if none is specified) perform a given action. Given the Authorizer interface (proposed or existing), this endpoint can be implemented generically against any Authorizer by creating the correct Attributes and making an .Authorize() call.
|
This set of APIs answers the question: can a user or group (use authenticated
|
||||||
|
user if none is specified) perform a given action. Given the Authorizer
|
||||||
|
interface (proposed or existing), this endpoint can be implemented generically
|
||||||
|
against any Authorizer by creating the correct Attributes and making an
|
||||||
|
.Authorize() call.
|
||||||
|
|
||||||
There are three different flavors:
|
There are three different flavors:
|
||||||
|
|
||||||
1. `/apis/authorization.kubernetes.io/{version}/subjectAccessReviews` - this checks to see if a specified user or group can perform a given action at the cluster scope or across all namespaces.
|
1. `/apis/authorization.kubernetes.io/{version}/subjectAccessReviews` - this
|
||||||
This is a highly privileged operation. It allows a cluster-admin to inspect rights of any person across the entire cluster and against cluster level resources.
|
checks to see if a specified user or group can perform a given action at the
|
||||||
2. `/apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews` - this checks to see if the current user (including his groups) can perform a given action at any specified scope.
|
cluster scope or across all namespaces. This is a highly privileged operation.
|
||||||
This is an unprivileged operation. It doesn't expose any information that a user couldn't discover simply by trying an endpoint themselves.
|
It allows a cluster-admin to inspect rights of any person across the entire
|
||||||
3. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localSubjectAccessReviews` - this checks to see if a specified user or group can perform a given action in **this** namespace.
|
cluster and against cluster level resources.
|
||||||
This is a moderately privileged operation. In a multi-tenant environment, have a namespace scoped resource makes it very easy to reason about powers granted to a namespace admin.
|
2. `/apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews` -
|
||||||
This allows a namespace admin (someone able to manage permissions inside of one namespaces, but not all namespaces), the power to inspect whether a given user or group
|
this checks to see if the current user (including his groups) can perform a
|
||||||
can manipulate resources in his namespace.
|
given action at any specified scope. This is an unprivileged operation. It
|
||||||
|
doesn't expose any information that a user couldn't discover simply by trying an
|
||||||
|
endpoint themselves.
|
||||||
|
3. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localSubjectAccessReviews` -
|
||||||
|
this checks to see if a specified user or group can perform a given action in
|
||||||
|
**this** namespace. This is a moderately privileged operation. In a multi-tenant
|
||||||
|
environment, having a namespace scoped resource makes it very easy to reason
|
||||||
|
about powers granted to a namespace admin. This allows a namespace admin
|
||||||
|
(someone able to manage permissions inside of one namespaces, but not all
|
||||||
|
namespaces), the power to inspect whether a given user or group can manipulate
|
||||||
|
resources in his namespace.
|
||||||
|
|
||||||
|
SubjectAccessReview is runtime.Object with associated RESTStorage that only
|
||||||
SubjectAccessReview is runtime.Object with associated RESTStorage that only accepts creates. The caller POSTs a SubjectAccessReview to this URL and he gets a SubjectAccessReviewResponse back. Here is an example of a call and its corresponding return.
|
accepts creates. The caller POSTs a SubjectAccessReview to this URL and he gets
|
||||||
|
a SubjectAccessReviewResponse back. Here is an example of a call and its
|
||||||
|
corresponding return:
|
||||||
|
|
||||||
```
|
```
|
||||||
// input
|
// input
|
||||||
@ -141,10 +199,14 @@ accessReviewResult, err := Client.SubjectAccessReviews().Create(subjectAccessRev
|
|||||||
"apiVersion": "authorization.kubernetes.io/v1",
|
"apiVersion": "authorization.kubernetes.io/v1",
|
||||||
"allowed": true
|
"allowed": true
|
||||||
}
|
}
|
||||||
|
|
||||||
PersonalSubjectAccessReview is runtime.Object with associated RESTStorage that only accepts creates. The caller POSTs a PersonalSubjectAccessReview to this URL and he gets a SubjectAccessReviewResponse back. Here is an example of a call and its corresponding return.
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
PersonalSubjectAccessReview is runtime.Object with associated RESTStorage that
|
||||||
|
only accepts creates. The caller POSTs a PersonalSubjectAccessReview to this URL
|
||||||
|
and he gets a SubjectAccessReviewResponse back. Here is an example of a call and
|
||||||
|
its corresponding return:
|
||||||
|
|
||||||
|
```
|
||||||
// input
|
// input
|
||||||
{
|
{
|
||||||
"kind": "PersonalSubjectAccessReview",
|
"kind": "PersonalSubjectAccessReview",
|
||||||
@ -167,8 +229,12 @@ accessReviewResult, err := Client.PersonalSubjectAccessReviews().Create(subjectA
|
|||||||
"apiVersion": "authorization.kubernetes.io/v1",
|
"apiVersion": "authorization.kubernetes.io/v1",
|
||||||
"allowed": true
|
"allowed": true
|
||||||
}
|
}
|
||||||
|
```
|
||||||
|
|
||||||
LocalSubjectAccessReview is runtime.Object with associated RESTStorage that only accepts creates. The caller POSTs a LocalSubjectAccessReview to this URL and he gets a LocalSubjectAccessReviewResponse back. Here is an example of a call and its corresponding return.
|
LocalSubjectAccessReview is runtime.Object with associated RESTStorage that only
|
||||||
|
accepts creates. The caller POSTs a LocalSubjectAccessReview to this URL and he
|
||||||
|
gets a LocalSubjectAccessReviewResponse back. Here is an example of a call and
|
||||||
|
its corresponding return:
|
||||||
|
|
||||||
```
|
```
|
||||||
// input
|
// input
|
||||||
@ -196,15 +262,14 @@ accessReviewResult, err := Client.LocalSubjectAccessReviews().Create(localSubjec
|
|||||||
"namespace": "my-ns"
|
"namespace": "my-ns"
|
||||||
"allowed": true
|
"allowed": true
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The actual Go objects look like this:
|
The actual Go objects look like this:
|
||||||
|
|
||||||
```go
|
```go
|
||||||
type AuthorizationAttributes struct {
|
type AuthorizationAttributes struct {
|
||||||
// Namespace is the namespace of the action being requested. Currently, there is no distinction between no namespace and all namespaces
|
// Namespace is the namespace of the action being requested. Currently, there
|
||||||
|
// is no distinction between no namespace and all namespaces
|
||||||
Namespace string `json:"namespace" description:"namespace of the action being requested"`
|
Namespace string `json:"namespace" description:"namespace of the action being requested"`
|
||||||
// Verb is one of: get, list, watch, create, update, delete
|
// Verb is one of: get, list, watch, create, update, delete
|
||||||
Verb string `json:"verb" description:"one of get, list, watch, create, update, delete"`
|
Verb string `json:"verb" description:"one of get, list, watch, create, update, delete"`
|
||||||
@ -214,13 +279,15 @@ type AuthorizationAttributes struct {
|
|||||||
ResourceVersion string `json:"resourceVersion" description:"version of the resource being requested"`
|
ResourceVersion string `json:"resourceVersion" description:"version of the resource being requested"`
|
||||||
// Resource is one of the existing resource types
|
// Resource is one of the existing resource types
|
||||||
Resource string `json:"resource" description:"one of the existing resource types"`
|
Resource string `json:"resource" description:"one of the existing resource types"`
|
||||||
// ResourceName is the name of the resource being requested for a "get" or deleted for a "delete"
|
// ResourceName is the name of the resource being requested for a "get" or
|
||||||
|
// deleted for a "delete"
|
||||||
ResourceName string `json:"resourceName" description:"name of the resource being requested for a get or delete"`
|
ResourceName string `json:"resourceName" description:"name of the resource being requested for a get or delete"`
|
||||||
// Subresource is one of the existing subresources types
|
// Subresource is one of the existing subresources types
|
||||||
Subresource string `json:"subresource" description:"one of the existing subresources"`
|
Subresource string `json:"subresource" description:"one of the existing subresources"`
|
||||||
}
|
}
|
||||||
|
|
||||||
// SubjectAccessReview is an object for requesting information about whether a user or group can perform an action
|
// SubjectAccessReview is an object for requesting information about whether a
|
||||||
|
// user or group can perform an action
|
||||||
type SubjectAccessReview struct {
|
type SubjectAccessReview struct {
|
||||||
kapi.TypeMeta `json:",inline"`
|
kapi.TypeMeta `json:",inline"`
|
||||||
|
|
||||||
@ -232,7 +299,8 @@ type SubjectAccessReview struct {
|
|||||||
Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"`
|
Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"`
|
||||||
}
|
}
|
||||||
|
|
||||||
// SubjectAccessReviewResponse describes whether or not a user or group can perform an action
|
// SubjectAccessReviewResponse describes whether or not a user or group can
|
||||||
|
// perform an action
|
||||||
type SubjectAccessReviewResponse struct {
|
type SubjectAccessReviewResponse struct {
|
||||||
kapi.TypeMeta
|
kapi.TypeMeta
|
||||||
|
|
||||||
@ -242,7 +310,8 @@ type SubjectAccessReviewResponse struct {
|
|||||||
Reason string
|
Reason string
|
||||||
}
|
}
|
||||||
|
|
||||||
// PersonalSubjectAccessReview is an object for requesting information about whether a user or group can perform an action
|
// PersonalSubjectAccessReview is an object for requesting information about
|
||||||
|
// whether a user or group can perform an action
|
||||||
type PersonalSubjectAccessReview struct {
|
type PersonalSubjectAccessReview struct {
|
||||||
kapi.TypeMeta `json:",inline"`
|
kapi.TypeMeta `json:",inline"`
|
||||||
|
|
||||||
@ -250,7 +319,8 @@ type PersonalSubjectAccessReview struct {
|
|||||||
AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"`
|
AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"`
|
||||||
}
|
}
|
||||||
|
|
||||||
// PersonalSubjectAccessReviewResponse describes whether this user can perform an action
|
// PersonalSubjectAccessReviewResponse describes whether this user can perform
|
||||||
|
// an action
|
||||||
type PersonalSubjectAccessReviewResponse struct {
|
type PersonalSubjectAccessReviewResponse struct {
|
||||||
kapi.TypeMeta
|
kapi.TypeMeta
|
||||||
|
|
||||||
@ -262,7 +332,8 @@ type PersonalSubjectAccessReviewResponse struct {
|
|||||||
Reason string
|
Reason string
|
||||||
}
|
}
|
||||||
|
|
||||||
// LocalSubjectAccessReview is an object for requesting information about whether a user or group can perform an action
|
// LocalSubjectAccessReview is an object for requesting information about
|
||||||
|
// whether a user or group can perform an action
|
||||||
type LocalSubjectAccessReview struct {
|
type LocalSubjectAccessReview struct {
|
||||||
kapi.TypeMeta `json:",inline"`
|
kapi.TypeMeta `json:",inline"`
|
||||||
|
|
||||||
@ -274,7 +345,8 @@ type LocalSubjectAccessReview struct {
|
|||||||
Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"`
|
Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"`
|
||||||
}
|
}
|
||||||
|
|
||||||
// LocalSubjectAccessReviewResponse describes whether or not a user or group can perform an action
|
// LocalSubjectAccessReviewResponse describes whether or not a user or group can
|
||||||
|
// perform an action
|
||||||
type LocalSubjectAccessReviewResponse struct {
|
type LocalSubjectAccessReviewResponse struct {
|
||||||
kapi.TypeMeta
|
kapi.TypeMeta
|
||||||
|
|
||||||
@ -287,21 +359,33 @@ type LocalSubjectAccessReviewResponse struct {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### ResourceAccessReview
|
### ResourceAccessReview
|
||||||
|
|
||||||
This set of APIs nswers the question: which users and groups can perform the specified verb on the specified resourceKind. Given the Authorizer interface described above, this endpoint can be implemented generically against any Authorizer by calling the .GetAllowedSubjects() function.
|
This set of APIs nswers the question: which users and groups can perform the
|
||||||
|
specified verb on the specified resourceKind. Given the Authorizer interface
|
||||||
|
described above, this endpoint can be implemented generically against any
|
||||||
|
Authorizer by calling the .GetAllowedSubjects() function.
|
||||||
|
|
||||||
There are two different flavors:
|
There are two different flavors:
|
||||||
|
|
||||||
1. `/apis/authorization.kubernetes.io/{version}/resourceAccessReview` - this checks to see which users and groups can perform a given action at the cluster scope or across all namespaces.
|
1. `/apis/authorization.kubernetes.io/{version}/resourceAccessReview` - this
|
||||||
This is a highly privileged operation. It allows a cluster-admin to inspect rights of all subjects across the entire cluster and against cluster level resources.
|
checks to see which users and groups can perform a given action at the cluster
|
||||||
2. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localResourceAccessReviews` - this checks to see which users and groups can perform a given action in **this** namespace.
|
scope or across all namespaces. This is a highly privileged operation. It allows
|
||||||
This is a moderately privileged operation. In a multi-tenant environment, have a namespace scoped resource makes it very easy to reason about powers granted to a namespace admin.
|
a cluster-admin to inspect rights of all subjects across the entire cluster and
|
||||||
This allows a namespace admin (someone able to manage permissions inside of one namespaces, but not all namespaces), the power to inspect which users and groups
|
against cluster level resources.
|
||||||
can manipulate resources in his namespace.
|
2. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localResourceAccessReviews` -
|
||||||
|
this checks to see which users and groups can perform a given action in **this**
|
||||||
|
namespace. This is a moderately privileged operation. In a multi-tenant
|
||||||
|
environment, having a namespace scoped resource makes it very easy to reason
|
||||||
|
about powers granted to a namespace admin. This allows a namespace admin
|
||||||
|
(someone able to manage permissions inside of one namespaces, but not all
|
||||||
|
namespaces), the power to inspect which users and groups can manipulate
|
||||||
|
resources in his namespace.
|
||||||
|
|
||||||
ResourceAccessReview is a runtime.Object with associated RESTStorage that only accepts creates. The caller POSTs a ResourceAccessReview to this URL and he gets a ResourceAccessReviewResponse back. Here is an example of a call and its corresponding return.
|
ResourceAccessReview is a runtime.Object with associated RESTStorage that only
|
||||||
|
accepts creates. The caller POSTs a ResourceAccessReview to this URL and he gets
|
||||||
|
a ResourceAccessReviewResponse back. Here is an example of a call and its
|
||||||
|
corresponding return:
|
||||||
|
|
||||||
```
|
```
|
||||||
// input
|
// input
|
||||||
@ -332,8 +416,8 @@ accessReviewResult, err := Client.ResourceAccessReviews().Create(resourceAccessR
|
|||||||
The actual Go objects look like this:
|
The actual Go objects look like this:
|
||||||
|
|
||||||
```go
|
```go
|
||||||
// ResourceAccessReview is a means to request a list of which users and groups are authorized to perform the
|
// ResourceAccessReview is a means to request a list of which users and groups
|
||||||
// action specified by spec
|
// are authorized to perform the action specified by spec
|
||||||
type ResourceAccessReview struct {
|
type ResourceAccessReview struct {
|
||||||
kapi.TypeMeta `json:",inline"`
|
kapi.TypeMeta `json:",inline"`
|
||||||
|
|
||||||
@ -351,8 +435,8 @@ type ResourceAccessReviewResponse struct {
|
|||||||
Groups []string
|
Groups []string
|
||||||
}
|
}
|
||||||
|
|
||||||
// LocalResourceAccessReview is a means to request a list of which users and groups are authorized to perform the
|
// LocalResourceAccessReview is a means to request a list of which users and
|
||||||
// action specified in a specific namespace
|
// groups are authorized to perform the action specified in a specific namespace
|
||||||
type LocalResourceAccessReview struct {
|
type LocalResourceAccessReview struct {
|
||||||
kapi.TypeMeta `json:",inline"`
|
kapi.TypeMeta `json:",inline"`
|
||||||
|
|
||||||
@ -371,7 +455,6 @@ type LocalResourceAccessReviewResponse struct {
|
|||||||
// Groups is the list of groups who can perform the action
|
// Groups is the list of groups who can perform the action
|
||||||
Groups []string
|
Groups []string
|
||||||
}
|
}
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
@ -42,40 +42,62 @@ Kubernetes components can get into a state where they generate tons of events.
|
|||||||
|
|
||||||
The events can be categorized in one of two ways:
|
The events can be categorized in one of two ways:
|
||||||
|
|
||||||
1. same - the event is identical to previous events except it varies only on timestamp
|
1. same - The event is identical to previous events except it varies only on
|
||||||
2. similar - the event is identical to previous events except it varies on timestamp and message
|
timestamp.
|
||||||
|
2. similar - The event is identical to previous events except it varies on
|
||||||
|
timestamp and message.
|
||||||
|
|
||||||
For example, when pulling a non-existing image, Kubelet will repeatedly generate `image_not_existing` and `container_is_waiting` events until upstream components correct the image. When this happens, the spam from the repeated events makes the entire event mechanism useless. It also appears to cause memory pressure in etcd (see [#3853](http://issue.k8s.io/3853)).
|
For example, when pulling a non-existing image, Kubelet will repeatedly generate
|
||||||
|
`image_not_existing` and `container_is_waiting` events until upstream components
|
||||||
|
correct the image. When this happens, the spam from the repeated events makes
|
||||||
|
the entire event mechanism useless. It also appears to cause memory pressure in
|
||||||
|
etcd (see [#3853](http://issue.k8s.io/3853)).
|
||||||
|
|
||||||
The goal is introduce event counting to increment same events, and event aggregation to collapse similar events.
|
The goal is introduce event counting to increment same events, and event
|
||||||
|
aggregation to collapse similar events.
|
||||||
|
|
||||||
## Proposal
|
## Proposal
|
||||||
|
|
||||||
Each binary that generates events (for example, `kubelet`) should keep track of previously generated events so that it can collapse recurring events into a single event instead of creating a new instance for each new event. In addition, if many similar events are
|
Each binary that generates events (for example, `kubelet`) should keep track of
|
||||||
created, events should be aggregated into a single event to reduce spam.
|
previously generated events so that it can collapse recurring events into a
|
||||||
|
single event instead of creating a new instance for each new event. In addition,
|
||||||
|
if many similar events are created, events should be aggregated into a single
|
||||||
|
event to reduce spam.
|
||||||
|
|
||||||
Event compression should be best effort (not guaranteed). Meaning, in the worst case, `n` identical (minus timestamp) events may still result in `n` event entries.
|
Event compression should be best effort (not guaranteed). Meaning, in the worst
|
||||||
|
case, `n` identical (minus timestamp) events may still result in `n` event
|
||||||
|
entries.
|
||||||
|
|
||||||
## Design
|
## Design
|
||||||
|
|
||||||
Instead of a single Timestamp, each event object [contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following fields:
|
Instead of a single Timestamp, each event object
|
||||||
|
[contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following
|
||||||
|
fields:
|
||||||
* `FirstTimestamp unversioned.Time`
|
* `FirstTimestamp unversioned.Time`
|
||||||
* The date/time of the first occurrence of the event.
|
* The date/time of the first occurrence of the event.
|
||||||
* `LastTimestamp unversioned.Time`
|
* `LastTimestamp unversioned.Time`
|
||||||
* The date/time of the most recent occurrence of the event.
|
* The date/time of the most recent occurrence of the event.
|
||||||
* On first occurrence, this is equal to the FirstTimestamp.
|
* On first occurrence, this is equal to the FirstTimestamp.
|
||||||
* `Count int`
|
* `Count int`
|
||||||
* The number of occurrences of this event between FirstTimestamp and LastTimestamp
|
* The number of occurrences of this event between FirstTimestamp and
|
||||||
|
LastTimestamp.
|
||||||
* On first occurrence, this is 1.
|
* On first occurrence, this is 1.
|
||||||
|
|
||||||
Each binary that generates events:
|
Each binary that generates events:
|
||||||
* Maintains a historical record of previously generated events:
|
* Maintains a historical record of previously generated events:
|
||||||
* Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [`pkg/client/record/events_cache.go`](../../pkg/client/record/events_cache.go).
|
* Implemented with
|
||||||
* Implemented behind an `EventCorrelator` that manages two subcomponents: `EventAggregator` and `EventLogger`
|
["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go)
|
||||||
* The `EventCorrelator` observes all incoming events and lets each subcomponent visit and modify the event in turn.
|
in [`pkg/client/record/events_cache.go`](../../pkg/client/record/events_cache.go).
|
||||||
* The `EventAggregator` runs an aggregation function over each event. This function buckets each event based on an `aggregateKey`,
|
* Implemented behind an `EventCorrelator` that manages two subcomponents:
|
||||||
and identifies the event uniquely with a `localKey` in that bucket.
|
`EventAggregator` and `EventLogger`.
|
||||||
* The default aggregation function groups similar events that differ only by `event.Message`. It's `localKey` is `event.Message` and its aggregate key is produced by joining:
|
* The `EventCorrelator` observes all incoming events and lets each
|
||||||
|
subcomponent visit and modify the event in turn.
|
||||||
|
* The `EventAggregator` runs an aggregation function over each event. This
|
||||||
|
function buckets each event based on an `aggregateKey` and identifies the event
|
||||||
|
uniquely with a `localKey` in that bucket.
|
||||||
|
* The default aggregation function groups similar events that differ only by
|
||||||
|
`event.Message`. Its `localKey` is `event.Message` and its aggregate key is
|
||||||
|
produced by joining:
|
||||||
* `event.Source.Component`
|
* `event.Source.Component`
|
||||||
* `event.Source.Host`
|
* `event.Source.Host`
|
||||||
* `event.InvolvedObject.Kind`
|
* `event.InvolvedObject.Kind`
|
||||||
@ -84,12 +106,17 @@ Each binary that generates events:
|
|||||||
* `event.InvolvedObject.UID`
|
* `event.InvolvedObject.UID`
|
||||||
* `event.InvolvedObject.APIVersion`
|
* `event.InvolvedObject.APIVersion`
|
||||||
* `event.Reason`
|
* `event.Reason`
|
||||||
* If the `EventAggregator` observes a similar event produced 10 times in a 10 minute window, it drops the event that was provided as
|
* If the `EventAggregator` observes a similar event produced 10 times in a 10
|
||||||
input and creates a new event that differs only on the message. The message denotes that this event is used to group similar events
|
minute window, it drops the event that was provided as input and creates a new
|
||||||
that matched on reason. This aggregated `Event` is then used in the event processing sequence.
|
event that differs only on the message. The message denotes that this event is
|
||||||
* The `EventLogger` observes the event out of `EventAggregation` and tracks the number of times it has observed that event previously
|
used to group similar events that matched on reason. This aggregated `Event` is
|
||||||
by incrementing a key in a cache associated with that matching event.
|
then used in the event processing sequence.
|
||||||
* The key in the cache is generated from the event object minus timestamps/count/transient fields, specifically the following events fields are used to construct a unique key for an event:
|
* The `EventLogger` observes the event out of `EventAggregation` and tracks
|
||||||
|
the number of times it has observed that event previously by incrementing a key
|
||||||
|
in a cache associated with that matching event.
|
||||||
|
* The key in the cache is generated from the event object minus
|
||||||
|
timestamps/count/transient fields, specifically the following events fields are
|
||||||
|
used to construct a unique key for an event:
|
||||||
* `event.Source.Component`
|
* `event.Source.Component`
|
||||||
* `event.Source.Host`
|
* `event.Source.Host`
|
||||||
* `event.InvolvedObject.Kind`
|
* `event.InvolvedObject.Kind`
|
||||||
@ -99,24 +126,47 @@ Each binary that generates events:
|
|||||||
* `event.InvolvedObject.APIVersion`
|
* `event.InvolvedObject.APIVersion`
|
||||||
* `event.Reason`
|
* `event.Reason`
|
||||||
* `event.Message`
|
* `event.Message`
|
||||||
* The LRU cache is capped at 4096 events for both `EventAggregator` and `EventLogger`. That means if a component (e.g. kubelet) runs for a long period of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache.
|
* The LRU cache is capped at 4096 events for both `EventAggregator` and
|
||||||
* When an event is generated, the previously generated events cache is checked (see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)).
|
`EventLogger`. That means if a component (e.g. kubelet) runs for a long period
|
||||||
* If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and the existing event entry is updated in etcd:
|
of time and generates tons of unique events, the previously generated events
|
||||||
* The new PUT (update) event API is called to update the existing event entry in etcd with the new last seen timestamp and count.
|
cache will not grow unchecked in memory. Instead, after 4096 unique events are
|
||||||
* The event is also updated in the previously generated events cache with an incremented count, updated last seen timestamp, name, and new resource version (all required to issue a future event update).
|
generated, the oldest events are evicted from the cache.
|
||||||
* If the key for the new event does not match the key for any previously generated event (meaning none of the above fields match between the new event and any previously generated events), then the event is considered to be new/unique and a new event entry is created in etcd:
|
* When an event is generated, the previously generated events cache is checked
|
||||||
* The usual POST/create event API is called to create a new event entry in etcd.
|
(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)).
|
||||||
* An entry for the event is also added to the previously generated events cache.
|
* If the key for the new event matches the key for a previously generated
|
||||||
|
event (meaning all of the above fields match between the new event and some
|
||||||
|
previously generated event), then the event is considered to be a duplicate and
|
||||||
|
the existing event entry is updated in etcd:
|
||||||
|
* The new PUT (update) event API is called to update the existing event
|
||||||
|
entry in etcd with the new last seen timestamp and count.
|
||||||
|
* The event is also updated in the previously generated events cache with
|
||||||
|
an incremented count, updated last seen timestamp, name, and new resource
|
||||||
|
version (all required to issue a future event update).
|
||||||
|
* If the key for the new event does not match the key for any previously
|
||||||
|
generated event (meaning none of the above fields match between the new event
|
||||||
|
and any previously generated events), then the event is considered to be
|
||||||
|
new/unique and a new event entry is created in etcd:
|
||||||
|
* The usual POST/create event API is called to create a new event entry in
|
||||||
|
etcd.
|
||||||
|
* An entry for the event is also added to the previously generated events
|
||||||
|
cache.
|
||||||
|
|
||||||
## Issues/Risks
|
## Issues/Risks
|
||||||
|
|
||||||
* Compression is not guaranteed, because each component keeps track of event history in memory
|
* Compression is not guaranteed, because each component keeps track of event
|
||||||
* An application restart causes event history to be cleared, meaning event history is not preserved across application restarts and compression will not occur across component restarts.
|
history in memory
|
||||||
* Because an LRU cache is used to keep track of previously generated events, if too many unique events are generated, old events will be evicted from the cache, so events will only be compressed until they age out of the events cache, at which point any new instance of the event will cause a new entry to be created in etcd.
|
* An application restart causes event history to be cleared, meaning event
|
||||||
|
history is not preserved across application restarts and compression will not
|
||||||
|
occur across component restarts.
|
||||||
|
* Because an LRU cache is used to keep track of previously generated events,
|
||||||
|
if too many unique events are generated, old events will be evicted from the
|
||||||
|
cache, so events will only be compressed until they age out of the events cache,
|
||||||
|
at which point any new instance of the event will cause a new entry to be
|
||||||
|
created in etcd.
|
||||||
|
|
||||||
## Example
|
## Example
|
||||||
|
|
||||||
Sample kubectl output
|
Sample kubectl output:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE
|
FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE
|
||||||
@ -133,15 +183,19 @@ Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1
|
|||||||
Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-node-4.c.saad-dev-vms.internal
|
Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-node-4.c.saad-dev-vms.internal
|
||||||
```
|
```
|
||||||
|
|
||||||
This demonstrates what would have been 20 separate entries (indicating scheduling failure) collapsed/compressed down to 5 entries.
|
This demonstrates what would have been 20 separate entries (indicating
|
||||||
|
scheduling failure) collapsed/compressed down to 5 entries.
|
||||||
|
|
||||||
## Related Pull Requests/Issues
|
## Related Pull Requests/Issues
|
||||||
|
|
||||||
* Issue [#4073](http://issue.k8s.io/4073): Compress duplicate events
|
* Issue [#4073](http://issue.k8s.io/4073): Compress duplicate events.
|
||||||
* PR [#4157](http://issue.k8s.io/4157): Add "Update Event" to Kubernetes API
|
* PR [#4157](http://issue.k8s.io/4157): Add "Update Event" to Kubernetes API.
|
||||||
* PR [#4206](http://issue.k8s.io/4206): Modify Event struct to allow compressing multiple recurring events in to a single event
|
* PR [#4206](http://issue.k8s.io/4206): Modify Event struct to allow
|
||||||
* PR [#4306](http://issue.k8s.io/4306): Compress recurring events in to a single event to optimize etcd storage
|
compressing multiple recurring events in to a single event.
|
||||||
* PR [#4444](http://pr.k8s.io/4444): Switch events history to use LRU cache instead of map
|
* PR [#4306](http://issue.k8s.io/4306): Compress recurring events in to a
|
||||||
|
single event to optimize etcd storage.
|
||||||
|
* PR [#4444](http://pr.k8s.io/4444): Switch events history to use LRU cache
|
||||||
|
instead of map.
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -36,13 +36,15 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
## Abstract
|
## Abstract
|
||||||
|
|
||||||
A proposal for the expansion of environment variables using a simple `$(var)` syntax.
|
A proposal for the expansion of environment variables using a simple `$(var)`
|
||||||
|
syntax.
|
||||||
|
|
||||||
## Motivation
|
## Motivation
|
||||||
|
|
||||||
It is extremely common for users to need to compose environment variables or pass arguments to
|
It is extremely common for users to need to compose environment variables or
|
||||||
their commands using the values of environment variables. Kubernetes should provide a facility for
|
pass arguments to their commands using the values of environment variables.
|
||||||
the 80% cases in order to decrease coupling and the use of workarounds.
|
Kubernetes should provide a facility for the 80% cases in order to decrease
|
||||||
|
coupling and the use of workarounds.
|
||||||
|
|
||||||
## Goals
|
## Goals
|
||||||
|
|
||||||
@ -53,150 +55,170 @@ the 80% cases in order to decrease coupling and the use of workarounds.
|
|||||||
|
|
||||||
## Constraints and Assumptions
|
## Constraints and Assumptions
|
||||||
|
|
||||||
* This design should describe the simplest possible syntax to accomplish the use-cases
|
* This design should describe the simplest possible syntax to accomplish the
|
||||||
* Expansion syntax will not support more complicated shell-like behaviors such as default values
|
use-cases.
|
||||||
(viz: `$(VARIABLE_NAME:"default")`), inline substitution, etc.
|
* Expansion syntax will not support more complicated shell-like behaviors such
|
||||||
|
as default values (viz: `$(VARIABLE_NAME:"default")`), inline substitution, etc.
|
||||||
|
|
||||||
## Use Cases
|
## Use Cases
|
||||||
|
|
||||||
1. As a user, I want to compose new environment variables for a container using a substitution
|
1. As a user, I want to compose new environment variables for a container using
|
||||||
syntax to reference other variables in the container's environment and service environment
|
a substitution syntax to reference other variables in the container's
|
||||||
variables
|
environment and service environment variables.
|
||||||
1. As a user, I want to substitute environment variables into a container's command
|
1. As a user, I want to substitute environment variables into a container's
|
||||||
1. As a user, I want to do the above without requiring the container's image to have a shell
|
command.
|
||||||
1. As a user, I want to be able to specify a default value for a service variable which may
|
1. As a user, I want to do the above without requiring the container's image to
|
||||||
not exist
|
have a shell.
|
||||||
1. As a user, I want to see an event associated with the pod if an expansion fails (ie, references
|
1. As a user, I want to be able to specify a default value for a service
|
||||||
variable names that cannot be expanded)
|
variable which may not exist.
|
||||||
|
1. As a user, I want to see an event associated with the pod if an expansion
|
||||||
|
fails (ie, references variable names that cannot be expanded).
|
||||||
|
|
||||||
### Use Case: Composition of environment variables
|
### Use Case: Composition of environment variables
|
||||||
|
|
||||||
Currently, containers are injected with docker-style environment variables for the services in
|
Currently, containers are injected with docker-style environment variables for
|
||||||
their pod's namespace. There are several variables for each service, but users routinely need
|
the services in their pod's namespace. There are several variables for each
|
||||||
to compose URLs based on these variables because there is not a variable for the exact format
|
service, but users routinely need to compose URLs based on these variables
|
||||||
they need. Users should be able to build new environment variables with the exact format they need.
|
because there is not a variable for the exact format they need. Users should be
|
||||||
Eventually, it should also be possible to turn off the automatic injection of the docker-style
|
able to build new environment variables with the exact format they need.
|
||||||
variables into pods and let the users consume the exact information they need via the downward API
|
Eventually, it should also be possible to turn off the automatic injection of
|
||||||
and composition.
|
the docker-style variables into pods and let the users consume the exact
|
||||||
|
information they need via the downward API and composition.
|
||||||
|
|
||||||
#### Expanding expanded variables
|
#### Expanding expanded variables
|
||||||
|
|
||||||
It should be possible to reference an variable which is itself the result of an expansion, if the
|
It should be possible to reference an variable which is itself the result of an
|
||||||
referenced variable is declared in the container's environment prior to the one referencing it.
|
expansion, if the referenced variable is declared in the container's environment
|
||||||
Put another way -- a container's environment is expanded in order, and expanded variables are
|
prior to the one referencing it. Put another way -- a container's environment is
|
||||||
available to subsequent expansions.
|
expanded in order, and expanded variables are available to subsequent
|
||||||
|
expansions.
|
||||||
|
|
||||||
### Use Case: Variable expansion in command
|
### Use Case: Variable expansion in command
|
||||||
|
|
||||||
Users frequently need to pass the values of environment variables to a container's command.
|
Users frequently need to pass the values of environment variables to a
|
||||||
Currently, Kubernetes does not perform any expansion of variables. The workaround is to invoke a
|
container's command. Currently, Kubernetes does not perform any expansion of
|
||||||
shell in the container's command and have the shell perform the substitution, or to write a wrapper
|
variables. The workaround is to invoke a shell in the container's command and
|
||||||
script that sets up the environment and runs the command. This has a number of drawbacks:
|
have the shell perform the substitution, or to write a wrapper script that sets
|
||||||
|
up the environment and runs the command. This has a number of drawbacks:
|
||||||
|
|
||||||
1. Solutions that require a shell are unfriendly to images that do not contain a shell
|
1. Solutions that require a shell are unfriendly to images that do not contain
|
||||||
2. Wrapper scripts make it harder to use images as base images
|
a shell.
|
||||||
3. Wrapper scripts increase coupling to Kubernetes
|
2. Wrapper scripts make it harder to use images as base images.
|
||||||
|
3. Wrapper scripts increase coupling to Kubernetes.
|
||||||
|
|
||||||
Users should be able to do the 80% case of variable expansion in command without writing a wrapper
|
Users should be able to do the 80% case of variable expansion in command without
|
||||||
script or adding a shell invocation to their containers' commands.
|
writing a wrapper script or adding a shell invocation to their containers'
|
||||||
|
commands.
|
||||||
|
|
||||||
### Use Case: Images without shells
|
### Use Case: Images without shells
|
||||||
|
|
||||||
The current workaround for variable expansion in a container's command requires the container's
|
The current workaround for variable expansion in a container's command requires
|
||||||
image to have a shell. This is unfriendly to images that do not contain a shell (`scratch` images,
|
the container's image to have a shell. This is unfriendly to images that do not
|
||||||
for example). Users should be able to perform the other use-cases in this design without regard to
|
contain a shell (`scratch` images, for example). Users should be able to perform
|
||||||
the content of their images.
|
the other use-cases in this design without regard to the content of their
|
||||||
|
images.
|
||||||
|
|
||||||
### Use Case: See an event for incomplete expansions
|
### Use Case: See an event for incomplete expansions
|
||||||
|
|
||||||
It is possible that a container with incorrect variable values or command line may continue to run
|
It is possible that a container with incorrect variable values or command line
|
||||||
for a long period of time, and that the end-user would have no visual or obvious warning of the
|
may continue to run for a long period of time, and that the end-user would have
|
||||||
incorrect configuration. If the kubelet creates an event when an expansion references a variable
|
no visual or obvious warning of the incorrect configuration. If the kubelet
|
||||||
that cannot be expanded, it will help users quickly detect problems with expansions.
|
creates an event when an expansion references a variable that cannot be
|
||||||
|
expanded, it will help users quickly detect problems with expansions.
|
||||||
|
|
||||||
## Design Considerations
|
## Design Considerations
|
||||||
|
|
||||||
### What features should be supported?
|
### What features should be supported?
|
||||||
|
|
||||||
In order to limit complexity, we want to provide the right amount of functionality so that the 80%
|
In order to limit complexity, we want to provide the right amount of
|
||||||
cases can be realized and nothing more. We felt that the essentials boiled down to:
|
functionality so that the 80% cases can be realized and nothing more. We felt
|
||||||
|
that the essentials boiled down to:
|
||||||
|
|
||||||
1. Ability to perform direct expansion of variables in a string
|
1. Ability to perform direct expansion of variables in a string.
|
||||||
2. Ability to specify default values via a prioritized mapping function but without support for
|
2. Ability to specify default values via a prioritized mapping function but
|
||||||
defaults as a syntax-level feature
|
without support for defaults as a syntax-level feature.
|
||||||
|
|
||||||
### What should the syntax be?
|
### What should the syntax be?
|
||||||
|
|
||||||
The exact syntax for variable expansion has a large impact on how users perceive and relate to the
|
The exact syntax for variable expansion has a large impact on how users perceive
|
||||||
feature. We considered implementing a very restrictive subset of the shell `${var}` syntax. This
|
and relate to the feature. We considered implementing a very restrictive subset
|
||||||
syntax is an attractive option on some level, because many people are familiar with it. However,
|
of the shell `${var}` syntax. This syntax is an attractive option on some level,
|
||||||
this syntax also has a large number of lesser known features such as the ability to provide
|
because many people are familiar with it. However, this syntax also has a large
|
||||||
default values for unset variables, perform inline substitution, etc.
|
number of lesser known features such as the ability to provide default values
|
||||||
|
for unset variables, perform inline substitution, etc.
|
||||||
|
|
||||||
In the interest of preventing conflation of the expansion feature in Kubernetes with the shell
|
In the interest of preventing conflation of the expansion feature in Kubernetes
|
||||||
feature, we chose a different syntax similar to the one in Makefiles, `$(var)`. We also chose not
|
with the shell feature, we chose a different syntax similar to the one in
|
||||||
to support the bar `$var` format, since it is not required to implement the required use-cases.
|
Makefiles, `$(var)`. We also chose not to support the bar `$var` format, since
|
||||||
|
it is not required to implement the required use-cases.
|
||||||
|
|
||||||
Nested references, ie, variable expansion within variable names, are not supported.
|
Nested references, ie, variable expansion within variable names, are not
|
||||||
|
supported.
|
||||||
|
|
||||||
#### How should unmatched references be treated?
|
#### How should unmatched references be treated?
|
||||||
|
|
||||||
Ideally, it should be extremely clear when a variable reference couldn't be expanded. We decided
|
Ideally, it should be extremely clear when a variable reference couldn't be
|
||||||
the best experience for unmatched variable references would be to have the entire reference, syntax
|
expanded. We decided the best experience for unmatched variable references would
|
||||||
included, show up in the output. As an example, if the reference `$(VARIABLE_NAME)` cannot be
|
be to have the entire reference, syntax included, show up in the output. As an
|
||||||
expanded, then `$(VARIABLE_NAME)` should be present in the output.
|
example, if the reference `$(VARIABLE_NAME)` cannot be expanded, then
|
||||||
|
`$(VARIABLE_NAME)` should be present in the output.
|
||||||
|
|
||||||
#### Escaping the operator
|
#### Escaping the operator
|
||||||
|
|
||||||
Although the `$(var)` syntax does overlap with the `$(command)` form of command substitution
|
Although the `$(var)` syntax does overlap with the `$(command)` form of command
|
||||||
supported by many shells, because unexpanded variables are present verbatim in the output, we
|
substitution supported by many shells, because unexpanded variables are present
|
||||||
expect this will not present a problem to many users. If there is a collision between a variable
|
verbatim in the output, we expect this will not present a problem to many users.
|
||||||
name and command substitution syntax, the syntax can be escaped with the form `$$(VARIABLE_NAME)`,
|
If there is a collision between a variable name and command substitution syntax,
|
||||||
which will evaluate to `$(VARIABLE_NAME)` whether `VARIABLE_NAME` can be expanded or not.
|
the syntax can be escaped with the form `$$(VARIABLE_NAME)`, which will evaluate
|
||||||
|
to `$(VARIABLE_NAME)` whether `VARIABLE_NAME` can be expanded or not.
|
||||||
|
|
||||||
## Design
|
## Design
|
||||||
|
|
||||||
This design encompasses the variable expansion syntax and specification and the changes needed to
|
This design encompasses the variable expansion syntax and specification and the
|
||||||
incorporate the expansion feature into the container's environment and command.
|
changes needed to incorporate the expansion feature into the container's
|
||||||
|
environment and command.
|
||||||
|
|
||||||
### Syntax and expansion mechanics
|
### Syntax and expansion mechanics
|
||||||
|
|
||||||
This section describes the expansion syntax, evaluation of variable values, and how unexpected or
|
This section describes the expansion syntax, evaluation of variable values, and
|
||||||
malformed inputs are handled.
|
how unexpected or malformed inputs are handled.
|
||||||
|
|
||||||
#### Syntax
|
#### Syntax
|
||||||
|
|
||||||
The inputs to the expansion feature are:
|
The inputs to the expansion feature are:
|
||||||
|
|
||||||
1. A utf-8 string (the input string) which may contain variable references
|
1. A utf-8 string (the input string) which may contain variable references.
|
||||||
2. A function (the mapping function) that maps the name of a variable to the variable's value, of
|
2. A function (the mapping function) that maps the name of a variable to the
|
||||||
type `func(string) string`
|
variable's value, of type `func(string) string`.
|
||||||
|
|
||||||
Variable references in the input string are indicated exclusively with the syntax
|
Variable references in the input string are indicated exclusively with the syntax
|
||||||
`$(<variable-name>)`. The syntax tokens are:
|
`$(<variable-name>)`. The syntax tokens are:
|
||||||
|
|
||||||
- `$`: the operator
|
- `$`: the operator,
|
||||||
- `(`: the reference opener
|
- `(`: the reference opener, and
|
||||||
- `)`: the reference closer
|
- `)`: the reference closer.
|
||||||
|
|
||||||
The operator has no meaning unless accompanied by the reference opener and closer tokens. The
|
The operator has no meaning unless accompanied by the reference opener and
|
||||||
operator can be escaped using `$$`. One literal `$` will be emitted for each `$$` in the input.
|
closer tokens. The operator can be escaped using `$$`. One literal `$` will be
|
||||||
|
emitted for each `$$` in the input.
|
||||||
|
|
||||||
The reference opener and closer characters have no meaning when not part of a variable reference.
|
The reference opener and closer characters have no meaning when not part of a
|
||||||
If a variable reference is malformed, viz: `$(VARIABLE_NAME` without a closing expression, the
|
variable reference. If a variable reference is malformed, viz: `$(VARIABLE_NAME`
|
||||||
operator and expression opening characters are treated as ordinary characters without special
|
without a closing expression, the operator and expression opening characters are
|
||||||
meanings.
|
treated as ordinary characters without special meanings.
|
||||||
|
|
||||||
#### Scope and ordering of substitutions
|
#### Scope and ordering of substitutions
|
||||||
|
|
||||||
The scope in which variable references are expanded is defined by the mapping function. Within the
|
The scope in which variable references are expanded is defined by the mapping
|
||||||
mapping function, any arbitrary strategy may be used to determine the value of a variable name.
|
function. Within the mapping function, any arbitrary strategy may be used to
|
||||||
The most basic implementation of a mapping function is to use a `map[string]string` to lookup the
|
determine the value of a variable name. The most basic implementation of a
|
||||||
value of a variable.
|
mapping function is to use a `map[string]string` to lookup the value of a
|
||||||
|
variable.
|
||||||
|
|
||||||
In order to support default values for variables like service variables presented by the kubelet,
|
In order to support default values for variables like service variables
|
||||||
which may not be bound because the service that provides them does not yet exist, there should be a
|
presented by the kubelet, which may not be bound because the service that
|
||||||
mapping function that uses a list of `map[string]string` like:
|
provides them does not yet exist, there should be a mapping function that uses a
|
||||||
|
list of `map[string]string` like:
|
||||||
|
|
||||||
```go
|
```go
|
||||||
func MakeMappingFunc(maps ...map[string]string) func(string) string {
|
func MakeMappingFunc(maps ...map[string]string) func(string) string {
|
||||||
@ -235,38 +257,41 @@ mappingWithDefaults := MakeMappingFunc(serviceEnv, containerEnv)
|
|||||||
|
|
||||||
The necessary changes to implement this functionality are:
|
The necessary changes to implement this functionality are:
|
||||||
|
|
||||||
1. Add a new interface, `ObjectEventRecorder`, which is like the `EventRecorder` interface, but
|
1. Add a new interface, `ObjectEventRecorder`, which is like the
|
||||||
scoped to a single object, and a function that returns an `ObjectEventRecorder` given an
|
`EventRecorder` interface, but scoped to a single object, and a function that
|
||||||
`ObjectReference` and an `EventRecorder`
|
returns an `ObjectEventRecorder` given an `ObjectReference` and an
|
||||||
|
`EventRecorder`.
|
||||||
2. Introduce `third_party/golang/expansion` package that provides:
|
2. Introduce `third_party/golang/expansion` package that provides:
|
||||||
1. An `Expand(string, func(string) string) string` function
|
1. An `Expand(string, func(string) string) string` function.
|
||||||
2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string` function
|
2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string`
|
||||||
3. Make the kubelet expand environment correctly
|
function.
|
||||||
4. Make the kubelet expand command correctly
|
3. Make the kubelet expand environment correctly.
|
||||||
|
4. Make the kubelet expand command correctly.
|
||||||
|
|
||||||
#### Event Recording
|
#### Event Recording
|
||||||
|
|
||||||
In order to provide an event when an expansion references undefined variables, the mapping function
|
In order to provide an event when an expansion references undefined variables,
|
||||||
must be able to create an event. In order to facilitate this, we should create a new interface in
|
the mapping function must be able to create an event. In order to facilitate
|
||||||
the `api/client/record` package which is similar to `EventRecorder`, but scoped to a single object:
|
this, we should create a new interface in the `api/client/record` package which
|
||||||
|
is similar to `EventRecorder`, but scoped to a single object:
|
||||||
|
|
||||||
```go
|
```go
|
||||||
// ObjectEventRecorder knows how to record events about a single object.
|
// ObjectEventRecorder knows how to record events about a single object.
|
||||||
type ObjectEventRecorder interface {
|
type ObjectEventRecorder interface {
|
||||||
// Event constructs an event from the given information and puts it in the queue for sending.
|
// Event constructs an event from the given information and puts it in the queue for sending.
|
||||||
// 'reason' is the reason this event is generated. 'reason' should be short and unique; it will
|
// 'reason' is the reason this event is generated. 'reason' should be short and unique; it will
|
||||||
// be used to automate handling of events, so imagine people writing switch statements to
|
// be used to automate handling of events, so imagine people writing switch statements to
|
||||||
// handle them. You want to make that easy.
|
// handle them. You want to make that easy.
|
||||||
// 'message' is intended to be human readable.
|
// 'message' is intended to be human readable.
|
||||||
//
|
//
|
||||||
// The resulting event will be created in the same namespace as the reference object.
|
// The resulting event will be created in the same namespace as the reference object.
|
||||||
Event(reason, message string)
|
Event(reason, message string)
|
||||||
|
|
||||||
// Eventf is just like Event, but with Sprintf for the message field.
|
// Eventf is just like Event, but with Sprintf for the message field.
|
||||||
Eventf(reason, messageFmt string, args ...interface{})
|
Eventf(reason, messageFmt string, args ...interface{})
|
||||||
|
|
||||||
// PastEventf is just like Eventf, but with an option to specify the event's 'timestamp' field.
|
// PastEventf is just like Eventf, but with an option to specify the event's 'timestamp' field.
|
||||||
PastEventf(timestamp unversioned.Time, reason, messageFmt string, args ...interface{})
|
PastEventf(timestamp unversioned.Time, reason, messageFmt string, args ...interface{})
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -275,16 +300,16 @@ and an `EventRecorder`:
|
|||||||
|
|
||||||
```go
|
```go
|
||||||
type objectRecorderImpl struct {
|
type objectRecorderImpl struct {
|
||||||
object runtime.Object
|
object runtime.Object
|
||||||
recorder EventRecorder
|
recorder EventRecorder
|
||||||
}
|
}
|
||||||
|
|
||||||
func (r *objectRecorderImpl) Event(reason, message string) {
|
func (r *objectRecorderImpl) Event(reason, message string) {
|
||||||
r.recorder.Event(r.object, reason, message)
|
r.recorder.Event(r.object, reason, message)
|
||||||
}
|
}
|
||||||
|
|
||||||
func ObjectEventRecorderFor(object runtime.Object, recorder EventRecorder) ObjectEventRecorder {
|
func ObjectEventRecorderFor(object runtime.Object, recorder EventRecorder) ObjectEventRecorder {
|
||||||
return &objectRecorderImpl{object, recorder}
|
return &objectRecorderImpl{object, recorder}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -299,28 +324,29 @@ The expansion package should provide two methods:
|
|||||||
// for the input is found. If no expansion is found for a key, an event
|
// for the input is found. If no expansion is found for a key, an event
|
||||||
// is raised on the given recorder.
|
// is raised on the given recorder.
|
||||||
func MappingFuncFor(recorder record.ObjectEventRecorder, context ...map[string]string) func(string) string {
|
func MappingFuncFor(recorder record.ObjectEventRecorder, context ...map[string]string) func(string) string {
|
||||||
// ...
|
// ...
|
||||||
}
|
}
|
||||||
|
|
||||||
// Expand replaces variable references in the input string according to
|
// Expand replaces variable references in the input string according to
|
||||||
// the expansion spec using the given mapping function to resolve the
|
// the expansion spec using the given mapping function to resolve the
|
||||||
// values of variables.
|
// values of variables.
|
||||||
func Expand(input string, mapping func(string) string) string {
|
func Expand(input string, mapping func(string) string) string {
|
||||||
// ...
|
// ...
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Kubelet changes
|
#### Kubelet changes
|
||||||
|
|
||||||
The Kubelet should be made to correctly expand variables references in a container's environment,
|
The Kubelet should be made to correctly expand variables references in a
|
||||||
command, and args. Changes will need to be made to:
|
container's environment, command, and args. Changes will need to be made to:
|
||||||
|
|
||||||
1. The `makeEnvironmentVariables` function in the kubelet; this is used by
|
1. The `makeEnvironmentVariables` function in the kubelet; this is used by
|
||||||
`GenerateRunContainerOptions`, which is used by both the docker and rkt container runtimes
|
`GenerateRunContainerOptions`, which is used by both the docker and rkt
|
||||||
2. The docker manager `setEntrypointAndCommand` func has to be changed to perform variable
|
container runtimes.
|
||||||
expansion
|
2. The docker manager `setEntrypointAndCommand` func has to be changed to
|
||||||
3. The rkt runtime should be made to support expansion in command and args when support for it is
|
perform variable expansion.
|
||||||
implemented
|
3. The rkt runtime should be made to support expansion in command and args
|
||||||
|
when support for it is implemented.
|
||||||
|
|
||||||
### Examples
|
### Examples
|
||||||
|
|
||||||
|
@ -34,59 +34,62 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
# Adding custom resources to the Kubernetes API server
|
# Adding custom resources to the Kubernetes API server
|
||||||
|
|
||||||
This document describes the design for implementing the storage of custom API types in the Kubernetes API Server.
|
This document describes the design for implementing the storage of custom API
|
||||||
|
types in the Kubernetes API Server.
|
||||||
|
|
||||||
|
|
||||||
## Resource Model
|
## Resource Model
|
||||||
|
|
||||||
### The ThirdPartyResource
|
### The ThirdPartyResource
|
||||||
|
|
||||||
The `ThirdPartyResource` resource describes the multiple versions of a custom resource that the user wants to add
|
The `ThirdPartyResource` resource describes the multiple versions of a custom
|
||||||
to the Kubernetes API. `ThirdPartyResource` is a non-namespaced resource; attempting to place it in a namespace
|
resource that the user wants to add to the Kubernetes API. `ThirdPartyResource`
|
||||||
will return an error.
|
is a non-namespaced resource; attempting to place it in a namespace will return
|
||||||
|
an error.
|
||||||
|
|
||||||
Each `ThirdPartyResource` resource has the following:
|
Each `ThirdPartyResource` resource has the following:
|
||||||
* Standard Kubernetes object metadata.
|
* Standard Kubernetes object metadata.
|
||||||
* ResourceKind - The kind of the resources described by this third party resource.
|
* ResourceKind - The kind of the resources described by this third party
|
||||||
|
resource.
|
||||||
* Description - A free text description of the resource.
|
* Description - A free text description of the resource.
|
||||||
* APIGroup - An API group that this resource should be placed into.
|
* APIGroup - An API group that this resource should be placed into.
|
||||||
* Versions - One or more `Version` objects.
|
* Versions - One or more `Version` objects.
|
||||||
|
|
||||||
### The `Version` Object
|
### The `Version` Object
|
||||||
|
|
||||||
The `Version` object describes a single concrete version of a custom resource. The `Version` object currently
|
The `Version` object describes a single concrete version of a custom resource.
|
||||||
only specifies:
|
The `Version` object currently only specifies:
|
||||||
* The `Name` of the version.
|
* The `Name` of the version.
|
||||||
* The `APIGroup` this version should belong to.
|
* The `APIGroup` this version should belong to.
|
||||||
|
|
||||||
## Expectations about third party objects
|
## Expectations about third party objects
|
||||||
|
|
||||||
Every object that is added to a third-party Kubernetes object store is expected to contain Kubernetes
|
Every object that is added to a third-party Kubernetes object store is expected
|
||||||
compatible [object metadata](../devel/api-conventions.md#metadata). This requirement enables the
|
to contain Kubernetes compatible [object metadata](../devel/api-conventions.md#metadata).
|
||||||
Kubernetes API server to provide the following features:
|
This requirement enables the Kubernetes API server to provide the following
|
||||||
* Filtering lists of objects via label queries
|
features:
|
||||||
* `resourceVersion`-based optimistic concurrency via compare-and-swap
|
* Filtering lists of objects via label queries.
|
||||||
* Versioned storage
|
* `resourceVersion`-based optimistic concurrency via compare-and-swap.
|
||||||
* Event recording
|
* Versioned storage.
|
||||||
* Integration with basic `kubectl` command line tooling
|
* Event recording.
|
||||||
* Watch for resource changes
|
* Integration with basic `kubectl` command line tooling.
|
||||||
|
* Watch for resource changes.
|
||||||
|
|
||||||
The `Kind` for an instance of a third-party object (e.g. CronTab) below is expected to be
|
The `Kind` for an instance of a third-party object (e.g. CronTab) below is
|
||||||
programmatically convertible to the name of the resource using
|
expected to be programmatically convertible to the name of the resource using
|
||||||
the following conversion. Kinds are expected to be of the form `<CamelCaseKind>`, and the
|
the following conversion. Kinds are expected to be of the form
|
||||||
`APIVersion` for the object is expected to be `<api-group>/<api-version>`. To
|
`<CamelCaseKind>`, and the `APIVersion` for the object is expected to be
|
||||||
prevent collisions, it's expected that you'll use a fully qualified domain
|
`<api-group>/<api-version>`. To prevent collisions, it's expected that you'll
|
||||||
name for the API group, e.g. `example.com`.
|
use a fully qualified domain name for the API group, e.g. `example.com`.
|
||||||
|
|
||||||
For example `stable.example.com/v1`
|
For example `stable.example.com/v1`
|
||||||
|
|
||||||
'CamelCaseKind' is the specific type name.
|
'CamelCaseKind' is the specific type name.
|
||||||
|
|
||||||
To convert this into the `metadata.name` for the `ThirdPartyResource` resource instance,
|
To convert this into the `metadata.name` for the `ThirdPartyResource` resource
|
||||||
the `<domain-name>` is copied verbatim, the `CamelCaseKind` is
|
instance, the `<domain-name>` is copied verbatim, the `CamelCaseKind` is then
|
||||||
then converted
|
converted using '-' instead of capitalization ('camel-case'), with the first
|
||||||
using '-' instead of capitalization ('camel-case'), with the first character being assumed to be
|
character being assumed to be capitalized. In pseudo code:
|
||||||
capitalized. In pseudo code:
|
|
||||||
|
|
||||||
```go
|
```go
|
||||||
var result string
|
var result string
|
||||||
@ -98,17 +101,20 @@ for ix := range kindName {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
As a concrete example, the resource named `camel-case-kind.example.com` defines resources of Kind `CamelCaseKind`, in
|
As a concrete example, the resource named `camel-case-kind.example.com` defines
|
||||||
the APIGroup with the prefix `example.com/...`.
|
resources of Kind `CamelCaseKind`, in the APIGroup with the prefix
|
||||||
|
`example.com/...`.
|
||||||
|
|
||||||
The reason for this is to enable rapid lookup of a `ThirdPartyResource` object given the kind information.
|
The reason for this is to enable rapid lookup of a `ThirdPartyResource` object
|
||||||
This is also the reason why `ThirdPartyResource` is not namespaced.
|
given the kind information. This is also the reason why `ThirdPartyResource` is
|
||||||
|
not namespaced.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
When a user creates a new `ThirdPartyResource`, the Kubernetes API Server reacts by creating a new, namespaced
|
When a user creates a new `ThirdPartyResource`, the Kubernetes API Server reacts
|
||||||
RESTful resource path. For now, non-namespaced objects are not supported. As with existing built-in objects,
|
by creating a new, namespaced RESTful resource path. For now, non-namespaced
|
||||||
deleting a namespace deletes all third party resources in that namespace.
|
objects are not supported. As with existing built-in objects, deleting a
|
||||||
|
namespace deletes all third party resources in that namespace.
|
||||||
|
|
||||||
For example, if a user creates:
|
For example, if a user creates:
|
||||||
|
|
||||||
@ -143,14 +149,15 @@ Now that this schema has been created, a user can `POST`:
|
|||||||
|
|
||||||
to: `/apis/stable.example.com/v1/namespaces/default/crontabs`
|
to: `/apis/stable.example.com/v1/namespaces/default/crontabs`
|
||||||
|
|
||||||
and the corresponding data will be stored into etcd by the APIServer, so that when the user issues:
|
and the corresponding data will be stored into etcd by the APIServer, so that
|
||||||
|
when the user issues:
|
||||||
|
|
||||||
```
|
```
|
||||||
GET /apis/stable.example.com/v1/namespaces/default/crontabs/my-new-cron-object`
|
GET /apis/stable.example.com/v1/namespaces/default/crontabs/my-new-cron-object`
|
||||||
```
|
```
|
||||||
|
|
||||||
And when they do that, they will get back the same data, but with additional Kubernetes metadata
|
And when they do that, they will get back the same data, but with additional
|
||||||
(e.g. `resourceVersion`, `createdTimestamp`) filled in.
|
Kubernetes metadata (e.g. `resourceVersion`, `createdTimestamp`) filled in.
|
||||||
|
|
||||||
Likewise, to list all resources, a user can issue:
|
Likewise, to list all resources, a user can issue:
|
||||||
|
|
||||||
@ -178,29 +185,35 @@ and get back:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Because all objects are expected to contain standard Kubernetes metadata fields, these
|
Because all objects are expected to contain standard Kubernetes metadata fields,
|
||||||
list operations can also use label queries to filter requests down to specific subsets.
|
these list operations can also use label queries to filter requests down to
|
||||||
|
specific subsets.
|
||||||
Likewise, clients can use watch endpoints to watch for changes to stored objects.
|
|
||||||
|
|
||||||
|
Likewise, clients can use watch endpoints to watch for changes to stored
|
||||||
|
objects.
|
||||||
|
|
||||||
## Storage
|
## Storage
|
||||||
|
|
||||||
In order to store custom user data in a versioned fashion inside of etcd, we need to also introduce a
|
In order to store custom user data in a versioned fashion inside of etcd, we
|
||||||
`Codec`-compatible object for persistent storage in etcd. This object is `ThirdPartyResourceData` and it contains:
|
need to also introduce a `Codec`-compatible object for persistent storage in
|
||||||
* Standard API Metadata
|
etcd. This object is `ThirdPartyResourceData` and it contains:
|
||||||
|
* Standard API Metadata.
|
||||||
* `Data`: The raw JSON data for this custom object.
|
* `Data`: The raw JSON data for this custom object.
|
||||||
|
|
||||||
### Storage key specification
|
### Storage key specification
|
||||||
|
|
||||||
Each custom object stored by the API server needs a custom key in storage, this is described below:
|
Each custom object stored by the API server needs a custom key in storage, this
|
||||||
|
is described below:
|
||||||
|
|
||||||
#### Definitions
|
#### Definitions
|
||||||
|
|
||||||
* `resource-namespace`: the namespace of the particular resource that is being stored
|
* `resource-namespace`: the namespace of the particular resource that is
|
||||||
|
being stored
|
||||||
* `resource-name`: the name of the particular resource being stored
|
* `resource-name`: the name of the particular resource being stored
|
||||||
* `third-party-resource-namespace`: the namespace of the `ThirdPartyResource` resource that represents the type for the specific instance being stored
|
* `third-party-resource-namespace`: the namespace of the `ThirdPartyResource`
|
||||||
* `third-party-resource-name`: the name of the `ThirdPartyResource` resource that represents the type for the specific instance being stored
|
resource that represents the type for the specific instance being stored
|
||||||
|
* `third-party-resource-name`: the name of the `ThirdPartyResource` resource
|
||||||
|
that represents the type for the specific instance being stored
|
||||||
|
|
||||||
#### Key
|
#### Key
|
||||||
|
|
||||||
|
@ -76,7 +76,7 @@ Documentation for other releases can be found at
|
|||||||
load balancers between the client and the serving Pod, failover
|
load balancers between the client and the serving Pod, failover
|
||||||
might be completely automatic (i.e. the client's end of the
|
might be completely automatic (i.e. the client's end of the
|
||||||
connection remains intact, and the client is completely
|
connection remains intact, and the client is completely
|
||||||
oblivious of the fail-over). This approach incurs network speed
|
oblivious of the fail-over). This approach incurs network speed
|
||||||
and cost penalties (by traversing possibly multiple load
|
and cost penalties (by traversing possibly multiple load
|
||||||
balancers), but requires zero smarts in clients, DNS libraries,
|
balancers), but requires zero smarts in clients, DNS libraries,
|
||||||
recursing DNS servers etc, as the IP address of the endpoint
|
recursing DNS servers etc, as the IP address of the endpoint
|
||||||
@ -102,17 +102,17 @@ Documentation for other releases can be found at
|
|||||||
A Kubernetes application configuration (e.g. for a Pod, Replication
|
A Kubernetes application configuration (e.g. for a Pod, Replication
|
||||||
Controller, Service etc) should be able to be successfully deployed
|
Controller, Service etc) should be able to be successfully deployed
|
||||||
into any Kubernetes Cluster or Ubernetes Federation of Clusters,
|
into any Kubernetes Cluster or Ubernetes Federation of Clusters,
|
||||||
without modification. More specifically, a typical configuration
|
without modification. More specifically, a typical configuration
|
||||||
should work correctly (although possibly not optimally) across any of
|
should work correctly (although possibly not optimally) across any of
|
||||||
the following environments:
|
the following environments:
|
||||||
|
|
||||||
1. A single Kubernetes Cluster on one cloud provider (e.g. Google
|
1. A single Kubernetes Cluster on one cloud provider (e.g. Google
|
||||||
Compute Engine, GCE)
|
Compute Engine, GCE).
|
||||||
1. A single Kubernetes Cluster on a different cloud provider
|
1. A single Kubernetes Cluster on a different cloud provider
|
||||||
(e.g. Amazon Web Services, AWS)
|
(e.g. Amazon Web Services, AWS).
|
||||||
1. A single Kubernetes Cluster on a non-cloud, on-premise data center
|
1. A single Kubernetes Cluster on a non-cloud, on-premise data center
|
||||||
1. A Federation of Kubernetes Clusters all on the same cloud provider
|
1. A Federation of Kubernetes Clusters all on the same cloud provider
|
||||||
(e.g. GCE)
|
(e.g. GCE).
|
||||||
1. A Federation of Kubernetes Clusters across multiple different cloud
|
1. A Federation of Kubernetes Clusters across multiple different cloud
|
||||||
providers and/or on-premise data centers (e.g. one cluster on
|
providers and/or on-premise data centers (e.g. one cluster on
|
||||||
GCE/GKE, one on AWS, and one on-premise).
|
GCE/GKE, one on AWS, and one on-premise).
|
||||||
@ -122,18 +122,18 @@ the following environments:
|
|||||||
It should be possible to explicitly opt out of portability across some
|
It should be possible to explicitly opt out of portability across some
|
||||||
subset of the above environments in order to take advantage of
|
subset of the above environments in order to take advantage of
|
||||||
non-portable load balancing and DNS features of one or more
|
non-portable load balancing and DNS features of one or more
|
||||||
environments. More specifically, for example:
|
environments. More specifically, for example:
|
||||||
|
|
||||||
1. For HTTP(S) applications running on GCE-only Federations,
|
1. For HTTP(S) applications running on GCE-only Federations,
|
||||||
[GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
|
[GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
|
||||||
should be usable. These provide single, static global IP addresses
|
should be usable. These provide single, static global IP addresses
|
||||||
which load balance and fail over globally (i.e. across both regions
|
which load balance and fail over globally (i.e. across both regions
|
||||||
and zones). These allow for really dumb clients, but they only
|
and zones). These allow for really dumb clients, but they only
|
||||||
work on GCE, and only for HTTP(S) traffic.
|
work on GCE, and only for HTTP(S) traffic.
|
||||||
1. For non-HTTP(S) applications running on GCE-only Federations within
|
1. For non-HTTP(S) applications running on GCE-only Federations within
|
||||||
a single region,
|
a single region,
|
||||||
[GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
|
[GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
|
||||||
should be usable. These provide TCP (i.e. both HTTP/S and
|
should be usable. These provide TCP (i.e. both HTTP/S and
|
||||||
non-HTTP/S) load balancing and failover, but only on GCE, and only
|
non-HTTP/S) load balancing and failover, but only on GCE, and only
|
||||||
within a single region.
|
within a single region.
|
||||||
[Google Cloud DNS](https://cloud.google.com/dns) can be used to
|
[Google Cloud DNS](https://cloud.google.com/dns) can be used to
|
||||||
@ -141,7 +141,7 @@ environments. More specifically, for example:
|
|||||||
providers and on-premise clusters, as it's plain DNS, IP only).
|
providers and on-premise clusters, as it's plain DNS, IP only).
|
||||||
1. For applications running on AWS-only Federations,
|
1. For applications running on AWS-only Federations,
|
||||||
[AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
|
[AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
|
||||||
should be usable. These provide both L7 (HTTP(S)) and L4 load
|
should be usable. These provide both L7 (HTTP(S)) and L4 load
|
||||||
balancing, but only within a single region, and only on AWS
|
balancing, but only within a single region, and only on AWS
|
||||||
([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be
|
([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be
|
||||||
used to load balance and fail over across multiple regions, and is
|
used to load balance and fail over across multiple regions, and is
|
||||||
@ -153,7 +153,7 @@ Ubernetes cross-cluster load balancing is built on top of the following:
|
|||||||
|
|
||||||
1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
|
1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
|
||||||
provide single, static global IP addresses which load balance and
|
provide single, static global IP addresses which load balance and
|
||||||
fail over globally (i.e. across both regions and zones). These
|
fail over globally (i.e. across both regions and zones). These
|
||||||
allow for really dumb clients, but they only work on GCE, and only
|
allow for really dumb clients, but they only work on GCE, and only
|
||||||
for HTTP(S) traffic.
|
for HTTP(S) traffic.
|
||||||
1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
|
1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
|
||||||
@ -170,7 +170,7 @@ Ubernetes cross-cluster load balancing is built on top of the following:
|
|||||||
doesn't provide any built-in geo-DNS, latency-based routing, health
|
doesn't provide any built-in geo-DNS, latency-based routing, health
|
||||||
checking, weighted round robin or other advanced capabilities.
|
checking, weighted round robin or other advanced capabilities.
|
||||||
It's plain old DNS. We would need to build all the aforementioned
|
It's plain old DNS. We would need to build all the aforementioned
|
||||||
on top of it. It can provide internal DNS services (i.e. serve RFC
|
on top of it. It can provide internal DNS services (i.e. serve RFC
|
||||||
1918 addresses).
|
1918 addresses).
|
||||||
1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can
|
1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can
|
||||||
be used to load balance and fail over across regions, and is also
|
be used to load balance and fail over across regions, and is also
|
||||||
@ -185,23 +185,24 @@ Ubernetes cross-cluster load balancing is built on top of the following:
|
|||||||
service IP which is load-balanced (currently simple round-robin)
|
service IP which is load-balanced (currently simple round-robin)
|
||||||
across the healthy pods comprising a service within a single
|
across the healthy pods comprising a service within a single
|
||||||
Kubernetes cluster.
|
Kubernetes cluster.
|
||||||
1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): A generic wrapper around cloud-provided L4 and L7 load balancing services, and roll-your-own load balancers run in pods, e.g. HA Proxy.
|
1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html):
|
||||||
|
A generic wrapper around cloud-provided L4 and L7 load balancing services, and
|
||||||
|
roll-your-own load balancers run in pods, e.g. HA Proxy.
|
||||||
|
|
||||||
## Ubernetes API
|
## Ubernetes API
|
||||||
|
|
||||||
The Ubernetes API for load balancing should be compatible with the
|
The Ubernetes API for load balancing should be compatible with the equivalent
|
||||||
equivalent Kubernetes API, to ease porting of clients between
|
Kubernetes API, to ease porting of clients between Ubernetes and Kubernetes.
|
||||||
Ubernetes and Kubernetes. Further details below.
|
Further details below.
|
||||||
|
|
||||||
## Common Client Behavior
|
## Common Client Behavior
|
||||||
|
|
||||||
To be useful, our load balancing solution needs to work properly with
|
To be useful, our load balancing solution needs to work properly with real
|
||||||
real client applications. There are a few different classes of
|
client applications. There are a few different classes of those...
|
||||||
those...
|
|
||||||
|
|
||||||
### Browsers
|
### Browsers
|
||||||
|
|
||||||
These are the most common external clients. These are all well-written. See below.
|
These are the most common external clients. These are all well-written. See below.
|
||||||
|
|
||||||
### Well-written clients
|
### Well-written clients
|
||||||
|
|
||||||
@ -218,8 +219,8 @@ Examples:
|
|||||||
|
|
||||||
### Dumb clients
|
### Dumb clients
|
||||||
|
|
||||||
1. Don't do a DNS resolution every time they connect (or do cache
|
1. Don't do a DNS resolution every time they connect (or do cache beyond the
|
||||||
beyond the TTL).
|
TTL).
|
||||||
1. Do try multiple A records
|
1. Do try multiple A records
|
||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
@ -237,34 +238,34 @@ Examples:
|
|||||||
|
|
||||||
### Dumbest clients
|
### Dumbest clients
|
||||||
|
|
||||||
1. Never do a DNS lookup - are pre-configured with a single (or
|
1. Never do a DNS lookup - are pre-configured with a single (or possibly
|
||||||
possibly multiple) fixed server IP(s). Nothing else matters.
|
multiple) fixed server IP(s). Nothing else matters.
|
||||||
|
|
||||||
## Architecture and Implementation
|
## Architecture and Implementation
|
||||||
|
|
||||||
### General control plane architecture
|
### General Control Plane Architecture
|
||||||
|
|
||||||
Each cluster hosts one or more Ubernetes master components (Ubernetes API servers, controller managers with leader election, and
|
Each cluster hosts one or more Ubernetes master components (Ubernetes API
|
||||||
etcd quorum members. This is documented in more detail in a
|
servers, controller managers with leader election, and etcd quorum members. This
|
||||||
[separate design doc: Kubernetes/Ubernetes Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#).
|
is documented in more detail in a separate design doc:
|
||||||
|
[Kubernetes/Ubernetes Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#).
|
||||||
|
|
||||||
In the description below, assume that 'n' clusters, named
|
In the description below, assume that 'n' clusters, named 'cluster-1'...
|
||||||
'cluster-1'... 'cluster-n' have been registered against an Ubernetes
|
'cluster-n' have been registered against an Ubernetes Federation "federation-1",
|
||||||
Federation "federation-1", each with their own set of Kubernetes API
|
each with their own set of Kubernetes API endpoints,so,
|
||||||
endpoints,so,
|
|
||||||
"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1),
|
"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1),
|
||||||
[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1)
|
[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1)
|
||||||
... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) .
|
... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) .
|
||||||
|
|
||||||
### Federated Services
|
### Federated Services
|
||||||
|
|
||||||
Ubernetes Services are pretty straight-forward. They're comprised of
|
Ubernetes Services are pretty straight-forward. They're comprised of multiple
|
||||||
multiple equivalent underlying Kubernetes Services, each with their
|
equivalent underlying Kubernetes Services, each with their own external
|
||||||
own external endpoint, and a load balancing mechanism across them.
|
endpoint, and a load balancing mechanism across them. Let's work through how
|
||||||
Let's work through how exactly that works in practice.
|
exactly that works in practice.
|
||||||
|
|
||||||
Our user creates the following Ubernetes Service (against an Ubernetes
|
Our user creates the following Ubernetes Service (against an Ubernetes API
|
||||||
API endpoint):
|
endpoint):
|
||||||
|
|
||||||
$ kubectl create -f my-service.yaml --context="federation-1"
|
$ kubectl create -f my-service.yaml --context="federation-1"
|
||||||
|
|
||||||
@ -290,9 +291,9 @@ where service.yaml contains the following:
|
|||||||
run: my-service
|
run: my-service
|
||||||
type: LoadBalancer
|
type: LoadBalancer
|
||||||
|
|
||||||
Ubernetes in turn creates one equivalent service (identical config to
|
Ubernetes in turn creates one equivalent service (identical config to the above)
|
||||||
the above) in each of the underlying Kubernetes clusters, each of
|
in each of the underlying Kubernetes clusters, each of which results in
|
||||||
which results in something like this:
|
something like this:
|
||||||
|
|
||||||
$ kubectl get -o yaml --context="cluster-1" service my-service
|
$ kubectl get -o yaml --context="cluster-1" service my-service
|
||||||
|
|
||||||
@ -329,9 +330,8 @@ which results in something like this:
|
|||||||
ingress:
|
ingress:
|
||||||
- ip: 104.197.117.10
|
- ip: 104.197.117.10
|
||||||
|
|
||||||
Similar services are created in `cluster-2` and `cluster-3`, each of
|
Similar services are created in `cluster-2` and `cluster-3`, each of which are
|
||||||
which are allocated their own `spec.clusterIP`, and
|
allocated their own `spec.clusterIP`, and `status.loadBalancer.ingress.ip`.
|
||||||
`status.loadBalancer.ingress.ip`.
|
|
||||||
|
|
||||||
In Ubernetes `federation-1`, the resulting federated service looks as follows:
|
In Ubernetes `federation-1`, the resulting federated service looks as follows:
|
||||||
|
|
||||||
@ -376,21 +376,21 @@ Note that the federated service:
|
|||||||
1. has no clusterIP (as it is cluster-independent)
|
1. has no clusterIP (as it is cluster-independent)
|
||||||
1. has a federation-wide load balancer hostname
|
1. has a federation-wide load balancer hostname
|
||||||
|
|
||||||
In addition to the set of underlying Kubernetes services (one per
|
In addition to the set of underlying Kubernetes services (one per cluster)
|
||||||
cluster) described above, Ubernetes has also created a DNS name
|
described above, Ubernetes has also created a DNS name (e.g. on
|
||||||
(e.g. on [Google Cloud DNS](https://cloud.google.com/dns) or
|
[Google Cloud DNS](https://cloud.google.com/dns) or
|
||||||
[AWS Route 53](https://aws.amazon.com/route53/), depending on
|
[AWS Route 53](https://aws.amazon.com/route53/), depending on configuration)
|
||||||
configuration) which provides load balancing across all of those
|
which provides load balancing across all of those services. For example, in a
|
||||||
services. For example, in a very basic configuration:
|
very basic configuration:
|
||||||
|
|
||||||
$ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
|
$ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
|
||||||
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10
|
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10
|
||||||
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
|
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
|
||||||
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
|
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
|
||||||
|
|
||||||
Each of the above IP addresses (which are just the external load
|
Each of the above IP addresses (which are just the external load balancer
|
||||||
balancer ingress IP's of each cluster service) is of course load
|
ingress IP's of each cluster service) is of course load balanced across the pods
|
||||||
balanced across the pods comprising the service in each cluster.
|
comprising the service in each cluster.
|
||||||
|
|
||||||
In a more sophisticated configuration (e.g. on GCE or GKE), Ubernetes
|
In a more sophisticated configuration (e.g. on GCE or GKE), Ubernetes
|
||||||
automatically creates a
|
automatically creates a
|
||||||
@ -411,23 +411,21 @@ for failover purposes:
|
|||||||
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
|
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
|
||||||
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
|
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
|
||||||
|
|
||||||
If Ubernetes Global Service Health Checking is enabled, multiple
|
If Ubernetes Global Service Health Checking is enabled, multiple service health
|
||||||
service health checkers running across the federated clusters
|
checkers running across the federated clusters collaborate to monitor the health
|
||||||
collaborate to monitor the health of the service endpoints, and
|
of the service endpoints, and automatically remove unhealthy endpoints from the
|
||||||
automatically remove unhealthy endpoints from the DNS record (e.g. a
|
DNS record (e.g. a majority quorum is required to vote a service endpoint
|
||||||
majority quorum is required to vote a service endpoint unhealthy, to
|
unhealthy, to avoid false positives due to individual health checker network
|
||||||
avoid false positives due to individual health checker network
|
|
||||||
isolation).
|
isolation).
|
||||||
|
|
||||||
### Federated Replication Controllers
|
### Federated Replication Controllers
|
||||||
|
|
||||||
So far we have a federated service defined, with a resolvable load
|
So far we have a federated service defined, with a resolvable load balancer
|
||||||
balancer hostname by which clients can reach it, but no pods serving
|
hostname by which clients can reach it, but no pods serving traffic directed
|
||||||
traffic directed there. So now we need a Federated Replication
|
there. So now we need a Federated Replication Controller. These are also fairly
|
||||||
Controller. These are also fairly straight-forward, being comprised
|
straight-forward, being comprised of multiple underlying Kubernetes Replication
|
||||||
of multiple underlying Kubernetes Replication Controllers which do the
|
Controllers which do the hard work of keeping the desired number of Pod replicas
|
||||||
hard work of keeping the desired number of Pod replicas alive in each
|
alive in each Kubernetes cluster.
|
||||||
Kubernetes cluster.
|
|
||||||
|
|
||||||
$ kubectl create -f my-service-rc.yaml --context="federation-1"
|
$ kubectl create -f my-service-rc.yaml --context="federation-1"
|
||||||
|
|
||||||
@ -495,54 +493,49 @@ something like this:
|
|||||||
status:
|
status:
|
||||||
replicas: 2
|
replicas: 2
|
||||||
|
|
||||||
The exact number of replicas created in each underlying cluster will
|
The exact number of replicas created in each underlying cluster will of course
|
||||||
of course depend on what scheduling policy is in force. In the above
|
depend on what scheduling policy is in force. In the above example, the
|
||||||
example, the scheduler created an equal number of replicas (2) in each
|
scheduler created an equal number of replicas (2) in each of the three
|
||||||
of the three underlying clusters, to make up the total of 6 replicas
|
underlying clusters, to make up the total of 6 replicas required. To handle
|
||||||
required. To handle entire cluster failures, various approaches are possible,
|
entire cluster failures, various approaches are possible, including:
|
||||||
including:
|
|
||||||
1. **simple overprovisioing**, such that sufficient replicas remain even if a
|
1. **simple overprovisioing**, such that sufficient replicas remain even if a
|
||||||
cluster fails. This wastes some resources, but is simple and
|
cluster fails. This wastes some resources, but is simple and reliable.
|
||||||
reliable.
|
|
||||||
2. **pod autoscaling**, where the replication controller in each
|
2. **pod autoscaling**, where the replication controller in each
|
||||||
cluster automatically and autonomously increases the number of
|
cluster automatically and autonomously increases the number of
|
||||||
replicas in its cluster in response to the additional traffic
|
replicas in its cluster in response to the additional traffic
|
||||||
diverted from the
|
diverted from the failed cluster. This saves resources and is relatively
|
||||||
failed cluster. This saves resources and is reatively simple,
|
simple, but there is some delay in the autoscaling.
|
||||||
but there is some delay in the autoscaling.
|
|
||||||
3. **federated replica migration**, where the Ubernetes Federation
|
3. **federated replica migration**, where the Ubernetes Federation
|
||||||
Control Plane detects the cluster failure and automatically
|
Control Plane detects the cluster failure and automatically
|
||||||
increases the replica count in the remainaing clusters to make up
|
increases the replica count in the remainaing clusters to make up
|
||||||
for the lost replicas in the failed cluster. This does not seem to
|
for the lost replicas in the failed cluster. This does not seem to
|
||||||
offer any benefits relative to pod autoscaling above, and is
|
offer any benefits relative to pod autoscaling above, and is
|
||||||
arguably more complex to implement, but we note it here as a
|
arguably more complex to implement, but we note it here as a
|
||||||
possibility.
|
possibility.
|
||||||
|
|
||||||
### Implementation Details
|
### Implementation Details
|
||||||
|
|
||||||
The implementation approach and architecture is very similar to
|
The implementation approach and architecture is very similar to Kubernetes, so
|
||||||
Kubernetes, so if you're familiar with how Kubernetes works, none of
|
if you're familiar with how Kubernetes works, none of what follows will be
|
||||||
what follows will be surprising. One additional design driver not
|
surprising. One additional design driver not present in Kubernetes is that
|
||||||
present in Kubernetes is that Ubernetes aims to be resilient to
|
Ubernetes aims to be resilient to individual cluster and availability zone
|
||||||
individual cluster and availability zone failures. So the control
|
failures. So the control plane spans multiple clusters. More specifically:
|
||||||
plane spans multiple clusters. More specifically:
|
|
||||||
|
|
||||||
+ Ubernetes runs it's own distinct set of API servers (typically one
|
+ Ubernetes runs it's own distinct set of API servers (typically one
|
||||||
or more per underlying Kubernetes cluster). These are completely
|
or more per underlying Kubernetes cluster). These are completely
|
||||||
distinct from the Kubernetes API servers for each of the underlying
|
distinct from the Kubernetes API servers for each of the underlying
|
||||||
clusters.
|
clusters.
|
||||||
+ Ubernetes runs it's own distinct quorum-based metadata store (etcd,
|
+ Ubernetes runs it's own distinct quorum-based metadata store (etcd,
|
||||||
by default). Approximately 1 quorum member runs in each underlying
|
by default). Approximately 1 quorum member runs in each underlying
|
||||||
cluster ("approximately" because we aim for an odd number of quorum
|
cluster ("approximately" because we aim for an odd number of quorum
|
||||||
members, and typically don't want more than 5 quorum members, even
|
members, and typically don't want more than 5 quorum members, even
|
||||||
if we have a larger number of federated clusters, so 2 clusters->3
|
if we have a larger number of federated clusters, so 2 clusters->3
|
||||||
quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc).
|
quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc).
|
||||||
|
|
||||||
Cluster Controllers in Ubernetes watch against the Ubernetes API
|
Cluster Controllers in Ubernetes watch against the Ubernetes API server/etcd
|
||||||
server/etcd state, and apply changes to the underlying kubernetes
|
state, and apply changes to the underlying kubernetes clusters accordingly. They
|
||||||
clusters accordingly. They also have the anti-entropy mechanism for
|
also have the anti-entropy mechanism for reconciling ubernetes "desired desired"
|
||||||
reconciling ubernetes "desired desired" state against kubernetes
|
state against kubernetes "actual desired" state.
|
||||||
"actual desired" state.
|
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -71,7 +71,8 @@ unified view.
|
|||||||
|
|
||||||
Here are the functionality requirements derived from above use cases:
|
Here are the functionality requirements derived from above use cases:
|
||||||
|
|
||||||
+ Clients of the federation control plane API server can register and deregister clusters.
|
+ Clients of the federation control plane API server can register and deregister
|
||||||
|
clusters.
|
||||||
+ Workloads should be spread to different clusters according to the
|
+ Workloads should be spread to different clusters according to the
|
||||||
workload distribution policy.
|
workload distribution policy.
|
||||||
+ Pods are able to discover and connect to services hosted in other
|
+ Pods are able to discover and connect to services hosted in other
|
||||||
@ -90,7 +91,7 @@ Here are the functionality requirements derived from above use cases:
|
|||||||
It’s difficult to have a perfect design with one click that implements
|
It’s difficult to have a perfect design with one click that implements
|
||||||
all the above requirements. Therefore we will go with an iterative
|
all the above requirements. Therefore we will go with an iterative
|
||||||
approach to design and build the system. This document describes the
|
approach to design and build the system. This document describes the
|
||||||
phase one of the whole work. In phase one we will cover only the
|
phase one of the whole work. In phase one we will cover only the
|
||||||
following objectives:
|
following objectives:
|
||||||
|
|
||||||
+ Define the basic building blocks and API objects of control plane
|
+ Define the basic building blocks and API objects of control plane
|
||||||
@ -130,9 +131,9 @@ description of each module contained in above diagram.
|
|||||||
|
|
||||||
The API Server in the Ubernetes control plane works just like the API
|
The API Server in the Ubernetes control plane works just like the API
|
||||||
Server in K8S. It talks to a distributed key-value store to persist,
|
Server in K8S. It talks to a distributed key-value store to persist,
|
||||||
retrieve and watch API objects. This store is completely distinct
|
retrieve and watch API objects. This store is completely distinct
|
||||||
from the kubernetes key-value stores (etcd) in the underlying
|
from the kubernetes key-value stores (etcd) in the underlying
|
||||||
kubernetes clusters. We still use `etcd` as the distributed
|
kubernetes clusters. We still use `etcd` as the distributed
|
||||||
storage so customers don’t need to learn and manage a different
|
storage so customers don’t need to learn and manage a different
|
||||||
storage system, although it is envisaged that other storage systems
|
storage system, although it is envisaged that other storage systems
|
||||||
(consol, zookeeper) will probably be developedand supported over
|
(consol, zookeeper) will probably be developedand supported over
|
||||||
@ -141,16 +142,16 @@ time.
|
|||||||
## Ubernetes Scheduler
|
## Ubernetes Scheduler
|
||||||
|
|
||||||
The Ubernetes Scheduler schedules resources onto the underlying
|
The Ubernetes Scheduler schedules resources onto the underlying
|
||||||
Kubernetes clusters. For example it watches for unscheduled Ubernetes
|
Kubernetes clusters. For example it watches for unscheduled Ubernetes
|
||||||
replication controllers (those that have not yet been scheduled onto
|
replication controllers (those that have not yet been scheduled onto
|
||||||
underlying Kubernetes clusters) and performs the global scheduling
|
underlying Kubernetes clusters) and performs the global scheduling
|
||||||
work. For each unscheduled replication controller, it calls policy
|
work. For each unscheduled replication controller, it calls policy
|
||||||
engine to decide how to spit workloads among clusters. It creates a
|
engine to decide how to spit workloads among clusters. It creates a
|
||||||
Kubernetes Replication Controller on one ore more underlying cluster,
|
Kubernetes Replication Controller on one ore more underlying cluster,
|
||||||
and post them back to `etcd` storage.
|
and post them back to `etcd` storage.
|
||||||
|
|
||||||
One sublety worth noting here is that the scheduling decision is
|
One sublety worth noting here is that the scheduling decision is arrived at by
|
||||||
arrived at by combining the application-specific request from the user (which might
|
combining the application-specific request from the user (which might
|
||||||
include, for example, placement constraints), and the global policy specified
|
include, for example, placement constraints), and the global policy specified
|
||||||
by the federation administrator (for example, "prefer on-premise
|
by the federation administrator (for example, "prefer on-premise
|
||||||
clusters over AWS clusters" or "spread load equally across clusters").
|
clusters over AWS clusters" or "spread load equally across clusters").
|
||||||
@ -165,9 +166,9 @@ performs the following two kinds of work:
|
|||||||
corresponding API objects on the underlying K8S clusters.
|
corresponding API objects on the underlying K8S clusters.
|
||||||
1. It periodically retrieves the available resources metrics from the
|
1. It periodically retrieves the available resources metrics from the
|
||||||
underlying K8S cluster, and updates them as object status of the
|
underlying K8S cluster, and updates them as object status of the
|
||||||
`cluster` API object. An alternative design might be to run a pod
|
`cluster` API object. An alternative design might be to run a pod
|
||||||
in each underlying cluster that reports metrics for that cluster to
|
in each underlying cluster that reports metrics for that cluster to
|
||||||
the Ubernetes control plane. Which approach is better remains an
|
the Ubernetes control plane. Which approach is better remains an
|
||||||
open topic of discussion.
|
open topic of discussion.
|
||||||
|
|
||||||
## Ubernetes Service Controller
|
## Ubernetes Service Controller
|
||||||
@ -187,7 +188,7 @@ Cluster is a new first-class API object introduced in this design. For
|
|||||||
each registered K8S cluster there will be such an API resource in
|
each registered K8S cluster there will be such an API resource in
|
||||||
control plane. The way clients register or deregister a cluster is to
|
control plane. The way clients register or deregister a cluster is to
|
||||||
send corresponding REST requests to following URL:
|
send corresponding REST requests to following URL:
|
||||||
`/api/{$version}/clusters`. Because control plane is behaving like a
|
`/api/{$version}/clusters`. Because control plane is behaving like a
|
||||||
regular K8S client to the underlying clusters, the spec of a cluster
|
regular K8S client to the underlying clusters, the spec of a cluster
|
||||||
object contains necessary properties like K8S cluster address and
|
object contains necessary properties like K8S cluster address and
|
||||||
credentials. The status of a cluster API object will contain
|
credentials. The status of a cluster API object will contain
|
||||||
@ -294,7 +295,7 @@ $version.clusterStatus
|
|||||||
**For simplicity we didn’t introduce a separate “cluster metrics” API
|
**For simplicity we didn’t introduce a separate “cluster metrics” API
|
||||||
object here**. The cluster resource metrics are stored in cluster
|
object here**. The cluster resource metrics are stored in cluster
|
||||||
status section, just like what we did to nodes in K8S. In phase one it
|
status section, just like what we did to nodes in K8S. In phase one it
|
||||||
only contains available CPU resources and memory resources. The
|
only contains available CPU resources and memory resources. The
|
||||||
cluster controller will periodically poll the underlying cluster API
|
cluster controller will periodically poll the underlying cluster API
|
||||||
Server to get cluster capability. In phase one it gets the metrics by
|
Server to get cluster capability. In phase one it gets the metrics by
|
||||||
simply aggregating metrics from all nodes. In future we will improve
|
simply aggregating metrics from all nodes. In future we will improve
|
||||||
@ -315,7 +316,7 @@ Below is the state transition diagram.
|
|||||||
## Replication Controller
|
## Replication Controller
|
||||||
|
|
||||||
A global workload submitted to control plane is represented as an
|
A global workload submitted to control plane is represented as an
|
||||||
Ubernetes replication controller. When a replication controller
|
Ubernetes replication controller. When a replication controller
|
||||||
is submitted to control plane, clients need a way to express its
|
is submitted to control plane, clients need a way to express its
|
||||||
requirements or preferences on clusters. Depending on different use
|
requirements or preferences on clusters. Depending on different use
|
||||||
cases it may be complex. For example:
|
cases it may be complex. For example:
|
||||||
@ -327,7 +328,7 @@ cases it may be complex. For example:
|
|||||||
(use case: workload )
|
(use case: workload )
|
||||||
+ Seventy percent of this workload should be scheduled to cluster Foo,
|
+ Seventy percent of this workload should be scheduled to cluster Foo,
|
||||||
and thirty percent should be scheduled to cluster Bar (use case:
|
and thirty percent should be scheduled to cluster Bar (use case:
|
||||||
vendor lock-in avoidance). In phase one, we only introduce a
|
vendor lock-in avoidance). In phase one, we only introduce a
|
||||||
_clusterSelector_ field to filter acceptable clusters. In default
|
_clusterSelector_ field to filter acceptable clusters. In default
|
||||||
case there is no such selector and it means any cluster is
|
case there is no such selector and it means any cluster is
|
||||||
acceptable.
|
acceptable.
|
||||||
@ -376,7 +377,7 @@ clusters. How to handle this will be addressed after phase one.
|
|||||||
The Service API object exposed by Ubernetes is similar to service
|
The Service API object exposed by Ubernetes is similar to service
|
||||||
objects on Kubernetes. It defines the access to a group of pods. The
|
objects on Kubernetes. It defines the access to a group of pods. The
|
||||||
Ubernetes service controller will create corresponding Kubernetes
|
Ubernetes service controller will create corresponding Kubernetes
|
||||||
service objects on underlying clusters. These are detailed in a
|
service objects on underlying clusters. These are detailed in a
|
||||||
separate design document: [Federated Services](federated-services.md).
|
separate design document: [Federated Services](federated-services.md).
|
||||||
|
|
||||||
## Pod
|
## Pod
|
||||||
@ -389,7 +390,8 @@ order to keep the Ubernetes API compatible with the Kubernetes API.
|
|||||||
|
|
||||||
## Scheduling
|
## Scheduling
|
||||||
|
|
||||||
The below diagram shows how workloads are scheduled on the Ubernetes control plane:
|
The below diagram shows how workloads are scheduled on the Ubernetes control\
|
||||||
|
plane:
|
||||||
|
|
||||||
1. A replication controller is created by the client.
|
1. A replication controller is created by the client.
|
||||||
1. APIServer persists it into the storage.
|
1. APIServer persists it into the storage.
|
||||||
@ -425,8 +427,8 @@ proposed solutions like resource reservation mechanisms.
|
|||||||
|
|
||||||
This part has been included in the section “Federated Service” of
|
This part has been included in the section “Federated Service” of
|
||||||
document
|
document
|
||||||
“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please
|
“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”.
|
||||||
refer to that document for details.
|
Please refer to that document for details.
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -36,33 +36,40 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
## Preface
|
## Preface
|
||||||
|
|
||||||
This document briefly describes the design of the horizontal autoscaler for pods.
|
This document briefly describes the design of the horizontal autoscaler for
|
||||||
The autoscaler (implemented as a Kubernetes API resource and controller) is responsible for dynamically controlling
|
pods. The autoscaler (implemented as a Kubernetes API resource and controller)
|
||||||
the number of replicas of some collection (e.g. the pods of a ReplicationController) to meet some objective(s),
|
is responsible for dynamically controlling the number of replicas of some
|
||||||
|
collection (e.g. the pods of a ReplicationController) to meet some objective(s),
|
||||||
for example a target per-pod CPU utilization.
|
for example a target per-pod CPU utilization.
|
||||||
|
|
||||||
This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.0/docs/proposals/autoscaling.md).
|
This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.0/docs/proposals/autoscaling.md).
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
The resource usage of a serving application usually varies over time: sometimes the demand for the application rises,
|
The resource usage of a serving application usually varies over time: sometimes
|
||||||
and sometimes it drops.
|
the demand for the application rises, and sometimes it drops. In Kubernetes
|
||||||
In Kubernetes version 1.0, a user can only manually set the number of serving pods.
|
version 1.0, a user can only manually set the number of serving pods. Our aim is
|
||||||
Our aim is to provide a mechanism for the automatic adjustment of the number of pods based on CPU utilization statistics
|
to provide a mechanism for the automatic adjustment of the number of pods based
|
||||||
(a future version will allow autoscaling based on other resources/metrics).
|
on CPU utilization statistics (a future version will allow autoscaling based on
|
||||||
|
other resources/metrics).
|
||||||
|
|
||||||
## Scale Subresource
|
## Scale Subresource
|
||||||
|
|
||||||
In Kubernetes version 1.1, we are introducing Scale subresource and implementing horizontal autoscaling of pods based on it.
|
In Kubernetes version 1.1, we are introducing Scale subresource and implementing
|
||||||
Scale subresource is supported for replication controllers and deployments.
|
horizontal autoscaling of pods based on it. Scale subresource is supported for
|
||||||
Scale subresource is a Virtual Resource (does not correspond to an object stored in etcd).
|
replication controllers and deployments. Scale subresource is a Virtual Resource
|
||||||
It is only present in the API as an interface that a controller (in this case the HorizontalPodAutoscaler) can use to dynamically scale
|
(does not correspond to an object stored in etcd). It is only present in the API
|
||||||
the number of replicas controlled by some other API object (currently ReplicationController and Deployment) and to learn the current number of replicas.
|
as an interface that a controller (in this case the HorizontalPodAutoscaler) can
|
||||||
Scale is a subresource of the API object that it serves as the interface for.
|
use to dynamically scale the number of replicas controlled by some other API
|
||||||
The Scale subresource is useful because whenever we introduce another type we want to autoscale, we just need to implement the Scale subresource for it.
|
object (currently ReplicationController and Deployment) and to learn the current
|
||||||
The wider discussion regarding Scale took place in [#1629](https://github.com/kubernetes/kubernetes/issues/1629).
|
number of replicas. Scale is a subresource of the API object that it serves as
|
||||||
|
the interface for. The Scale subresource is useful because whenever we introduce
|
||||||
|
another type we want to autoscale, we just need to implement the Scale
|
||||||
|
subresource for it. The wider discussion regarding Scale took place in issue
|
||||||
|
[#1629](https://github.com/kubernetes/kubernetes/issues/1629).
|
||||||
|
|
||||||
Scale subresource is in API for replication controller or deployment under the following paths:
|
Scale subresource is in API for replication controller or deployment under the
|
||||||
|
following paths:
|
||||||
|
|
||||||
`apis/extensions/v1beta1/replicationcontrollers/myrc/scale`
|
`apis/extensions/v1beta1/replicationcontrollers/myrc/scale`
|
||||||
|
|
||||||
@ -99,14 +106,15 @@ type ScaleStatus struct {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Writing to `ScaleSpec.Replicas` resizes the replication controller/deployment associated with
|
Writing to `ScaleSpec.Replicas` resizes the replication controller/deployment
|
||||||
the given Scale subresource.
|
associated with the given Scale subresource. `ScaleStatus.Replicas` reports how
|
||||||
`ScaleStatus.Replicas` reports how many pods are currently running in the replication controller/deployment,
|
many pods are currently running in the replication controller/deployment, and
|
||||||
and `ScaleStatus.Selector` returns selector for the pods.
|
`ScaleStatus.Selector` returns selector for the pods.
|
||||||
|
|
||||||
## HorizontalPodAutoscaler Object
|
## HorizontalPodAutoscaler Object
|
||||||
|
|
||||||
In Kubernetes version 1.1, we are introducing HorizontalPodAutoscaler object. It is accessible under:
|
In Kubernetes version 1.1, we are introducing HorizontalPodAutoscaler object. It
|
||||||
|
is accessible under:
|
||||||
|
|
||||||
`apis/extensions/v1beta1/horizontalpodautoscalers/myautoscaler`
|
`apis/extensions/v1beta1/horizontalpodautoscalers/myautoscaler`
|
||||||
|
|
||||||
@ -168,8 +176,9 @@ type HorizontalPodAutoscalerStatus struct {
|
|||||||
```
|
```
|
||||||
|
|
||||||
`ScaleRef` is a reference to the Scale subresource.
|
`ScaleRef` is a reference to the Scale subresource.
|
||||||
`MinReplicas`, `MaxReplicas` and `CPUUtilization` define autoscaler configuration.
|
`MinReplicas`, `MaxReplicas` and `CPUUtilization` define autoscaler
|
||||||
We are also introducing HorizontalPodAutoscalerList object to enable listing all autoscalers in a namespace:
|
configuration. We are also introducing HorizontalPodAutoscalerList object to
|
||||||
|
enable listing all autoscalers in a namespace:
|
||||||
|
|
||||||
```go
|
```go
|
||||||
// list of horizontal pod autoscaler objects.
|
// list of horizontal pod autoscaler objects.
|
||||||
@ -184,19 +193,22 @@ type HorizontalPodAutoscalerList struct {
|
|||||||
|
|
||||||
## Autoscaling Algorithm
|
## Autoscaling Algorithm
|
||||||
|
|
||||||
The autoscaler is implemented as a control loop. It periodically queries pods described by `Status.PodSelector` of Scale subresource, and collects their CPU utilization.
|
The autoscaler is implemented as a control loop. It periodically queries pods
|
||||||
Then, it compares the arithmetic mean of the pods' CPU utilization with the target defined in `Spec.CPUUtilization`,
|
described by `Status.PodSelector` of Scale subresource, and collects their CPU
|
||||||
and adjust the replicas of the Scale if needed to match the target
|
utilization. Then, it compares the arithmetic mean of the pods' CPU utilization
|
||||||
(preserving condition: MinReplicas <= Replicas <= MaxReplicas).
|
with the target defined in `Spec.CPUUtilization`, and adjusts the replicas of
|
||||||
|
the Scale if needed to match the target (preserving condition: MinReplicas <=
|
||||||
|
Replicas <= MaxReplicas).
|
||||||
|
|
||||||
The period of the autoscaler is controlled by `--horizontal-pod-autoscaler-sync-period` flag of controller manager.
|
The period of the autoscaler is controlled by the
|
||||||
The default value is 30 seconds.
|
`--horizontal-pod-autoscaler-sync-period` flag of controller manager. The
|
||||||
|
default value is 30 seconds.
|
||||||
|
|
||||||
|
|
||||||
CPU utilization is the recent CPU usage of a pod (average across the last 1 minute) divided by the CPU requested by the pod.
|
CPU utilization is the recent CPU usage of a pod (average across the last 1
|
||||||
In Kubernetes version 1.1, CPU usage is taken directly from Heapster.
|
minute) divided by the CPU requested by the pod. In Kubernetes version 1.1, CPU
|
||||||
In future, there will be API on master for this purpose
|
usage is taken directly from Heapster. In future, there will be API on master
|
||||||
(see [#11951](https://github.com/kubernetes/kubernetes/issues/11951)).
|
for this purpose (see issue [#11951](https://github.com/kubernetes/kubernetes/issues/11951)).
|
||||||
|
|
||||||
The target number of pods is calculated from the following formula:
|
The target number of pods is calculated from the following formula:
|
||||||
|
|
||||||
@ -204,66 +216,76 @@ The target number of pods is calculated from the following formula:
|
|||||||
TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target)
|
TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target)
|
||||||
```
|
```
|
||||||
|
|
||||||
Starting and stopping pods may introduce noise to the metric (for instance, starting may temporarily increase CPU).
|
Starting and stopping pods may introduce noise to the metric (for instance,
|
||||||
So, after each action, the autoscaler should wait some time for reliable data.
|
starting may temporarily increase CPU). So, after each action, the autoscaler
|
||||||
Scale-up can only happen if there was no rescaling within the last 3 minutes.
|
should wait some time for reliable data. Scale-up can only happen if there was
|
||||||
Scale-down will wait for 5 minutes from the last rescaling.
|
no rescaling within the last 3 minutes. Scale-down will wait for 5 minutes from
|
||||||
Moreover any scaling will only be made if: `avg(CurrentPodsConsumption) / Target` drops below 0.9 or increases above 1.1 (10% tolerance).
|
the last rescaling. Moreover any scaling will only be made if:
|
||||||
Such approach has two benefits:
|
`avg(CurrentPodsConsumption) / Target` drops below 0.9 or increases above 1.1
|
||||||
|
(10% tolerance). Such approach has two benefits:
|
||||||
|
|
||||||
* Autoscaler works in a conservative way.
|
* Autoscaler works in a conservative way. If new user load appears, it is
|
||||||
If new user load appears, it is important for us to rapidly increase the number of pods,
|
important for us to rapidly increase the number of pods, so that user requests
|
||||||
so that user requests will not be rejected.
|
will not be rejected. Lowering the number of pods is not that urgent.
|
||||||
Lowering the number of pods is not that urgent.
|
|
||||||
|
|
||||||
* Autoscaler avoids thrashing, i.e.: prevents rapid execution of conflicting decision if the load is not stable.
|
* Autoscaler avoids thrashing, i.e.: prevents rapid execution of conflicting
|
||||||
|
decision if the load is not stable.
|
||||||
|
|
||||||
## Relative vs. absolute metrics
|
## Relative vs. absolute metrics
|
||||||
|
|
||||||
We chose values of the target metric to be relative (e.g. 90% of requested CPU resource) rather than absolute (e.g. 0.6 core) for the following reason.
|
We chose values of the target metric to be relative (e.g. 90% of requested CPU
|
||||||
If we choose absolute metric, user will need to guarantee that the target is lower than the request.
|
resource) rather than absolute (e.g. 0.6 core) for the following reason. If we
|
||||||
Otherwise, overloaded pods may not be able to consume more than the autoscaler's absolute target utilization,
|
choose absolute metric, user will need to guarantee that the target is lower
|
||||||
thereby preventing the autoscaler from seeing high enough utilization to trigger it to scale up.
|
than the request. Otherwise, overloaded pods may not be able to consume more
|
||||||
This may be especially troublesome when user changes requested resources for a pod
|
than the autoscaler's absolute target utilization, thereby preventing the
|
||||||
|
autoscaler from seeing high enough utilization to trigger it to scale up. This
|
||||||
|
may be especially troublesome when user changes requested resources for a pod
|
||||||
because they would need to also change the autoscaler utilization threshold.
|
because they would need to also change the autoscaler utilization threshold.
|
||||||
Therefore, we decided to choose relative metric.
|
Therefore, we decided to choose relative metric. For user, it is enough to set
|
||||||
For user, it is enough to set it to a value smaller than 100%, and further changes of requested resources will not invalidate it.
|
it to a value smaller than 100%, and further changes of requested resources will
|
||||||
|
not invalidate it.
|
||||||
|
|
||||||
## Support in kubectl
|
## Support in kubectl
|
||||||
|
|
||||||
To make manipulation of HorizontalPodAutoscaler object simpler, we added support for
|
To make manipulation of HorizontalPodAutoscaler object simpler, we added support
|
||||||
creating/updating/deleting/listing of HorizontalPodAutoscaler to kubectl.
|
for creating/updating/deleting/listing of HorizontalPodAutoscaler to kubectl. In
|
||||||
In addition, in future, we are planning to add kubectl support for the following use-cases:
|
addition, in future, we are planning to add kubectl support for the following
|
||||||
* When creating a replication controller or deployment with `kubectl create [-f]`, there should be
|
use-cases:
|
||||||
a possibility to specify an additional autoscaler object.
|
* When creating a replication controller or deployment with
|
||||||
(This should work out-of-the-box when creation of autoscaler is supported by kubectl as we may include
|
`kubectl create [-f]`, there should be a possibility to specify an additional
|
||||||
multiple objects in the same config file).
|
autoscaler object. (This should work out-of-the-box when creation of autoscaler
|
||||||
* *[future]* When running an image with `kubectl run`, there should be an additional option to create
|
is supported by kubectl as we may include multiple objects in the same config
|
||||||
an autoscaler for it.
|
file).
|
||||||
* *[future]* We will add a new command `kubectl autoscale` that will allow for easy creation of an autoscaler object
|
* *[future]* When running an image with `kubectl run`, there should be an
|
||||||
for already existing replication controller/deployment.
|
additional option to create an autoscaler for it.
|
||||||
|
* *[future]* We will add a new command `kubectl autoscale` that will allow for
|
||||||
|
easy creation of an autoscaler object for already existing replication
|
||||||
|
controller/deployment.
|
||||||
|
|
||||||
## Next steps
|
## Next steps
|
||||||
|
|
||||||
We list here some features that are not supported in Kubernetes version 1.1.
|
We list here some features that are not supported in Kubernetes version 1.1.
|
||||||
However, we want to keep them in mind, as they will most probably be needed in future.
|
However, we want to keep them in mind, as they will most probably be needed in
|
||||||
|
the future.
|
||||||
Our design is in general compatible with them.
|
Our design is in general compatible with them.
|
||||||
* *[future]* **Autoscale pods based on metrics different than CPU** (e.g. memory, network traffic, qps).
|
* *[future]* **Autoscale pods based on metrics different than CPU** (e.g.
|
||||||
This includes scaling based on a custom/application metric.
|
memory, network traffic, qps). This includes scaling based on a custom/application metric.
|
||||||
* *[future]* **Autoscale pods base on an aggregate metric.**
|
* *[future]* **Autoscale pods base on an aggregate metric.** Autoscaler,
|
||||||
Autoscaler, instead of computing average for a target metric across pods, will use a single, external, metric (e.g. qps metric from load balancer).
|
instead of computing average for a target metric across pods, will use a single,
|
||||||
The metric will be aggregated while the target will remain per-pod
|
external, metric (e.g. qps metric from load balancer). The metric will be
|
||||||
(e.g. when observing 100 qps on load balancer while the target is 20 qps per pod, autoscaler will set the number of replicas to 5).
|
aggregated while the target will remain per-pod (e.g. when observing 100 qps on
|
||||||
* *[future]* **Autoscale pods based on multiple metrics.**
|
load balancer while the target is 20 qps per pod, autoscaler will set the number
|
||||||
If the target numbers of pods for different metrics are different, choose the largest target number of pods.
|
of replicas to 5).
|
||||||
* *[future]* **Scale the number of pods starting from 0.**
|
* *[future]* **Autoscale pods based on multiple metrics.** If the target numbers
|
||||||
All pods can be turned-off, and then turned-on when there is a demand for them.
|
of pods for different metrics are different, choose the largest target number of
|
||||||
When a request to service with no pods arrives, kube-proxy will generate an event for autoscaler
|
pods.
|
||||||
to create a new pod.
|
* *[future]* **Scale the number of pods starting from 0.** All pods can be
|
||||||
Discussed in [#3247](https://github.com/kubernetes/kubernetes/issues/3247).
|
turned-off, and then turned-on when there is a demand for them. When a request
|
||||||
* *[future]* **When scaling down, make more educated decision which pods to kill.**
|
to service with no pods arrives, kube-proxy will generate an event for
|
||||||
E.g.: if two or more pods from the same replication controller are on the same node, kill one of them.
|
autoscaler to create a new pod. Discussed in issue [#3247](https://github.com/kubernetes/kubernetes/issues/3247).
|
||||||
Discussed in [#4301](https://github.com/kubernetes/kubernetes/issues/4301).
|
* *[future]* **When scaling down, make more educated decision which pods to
|
||||||
|
kill.** E.g.: if two or more pods from the same replication controller are on
|
||||||
|
the same node, kill one of them. Discussed in issue [#4301](https://github.com/kubernetes/kubernetes/issues/4301).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -34,95 +34,111 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
# Identifiers and Names in Kubernetes
|
# Identifiers and Names in Kubernetes
|
||||||
|
|
||||||
A summarization of the goals and recommendations for identifiers in Kubernetes. Described in [GitHub issue #199](http://issue.k8s.io/199).
|
A summarization of the goals and recommendations for identifiers in Kubernetes.
|
||||||
|
Described in GitHub issue [#199](http://issue.k8s.io/199).
|
||||||
|
|
||||||
|
|
||||||
## Definitions
|
## Definitions
|
||||||
|
|
||||||
UID
|
`UID`: A non-empty, opaque, system-generated value guaranteed to be unique in time
|
||||||
: A non-empty, opaque, system-generated value guaranteed to be unique in time and space; intended to distinguish between historical occurrences of similar entities.
|
and space; intended to distinguish between historical occurrences of similar
|
||||||
|
entities.
|
||||||
|
|
||||||
Name
|
`Name`: A non-empty string guaranteed to be unique within a given scope at a
|
||||||
: A non-empty string guaranteed to be unique within a given scope at a particular time; used in resource URLs; provided by clients at creation time and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish distinct entities, and reference particular entities across operations.
|
particular time; used in resource URLs; provided by clients at creation time and
|
||||||
|
encouraged to be human friendly; intended to facilitate creation idempotence and
|
||||||
|
space-uniqueness of singleton objects, distinguish distinct entities, and
|
||||||
|
reference particular entities across operations.
|
||||||
|
|
||||||
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) label (DNS_LABEL)
|
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `label` (DNS_LABEL):
|
||||||
: An alphanumeric (a-z, and 0-9) string, with a maximum length of 63 characters, with the '-' character allowed anywhere except the first or last character, suitable for use as a hostname or segment in a domain name
|
An alphanumeric (a-z, and 0-9) string, with a maximum length of 63 characters,
|
||||||
|
with the '-' character allowed anywhere except the first or last character,
|
||||||
|
suitable for use as a hostname or segment in a domain name.
|
||||||
|
|
||||||
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) subdomain (DNS_SUBDOMAIN)
|
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `subdomain` (DNS_SUBDOMAIN):
|
||||||
: One or more lowercase rfc1035/rfc1123 labels separated by '.' with a maximum length of 253 characters
|
One or more lowercase rfc1035/rfc1123 labels separated by '.' with a maximum
|
||||||
|
length of 253 characters.
|
||||||
|
|
||||||
[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) universally unique identifier (UUID)
|
[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) `universally unique identifier` (UUID):
|
||||||
: A 128 bit generated value that is extremely unlikely to collide across time and space and requires no central coordination
|
A 128 bit generated value that is extremely unlikely to collide across time and
|
||||||
|
space and requires no central coordination.
|
||||||
|
|
||||||
[rfc6335](https://tools.ietf.org/rfc/rfc6335.txt) port name (IANA_SVC_NAME)
|
[rfc6335](https://tools.ietf.org/rfc/rfc6335.txt) `port name` (IANA_SVC_NAME):
|
||||||
: An alphanumeric (a-z, and 0-9) string, with a maximum length of 15 characters, with the '-' character allowed anywhere except the first or the last character or adjacent to another '-' character, it must contain at least a (a-z) character
|
An alphanumeric (a-z, and 0-9) string, with a maximum length of 15 characters,
|
||||||
|
with the '-' character allowed anywhere except the first or the last character
|
||||||
|
or adjacent to another '-' character, it must contain at least a (a-z)
|
||||||
|
character.
|
||||||
|
|
||||||
## Objectives for names and UIDs
|
## Objectives for names and UIDs
|
||||||
|
|
||||||
1. Uniquely identify (via a UID) an object across space and time
|
1. Uniquely identify (via a UID) an object across space and time.
|
||||||
|
2. Uniquely name (via a name) an object across space.
|
||||||
2. Uniquely name (via a name) an object across space
|
3. Provide human-friendly names in API operations and/or configuration files.
|
||||||
|
4. Allow idempotent creation of API resources (#148) and enforcement of
|
||||||
3. Provide human-friendly names in API operations and/or configuration files
|
space-uniqueness of singleton objects.
|
||||||
|
5. Allow DNS names to be automatically generated for some objects.
|
||||||
4. Allow idempotent creation of API resources (#148) and enforcement of space-uniqueness of singleton objects
|
|
||||||
|
|
||||||
5. Allow DNS names to be automatically generated for some objects
|
|
||||||
|
|
||||||
|
|
||||||
## General design
|
## General design
|
||||||
|
|
||||||
1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must be specified. Name must be non-empty and unique within the apiserver. This enables idempotent and space-unique creation operations. Parts of the system (e.g. replication controller) may join strings (e.g. a base name and a random suffix) to create a unique Name. For situations where generating a name is impractical, some or all objects may support a param to auto-generate a name. Generating random names will defeat idempotency.
|
1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must
|
||||||
|
be specified. Name must be non-empty and unique within the apiserver. This
|
||||||
|
enables idempotent and space-unique creation operations. Parts of the system
|
||||||
|
(e.g. replication controller) may join strings (e.g. a base name and a random
|
||||||
|
suffix) to create a unique Name. For situations where generating a name is
|
||||||
|
impractical, some or all objects may support a param to auto-generate a name.
|
||||||
|
Generating random names will defeat idempotency.
|
||||||
* Examples: "guestbook.user", "backend-x4eb1"
|
* Examples: "guestbook.user", "backend-x4eb1"
|
||||||
|
2. When an object is created via an API, a Namespace string (a DNS_SUBDOMAIN?
|
||||||
2. When an object is created via an API, a Namespace string (a DNS_SUBDOMAIN? format TBD via #1114) may be specified. Depending on the API receiver, namespaces might be validated (e.g. apiserver might ensure that the namespace actually exists). If a namespace is not specified, one will be assigned by the API receiver. This assignment policy might vary across API receivers (e.g. apiserver might have a default, kubelet might generate something semi-random).
|
format TBD via #1114) may be specified. Depending on the API receiver,
|
||||||
|
namespaces might be validated (e.g. apiserver might ensure that the namespace
|
||||||
|
actually exists). If a namespace is not specified, one will be assigned by the
|
||||||
|
API receiver. This assignment policy might vary across API receivers (e.g.
|
||||||
|
apiserver might have a default, kubelet might generate something semi-random).
|
||||||
* Example: "api.k8s.example.com"
|
* Example: "api.k8s.example.com"
|
||||||
|
3. Upon acceptance of an object via an API, the object is assigned a UID
|
||||||
3. Upon acceptance of an object via an API, the object is assigned a UID (a UUID). UID must be non-empty and unique across space and time.
|
(a UUID). UID must be non-empty and unique across space and time.
|
||||||
* Example: "01234567-89ab-cdef-0123-456789abcdef"
|
* Example: "01234567-89ab-cdef-0123-456789abcdef"
|
||||||
|
|
||||||
|
|
||||||
## Case study: Scheduling a pod
|
## Case study: Scheduling a pod
|
||||||
|
|
||||||
Pods can be placed onto a particular node in a number of ways. This case
|
Pods can be placed onto a particular node in a number of ways. This case study
|
||||||
study demonstrates how the above design can be applied to satisfy the
|
demonstrates how the above design can be applied to satisfy the objectives.
|
||||||
objectives.
|
|
||||||
|
|
||||||
### A pod scheduled by a user through the apiserver
|
### A pod scheduled by a user through the apiserver
|
||||||
|
|
||||||
1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver.
|
1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver.
|
||||||
|
|
||||||
2. The apiserver validates the input.
|
2. The apiserver validates the input.
|
||||||
1. A default Namespace is assigned.
|
1. A default Namespace is assigned.
|
||||||
2. The pod name must be space-unique within the Namespace.
|
2. The pod name must be space-unique within the Namespace.
|
||||||
3. Each container within the pod has a name which must be space-unique within the pod.
|
3. Each container within the pod has a name which must be space-unique within
|
||||||
|
the pod.
|
||||||
3. The pod is accepted.
|
3. The pod is accepted.
|
||||||
1. A new UID is assigned.
|
1. A new UID is assigned.
|
||||||
|
|
||||||
4. The pod is bound to a node.
|
4. The pod is bound to a node.
|
||||||
1. The kubelet on the node is passed the pod's UID, Namespace, and Name.
|
1. The kubelet on the node is passed the pod's UID, Namespace, and Name.
|
||||||
|
|
||||||
5. Kubelet validates the input.
|
5. Kubelet validates the input.
|
||||||
|
|
||||||
6. Kubelet runs the pod.
|
6. Kubelet runs the pod.
|
||||||
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
|
1. Each container is started up with enough metadata to distinguish the pod
|
||||||
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
|
from whence it came.
|
||||||
* This may correspond to Docker's container ID.
|
2. Each attempt to run a container is assigned a UID (a string) that is
|
||||||
|
unique across time. * This may correspond to Docker's container ID.
|
||||||
|
|
||||||
### A pod placed by a config file on the node
|
### A pod placed by a config file on the node
|
||||||
|
|
||||||
1. A config file is stored on the node, containing a pod with UID="", Namespace="", and Name="cadvisor".
|
1. A config file is stored on the node, containing a pod with UID="",
|
||||||
|
Namespace="", and Name="cadvisor".
|
||||||
2. Kubelet validates the input.
|
2. Kubelet validates the input.
|
||||||
1. Since UID is not provided, kubelet generates one.
|
1. Since UID is not provided, kubelet generates one.
|
||||||
2. Since Namespace is not provided, kubelet generates one.
|
2. Since Namespace is not provided, kubelet generates one.
|
||||||
1. The generated namespace should be deterministic and cluster-unique for the source, such as a hash of the hostname and file path.
|
1. The generated namespace should be deterministic and cluster-unique for
|
||||||
|
the source, such as a hash of the hostname and file path.
|
||||||
* E.g. Namespace="file-f4231812554558a718a01ca942782d81"
|
* E.g. Namespace="file-f4231812554558a718a01ca942782d81"
|
||||||
|
|
||||||
3. Kubelet runs the pod.
|
3. Kubelet runs the pod.
|
||||||
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
|
1. Each container is started up with enough metadata to distinguish the pod
|
||||||
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
|
from whence it came.
|
||||||
|
2. Each attempt to run a container is assigned a UID (a string) that is
|
||||||
|
unique across time.
|
||||||
1. This may correspond to Docker's container ID.
|
1. This may correspond to Docker's container ID.
|
||||||
|
|
||||||
|
|
||||||
|
@ -53,64 +53,66 @@ a third way to run embarrassingly parallel programs, with a focus on
|
|||||||
ease of use.
|
ease of use.
|
||||||
|
|
||||||
This new style of Job is called an *indexed job*, because each Pod of the Job
|
This new style of Job is called an *indexed job*, because each Pod of the Job
|
||||||
is specialized to work on a particular *index* from a fixed length array of work items.
|
is specialized to work on a particular *index* from a fixed length array of work
|
||||||
|
items.
|
||||||
|
|
||||||
## Background
|
## Background
|
||||||
|
|
||||||
The Kubernetes [Job](../../docs/user-guide/jobs.md) already supports
|
The Kubernetes [Job](../../docs/user-guide/jobs.md) already supports
|
||||||
the embarrassingly parallel use case through *workqueue jobs*.
|
the embarrassingly parallel use case through *workqueue jobs*.
|
||||||
While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns)
|
While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns) are very
|
||||||
are very flexible, they can be difficult to use.
|
flexible, they can be difficult to use. They: (1) typically require running a
|
||||||
They: (1) typically require running a message queue
|
message queue or other database service, (2) typically require modifications
|
||||||
or other database service, (2) typically require modifications
|
to existing binaries and images and (3) subtle race conditions are easy to
|
||||||
to existing binaries and images and (3) subtle race conditions
|
overlook.
|
||||||
are easy to overlook.
|
|
||||||
|
|
||||||
Users also have another option for parallel jobs: creating [multiple Job objects
|
Users also have another option for parallel jobs: creating [multiple Job objects
|
||||||
from a template](hdocs/design/indexed-job.md#job-patterns).
|
from a template](hdocs/design/indexed-job.md#job-patterns). For small numbers of
|
||||||
For small numbers of Jobs, this is a fine choice. Labels make it easy to view and
|
Jobs, this is a fine choice. Labels make it easy to view and delete multiple Job
|
||||||
delete multiple Job objects at once. But, that approach also has its drawbacks:
|
objects at once. But, that approach also has its drawbacks: (1) for large levels
|
||||||
(1) for large levels of parallelism (hundreds or thousands of pods) this approach
|
of parallelism (hundreds or thousands of pods) this approach means that listing
|
||||||
means that listing all jobs presents too much information, (2) users want a single
|
all jobs presents too much information, (2) users want a single source of
|
||||||
source of information about the success or failure of what the user views as a single
|
information about the success or failure of what the user views as a single
|
||||||
logical process.
|
logical process.
|
||||||
|
|
||||||
Indexed job fills provides a third option with better ease-of-use for common use cases.
|
Indexed job fills provides a third option with better ease-of-use for common
|
||||||
|
use cases.
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
### User Requirements
|
### User Requirements
|
||||||
|
|
||||||
- Users want an easy way to run a Pod to completion *for each* item within a
|
- Users want an easy way to run a Pod to completion *for each* item within a
|
||||||
[work list](#example-use-cases).
|
[work list](#example-use-cases).
|
||||||
|
|
||||||
- Users want to run these pods in parallel for speed, but to vary the level of
|
- Users want to run these pods in parallel for speed, but to vary the level of
|
||||||
parallelism as needed, independent of the number of work items.
|
parallelism as needed, independent of the number of work items.
|
||||||
|
|
||||||
- Users want to do this without requiring changes to existing images,
|
- Users want to do this without requiring changes to existing images,
|
||||||
or source-to-image pipelines.
|
or source-to-image pipelines.
|
||||||
|
|
||||||
- Users want a single object that encompasses the lifetime of the parallel
|
- Users want a single object that encompasses the lifetime of the parallel
|
||||||
program. Deleting it should delete all dependent objects. It should report
|
program. Deleting it should delete all dependent objects. It should report the
|
||||||
the status of the overall process. Users should be
|
status of the overall process. Users should be able to wait for it to complete,
|
||||||
able to wait for it to complete, and can refer to it from other resource types, such as
|
and can refer to it from other resource types, such as
|
||||||
[ScheduledJob](https://github.com/kubernetes/kubernetes/pull/11980).
|
[ScheduledJob](https://github.com/kubernetes/kubernetes/pull/11980).
|
||||||
|
|
||||||
|
|
||||||
### Example Use Cases
|
### Example Use Cases
|
||||||
|
|
||||||
Here are several examples of *work lists*: lists of command lines that the
|
Here are several examples of *work lists*: lists of command lines that the user
|
||||||
user wants to run, each line its own Pod. (Note that in practice, a work
|
wants to run, each line its own Pod. (Note that in practice, a work list may not
|
||||||
list may not ever be written out in this form, but it exists in the mind of
|
ever be written out in this form, but it exists in the mind of the Job creator,
|
||||||
the Job creator, and it is a useful way to talk about the the intent of the user when discussing alternatives for specifying Indexed Jobs).
|
and it is a useful way to talk about the the intent of the user when discussing
|
||||||
|
alternatives for specifying Indexed Jobs).
|
||||||
|
|
||||||
Note that we will not have the user express their requirements in work list
|
Note that we will not have the user express their requirements in work list
|
||||||
form; it is just a format for presenting use cases. Subsequent discussion
|
form; it is just a format for presenting use cases. Subsequent discussion will
|
||||||
will reference these work lists.
|
reference these work lists.
|
||||||
|
|
||||||
#### Work List 1
|
#### Work List 1
|
||||||
|
|
||||||
Process several files with the same program
|
Process several files with the same program:
|
||||||
|
|
||||||
```
|
```
|
||||||
/usr/local/bin/process_file 12342.dat
|
/usr/local/bin/process_file 12342.dat
|
||||||
@ -120,7 +122,7 @@ Process several files with the same program
|
|||||||
|
|
||||||
#### Work List 2
|
#### Work List 2
|
||||||
|
|
||||||
Process a matrix (or image, etc) in rectangular blocks
|
Process a matrix (or image, etc) in rectangular blocks:
|
||||||
|
|
||||||
```
|
```
|
||||||
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15
|
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15
|
||||||
@ -131,7 +133,7 @@ Process a matrix (or image, etc) in rectangular blocks
|
|||||||
|
|
||||||
#### Work List 3
|
#### Work List 3
|
||||||
|
|
||||||
Build a program at several different git commits
|
Build a program at several different git commits:
|
||||||
|
|
||||||
```
|
```
|
||||||
HASH=3cab5cb4a git checkout $HASH && make clean && make VERSION=$HASH
|
HASH=3cab5cb4a git checkout $HASH && make clean && make VERSION=$HASH
|
||||||
@ -141,7 +143,7 @@ HASH=a8b5e34c5 git checkout $HASH && make clean && make VERSION=$HASH
|
|||||||
|
|
||||||
#### Work List 4
|
#### Work List 4
|
||||||
|
|
||||||
Render several frames of a movie.
|
Render several frames of a movie:
|
||||||
|
|
||||||
```
|
```
|
||||||
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 1
|
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 1
|
||||||
@ -151,7 +153,8 @@ Render several frames of a movie.
|
|||||||
|
|
||||||
#### Work List 5
|
#### Work List 5
|
||||||
|
|
||||||
Render several blocks of frames. (Render blocks to avoid Pod startup overhead for every frame)
|
Render several blocks of frames (Render blocks to avoid Pod startup overhead for
|
||||||
|
every frame):
|
||||||
|
|
||||||
```
|
```
|
||||||
./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 1 --frame-end 100
|
./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 1 --frame-end 100
|
||||||
@ -167,57 +170,59 @@ Given a work list, like in the [work list examples](#work-list-examples),
|
|||||||
the information from the work list needs to get into each Pod of the Job.
|
the information from the work list needs to get into each Pod of the Job.
|
||||||
|
|
||||||
Users will typically not want to create a new image for each job they
|
Users will typically not want to create a new image for each job they
|
||||||
run. They will want to use existing images. So, the image is not the place
|
run. They will want to use existing images. So, the image is not the place
|
||||||
for the work list.
|
for the work list.
|
||||||
|
|
||||||
A work list can be stored on networked storage, and mounted by pods of the job.
|
A work list can be stored on networked storage, and mounted by pods of the job.
|
||||||
Also, as a shortcut, for small worklists, it can be included in an annotation on the Job object,
|
Also, as a shortcut, for small worklists, it can be included in an annotation on
|
||||||
which is then exposed as a volume in the pod via the downward API.
|
the Job object, which is then exposed as a volume in the pod via the downward
|
||||||
|
API.
|
||||||
|
|
||||||
### What Varies Between Pods of a Job
|
### What Varies Between Pods of a Job
|
||||||
|
|
||||||
Pods need to differ in some way to do something different. (They do not
|
Pods need to differ in some way to do something different. (They do not differ
|
||||||
differ in the work-queue style of Job, but that style has ease-of-use issues).
|
in the work-queue style of Job, but that style has ease-of-use issues).
|
||||||
|
|
||||||
A general approach would be to allow pods to differ from each other in arbitrary ways.
|
A general approach would be to allow pods to differ from each other in arbitrary
|
||||||
For example, the Job object could have a list of PodSpecs to run.
|
ways. For example, the Job object could have a list of PodSpecs to run.
|
||||||
However, this is so general that it provides little value. It would:
|
However, this is so general that it provides little value. It would:
|
||||||
|
|
||||||
- make the Job Spec very verbose, especially for jobs with thousands of work items
|
- make the Job Spec very verbose, especially for jobs with thousands of work
|
||||||
|
items
|
||||||
- Job becomes such a vague concept that it is hard to explain to users
|
- Job becomes such a vague concept that it is hard to explain to users
|
||||||
- in practice, we do not see cases where many pods which differ across many fields of their
|
- in practice, we do not see cases where many pods which differ across many
|
||||||
specs, and need to run as a group, with no ordering constraints.
|
fields of their specs, and need to run as a group, with no ordering constraints.
|
||||||
- CLIs and UIs need to support more options for creating Job
|
- CLIs and UIs need to support more options for creating Job
|
||||||
- it is useful for monitoring and accounting databases want to aggregate data for pods
|
- it is useful for monitoring and accounting databases want to aggregate data
|
||||||
with the same controller. However, pods with very different Specs may not make sense
|
for pods with the same controller. However, pods with very different Specs may
|
||||||
to aggregate.
|
not make sense to aggregate.
|
||||||
- profiling, debugging, accounting, auditing and monitoring tools cannot assume common
|
- profiling, debugging, accounting, auditing and monitoring tools cannot assume
|
||||||
images/files, behaviors, provenance and so on between Pods of a Job.
|
common images/files, behaviors, provenance and so on between Pods of a Job.
|
||||||
|
|
||||||
Also, variety has another cost. Pods which differ in ways that affect scheduling
|
Also, variety has another cost. Pods which differ in ways that affect scheduling
|
||||||
(node constraints, resource requirements, labels) prevent the scheduler
|
(node constraints, resource requirements, labels) prevent the scheduler from
|
||||||
from treating them as fungible, which is an important optimization for the scheduler.
|
treating them as fungible, which is an important optimization for the scheduler.
|
||||||
|
|
||||||
Therefore, we will not allow Pods from the same Job to differ arbitrarily
|
Therefore, we will not allow Pods from the same Job to differ arbitrarily
|
||||||
(anyway, users can use multiple Job objects for that case). We will try to
|
(anyway, users can use multiple Job objects for that case). We will try to
|
||||||
allow as little as possible to differ between pods of the same Job, while
|
allow as little as possible to differ between pods of the same Job, while still
|
||||||
still allowing users to express common parallel patterns easily.
|
allowing users to express common parallel patterns easily. For users who need to
|
||||||
For users who need to run jobs which differ in other ways, they can create multiple
|
run jobs which differ in other ways, they can create multiple Jobs, and manage
|
||||||
Jobs, and manage them as a group using labels.
|
them as a group using labels.
|
||||||
|
|
||||||
From the above work lists, we see a need for Pods which differ in their command
|
From the above work lists, we see a need for Pods which differ in their command
|
||||||
lines, and in their environment variables. These work lists do not require the
|
lines, and in their environment variables. These work lists do not require the
|
||||||
pods to differ in other ways.
|
pods to differ in other ways.
|
||||||
|
|
||||||
Experience in a [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf) has shown this model to be applicable
|
Experience in [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf)
|
||||||
to a very broad range of problems, despite this restriction.
|
has shown this model to be applicable to a very broad range of problems, despite
|
||||||
|
this restriction.
|
||||||
Therefore we to allow pods in the same Job to differ **only** in the following aspects:
|
|
||||||
|
|
||||||
|
Therefore we to allow pods in the same Job to differ **only** in the following
|
||||||
|
aspects:
|
||||||
- command line
|
- command line
|
||||||
- environment variables
|
- environment variables
|
||||||
|
|
||||||
|
|
||||||
### Composition of existing images
|
### Composition of existing images
|
||||||
|
|
||||||
The docker image that is used in a job may not be maintained by the person
|
The docker image that is used in a job may not be maintained by the person
|
||||||
@ -230,9 +235,9 @@ This needs more thought.
|
|||||||
|
|
||||||
### Running Ad-Hoc Jobs using kubectl
|
### Running Ad-Hoc Jobs using kubectl
|
||||||
|
|
||||||
A user should be able to easily start an Indexed Job using `kubectl`.
|
A user should be able to easily start an Indexed Job using `kubectl`. For
|
||||||
For example to run [work list 1](#work-list-1), a user should be able
|
example to run [work list 1](#work-list-1), a user should be able to type
|
||||||
to type something simple like:
|
something simple like:
|
||||||
|
|
||||||
```
|
```
|
||||||
kubectl run process-files --image=myfileprocessor \
|
kubectl run process-files --image=myfileprocessor \
|
||||||
@ -246,13 +251,16 @@ In the above example:
|
|||||||
|
|
||||||
- `--restart=OnFailure` implies creating a job instead of replicationController.
|
- `--restart=OnFailure` implies creating a job instead of replicationController.
|
||||||
- Each pods command line is `/usr/local/bin/process_file $F`.
|
- Each pods command line is `/usr/local/bin/process_file $F`.
|
||||||
- `--per-completion-env=` implies the jobs `.spec.completions` is set to the length of the argument array (3 in the example).
|
- `--per-completion-env=` implies the jobs `.spec.completions` is set to the
|
||||||
- `--per-completion-env=F=<values>` causes env var with `F` to be available in the environment when the command line is evaluated.
|
length of the argument array (3 in the example).
|
||||||
|
- `--per-completion-env=F=<values>` causes env var with `F` to be available in
|
||||||
|
the environment when the command line is evaluated.
|
||||||
|
|
||||||
How exactly this happens is discussed later in the doc: this is a sketch of the user experience.
|
How exactly this happens is discussed later in the doc: this is a sketch of the
|
||||||
|
user experience.
|
||||||
|
|
||||||
In practice, the list of files might be much longer and stored in a file
|
In practice, the list of files might be much longer and stored in a file on the
|
||||||
on the users local host, like:
|
users local host, like:
|
||||||
|
|
||||||
```
|
```
|
||||||
$ cat files-to-process.txt
|
$ cat files-to-process.txt
|
||||||
@ -266,16 +274,27 @@ So, the user could specify instead: `--per-completion-env=F="$(cat files-to-proc
|
|||||||
|
|
||||||
However, `kubectl` should also support a format like:
|
However, `kubectl` should also support a format like:
|
||||||
`--per-completion-env=F=@files-to-process.txt`.
|
`--per-completion-env=F=@files-to-process.txt`.
|
||||||
That allows `kubectl` to parse the file, point out any syntax errors, and would not run up against command line length limits (2MB is common, as low as 4kB is POSIX compliant).
|
That allows `kubectl` to parse the file, point out any syntax errors, and would
|
||||||
|
not run up against command line length limits (2MB is common, as low as 4kB is
|
||||||
|
POSIX compliant).
|
||||||
|
|
||||||
One case we do not try to handle is where the file of work is stored on a cloud filesystem, and not accessible from the users local host. Then we cannot easily use indexed job, because we do not know the number of completions. The user needs to copy the file locally first or use the Work-Queue style of Job (already supported).
|
One case we do not try to handle is where the file of work is stored on a cloud
|
||||||
|
filesystem, and not accessible from the users local host. Then we cannot easily
|
||||||
|
use indexed job, because we do not know the number of completions. The user
|
||||||
|
needs to copy the file locally first or use the Work-Queue style of Job (already
|
||||||
|
supported).
|
||||||
|
|
||||||
Another case we do not try to handle is where the input file does not exist yet because this Job is to be run at a future time, or depends on another job. The workflow and scheduled job proposal need to consider this case. For that case, you could use an indexed job which runs a program which shards the input file (map-reduce-style).
|
Another case we do not try to handle is where the input file does not exist yet
|
||||||
|
because this Job is to be run at a future time, or depends on another job. The
|
||||||
|
workflow and scheduled job proposal need to consider this case. For that case,
|
||||||
|
you could use an indexed job which runs a program which shards the input file
|
||||||
|
(map-reduce-style).
|
||||||
|
|
||||||
#### Multiple parameters
|
#### Multiple parameters
|
||||||
|
|
||||||
The user may also have multiple parameters, like in [work list 2](#work-list-2).
|
The user may also have multiple parameters, like in [work list 2](#work-list-2).
|
||||||
One way is to just list all the command lines already expanded, one per line, in a file, like this:
|
One way is to just list all the command lines already expanded, one per line, in
|
||||||
|
a file, like this:
|
||||||
|
|
||||||
```
|
```
|
||||||
$ cat matrix-commandlines.txt
|
$ cat matrix-commandlines.txt
|
||||||
@ -295,10 +314,12 @@ kubectl run process-matrix --image=my/matrix \
|
|||||||
'eval "$COMMAND_LINE"'
|
'eval "$COMMAND_LINE"'
|
||||||
```
|
```
|
||||||
|
|
||||||
However, this may have some subtleties with shell escaping. Also, it depends on the user
|
However, this may have some subtleties with shell escaping. Also, it depends on
|
||||||
knowing all the correct arguments to the docker image being used (more on this later).
|
the user knowing all the correct arguments to the docker image being used (more
|
||||||
|
on this later).
|
||||||
|
|
||||||
Instead, kubectl should support multiple instances of the `--per-completion-env` flag. For example, to implement work list 2, a user could do:
|
Instead, kubectl should support multiple instances of the `--per-completion-env`
|
||||||
|
flag. For example, to implement work list 2, a user could do:
|
||||||
|
|
||||||
```
|
```
|
||||||
kubectl run process-matrix --image=my/matrix \
|
kubectl run process-matrix --image=my/matrix \
|
||||||
@ -313,8 +334,8 @@ kubectl run process-matrix --image=my/matrix \
|
|||||||
|
|
||||||
### Composition With Workflows and ScheduledJob
|
### Composition With Workflows and ScheduledJob
|
||||||
|
|
||||||
A user should be able to create a job (Indexed or not) which runs at a specific time(s).
|
A user should be able to create a job (Indexed or not) which runs at a specific
|
||||||
For example:
|
time(s). For example:
|
||||||
|
|
||||||
```
|
```
|
||||||
$ kubectl run process-files --image=myfileprocessor \
|
$ kubectl run process-files --image=myfileprocessor \
|
||||||
@ -326,12 +347,16 @@ $ kubectl run process-files --image=myfileprocessor \
|
|||||||
created "scheduledJob/process-files-37dt3"
|
created "scheduledJob/process-files-37dt3"
|
||||||
```
|
```
|
||||||
|
|
||||||
Kubectl should build the same JobSpec, and then put it into a ScheduledJob (#11980) and create that.
|
Kubectl should build the same JobSpec, and then put it into a ScheduledJob
|
||||||
|
(#11980) and create that.
|
||||||
|
|
||||||
For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a complete workflow from a single command line would be messy, because of the need to specify all the arguments multiple times.
|
For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a
|
||||||
|
complete workflow from a single command line would be messy, because of the need
|
||||||
|
to specify all the arguments multiple times.
|
||||||
|
|
||||||
For that use case, the user could create a workflow message by hand.
|
For that use case, the user could create a workflow message by hand. Or the user
|
||||||
Or the user could create a job template, and then make a workflow from the templates, perhaps like this:
|
could create a job template, and then make a workflow from the templates,
|
||||||
|
perhaps like this:
|
||||||
|
|
||||||
```
|
```
|
||||||
$ kubectl run process-files --image=myfileprocessor \
|
$ kubectl run process-files --image=myfileprocessor \
|
||||||
@ -357,17 +382,17 @@ created "workflow/process-and-merge"
|
|||||||
### Completion Indexes
|
### Completion Indexes
|
||||||
|
|
||||||
A JobSpec specifies the number of times a pod needs to complete successfully,
|
A JobSpec specifies the number of times a pod needs to complete successfully,
|
||||||
through the `job.Spec.Completions` field. The number of completions
|
through the `job.Spec.Completions` field. The number of completions will be
|
||||||
will be equal to the number of work items in the work list.
|
equal to the number of work items in the work list.
|
||||||
|
|
||||||
Each pod that the job controller creates is intended to complete one work item
|
Each pod that the job controller creates is intended to complete one work item
|
||||||
from the work list. Since a pod may fail, several pods may, serially,
|
from the work list. Since a pod may fail, several pods may, serially, attempt to
|
||||||
attempt to complete the same index. Therefore, we call it a
|
complete the same index. Therefore, we call it a a *completion index* (or just
|
||||||
a *completion index* (or just *index*), but not a *pod index*.
|
*index*), but not a *pod index*.
|
||||||
|
|
||||||
For each completion index, in the range 1 to `.job.Spec.Completions`,
|
For each completion index, in the range 1 to `.job.Spec.Completions`, the job
|
||||||
the job controller will create a pod with that index, and keep creating them
|
controller will create a pod with that index, and keep creating them on failure,
|
||||||
on failure, until each index is completed.
|
until each index is completed.
|
||||||
|
|
||||||
An dense integer index, rather than a sparse string index (e.g. using just
|
An dense integer index, rather than a sparse string index (e.g. using just
|
||||||
`metadata.generate-name`) makes it easy to use the index to lookup parameters
|
`metadata.generate-name`) makes it easy to use the index to lookup parameters
|
||||||
@ -375,9 +400,9 @@ in, for example, an array in shared storage.
|
|||||||
|
|
||||||
### Pod Identity and Template Substitution in Job Controller
|
### Pod Identity and Template Substitution in Job Controller
|
||||||
|
|
||||||
The JobSpec contains a single pod template. When the job controller creates a particular
|
The JobSpec contains a single pod template. When the job controller creates a
|
||||||
pod, it copies the pod template and modifies it in some way to make that pod distinctive.
|
particular pod, it copies the pod template and modifies it in some way to make
|
||||||
Whatever is distinctive about that pod is its *identity*.
|
that pod distinctive. Whatever is distinctive about that pod is its *identity*.
|
||||||
|
|
||||||
We consider several options.
|
We consider several options.
|
||||||
|
|
||||||
@ -387,45 +412,46 @@ The job controller substitutes only the *completion index* of the pod into the
|
|||||||
pod template when creating it. The JSON it POSTs differs only in a single
|
pod template when creating it. The JSON it POSTs differs only in a single
|
||||||
fields.
|
fields.
|
||||||
|
|
||||||
We would put the completion index as a stringified integer, into an
|
We would put the completion index as a stringified integer, into an annotation
|
||||||
annotation of the pod. The user can extract it from the annotation
|
of the pod. The user can extract it from the annotation into an env var via the
|
||||||
into an env var via the downward API, or put it in a file via a Downward
|
downward API, or put it in a file via a Downward API volume, and parse it
|
||||||
API volume, and parse it himself.
|
himself.
|
||||||
|
|
||||||
|
Once it is an environment variable in the pod (say `$INDEX`), then one of two
|
||||||
Once it is an environment variable in the pod (say `$INDEX`),
|
things can happen.
|
||||||
then one of two things can happen.
|
|
||||||
|
|
||||||
First, the main program can know how to map from an integer index to what it
|
First, the main program can know how to map from an integer index to what it
|
||||||
needs to do.
|
needs to do. For example, from Work List 4 above:
|
||||||
For example, from Work List 4 above:
|
|
||||||
|
|
||||||
```
|
```
|
||||||
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f $INDEX
|
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f $INDEX
|
||||||
```
|
```
|
||||||
|
|
||||||
Second, a shell script can be prepended to the original command line which maps the
|
Second, a shell script can be prepended to the original command line which maps
|
||||||
index to one or more string parameters. For example, to implement Work List 5 above,
|
the index to one or more string parameters. For example, to implement Work List
|
||||||
you could do:
|
5 above, you could do:
|
||||||
|
|
||||||
```
|
```
|
||||||
/vol0/setupenv.sh && ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start $START_FRAME --frame-end $END_FRAME
|
/vol0/setupenv.sh && ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start $START_FRAME --frame-end $END_FRAME
|
||||||
```
|
```
|
||||||
|
|
||||||
In the above example, `/vol0/setupenv.sh` is a shell script that reads `$INDEX` and exports `$START_FRAME` and `$END_FRAME`.
|
In the above example, `/vol0/setupenv.sh` is a shell script that reads `$INDEX`
|
||||||
|
and exports `$START_FRAME` and `$END_FRAME`.
|
||||||
|
|
||||||
The shell could be part of the image, but more usefully, it could be generated by a program and stuffed in an annotation
|
The shell could be part of the image, but more usefully, it could be generated
|
||||||
or a configMap, and from there added to a volume.
|
by a program and stuffed in an annotation or a configMap, and from there added
|
||||||
|
to a volume.
|
||||||
|
|
||||||
The first approach may require the user
|
The first approach may require the user to modify an existing image (see next
|
||||||
to modify an existing image (see next section) to be able to accept an `$INDEX` env var or argument.
|
section) to be able to accept an `$INDEX` env var or argument. The second
|
||||||
The second approach requires that the image have a shell. We think that together these two options
|
approach requires that the image have a shell. We think that together these two
|
||||||
cover a wide range of use cases (though not all).
|
options cover a wide range of use cases (though not all).
|
||||||
|
|
||||||
#### Multiple Substitution
|
#### Multiple Substitution
|
||||||
|
|
||||||
In this option, the JobSpec is extended to include a list of values to substitute,
|
In this option, the JobSpec is extended to include a list of values to
|
||||||
and which fields to substitute them into. For example, a worklist like this:
|
substitute, and which fields to substitute them into. For example, a worklist
|
||||||
|
like this:
|
||||||
|
|
||||||
```
|
```
|
||||||
FRUIT_COLOR=green process-fruit -a -b -c -f apple.txt --remove-seeds
|
FRUIT_COLOR=green process-fruit -a -b -c -f apple.txt --remove-seeds
|
||||||
@ -433,7 +459,7 @@ FRUIT_COLOR=yellow process-fruit -a -b -c -f banana.txt
|
|||||||
FRUIT_COLOR=red process-fruit -a -b -c -f cherry.txt --remove-pit
|
FRUIT_COLOR=red process-fruit -a -b -c -f cherry.txt --remove-pit
|
||||||
```
|
```
|
||||||
|
|
||||||
Can be broken down into a template like this, with three parameters
|
Can be broken down into a template like this, with three parameters:
|
||||||
|
|
||||||
```
|
```
|
||||||
<custom env var 1>; process-fruit -a -b -c <custom arg 1> <custom arg 1>
|
<custom env var 1>; process-fruit -a -b -c <custom arg 1> <custom arg 1>
|
||||||
@ -447,9 +473,8 @@ and a list of parameter tuples, like this:
|
|||||||
("FRUIT_COLOR=red", "-f cherry.txt", "--remove-pit")
|
("FRUIT_COLOR=red", "-f cherry.txt", "--remove-pit")
|
||||||
```
|
```
|
||||||
|
|
||||||
The JobSpec can be extended to hold a list of parameter tuples (which
|
The JobSpec can be extended to hold a list of parameter tuples (which are more
|
||||||
are more easily expressed as a list of lists of individual parameters).
|
easily expressed as a list of lists of individual parameters). For example:
|
||||||
For example:
|
|
||||||
|
|
||||||
```
|
```
|
||||||
apiVersion: extensions/v1beta1
|
apiVersion: extensions/v1beta1
|
||||||
@ -477,42 +502,46 @@ spec:
|
|||||||
- "red"
|
- "red"
|
||||||
```
|
```
|
||||||
|
|
||||||
However, just providing custom env vars, and not arguments, is sufficient
|
However, just providing custom env vars, and not arguments, is sufficient for
|
||||||
for many use cases: parameter can be put into env vars, and then
|
many use cases: parameter can be put into env vars, and then substituted on the
|
||||||
substituted on the command line.
|
command line.
|
||||||
|
|
||||||
#### Comparison
|
#### Comparison
|
||||||
|
|
||||||
The multiple substitution approach:
|
The multiple substitution approach:
|
||||||
|
|
||||||
- keeps the *per completion parameters* in the JobSpec.
|
- keeps the *per completion parameters* in the JobSpec.
|
||||||
- Drawback: makes the job spec large for job with thousands of completions. (But for very large jobs, the work-queue style or another type of controller, such as map-reduce or spark, may be a better fit.)
|
- Drawback: makes the job spec large for job with thousands of completions. (But
|
||||||
- Drawback: is a form of server-side templating, which we want in Kubernetes but have not fully designed
|
for very large jobs, the work-queue style or another type of controller, such as
|
||||||
(see the [PetSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)).
|
map-reduce or spark, may be a better fit.)
|
||||||
|
- Drawback: is a form of server-side templating, which we want in Kubernetes but
|
||||||
|
have not fully designed (see the [PetSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)).
|
||||||
|
|
||||||
The index-only approach:
|
The index-only approach:
|
||||||
|
|
||||||
- requires that the user keep the *per completion parameters* in a separate storage, such as a configData or networked storage.
|
- Requires that the user keep the *per completion parameters* in a separate
|
||||||
- makes no changes to the JobSpec.
|
storage, such as a configData or networked storage.
|
||||||
- Drawback: while in separate storage, they could be mutatated, which would have unexpected effects
|
- Makes no changes to the JobSpec.
|
||||||
|
- Drawback: while in separate storage, they could be mutatated, which would have
|
||||||
|
unexpected effects.
|
||||||
- Drawback: Logic for using index to lookup parameters needs to be in the Pod.
|
- Drawback: Logic for using index to lookup parameters needs to be in the Pod.
|
||||||
- Drawback: CLIs and UIs are limited to using the "index" as the identity of a pod
|
- Drawback: CLIs and UIs are limited to using the "index" as the identity of a
|
||||||
from a job. They cannot easily say, for example `repeated failures on the pod processing banana.txt`.
|
pod from a job. They cannot easily say, for example `repeated failures on the
|
||||||
|
pod processing banana.txt`.
|
||||||
|
|
||||||
Index-only approach relies on at least one of the following being true:
|
Index-only approach relies on at least one of the following being true:
|
||||||
|
|
||||||
1. image containing a shell and certain shell commands (not all images have this)
|
1. Image containing a shell and certain shell commands (not all images have
|
||||||
1. use directly consumes the index from annoations (file or env var) and expands to specific behavior in the main program.
|
this).
|
||||||
|
1. Use directly consumes the index from annotations (file or env var) and
|
||||||
|
expands to specific behavior in the main program.
|
||||||
|
|
||||||
Also Using the index-only approach from
|
Also Using the index-only approach from non-kubectl clients requires that they
|
||||||
non-kubectl clients requires that they mimic the script-generation step,
|
mimic the script-generation step, or only use the second style.
|
||||||
or only use the second style.
|
|
||||||
|
|
||||||
#### Decision
|
#### Decision
|
||||||
|
|
||||||
It is decided to implement the Index-only approach now. Once the server-side
|
It is decided to implement the Index-only approach now. Once the server-side
|
||||||
templating design is complete for Kubernetes, and we have feedback from users,
|
templating design is complete for Kubernetes, and we have feedback from users,
|
||||||
we can consider if Multiple Substitution.
|
we can consider if Multiple Substitution.
|
||||||
|
|
||||||
@ -523,43 +552,42 @@ we can consider if Multiple Substitution.
|
|||||||
No changes are made to the JobSpec.
|
No changes are made to the JobSpec.
|
||||||
|
|
||||||
|
|
||||||
The JobStatus is also not changed.
|
The JobStatus is also not changed. The user can gauge the progress of the job by
|
||||||
The user can gauge the progress of the job by the `.status.succeeded` count.
|
the `.status.succeeded` count.
|
||||||
|
|
||||||
|
|
||||||
#### Job Spec Compatilibity
|
#### Job Spec Compatilibity
|
||||||
|
|
||||||
A job spec written before this change will work exactly the same
|
A job spec written before this change will work exactly the same as before with
|
||||||
as before with the new controller.
|
the new controller. The Pods it creates will have the same environment as
|
||||||
The Pods it creates will have the same environment as before.
|
before. They will have a new annotation, but pod are expected to tolerate
|
||||||
They will have a new annotation, but pod are expected to tolerate
|
|
||||||
unfamiliar annotations.
|
unfamiliar annotations.
|
||||||
|
|
||||||
However, if the job controller version is reverted, to a version before this change,
|
However, if the job controller version is reverted, to a version before this
|
||||||
the jobs whose pod specs depend on the the new annotation will fail. This is
|
change, the jobs whose pod specs depend on the the new annotation will fail.
|
||||||
okay for a Beta resource.
|
This is okay for a Beta resource.
|
||||||
|
|
||||||
#### Job Controller Changes
|
#### Job Controller Changes
|
||||||
|
|
||||||
The Job controller will maintain for each Job a data structed which
|
The Job controller will maintain for each Job a data structed which
|
||||||
indicates the status of each completion index. We call this the
|
indicates the status of each completion index. We call this the
|
||||||
*scoreboard* for short. It is an array of length `.spec.completions`.
|
*scoreboard* for short. It is an array of length `.spec.completions`.
|
||||||
Elements of the array are `enum` type with possible values including
|
Elements of the array are `enum` type with possible values including
|
||||||
`complete`, `running`, and `notStarted`.
|
`complete`, `running`, and `notStarted`.
|
||||||
|
|
||||||
The scoreboard is stored in Job Controller
|
The scoreboard is stored in Job Controller memory for efficiency. In either
|
||||||
memory for efficiency. In either case, the Status can be reconstructed from
|
case, the Status can be reconstructed from watching pods of the job (such as on
|
||||||
watching pods of the job (such as on a controller manager restart).
|
a controller manager restart). The index of the pods can be extracted from the
|
||||||
The index of the pods can be extracted from the pod annotation.
|
pod annotation.
|
||||||
|
|
||||||
When Job controller sees that the number of running pods is less than the desired
|
When Job controller sees that the number of running pods is less than the
|
||||||
parallelism of the job, it finds the first index in the scoreboard with value
|
desired parallelism of the job, it finds the first index in the scoreboard with
|
||||||
`notRunning`. It creates a pod with this creation index.
|
value `notRunning`. It creates a pod with this creation index.
|
||||||
|
|
||||||
When it creates a pod with creation index `i`, it makes a copy
|
When it creates a pod with creation index `i`, it makes a copy of the
|
||||||
of the `.spec.template`, and sets
|
`.spec.template`, and sets
|
||||||
`.spec.template.metadata.annotations.[kubernetes.io/job/completion-index]`
|
`.spec.template.metadata.annotations.[kubernetes.io/job/completion-index]` to
|
||||||
to `i`. It does this in both the index-only and multiple-substitutions options.
|
`i`. It does this in both the index-only and multiple-substitutions options.
|
||||||
|
|
||||||
Then it creates the pod.
|
Then it creates the pod.
|
||||||
|
|
||||||
@ -571,8 +599,8 @@ When all entries in the scoreboard are `complete`, then the job is complete.
|
|||||||
|
|
||||||
#### Downward API Changes
|
#### Downward API Changes
|
||||||
|
|
||||||
The downward API is changed to support extracting specific key names
|
The downward API is changed to support extracting specific key names into a
|
||||||
into a single environment variable. So, the following would be supported:
|
single environment variable. So, the following would be supported:
|
||||||
|
|
||||||
```
|
```
|
||||||
kind: Pod
|
kind: Pod
|
||||||
@ -589,15 +617,16 @@ spec:
|
|||||||
|
|
||||||
This requires kubelet changes.
|
This requires kubelet changes.
|
||||||
|
|
||||||
Users who fail to upgrade their kubelets at the same time as they upgrade their controller
|
Users who fail to upgrade their kubelets at the same time as they upgrade their
|
||||||
manager will see a failure for pods to run when they are created by the controller.
|
controller manager will see a failure for pods to run when they are created by
|
||||||
The Kubelet will send an event about failure to create the pod.
|
the controller. The Kubelet will send an event about failure to create the pod.
|
||||||
The `kubectl describe job` will show many failed pods.
|
The `kubectl describe job` will show many failed pods.
|
||||||
|
|
||||||
|
|
||||||
#### Kubectl Interface Changes
|
#### Kubectl Interface Changes
|
||||||
|
|
||||||
The `--completions` and `--completion-index-var-name` flags are added to kubectl.
|
The `--completions` and `--completion-index-var-name` flags are added to
|
||||||
|
kubectl.
|
||||||
|
|
||||||
For example, this command:
|
For example, this command:
|
||||||
|
|
||||||
@ -621,8 +650,8 @@ Kubectl would create the following pod:
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
Kubectl will also support the `--per-completion-env` flag, as described previously.
|
Kubectl will also support the `--per-completion-env` flag, as described
|
||||||
For example, this command:
|
previously. For example, this command:
|
||||||
|
|
||||||
```
|
```
|
||||||
kubectl run say-fruit --image=busybox \
|
kubectl run say-fruit --image=busybox \
|
||||||
@ -655,7 +684,7 @@ kubectl run say-fruit --image=busybox \
|
|||||||
sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
|
sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
|
||||||
```
|
```
|
||||||
|
|
||||||
will all run 3 pods in parallel. Index 0 pod will log:
|
will all run 3 pods in parallel. Index 0 pod will log:
|
||||||
|
|
||||||
```
|
```
|
||||||
Have a nice grenn apple
|
Have a nice grenn apple
|
||||||
@ -666,16 +695,20 @@ and so on.
|
|||||||
|
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
- `--per-completion-env=` is of form `KEY=VALUES` where `VALUES` is either a quoted
|
- `--per-completion-env=` is of form `KEY=VALUES` where `VALUES` is either a
|
||||||
space separated list or `@` and the name of a text file containing a list.
|
quoted space separated list or `@` and the name of a text file containing a
|
||||||
- `--per-completion-env=` can be specified several times, but all must have the same
|
list.
|
||||||
length list
|
- `--per-completion-env=` can be specified several times, but all must have the
|
||||||
|
same length list.
|
||||||
- `--completions=N` with `N` equal to list length is implied.
|
- `--completions=N` with `N` equal to list length is implied.
|
||||||
- The flag `--completions=3` sets `job.spec.completions=3`.
|
- The flag `--completions=3` sets `job.spec.completions=3`.
|
||||||
- The flag `--completion-index-var-name=I` causes an env var to be created named I in each pod, with the index in it.
|
- The flag `--completion-index-var-name=I` causes an env var to be created named
|
||||||
- The flag `--restart=OnFailure` is implied by `--completions` or any job-specific arguments. The user can also specify
|
I in each pod, with the index in it.
|
||||||
`--restart=Never` if they desire but may not specify `--restart=Always` with job-related flags.
|
- The flag `--restart=OnFailure` is implied by `--completions` or any
|
||||||
- Setting any of these flags in turn tells kubectl to create a Job, not a replicationController.
|
job-specific arguments. The user can also specify `--restart=Never` if they
|
||||||
|
desire but may not specify `--restart=Always` with job-related flags.
|
||||||
|
- Setting any of these flags in turn tells kubectl to create a Job, not a
|
||||||
|
replicationController.
|
||||||
|
|
||||||
#### How Kubectl Creates Job Specs.
|
#### How Kubectl Creates Job Specs.
|
||||||
|
|
||||||
@ -850,14 +883,17 @@ configData/secret, and prevent the case where someone changes the
|
|||||||
configData mid-job, and breaks things in a hard-to-debug way.
|
configData mid-job, and breaks things in a hard-to-debug way.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Interactions with other features
|
## Interactions with other features
|
||||||
|
|
||||||
#### Supporting Work Queue Jobs too
|
#### Supporting Work Queue Jobs too
|
||||||
|
|
||||||
For Work Queue Jobs, completions has no meaning. Parallelism should be allowed to be greater than it, and pods have no identity. So, the job controller should not create a scoreboard in the JobStatus, just a count. Therefore, we need to add one of the following to JobSpec:
|
For Work Queue Jobs, completions has no meaning. Parallelism should be allowed
|
||||||
|
to be greater than it, and pods have no identity. So, the job controller should
|
||||||
|
not create a scoreboard in the JobStatus, just a count. Therefore, we need to
|
||||||
|
add one of the following to JobSpec:
|
||||||
|
|
||||||
- allow unset `.spec.completions` to indicate no scoreboard, and no index for tasks (identical tasks)
|
- allow unset `.spec.completions` to indicate no scoreboard, and no index for
|
||||||
|
tasks (identical tasks).
|
||||||
- allow `.spec.completions=-1` to indicate the same.
|
- allow `.spec.completions=-1` to indicate the same.
|
||||||
- add `.spec.indexed` to job to indicate need for scoreboard.
|
- add `.spec.indexed` to job to indicate need for scoreboard.
|
||||||
|
|
||||||
@ -866,33 +902,31 @@ For Work Queue Jobs, completions has no meaning. Parallelism should be allowed
|
|||||||
Since pods of the same job will not be created with different resources,
|
Since pods of the same job will not be created with different resources,
|
||||||
a vertical autoscaler will need to:
|
a vertical autoscaler will need to:
|
||||||
|
|
||||||
- if it has index-specific initial resource suggestions, suggest those at admission
|
- if it has index-specific initial resource suggestions, suggest those at
|
||||||
time; it will need to understand indexes.
|
admission time; it will need to understand indexes.
|
||||||
- mutate resource requests on already created pods based on usage trend or previous container failures
|
- mutate resource requests on already created pods based on usage trend or
|
||||||
|
previous container failures.
|
||||||
- modify the job template, affecting all indexes.
|
- modify the job template, affecting all indexes.
|
||||||
|
|
||||||
#### Comparison to PetSets
|
#### Comparison to PetSets
|
||||||
|
|
||||||
|
|
||||||
The *Index substitution-only* option corresponds roughly to PetSet Proposal 1b.
|
The *Index substitution-only* option corresponds roughly to PetSet Proposal 1b.
|
||||||
The `perCompletionArgs` approach is similar to PetSet Proposal 1e, but more restrictive and thus less verbose.
|
The `perCompletionArgs` approach is similar to PetSet Proposal 1e, but more
|
||||||
|
restrictive and thus less verbose.
|
||||||
|
|
||||||
It would be easier for users if Indexed Job and PetSet are similar where possible.
|
It would be easier for users if Indexed Job and PetSet are similar where
|
||||||
However, PetSet differs in several key respects:
|
possible. However, PetSet differs in several key respects:
|
||||||
|
|
||||||
- PetSet is for ones to tens of instances. Indexed job should work with tens of
|
- PetSet is for ones to tens of instances. Indexed job should work with tens of
|
||||||
thousands of instances.
|
thousands of instances.
|
||||||
- When you have few instances, you may want to given them pet names. When you have many
|
- When you have few instances, you may want to given them pet names. When you
|
||||||
instances, you that many instances, integer indexes make more sense.
|
have many instances, you that many instances, integer indexes make more sense.
|
||||||
- When you have thousands of instances, storing the work-list in the JobSpec
|
- When you have thousands of instances, storing the work-list in the JobSpec
|
||||||
is verbose. For PetSet, this is less of a problem.
|
is verbose. For PetSet, this is less of a problem.
|
||||||
- PetSets (apparently) need to differ in more fields than indexed Jobs.
|
- PetSets (apparently) need to differ in more fields than indexed Jobs.
|
||||||
|
|
||||||
This differs from PetSet in that PetSet uses names and not indexes.
|
This differs from PetSet in that PetSet uses names and not indexes. PetSet is
|
||||||
PetSet is intended to support ones to tens of things.
|
intended to support ones to tens of things.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -38,21 +38,24 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
This document describes a new API resource, `MetadataPolicy`, that configures an
|
This document describes a new API resource, `MetadataPolicy`, that configures an
|
||||||
admission controller to take one or more actions based on an object's metadata.
|
admission controller to take one or more actions based on an object's metadata.
|
||||||
Initially the metadata fields that the predicates can examine are labels and annotations,
|
Initially the metadata fields that the predicates can examine are labels and
|
||||||
and the actions are to add one or more labels and/or annotations, or to reject creation/update
|
annotations, and the actions are to add one or more labels and/or annotations,
|
||||||
of the object. In the future other actions might be supported, such as applying an initializer.
|
or to reject creation/update of the object. In the future other actions might be
|
||||||
|
supported, such as applying an initializer.
|
||||||
|
|
||||||
The first use of `MetadataPolicy` will be to decide which scheduler should schedule a pod
|
The first use of `MetadataPolicy` will be to decide which scheduler should
|
||||||
in a [multi-scheduler](../proposals/multiple-schedulers.md) Kubernetes system. In particular, the
|
schedule a pod in a [multi-scheduler](../proposals/multiple-schedulers.md)
|
||||||
policy will add the scheduler name annotation to a pod based on an annotation that
|
Kubernetes system. In particular, the policy will add the scheduler name
|
||||||
is already on the pod that indicates the QoS of the pod.
|
annotation to a pod based on an annotation that is already on the pod that
|
||||||
(That annotation was presumably set by a simpler admission controller that
|
indicates the QoS of the pod. (That annotation was presumably set by a simpler
|
||||||
uses code, rather than configuration, to map the resource requests and limits of a pod
|
admission controller that uses code, rather than configuration, to map the
|
||||||
to QoS, and attaches the corresponding annotation.)
|
resource requests and limits of a pod to QoS, and attaches the corresponding
|
||||||
|
annotation.)
|
||||||
|
|
||||||
We anticipate a number of other uses for `MetadataPolicy`, such as defaulting for
|
We anticipate a number of other uses for `MetadataPolicy`, such as defaulting
|
||||||
labels and annotations, prohibiting/requiring particular labels or annotations, or
|
for labels and annotations, prohibiting/requiring particular labels or
|
||||||
choosing a scheduling policy within a scheduler. We do not discuss them in this doc.
|
annotations, or choosing a scheduling policy within a scheduler. We do not
|
||||||
|
discuss them in this doc.
|
||||||
|
|
||||||
|
|
||||||
## API
|
## API
|
||||||
@ -126,7 +129,8 @@ type MetadataPolicyList struct {
|
|||||||
## Implementation plan
|
## Implementation plan
|
||||||
|
|
||||||
1. Create `MetadataPolicy` API resource
|
1. Create `MetadataPolicy` API resource
|
||||||
1. Create admission controller that implements policies defined in `MetadataPolicy`
|
1. Create admission controller that implements policies defined in
|
||||||
|
`MetadataPolicy`
|
||||||
1. Create admission controller that sets annotation
|
1. Create admission controller that sets annotation
|
||||||
`scheduler.alpha.kubernetes.io/qos: <QoS>`
|
`scheduler.alpha.kubernetes.io/qos: <QoS>`
|
||||||
(where `QOS` is one of `Guaranteed, Burstable, BestEffort`)
|
(where `QOS` is one of `Guaranteed, Burstable, BestEffort`)
|
||||||
@ -134,30 +138,32 @@ based on pod's resource request and limit.
|
|||||||
|
|
||||||
## Future work
|
## Future work
|
||||||
|
|
||||||
Longer-term we will have QoS be set on create and update by the registry, similar to `Pending` phase today,
|
Longer-term we will have QoS be set on create and update by the registry,
|
||||||
instead of having an admission controller (that runs before the one that takes `MetadataPolicy` as input)
|
similar to `Pending` phase today, instead of having an admission controller
|
||||||
do it.
|
(that runs before the one that takes `MetadataPolicy` as input) do it.
|
||||||
|
|
||||||
We plan to eventually move from having an admission controller
|
We plan to eventually move from having an admission controller set the scheduler
|
||||||
set the scheduler name as a pod annotation, to using the initializer concept. In particular, the
|
name as a pod annotation, to using the initializer concept. In particular, the
|
||||||
scheduler will be an initializer, and the admission controller that decides which scheduler to use
|
scheduler will be an initializer, and the admission controller that decides
|
||||||
will add the scheduler's name to the list of initializers for the pod (presumably the scheduler
|
which scheduler to use will add the scheduler's name to the list of initializers
|
||||||
will be the last initializer to run on each pod).
|
for the pod (presumably the scheduler will be the last initializer to run on
|
||||||
The admission controller would still be configured using the `MetadataPolicy` described here, only the
|
each pod). The admission controller would still be configured using the
|
||||||
mechanism the admission controller uses to record its decision of which scheduler to use would change.
|
`MetadataPolicy` described here, only the mechanism the admission controller
|
||||||
|
uses to record its decision of which scheduler to use would change.
|
||||||
|
|
||||||
## Related issues
|
## Related issues
|
||||||
|
|
||||||
The main issue for multiple schedulers is #11793. There was also a lot of discussion
|
The main issue for multiple schedulers is #11793. There was also a lot of
|
||||||
in PRs #17197 and #17865.
|
discussion in PRs #17197 and #17865.
|
||||||
|
|
||||||
We could use the approach described here to choose a scheduling
|
We could use the approach described here to choose a scheduling policy within a
|
||||||
policy within a single scheduler, as opposed to choosing a scheduler, a desire mentioned in #9920.
|
single scheduler, as opposed to choosing a scheduler, a desire mentioned in
|
||||||
Issue #17097 describes a scenario unrelated to scheduler-choosing where `MetadataPolicy` could be used.
|
|
||||||
Issue #17324 proposes to create a generalized API for matching
|
|
||||||
"claims" to "service classes"; matching a pod to a scheduler would be one use for such an API.
|
|
||||||
|
|
||||||
|
# 9920. Issue #17097 describes a scenario unrelated to scheduler-choosing where
|
||||||
|
|
||||||
|
`MetadataPolicy` could be used. Issue #17324 proposes to create a generalized
|
||||||
|
API for matching "claims" to "service classes"; matching a pod to a scheduler
|
||||||
|
would be one use for such an API.
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -41,9 +41,11 @@ a logically named group.
|
|||||||
|
|
||||||
## Motivation
|
## Motivation
|
||||||
|
|
||||||
A single cluster should be able to satisfy the needs of multiple user communities.
|
A single cluster should be able to satisfy the needs of multiple user
|
||||||
|
communities.
|
||||||
|
|
||||||
Each user community wants to be able to work in isolation from other communities.
|
Each user community wants to be able to work in isolation from other
|
||||||
|
communities.
|
||||||
|
|
||||||
Each user community has its own:
|
Each user community has its own:
|
||||||
|
|
||||||
@ -61,13 +63,16 @@ The Namespace provides a unique scope for:
|
|||||||
|
|
||||||
## Use cases
|
## Use cases
|
||||||
|
|
||||||
1. As a cluster operator, I want to support multiple user communities on a single cluster.
|
1. As a cluster operator, I want to support multiple user communities on a
|
||||||
2. As a cluster operator, I want to delegate authority to partitions of the cluster to trusted users
|
single cluster.
|
||||||
in those communities.
|
2. As a cluster operator, I want to delegate authority to partitions of the
|
||||||
3. As a cluster operator, I want to limit the amount of resources each community can consume in order
|
cluster to trusted users in those communities.
|
||||||
to limit the impact to other communities using the cluster.
|
3. As a cluster operator, I want to limit the amount of resources each
|
||||||
4. As a cluster user, I want to interact with resources that are pertinent to my user community in
|
community can consume in order to limit the impact to other communities using
|
||||||
isolation of what other user communities are doing on the cluster.
|
the cluster.
|
||||||
|
4. As a cluster user, I want to interact with resources that are pertinent to
|
||||||
|
my user community in isolation of what other user communities are doing on the
|
||||||
|
cluster.
|
||||||
|
|
||||||
## Design
|
## Design
|
||||||
|
|
||||||
@ -91,20 +96,26 @@ A *Namespace* must exist prior to associating content with it.
|
|||||||
|
|
||||||
A *Namespace* must not be deleted if there is content associated with it.
|
A *Namespace* must not be deleted if there is content associated with it.
|
||||||
|
|
||||||
To associate a resource with a *Namespace* the following conditions must be satisfied:
|
To associate a resource with a *Namespace* the following conditions must be
|
||||||
|
satisfied:
|
||||||
|
|
||||||
1. The resource's *Kind* must be registered as having *RESTScopeNamespace* with the server
|
1. The resource's *Kind* must be registered as having *RESTScopeNamespace* with
|
||||||
2. The resource's *TypeMeta.Namespace* field must have a value that references an existing *Namespace*
|
the server
|
||||||
|
2. The resource's *TypeMeta.Namespace* field must have a value that references
|
||||||
|
an existing *Namespace*
|
||||||
|
|
||||||
The *Name* of a resource associated with a *Namespace* is unique to that *Kind* in that *Namespace*.
|
The *Name* of a resource associated with a *Namespace* is unique to that *Kind*
|
||||||
|
in that *Namespace*.
|
||||||
|
|
||||||
It is intended to be used in resource URLs; provided by clients at creation time, and encouraged to be
|
It is intended to be used in resource URLs; provided by clients at creation
|
||||||
human friendly; intended to facilitate idempotent creation, space-uniqueness of singleton objects,
|
time, and encouraged to be human friendly; intended to facilitate idempotent
|
||||||
distinguish distinct entities, and reference particular entities across operations.
|
creation, space-uniqueness of singleton objects, distinguish distinct entities,
|
||||||
|
and reference particular entities across operations.
|
||||||
|
|
||||||
### Authorization
|
### Authorization
|
||||||
|
|
||||||
A *Namespace* provides an authorization scope for accessing content associated with the *Namespace*.
|
A *Namespace* provides an authorization scope for accessing content associated
|
||||||
|
with the *Namespace*.
|
||||||
|
|
||||||
See [Authorization plugins](../admin/authorization.md)
|
See [Authorization plugins](../admin/authorization.md)
|
||||||
|
|
||||||
@ -112,19 +123,21 @@ See [Authorization plugins](../admin/authorization.md)
|
|||||||
|
|
||||||
A *Namespace* provides a scope to limit resource consumption.
|
A *Namespace* provides a scope to limit resource consumption.
|
||||||
|
|
||||||
A *LimitRange* defines min/max constraints on the amount of resources a single entity can consume in
|
A *LimitRange* defines min/max constraints on the amount of resources a single
|
||||||
a *Namespace*.
|
entity can consume in a *Namespace*.
|
||||||
|
|
||||||
See [Admission control: Limit Range](admission_control_limit_range.md)
|
See [Admission control: Limit Range](admission_control_limit_range.md)
|
||||||
|
|
||||||
A *ResourceQuota* tracks aggregate usage of resources in the *Namespace* and allows cluster operators
|
A *ResourceQuota* tracks aggregate usage of resources in the *Namespace* and
|
||||||
to define *Hard* resource usage limits that a *Namespace* may consume.
|
allows cluster operators to define *Hard* resource usage limits that a
|
||||||
|
*Namespace* may consume.
|
||||||
|
|
||||||
See [Admission control: Resource Quota](admission_control_resource_quota.md)
|
See [Admission control: Resource Quota](admission_control_resource_quota.md)
|
||||||
|
|
||||||
### Finalizers
|
### Finalizers
|
||||||
|
|
||||||
Upon creation of a *Namespace*, the creator may provide a list of *Finalizer* objects.
|
Upon creation of a *Namespace*, the creator may provide a list of *Finalizer*
|
||||||
|
objects.
|
||||||
|
|
||||||
```go
|
```go
|
||||||
type FinalizerName string
|
type FinalizerName string
|
||||||
@ -143,13 +156,14 @@ type NamespaceSpec struct {
|
|||||||
|
|
||||||
A *FinalizerName* is a qualified name.
|
A *FinalizerName* is a qualified name.
|
||||||
|
|
||||||
The API Server enforces that a *Namespace* can only be deleted from storage if and only if
|
The API Server enforces that a *Namespace* can only be deleted from storage if
|
||||||
it's *Namespace.Spec.Finalizers* is empty.
|
and only if it's *Namespace.Spec.Finalizers* is empty.
|
||||||
|
|
||||||
A *finalize* operation is the only mechanism to modify the *Namespace.Spec.Finalizers* field post creation.
|
A *finalize* operation is the only mechanism to modify the
|
||||||
|
*Namespace.Spec.Finalizers* field post creation.
|
||||||
|
|
||||||
Each *Namespace* created has *kubernetes* as an item in its list of initial *Namespace.Spec.Finalizers*
|
Each *Namespace* created has *kubernetes* as an item in its list of initial
|
||||||
set by default.
|
*Namespace.Spec.Finalizers* set by default.
|
||||||
|
|
||||||
### Phases
|
### Phases
|
||||||
|
|
||||||
@ -168,39 +182,48 @@ type NamespaceStatus struct {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
A *Namespace* is in the **Active** phase if it does not have a *ObjectMeta.DeletionTimestamp*.
|
A *Namespace* is in the **Active** phase if it does not have a
|
||||||
|
*ObjectMeta.DeletionTimestamp*.
|
||||||
|
|
||||||
A *Namespace* is in the **Terminating** phase if it has a *ObjectMeta.DeletionTimestamp*.
|
A *Namespace* is in the **Terminating** phase if it has a
|
||||||
|
*ObjectMeta.DeletionTimestamp*.
|
||||||
|
|
||||||
**Active**
|
**Active**
|
||||||
|
|
||||||
Upon creation, a *Namespace* goes in the *Active* phase. This means that content may be associated with
|
Upon creation, a *Namespace* goes in the *Active* phase. This means that content
|
||||||
a namespace, and all normal interactions with the namespace are allowed to occur in the cluster.
|
may be associated with a namespace, and all normal interactions with the
|
||||||
|
namespace are allowed to occur in the cluster.
|
||||||
|
|
||||||
If a DELETE request occurs for a *Namespace*, the *Namespace.ObjectMeta.DeletionTimestamp* is set
|
If a DELETE request occurs for a *Namespace*, the
|
||||||
to the current server time. A *namespace controller* observes the change, and sets the *Namespace.Status.Phase*
|
*Namespace.ObjectMeta.DeletionTimestamp* is set to the current server time. A
|
||||||
to *Terminating*.
|
*namespace controller* observes the change, and sets the
|
||||||
|
*Namespace.Status.Phase* to *Terminating*.
|
||||||
|
|
||||||
**Terminating**
|
**Terminating**
|
||||||
|
|
||||||
A *namespace controller* watches for *Namespace* objects that have a *Namespace.ObjectMeta.DeletionTimestamp*
|
A *namespace controller* watches for *Namespace* objects that have a
|
||||||
value set in order to know when to initiate graceful termination of the *Namespace* associated content that
|
*Namespace.ObjectMeta.DeletionTimestamp* value set in order to know when to
|
||||||
are known to the cluster.
|
initiate graceful termination of the *Namespace* associated content that are
|
||||||
|
known to the cluster.
|
||||||
|
|
||||||
The *namespace controller* enumerates each known resource type in that namespace and deletes it one by one.
|
The *namespace controller* enumerates each known resource type in that namespace
|
||||||
|
and deletes it one by one.
|
||||||
|
|
||||||
Admission control blocks creation of new resources in that namespace in order to prevent a race-condition
|
Admission control blocks creation of new resources in that namespace in order to
|
||||||
where the controller could believe all of a given resource type had been deleted from the namespace,
|
prevent a race-condition where the controller could believe all of a given
|
||||||
when in fact some other rogue client agent had created new objects. Using admission control in this
|
resource type had been deleted from the namespace, when in fact some other rogue
|
||||||
scenario allows each of registry implementations for the individual objects to not need to take into account Namespace life-cycle.
|
client agent had created new objects. Using admission control in this scenario
|
||||||
|
allows each of registry implementations for the individual objects to not need
|
||||||
|
to take into account Namespace life-cycle.
|
||||||
|
|
||||||
Once all objects known to the *namespace controller* have been deleted, the *namespace controller*
|
Once all objects known to the *namespace controller* have been deleted, the
|
||||||
executes a *finalize* operation on the namespace that removes the *kubernetes* value from
|
*namespace controller* executes a *finalize* operation on the namespace that
|
||||||
the *Namespace.Spec.Finalizers* list.
|
removes the *kubernetes* value from the *Namespace.Spec.Finalizers* list.
|
||||||
|
|
||||||
If the *namespace controller* sees a *Namespace* whose *ObjectMeta.DeletionTimestamp* is set, and
|
If the *namespace controller* sees a *Namespace* whose
|
||||||
whose *Namespace.Spec.Finalizers* list is empty, it will signal the server to permanently remove
|
*ObjectMeta.DeletionTimestamp* is set, and whose *Namespace.Spec.Finalizers*
|
||||||
the *Namespace* from storage by sending a final DELETE action to the API server.
|
list is empty, it will signal the server to permanently remove the *Namespace*
|
||||||
|
from storage by sending a final DELETE action to the API server.
|
||||||
|
|
||||||
### REST API
|
### REST API
|
||||||
|
|
||||||
@ -232,15 +255,18 @@ To interact with content associated with a Namespace:
|
|||||||
| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces |
|
| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces |
|
||||||
| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces |
|
| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces |
|
||||||
|
|
||||||
The API server verifies the *Namespace* on resource creation matches the *{namespace}* on the path.
|
The API server verifies the *Namespace* on resource creation matches the
|
||||||
|
*{namespace}* on the path.
|
||||||
|
|
||||||
The API server will associate a resource with a *Namespace* if not populated by the end-user based on the *Namespace* context
|
The API server will associate a resource with a *Namespace* if not populated by
|
||||||
of the incoming request. If the *Namespace* of the resource being created, or updated does not match the *Namespace* on the request,
|
the end-user based on the *Namespace* context of the incoming request. If the
|
||||||
then the API server will reject the request.
|
*Namespace* of the resource being created, or updated does not match the
|
||||||
|
*Namespace* on the request, then the API server will reject the request.
|
||||||
|
|
||||||
### Storage
|
### Storage
|
||||||
|
|
||||||
A namespace provides a unique identifier space and therefore must be in the storage path of a resource.
|
A namespace provides a unique identifier space and therefore must be in the
|
||||||
|
storage path of a resource.
|
||||||
|
|
||||||
In etcd, we want to continue to still support efficient WATCH across namespaces.
|
In etcd, we want to continue to still support efficient WATCH across namespaces.
|
||||||
|
|
||||||
@ -248,18 +274,19 @@ Resources that persist content in etcd will have storage paths as follows:
|
|||||||
|
|
||||||
/{k8s_storage_prefix}/{resourceType}/{resource.Namespace}/{resource.Name}
|
/{k8s_storage_prefix}/{resourceType}/{resource.Namespace}/{resource.Name}
|
||||||
|
|
||||||
This enables consumers to WATCH /registry/{resourceType} for changes across namespace of a particular {resourceType}.
|
This enables consumers to WATCH /registry/{resourceType} for changes across
|
||||||
|
namespace of a particular {resourceType}.
|
||||||
|
|
||||||
### Kubelet
|
### Kubelet
|
||||||
|
|
||||||
The kubelet will register pod's it sources from a file or http source with a namespace associated with the
|
The kubelet will register pod's it sources from a file or http source with a
|
||||||
*cluster-id*
|
namespace associated with the *cluster-id*
|
||||||
|
|
||||||
### Example: OpenShift Origin managing a Kubernetes Namespace
|
### Example: OpenShift Origin managing a Kubernetes Namespace
|
||||||
|
|
||||||
In this example, we demonstrate how the design allows for agents built on-top of
|
In this example, we demonstrate how the design allows for agents built on-top of
|
||||||
Kubernetes that manage their own set of resource types associated with a *Namespace*
|
Kubernetes that manage their own set of resource types associated with a
|
||||||
to take part in Namespace termination.
|
*Namespace* to take part in Namespace termination.
|
||||||
|
|
||||||
OpenShift creates a Namespace in Kubernetes
|
OpenShift creates a Namespace in Kubernetes
|
||||||
|
|
||||||
@ -282,9 +309,10 @@ OpenShift creates a Namespace in Kubernetes
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
OpenShift then goes and creates a set of resources (pods, services, etc) associated
|
OpenShift then goes and creates a set of resources (pods, services, etc)
|
||||||
with the "development" namespace. It also creates its own set of resources in its
|
associated with the "development" namespace. It also creates its own set of
|
||||||
own storage associated with the "development" namespace unknown to Kubernetes.
|
resources in its own storage associated with the "development" namespace unknown
|
||||||
|
to Kubernetes.
|
||||||
|
|
||||||
User deletes the Namespace in Kubernetes, and Namespace now has following state:
|
User deletes the Namespace in Kubernetes, and Namespace now has following state:
|
||||||
|
|
||||||
@ -308,10 +336,10 @@ User deletes the Namespace in Kubernetes, and Namespace now has following state:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
The Kubernetes *namespace controller* observes the namespace has a *deletionTimestamp*
|
The Kubernetes *namespace controller* observes the namespace has a
|
||||||
and begins to terminate all of the content in the namespace that it knows about. Upon
|
*deletionTimestamp* and begins to terminate all of the content in the namespace
|
||||||
success, it executes a *finalize* action that modifies the *Namespace* by
|
that it knows about. Upon success, it executes a *finalize* action that modifies
|
||||||
removing *kubernetes* from the list of finalizers:
|
the *Namespace* by removing *kubernetes* from the list of finalizers:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
@ -333,11 +361,11 @@ removing *kubernetes* from the list of finalizers:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
OpenShift Origin has its own *namespace controller* that is observing cluster state, and
|
OpenShift Origin has its own *namespace controller* that is observing cluster
|
||||||
it observes the same namespace had a *deletionTimestamp* assigned to it. It too will go
|
state, and it observes the same namespace had a *deletionTimestamp* assigned to
|
||||||
and purge resources from its own storage that it manages associated with that namespace.
|
it. It too will go and purge resources from its own storage that it manages
|
||||||
Upon completion, it executes a *finalize* action and removes the reference to "openshift.com/origin"
|
associated with that namespace. Upon completion, it executes a *finalize* action
|
||||||
from the list of finalizers.
|
and removes the reference to "openshift.com/origin" from the list of finalizers.
|
||||||
|
|
||||||
This results in the following state:
|
This results in the following state:
|
||||||
|
|
||||||
@ -361,12 +389,14 @@ This results in the following state:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
At this point, the Kubernetes *namespace controller* in its sync loop will see that the namespace
|
At this point, the Kubernetes *namespace controller* in its sync loop will see
|
||||||
has a deletion timestamp and that its list of finalizers is empty. As a result, it knows all
|
that the namespace has a deletion timestamp and that its list of finalizers is
|
||||||
content associated from that namespace has been purged. It performs a final DELETE action
|
empty. As a result, it knows all content associated from that namespace has been
|
||||||
to remove that Namespace from the storage.
|
purged. It performs a final DELETE action to remove that Namespace from the
|
||||||
|
storage.
|
||||||
|
|
||||||
At this point, all content associated with that Namespace, and the Namespace itself are gone.
|
At this point, all content associated with that Namespace, and the Namespace
|
||||||
|
itself are gone.
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -44,9 +44,9 @@ There are 4 distinct networking problems to solve:
|
|||||||
## Model and motivation
|
## Model and motivation
|
||||||
|
|
||||||
Kubernetes deviates from the default Docker networking model (though as of
|
Kubernetes deviates from the default Docker networking model (though as of
|
||||||
Docker 1.8 their network plugins are getting closer). The goal is for each pod
|
Docker 1.8 their network plugins are getting closer). The goal is for each pod
|
||||||
to have an IP in a flat shared networking namespace that has full communication
|
to have an IP in a flat shared networking namespace that has full communication
|
||||||
with other physical computers and containers across the network. IP-per-pod
|
with other physical computers and containers across the network. IP-per-pod
|
||||||
creates a clean, backward-compatible model where pods can be treated much like
|
creates a clean, backward-compatible model where pods can be treated much like
|
||||||
VMs or physical hosts from the perspectives of port allocation, networking,
|
VMs or physical hosts from the perspectives of port allocation, networking,
|
||||||
naming, service discovery, load balancing, application configuration, and
|
naming, service discovery, load balancing, application configuration, and
|
||||||
@ -71,15 +71,15 @@ among other problems.
|
|||||||
All containers within a pod behave as if they are on the same host with regard
|
All containers within a pod behave as if they are on the same host with regard
|
||||||
to networking. They can all reach each other’s ports on localhost. This offers
|
to networking. They can all reach each other’s ports on localhost. This offers
|
||||||
simplicity (static ports know a priori), security (ports bound to localhost
|
simplicity (static ports know a priori), security (ports bound to localhost
|
||||||
are visible within the pod but never outside it), and performance. This also
|
are visible within the pod but never outside it), and performance. This also
|
||||||
reduces friction for applications moving from the world of uncontainerized apps
|
reduces friction for applications moving from the world of uncontainerized apps
|
||||||
on physical or virtual hosts. People running application stacks together on
|
on physical or virtual hosts. People running application stacks together on
|
||||||
the same host have already figured out how to make ports not conflict and have
|
the same host have already figured out how to make ports not conflict and have
|
||||||
arranged for clients to find them.
|
arranged for clients to find them.
|
||||||
|
|
||||||
The approach does reduce isolation between containers within a pod —
|
The approach does reduce isolation between containers within a pod —
|
||||||
ports could conflict, and there can be no container-private ports, but these
|
ports could conflict, and there can be no container-private ports, but these
|
||||||
seem to be relatively minor issues with plausible future workarounds. Besides,
|
seem to be relatively minor issues with plausible future workarounds. Besides,
|
||||||
the premise of pods is that containers within a pod share some resources
|
the premise of pods is that containers within a pod share some resources
|
||||||
(volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation.
|
(volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation.
|
||||||
Additionally, the user can control what containers belong to the same pod
|
Additionally, the user can control what containers belong to the same pod
|
||||||
@ -88,7 +88,7 @@ whereas, in general, they don't control what pods land together on a host.
|
|||||||
## Pod to pod
|
## Pod to pod
|
||||||
|
|
||||||
Because every pod gets a "real" (not machine-private) IP address, pods can
|
Because every pod gets a "real" (not machine-private) IP address, pods can
|
||||||
communicate without proxies or translations. The pod can use well-known port
|
communicate without proxies or translations. The pod can use well-known port
|
||||||
numbers and can avoid the use of higher-level service discovery systems like
|
numbers and can avoid the use of higher-level service discovery systems like
|
||||||
DNS-SD, Consul, or Etcd.
|
DNS-SD, Consul, or Etcd.
|
||||||
|
|
||||||
@ -98,7 +98,7 @@ each pod has its own IP address that other pods can know. By making IP addresses
|
|||||||
and ports the same both inside and outside the pods, we create a NAT-less, flat
|
and ports the same both inside and outside the pods, we create a NAT-less, flat
|
||||||
address space. Running "ip addr show" should work as expected. This would enable
|
address space. Running "ip addr show" should work as expected. This would enable
|
||||||
all existing naming/discovery mechanisms to work out of the box, including
|
all existing naming/discovery mechanisms to work out of the box, including
|
||||||
self-registration mechanisms and applications that distribute IP addresses. We
|
self-registration mechanisms and applications that distribute IP addresses. We
|
||||||
should be optimizing for inter-pod network communication. Within a pod,
|
should be optimizing for inter-pod network communication. Within a pod,
|
||||||
containers are more likely to use communication through volumes (e.g., tmpfs) or
|
containers are more likely to use communication through volumes (e.g., tmpfs) or
|
||||||
IPC.
|
IPC.
|
||||||
@ -141,7 +141,7 @@ gcloud compute routes add "${NODE_NAMES[$i]}" \
|
|||||||
--next-hop-instance-zone "${ZONE}" &
|
--next-hop-instance-zone "${ZONE}" &
|
||||||
```
|
```
|
||||||
|
|
||||||
GCE itself does not know anything about these IPs, though. This means that when
|
GCE itself does not know anything about these IPs, though. This means that when
|
||||||
a pod tries to egress beyond GCE's project the packets must be SNAT'ed
|
a pod tries to egress beyond GCE's project the packets must be SNAT'ed
|
||||||
(masqueraded) to the VM's IP, which GCE recognizes and allows.
|
(masqueraded) to the VM's IP, which GCE recognizes and allows.
|
||||||
|
|
||||||
@ -161,26 +161,26 @@ to serve the purpose outside of GCE.
|
|||||||
## Pod to service
|
## Pod to service
|
||||||
|
|
||||||
The [service](../user-guide/services.md) abstraction provides a way to group pods under a
|
The [service](../user-guide/services.md) abstraction provides a way to group pods under a
|
||||||
common access policy (e.g. load-balanced). The implementation of this creates a
|
common access policy (e.g. load-balanced). The implementation of this creates a
|
||||||
virtual IP which clients can access and which is transparently proxied to the
|
virtual IP which clients can access and which is transparently proxied to the
|
||||||
pods in a Service. Each node runs a kube-proxy process which programs
|
pods in a Service. Each node runs a kube-proxy process which programs
|
||||||
`iptables` rules to trap access to service IPs and redirect them to the correct
|
`iptables` rules to trap access to service IPs and redirect them to the correct
|
||||||
backends. This provides a highly-available load-balancing solution with low
|
backends. This provides a highly-available load-balancing solution with low
|
||||||
performance overhead by balancing client traffic from a node on that same node.
|
performance overhead by balancing client traffic from a node on that same node.
|
||||||
|
|
||||||
## External to internal
|
## External to internal
|
||||||
|
|
||||||
So far the discussion has been about how to access a pod or service from within
|
So far the discussion has been about how to access a pod or service from within
|
||||||
the cluster. Accessing a pod from outside the cluster is a bit more tricky. We
|
the cluster. Accessing a pod from outside the cluster is a bit more tricky. We
|
||||||
want to offer highly-available, high-performance load balancing to target
|
want to offer highly-available, high-performance load balancing to target
|
||||||
Kubernetes Services. Most public cloud providers are simply not flexible enough
|
Kubernetes Services. Most public cloud providers are simply not flexible enough
|
||||||
yet.
|
yet.
|
||||||
|
|
||||||
The way this is generally implemented is to set up external load balancers (e.g.
|
The way this is generally implemented is to set up external load balancers (e.g.
|
||||||
GCE's ForwardingRules or AWS's ELB) which target all nodes in a cluster. When
|
GCE's ForwardingRules or AWS's ELB) which target all nodes in a cluster. When
|
||||||
traffic arrives at a node it is recognized as being part of a particular Service
|
traffic arrives at a node it is recognized as being part of a particular Service
|
||||||
and routed to an appropriate backend Pod. This does mean that some traffic will
|
and routed to an appropriate backend Pod. This does mean that some traffic will
|
||||||
get double-bounced on the network. Once cloud providers have better offerings
|
get double-bounced on the network. Once cloud providers have better offerings
|
||||||
we can take advantage of those.
|
we can take advantage of those.
|
||||||
|
|
||||||
## Challenges and future work
|
## Challenges and future work
|
||||||
@ -207,7 +207,13 @@ External IP assignment would also simplify DNS support (see below).
|
|||||||
|
|
||||||
### IPv6
|
### IPv6
|
||||||
|
|
||||||
IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974), [Docker issue #6923](https://github.com/dotcloud/docker/issues/6923), [Docker issue #6975](https://github.com/dotcloud/docker/issues/6975). Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-)
|
IPv6 would be a nice option, also, but we can't depend on it yet. Docker support
|
||||||
|
is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974),
|
||||||
|
[Docker issue #6923](https://github.com/dotcloud/docker/issues/6923),
|
||||||
|
[Docker issue #6975](https://github.com/dotcloud/docker/issues/6975).
|
||||||
|
Additionally, direct ipv6 assignment to instances doesn't appear to be supported
|
||||||
|
by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull
|
||||||
|
requests from people running Kubernetes on bare metal, though. :-)
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -36,40 +36,41 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
|
|
||||||
This document proposes a new label selector representation, called `NodeSelector`,
|
This document proposes a new label selector representation, called
|
||||||
that is similar in many ways to `LabelSelector`, but is a bit more flexible and is
|
`NodeSelector`, that is similar in many ways to `LabelSelector`, but is a bit
|
||||||
intended to be used only for selecting nodes.
|
more flexible and is intended to be used only for selecting nodes.
|
||||||
|
|
||||||
In addition, we propose to replace the `map[string]string` in `PodSpec` that the scheduler
|
In addition, we propose to replace the `map[string]string` in `PodSpec` that the
|
||||||
currently uses as part of restricting the set of nodes onto which a pod is
|
scheduler currently uses as part of restricting the set of nodes onto which a
|
||||||
eligible to schedule, with a field of type `Affinity` that contains one or
|
pod is eligible to schedule, with a field of type `Affinity` that contains one
|
||||||
more affinity specifications. In this document we discuss `NodeAffinity`, which
|
or more affinity specifications. In this document we discuss `NodeAffinity`,
|
||||||
contains one or more of the following
|
which contains one or more of the following:
|
||||||
* a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be
|
* a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be
|
||||||
represented by a `NodeSelector`, and thus generalizes the scheduling behavior of
|
represented by a `NodeSelector`, and thus generalizes the scheduling behavior of
|
||||||
the current `map[string]string` but still serves the purpose of restricting
|
the current `map[string]string` but still serves the purpose of restricting
|
||||||
the set of nodes onto which the pod can schedule. In addition, unlike the behavior
|
the set of nodes onto which the pod can schedule. In addition, unlike the
|
||||||
of the current `map[string]string`, when it becomes violated the system will
|
behavior of the current `map[string]string`, when it becomes violated the system
|
||||||
try to eventually evict the pod from its node.
|
will try to eventually evict the pod from its node.
|
||||||
* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is identical
|
* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is
|
||||||
to `RequiredDuringSchedulingRequiredDuringExecution` except that the system
|
identical to `RequiredDuringSchedulingRequiredDuringExecution` except that the
|
||||||
may or may not try to eventually evict the pod from its node.
|
system may or may not try to eventually evict the pod from its node.
|
||||||
* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that specifies which nodes are
|
* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that
|
||||||
preferred for scheduling among those that meet all scheduling requirements.
|
specifies which nodes are preferred for scheduling among those that meet all
|
||||||
|
scheduling requirements.
|
||||||
|
|
||||||
(In practice, as discussed later, we will actually *add* the `Affinity` field
|
(In practice, as discussed later, we will actually *add* the `Affinity` field
|
||||||
rather than replacing `map[string]string`, due to backward compatibility requirements.)
|
rather than replacing `map[string]string`, due to backward compatibility
|
||||||
|
requirements.)
|
||||||
|
|
||||||
The affiniy specifications described above allow a pod to request various properties
|
The affiniy specifications described above allow a pod to request various
|
||||||
that are inherent to nodes, for example "run this pod on a node with an Intel CPU" or, in a
|
properties that are inherent to nodes, for example "run this pod on a node with
|
||||||
multi-zone cluster, "run this pod on a node in zone Z."
|
an Intel CPU" or, in a multi-zone cluster, "run this pod on a node in zone Z."
|
||||||
([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes
|
([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes
|
||||||
some of the properties that a node might publish as labels, which affinity expressions
|
some of the properties that a node might publish as labels, which affinity
|
||||||
can match against.)
|
expressions can match against.) They do *not* allow a pod to request to schedule
|
||||||
They do *not* allow a pod to request to schedule
|
(or not schedule) on a node based on what other pods are running on the node.
|
||||||
(or not schedule) on a node based on what other pods are running on the node. That
|
That feature is called "inter-pod topological affinity/anti-afinity" and is
|
||||||
feature is called "inter-pod topological affinity/anti-afinity" and is described
|
described [here](https://github.com/kubernetes/kubernetes/pull/18265).
|
||||||
[here](https://github.com/kubernetes/kubernetes/pull/18265).
|
|
||||||
|
|
||||||
## API
|
## API
|
||||||
|
|
||||||
@ -171,9 +172,9 @@ type PreferredSchedulingTerm struct {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Unfortunately, the name of the existing `map[string]string` field in PodSpec is `NodeSelector`
|
Unfortunately, the name of the existing `map[string]string` field in PodSpec is
|
||||||
and we can't change it since this name is part of the API. Hopefully this won't
|
`NodeSelector` and we can't change it since this name is part of the API.
|
||||||
cause too much confusion.
|
Hopefully this won't cause too much confusion.
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
@ -186,81 +187,91 @@ cause too much confusion.
|
|||||||
|
|
||||||
## Backward compatibility
|
## Backward compatibility
|
||||||
|
|
||||||
When we add `Affinity` to PodSpec, we will deprecate, but not remove, the current field in PodSpec
|
When we add `Affinity` to PodSpec, we will deprecate, but not remove, the
|
||||||
|
current field in PodSpec
|
||||||
|
|
||||||
```go
|
```go
|
||||||
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
|
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
|
||||||
```
|
```
|
||||||
|
|
||||||
Old version of the scheduler will ignore the `Affinity` field.
|
Old version of the scheduler will ignore the `Affinity` field. New versions of
|
||||||
New versions of the scheduler will apply their scheduling predicates to both `Affinity` and `nodeSelector`,
|
the scheduler will apply their scheduling predicates to both `Affinity` and
|
||||||
i.e. the pod can only schedule onto nodes that satisfy both sets of requirements. We will not
|
`nodeSelector`, i.e. the pod can only schedule onto nodes that satisfy both sets
|
||||||
attempt to convert between `Affinity` and `nodeSelector`.
|
of requirements. We will not attempt to convert between `Affinity` and
|
||||||
|
`nodeSelector`.
|
||||||
|
|
||||||
Old versions of non-scheduling clients will not know how to do anything semantically meaningful
|
Old versions of non-scheduling clients will not know how to do anything
|
||||||
with `Affinity`, but we don't expect that this will cause a problem.
|
semantically meaningful with `Affinity`, but we don't expect that this will
|
||||||
|
cause a problem.
|
||||||
|
|
||||||
See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259)
|
See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259)
|
||||||
for more discussion.
|
for more discussion.
|
||||||
|
|
||||||
Users should not start using `NodeAffinity` until the full implementation has been in Kubelet and the master
|
Users should not start using `NodeAffinity` until the full implementation has
|
||||||
for enough binary versions that we feel comfortable that we will not need to roll back either Kubelet
|
been in Kubelet and the master for enough binary versions that we feel
|
||||||
or master to a version that does not support them. Longer-term we will use a programatic approach to
|
comfortable that we will not need to roll back either Kubelet or master to a
|
||||||
enforcing this (#4855).
|
version that does not support them. Longer-term we will use a programatic
|
||||||
|
approach to enforcing this (#4855).
|
||||||
|
|
||||||
## Implementation plan
|
## Implementation plan
|
||||||
|
|
||||||
1. Add the `Affinity` field to PodSpec and the `NodeAffinity`, `PreferredDuringSchedulingIgnoredDuringExecution`,
|
1. Add the `Affinity` field to PodSpec and the `NodeAffinity`,
|
||||||
and `RequiredDuringSchedulingIgnoredDuringExecution` types to the API
|
`PreferredDuringSchedulingIgnoredDuringExecution`, and
|
||||||
2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution` into account
|
`RequiredDuringSchedulingIgnoredDuringExecution` types to the API.
|
||||||
3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` into account
|
2. Implement a scheduler predicate that takes
|
||||||
4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be marked as deprecated
|
`RequiredDuringSchedulingIgnoredDuringExecution` into account.
|
||||||
5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API
|
3. Implement a scheduler priority function that takes
|
||||||
6. Modify the scheduler predicate from step 2 to also take `RequiredDuringSchedulingRequiredDuringExecution` into account
|
`PreferredDuringSchedulingIgnoredDuringExecution` into account.
|
||||||
7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission decision
|
4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be
|
||||||
8. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies
|
marked as deprecated.
|
||||||
`RequiredDuringSchedulingRequiredDuringExecution`
|
5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API.
|
||||||
(see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
|
6. Modify the scheduler predicate from step 2 to also take
|
||||||
|
`RequiredDuringSchedulingRequiredDuringExecution` into account.
|
||||||
|
7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission
|
||||||
|
decision.
|
||||||
|
8. Implement code in Kubelet *or* the controllers that evicts a pod that no
|
||||||
|
longer satisfies `RequiredDuringSchedulingRequiredDuringExecution` (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
|
||||||
|
|
||||||
We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling
|
We assume Kubelet publishes labels describing the node's membership in all of
|
||||||
domains (e.g. node name, rack name, availability zone name, etc.). See #9044.
|
the relevant scheduling domains (e.g. node name, rack name, availability zone
|
||||||
|
name, etc.). See #9044.
|
||||||
|
|
||||||
## Extensibility
|
## Extensibility
|
||||||
|
|
||||||
The design described here is the result of careful analysis of use cases, a decade of experience
|
The design described here is the result of careful analysis of use cases, a
|
||||||
with Borg at Google, and a review of similar features in other open-source container orchestration
|
decade of experience with Borg at Google, and a review of similar features in
|
||||||
systems. We believe that it properly balances the goal of expressiveness against the goals of
|
other open-source container orchestration systems. We believe that it properly
|
||||||
simplicity and efficiency of implementation. However, we recognize that
|
balances the goal of expressiveness against the goals of simplicity and
|
||||||
use cases may arise in the future that cannot be expressed using the syntax described here.
|
efficiency of implementation. However, we recognize that use cases may arise in
|
||||||
Although we are not implementing an affinity-specific extensibility mechanism for a variety
|
the future that cannot be expressed using the syntax described here. Although we
|
||||||
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes
|
are not implementing an affinity-specific extensibility mechanism for a variety
|
||||||
users to get a consistent experience, etc.), the regular Kubernetes
|
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire
|
||||||
annotation mechanism can be used to add or replace affinity rules. The way this work would is
|
for Kubernetes users to get a consistent experience, etc.), the regular
|
||||||
|
Kubernetes annotation mechanism can be used to add or replace affinity rules.
|
||||||
|
The way this work would is:
|
||||||
|
|
||||||
1. Define one or more annotations to describe the new affinity rule(s)
|
1. Define one or more annotations to describe the new affinity rule(s)
|
||||||
1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior.
|
1. User (or an admission controller) attaches the annotation(s) to pods to
|
||||||
If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields
|
request the desired scheduling behavior. If the new rule(s) *replace* one or
|
||||||
from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the
|
more fields of `Affinity` then the user would omit those fields from `Affinity`;
|
||||||
annotation(s).
|
if they are *additional rules*, then the user would fill in `Affinity` as well
|
||||||
|
as the annotation(s).
|
||||||
1. Scheduler takes the annotation(s) into account when scheduling.
|
1. Scheduler takes the annotation(s) into account when scheduling.
|
||||||
|
|
||||||
If some particular new syntax becomes popular, we would consider upstreaming it by integrating
|
If some particular new syntax becomes popular, we would consider upstreaming it
|
||||||
it into the standard `Affinity`.
|
by integrating it into the standard `Affinity`.
|
||||||
|
|
||||||
## Future work
|
## Future work
|
||||||
|
|
||||||
Are there any other fields we should convert from `map[string]string` to `NodeSelector`?
|
Are there any other fields we should convert from `map[string]string` to
|
||||||
|
`NodeSelector`?
|
||||||
|
|
||||||
## Related issues
|
## Related issues
|
||||||
|
|
||||||
The review for this proposal is in #18261.
|
The review for this proposal is in #18261.
|
||||||
|
|
||||||
The main related issue is #341. Issue #367 is also related. Those issues reference other
|
The main related issue is #341. Issue #367 is also related. Those issues
|
||||||
related issues.
|
reference other related issues.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -34,43 +34,60 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
# Persistent Storage
|
# Persistent Storage
|
||||||
|
|
||||||
This document proposes a model for managing persistent, cluster-scoped storage for applications requiring long lived data.
|
This document proposes a model for managing persistent, cluster-scoped storage
|
||||||
|
for applications requiring long lived data.
|
||||||
|
|
||||||
### Abstract
|
### Abstract
|
||||||
|
|
||||||
Two new API kinds:
|
Two new API kinds:
|
||||||
|
|
||||||
A `PersistentVolume` (PV) is a storage resource provisioned by an administrator. It is analogous to a node. See [Persistent Volume Guide](../user-guide/persistent-volumes/) for how to use it.
|
A `PersistentVolume` (PV) is a storage resource provisioned by an administrator.
|
||||||
|
It is analogous to a node. See [Persistent Volume Guide](../user-guide/persistent-volumes/)
|
||||||
|
for how to use it.
|
||||||
|
|
||||||
A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to use in a pod. It is analogous to a pod.
|
A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to
|
||||||
|
use in a pod. It is analogous to a pod.
|
||||||
|
|
||||||
One new system component:
|
One new system component:
|
||||||
|
|
||||||
`PersistentVolumeClaimBinder` is a singleton running in master that watches all PersistentVolumeClaims in the system and binds them to the closest matching available PersistentVolume. The volume manager watches the API for newly created volumes to manage.
|
`PersistentVolumeClaimBinder` is a singleton running in master that watches all
|
||||||
|
PersistentVolumeClaims in the system and binds them to the closest matching
|
||||||
|
available PersistentVolume. The volume manager watches the API for newly created
|
||||||
|
volumes to manage.
|
||||||
|
|
||||||
One new volume:
|
One new volume:
|
||||||
|
|
||||||
`PersistentVolumeClaimVolumeSource` references the user's PVC in the same namespace. This volume finds the bound PV and mounts that volume for the pod. A `PersistentVolumeClaimVolumeSource` is, essentially, a wrapper around another type of volume that is owned by someone else (the system).
|
`PersistentVolumeClaimVolumeSource` references the user's PVC in the same
|
||||||
|
namespace. This volume finds the bound PV and mounts that volume for the pod. A
|
||||||
|
`PersistentVolumeClaimVolumeSource` is, essentially, a wrapper around another
|
||||||
|
type of volume that is owned by someone else (the system).
|
||||||
|
|
||||||
Kubernetes makes no guarantees at runtime that the underlying storage exists or is available. High availability is left to the storage provider.
|
Kubernetes makes no guarantees at runtime that the underlying storage exists or
|
||||||
|
is available. High availability is left to the storage provider.
|
||||||
|
|
||||||
### Goals
|
### Goals
|
||||||
|
|
||||||
* Allow administrators to describe available storage
|
* Allow administrators to describe available storage.
|
||||||
* Allow pod authors to discover and request persistent volumes to use with pods
|
* Allow pod authors to discover and request persistent volumes to use with pods.
|
||||||
* Enforce security through access control lists and securing storage to the same namespace as the pod volume
|
* Enforce security through access control lists and securing storage to the same
|
||||||
* Enforce quotas through admission control
|
namespace as the pod volume.
|
||||||
* Enforce scheduler rules by resource counting
|
* Enforce quotas through admission control.
|
||||||
* Ensure developers can rely on storage being available without being closely bound to a particular disk, server, network, or storage device.
|
* Enforce scheduler rules by resource counting.
|
||||||
|
* Ensure developers can rely on storage being available without being closely
|
||||||
|
bound to a particular disk, server, network, or storage device.
|
||||||
|
|
||||||
#### Describe available storage
|
#### Describe available storage
|
||||||
|
|
||||||
Cluster administrators use the API to manage *PersistentVolumes*. A custom store `NewPersistentVolumeOrderedIndex` will index volumes by access modes and sort by storage capacity. The `PersistentVolumeClaimBinder` watches for new claims for storage and binds them to an available volume by matching the volume's characteristics (AccessModes and storage size) to the user's request.
|
Cluster administrators use the API to manage *PersistentVolumes*. A custom store
|
||||||
|
`NewPersistentVolumeOrderedIndex` will index volumes by access modes and sort by
|
||||||
|
storage capacity. The `PersistentVolumeClaimBinder` watches for new claims for
|
||||||
|
storage and binds them to an available volume by matching the volume's
|
||||||
|
characteristics (AccessModes and storage size) to the user's request.
|
||||||
|
|
||||||
PVs are system objects and, thus, have no namespace.
|
PVs are system objects and, thus, have no namespace.
|
||||||
|
|
||||||
Many means of dynamic provisioning will be eventually be implemented for various storage types.
|
Many means of dynamic provisioning will be eventually be implemented for various
|
||||||
|
storage types.
|
||||||
|
|
||||||
|
|
||||||
##### PersistentVolume API
|
##### PersistentVolume API
|
||||||
@ -87,11 +104,15 @@ Many means of dynamic provisioning will be eventually be implemented for various
|
|||||||
|
|
||||||
#### Request Storage
|
#### Request Storage
|
||||||
|
|
||||||
Kubernetes users request persistent storage for their pod by creating a ```PersistentVolumeClaim```. Their request for storage is described by their requirements for resources and mount capabilities.
|
Kubernetes users request persistent storage for their pod by creating a
|
||||||
|
```PersistentVolumeClaim```. Their request for storage is described by their
|
||||||
|
requirements for resources and mount capabilities.
|
||||||
|
|
||||||
Requests for volumes are bound to available volumes by the volume manager, if a suitable match is found. Requests for resources can go unfulfilled.
|
Requests for volumes are bound to available volumes by the volume manager, if a
|
||||||
|
suitable match is found. Requests for resources can go unfulfilled.
|
||||||
|
|
||||||
Users attach their claim to their pod using a new ```PersistentVolumeClaimVolumeSource``` volume source.
|
Users attach their claim to their pod using a new
|
||||||
|
```PersistentVolumeClaimVolumeSource``` volume source.
|
||||||
|
|
||||||
|
|
||||||
##### PersistentVolumeClaim API
|
##### PersistentVolumeClaim API
|
||||||
@ -110,23 +131,31 @@ Users attach their claim to their pod using a new ```PersistentVolumeClaimVolume
|
|||||||
|
|
||||||
#### Scheduling constraints
|
#### Scheduling constraints
|
||||||
|
|
||||||
Scheduling constraints are to be handled similar to pod resource constraints. Pods will need to be annotated or decorated with the number of resources it requires on a node. Similarly, a node will need to list how many it has used or available.
|
Scheduling constraints are to be handled similar to pod resource constraints.
|
||||||
|
Pods will need to be annotated or decorated with the number of resources it
|
||||||
|
requires on a node. Similarly, a node will need to list how many it has used or
|
||||||
|
available.
|
||||||
|
|
||||||
TBD
|
TBD
|
||||||
|
|
||||||
|
|
||||||
#### Events
|
#### Events
|
||||||
|
|
||||||
The implementation of persistent storage will not require events to communicate to the user the state of their claim. The CLI for bound claims contains a reference to the backing persistent volume. This is always present in the API and CLI, making an event to communicate the same unnecessary.
|
The implementation of persistent storage will not require events to communicate
|
||||||
|
to the user the state of their claim. The CLI for bound claims contains a
|
||||||
Events that communicate the state of a mounted volume are left to the volume plugins.
|
reference to the backing persistent volume. This is always present in the API
|
||||||
|
and CLI, making an event to communicate the same unnecessary.
|
||||||
|
|
||||||
|
Events that communicate the state of a mounted volume are left to the volume
|
||||||
|
plugins.
|
||||||
|
|
||||||
### Example
|
### Example
|
||||||
|
|
||||||
#### Admin provisions storage
|
#### Admin provisions storage
|
||||||
|
|
||||||
An administrator provisions storage by posting PVs to the API. Various way to automate this task can be scripted. Dynamic provisioning is a future feature that can maintain levels of PVs.
|
An administrator provisions storage by posting PVs to the API. Various ways to
|
||||||
|
automate this task can be scripted. Dynamic provisioning is a future feature
|
||||||
|
that can maintain levels of PVs.
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
POST:
|
POST:
|
||||||
@ -152,7 +181,8 @@ pv0001 map[] 10737418240 RWO
|
|||||||
|
|
||||||
#### Users request storage
|
#### Users request storage
|
||||||
|
|
||||||
A user requests storage by posting a PVC to the API. Their request contains the AccessModes they wish their volume to have and the minimum size needed.
|
A user requests storage by posting a PVC to the API. Their request contains the
|
||||||
|
AccessModes they wish their volume to have and the minimum size needed.
|
||||||
|
|
||||||
The user must be within a namespace to create PVCs.
|
The user must be within a namespace to create PVCs.
|
||||||
|
|
||||||
@ -181,7 +211,10 @@ myclaim-1 map[] pending
|
|||||||
|
|
||||||
#### Matching and binding
|
#### Matching and binding
|
||||||
|
|
||||||
The ```PersistentVolumeClaimBinder``` attempts to find an available volume that most closely matches the user's request. If one exists, they are bound by putting a reference on the PV to the PVC. Requests can go unfulfilled if a suitable match is not found.
|
The ```PersistentVolumeClaimBinder``` attempts to find an available volume that
|
||||||
|
most closely matches the user's request. If one exists, they are bound by
|
||||||
|
putting a reference on the PV to the PVC. Requests can go unfulfilled if a
|
||||||
|
suitable match is not found.
|
||||||
|
|
||||||
```console
|
```console
|
||||||
$ kubectl get pv
|
$ kubectl get pv
|
||||||
@ -198,9 +231,12 @@ myclaim-1 map[] Bound b16e91d6-c0ef-11e4-8
|
|||||||
|
|
||||||
#### Claim usage
|
#### Claim usage
|
||||||
|
|
||||||
The claim holder can use their claim as a volume. The ```PersistentVolumeClaimVolumeSource``` knows to fetch the PV backing the claim and mount its volume for a pod.
|
The claim holder can use their claim as a volume. The ```PersistentVolumeClaimVolumeSource``` knows to fetch the PV backing the claim
|
||||||
|
and mount its volume for a pod.
|
||||||
|
|
||||||
The claim holder owns the claim and its data for as long as the claim exists. The pod using the claim can be deleted, but the claim remains in the user's namespace. It can be used again and again by many pods.
|
The claim holder owns the claim and its data for as long as the claim exists.
|
||||||
|
The pod using the claim can be deleted, but the claim remains in the user's
|
||||||
|
namespace. It can be used again and again by many pods.
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
POST:
|
POST:
|
||||||
@ -233,9 +269,11 @@ When a claim holder is finished with their data, they can delete their claim.
|
|||||||
$ kubectl delete pvc myclaim-1
|
$ kubectl delete pvc myclaim-1
|
||||||
```
|
```
|
||||||
|
|
||||||
The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim reference from the PV and change the PVs status to 'Released'.
|
The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim
|
||||||
|
reference from the PV and change the PVs status to 'Released'.
|
||||||
|
|
||||||
Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled.
|
Admins can script the recycling of released volumes. Future dynamic provisioners
|
||||||
|
will understand how a volume should be recycled.
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -38,45 +38,48 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
NOTE: It is useful to read about [node affinity](nodeaffinity.md) first.
|
NOTE: It is useful to read about [node affinity](nodeaffinity.md) first.
|
||||||
|
|
||||||
This document describes a proposal for specifying and implementing inter-pod topological affinity and
|
This document describes a proposal for specifying and implementing inter-pod
|
||||||
anti-affinity. By that we mean: rules that specify that certain pods should be placed
|
topological affinity and anti-affinity. By that we mean: rules that specify that
|
||||||
in the same topological domain (e.g. same node, same rack, same zone, same
|
certain pods should be placed in the same topological domain (e.g. same node,
|
||||||
power domain, etc.) as some other pods, or, conversely, should *not* be placed in the
|
same rack, same zone, same power domain, etc.) as some other pods, or,
|
||||||
same topological domain as some other pods.
|
conversely, should *not* be placed in the same topological domain as some other
|
||||||
|
pods.
|
||||||
|
|
||||||
Here are a few example rules; we explain how to express them using the API described
|
Here are a few example rules; we explain how to express them using the API
|
||||||
in this doc later, in the section "Examples."
|
described in this doc later, in the section "Examples."
|
||||||
* Affinity
|
* Affinity
|
||||||
* Co-locate the pods from a particular service or Job in the same availability zone,
|
* Co-locate the pods from a particular service or Job in the same availability
|
||||||
without specifying which zone that should be.
|
zone, without specifying which zone that should be.
|
||||||
* Co-locate the pods from service S1 with pods from service S2 because S1 uses S2
|
* Co-locate the pods from service S1 with pods from service S2 because S1 uses
|
||||||
and thus it is useful to minimize the network latency between them. Co-location
|
S2 and thus it is useful to minimize the network latency between them.
|
||||||
might mean same nodes and/or same availability zone.
|
Co-location might mean same nodes and/or same availability zone.
|
||||||
* Anti-affinity
|
* Anti-affinity
|
||||||
* Spread the pods of a service across nodes and/or availability zones,
|
* Spread the pods of a service across nodes and/or availability zones, e.g. to
|
||||||
e.g. to reduce correlated failures
|
reduce correlated failures.
|
||||||
* Give a pod "exclusive" access to a node to guarantee resource isolation -- it must never share the node with other pods
|
* Give a pod "exclusive" access to a node to guarantee resource isolation --
|
||||||
|
it must never share the node with other pods.
|
||||||
* Don't schedule the pods of a particular service on the same nodes as pods of
|
* Don't schedule the pods of a particular service on the same nodes as pods of
|
||||||
another service that are known to interfere with the performance of the pods of the first service.
|
another service that are known to interfere with the performance of the pods of
|
||||||
|
the first service.
|
||||||
|
|
||||||
For both affinity and anti-affinity, there are three variants. Two variants have the
|
For both affinity and anti-affinity, there are three variants. Two variants have
|
||||||
property of requiring the affinity/anti-affinity to be satisfied for the pod to be allowed
|
the property of requiring the affinity/anti-affinity to be satisfied for the pod
|
||||||
to schedule onto a node; the difference between them is that if the condition ceases to
|
to be allowed to schedule onto a node; the difference between them is that if
|
||||||
be met later on at runtime, for one of them the system will try to eventually evict the pod,
|
the condition ceases to be met later on at runtime, for one of them the system
|
||||||
while for the other the system may not try to do so. The third variant
|
will try to eventually evict the pod, while for the other the system may not try
|
||||||
simply provides scheduling-time *hints* that the scheduler will try
|
to do so. The third variant simply provides scheduling-time *hints* that the
|
||||||
to satisfy but may not be able to. These three variants are directly analogous to the three
|
scheduler will try to satisfy but may not be able to. These three variants are
|
||||||
variants of [node affinity](nodeaffinity.md).
|
directly analogous to the three variants of [node affinity](nodeaffinity.md).
|
||||||
|
|
||||||
Note that this proposal is only about *inter-pod* topological affinity and anti-affinity.
|
Note that this proposal is only about *inter-pod* topological affinity and
|
||||||
There are other forms of topological affinity and anti-affinity. For example,
|
anti-affinity. There are other forms of topological affinity and anti-affinity.
|
||||||
you can use [node affinity](nodeaffinity.md) to require (prefer)
|
For example, you can use [node affinity](nodeaffinity.md) to require (prefer)
|
||||||
that a set of pods all be scheduled in some specific zone Z. Node affinity is not
|
that a set of pods all be scheduled in some specific zone Z. Node affinity is
|
||||||
capable of expressing inter-pod dependencies, and conversely the API
|
not capable of expressing inter-pod dependencies, and conversely the API we
|
||||||
we describe in this document is not capable of expressing node affinity rules.
|
describe in this document is not capable of expressing node affinity rules. For
|
||||||
For simplicity, we will use the terms "affinity" and "anti-affinity" to mean
|
simplicity, we will use the terms "affinity" and "anti-affinity" to mean
|
||||||
"inter-pod topological affinity" and "inter-pod topological anti-affinity," respectively,
|
"inter-pod topological affinity" and "inter-pod topological anti-affinity,"
|
||||||
in the remainder of this document.
|
respectively, in the remainder of this document.
|
||||||
|
|
||||||
## API
|
## API
|
||||||
|
|
||||||
@ -90,28 +93,28 @@ The `Affinity` type is defined as follows
|
|||||||
|
|
||||||
```go
|
```go
|
||||||
type Affinity struct {
|
type Affinity struct {
|
||||||
PodAffinity *PodAffinity `json:"podAffinity,omitempty"`
|
PodAffinity *PodAffinity `json:"podAffinity,omitempty"`
|
||||||
PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"`
|
PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"`
|
||||||
}
|
}
|
||||||
|
|
||||||
type PodAffinity struct {
|
type PodAffinity struct {
|
||||||
// If the affinity requirements specified by this field are not met at
|
// If the affinity requirements specified by this field are not met at
|
||||||
// scheduling time, the pod will not be scheduled onto the node.
|
// scheduling time, the pod will not be scheduled onto the node.
|
||||||
// If the affinity requirements specified by this field cease to be met
|
// If the affinity requirements specified by this field cease to be met
|
||||||
// at some point during pod execution (e.g. due to a pod label update), the
|
// at some point during pod execution (e.g. due to a pod label update), the
|
||||||
// system will try to eventually evict the pod from its node.
|
// system will try to eventually evict the pod from its node.
|
||||||
// When there are multiple elements, the lists of nodes corresponding to each
|
// When there are multiple elements, the lists of nodes corresponding to each
|
||||||
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
||||||
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
|
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
|
||||||
// If the affinity requirements specified by this field are not met at
|
// If the affinity requirements specified by this field are not met at
|
||||||
// scheduling time, the pod will not be scheduled onto the node.
|
// scheduling time, the pod will not be scheduled onto the node.
|
||||||
// If the affinity requirements specified by this field cease to be met
|
// If the affinity requirements specified by this field cease to be met
|
||||||
// at some point during pod execution (e.g. due to a pod label update), the
|
// at some point during pod execution (e.g. due to a pod label update), the
|
||||||
// system may or may not try to eventually evict the pod from its node.
|
// system may or may not try to eventually evict the pod from its node.
|
||||||
// When there are multiple elements, the lists of nodes corresponding to each
|
// When there are multiple elements, the lists of nodes corresponding to each
|
||||||
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
||||||
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||||
// The scheduler will prefer to schedule pods to nodes that satisfy
|
// The scheduler will prefer to schedule pods to nodes that satisfy
|
||||||
// the affinity expressions specified by this field, but it may choose
|
// the affinity expressions specified by this field, but it may choose
|
||||||
// a node that violates one or more of the expressions. The node that is
|
// a node that violates one or more of the expressions. The node that is
|
||||||
// most preferred is the one with the greatest sum of weights, i.e.
|
// most preferred is the one with the greatest sum of weights, i.e.
|
||||||
@ -120,27 +123,27 @@ type PodAffinity struct {
|
|||||||
// compute a sum by iterating through the elements of this field and adding
|
// compute a sum by iterating through the elements of this field and adding
|
||||||
// "weight" to the sum if the node matches the corresponding MatchExpressions; the
|
// "weight" to the sum if the node matches the corresponding MatchExpressions; the
|
||||||
// node(s) with the highest sum are the most preferred.
|
// node(s) with the highest sum are the most preferred.
|
||||||
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||||
}
|
}
|
||||||
|
|
||||||
type PodAntiAffinity struct {
|
type PodAntiAffinity struct {
|
||||||
// If the anti-affinity requirements specified by this field are not met at
|
// If the anti-affinity requirements specified by this field are not met at
|
||||||
// scheduling time, the pod will not be scheduled onto the node.
|
// scheduling time, the pod will not be scheduled onto the node.
|
||||||
// If the anti-affinity requirements specified by this field cease to be met
|
// If the anti-affinity requirements specified by this field cease to be met
|
||||||
// at some point during pod execution (e.g. due to a pod label update), the
|
// at some point during pod execution (e.g. due to a pod label update), the
|
||||||
// system will try to eventually evict the pod from its node.
|
// system will try to eventually evict the pod from its node.
|
||||||
// When there are multiple elements, the lists of nodes corresponding to each
|
// When there are multiple elements, the lists of nodes corresponding to each
|
||||||
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
||||||
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
|
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
|
||||||
// If the anti-affinity requirements specified by this field are not met at
|
// If the anti-affinity requirements specified by this field are not met at
|
||||||
// scheduling time, the pod will not be scheduled onto the node.
|
// scheduling time, the pod will not be scheduled onto the node.
|
||||||
// If the anti-affinity requirements specified by this field cease to be met
|
// If the anti-affinity requirements specified by this field cease to be met
|
||||||
// at some point during pod execution (e.g. due to a pod label update), the
|
// at some point during pod execution (e.g. due to a pod label update), the
|
||||||
// system may or may not try to eventually evict the pod from its node.
|
// system may or may not try to eventually evict the pod from its node.
|
||||||
// When there are multiple elements, the lists of nodes corresponding to each
|
// When there are multiple elements, the lists of nodes corresponding to each
|
||||||
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
||||||
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||||
// The scheduler will prefer to schedule pods to nodes that satisfy
|
// The scheduler will prefer to schedule pods to nodes that satisfy
|
||||||
// the anti-affinity expressions specified by this field, but it may choose
|
// the anti-affinity expressions specified by this field, but it may choose
|
||||||
// a node that violates one or more of the expressions. The node that is
|
// a node that violates one or more of the expressions. The node that is
|
||||||
// most preferred is the one with the greatest sum of weights, i.e.
|
// most preferred is the one with the greatest sum of weights, i.e.
|
||||||
@ -149,7 +152,7 @@ type PodAntiAffinity struct {
|
|||||||
// compute a sum by iterating through the elements of this field and adding
|
// compute a sum by iterating through the elements of this field and adding
|
||||||
// "weight" to the sum if the node matches the corresponding MatchExpressions; the
|
// "weight" to the sum if the node matches the corresponding MatchExpressions; the
|
||||||
// node(s) with the highest sum are the most preferred.
|
// node(s) with the highest sum are the most preferred.
|
||||||
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||||
}
|
}
|
||||||
|
|
||||||
type WeightedPodAffinityTerm struct {
|
type WeightedPodAffinityTerm struct {
|
||||||
@ -159,23 +162,25 @@ type WeightedPodAffinityTerm struct {
|
|||||||
}
|
}
|
||||||
|
|
||||||
type PodAffinityTerm struct {
|
type PodAffinityTerm struct {
|
||||||
LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
|
LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
|
||||||
// namespaces specifies which namespaces the LabelSelector applies to (matches against);
|
// namespaces specifies which namespaces the LabelSelector applies to (matches against);
|
||||||
// nil list means "this pod's namespace," empty list means "all namespaces"
|
// nil list means "this pod's namespace," empty list means "all namespaces"
|
||||||
// The json tag here is not "omitempty" since we need to distinguish nil and empty.
|
// The json tag here is not "omitempty" since we need to distinguish nil and empty.
|
||||||
// See https://golang.org/pkg/encoding/json/#Marshal for more details.
|
// See https://golang.org/pkg/encoding/json/#Marshal for more details.
|
||||||
Namespaces []api.Namespace `json:"namespaces,omitempty"`
|
Namespaces []api.Namespace `json:"namespaces,omitempty"`
|
||||||
// empty topology key is interpreted by the scheduler as "all topologies"
|
// empty topology key is interpreted by the scheduler as "all topologies"
|
||||||
TopologyKey string `json:"topologyKey,omitempty"`
|
TopologyKey string `json:"topologyKey,omitempty"`
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Note that the `Namespaces` field is necessary because normal `LabelSelector` is scoped
|
Note that the `Namespaces` field is necessary because normal `LabelSelector` is
|
||||||
to the pod's namespace, but we need to be able to match against all pods globally.
|
scoped to the pod's namespace, but we need to be able to match against all pods
|
||||||
|
globally.
|
||||||
|
|
||||||
To explain how this API works, let's say that the `PodSpec` of a pod `P` has an `Affinity`
|
To explain how this API works, let's say that the `PodSpec` of a pod `P` has an
|
||||||
that is configured as follows (note that we've omitted and collapsed some fields for
|
`Affinity` that is configured as follows (note that we've omitted and collapsed
|
||||||
simplicity, but this should sufficiently convey the intent of the design):
|
some fields for simplicity, but this should sufficiently convey the intent of
|
||||||
|
the design):
|
||||||
|
|
||||||
```go
|
```go
|
||||||
PodAffinity {
|
PodAffinity {
|
||||||
@ -188,130 +193,160 @@ PodAntiAffinity {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Then when scheduling pod P, the scheduler
|
Then when scheduling pod P, the scheduler:
|
||||||
* Can only schedule P onto nodes that are running pods that satisfy `P1`. (Assumes all nodes have a label with key `node` and value specifying their node name.)
|
* Can only schedule P onto nodes that are running pods that satisfy `P1`.
|
||||||
* Should try to schedule P onto zones that are running pods that satisfy `P2`. (Assumes all nodes have a label with key `zone` and value specifying their zone.)
|
(Assumes all nodes have a label with key `node` and value specifying their node
|
||||||
* Cannot schedule P onto any racks that are running pods that satisfy `P3`. (Assumes all nodes have a label with key `rack` and value specifying their rack name.)
|
name.)
|
||||||
* Should try not to schedule P onto any power domains that are running pods that satisfy `P4`. (Assumes all nodes have a label with key `power` and value specifying their power domain.)
|
* Should try to schedule P onto zones that are running pods that satisfy `P2`.
|
||||||
|
(Assumes all nodes have a label with key `zone` and value specifying their
|
||||||
|
zone.)
|
||||||
|
* Cannot schedule P onto any racks that are running pods that satisfy `P3`.
|
||||||
|
(Assumes all nodes have a label with key `rack` and value specifying their rack
|
||||||
|
name.)
|
||||||
|
* Should try not to schedule P onto any power domains that are running pods that
|
||||||
|
satisfy `P4`. (Assumes all nodes have a label with key `power` and value
|
||||||
|
specifying their power domain.)
|
||||||
|
|
||||||
When `RequiredDuringScheduling` has multiple elements, the requirements are ANDed.
|
When `RequiredDuringScheduling` has multiple elements, the requirements are
|
||||||
For `PreferredDuringScheduling` the weights are added for the terms that are satisfied for each node, and
|
ANDed. For `PreferredDuringScheduling` the weights are added for the terms that
|
||||||
the node(s) with the highest weight(s) are the most preferred.
|
are satisfied for each node, and the node(s) with the highest weight(s) are the
|
||||||
|
most preferred.
|
||||||
|
|
||||||
In reality there are two variants of `RequiredDuringScheduling`: one suffixed with
|
In reality there are two variants of `RequiredDuringScheduling`: one suffixed
|
||||||
`RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`. For the
|
with `RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`.
|
||||||
first variant, if the affinity/anti-affinity ceases to be met at some point during
|
For the first variant, if the affinity/anti-affinity ceases to be met at some
|
||||||
pod execution (e.g. due to a pod label update), the system will try to eventually evict the pod
|
point during pod execution (e.g. due to a pod label update), the system will try
|
||||||
from its node. In the second variant, the system may or may not try to eventually
|
to eventually evict the pod from its node. In the second variant, the system may
|
||||||
evict the pod from its node.
|
or may not try to eventually evict the pod from its node.
|
||||||
|
|
||||||
## A comment on symmetry
|
## A comment on symmetry
|
||||||
|
|
||||||
One thing that makes affinity and anti-affinity tricky is symmetry.
|
One thing that makes affinity and anti-affinity tricky is symmetry.
|
||||||
|
|
||||||
Imagine a cluster that is running pods from two services, S1 and S2. Imagine that the pods of S1 have a RequiredDuringScheduling anti-affinity rule
|
Imagine a cluster that is running pods from two services, S1 and S2. Imagine
|
||||||
"do not run me on nodes that are running pods from S2." It is not sufficient just to check that there are no S2 pods on a node when
|
that the pods of S1 have a RequiredDuringScheduling anti-affinity rule "do not
|
||||||
you are scheduling a S1 pod. You also need to ensure that there are no S1 pods on a node when you are scheduling a S2 pod,
|
run me on nodes that are running pods from S2." It is not sufficient just to
|
||||||
*even though the S2 pod does not have any anti-affinity rules*. Otherwise if an S1 pod schedules before an S2 pod, the S1
|
check that there are no S2 pods on a node when you are scheduling a S1 pod. You
|
||||||
pod's RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving S2 pod. More specifically, if S1 has the aforementioned
|
also need to ensure that there are no S1 pods on a node when you are scheduling
|
||||||
RequiredDuringScheduling anti-affinity rule, then
|
a S2 pod, *even though the S2 pod does not have any anti-affinity rules*.
|
||||||
|
Otherwise if an S1 pod schedules before an S2 pod, the S1 pod's
|
||||||
|
RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving
|
||||||
|
S2 pod. More specifically, if S1 has the aforementioned RequiredDuringScheduling
|
||||||
|
anti-affinity rule, then:
|
||||||
* if a node is empty, you can schedule S1 or S2 onto the node
|
* if a node is empty, you can schedule S1 or S2 onto the node
|
||||||
* if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node
|
* if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node
|
||||||
|
|
||||||
Note that while RequiredDuringScheduling anti-affinity is symmetric,
|
Note that while RequiredDuringScheduling anti-affinity is symmetric,
|
||||||
RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running
|
RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1
|
||||||
pods from S2," it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node. More
|
have a RequiredDuringScheduling affinity rule "run me on nodes that are running
|
||||||
specifically, if S1 has the aforementioned RequiredDuringScheduling affinity rule, then
|
pods from S2," it is not required that there be S1 pods on a node in order to
|
||||||
|
schedule a S2 pod onto that node. More specifically, if S1 has the
|
||||||
|
aforementioned RequiredDuringScheduling affinity rule, then:
|
||||||
* if a node is empty, you can schedule S2 onto the node
|
* if a node is empty, you can schedule S2 onto the node
|
||||||
* if a node is empty, you cannot schedule S1 onto the node
|
* if a node is empty, you cannot schedule S1 onto the node
|
||||||
* if a node is running S2, you can schedule S1 onto the node
|
* if a node is running S2, you can schedule S1 onto the node
|
||||||
* if a node is running S1+S2 and S1 terminates, S2 continues running
|
* if a node is running S1+S2 and S1 terminates, S2 continues running
|
||||||
* if a node is running S1+S2 and S2 terminates, the system terminates S1 (eventually)
|
* if a node is running S1+S2 and S2 terminates, the system terminates S1
|
||||||
|
(eventually)
|
||||||
|
|
||||||
However, although RequiredDuringScheduling affinity is not symmetric, there is an implicit PreferredDuringScheduling affinity rule corresponding to every
|
However, although RequiredDuringScheduling affinity is not symmetric, there is
|
||||||
RequiredDuringScheduling affinity rule: if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running
|
an implicit PreferredDuringScheduling affinity rule corresponding to every
|
||||||
pods from S2" then it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node,
|
RequiredDuringScheduling affinity rule: if the pods of S1 have a
|
||||||
but it would be better if there are.
|
RequiredDuringScheduling affinity rule "run me on nodes that are running pods
|
||||||
|
from S2" then it is not required that there be S1 pods on a node in order to
|
||||||
|
schedule a S2 pod onto that node, but it would be better if there are.
|
||||||
|
|
||||||
PreferredDuringScheduling is symmetric.
|
PreferredDuringScheduling is symmetric. If the pods of S1 had a
|
||||||
If the pods of S1 had a PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that are running pods from S2"
|
PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that
|
||||||
then we would prefer to keep a S1 pod that we are scheduling off of nodes that are running S2 pods, and also
|
are running pods from S2" then we would prefer to keep a S1 pod that we are
|
||||||
to keep a S2 pod that we are scheduling off of nodes that are running S1 pods. Likewise if the pods of
|
scheduling off of nodes that are running S2 pods, and also to keep a S2 pod that
|
||||||
S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that are running pods from S2" then we would prefer
|
we are scheduling off of nodes that are running S1 pods. Likewise if the pods of
|
||||||
to place a S1 pod that we are scheduling onto a node that is running a S2 pod, and also to place
|
S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that
|
||||||
a S2 pod that we are scheduling onto a node that is running a S1 pod.
|
are running pods from S2" then we would prefer to place a S1 pod that we are
|
||||||
|
scheduling onto a node that is running a S2 pod, and also to place a S2 pod that
|
||||||
|
we are scheduling onto a node that is running a S1 pod.
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
Here are some examples of how you would express various affinity and anti-affinity rules using the API we described.
|
Here are some examples of how you would express various affinity and
|
||||||
|
anti-affinity rules using the API we described.
|
||||||
|
|
||||||
### Affinity
|
### Affinity
|
||||||
|
|
||||||
In the examples below, the word "put" is intentionally ambiguous; the rules are the same
|
In the examples below, the word "put" is intentionally ambiguous; the rules are
|
||||||
whether "put" means "must put" (RequiredDuringScheduling) or "try to put"
|
the same whether "put" means "must put" (RequiredDuringScheduling) or "try to
|
||||||
(PreferredDuringScheduling)--all that changes is which field the rule goes into.
|
put" (PreferredDuringScheduling)--all that changes is which field the rule goes
|
||||||
Also, we only discuss scheduling-time, and ignore the execution-time.
|
into. Also, we only discuss scheduling-time, and ignore the execution-time.
|
||||||
Finally, some of the examples
|
Finally, some of the examples use "zone" and some use "node," just to make the
|
||||||
use "zone" and some use "node," just to make the examples more interesting; any of the examples
|
examples more interesting; any of the examples with "zone" will also work for
|
||||||
with "zone" will also work for "node" if you change the `TopologyKey`, and vice-versa.
|
"node" if you change the `TopologyKey`, and vice-versa.
|
||||||
|
|
||||||
* **Put the pod in zone Z**:
|
* **Put the pod in zone Z**:
|
||||||
Tricked you! It is not possible express this using the API described here. For this you should use node affinity.
|
Tricked you! It is not possible express this using the API described here. For
|
||||||
|
this you should use node affinity.
|
||||||
|
|
||||||
* **Put the pod in a zone that is running at least one pod from service S**:
|
* **Put the pod in a zone that is running at least one pod from service S**:
|
||||||
`{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}`
|
`{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}`
|
||||||
|
|
||||||
* **Put the pod on a node that is already running a pod that requires a license for software package P**:
|
* **Put the pod on a node that is already running a pod that requires a license
|
||||||
Assuming pods that require a license for software package P have a label `{key=license, value=P}`:
|
for software package P**: Assuming pods that require a license for software
|
||||||
|
package P have a label `{key=license, value=P}`:
|
||||||
`{LabelSelector: "license" In "P", TopologyKey: "node"}`
|
`{LabelSelector: "license" In "P", TopologyKey: "node"}`
|
||||||
|
|
||||||
* **Put this pod in the same zone as other pods from its same service**:
|
* **Put this pod in the same zone as other pods from its same service**:
|
||||||
Assuming pods from this pod's service have some label `{key=service, value=S}`:
|
Assuming pods from this pod's service have some label `{key=service, value=S}`:
|
||||||
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
|
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
|
||||||
|
|
||||||
This last example illustrates a small issue with this API when it is used
|
This last example illustrates a small issue with this API when it is used with a
|
||||||
with a scheduler that processes the pending queue one pod at a time, like the current
|
scheduler that processes the pending queue one pod at a time, like the current
|
||||||
Kubernetes scheduler. The RequiredDuringScheduling rule
|
Kubernetes scheduler. The RequiredDuringScheduling rule
|
||||||
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
|
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
|
||||||
only "works" once one pod from service S has been scheduled. But if all pods in service
|
only "works" once one pod from service S has been scheduled. But if all pods in
|
||||||
S have this RequiredDuringScheduling rule in their PodSpec, then the RequiredDuringScheduling rule
|
service S have this RequiredDuringScheduling rule in their PodSpec, then the
|
||||||
will block the first
|
RequiredDuringScheduling rule will block the first pod of the service from ever
|
||||||
pod of the service from ever scheduling, since it is only allowed to run in a zone with another pod from
|
scheduling, since it is only allowed to run in a zone with another pod from the
|
||||||
the same service. And of course that means none of the pods of the service will be able
|
same service. And of course that means none of the pods of the service will be
|
||||||
to schedule. This problem *only* applies to RequiredDuringScheduling affinity, not
|
able to schedule. This problem *only* applies to RequiredDuringScheduling
|
||||||
PreferredDuringScheduling affinity or any variant of anti-affinity.
|
affinity, not PreferredDuringScheduling affinity or any variant of
|
||||||
There are at least three ways to solve this problem
|
anti-affinity. There are at least three ways to solve this problem:
|
||||||
* **short-term**: have the scheduler use a rule that if the RequiredDuringScheduling affinity requirement
|
* **short-term**: have the scheduler use a rule that if the
|
||||||
matches a pod's own labels, and there are no other such pods anywhere, then disregard the requirement.
|
RequiredDuringScheduling affinity requirement matches a pod's own labels, and
|
||||||
This approach has a corner case when running parallel schedulers that are allowed to
|
there are no other such pods anywhere, then disregard the requirement. This
|
||||||
schedule pods from the same replicated set (e.g. a single PodTemplate): both schedulers may try to
|
approach has a corner case when running parallel schedulers that are allowed to
|
||||||
schedule pods from the set
|
schedule pods from the same replicated set (e.g. a single PodTemplate): both
|
||||||
at the same time and think there are no other pods from that set scheduled yet (e.g. they are
|
schedulers may try to schedule pods from the set at the same time and think
|
||||||
trying to schedule the first two pods from the set), but by the time
|
there are no other pods from that set scheduled yet (e.g. they are trying to
|
||||||
the second binding is committed, the first one has already been committed, leaving you with
|
schedule the first two pods from the set), but by the time the second binding is
|
||||||
two pods running that do not respect their RequiredDuringScheduling affinity. There is no
|
committed, the first one has already been committed, leaving you with two pods
|
||||||
simple way to detect this "conflict" at scheduling time given the current system implementation.
|
running that do not respect their RequiredDuringScheduling affinity. There is no
|
||||||
* **longer-term**: when a controller creates pods from a PodTemplate, for exactly *one* of those
|
simple way to detect this "conflict" at scheduling time given the current system
|
||||||
pods, it should omit any RequiredDuringScheduling affinity rules that select the pods of that PodTemplate.
|
implementation.
|
||||||
* **very long-term/speculative**: controllers could present the scheduler with a group of pods from
|
* **longer-term**: when a controller creates pods from a PodTemplate, for
|
||||||
the same PodTemplate as a single unit. This is similar to the first approach described above but
|
exactly *one* of those pods, it should omit any RequiredDuringScheduling
|
||||||
avoids the corner case. No special logic is needed in the controllers. Moreover, this would allow
|
affinity rules that select the pods of that PodTemplate.
|
||||||
the scheduler to do proper [gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845)
|
* **very long-term/speculative**: controllers could present the scheduler with a
|
||||||
since it could receive an entire gang simultaneously as a single unit.
|
group of pods from the same PodTemplate as a single unit. This is similar to the
|
||||||
|
first approach described above but avoids the corner case. No special logic is
|
||||||
|
needed in the controllers. Moreover, this would allow the scheduler to do proper
|
||||||
|
[gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845) since
|
||||||
|
it could receive an entire gang simultaneously as a single unit.
|
||||||
|
|
||||||
### Anti-affinity
|
### Anti-affinity
|
||||||
|
|
||||||
As with the affinity examples, the examples here can be RequiredDuringScheduling or
|
As with the affinity examples, the examples here can be RequiredDuringScheduling
|
||||||
PreferredDuringScheduling anti-affinity, i.e.
|
or PreferredDuringScheduling anti-affinity, i.e. "don't" can be interpreted as
|
||||||
"don't" can be interpreted as "must not" or as "try not to" depending on whether the rule appears
|
"must not" or as "try not to" depending on whether the rule appears in
|
||||||
in `RequiredDuringScheduling` or `PreferredDuringScheduling`.
|
`RequiredDuringScheduling` or `PreferredDuringScheduling`.
|
||||||
|
|
||||||
* **Spread the pods of this service S across nodes and zones**:
|
* **Spread the pods of this service S across nodes and zones**:
|
||||||
`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"}, {LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}`
|
`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"},
|
||||||
(note that if this is specified as a RequiredDuringScheduling anti-affinity, then the first clause is redundant, since the second
|
{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}`
|
||||||
clause will force the scheduler to not put more than one pod from S in the same zone, and thus by
|
(note that if this is specified as a RequiredDuringScheduling anti-affinity,
|
||||||
definition it will not put more than one pod from S on the same node, assuming each node is in one zone.
|
then the first clause is redundant, since the second clause will force the
|
||||||
This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one might expect it to be common in
|
scheduler to not put more than one pod from S in the same zone, and thus by
|
||||||
|
definition it will not put more than one pod from S on the same node, assuming
|
||||||
|
each node is in one zone. This rule is more useful as PreferredDuringScheduling
|
||||||
|
anti-affinity, e.g. one might expect it to be common in
|
||||||
[Ubernetes](../../docs/proposals/federation.md) clusters.)
|
[Ubernetes](../../docs/proposals/federation.md) clusters.)
|
||||||
|
|
||||||
* **Don't co-locate pods of this service with pods from service "evilService"**:
|
* **Don't co-locate pods of this service with pods from service "evilService"**:
|
||||||
@ -323,25 +358,29 @@ This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one mi
|
|||||||
* **Don't co-locate pods of this service with any other pods except other pods of this service**:
|
* **Don't co-locate pods of this service with any other pods except other pods of this service**:
|
||||||
Assuming pods from the service have some label `{key=service, value=S}`:
|
Assuming pods from the service have some label `{key=service, value=S}`:
|
||||||
`{LabelSelector: "service" NotIn "S", TopologyKey: "node"}`
|
`{LabelSelector: "service" NotIn "S", TopologyKey: "node"}`
|
||||||
Note that this works because `"service" NotIn "S"` matches pods with no key "service"
|
Note that this works because `"service" NotIn "S"` matches pods with no key
|
||||||
as well as pods with key "service" and a corresponding value that is not "S."
|
"service" as well as pods with key "service" and a corresponding value that is
|
||||||
|
not "S."
|
||||||
|
|
||||||
## Algorithm
|
## Algorithm
|
||||||
|
|
||||||
An example algorithm a scheduler might use to implement affinity and anti-affinity rules is as follows.
|
An example algorithm a scheduler might use to implement affinity and
|
||||||
There are certainly more efficient ways to do it; this is just intended to demonstrate that the API's
|
anti-affinity rules is as follows. There are certainly more efficient ways to
|
||||||
semantics are implementable.
|
do it; this is just intended to demonstrate that the API's semantics are
|
||||||
|
implementable.
|
||||||
|
|
||||||
Terminology definition: We say a pod P is "feasible" on a node N if P meets all of the scheduler
|
Terminology definition: We say a pod P is "feasible" on a node N if P meets all
|
||||||
predicates for scheduling P onto N. Note that this algorithm is only concerned about scheduling
|
of the scheduler predicates for scheduling P onto N. Note that this algorithm is
|
||||||
time, thus it makes no distinction between RequiredDuringExecution and IgnoredDuringExecution.
|
only concerned about scheduling time, thus it makes no distinction between
|
||||||
|
RequiredDuringExecution and IgnoredDuringExecution.
|
||||||
|
|
||||||
To make the algorithm slightly more readable, we use the term "HardPodAffinity" as shorthand
|
To make the algorithm slightly more readable, we use the term "HardPodAffinity"
|
||||||
for "RequiredDuringSchedulingScheduling pod affinity" and "SoftPodAffinity" as shorthand for
|
as shorthand for "RequiredDuringSchedulingScheduling pod affinity" and
|
||||||
"PreferredDuringScheduling pod affinity." Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."
|
"SoftPodAffinity" as shorthand for "PreferredDuringScheduling pod affinity."
|
||||||
|
Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."
|
||||||
|
|
||||||
** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity} into account;
|
** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity}
|
||||||
currently it assumes all terms have weight 1. **
|
into account; currently it assumes all terms have weight 1. **
|
||||||
|
|
||||||
```
|
```
|
||||||
Z = the pod you are scheduling
|
Z = the pod you are scheduling
|
||||||
@ -389,74 +428,81 @@ foreach node A of {N}
|
|||||||
|
|
||||||
## Special considerations for RequiredDuringScheduling anti-affinity
|
## Special considerations for RequiredDuringScheduling anti-affinity
|
||||||
|
|
||||||
In this section we discuss three issues with RequiredDuringScheduling anti-affinity:
|
In this section we discuss three issues with RequiredDuringScheduling
|
||||||
Denial of Service (DoS), co-existing with daemons, and determining which pod(s) to kill.
|
anti-affinity: Denial of Service (DoS), co-existing with daemons, and
|
||||||
See issue #18265 for additional discussion of these topics.
|
determining which pod(s) to kill. See issue #18265 for additional discussion of
|
||||||
|
these topics.
|
||||||
|
|
||||||
### Denial of Service
|
### Denial of Service
|
||||||
|
|
||||||
Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity can intentionally
|
Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity
|
||||||
or unintentionally cause various problems for other pods, due to the symmetry property of anti-affinity.
|
can intentionally or unintentionally cause various problems for other pods, due
|
||||||
|
to the symmetry property of anti-affinity.
|
||||||
|
|
||||||
The most notable danger is the ability for a
|
The most notable danger is the ability for a pod that arrives first to some
|
||||||
pod that arrives first to some topology domain, to block all other pods from
|
topology domain, to block all other pods from scheduling there by stating a
|
||||||
scheduling there by stating a conflict with all other pods.
|
conflict with all other pods. The standard approach to preventing resource
|
||||||
The standard approach
|
hogging is quota, but simple resource quota cannot prevent this scenario because
|
||||||
to preventing resource hogging is quota, but simple resource quota cannot prevent
|
the pod may request very little resources. Addressing this using quota requires
|
||||||
this scenario because the pod may request very little resources. Addressing this
|
a quota scheme that charges based on "opportunity cost" rather than based simply
|
||||||
using quota requires a quota scheme that charges based on "opportunity cost" rather
|
on requested resources. For example, when handling a pod that expresses
|
||||||
than based simply on requested resources. For example, when handling a pod that expresses
|
|
||||||
RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey`
|
RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey`
|
||||||
(i.e. exclusive access to a node), it could charge for the resources of the
|
(i.e. exclusive access to a node), it could charge for the resources of the
|
||||||
average or largest node in the cluster. Likewise if a pod expresses RequiredDuringScheduling
|
average or largest node in the cluster. Likewise if a pod expresses
|
||||||
anti-affinity for all pods using a "cluster" `TopologyKey`, it could charge for the resources of the
|
RequiredDuringScheduling anti-affinity for all pods using a "cluster"
|
||||||
entire cluster. If node affinity is used to
|
`TopologyKey`, it could charge for the resources of the entire cluster. If node
|
||||||
constrain the pod to a particular topology domain, then the admission-time quota
|
affinity is used to constrain the pod to a particular topology domain, then the
|
||||||
charging should take that into account (e.g. not charge for the average/largest machine
|
admission-time quota charging should take that into account (e.g. not charge for
|
||||||
if the PodSpec constrains the pod to a specific machine with a known size; instead charge
|
the average/largest machine if the PodSpec constrains the pod to a specific
|
||||||
for the size of the actual machine that the pod was constrained to). In all cases
|
machine with a known size; instead charge for the size of the actual machine
|
||||||
once the pod is scheduled, the quota charge should be adjusted down to the
|
that the pod was constrained to). In all cases once the pod is scheduled, the
|
||||||
actual amount of resources allocated (e.g. the size of the actual machine that was
|
quota charge should be adjusted down to the actual amount of resources allocated
|
||||||
assigned, not the average/largest). If a cluster administrator wants to overcommit quota, for
|
(e.g. the size of the actual machine that was assigned, not the
|
||||||
|
average/largest). If a cluster administrator wants to overcommit quota, for
|
||||||
example to allow more than N pods across all users to request exclusive node
|
example to allow more than N pods across all users to request exclusive node
|
||||||
access in a cluster with N nodes, then a priority/preemption scheme should be added
|
access in a cluster with N nodes, then a priority/preemption scheme should be
|
||||||
so that the most important pods run when resource demand exceeds supply.
|
added so that the most important pods run when resource demand exceeds supply.
|
||||||
|
|
||||||
An alternative approach, which is a bit of a blunt hammer, is to use a
|
An alternative approach, which is a bit of a blunt hammer, is to use a
|
||||||
capability mechanism to restrict use of RequiredDuringScheduling anti-affinity
|
capability mechanism to restrict use of RequiredDuringScheduling anti-affinity
|
||||||
to trusted users. A more complex capability mechanism might only restrict it when
|
to trusted users. A more complex capability mechanism might only restrict it
|
||||||
using a non-"node" TopologyKey.
|
when using a non-"node" TopologyKey.
|
||||||
|
|
||||||
Our initial implementation will use a variant of the capability approach, which
|
Our initial implementation will use a variant of the capability approach, which
|
||||||
requires no configuration: we will simply reject ALL requests, regardless of user,
|
requires no configuration: we will simply reject ALL requests, regardless of
|
||||||
that specify "all namespaces" with non-"node" TopologyKey for RequiredDuringScheduling anti-affinity.
|
user, that specify "all namespaces" with non-"node" TopologyKey for
|
||||||
This allows the "exclusive node" use case while prohibiting the more dangerous ones.
|
RequiredDuringScheduling anti-affinity. This allows the "exclusive node" use
|
||||||
|
case while prohibiting the more dangerous ones.
|
||||||
|
|
||||||
A weaker variant of the problem described in the previous paragraph is a pod's ability to use anti-affinity to degrade
|
A weaker variant of the problem described in the previous paragraph is a pod's
|
||||||
the scheduling quality of another pod, but not completely block it from scheduling.
|
ability to use anti-affinity to degrade the scheduling quality of another pod,
|
||||||
For example, a set of pods S1 could use node affinity to request to schedule onto a set
|
but not completely block it from scheduling. For example, a set of pods S1 could
|
||||||
of nodes that some other set of pods S2 prefers to schedule onto. If the pods in S1
|
use node affinity to request to schedule onto a set of nodes that some other set
|
||||||
have RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for S2,
|
of pods S2 prefers to schedule onto. If the pods in S1 have
|
||||||
then due to the symmetry property of anti-affinity, they can prevent the pods in S2 from
|
RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for
|
||||||
scheduling onto their preferred nodes if they arrive first (for sure in the RequiredDuringScheduling case, and
|
S2, then due to the symmetry property of anti-affinity, they can prevent the
|
||||||
with some probability that depends on the weighting scheme for the PreferredDuringScheduling case).
|
pods in S2 from scheduling onto their preferred nodes if they arrive first (for
|
||||||
A very sophisticated priority and/or quota scheme could mitigate this, or alternatively
|
sure in the RequiredDuringScheduling case, and with some probability that
|
||||||
we could eliminate the symmetry property of the implementation of PreferredDuringScheduling anti-affinity.
|
depends on the weighting scheme for the PreferredDuringScheduling case). A very
|
||||||
Then only RequiredDuringScheduling anti-affinity could affect scheduling quality
|
sophisticated priority and/or quota scheme could mitigate this, or alternatively
|
||||||
of another pod, and as we described in the previous paragraph, such pods could be charged
|
we could eliminate the symmetry property of the implementation of
|
||||||
quota for the full topology domain, thereby reducing the potential for abuse.
|
PreferredDuringScheduling anti-affinity. Then only RequiredDuringScheduling
|
||||||
|
anti-affinity could affect scheduling quality of another pod, and as we
|
||||||
|
described in the previous paragraph, such pods could be charged quota for the
|
||||||
|
full topology domain, thereby reducing the potential for abuse.
|
||||||
|
|
||||||
We won't try to address this issue in our initial implementation; we can consider one
|
We won't try to address this issue in our initial implementation; we can
|
||||||
of the approaches mentioned above if it turns out to be a problem in practice.
|
consider one of the approaches mentioned above if it turns out to be a problem
|
||||||
|
in practice.
|
||||||
|
|
||||||
### Co-existing with daemons
|
### Co-existing with daemons
|
||||||
|
|
||||||
A cluster administrator
|
A cluster administrator may wish to allow pods that express anti-affinity
|
||||||
may wish to allow pods that express anti-affinity against all pods, to nonetheless co-exist with
|
against all pods, to nonetheless co-exist with system daemon pods, such as those
|
||||||
system daemon pods, such as those run by DaemonSet. In principle, we would like the specification
|
run by DaemonSet. In principle, we would like the specification for
|
||||||
for RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or more
|
RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or
|
||||||
other pods (see #18263 for a more detailed explanation of the toleration concept). There are
|
more other pods (see #18263 for a more detailed explanation of the toleration
|
||||||
at least two ways to accomplish this:
|
concept). There are at least two ways to accomplish this:
|
||||||
|
|
||||||
* Scheduler special-cases the namespace(s) where daemons live, in the
|
* Scheduler special-cases the namespace(s) where daemons live, in the
|
||||||
sense that it ignores pods in those namespaces when it is
|
sense that it ignores pods in those namespaces when it is
|
||||||
@ -478,147 +524,168 @@ Our initial implementation will use the first approach.
|
|||||||
|
|
||||||
### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution)
|
### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution)
|
||||||
|
|
||||||
Because anti-affinity is symmetric, in the case of RequiredDuringSchedulingRequiredDuringExecution
|
Because anti-affinity is symmetric, in the case of
|
||||||
anti-affinity, the system must determine which pod(s) to kill when a pod's labels are updated in
|
RequiredDuringSchedulingRequiredDuringExecution anti-affinity, the system must
|
||||||
such as way as to cause them to conflict with one or more other pods' RequiredDuringSchedulingRequiredDuringExecution
|
determine which pod(s) to kill when a pod's labels are updated in such as way as
|
||||||
anti-affinity rules. In the absence of a priority/preemption scheme, our rule will be that the pod
|
to cause them to conflict with one or more other pods'
|
||||||
with the anti-affinity rule that becomes violated should be the one killed.
|
RequiredDuringSchedulingRequiredDuringExecution anti-affinity rules. In the
|
||||||
A pod should only specify constraints that apply to
|
absence of a priority/preemption scheme, our rule will be that the pod with the
|
||||||
namespaces it trusts to not do malicious things. Once we have priority/preemption, we can
|
anti-affinity rule that becomes violated should be the one killed. A pod should
|
||||||
change the rule to say that the lowest-priority pod(s) are killed until all
|
only specify constraints that apply to namespaces it trusts to not do malicious
|
||||||
|
things. Once we have priority/preemption, we can change the rule to say that the
|
||||||
|
lowest-priority pod(s) are killed until all
|
||||||
RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied.
|
RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied.
|
||||||
|
|
||||||
## Special considerations for RequiredDuringScheduling affinity
|
## Special considerations for RequiredDuringScheduling affinity
|
||||||
|
|
||||||
The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its symmetry:
|
The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its
|
||||||
if a pod P requests anti-affinity, P cannot schedule onto a node with conflicting pods,
|
symmetry: if a pod P requests anti-affinity, P cannot schedule onto a node with
|
||||||
and pods that conflict with P cannot schedule onto the node one P has been scheduled there.
|
conflicting pods, and pods that conflict with P cannot schedule onto the node
|
||||||
The design we have described says that the symmetry property for RequiredDuringScheduling *affinity*
|
one P has been scheduled there. The design we have described says that the
|
||||||
is weaker: if a pod P says it can only schedule onto nodes running pod Q, this
|
symmetry property for RequiredDuringScheduling *affinity* is weaker: if a pod P
|
||||||
does not mean Q can only run on a node that is running P, but the scheduler will try
|
says it can only schedule onto nodes running pod Q, this does not mean Q can
|
||||||
to schedule Q onto a node that is running P (i.e. treats the reverse direction as
|
only run on a node that is running P, but the scheduler will try to schedule Q
|
||||||
preferred). This raises the same scheduling quality concern as we mentioned at the
|
onto a node that is running P (i.e. treats the reverse direction as preferred).
|
||||||
end of the Denial of Service section above, and can be addressed in similar ways.
|
This raises the same scheduling quality concern as we mentioned at the end of
|
||||||
|
the Denial of Service section above, and can be addressed in similar ways.
|
||||||
|
|
||||||
The nature of affinity (as opposed to anti-affinity) means that there is no issue of
|
The nature of affinity (as opposed to anti-affinity) means that there is no
|
||||||
determining which pod(s) to kill
|
issue of determining which pod(s) to kill when a pod's labels change: it is
|
||||||
when a pod's labels change: it is obviously the pod with the affinity rule that becomes
|
obviously the pod with the affinity rule that becomes violated that must be
|
||||||
violated that must be killed. (Killing a pod never "fixes" violation of an affinity rule;
|
killed. (Killing a pod never "fixes" violation of an affinity rule; it can only
|
||||||
it can only "fix" violation an anti-affinity rule.) However, affinity does have a
|
"fix" violation an anti-affinity rule.) However, affinity does have a different
|
||||||
different question related to killing: how long should the system wait before declaring
|
question related to killing: how long should the system wait before declaring
|
||||||
that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met at runtime?
|
that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met
|
||||||
For example, if a pod P has such an affinity for a pod Q and pod Q is temporarily killed
|
at runtime? For example, if a pod P has such an affinity for a pod Q and pod Q
|
||||||
so that it can be updated to a new binary version, should that trigger killing of P? More
|
is temporarily killed so that it can be updated to a new binary version, should
|
||||||
generally, how long should the system wait before declaring that P's affinity is
|
that trigger killing of P? More generally, how long should the system wait
|
||||||
violated? (Of course affinity is expressed in terms of label selectors, not for a specific
|
before declaring that P's affinity is violated? (Of course affinity is expressed
|
||||||
pod, but the scenario is easier to describe using a concrete pod.) This is closely related to
|
in terms of label selectors, not for a specific pod, but the scenario is easier
|
||||||
the concept of forgiveness (see issue #1574). In theory we could make this time duration be
|
to describe using a concrete pod.) This is closely related to the concept of
|
||||||
configurable by the user on a per-pod basis, but for the first version of this feature we will
|
forgiveness (see issue #1574). In theory we could make this time duration be
|
||||||
make it a configurable property of whichever component does the killing and that applies across
|
configurable by the user on a per-pod basis, but for the first version of this
|
||||||
all pods using the feature. Making it configurable by the user would require a nontrivial change
|
feature we will make it a configurable property of whichever component does the
|
||||||
to the API syntax (since the field would only apply to RequiredDuringSchedulingRequiredDuringExecution
|
killing and that applies across all pods using the feature. Making it
|
||||||
affinity).
|
configurable by the user would require a nontrivial change to the API syntax
|
||||||
|
(since the field would only apply to
|
||||||
|
RequiredDuringSchedulingRequiredDuringExecution affinity).
|
||||||
|
|
||||||
## Implementation plan
|
## Implementation plan
|
||||||
|
|
||||||
1. Add the `Affinity` field to PodSpec and the `PodAffinity` and `PodAntiAffinity` types to the API along with all of their descendant types.
|
1. Add the `Affinity` field to PodSpec and the `PodAffinity` and
|
||||||
2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution`
|
`PodAntiAffinity` types to the API along with all of their descendant types.
|
||||||
affinity and anti-affinity into account. Include a workaround for the issue described at the end of the Affinity section of the Examples section (can't schedule first pod).
|
2. Implement a scheduler predicate that takes
|
||||||
3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into account
|
`RequiredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into
|
||||||
4. Implement admission controller that rejects requests that specify "all namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling` anti-affinity.
|
account. Include a workaround for the issue described at the end of the Affinity
|
||||||
This admission controller should be enabled by default.
|
section of the Examples section (can't schedule first pod).
|
||||||
|
3. Implement a scheduler priority function that takes
|
||||||
|
`PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity
|
||||||
|
into account.
|
||||||
|
4. Implement admission controller that rejects requests that specify "all
|
||||||
|
namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling`
|
||||||
|
anti-affinity. This admission controller should be enabled by default.
|
||||||
5. Implement the recommended solution to the "co-existing with daemons" issue
|
5. Implement the recommended solution to the "co-existing with daemons" issue
|
||||||
6. At this point, the feature can be deployed.
|
6. At this point, the feature can be deployed.
|
||||||
7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity and anti-affinity, and make sure
|
7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity
|
||||||
the pieces of the system already implemented for `RequiredDuringSchedulingIgnoredDuringExecution` also take
|
and anti-affinity, and make sure the pieces of the system already implemented
|
||||||
`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the scheduler predicate, the quota mechanism,
|
for `RequiredDuringSchedulingIgnoredDuringExecution` also take
|
||||||
the "co-existing with daemons" solution).
|
`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the
|
||||||
8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node" `TopologyKey` to Kubelet's admission decision
|
scheduler predicate, the quota mechanism, the "co-existing with daemons"
|
||||||
9. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies
|
solution).
|
||||||
`RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet then only for "node" `TopologyKey`;
|
8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node"
|
||||||
if controller then potentially for all `TopologyKeys`'s.
|
`TopologyKey` to Kubelet's admission decision.
|
||||||
(see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
|
9. Implement code in Kubelet *or* the controllers that evicts a pod that no
|
||||||
|
longer satisfies `RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet
|
||||||
|
then only for "node" `TopologyKey`; if controller then potentially for all
|
||||||
|
`TopologyKeys`'s. (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
|
||||||
Do so in a way that addresses the "determining which pod(s) to kill" issue.
|
Do so in a way that addresses the "determining which pod(s) to kill" issue.
|
||||||
|
|
||||||
We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling
|
We assume Kubelet publishes labels describing the node's membership in all of
|
||||||
domains (e.g. node name, rack name, availability zone name, etc.). See #9044.
|
the relevant scheduling domains (e.g. node name, rack name, availability zone
|
||||||
|
name, etc.). See #9044.
|
||||||
|
|
||||||
## Backward compatibility
|
## Backward compatibility
|
||||||
|
|
||||||
Old versions of the scheduler will ignore `Affinity`.
|
Old versions of the scheduler will ignore `Affinity`.
|
||||||
|
|
||||||
Users should not start using `Affinity` until the full implementation has
|
Users should not start using `Affinity` until the full implementation has been
|
||||||
been in Kubelet and the master for enough binary versions that we feel
|
in Kubelet and the master for enough binary versions that we feel comfortable
|
||||||
comfortable that we will not need to roll back either Kubelet or
|
that we will not need to roll back either Kubelet or master to a version that
|
||||||
master to a version that does not support them. Longer-term we will
|
does not support them. Longer-term we will use a programmatic approach to
|
||||||
use a programmatic approach to enforcing this (#4855).
|
enforcing this (#4855).
|
||||||
|
|
||||||
## Extensibility
|
## Extensibility
|
||||||
|
|
||||||
The design described here is the result of careful analysis of use cases, a decade of experience
|
The design described here is the result of careful analysis of use cases, a
|
||||||
with Borg at Google, and a review of similar features in other open-source container orchestration
|
decade of experience with Borg at Google, and a review of similar features in
|
||||||
systems. We believe that it properly balances the goal of expressiveness against the goals of
|
other open-source container orchestration systems. We believe that it properly
|
||||||
simplicity and efficiency of implementation. However, we recognize that
|
balances the goal of expressiveness against the goals of simplicity and
|
||||||
use cases may arise in the future that cannot be expressed using the syntax described here.
|
efficiency of implementation. However, we recognize that use cases may arise in
|
||||||
Although we are not implementing an affinity-specific extensibility mechanism for a variety
|
the future that cannot be expressed using the syntax described here. Although we
|
||||||
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes
|
are not implementing an affinity-specific extensibility mechanism for a variety
|
||||||
users to get a consistent experience, etc.), the regular Kubernetes
|
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire
|
||||||
annotation mechanism can be used to add or replace affinity rules. The way this work would is
|
for Kubernetes users to get a consistent experience, etc.), the regular
|
||||||
|
Kubernetes annotation mechanism can be used to add or replace affinity rules.
|
||||||
|
The way this work would is:
|
||||||
1. Define one or more annotations to describe the new affinity rule(s)
|
1. Define one or more annotations to describe the new affinity rule(s)
|
||||||
1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior.
|
1. User (or an admission controller) attaches the annotation(s) to pods to
|
||||||
If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields
|
request the desired scheduling behavior. If the new rule(s) *replace* one or
|
||||||
from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the
|
more fields of `Affinity` then the user would omit those fields from `Affinity`;
|
||||||
annotation(s).
|
if they are *additional rules*, then the user would fill in `Affinity` as well
|
||||||
|
as the annotation(s).
|
||||||
1. Scheduler takes the annotation(s) into account when scheduling.
|
1. Scheduler takes the annotation(s) into account when scheduling.
|
||||||
|
|
||||||
If some particular new syntax becomes popular, we would consider upstreaming it by integrating
|
If some particular new syntax becomes popular, we would consider upstreaming it
|
||||||
it into the standard `Affinity`.
|
by integrating it into the standard `Affinity`.
|
||||||
|
|
||||||
## Future work and non-work
|
## Future work and non-work
|
||||||
|
|
||||||
One can imagine that in the anti-affinity RequiredDuringScheduling case
|
One can imagine that in the anti-affinity RequiredDuringScheduling case one
|
||||||
one might want to associate a number with the rule,
|
might want to associate a number with the rule, for example "do not allow this
|
||||||
for example "do not allow this pod to share a rack with more than three other
|
pod to share a rack with more than three other pods (in total, or from the same
|
||||||
pods (in total, or from the same service as the pod)." We could allow this to be
|
service as the pod)." We could allow this to be specified by adding an integer
|
||||||
specified by adding an integer `Limit` to `PodAffinityTerm` just for the
|
`Limit` to `PodAffinityTerm` just for the `RequiredDuringScheduling` case.
|
||||||
`RequiredDuringScheduling` case. However, this flexibility complicates the
|
However, this flexibility complicates the system and we do not intend to
|
||||||
system and we do not intend to implement it.
|
implement it.
|
||||||
|
|
||||||
It is likely that the specification and implementation of pod anti-affinity
|
It is likely that the specification and implementation of pod anti-affinity
|
||||||
can be unified with [taints and tolerations](taint-toleration-dedicated.md),
|
can be unified with [taints and tolerations](taint-toleration-dedicated.md),
|
||||||
and likewise that the specification and implementation of pod affinity
|
and likewise that the specification and implementation of pod affinity
|
||||||
can be unified with [node affinity](nodeaffinity.md).
|
can be unified with [node affinity](nodeaffinity.md). The basic idea is that pod
|
||||||
The basic idea is that pod labels would be "inherited" by the node, and pods
|
labels would be "inherited" by the node, and pods would only be able to specify
|
||||||
would only be able to specify affinity and anti-affinity for a node's labels.
|
affinity and anti-affinity for a node's labels. Our main motivation for not
|
||||||
Our main motivation for not unifying taints and tolerations with
|
unifying taints and tolerations with pod anti-affinity is that we foresee taints
|
||||||
pod anti-affinity is that we foresee taints and tolerations as being a concept that
|
and tolerations as being a concept that only cluster administrators need to
|
||||||
only cluster administrators need to understand (and indeed in some setups taints and
|
understand (and indeed in some setups taints and tolerations wouldn't even be
|
||||||
tolerations wouldn't even be directly manipulated by a cluster administrator,
|
directly manipulated by a cluster administrator, instead they would only be set
|
||||||
instead they would only be set by an admission controller that is implementing the administrator's
|
by an admission controller that is implementing the administrator's high-level
|
||||||
high-level policy about different classes of special machines and the users who belong to the groups
|
policy about different classes of special machines and the users who belong to
|
||||||
allowed to access them). Moreover, the concept of nodes "inheriting" labels
|
the groups allowed to access them). Moreover, the concept of nodes "inheriting"
|
||||||
from pods seems complicated; it seems conceptually simpler to separate rules involving
|
labels from pods seems complicated; it seems conceptually simpler to separate
|
||||||
relatively static properties of nodes from rules involving which other pods are running
|
rules involving relatively static properties of nodes from rules involving which
|
||||||
on the same node or larger topology domain.
|
other pods are running on the same node or larger topology domain.
|
||||||
|
|
||||||
Data/storage affinity is related to pod affinity, and is likely to draw on some of the
|
Data/storage affinity is related to pod affinity, and is likely to draw on some
|
||||||
ideas we have used for pod affinity. Today, data/storage affinity is expressed using
|
of the ideas we have used for pod affinity. Today, data/storage affinity is
|
||||||
node affinity, on the assumption that the pod knows which node(s) store(s) the data
|
expressed using node affinity, on the assumption that the pod knows which
|
||||||
it wants. But a more flexible approach would allow the pod to name the data rather than
|
node(s) store(s) the data it wants. But a more flexible approach would allow the
|
||||||
the node.
|
pod to name the data rather than the node.
|
||||||
|
|
||||||
## Related issues
|
## Related issues
|
||||||
|
|
||||||
The review for this proposal is in #18265.
|
The review for this proposal is in #18265.
|
||||||
|
|
||||||
The topic of affinity/anti-affinity has generated a lot of discussion. The main issue
|
The topic of affinity/anti-affinity has generated a lot of discussion. The main
|
||||||
is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341, #1965, and #2906
|
issue is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341,
|
||||||
all have additional discussion and use cases.
|
|
||||||
|
|
||||||
As the examples in this document have demonstrated, topological affinity is very useful
|
# 1965, and #2906 all have additional discussion and use cases.
|
||||||
in clusters that are spread across availability zones, e.g. to co-locate pods of a service
|
|
||||||
in the same zone to avoid a wide-area network hop, or to spread pods across zones for
|
As the examples in this document have demonstrated, topological affinity is very
|
||||||
failure tolerance. #17059, #13056, #13063, and #4235 are relevant.
|
useful in clusters that are spread across availability zones, e.g. to co-locate
|
||||||
|
pods of a service in the same zone to avoid a wide-area network hop, or to
|
||||||
|
spread pods across zones for failure tolerance. #17059, #13056, #13063, and
|
||||||
|
|
||||||
|
# 4235 are relevant.
|
||||||
|
|
||||||
Issue #15675 describes connection affinity, which is vaguely related.
|
Issue #15675 describes connection affinity, which is vaguely related.
|
||||||
|
|
||||||
|
@ -43,26 +43,57 @@ See also the [API conventions](../devel/api-conventions.md).
|
|||||||
* All APIs should be declarative.
|
* All APIs should be declarative.
|
||||||
* API objects should be complementary and composable, not opaque wrappers.
|
* API objects should be complementary and composable, not opaque wrappers.
|
||||||
* The control plane should be transparent -- there are no hidden internal APIs.
|
* The control plane should be transparent -- there are no hidden internal APIs.
|
||||||
* The cost of API operations should be proportional to the number of objects intentionally operated upon. Therefore, common filtered lookups must be indexed. Beware of patterns of multiple API calls that would incur quadratic behavior.
|
* The cost of API operations should be proportional to the number of objects
|
||||||
* Object status must be 100% reconstructable by observation. Any history kept must be just an optimization and not required for correct operation.
|
intentionally operated upon. Therefore, common filtered lookups must be indexed.
|
||||||
* Cluster-wide invariants are difficult to enforce correctly. Try not to add them. If you must have them, don't enforce them atomically in master components, that is contention-prone and doesn't provide a recovery path in the case of a bug allowing the invariant to be violated. Instead, provide a series of checks to reduce the probability of a violation, and make every component involved able to recover from an invariant violation.
|
Beware of patterns of multiple API calls that would incur quadratic behavior.
|
||||||
* Low-level APIs should be designed for control by higher-level systems. Higher-level APIs should be intent-oriented (think SLOs) rather than implementation-oriented (think control knobs).
|
* Object status must be 100% reconstructable by observation. Any history kept
|
||||||
|
must be just an optimization and not required for correct operation.
|
||||||
|
* Cluster-wide invariants are difficult to enforce correctly. Try not to add
|
||||||
|
them. If you must have them, don't enforce them atomically in master components,
|
||||||
|
that is contention-prone and doesn't provide a recovery path in the case of a
|
||||||
|
bug allowing the invariant to be violated. Instead, provide a series of checks
|
||||||
|
to reduce the probability of a violation, and make every component involved able
|
||||||
|
to recover from an invariant violation.
|
||||||
|
* Low-level APIs should be designed for control by higher-level systems.
|
||||||
|
Higher-level APIs should be intent-oriented (think SLOs) rather than
|
||||||
|
implementation-oriented (think control knobs).
|
||||||
|
|
||||||
## Control logic
|
## Control logic
|
||||||
|
|
||||||
* Functionality must be *level-based*, meaning the system must operate correctly given the desired state and the current/observed state, regardless of how many intermediate state updates may have been missed. Edge-triggered behavior must be just an optimization.
|
* Functionality must be *level-based*, meaning the system must operate correctly
|
||||||
* Assume an open world: continually verify assumptions and gracefully adapt to external events and/or actors. Example: we allow users to kill pods under control of a replication controller; it just replaces them.
|
given the desired state and the current/observed state, regardless of how many
|
||||||
* Do not define comprehensive state machines for objects with behaviors associated with state transitions and/or "assumed" states that cannot be ascertained by observation.
|
intermediate state updates may have been missed. Edge-triggered behavior must be
|
||||||
* Don't assume a component's decisions will not be overridden or rejected, nor for the component to always understand why. For example, etcd may reject writes. Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry, but back off and/or make alternative decisions.
|
just an optimization.
|
||||||
* Components should be self-healing. For example, if you must keep some state (e.g., cache) the content needs to be periodically refreshed, so that if an item does get erroneously stored or a deletion event is missed etc, it will be soon fixed, ideally on timescales that are shorter than what will attract attention from humans.
|
* Assume an open world: continually verify assumptions and gracefully adapt to
|
||||||
* Component behavior should degrade gracefully. Prioritize actions so that the most important activities can continue to function even when overloaded and/or in states of partial failure.
|
external events and/or actors. Example: we allow users to kill pods under
|
||||||
|
control of a replication controller; it just replaces them.
|
||||||
|
* Do not define comprehensive state machines for objects with behaviors
|
||||||
|
associated with state transitions and/or "assumed" states that cannot be
|
||||||
|
ascertained by observation.
|
||||||
|
* Don't assume a component's decisions will not be overridden or rejected, nor
|
||||||
|
for the component to always understand why. For example, etcd may reject writes.
|
||||||
|
Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry,
|
||||||
|
but back off and/or make alternative decisions.
|
||||||
|
* Components should be self-healing. For example, if you must keep some state
|
||||||
|
(e.g., cache) the content needs to be periodically refreshed, so that if an item
|
||||||
|
does get erroneously stored or a deletion event is missed etc, it will be soon
|
||||||
|
fixed, ideally on timescales that are shorter than what will attract attention
|
||||||
|
from humans.
|
||||||
|
* Component behavior should degrade gracefully. Prioritize actions so that the
|
||||||
|
most important activities can continue to function even when overloaded and/or
|
||||||
|
in states of partial failure.
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
* Only the apiserver should communicate with etcd/store, and not other components (scheduler, kubelet, etc.).
|
* Only the apiserver should communicate with etcd/store, and not other
|
||||||
|
components (scheduler, kubelet, etc.).
|
||||||
* Compromising a single node shouldn't compromise the cluster.
|
* Compromising a single node shouldn't compromise the cluster.
|
||||||
* Components should continue to do what they were last told in the absence of new instructions (e.g., due to network partition or component outage).
|
* Components should continue to do what they were last told in the absence of
|
||||||
* All components should keep all relevant state in memory all the time. The apiserver should write through to etcd/store, other components should write through to the apiserver, and they should watch for updates made by other clients.
|
new instructions (e.g., due to network partition or component outage).
|
||||||
|
* All components should keep all relevant state in memory all the time. The
|
||||||
|
apiserver should write through to etcd/store, other components should write
|
||||||
|
through to the apiserver, and they should watch for updates made by other
|
||||||
|
clients.
|
||||||
* Watch is preferred over polling.
|
* Watch is preferred over polling.
|
||||||
|
|
||||||
## Extensibility
|
## Extensibility
|
||||||
@ -72,13 +103,23 @@ TODO: pluggability
|
|||||||
## Bootstrapping
|
## Bootstrapping
|
||||||
|
|
||||||
* [Self-hosting](http://issue.k8s.io/246) of all components is a goal.
|
* [Self-hosting](http://issue.k8s.io/246) of all components is a goal.
|
||||||
* Minimize the number of dependencies, particularly those required for steady-state operation.
|
* Minimize the number of dependencies, particularly those required for
|
||||||
|
steady-state operation.
|
||||||
* Stratify the dependencies that remain via principled layering.
|
* Stratify the dependencies that remain via principled layering.
|
||||||
* Break any circular dependencies by converting hard dependencies to soft dependencies.
|
* Break any circular dependencies by converting hard dependencies to soft
|
||||||
* Also accept that data from other components from another source, such as local files, which can then be manually populated at bootstrap time and then continuously updated once those other components are available.
|
dependencies.
|
||||||
|
* Also accept that data from other components from another source, such as
|
||||||
|
local files, which can then be manually populated at bootstrap time and then
|
||||||
|
continuously updated once those other components are available.
|
||||||
* State should be rediscoverable and/or reconstructable.
|
* State should be rediscoverable and/or reconstructable.
|
||||||
* Make it easy to run temporary, bootstrap instances of all components in order to create the runtime state needed to run the components in the steady state; use a lock (master election for distributed components, file lock for local components like Kubelet) to coordinate handoff. We call this technique "pivoting".
|
* Make it easy to run temporary, bootstrap instances of all components in
|
||||||
* Have a solution to restart dead components. For distributed components, replication works well. For local components such as Kubelet, a process manager or even a simple shell loop works.
|
order to create the runtime state needed to run the components in the steady
|
||||||
|
state; use a lock (master election for distributed components, file lock for
|
||||||
|
local components like Kubelet) to coordinate handoff. We call this technique
|
||||||
|
"pivoting".
|
||||||
|
* Have a solution to restart dead components. For distributed components,
|
||||||
|
replication works well. For local components such as Kubelet, a process manager
|
||||||
|
or even a simple shell loop works.
|
||||||
|
|
||||||
## Availability
|
## Availability
|
||||||
|
|
||||||
|
@ -31,16 +31,19 @@ Documentation for other releases can be found at
|
|||||||
<!-- END STRIP_FOR_RELEASE -->
|
<!-- END STRIP_FOR_RELEASE -->
|
||||||
|
|
||||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||||
**Note: this is a design doc, which describes features that have not been completely implemented.
|
**Note: this is a design doc, which describes features that have not been
|
||||||
User documentation of the current state is [here](../user-guide/compute-resources.md). The tracking issue for
|
completely implemented. User documentation of the current state is
|
||||||
implementation of this model is
|
[here](../user-guide/compute-resources.md). The tracking issue for
|
||||||
[#168](http://issue.k8s.io/168). Currently, both limits and requests of memory and
|
implementation of this model is [#168](http://issue.k8s.io/168). Currently, both
|
||||||
cpu on containers (not pods) are supported. "memory" is in bytes and "cpu" is in
|
limits and requests of memory and cpu on containers (not pods) are supported.
|
||||||
milli-cores.**
|
"memory" is in bytes and "cpu" is in milli-cores.**
|
||||||
|
|
||||||
# The Kubernetes resource model
|
# The Kubernetes resource model
|
||||||
|
|
||||||
To do good pod placement, Kubernetes needs to know how big pods are, as well as the sizes of the nodes onto which they are being placed. The definition of "how big" is given by the Kubernetes resource model — the subject of this document.
|
To do good pod placement, Kubernetes needs to know how big pods are, as well as
|
||||||
|
the sizes of the nodes onto which they are being placed. The definition of "how
|
||||||
|
big" is given by the Kubernetes resource model — the subject of this
|
||||||
|
document.
|
||||||
|
|
||||||
The resource model aims to be:
|
The resource model aims to be:
|
||||||
* simple, for common cases;
|
* simple, for common cases;
|
||||||
@ -50,43 +53,107 @@ The resource model aims to be:
|
|||||||
|
|
||||||
## The resource model
|
## The resource model
|
||||||
|
|
||||||
A Kubernetes _resource_ is something that can be requested by, allocated to, or consumed by a pod or container. Examples include memory (RAM), CPU, disk-time, and network bandwidth.
|
A Kubernetes _resource_ is something that can be requested by, allocated to, or
|
||||||
|
consumed by a pod or container. Examples include memory (RAM), CPU, disk-time,
|
||||||
|
and network bandwidth.
|
||||||
|
|
||||||
Once resources on a node have been allocated to one pod, they should not be allocated to another until that pod is removed or exits. This means that Kubernetes schedulers should ensure that the sum of the resources allocated (requested and granted) to its pods never exceeds the usable capacity of the node. Testing whether a pod will fit on a node is called _feasibility checking_.
|
Once resources on a node have been allocated to one pod, they should not be
|
||||||
|
allocated to another until that pod is removed or exits. This means that
|
||||||
|
Kubernetes schedulers should ensure that the sum of the resources allocated
|
||||||
|
(requested and granted) to its pods never exceeds the usable capacity of the
|
||||||
|
node. Testing whether a pod will fit on a node is called _feasibility checking_.
|
||||||
|
|
||||||
Note that the resource model currently prohibits over-committing resources; we will want to relax that restriction later.
|
Note that the resource model currently prohibits over-committing resources; we
|
||||||
|
will want to relax that restriction later.
|
||||||
|
|
||||||
### Resource types
|
### Resource types
|
||||||
|
|
||||||
All resources have a _type_ that is identified by their _typename_ (a string, e.g., "memory"). Several resource types are predefined by Kubernetes (a full list is below), although only two will be supported at first: CPU and memory. Users and system administrators can define their own resource types if they wish (e.g., Hadoop slots).
|
All resources have a _type_ that is identified by their _typename_ (a string,
|
||||||
|
e.g., "memory"). Several resource types are predefined by Kubernetes (a full
|
||||||
|
list is below), although only two will be supported at first: CPU and memory.
|
||||||
|
Users and system administrators can define their own resource types if they wish
|
||||||
|
(e.g., Hadoop slots).
|
||||||
|
|
||||||
A fully-qualified resource typename is constructed from a DNS-style _subdomain_, followed by a slash `/`, followed by a name.
|
A fully-qualified resource typename is constructed from a DNS-style _subdomain_,
|
||||||
* The subdomain must conform to [RFC 1123](http://www.ietf.org/rfc/rfc1123.txt) (e.g., `kubernetes.io`, `example.com`).
|
followed by a slash `/`, followed by a name.
|
||||||
* The name must be not more than 63 characters, consisting of upper- or lower-case alphanumeric characters, with the `-`, `_`, and `.` characters allowed anywhere except the first or last character.
|
* The subdomain must conform to [RFC 1123](http://www.ietf.org/rfc/rfc1123.txt)
|
||||||
* As a shorthand, any resource typename that does not start with a subdomain and a slash will automatically be prefixed with the built-in Kubernetes _namespace_, `kubernetes.io/` in order to fully-qualify it. This namespace is reserved for code in the open source Kubernetes repository; as a result, all user typenames MUST be fully qualified, and cannot be created in this namespace.
|
(e.g., `kubernetes.io`, `example.com`).
|
||||||
|
* The name must be not more than 63 characters, consisting of upper- or
|
||||||
|
lower-case alphanumeric characters, with the `-`, `_`, and `.` characters
|
||||||
|
allowed anywhere except the first or last character.
|
||||||
|
* As a shorthand, any resource typename that does not start with a subdomain and
|
||||||
|
a slash will automatically be prefixed with the built-in Kubernetes _namespace_,
|
||||||
|
`kubernetes.io/` in order to fully-qualify it. This namespace is reserved for
|
||||||
|
code in the open source Kubernetes repository; as a result, all user typenames
|
||||||
|
MUST be fully qualified, and cannot be created in this namespace.
|
||||||
|
|
||||||
Some example typenames include `memory` (which will be fully-qualified as `kubernetes.io/memory`), and `example.com/Shiny_New-Resource.Type`.
|
Some example typenames include `memory` (which will be fully-qualified as
|
||||||
|
`kubernetes.io/memory`), and `example.com/Shiny_New-Resource.Type`.
|
||||||
|
|
||||||
For future reference, note that some resources, such as CPU and network bandwidth, are _compressible_, which means that their usage can potentially be throttled in a relatively benign manner. All other resources are _incompressible_, which means that any attempt to throttle them is likely to cause grief. This distinction will be important if a Kubernetes implementation supports over-committing of resources.
|
For future reference, note that some resources, such as CPU and network
|
||||||
|
bandwidth, are _compressible_, which means that their usage can potentially be
|
||||||
|
throttled in a relatively benign manner. All other resources are
|
||||||
|
_incompressible_, which means that any attempt to throttle them is likely to
|
||||||
|
cause grief. This distinction will be important if a Kubernetes implementation
|
||||||
|
supports over-committing of resources.
|
||||||
|
|
||||||
### Resource quantities
|
### Resource quantities
|
||||||
|
|
||||||
Initially, all Kubernetes resource types are _quantitative_, and have an associated _unit_ for quantities of the associated resource (e.g., bytes for memory, bytes per seconds for bandwidth, instances for software licences). The units will always be a resource type's natural base units (e.g., bytes, not MB), to avoid confusion between binary and decimal multipliers and the underlying unit multiplier (e.g., is memory measured in MiB, MB, or GB?).
|
Initially, all Kubernetes resource types are _quantitative_, and have an
|
||||||
|
associated _unit_ for quantities of the associated resource (e.g., bytes for
|
||||||
|
memory, bytes per seconds for bandwidth, instances for software licences). The
|
||||||
|
units will always be a resource type's natural base units (e.g., bytes, not MB),
|
||||||
|
to avoid confusion between binary and decimal multipliers and the underlying
|
||||||
|
unit multiplier (e.g., is memory measured in MiB, MB, or GB?).
|
||||||
|
|
||||||
Resource quantities can be added and subtracted: for example, a node has a fixed quantity of each resource type that can be allocated to pods/containers; once such an allocation has been made, the allocated resources cannot be made available to other pods/containers without over-committing the resources.
|
Resource quantities can be added and subtracted: for example, a node has a fixed
|
||||||
|
quantity of each resource type that can be allocated to pods/containers; once
|
||||||
|
such an allocation has been made, the allocated resources cannot be made
|
||||||
|
available to other pods/containers without over-committing the resources.
|
||||||
|
|
||||||
To make life easier for people, quantities can be represented externally as unadorned integers, or as fixed-point integers with one of these SI suffices (E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi, Ki). For example, the following represent roughly the same value: 128974848, "129e6", "129M" , "123Mi". Small quantities can be represented directly as decimals (e.g., 0.3), or using milli-units (e.g., "300m").
|
To make life easier for people, quantities can be represented externally as
|
||||||
* "Externally" means in user interfaces, reports, graphs, and in JSON or YAML resource specifications that might be generated or read by people.
|
unadorned integers, or as fixed-point integers with one of these SI suffices
|
||||||
* Case is significant: "m" and "M" are not the same, so "k" is not a valid SI suffix. There are no power-of-two equivalents for SI suffixes that represent multipliers less than 1.
|
(E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi,
|
||||||
|
Ki). For example, the following represent roughly the same value: 128974848,
|
||||||
|
"129e6", "129M" , "123Mi". Small quantities can be represented directly as
|
||||||
|
decimals (e.g., 0.3), or using milli-units (e.g., "300m").
|
||||||
|
* "Externally" means in user interfaces, reports, graphs, and in JSON or YAML
|
||||||
|
resource specifications that might be generated or read by people.
|
||||||
|
* Case is significant: "m" and "M" are not the same, so "k" is not a valid SI
|
||||||
|
suffix. There are no power-of-two equivalents for SI suffixes that represent
|
||||||
|
multipliers less than 1.
|
||||||
* These conventions only apply to resource quantities, not arbitrary values.
|
* These conventions only apply to resource quantities, not arbitrary values.
|
||||||
|
|
||||||
Internally (i.e., everywhere else), Kubernetes will represent resource quantities as integers so it can avoid problems with rounding errors, and will not use strings to represent numeric values. To achieve this, quantities that naturally have fractional parts (e.g., CPU seconds/second) will be scaled to integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in. Internal APIs, data structures, and protobufs will use these scaled integer units. Raw measurement data such as usage may still need to be tracked and calculated using floating point values, but internally they should be rescaled to avoid some values being in milli-units and some not.
|
Internally (i.e., everywhere else), Kubernetes will represent resource
|
||||||
* Note that reading in a resource quantity and writing it out again may change the way its values are represented, and truncate precision (e.g., 1.0001 may become 1.000), so comparison and difference operations (e.g., by an updater) must be done on the internal representations.
|
quantities as integers so it can avoid problems with rounding errors, and will
|
||||||
* Avoiding milli-units in external representations has advantages for people who will use Kubernetes, but runs the risk of developers forgetting to rescale or accidentally using floating-point representations. That seems like the right choice. We will try to reduce the risk by providing libraries that automatically do the quantization for JSON/YAML inputs.
|
not use strings to represent numeric values. To achieve this, quantities that
|
||||||
|
naturally have fractional parts (e.g., CPU seconds/second) will be scaled to
|
||||||
|
integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in.
|
||||||
|
Internal APIs, data structures, and protobufs will use these scaled integer
|
||||||
|
units. Raw measurement data such as usage may still need to be tracked and
|
||||||
|
calculated using floating point values, but internally they should be rescaled
|
||||||
|
to avoid some values being in milli-units and some not.
|
||||||
|
* Note that reading in a resource quantity and writing it out again may change
|
||||||
|
the way its values are represented, and truncate precision (e.g., 1.0001 may
|
||||||
|
become 1.000), so comparison and difference operations (e.g., by an updater)
|
||||||
|
must be done on the internal representations.
|
||||||
|
* Avoiding milli-units in external representations has advantages for people
|
||||||
|
who will use Kubernetes, but runs the risk of developers forgetting to rescale
|
||||||
|
or accidentally using floating-point representations. That seems like the right
|
||||||
|
choice. We will try to reduce the risk by providing libraries that automatically
|
||||||
|
do the quantization for JSON/YAML inputs.
|
||||||
|
|
||||||
### Resource specifications
|
### Resource specifications
|
||||||
|
|
||||||
Both users and a number of system components, such as schedulers, (horizontal) auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers need to reason about resource requirements of workloads, resource capacities of nodes, and resource usage. Kubernetes divides specifications of *desired state*, aka the Spec, and representations of *current state*, aka the Status. Resource requirements and total node capacity fall into the specification category, while resource usage, characterizations derived from usage (e.g., maximum usage, histograms), and other resource demand signals (e.g., CPU load) clearly fall into the status category and are discussed in the Appendix for now.
|
Both users and a number of system components, such as schedulers, (horizontal)
|
||||||
|
auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers
|
||||||
|
need to reason about resource requirements of workloads, resource capacities of
|
||||||
|
nodes, and resource usage. Kubernetes divides specifications of *desired state*,
|
||||||
|
aka the Spec, and representations of *current state*, aka the Status. Resource
|
||||||
|
requirements and total node capacity fall into the specification category, while
|
||||||
|
resource usage, characterizations derived from usage (e.g., maximum usage,
|
||||||
|
histograms), and other resource demand signals (e.g., CPU load) clearly fall
|
||||||
|
into the status category and are discussed in the Appendix for now.
|
||||||
|
|
||||||
Resource requirements for a container or pod should have the following form:
|
Resource requirements for a container or pod should have the following form:
|
||||||
|
|
||||||
@ -98,9 +165,24 @@ resourceRequirementSpec: [
|
|||||||
```
|
```
|
||||||
|
|
||||||
Where:
|
Where:
|
||||||
* _request_ [optional]: the amount of resources being requested, or that were requested and have been allocated. Scheduler algorithms will use these quantities to test feasibility (whether a pod will fit onto a node). If a container (or pod) tries to use more resources than its _request_, any associated SLOs are voided — e.g., the program it is running may be throttled (compressible resource types), or the attempt may be denied. If _request_ is omitted for a container, it defaults to _limit_ if that is explicitly specified, otherwise to an implementation-defined value; this will always be 0 for a user-defined resource type. If _request_ is omitted for a pod, it defaults to the sum of the (explicit or implicit) _request_ values for the containers it encloses.
|
* _request_ [optional]: the amount of resources being requested, or that were
|
||||||
|
requested and have been allocated. Scheduler algorithms will use these
|
||||||
|
quantities to test feasibility (whether a pod will fit onto a node).
|
||||||
|
If a container (or pod) tries to use more resources than its _request_, any
|
||||||
|
associated SLOs are voided — e.g., the program it is running may be
|
||||||
|
throttled (compressible resource types), or the attempt may be denied. If
|
||||||
|
_request_ is omitted for a container, it defaults to _limit_ if that is
|
||||||
|
explicitly specified, otherwise to an implementation-defined value; this will
|
||||||
|
always be 0 for a user-defined resource type. If _request_ is omitted for a pod,
|
||||||
|
it defaults to the sum of the (explicit or implicit) _request_ values for the
|
||||||
|
containers it encloses.
|
||||||
|
|
||||||
* _limit_ [optional]: an upper bound or cap on the maximum amount of resources that will be made available to a container or pod; if a container or pod uses more resources than its _limit_, it may be terminated. The _limit_ defaults to "unbounded"; in practice, this probably means the capacity of an enclosing container, pod, or node, but may result in non-deterministic behavior, especially for memory.
|
* _limit_ [optional]: an upper bound or cap on the maximum amount of resources
|
||||||
|
that will be made available to a container or pod; if a container or pod uses
|
||||||
|
more resources than its _limit_, it may be terminated. The _limit_ defaults to
|
||||||
|
"unbounded"; in practice, this probably means the capacity of an enclosing
|
||||||
|
container, pod, or node, but may result in non-deterministic behavior,
|
||||||
|
especially for memory.
|
||||||
|
|
||||||
Total capacity for a node should have a similar structure:
|
Total capacity for a node should have a similar structure:
|
||||||
|
|
||||||
@ -111,36 +193,66 @@ resourceCapacitySpec: [
|
|||||||
```
|
```
|
||||||
|
|
||||||
Where:
|
Where:
|
||||||
* _total_: the total allocatable resources of a node. Initially, the resources at a given scope will bound the resources of the sum of inner scopes.
|
* _total_: the total allocatable resources of a node. Initially, the resources
|
||||||
|
at a given scope will bound the resources of the sum of inner scopes.
|
||||||
|
|
||||||
#### Notes
|
#### Notes
|
||||||
|
|
||||||
* It is an error to specify the same resource type more than once in each list.
|
* It is an error to specify the same resource type more than once in each
|
||||||
|
list.
|
||||||
|
|
||||||
* It is an error for the _request_ or _limit_ values for a pod to be less than the sum of the (explicit or defaulted) values for the containers it encloses. (We may relax this later.)
|
* It is an error for the _request_ or _limit_ values for a pod to be less than
|
||||||
|
the sum of the (explicit or defaulted) values for the containers it encloses.
|
||||||
|
(We may relax this later.)
|
||||||
|
|
||||||
* If multiple pods are running on the same node and attempting to use more resources than they have requested, the result is implementation-defined. For example: unallocated or unused resources might be spread equally across claimants, or the assignment might be weighted by the size of the original request, or as a function of limits, or priority, or the phase of the moon, perhaps modulated by the direction of the tide. Thus, although it's not mandatory to provide a _request_, it's probably a good idea. (Note that the _request_ could be filled in by an automated system that is observing actual usage and/or historical data.)
|
* If multiple pods are running on the same node and attempting to use more
|
||||||
|
resources than they have requested, the result is implementation-defined. For
|
||||||
|
example: unallocated or unused resources might be spread equally across
|
||||||
|
claimants, or the assignment might be weighted by the size of the original
|
||||||
|
request, or as a function of limits, or priority, or the phase of the moon,
|
||||||
|
perhaps modulated by the direction of the tide. Thus, although it's not
|
||||||
|
mandatory to provide a _request_, it's probably a good idea. (Note that the
|
||||||
|
_request_ could be filled in by an automated system that is observing actual
|
||||||
|
usage and/or historical data.)
|
||||||
|
|
||||||
* Internally, the Kubernetes master can decide the defaulting behavior and the kubelet implementation may expected an absolute specification. For example, if the master decided that "the default is unbounded" it would pass 2^64 to the kubelet.
|
* Internally, the Kubernetes master can decide the defaulting behavior and the
|
||||||
|
kubelet implementation may expected an absolute specification. For example, if
|
||||||
|
the master decided that "the default is unbounded" it would pass 2^64 to the
|
||||||
|
kubelet.
|
||||||
|
|
||||||
|
|
||||||
## Kubernetes-defined resource types
|
## Kubernetes-defined resource types
|
||||||
|
|
||||||
The following resource types are predefined ("reserved") by Kubernetes in the `kubernetes.io` namespace, and so cannot be used for user-defined resources. Note that the syntax of all resource types in the resource spec is deliberately similar, but some resource types (e.g., CPU) may receive significantly more support than simply tracking quantities in the schedulers and/or the Kubelet.
|
The following resource types are predefined ("reserved") by Kubernetes in the
|
||||||
|
`kubernetes.io` namespace, and so cannot be used for user-defined resources.
|
||||||
|
Note that the syntax of all resource types in the resource spec is deliberately
|
||||||
|
similar, but some resource types (e.g., CPU) may receive significantly more
|
||||||
|
support than simply tracking quantities in the schedulers and/or the Kubelet.
|
||||||
|
|
||||||
### Processor cycles
|
### Processor cycles
|
||||||
|
|
||||||
* Name: `cpu` (or `kubernetes.io/cpu`)
|
* Name: `cpu` (or `kubernetes.io/cpu`)
|
||||||
* Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to a canonical "Kubernetes CPU")
|
* Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to
|
||||||
|
a canonical "Kubernetes CPU")
|
||||||
* Internal representation: milli-KCUs
|
* Internal representation: milli-KCUs
|
||||||
* Compressible? yes
|
* Compressible? yes
|
||||||
* Qualities: this is a placeholder for the kind of thing that may be supported in the future — see [#147](http://issue.k8s.io/147)
|
* Qualities: this is a placeholder for the kind of thing that may be supported
|
||||||
|
in the future — see [#147](http://issue.k8s.io/147)
|
||||||
* [future] `schedulingLatency`: as per lmctfy
|
* [future] `schedulingLatency`: as per lmctfy
|
||||||
* [future] `cpuConversionFactor`: property of a node: the speed of a CPU core on the node's processor divided by the speed of the canonical Kubernetes CPU (a floating point value; default = 1.0).
|
* [future] `cpuConversionFactor`: property of a node: the speed of a CPU
|
||||||
|
core on the node's processor divided by the speed of the canonical Kubernetes
|
||||||
|
CPU (a floating point value; default = 1.0).
|
||||||
|
|
||||||
To reduce performance portability problems for pods, and to avoid worse-case provisioning behavior, the units of CPU will be normalized to a canonical "Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be equivalent to a single CPU hyperthreaded core for some recent x86 processor. The normalization may be implementation-defined, although some reasonable defaults will be provided in the open-source Kubernetes code.
|
To reduce performance portability problems for pods, and to avoid worse-case
|
||||||
|
provisioning behavior, the units of CPU will be normalized to a canonical
|
||||||
|
"Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be
|
||||||
|
equivalent to a single CPU hyperthreaded core for some recent x86 processor. The
|
||||||
|
normalization may be implementation-defined, although some reasonable defaults
|
||||||
|
will be provided in the open-source Kubernetes code.
|
||||||
|
|
||||||
Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will be allocated — control of aspects like this will be handled by resource _qualities_ (a future feature).
|
Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will
|
||||||
|
be allocated — control of aspects like this will be handled by resource
|
||||||
|
_qualities_ (a future feature).
|
||||||
|
|
||||||
|
|
||||||
### Memory
|
### Memory
|
||||||
@ -149,15 +261,18 @@ Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will
|
|||||||
* Units: bytes
|
* Units: bytes
|
||||||
* Compressible? no (at least initially)
|
* Compressible? no (at least initially)
|
||||||
|
|
||||||
The precise meaning of what "memory" means is implementation dependent, but the basic idea is to rely on the underlying `memcg` mechanisms, support, and definitions.
|
The precise meaning of what "memory" means is implementation dependent, but the
|
||||||
|
basic idea is to rely on the underlying `memcg` mechanisms, support, and
|
||||||
|
definitions.
|
||||||
|
|
||||||
Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory quantities
|
Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory
|
||||||
rather than decimal ones: "64MiB" rather than "64MB".
|
quantities rather than decimal ones: "64MiB" rather than "64MB".
|
||||||
|
|
||||||
|
|
||||||
## Resource metadata
|
## Resource metadata
|
||||||
|
|
||||||
A resource type may have an associated read-only ResourceType structure, that contains metadata about the type. For example:
|
A resource type may have an associated read-only ResourceType structure, that
|
||||||
|
contains metadata about the type. For example:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
resourceTypes: [
|
resourceTypes: [
|
||||||
@ -172,7 +287,10 @@ resourceTypes: [
|
|||||||
]
|
]
|
||||||
```
|
```
|
||||||
|
|
||||||
Kubernetes will provide ResourceType metadata for its predefined types. If no resource metadata can be found for a resource type, Kubernetes will assume that it is a quantified, incompressible resource that is not specified in milli-units, and has no default value.
|
Kubernetes will provide ResourceType metadata for its predefined types. If no
|
||||||
|
resource metadata can be found for a resource type, Kubernetes will assume that
|
||||||
|
it is a quantified, incompressible resource that is not specified in
|
||||||
|
milli-units, and has no default value.
|
||||||
|
|
||||||
The defined properties are as follows:
|
The defined properties are as follows:
|
||||||
|
|
||||||
@ -188,13 +306,21 @@ The defined properties are as follows:
|
|||||||
|
|
||||||
# Appendix: future extensions
|
# Appendix: future extensions
|
||||||
|
|
||||||
The following are planned future extensions to the resource model, included here to encourage comments.
|
The following are planned future extensions to the resource model, included here
|
||||||
|
to encourage comments.
|
||||||
|
|
||||||
## Usage data
|
## Usage data
|
||||||
|
|
||||||
Because resource usage and related metrics change continuously, need to be tracked over time (i.e., historically), can be characterized in a variety of ways, and are fairly voluminous, we will not include usage in core API objects, such as [Pods](../user-guide/pods.md) and Nodes, but will provide separate APIs for accessing and managing that data. See the Appendix for possible representations of usage data, but the representation we'll use is TBD.
|
Because resource usage and related metrics change continuously, need to be
|
||||||
|
tracked over time (i.e., historically), can be characterized in a variety of
|
||||||
|
ways, and are fairly voluminous, we will not include usage in core API objects,
|
||||||
|
such as [Pods](../user-guide/pods.md) and Nodes, but will provide separate APIs
|
||||||
|
for accessing and managing that data. See the Appendix for possible
|
||||||
|
representations of usage data, but the representation we'll use is TBD.
|
||||||
|
|
||||||
Singleton values for observed and predicted future usage will rapidly prove inadequate, so we will support the following structure for extended usage information:
|
Singleton values for observed and predicted future usage will rapidly prove
|
||||||
|
inadequate, so we will support the following structure for extended usage
|
||||||
|
information:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
resourceStatus: [
|
resourceStatus: [
|
||||||
@ -222,8 +348,12 @@ where a `<CPU-info>` or `<memory-info>` structure looks like this:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
All parts of this structure are optional, although we strongly encourage including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles. _[In practice, it will be important to include additional info such as the length of the time window over which the averages are calculated, the confidence level, and information-quality metrics such as the number of dropped or discarded data points.]_
|
All parts of this structure are optional, although we strongly encourage
|
||||||
and predicted
|
including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles.
|
||||||
|
_[In practice, it will be important to include additional info such as the
|
||||||
|
length of the time window over which the averages are calculated, the
|
||||||
|
confidence level, and information-quality metrics such as the number of dropped
|
||||||
|
or discarded data points.]_ and predicted
|
||||||
|
|
||||||
## Future resource types
|
## Future resource types
|
||||||
|
|
||||||
@ -245,7 +375,10 @@ and predicted
|
|||||||
* Units: bytes
|
* Units: bytes
|
||||||
* Compressible? no
|
* Compressible? no
|
||||||
|
|
||||||
The amount of secondary storage space available to a container. The main target is local disk drives and SSDs, although this could also be used to qualify remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a disk array, or a file system fronting any of these, is left for future work.
|
The amount of secondary storage space available to a container. The main target
|
||||||
|
is local disk drives and SSDs, although this could also be used to qualify
|
||||||
|
remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a
|
||||||
|
disk array, or a file system fronting any of these, is left for future work.
|
||||||
|
|
||||||
### _[future] Storage time_
|
### _[future] Storage time_
|
||||||
|
|
||||||
@ -254,7 +387,9 @@ The amount of secondary storage space available to a container. The main target
|
|||||||
* Internal representation: milli-units
|
* Internal representation: milli-units
|
||||||
* Compressible? yes
|
* Compressible? yes
|
||||||
|
|
||||||
This is the amount of time a container spends accessing disk, including actuator and transfer time. A standard disk drive provides 1.0 diskTime seconds per second.
|
This is the amount of time a container spends accessing disk, including actuator
|
||||||
|
and transfer time. A standard disk drive provides 1.0 diskTime seconds per
|
||||||
|
second.
|
||||||
|
|
||||||
### _[future] Storage operations_
|
### _[future] Storage operations_
|
||||||
|
|
||||||
|
@ -34,11 +34,26 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
# Scheduler extender
|
# Scheduler extender
|
||||||
|
|
||||||
There are three ways to add new scheduling rules (predicates and priority functions) to Kubernetes: (1) by adding these rules to the scheduler and recompiling (described here: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md), (2) implementing your own scheduler process that runs instead of, or alongside of, the standard Kubernetes scheduler, (3) implementing a "scheduler extender" process that the standard Kubernetes scheduler calls out to as a final pass when making scheduling decisions.
|
There are three ways to add new scheduling rules (predicates and priority
|
||||||
|
functions) to Kubernetes: (1) by adding these rules to the scheduler and
|
||||||
|
recompiling (described here:
|
||||||
|
https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md),
|
||||||
|
(2) implementing your own scheduler process that runs instead of, or alongside
|
||||||
|
of, the standard Kubernetes scheduler, (3) implementing a "scheduler extender"
|
||||||
|
process that the standard Kubernetes scheduler calls out to as a final pass when
|
||||||
|
making scheduling decisions.
|
||||||
|
|
||||||
This document describes the third approach. This approach is needed for use cases where scheduling decisions need to be made on resources not directly managed by the standard Kubernetes scheduler. The extender helps make scheduling decisions based on such resources. (Note that the three approaches are not mutually exclusive.)
|
This document describes the third approach. This approach is needed for use
|
||||||
|
cases where scheduling decisions need to be made on resources not directly
|
||||||
|
managed by the standard Kubernetes scheduler. The extender helps make scheduling
|
||||||
|
decisions based on such resources. (Note that the three approaches are not
|
||||||
|
mutually exclusive.)
|
||||||
|
|
||||||
When scheduling a pod, the extender allows an external process to filter and prioritize nodes. Two separate http/https calls are issued to the extender, one for "filter" and one for "prioritize" actions. To use the extender, you must create a scheduler policy configuration file. The configuration specifies how to reach the extender, whether to use http or https and the timeout.
|
When scheduling a pod, the extender allows an external process to filter and
|
||||||
|
prioritize nodes. Two separate http/https calls are issued to the extender, one
|
||||||
|
for "filter" and one for "prioritize" actions. To use the extender, you must
|
||||||
|
create a scheduler policy configuration file. The configuration specifies how to
|
||||||
|
reach the extender, whether to use http or https and the timeout.
|
||||||
|
|
||||||
```go
|
```go
|
||||||
// Holds the parameters used to communicate with the extender. If a verb is unspecified/empty,
|
// Holds the parameters used to communicate with the extender. If a verb is unspecified/empty,
|
||||||
@ -94,7 +109,10 @@ A sample scheduler policy file with extender configuration:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Arguments passed to the FilterVerb endpoint on the extender are the set of nodes filtered through the k8s predicates and the pod. Arguments passed to the PrioritizeVerb endpoint on the extender are the set of nodes filtered through the k8s predicates and extender predicates and the pod.
|
Arguments passed to the FilterVerb endpoint on the extender are the set of nodes
|
||||||
|
filtered through the k8s predicates and the pod. Arguments passed to the
|
||||||
|
PrioritizeVerb endpoint on the extender are the set of nodes filtered through
|
||||||
|
the k8s predicates and extender predicates and the pod.
|
||||||
|
|
||||||
```go
|
```go
|
||||||
// ExtenderArgs represents the arguments needed by the extender to filter/prioritize
|
// ExtenderArgs represents the arguments needed by the extender to filter/prioritize
|
||||||
@ -107,9 +125,12 @@ type ExtenderArgs struct {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
The "filter" call returns a list of nodes (api.NodeList). The "prioritize" call returns priorities for each node (schedulerapi.HostPriorityList).
|
The "filter" call returns a list of nodes (api.NodeList). The "prioritize" call
|
||||||
|
returns priorities for each node (schedulerapi.HostPriorityList).
|
||||||
|
|
||||||
The "filter" call may prune the set of nodes based on its predicates. Scores returned by the "prioritize" call are added to the k8s scores (computed through its priority functions) and used for final host selection.
|
The "filter" call may prune the set of nodes based on its predicates. Scores
|
||||||
|
returned by the "prioritize" call are added to the k8s scores (computed through
|
||||||
|
its priority functions) and used for final host selection.
|
||||||
|
|
||||||
Multiple extenders can be configured in the scheduler policy.
|
Multiple extenders can be configured in the scheduler policy.
|
||||||
|
|
||||||
|
@ -34,15 +34,17 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
## Abstract
|
## Abstract
|
||||||
|
|
||||||
A proposal for the distribution of [secrets](../user-guide/secrets.md) (passwords, keys, etc) to the Kubelet and to
|
A proposal for the distribution of [secrets](../user-guide/secrets.md)
|
||||||
containers inside Kubernetes using a custom [volume](../user-guide/volumes.md#secrets) type. See the [secrets example](../user-guide/secrets/) for more information.
|
(passwords, keys, etc) to the Kubelet and to containers inside Kubernetes using
|
||||||
|
a custom [volume](../user-guide/volumes.md#secrets) type. See the
|
||||||
|
[secrets example](../user-guide/secrets/) for more information.
|
||||||
|
|
||||||
## Motivation
|
## Motivation
|
||||||
|
|
||||||
Secrets are needed in containers to access internal resources like the Kubernetes master or
|
Secrets are needed in containers to access internal resources like the
|
||||||
external resources such as git repositories, databases, etc. Users may also want behaviors in the
|
Kubernetes master or external resources such as git repositories, databases,
|
||||||
kubelet that depend on secret data (credentials for image pull from a docker registry) associated
|
etc. Users may also want behaviors in the kubelet that depend on secret data
|
||||||
with pods.
|
(credentials for image pull from a docker registry) associated with pods.
|
||||||
|
|
||||||
Goals of this design:
|
Goals of this design:
|
||||||
|
|
||||||
@ -52,114 +54,127 @@ Goals of this design:
|
|||||||
|
|
||||||
## Constraints and Assumptions
|
## Constraints and Assumptions
|
||||||
|
|
||||||
* This design does not prescribe a method for storing secrets; storage of secrets should be
|
* This design does not prescribe a method for storing secrets; storage of
|
||||||
pluggable to accommodate different use-cases
|
secrets should be pluggable to accommodate different use-cases
|
||||||
* Encryption of secret data and node security are orthogonal concerns
|
* Encryption of secret data and node security are orthogonal concerns
|
||||||
* It is assumed that node and master are secure and that compromising their security could also
|
* It is assumed that node and master are secure and that compromising their
|
||||||
compromise secrets:
|
security could also compromise secrets:
|
||||||
* If a node is compromised, the only secrets that could potentially be exposed should be the
|
* If a node is compromised, the only secrets that could potentially be
|
||||||
secrets belonging to containers scheduled onto it
|
exposed should be the secrets belonging to containers scheduled onto it
|
||||||
* If the master is compromised, all secrets in the cluster may be exposed
|
* If the master is compromised, all secrets in the cluster may be exposed
|
||||||
* Secret rotation is an orthogonal concern, but it should be facilitated by this proposal
|
* Secret rotation is an orthogonal concern, but it should be facilitated by
|
||||||
* A user who can consume a secret in a container can know the value of the secret; secrets must
|
this proposal
|
||||||
be provisioned judiciously
|
* A user who can consume a secret in a container can know the value of the
|
||||||
|
secret; secrets must be provisioned judiciously
|
||||||
|
|
||||||
## Use Cases
|
## Use Cases
|
||||||
|
|
||||||
1. As a user, I want to store secret artifacts for my applications and consume them securely in
|
1. As a user, I want to store secret artifacts for my applications and consume
|
||||||
containers, so that I can keep the configuration for my applications separate from the images
|
them securely in containers, so that I can keep the configuration for my
|
||||||
that use them:
|
applications separate from the images that use them:
|
||||||
1. As a cluster operator, I want to allow a pod to access the Kubernetes master using a custom
|
1. As a cluster operator, I want to allow a pod to access the Kubernetes
|
||||||
`.kubeconfig` file, so that I can securely reach the master
|
master using a custom `.kubeconfig` file, so that I can securely reach the
|
||||||
2. As a cluster operator, I want to allow a pod to access a Docker registry using credentials
|
master
|
||||||
from a `.dockercfg` file, so that containers can push images
|
2. As a cluster operator, I want to allow a pod to access a Docker registry
|
||||||
3. As a cluster operator, I want to allow a pod to access a git repository using SSH keys,
|
using credentials from a `.dockercfg` file, so that containers can push images
|
||||||
so that I can push to and fetch from the repository
|
3. As a cluster operator, I want to allow a pod to access a git repository
|
||||||
2. As a user, I want to allow containers to consume supplemental information about services such
|
using SSH keys, so that I can push to and fetch from the repository
|
||||||
as username and password which should be kept secret, so that I can share secrets about a
|
2. As a user, I want to allow containers to consume supplemental information
|
||||||
service amongst the containers in my application securely
|
about services such as username and password which should be kept secret, so
|
||||||
3. As a user, I want to associate a pod with a `ServiceAccount` that consumes a secret and have
|
that I can share secrets about a service amongst the containers in my
|
||||||
the kubelet implement some reserved behaviors based on the types of secrets the service account
|
application securely
|
||||||
consumes:
|
3. As a user, I want to associate a pod with a `ServiceAccount` that consumes a
|
||||||
|
secret and have the kubelet implement some reserved behaviors based on the types
|
||||||
|
of secrets the service account consumes:
|
||||||
1. Use credentials for a docker registry to pull the pod's docker image
|
1. Use credentials for a docker registry to pull the pod's docker image
|
||||||
2. Present Kubernetes auth token to the pod or transparently decorate traffic between the pod
|
2. Present Kubernetes auth token to the pod or transparently decorate
|
||||||
and master service
|
traffic between the pod and master service
|
||||||
4. As a user, I want to be able to indicate that a secret expires and for that secret's value to
|
4. As a user, I want to be able to indicate that a secret expires and for that
|
||||||
be rotated once it expires, so that the system can help me follow good practices
|
secret's value to be rotated once it expires, so that the system can help me
|
||||||
|
follow good practices
|
||||||
|
|
||||||
### Use-Case: Configuration artifacts
|
### Use-Case: Configuration artifacts
|
||||||
|
|
||||||
Many configuration files contain secrets intermixed with other configuration information. For
|
Many configuration files contain secrets intermixed with other configuration
|
||||||
example, a user's application may contain a properties file than contains database credentials,
|
information. For example, a user's application may contain a properties file
|
||||||
SaaS API tokens, etc. Users should be able to consume configuration artifacts in their containers
|
than contains database credentials, SaaS API tokens, etc. Users should be able
|
||||||
and be able to control the path on the container's filesystems where the artifact will be
|
to consume configuration artifacts in their containers and be able to control
|
||||||
presented.
|
the path on the container's filesystems where the artifact will be presented.
|
||||||
|
|
||||||
### Use-Case: Metadata about services
|
### Use-Case: Metadata about services
|
||||||
|
|
||||||
Most pieces of information about how to use a service are secrets. For example, a service that
|
Most pieces of information about how to use a service are secrets. For example,
|
||||||
provides a MySQL database needs to provide the username, password, and database name to consumers
|
a service that provides a MySQL database needs to provide the username,
|
||||||
so that they can authenticate and use the correct database. Containers in pods consuming the MySQL
|
password, and database name to consumers so that they can authenticate and use
|
||||||
service would also consume the secrets associated with the MySQL service.
|
the correct database. Containers in pods consuming the MySQL service would also
|
||||||
|
consume the secrets associated with the MySQL service.
|
||||||
|
|
||||||
### Use-Case: Secrets associated with service accounts
|
### Use-Case: Secrets associated with service accounts
|
||||||
|
|
||||||
[Service Accounts](service_accounts.md) are proposed as a
|
[Service Accounts](service_accounts.md) are proposed as a mechanism to decouple
|
||||||
mechanism to decouple capabilities and security contexts from individual human users. A
|
capabilities and security contexts from individual human users. A
|
||||||
`ServiceAccount` contains references to some number of secrets. A `Pod` can specify that it is
|
`ServiceAccount` contains references to some number of secrets. A `Pod` can
|
||||||
associated with a `ServiceAccount`. Secrets should have a `Type` field to allow the Kubelet and
|
specify that it is associated with a `ServiceAccount`. Secrets should have a
|
||||||
other system components to take action based on the secret's type.
|
`Type` field to allow the Kubelet and other system components to take action
|
||||||
|
based on the secret's type.
|
||||||
|
|
||||||
#### Example: service account consumes auth token secret
|
#### Example: service account consumes auth token secret
|
||||||
|
|
||||||
As an example, the service account proposal discusses service accounts consuming secrets which
|
As an example, the service account proposal discusses service accounts consuming
|
||||||
contain Kubernetes auth tokens. When a Kubelet starts a pod associated with a service account
|
secrets which contain Kubernetes auth tokens. When a Kubelet starts a pod
|
||||||
which consumes this type of secret, the Kubelet may take a number of actions:
|
associated with a service account which consumes this type of secret, the
|
||||||
|
Kubelet may take a number of actions:
|
||||||
|
|
||||||
1. Expose the secret in a `.kubernetes_auth` file in a well-known location in the container's
|
1. Expose the secret in a `.kubernetes_auth` file in a well-known location in
|
||||||
file system
|
the container's file system
|
||||||
2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod to the
|
2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod
|
||||||
`kubernetes-master` service with the auth token, e. g. by adding a header to the request
|
to the `kubernetes-master` service with the auth token, e. g. by adding a header
|
||||||
(see the [LOAS Daemon](http://issue.k8s.io/2209) proposal)
|
to the request (see the [LOAS Daemon](http://issue.k8s.io/2209) proposal)
|
||||||
|
|
||||||
#### Example: service account consumes docker registry credentials
|
#### Example: service account consumes docker registry credentials
|
||||||
|
|
||||||
Another example use case is where a pod is associated with a secret containing docker registry
|
Another example use case is where a pod is associated with a secret containing
|
||||||
credentials. The Kubelet could use these credentials for the docker pull to retrieve the image.
|
docker registry credentials. The Kubelet could use these credentials for the
|
||||||
|
docker pull to retrieve the image.
|
||||||
|
|
||||||
### Use-Case: Secret expiry and rotation
|
### Use-Case: Secret expiry and rotation
|
||||||
|
|
||||||
Rotation is considered a good practice for many types of secret data. It should be possible to
|
Rotation is considered a good practice for many types of secret data. It should
|
||||||
express that a secret has an expiry date; this would make it possible to implement a system
|
be possible to express that a secret has an expiry date; this would make it
|
||||||
component that could regenerate expired secrets. As an example, consider a component that rotates
|
possible to implement a system component that could regenerate expired secrets.
|
||||||
expired secrets. The rotator could periodically regenerate the values for expired secrets of
|
As an example, consider a component that rotates expired secrets. The rotator
|
||||||
common types and update their expiry dates.
|
could periodically regenerate the values for expired secrets of common types and
|
||||||
|
update their expiry dates.
|
||||||
|
|
||||||
## Deferral: Consuming secrets as environment variables
|
## Deferral: Consuming secrets as environment variables
|
||||||
|
|
||||||
Some images will expect to receive configuration items as environment variables instead of files.
|
Some images will expect to receive configuration items as environment variables
|
||||||
We should consider what the best way to allow this is; there are a few different options:
|
instead of files. We should consider what the best way to allow this is; there
|
||||||
|
are a few different options:
|
||||||
|
|
||||||
1. Force the user to adapt files into environment variables. Users can store secrets that need to
|
1. Force the user to adapt files into environment variables. Users can store
|
||||||
be presented as environment variables in a format that is easy to consume from a shell:
|
secrets that need to be presented as environment variables in a format that is
|
||||||
|
easy to consume from a shell:
|
||||||
|
|
||||||
$ cat /etc/secrets/my-secret.txt
|
$ cat /etc/secrets/my-secret.txt
|
||||||
export MY_SECRET_ENV=MY_SECRET_VALUE
|
export MY_SECRET_ENV=MY_SECRET_VALUE
|
||||||
|
|
||||||
The user could `source` the file at `/etc/secrets/my-secret` prior to executing the command for
|
The user could `source` the file at `/etc/secrets/my-secret` prior to
|
||||||
the image either inline in the command or in an init script,
|
executing the command for the image either inline in the command or in an init
|
||||||
|
script.
|
||||||
|
|
||||||
2. Give secrets an attribute that allows users to express the intent that the platform should
|
2. Give secrets an attribute that allows users to express the intent that the
|
||||||
generate the above syntax in the file used to present a secret. The user could consume these
|
platform should generate the above syntax in the file used to present a secret.
|
||||||
files in the same manner as the above option.
|
The user could consume these files in the same manner as the above option.
|
||||||
|
|
||||||
3. Give secrets attributes that allow the user to express that the secret should be presented to
|
3. Give secrets attributes that allow the user to express that the secret
|
||||||
the container as an environment variable. The container's environment would contain the
|
should be presented to the container as an environment variable. The container's
|
||||||
desired values and the software in the container could use them without accommodation the
|
environment would contain the desired values and the software in the container
|
||||||
command or setup script.
|
could use them without accommodation the command or setup script.
|
||||||
|
|
||||||
For our initial work, we will treat all secrets as files to narrow the problem space. There will
|
For our initial work, we will treat all secrets as files to narrow the problem
|
||||||
be a future proposal that handles exposing secrets as environment variables.
|
space. There will be a future proposal that handles exposing secrets as
|
||||||
|
environment variables.
|
||||||
|
|
||||||
## Flow analysis of secret data with respect to the API server
|
## Flow analysis of secret data with respect to the API server
|
||||||
|
|
||||||
@ -170,17 +185,19 @@ There are two fundamentally different use-cases for access to secrets:
|
|||||||
|
|
||||||
### Use-Case: CRUD operations by owners
|
### Use-Case: CRUD operations by owners
|
||||||
|
|
||||||
In use cases for CRUD operations, the user experience for secrets should be no different than for
|
In use cases for CRUD operations, the user experience for secrets should be no
|
||||||
other API resources.
|
different than for other API resources.
|
||||||
|
|
||||||
#### Data store backing the REST API
|
#### Data store backing the REST API
|
||||||
|
|
||||||
The data store backing the REST API should be pluggable because different cluster operators will
|
The data store backing the REST API should be pluggable because different
|
||||||
have different preferences for the central store of secret data. Some possibilities for storage:
|
cluster operators will have different preferences for the central store of
|
||||||
|
secret data. Some possibilities for storage:
|
||||||
|
|
||||||
1. An etcd collection alongside the storage for other API resources
|
1. An etcd collection alongside the storage for other API resources
|
||||||
2. A collocated [HSM](http://en.wikipedia.org/wiki/Hardware_security_module)
|
2. A collocated [HSM](http://en.wikipedia.org/wiki/Hardware_security_module)
|
||||||
3. A secrets server like [Vault](https://www.vaultproject.io/) or [Keywhiz](https://square.github.io/keywhiz/)
|
3. A secrets server like [Vault](https://www.vaultproject.io/) or
|
||||||
|
[Keywhiz](https://square.github.io/keywhiz/)
|
||||||
4. An external datastore such as an external etcd, RDBMS, etc.
|
4. An external datastore such as an external etcd, RDBMS, etc.
|
||||||
|
|
||||||
#### Size limit for secrets
|
#### Size limit for secrets
|
||||||
@ -188,101 +205,116 @@ have different preferences for the central store of secret data. Some possibili
|
|||||||
There should be a size limit for secrets in order to:
|
There should be a size limit for secrets in order to:
|
||||||
|
|
||||||
1. Prevent DOS attacks against the API server
|
1. Prevent DOS attacks against the API server
|
||||||
2. Allow kubelet implementations that prevent secret data from touching the node's filesystem
|
2. Allow kubelet implementations that prevent secret data from touching the
|
||||||
|
node's filesystem
|
||||||
|
|
||||||
The size limit should satisfy the following conditions:
|
The size limit should satisfy the following conditions:
|
||||||
|
|
||||||
1. Large enough to store common artifact types (encryption keypairs, certificates, small
|
1. Large enough to store common artifact types (encryption keypairs,
|
||||||
configuration files)
|
certificates, small configuration files)
|
||||||
2. Small enough to avoid large impact on node resource consumption (storage, RAM for tmpfs, etc)
|
2. Small enough to avoid large impact on node resource consumption (storage,
|
||||||
|
RAM for tmpfs, etc)
|
||||||
|
|
||||||
To begin discussion, we propose an initial value for this size limit of **1MB**.
|
To begin discussion, we propose an initial value for this size limit of **1MB**.
|
||||||
|
|
||||||
#### Other limitations on secrets
|
#### Other limitations on secrets
|
||||||
|
|
||||||
Defining a policy for limitations on how a secret may be referenced by another API resource and how
|
Defining a policy for limitations on how a secret may be referenced by another
|
||||||
constraints should be applied throughout the cluster is tricky due to the number of variables
|
API resource and how constraints should be applied throughout the cluster is
|
||||||
involved:
|
tricky due to the number of variables involved:
|
||||||
|
|
||||||
1. Should there be a maximum number of secrets a pod can reference via a volume?
|
1. Should there be a maximum number of secrets a pod can reference via a
|
||||||
|
volume?
|
||||||
2. Should there be a maximum number of secrets a service account can reference?
|
2. Should there be a maximum number of secrets a service account can reference?
|
||||||
3. Should there be a total maximum number of secrets a pod can reference via its own spec and its
|
3. Should there be a total maximum number of secrets a pod can reference via
|
||||||
associated service account?
|
its own spec and its associated service account?
|
||||||
4. Should there be a total size limit on the amount of secret data consumed by a pod?
|
4. Should there be a total size limit on the amount of secret data consumed by
|
||||||
|
a pod?
|
||||||
5. How will cluster operators want to be able to configure these limits?
|
5. How will cluster operators want to be able to configure these limits?
|
||||||
6. How will these limits impact API server validations?
|
6. How will these limits impact API server validations?
|
||||||
7. How will these limits affect scheduling?
|
7. How will these limits affect scheduling?
|
||||||
|
|
||||||
For now, we will not implement validations around these limits. Cluster operators will decide how
|
For now, we will not implement validations around these limits. Cluster
|
||||||
much node storage is allocated to secrets. It will be the operator's responsibility to ensure that
|
operators will decide how much node storage is allocated to secrets. It will be
|
||||||
the allocated storage is sufficient for the workload scheduled onto a node.
|
the operator's responsibility to ensure that the allocated storage is sufficient
|
||||||
|
for the workload scheduled onto a node.
|
||||||
|
|
||||||
For now, kubelets will only attach secrets to api-sourced pods, and not file- or http-sourced
|
For now, kubelets will only attach secrets to api-sourced pods, and not file-
|
||||||
ones. Doing so would:
|
or http-sourced ones. Doing so would:
|
||||||
- confuse the secrets admission controller in the case of mirror pods.
|
- confuse the secrets admission controller in the case of mirror pods.
|
||||||
- create an apiserver-liveness dependency -- avoiding this dependency is a main reason to use non-api-source pods.
|
- create an apiserver-liveness dependency -- avoiding this dependency is a
|
||||||
|
main reason to use non-api-source pods.
|
||||||
|
|
||||||
### Use-Case: Kubelet read of secrets for node
|
### Use-Case: Kubelet read of secrets for node
|
||||||
|
|
||||||
The use-case where the kubelet reads secrets has several additional requirements:
|
The use-case where the kubelet reads secrets has several additional requirements:
|
||||||
|
|
||||||
1. Kubelets should only be able to receive secret data which is required by pods scheduled onto
|
1. Kubelets should only be able to receive secret data which is required by
|
||||||
the kubelet's node
|
pods scheduled onto the kubelet's node
|
||||||
2. Kubelets should have read-only access to secret data
|
2. Kubelets should have read-only access to secret data
|
||||||
3. Secret data should not be transmitted over the wire insecurely
|
3. Secret data should not be transmitted over the wire insecurely
|
||||||
4. Kubelets must ensure pods do not have access to each other's secrets
|
4. Kubelets must ensure pods do not have access to each other's secrets
|
||||||
|
|
||||||
#### Read of secret data by the Kubelet
|
#### Read of secret data by the Kubelet
|
||||||
|
|
||||||
The Kubelet should only be allowed to read secrets which are consumed by pods scheduled onto that
|
The Kubelet should only be allowed to read secrets which are consumed by pods
|
||||||
Kubelet's node and their associated service accounts. Authorization of the Kubelet to read this
|
scheduled onto that Kubelet's node and their associated service accounts.
|
||||||
data would be delegated to an authorization plugin and associated policy rule.
|
Authorization of the Kubelet to read this data would be delegated to an
|
||||||
|
authorization plugin and associated policy rule.
|
||||||
|
|
||||||
#### Secret data on the node: data at rest
|
#### Secret data on the node: data at rest
|
||||||
|
|
||||||
Consideration must be given to whether secret data should be allowed to be at rest on the node:
|
Consideration must be given to whether secret data should be allowed to be at
|
||||||
|
rest on the node:
|
||||||
|
|
||||||
1. If secret data is not allowed to be at rest, the size of secret data becomes another draw on
|
1. If secret data is not allowed to be at rest, the size of secret data becomes
|
||||||
the node's RAM - should it affect scheduling?
|
another draw on the node's RAM - should it affect scheduling?
|
||||||
2. If secret data is allowed to be at rest, should it be encrypted?
|
2. If secret data is allowed to be at rest, should it be encrypted?
|
||||||
1. If so, how should be this be done?
|
1. If so, how should be this be done?
|
||||||
2. If not, what threats exist? What types of secret are appropriate to store this way?
|
2. If not, what threats exist? What types of secret are appropriate to
|
||||||
|
store this way?
|
||||||
|
|
||||||
For the sake of limiting complexity, we propose that initially secret data should not be allowed
|
For the sake of limiting complexity, we propose that initially secret data
|
||||||
to be at rest on a node; secret data should be stored on a node-level tmpfs filesystem. This
|
should not be allowed to be at rest on a node; secret data should be stored on a
|
||||||
filesystem can be subdivided into directories for use by the kubelet and by the volume plugin.
|
node-level tmpfs filesystem. This filesystem can be subdivided into directories
|
||||||
|
for use by the kubelet and by the volume plugin.
|
||||||
|
|
||||||
#### Secret data on the node: resource consumption
|
#### Secret data on the node: resource consumption
|
||||||
|
|
||||||
The Kubelet will be responsible for creating the per-node tmpfs file system for secret storage.
|
The Kubelet will be responsible for creating the per-node tmpfs file system for
|
||||||
It is hard to make a prescriptive declaration about how much storage is appropriate to reserve for
|
secret storage. It is hard to make a prescriptive declaration about how much
|
||||||
secrets because different installations will vary widely in available resources, desired pod to
|
storage is appropriate to reserve for secrets because different installations
|
||||||
node density, overcommit policy, and other operation dimensions. That being the case, we propose
|
will vary widely in available resources, desired pod to node density, overcommit
|
||||||
for simplicity that the amount of secret storage be controlled by a new parameter to the kubelet
|
policy, and other operation dimensions. That being the case, we propose for
|
||||||
with a default value of **64MB**. It is the cluster operator's responsibility to handle choosing
|
simplicity that the amount of secret storage be controlled by a new parameter to
|
||||||
the right storage size for their installation and configuring their Kubelets correctly.
|
the kubelet with a default value of **64MB**. It is the cluster operator's
|
||||||
|
responsibility to handle choosing the right storage size for their installation
|
||||||
|
and configuring their Kubelets correctly.
|
||||||
|
|
||||||
Configuring each Kubelet is not the ideal story for operator experience; it is more intuitive that
|
Configuring each Kubelet is not the ideal story for operator experience; it is
|
||||||
the cluster-wide storage size be readable from a central configuration store like the one proposed
|
more intuitive that the cluster-wide storage size be readable from a central
|
||||||
in [#1553](http://issue.k8s.io/1553). When such a store
|
configuration store like the one proposed in [#1553](http://issue.k8s.io/1553).
|
||||||
exists, the Kubelet could be modified to read this configuration item from the store.
|
When such a store exists, the Kubelet could be modified to read this
|
||||||
|
configuration item from the store.
|
||||||
|
|
||||||
When the Kubelet is modified to advertise node resources (as proposed in
|
When the Kubelet is modified to advertise node resources (as proposed in
|
||||||
[#4441](http://issue.k8s.io/4441)), the capacity calculation
|
[#4441](http://issue.k8s.io/4441)), the capacity calculation
|
||||||
for available memory should factor in the potential size of the node-level tmpfs in order to avoid
|
for available memory should factor in the potential size of the node-level tmpfs
|
||||||
memory overcommit on the node.
|
in order to avoid memory overcommit on the node.
|
||||||
|
|
||||||
#### Secret data on the node: isolation
|
#### Secret data on the node: isolation
|
||||||
|
|
||||||
Every pod will have a [security context](security_context.md).
|
Every pod will have a [security context](security_context.md).
|
||||||
Secret data on the node should be isolated according to the security context of the container. The
|
Secret data on the node should be isolated according to the security context of
|
||||||
Kubelet volume plugin API will be changed so that a volume plugin receives the security context of
|
the container. The Kubelet volume plugin API will be changed so that a volume
|
||||||
a volume along with the volume spec. This will allow volume plugins to implement setting the
|
plugin receives the security context of a volume along with the volume spec.
|
||||||
security context of volumes they manage.
|
This will allow volume plugins to implement setting the security context of
|
||||||
|
volumes they manage.
|
||||||
|
|
||||||
## Community work
|
## Community work
|
||||||
|
|
||||||
Several proposals / upstream patches are notable as background for this proposal:
|
Several proposals / upstream patches are notable as background for this
|
||||||
|
proposal:
|
||||||
|
|
||||||
1. [Docker vault proposal](https://github.com/docker/docker/issues/10310)
|
1. [Docker vault proposal](https://github.com/docker/docker/issues/10310)
|
||||||
2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277)
|
2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277)
|
||||||
@ -292,14 +324,15 @@ Several proposals / upstream patches are notable as background for this proposal
|
|||||||
|
|
||||||
## Proposed Design
|
## Proposed Design
|
||||||
|
|
||||||
We propose a new `Secret` resource which is mounted into containers with a new volume type. Secret
|
We propose a new `Secret` resource which is mounted into containers with a new
|
||||||
volumes will be handled by a volume plugin that does the actual work of fetching the secret and
|
volume type. Secret volumes will be handled by a volume plugin that does the
|
||||||
storing it. Secrets contain multiple pieces of data that are presented as different files within
|
actual work of fetching the secret and storing it. Secrets contain multiple
|
||||||
the secret volume (example: SSH key pair).
|
pieces of data that are presented as different files within the secret volume
|
||||||
|
(example: SSH key pair).
|
||||||
|
|
||||||
In order to remove the burden from the end user in specifying every file that a secret consists of,
|
In order to remove the burden from the end user in specifying every file that a
|
||||||
it should be possible to mount all files provided by a secret with a single `VolumeMount` entry
|
secret consists of, it should be possible to mount all files provided by a
|
||||||
in the container specification.
|
secret with a single `VolumeMount` entry in the container specification.
|
||||||
|
|
||||||
### Secret API Resource
|
### Secret API Resource
|
||||||
|
|
||||||
@ -331,27 +364,30 @@ const (
|
|||||||
const MaxSecretSize = 1 * 1024 * 1024
|
const MaxSecretSize = 1 * 1024 * 1024
|
||||||
```
|
```
|
||||||
|
|
||||||
A Secret can declare a type in order to provide type information to system components that work
|
A Secret can declare a type in order to provide type information to system
|
||||||
with secrets. The default type is `opaque`, which represents arbitrary user-owned data.
|
components that work with secrets. The default type is `opaque`, which
|
||||||
|
represents arbitrary user-owned data.
|
||||||
|
|
||||||
Secrets are validated against `MaxSecretSize`. The keys in the `Data` field must be valid DNS
|
Secrets are validated against `MaxSecretSize`. The keys in the `Data` field must
|
||||||
subdomains.
|
be valid DNS subdomains.
|
||||||
|
|
||||||
A new REST API and registry interface will be added to accompany the `Secret` resource. The
|
A new REST API and registry interface will be added to accompany the `Secret`
|
||||||
default implementation of the registry will store `Secret` information in etcd. Future registry
|
resource. The default implementation of the registry will store `Secret`
|
||||||
implementations could store the `TypeMeta` and `ObjectMeta` fields in etcd and store the secret
|
information in etcd. Future registry implementations could store the `TypeMeta`
|
||||||
data in another data store entirely, or store the whole object in another data store.
|
and `ObjectMeta` fields in etcd and store the secret data in another data store
|
||||||
|
entirely, or store the whole object in another data store.
|
||||||
|
|
||||||
#### Other validations related to secrets
|
#### Other validations related to secrets
|
||||||
|
|
||||||
Initially there will be no validations for the number of secrets a pod references, or the number of
|
Initially there will be no validations for the number of secrets a pod
|
||||||
secrets that can be associated with a service account. These may be added in the future as the
|
references, or the number of secrets that can be associated with a service
|
||||||
finer points of secrets and resource allocation are fleshed out.
|
account. These may be added in the future as the finer points of secrets and
|
||||||
|
resource allocation are fleshed out.
|
||||||
|
|
||||||
### Secret Volume Source
|
### Secret Volume Source
|
||||||
|
|
||||||
A new `SecretSource` type of volume source will be added to the `VolumeSource` struct in the
|
A new `SecretSource` type of volume source will be added to the `VolumeSource`
|
||||||
API:
|
struct in the API:
|
||||||
|
|
||||||
```go
|
```go
|
||||||
type VolumeSource struct {
|
type VolumeSource struct {
|
||||||
@ -366,19 +402,21 @@ type SecretSource struct {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Secret volume sources are validated to ensure that the specified object reference actually points
|
Secret volume sources are validated to ensure that the specified object
|
||||||
to an object of type `Secret`.
|
reference actually points to an object of type `Secret`.
|
||||||
|
|
||||||
In the future, the `SecretSource` will be extended to allow:
|
In the future, the `SecretSource` will be extended to allow:
|
||||||
|
|
||||||
1. Fine-grained control over which pieces of secret data are exposed in the volume
|
1. Fine-grained control over which pieces of secret data are exposed in the
|
||||||
|
volume
|
||||||
2. The paths and filenames for how secret data are exposed
|
2. The paths and filenames for how secret data are exposed
|
||||||
|
|
||||||
### Secret Volume Plugin
|
### Secret Volume Plugin
|
||||||
|
|
||||||
A new Kubelet volume plugin will be added to handle volumes with a secret source. This plugin will
|
A new Kubelet volume plugin will be added to handle volumes with a secret
|
||||||
require access to the API server to retrieve secret data and therefore the volume `Host` interface
|
source. This plugin will require access to the API server to retrieve secret
|
||||||
will have to change to expose a client interface:
|
data and therefore the volume `Host` interface will have to change to expose a
|
||||||
|
client interface:
|
||||||
|
|
||||||
```go
|
```go
|
||||||
type Host interface {
|
type Host interface {
|
||||||
@ -394,36 +432,42 @@ The secret volume plugin will be responsible for:
|
|||||||
1. Returning a `volume.Mounter` implementation from `NewMounter` that:
|
1. Returning a `volume.Mounter` implementation from `NewMounter` that:
|
||||||
1. Retrieves the secret data for the volume from the API server
|
1. Retrieves the secret data for the volume from the API server
|
||||||
2. Places the secret data onto the container's filesystem
|
2. Places the secret data onto the container's filesystem
|
||||||
3. Sets the correct security attributes for the volume based on the pod's `SecurityContext`
|
3. Sets the correct security attributes for the volume based on the pod's
|
||||||
2. Returning a `volume.Unmounter` implementation from `NewUnmounter` that cleans the volume from the
|
`SecurityContext`
|
||||||
container's filesystem
|
2. Returning a `volume.Unmounter` implementation from `NewUnmounter` that
|
||||||
|
cleans the volume from the container's filesystem
|
||||||
|
|
||||||
### Kubelet: Node-level secret storage
|
### Kubelet: Node-level secret storage
|
||||||
|
|
||||||
The Kubelet must be modified to accept a new parameter for the secret storage size and to create
|
The Kubelet must be modified to accept a new parameter for the secret storage
|
||||||
a tmpfs file system of that size to store secret data. Rough accounting of specific changes:
|
size and to create a tmpfs file system of that size to store secret data. Rough
|
||||||
|
accounting of specific changes:
|
||||||
|
|
||||||
1. The Kubelet should have a new field added called `secretStorageSize`; units are megabytes
|
1. The Kubelet should have a new field added called `secretStorageSize`; units
|
||||||
|
are megabytes
|
||||||
2. `NewMainKubelet` should accept a value for secret storage size
|
2. `NewMainKubelet` should accept a value for secret storage size
|
||||||
3. The Kubelet server should have a new flag added for secret storage size
|
3. The Kubelet server should have a new flag added for secret storage size
|
||||||
4. The Kubelet's `setupDataDirs` method should be changed to create the secret storage
|
4. The Kubelet's `setupDataDirs` method should be changed to create the secret
|
||||||
|
storage
|
||||||
|
|
||||||
### Kubelet: New behaviors for secrets associated with service accounts
|
### Kubelet: New behaviors for secrets associated with service accounts
|
||||||
|
|
||||||
For use-cases where the Kubelet's behavior is affected by the secrets associated with a pod's
|
For use-cases where the Kubelet's behavior is affected by the secrets associated
|
||||||
`ServiceAccount`, the Kubelet will need to be changed. For example, if secrets of type
|
with a pod's `ServiceAccount`, the Kubelet will need to be changed. For example,
|
||||||
`docker-reg-auth` affect how the pod's images are pulled, the Kubelet will need to be changed
|
if secrets of type `docker-reg-auth` affect how the pod's images are pulled, the
|
||||||
to accommodate this. Subsequent proposals can address this on a type-by-type basis.
|
Kubelet will need to be changed to accommodate this. Subsequent proposals can
|
||||||
|
address this on a type-by-type basis.
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
For clarity, let's examine some detailed examples of some common use-cases in terms of the
|
For clarity, let's examine some detailed examples of some common use-cases in
|
||||||
suggested changes. All of these examples are assumed to be created in a namespace called
|
terms of the suggested changes. All of these examples are assumed to be created
|
||||||
`example`.
|
in a namespace called `example`.
|
||||||
|
|
||||||
### Use-Case: Pod with ssh keys
|
### Use-Case: Pod with ssh keys
|
||||||
|
|
||||||
To create a pod that uses an ssh key stored as a secret, we first need to create a secret:
|
To create a pod that uses an ssh key stored as a secret, we first need to create
|
||||||
|
a secret:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
@ -443,7 +487,8 @@ To create a pod that uses an ssh key stored as a secret, we first need to create
|
|||||||
base64 strings. Newlines are not valid within these strings and must be
|
base64 strings. Newlines are not valid within these strings and must be
|
||||||
omitted.
|
omitted.
|
||||||
|
|
||||||
Now we can create a pod which references the secret with the ssh key and consumes it in a volume:
|
Now we can create a pod which references the secret with the ssh key and
|
||||||
|
consumes it in a volume:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
@ -486,7 +531,8 @@ When the container's command runs, the pieces of the key will be available in:
|
|||||||
/etc/secret-volume/id-rsa.pub
|
/etc/secret-volume/id-rsa.pub
|
||||||
/etc/secret-volume/id-rsa
|
/etc/secret-volume/id-rsa
|
||||||
|
|
||||||
The container is then free to use the secret data to establish an ssh connection.
|
The container is then free to use the secret data to establish an ssh
|
||||||
|
connection.
|
||||||
|
|
||||||
### Use-Case: Pods with pod / test credentials
|
### Use-Case: Pods with pod / test credentials
|
||||||
|
|
||||||
@ -602,8 +648,9 @@ The pods:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
The specs for the two pods differ only in the value of the object referred to by the secret volume
|
The specs for the two pods differ only in the value of the object referred to by
|
||||||
source. Both containers will have the following files present on their filesystems:
|
the secret volume source. Both containers will have the following files present
|
||||||
|
on their filesystems:
|
||||||
|
|
||||||
/etc/secret-volume/username
|
/etc/secret-volume/username
|
||||||
/etc/secret-volume/password
|
/etc/secret-volume/password
|
||||||
|
@ -34,37 +34,57 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
# Security in Kubernetes
|
# Security in Kubernetes
|
||||||
|
|
||||||
Kubernetes should define a reasonable set of security best practices that allows processes to be isolated from each other, from the cluster infrastructure, and which preserves important boundaries between those who manage the cluster, and those who use the cluster.
|
Kubernetes should define a reasonable set of security best practices that allows
|
||||||
|
processes to be isolated from each other, from the cluster infrastructure, and
|
||||||
|
which preserves important boundaries between those who manage the cluster, and
|
||||||
|
those who use the cluster.
|
||||||
|
|
||||||
While Kubernetes today is not primarily a multi-tenant system, the long term evolution of Kubernetes will increasingly rely on proper boundaries between users and administrators. The code running on the cluster must be appropriately isolated and secured to prevent malicious parties from affecting the entire cluster.
|
While Kubernetes today is not primarily a multi-tenant system, the long term
|
||||||
|
evolution of Kubernetes will increasingly rely on proper boundaries between
|
||||||
|
users and administrators. The code running on the cluster must be appropriately
|
||||||
|
isolated and secured to prevent malicious parties from affecting the entire
|
||||||
|
cluster.
|
||||||
|
|
||||||
|
|
||||||
## High Level Goals
|
## High Level Goals
|
||||||
|
|
||||||
1. Ensure a clear isolation between the container and the underlying host it runs on
|
1. Ensure a clear isolation between the container and the underlying host it
|
||||||
2. Limit the ability of the container to negatively impact the infrastructure or other containers
|
runs on
|
||||||
3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) - ensure components are only authorized to perform the actions they need, and limit the scope of a compromise by limiting the capabilities of individual components
|
2. Limit the ability of the container to negatively impact the infrastructure
|
||||||
4. Reduce the number of systems that have to be hardened and secured by defining clear boundaries between components
|
or other containers
|
||||||
|
3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) -
|
||||||
|
ensure components are only authorized to perform the actions they need, and
|
||||||
|
limit the scope of a compromise by limiting the capabilities of individual
|
||||||
|
components
|
||||||
|
4. Reduce the number of systems that have to be hardened and secured by
|
||||||
|
defining clear boundaries between components
|
||||||
5. Allow users of the system to be cleanly separated from administrators
|
5. Allow users of the system to be cleanly separated from administrators
|
||||||
6. Allow administrative functions to be delegated to users where necessary
|
6. Allow administrative functions to be delegated to users where necessary
|
||||||
7. Allow applications to be run on the cluster that have "secret" data (keys, certs, passwords) which is properly abstracted from "public" data.
|
7. Allow applications to be run on the cluster that have "secret" data (keys,
|
||||||
|
certs, passwords) which is properly abstracted from "public" data.
|
||||||
|
|
||||||
## Use cases
|
## Use cases
|
||||||
|
|
||||||
### Roles
|
### Roles
|
||||||
|
|
||||||
We define "user" as a unique identity accessing the Kubernetes API server, which may be a human or an automated process. Human users fall into the following categories:
|
We define "user" as a unique identity accessing the Kubernetes API server, which
|
||||||
|
may be a human or an automated process. Human users fall into the following
|
||||||
|
categories:
|
||||||
|
|
||||||
1. k8s admin - administers a Kubernetes cluster and has access to the underlying components of the system
|
1. k8s admin - administers a Kubernetes cluster and has access to the underlying
|
||||||
2. k8s project administrator - administrates the security of a small subset of the cluster
|
components of the system
|
||||||
3. k8s developer - launches pods on a Kubernetes cluster and consumes cluster resources
|
2. k8s project administrator - administrates the security of a small subset of
|
||||||
|
the cluster
|
||||||
|
3. k8s developer - launches pods on a Kubernetes cluster and consumes cluster
|
||||||
|
resources
|
||||||
|
|
||||||
Automated process users fall into the following categories:
|
Automated process users fall into the following categories:
|
||||||
|
|
||||||
1. k8s container user - a user that processes running inside a container (on the cluster) can use to access other cluster resources independent of the human users attached to a project
|
1. k8s container user - a user that processes running inside a container (on the
|
||||||
2. k8s infrastructure user - the user that Kubernetes infrastructure components use to perform cluster functions with clearly defined roles
|
cluster) can use to access other cluster resources independent of the human
|
||||||
|
users attached to a project
|
||||||
|
2. k8s infrastructure user - the user that Kubernetes infrastructure components
|
||||||
|
use to perform cluster functions with clearly defined roles
|
||||||
|
|
||||||
### Description of roles
|
### Description of roles
|
||||||
|
|
||||||
@ -73,9 +93,11 @@ Automated process users fall into the following categories:
|
|||||||
* making some of their own images, and using some "community" docker images
|
* making some of their own images, and using some "community" docker images
|
||||||
* know which pods need to talk to which other pods
|
* know which pods need to talk to which other pods
|
||||||
* decide which pods should share files with other pods, and which should not.
|
* decide which pods should share files with other pods, and which should not.
|
||||||
* reason about application level security, such as containing the effects of a local-file-read exploit in a webserver pod.
|
* reason about application level security, such as containing the effects of a
|
||||||
|
local-file-read exploit in a webserver pod.
|
||||||
* do not often reason about operating system or organizational security.
|
* do not often reason about operating system or organizational security.
|
||||||
* are not necessarily comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc.
|
* are not necessarily comfortable reasoning about the security properties of a
|
||||||
|
system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc.
|
||||||
|
|
||||||
* Project Admins:
|
* Project Admins:
|
||||||
* allocate identity and roles within a namespace
|
* allocate identity and roles within a namespace
|
||||||
@ -85,44 +107,81 @@ Automated process users fall into the following categories:
|
|||||||
* are less focused about application security
|
* are less focused about application security
|
||||||
|
|
||||||
* Administrators:
|
* Administrators:
|
||||||
* are less focused on application security. Focused on operating system security.
|
* are less focused on application security. Focused on operating system
|
||||||
* protect the node from bad actors in containers, and properly-configured innocent containers from bad actors in other containers.
|
security.
|
||||||
* comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc.
|
* protect the node from bad actors in containers, and properly-configured
|
||||||
* decides who can use which Linux Capabilities, run privileged containers, use hostPath, etc.
|
innocent containers from bad actors in other containers.
|
||||||
* e.g. a team that manages Ceph or a mysql server might be trusted to have raw access to storage devices in some organizations, but teams that develop the applications at higher layers would not.
|
* comfortable reasoning about the security properties of a system at the level
|
||||||
|
of detail of Linux Capabilities, SELinux, AppArmor, etc.
|
||||||
|
* decides who can use which Linux Capabilities, run privileged containers, use
|
||||||
|
hostPath, etc.
|
||||||
|
* e.g. a team that manages Ceph or a mysql server might be trusted to have
|
||||||
|
raw access to storage devices in some organizations, but teams that develop the
|
||||||
|
applications at higher layers would not.
|
||||||
|
|
||||||
|
|
||||||
## Proposed Design
|
## Proposed Design
|
||||||
|
|
||||||
A pod runs in a *security context* under a *service account* that is defined by an administrator or project administrator, and the *secrets* a pod has access to is limited by that *service account*.
|
A pod runs in a *security context* under a *service account* that is defined by
|
||||||
|
an administrator or project administrator, and the *secrets* a pod has access to
|
||||||
|
is limited by that *service account*.
|
||||||
|
|
||||||
|
|
||||||
1. The API should authenticate and authorize user actions [authn and authz](access.md)
|
1. The API should authenticate and authorize user actions [authn and authz](access.md)
|
||||||
2. All infrastructure components (kubelets, kube-proxies, controllers, scheduler) should have an infrastructure user that they can authenticate with and be authorized to perform only the functions they require against the API.
|
2. All infrastructure components (kubelets, kube-proxies, controllers,
|
||||||
3. Most infrastructure components should use the API as a way of exchanging data and changing the system, and only the API should have access to the underlying data store (etcd)
|
scheduler) should have an infrastructure user that they can authenticate with
|
||||||
4. When containers run on the cluster and need to talk to other containers or the API server, they should be identified and authorized clearly as an autonomous process via a [service account](service_accounts.md)
|
and be authorized to perform only the functions they require against the API.
|
||||||
1. If the user who started a long-lived process is removed from access to the cluster, the process should be able to continue without interruption
|
3. Most infrastructure components should use the API as a way of exchanging data
|
||||||
2. If the user who started processes are removed from the cluster, administrators may wish to terminate their processes in bulk
|
and changing the system, and only the API should have access to the underlying
|
||||||
3. When containers run with a service account, the user that created / triggered the service account behavior must be associated with the container's action
|
data store (etcd)
|
||||||
5. When container processes run on the cluster, they should run in a [security context](security_context.md) that isolates those processes via Linux user security, user namespaces, and permissions.
|
4. When containers run on the cluster and need to talk to other containers or
|
||||||
1. Administrators should be able to configure the cluster to automatically confine all container processes as a non-root, randomly assigned UID
|
the API server, they should be identified and authorized clearly as an
|
||||||
2. Administrators should be able to ensure that container processes within the same namespace are all assigned the same unix user UID
|
autonomous process via a [service account](service_accounts.md)
|
||||||
3. Administrators should be able to limit which developers and project administrators have access to higher privilege actions
|
1. If the user who started a long-lived process is removed from access to
|
||||||
4. Project administrators should be able to run pods within a namespace under different security contexts, and developers must be able to specify which of the available security contexts they may use
|
the cluster, the process should be able to continue without interruption
|
||||||
5. Developers should be able to run their own images or images from the community and expect those images to run correctly
|
2. If the user who started processes are removed from the cluster,
|
||||||
6. Developers may need to ensure their images work within higher security requirements specified by administrators
|
administrators may wish to terminate their processes in bulk
|
||||||
7. When available, Linux kernel user namespaces can be used to ensure 5.2 and 5.4 are met.
|
3. When containers run with a service account, the user that created /
|
||||||
8. When application developers want to share filesystem data via distributed filesystems, the Unix user ids on those filesystems must be consistent across different container processes
|
triggered the service account behavior must be associated with the container's
|
||||||
6. Developers should be able to define [secrets](secrets.md) that are automatically added to the containers when pods are run
|
action
|
||||||
1. Secrets are files injected into the container whose values should not be displayed within a pod. Examples:
|
5. When container processes run on the cluster, they should run in a
|
||||||
|
[security context](security_context.md) that isolates those processes via Linux
|
||||||
|
user security, user namespaces, and permissions.
|
||||||
|
1. Administrators should be able to configure the cluster to automatically
|
||||||
|
confine all container processes as a non-root, randomly assigned UID
|
||||||
|
2. Administrators should be able to ensure that container processes within
|
||||||
|
the same namespace are all assigned the same unix user UID
|
||||||
|
3. Administrators should be able to limit which developers and project
|
||||||
|
administrators have access to higher privilege actions
|
||||||
|
4. Project administrators should be able to run pods within a namespace
|
||||||
|
under different security contexts, and developers must be able to specify which
|
||||||
|
of the available security contexts they may use
|
||||||
|
5. Developers should be able to run their own images or images from the
|
||||||
|
community and expect those images to run correctly
|
||||||
|
6. Developers may need to ensure their images work within higher security
|
||||||
|
requirements specified by administrators
|
||||||
|
7. When available, Linux kernel user namespaces can be used to ensure 5.2
|
||||||
|
and 5.4 are met.
|
||||||
|
8. When application developers want to share filesystem data via distributed
|
||||||
|
filesystems, the Unix user ids on those filesystems must be consistent across
|
||||||
|
different container processes
|
||||||
|
6. Developers should be able to define [secrets](secrets.md) that are
|
||||||
|
automatically added to the containers when pods are run
|
||||||
|
1. Secrets are files injected into the container whose values should not be
|
||||||
|
displayed within a pod. Examples:
|
||||||
1. An SSH private key for git cloning remote data
|
1. An SSH private key for git cloning remote data
|
||||||
2. A client certificate for accessing a remote system
|
2. A client certificate for accessing a remote system
|
||||||
3. A private key and certificate for a web server
|
3. A private key and certificate for a web server
|
||||||
4. A .kubeconfig file with embedded cert / token data for accessing the Kubernetes master
|
4. A .kubeconfig file with embedded cert / token data for accessing the
|
||||||
|
Kubernetes master
|
||||||
5. A .dockercfg file for pulling images from a protected registry
|
5. A .dockercfg file for pulling images from a protected registry
|
||||||
2. Developers should be able to define the pod spec so that a secret lands in a specific location
|
2. Developers should be able to define the pod spec so that a secret lands
|
||||||
3. Project administrators should be able to limit developers within a namespace from viewing or modifying secrets (anyone who can launch an arbitrary pod can view secrets)
|
in a specific location
|
||||||
4. Secrets are generally not copied from one namespace to another when a developer's application definitions are copied
|
3. Project administrators should be able to limit developers within a
|
||||||
|
namespace from viewing or modifying secrets (anyone who can launch an arbitrary
|
||||||
|
pod can view secrets)
|
||||||
|
4. Secrets are generally not copied from one namespace to another when a
|
||||||
|
developer's application definitions are copied
|
||||||
|
|
||||||
|
|
||||||
### Related design discussion
|
### Related design discussion
|
||||||
@ -140,15 +199,52 @@ A pod runs in a *security context* under a *service account* that is defined by
|
|||||||
|
|
||||||
### Isolate the data store from the nodes and supporting infrastructure
|
### Isolate the data store from the nodes and supporting infrastructure
|
||||||
|
|
||||||
Access to the central data store (etcd) in Kubernetes allows an attacker to run arbitrary containers on hosts, to gain access to any protected information stored in either volumes or in pods (such as access tokens or shared secrets provided as environment variables), to intercept and redirect traffic from running services by inserting middlemen, or to simply delete the entire history of the custer.
|
Access to the central data store (etcd) in Kubernetes allows an attacker to run
|
||||||
|
arbitrary containers on hosts, to gain access to any protected information
|
||||||
|
stored in either volumes or in pods (such as access tokens or shared secrets
|
||||||
|
provided as environment variables), to intercept and redirect traffic from
|
||||||
|
running services by inserting middlemen, or to simply delete the entire history
|
||||||
|
of the custer.
|
||||||
|
|
||||||
As a general principle, access to the central data store should be restricted to the components that need full control over the system and which can apply appropriate authorization and authentication of change requests. In the future, etcd may offer granular access control, but that granularity will require an administrator to understand the schema of the data to properly apply security. An administrator must be able to properly secure Kubernetes at a policy level, rather than at an implementation level, and schema changes over time should not risk unintended security leaks.
|
As a general principle, access to the central data store should be restricted to
|
||||||
|
the components that need full control over the system and which can apply
|
||||||
|
appropriate authorization and authentication of change requests. In the future,
|
||||||
|
etcd may offer granular access control, but that granularity will require an
|
||||||
|
administrator to understand the schema of the data to properly apply security.
|
||||||
|
An administrator must be able to properly secure Kubernetes at a policy level,
|
||||||
|
rather than at an implementation level, and schema changes over time should not
|
||||||
|
risk unintended security leaks.
|
||||||
|
|
||||||
Both the Kubelet and Kube Proxy need information related to their specific roles - for the Kubelet, the set of pods it should be running, and for the Proxy, the set of services and endpoints to load balance. The Kubelet also needs to provide information about running pods and historical termination data. The access pattern for both Kubelet and Proxy to load their configuration is an efficient "wait for changes" request over HTTP. It should be possible to limit the Kubelet and Proxy to only access the information they need to perform their roles and no more.
|
Both the Kubelet and Kube Proxy need information related to their specific roles -
|
||||||
|
for the Kubelet, the set of pods it should be running, and for the Proxy, the
|
||||||
|
set of services and endpoints to load balance. The Kubelet also needs to provide
|
||||||
|
information about running pods and historical termination data. The access
|
||||||
|
pattern for both Kubelet and Proxy to load their configuration is an efficient
|
||||||
|
"wait for changes" request over HTTP. It should be possible to limit the Kubelet
|
||||||
|
and Proxy to only access the information they need to perform their roles and no
|
||||||
|
more.
|
||||||
|
|
||||||
The controller manager for Replication Controllers and other future controllers act on behalf of a user via delegation to perform automated maintenance on Kubernetes resources. Their ability to access or modify resource state should be strictly limited to their intended duties and they should be prevented from accessing information not pertinent to their role. For example, a replication controller needs only to create a copy of a known pod configuration, to determine the running state of an existing pod, or to delete an existing pod that it created - it does not need to know the contents or current state of a pod, nor have access to any data in the pods attached volumes.
|
The controller manager for Replication Controllers and other future controllers
|
||||||
|
act on behalf of a user via delegation to perform automated maintenance on
|
||||||
|
Kubernetes resources. Their ability to access or modify resource state should be
|
||||||
|
strictly limited to their intended duties and they should be prevented from
|
||||||
|
accessing information not pertinent to their role. For example, a replication
|
||||||
|
controller needs only to create a copy of a known pod configuration, to
|
||||||
|
determine the running state of an existing pod, or to delete an existing pod
|
||||||
|
that it created - it does not need to know the contents or current state of a
|
||||||
|
pod, nor have access to any data in the pods attached volumes.
|
||||||
|
|
||||||
The Kubernetes pod scheduler is responsible for reading data from the pod to fit it onto a node in the cluster. At a minimum, it needs access to view the ID of a pod (to craft the binding), its current state, any resource information necessary to identify placement, and other data relevant to concerns like anti-affinity, zone or region preference, or custom logic. It does not need the ability to modify pods or see other resources, only to create bindings. It should not need the ability to delete bindings unless the scheduler takes control of relocating components on failed hosts (which could be implemented by a separate component that can delete bindings but not create them). The scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time).
|
The Kubernetes pod scheduler is responsible for reading data from the pod to fit
|
||||||
|
it onto a node in the cluster. At a minimum, it needs access to view the ID of a
|
||||||
|
pod (to craft the binding), its current state, any resource information
|
||||||
|
necessary to identify placement, and other data relevant to concerns like
|
||||||
|
anti-affinity, zone or region preference, or custom logic. It does not need the
|
||||||
|
ability to modify pods or see other resources, only to create bindings. It
|
||||||
|
should not need the ability to delete bindings unless the scheduler takes
|
||||||
|
control of relocating components on failed hosts (which could be implemented by
|
||||||
|
a separate component that can delete bindings but not create them). The
|
||||||
|
scheduler may need read access to user or project-container information to
|
||||||
|
determine preferential location (underspecified at this time).
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -36,41 +36,59 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
## Abstract
|
## Abstract
|
||||||
|
|
||||||
A security context is a set of constraints that are applied to a container in order to achieve the following goals (from [security design](security.md)):
|
A security context is a set of constraints that are applied to a container in
|
||||||
|
order to achieve the following goals (from [security design](security.md)):
|
||||||
|
|
||||||
1. Ensure a clear isolation between container and the underlying host it runs on
|
1. Ensure a clear isolation between container and the underlying host it runs
|
||||||
2. Limit the ability of the container to negatively impact the infrastructure or other containers
|
on
|
||||||
|
2. Limit the ability of the container to negatively impact the infrastructure
|
||||||
|
or other containers
|
||||||
|
|
||||||
## Background
|
## Background
|
||||||
|
|
||||||
The problem of securing containers in Kubernetes has come up [before](http://issue.k8s.io/398) and the potential problems with container security are [well known](http://opensource.com/business/14/7/docker-security-selinux). Although it is not possible to completely isolate Docker containers from their hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) make it possible to greatly reduce the attack surface.
|
The problem of securing containers in Kubernetes has come up
|
||||||
|
[before](http://issue.k8s.io/398) and the potential problems with container
|
||||||
|
security are [well known](http://opensource.com/business/14/7/docker-security-selinux).
|
||||||
|
Although it is not possible to completely isolate Docker containers from their
|
||||||
|
hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304)
|
||||||
|
make it possible to greatly reduce the attack surface.
|
||||||
|
|
||||||
## Motivation
|
## Motivation
|
||||||
|
|
||||||
### Container isolation
|
### Container isolation
|
||||||
|
|
||||||
In order to improve container isolation from host and other containers running on the host, containers should only be
|
In order to improve container isolation from host and other containers running
|
||||||
granted the access they need to perform their work. To this end it should be possible to take advantage of Docker
|
on the host, containers should only be granted the access they need to perform
|
||||||
features such as the ability to [add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration)
|
their work. To this end it should be possible to take advantage of Docker
|
||||||
|
features such as the ability to
|
||||||
|
[add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration)
|
||||||
|
and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration)
|
||||||
to the container process.
|
to the container process.
|
||||||
|
|
||||||
Support for user namespaces has recently been [merged](https://github.com/docker/libcontainer/pull/304) into Docker's libcontainer project and should soon surface in Docker itself. It will make it possible to assign a range of unprivileged uids and gids from the host to each container, improving the isolation between host and container and between containers.
|
Support for user namespaces has recently been
|
||||||
|
[merged](https://github.com/docker/libcontainer/pull/304) into Docker's
|
||||||
|
libcontainer project and should soon surface in Docker itself. It will make it
|
||||||
|
possible to assign a range of unprivileged uids and gids from the host to each
|
||||||
|
container, improving the isolation between host and container and between
|
||||||
|
containers.
|
||||||
|
|
||||||
### External integration with shared storage
|
### External integration with shared storage
|
||||||
|
|
||||||
In order to support external integration with shared storage, processes running in a Kubernetes cluster
|
In order to support external integration with shared storage, processes running
|
||||||
should be able to be uniquely identified by their Unix UID, such that a chain of ownership can be established.
|
in a Kubernetes cluster should be able to be uniquely identified by their Unix
|
||||||
Processes in pods will need to have consistent UID/GID/SELinux category labels in order to access shared disks.
|
UID, such that a chain of ownership can be established. Processes in pods will
|
||||||
|
need to have consistent UID/GID/SELinux category labels in order to access
|
||||||
|
shared disks.
|
||||||
|
|
||||||
## Constraints and Assumptions
|
## Constraints and Assumptions
|
||||||
|
|
||||||
* It is out of the scope of this document to prescribe a specific set
|
* It is out of the scope of this document to prescribe a specific set of
|
||||||
of constraints to isolate containers from their host. Different use cases need different
|
constraints to isolate containers from their host. Different use cases need
|
||||||
settings.
|
different settings.
|
||||||
* The concept of a security context should not be tied to a particular security mechanism or platform
|
* The concept of a security context should not be tied to a particular security
|
||||||
(ie. SELinux, AppArmor)
|
mechanism or platform (ie. SELinux, AppArmor)
|
||||||
* Applying a different security context to a scope (namespace or pod) requires a solution such as the one proposed for
|
* Applying a different security context to a scope (namespace or pod) requires
|
||||||
[service accounts](service_accounts.md).
|
a solution such as the one proposed for [service accounts](service_accounts.md).
|
||||||
|
|
||||||
## Use Cases
|
## Use Cases
|
||||||
|
|
||||||
@ -78,47 +96,51 @@ In order of increasing complexity, following are example use cases that would
|
|||||||
be addressed with security contexts:
|
be addressed with security contexts:
|
||||||
|
|
||||||
1. Kubernetes is used to run a single cloud application. In order to protect
|
1. Kubernetes is used to run a single cloud application. In order to protect
|
||||||
nodes from containers:
|
nodes from containers:
|
||||||
* All containers run as a single non-root user
|
* All containers run as a single non-root user
|
||||||
* Privileged containers are disabled
|
* Privileged containers are disabled
|
||||||
* All containers run with a particular MCS label
|
* All containers run with a particular MCS label
|
||||||
* Kernel capabilities like CHOWN and MKNOD are removed from containers
|
* Kernel capabilities like CHOWN and MKNOD are removed from containers
|
||||||
|
|
||||||
2. Just like case #1, except that I have more than one application running on
|
2. Just like case #1, except that I have more than one application running on
|
||||||
the Kubernetes cluster.
|
the Kubernetes cluster.
|
||||||
* Each application is run in its own namespace to avoid name collisions
|
* Each application is run in its own namespace to avoid name collisions
|
||||||
* For each application a different uid and MCS label is used
|
* For each application a different uid and MCS label is used
|
||||||
|
|
||||||
3. Kubernetes is used as the base for a PAAS with
|
3. Kubernetes is used as the base for a PAAS with multiple projects, each
|
||||||
multiple projects, each project represented by a namespace.
|
project represented by a namespace.
|
||||||
* Each namespace is associated with a range of uids/gids on the node that
|
* Each namespace is associated with a range of uids/gids on the node that
|
||||||
are mapped to uids/gids on containers using linux user namespaces.
|
are mapped to uids/gids on containers using linux user namespaces.
|
||||||
* Certain pods in each namespace have special privileges to perform system
|
* Certain pods in each namespace have special privileges to perform system
|
||||||
actions such as talking back to the server for deployment, run docker
|
actions such as talking back to the server for deployment, run docker builds,
|
||||||
builds, etc.
|
etc.
|
||||||
* External NFS storage is assigned to each namespace and permissions set
|
* External NFS storage is assigned to each namespace and permissions set
|
||||||
using the range of uids/gids assigned to that namespace.
|
using the range of uids/gids assigned to that namespace.
|
||||||
|
|
||||||
## Proposed Design
|
## Proposed Design
|
||||||
|
|
||||||
### Overview
|
### Overview
|
||||||
|
|
||||||
A *security context* consists of a set of constraints that determine how a container
|
A *security context* consists of a set of constraints that determine how a
|
||||||
is secured before getting created and run. A security context resides on the container and represents the runtime parameters that will
|
container is secured before getting created and run. A security context resides
|
||||||
be used to create and run the container via container APIs. A *security context provider* is passed to the Kubelet so it can have a chance
|
on the container and represents the runtime parameters that will be used to
|
||||||
to mutate Docker API calls in order to apply the security context.
|
create and run the container via container APIs. A *security context provider*
|
||||||
|
is passed to the Kubelet so it can have a chance to mutate Docker API calls in
|
||||||
|
order to apply the security context.
|
||||||
|
|
||||||
It is recommended that this design be implemented in two phases:
|
It is recommended that this design be implemented in two phases:
|
||||||
|
|
||||||
1. Implement the security context provider extension point in the Kubelet
|
1. Implement the security context provider extension point in the Kubelet
|
||||||
so that a default security context can be applied on container run and creation.
|
so that a default security context can be applied on container run and creation.
|
||||||
2. Implement a security context structure that is part of a service account. The
|
2. Implement a security context structure that is part of a service account. The
|
||||||
default context provider can then be used to apply a security context based
|
default context provider can then be used to apply a security context based on
|
||||||
on the service account associated with the pod.
|
the service account associated with the pod.
|
||||||
|
|
||||||
### Security Context Provider
|
### Security Context Provider
|
||||||
|
|
||||||
The Kubelet will have an interface that points to a `SecurityContextProvider`. The `SecurityContextProvider` is invoked before creating and running a given container:
|
The Kubelet will have an interface that points to a `SecurityContextProvider`.
|
||||||
|
The `SecurityContextProvider` is invoked before creating and running a given
|
||||||
|
container:
|
||||||
|
|
||||||
```go
|
```go
|
||||||
type SecurityContextProvider interface {
|
type SecurityContextProvider interface {
|
||||||
@ -138,12 +160,14 @@ type SecurityContextProvider interface {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
If the value of the SecurityContextProvider field on the Kubelet is nil, the kubelet will create and run the container as it does today.
|
If the value of the SecurityContextProvider field on the Kubelet is nil, the
|
||||||
|
kubelet will create and run the container as it does today.
|
||||||
|
|
||||||
### Security Context
|
### Security Context
|
||||||
|
|
||||||
A security context resides on the container and represents the runtime parameters that will
|
A security context resides on the container and represents the runtime
|
||||||
be used to create and run the container via container APIs. Following is an example of an initial implementation:
|
parameters that will be used to create and run the container via container APIs.
|
||||||
|
Following is an example of an initial implementation:
|
||||||
|
|
||||||
```go
|
```go
|
||||||
type Container struct {
|
type Container struct {
|
||||||
@ -189,11 +213,12 @@ type SELinuxOptions struct {
|
|||||||
|
|
||||||
### Admission
|
### Admission
|
||||||
|
|
||||||
It is up to an admission plugin to determine if the security context is acceptable or not. At the
|
It is up to an admission plugin to determine if the security context is
|
||||||
time of writing, the admission control plugin for security contexts will only allow a context that
|
acceptable or not. At the time of writing, the admission control plugin for
|
||||||
has defined capabilities or privileged. Contexts that attempt to define a UID or SELinux options
|
security contexts will only allow a context that has defined capabilities or
|
||||||
will be denied by default. In the future the admission plugin will base this decision upon
|
privileged. Contexts that attempt to define a UID or SELinux options will be
|
||||||
configurable policies that reside within the [service account](http://pr.k8s.io/2297).
|
denied by default. In the future the admission plugin will base this decision
|
||||||
|
upon configurable policies that reside within the [service account](http://pr.k8s.io/2297).
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -37,40 +37,64 @@ Design
|
|||||||
|
|
||||||
# Goals
|
# Goals
|
||||||
|
|
||||||
Make it really hard to accidentally create a job which has an overlapping selector, while still making it possible to chose an arbitrary selector, and without adding complex constraint solving to the APIserver.
|
Make it really hard to accidentally create a job which has an overlapping
|
||||||
|
selector, while still making it possible to chose an arbitrary selector, and
|
||||||
|
without adding complex constraint solving to the APIserver.
|
||||||
|
|
||||||
# Use Cases
|
# Use Cases
|
||||||
|
|
||||||
1. user can leave all label and selector fields blank and system will fill in reasonable ones: non-overlappingness guaranteed.
|
1. user can leave all label and selector fields blank and system will fill in
|
||||||
2. user can put on the pod template some labels that are useful to the user, without reasoning about non-overlappingness. System adds additional label to assure not overlapping.
|
reasonable ones: non-overlappingness guaranteed.
|
||||||
3. If user wants to reparent pods to new job (very rare case) and knows what they are doing, they can completely disable this behavior and specify explicit selector.
|
2. user can put on the pod template some labels that are useful to the user,
|
||||||
4. If a controller that makes jobs, like scheduled job, wants to use different labels, such as the time and date of the run, it can do that.
|
without reasoning about non-overlappingness. System adds additional label to
|
||||||
5. If User reads v1beta1 documentation or reuses v1beta1 Job definitions and just changes the API group, the user should not automatically be allowed to specify a selector, since this is very rarely what people want to do and is error prone.
|
assure not overlapping.
|
||||||
6. If User downloads an existing job definition, e.g. with `kubectl get jobs/old -o yaml` and tries to modify and post it, he should not create an overlapping job.
|
3. If user wants to reparent pods to new job (very rare case) and knows what
|
||||||
7. If User downloads an existing job definition, e.g. with `kubectl get jobs/old -o yaml` and tries to modify and post it, and he accidentally copies the uniquifying label from the old one, then he should not get an error from a label-key conflict, nor get erratic behavior.
|
they are doing, they can completely disable this behavior and specify explicit
|
||||||
8. If user reads swagger docs and sees the selector field, he should not be able to set it without realizing the risks.
|
selector.
|
||||||
8. (Deferred requirement:) If user wants to specify a preferred name for the non-overlappingness key, they can pick a name.
|
4. If a controller that makes jobs, like scheduled job, wants to use different
|
||||||
|
labels, such as the time and date of the run, it can do that.
|
||||||
|
5. If User reads v1beta1 documentation or reuses v1beta1 Job definitions and
|
||||||
|
just changes the API group, the user should not automatically be allowed to
|
||||||
|
specify a selector, since this is very rarely what people want to do and is
|
||||||
|
error prone.
|
||||||
|
6. If User downloads an existing job definition, e.g. with
|
||||||
|
`kubectl get jobs/old -o yaml` and tries to modify and post it, he should not
|
||||||
|
create an overlapping job.
|
||||||
|
7. If User downloads an existing job definition, e.g. with
|
||||||
|
`kubectl get jobs/old -o yaml` and tries to modify and post it, and he
|
||||||
|
accidentally copies the uniquifying label from the old one, then he should not
|
||||||
|
get an error from a label-key conflict, nor get erratic behavior.
|
||||||
|
8. If user reads swagger docs and sees the selector field, he should not be able
|
||||||
|
to set it without realizing the risks.
|
||||||
|
8. (Deferred requirement:) If user wants to specify a preferred name for the
|
||||||
|
non-overlappingness key, they can pick a name.
|
||||||
|
|
||||||
# Proposed changes
|
# Proposed changes
|
||||||
|
|
||||||
## API
|
## API
|
||||||
|
|
||||||
`extensions/v1beta1 Job` remains the same. `batch/v1 Job` changes change as follows.
|
`extensions/v1beta1 Job` remains the same. `batch/v1 Job` changes change as
|
||||||
|
follows.
|
||||||
|
|
||||||
Field `job.spec.manualSelector` is added. It controls whether selectors are automatically
|
Field `job.spec.manualSelector` is added. It controls whether selectors are
|
||||||
generated. In automatic mode, user cannot make the mistake of creating non-unique selectors.
|
automatically generated. In automatic mode, user cannot make the mistake of
|
||||||
In manual mode, certain rare use cases are supported.
|
creating non-unique selectors. In manual mode, certain rare use cases are
|
||||||
|
supported.
|
||||||
|
|
||||||
Validation is not changed. A selector must be provided, and it must select the pod template.
|
Validation is not changed. A selector must be provided, and it must select the
|
||||||
|
pod template.
|
||||||
|
|
||||||
Defaulting changes. Defaulting happens in one of two modes:
|
Defaulting changes. Defaulting happens in one of two modes:
|
||||||
|
|
||||||
### Automatic Mode
|
### Automatic Mode
|
||||||
|
|
||||||
- User does not specify `job.spec.selector`.
|
- User does not specify `job.spec.selector`.
|
||||||
- User is probably unaware of the `job.spec.manualSelector` field and does not think about it.
|
- User is probably unaware of the `job.spec.manualSelector` field and does not
|
||||||
- User optionally puts labels on pod template (optional). user does not think about uniqueness, just labeling for user's own reasons.
|
think about it.
|
||||||
- Defaulting logic sets `job.spec.selector` to `matchLabels["controller-uid"]="$UIDOFJOB"`
|
- User optionally puts labels on pod template (optional). User does not think
|
||||||
|
about uniqueness, just labeling for user's own reasons.
|
||||||
|
- Defaulting logic sets `job.spec.selector` to
|
||||||
|
`matchLabels["controller-uid"]="$UIDOFJOB"`
|
||||||
- Defaulting logic appends 2 labels to the `.spec.template.metadata.labels`.
|
- Defaulting logic appends 2 labels to the `.spec.template.metadata.labels`.
|
||||||
- The first label is controller-uid=$UIDOFJOB.
|
- The first label is controller-uid=$UIDOFJOB.
|
||||||
- The second label is "job-name=$NAMEOFJOB".
|
- The second label is "job-name=$NAMEOFJOB".
|
||||||
@ -80,19 +104,30 @@ Defaulting changes. Defaulting happens in one of two modes:
|
|||||||
- User means User or Controller for the rest of this list.
|
- User means User or Controller for the rest of this list.
|
||||||
- User does specify `job.spec.selector`.
|
- User does specify `job.spec.selector`.
|
||||||
- User does specify `job.spec.manualSelector=true`
|
- User does specify `job.spec.manualSelector=true`
|
||||||
- User puts a unique label or label(s) on pod template (required). user does think carefully about uniqueness.
|
- User puts a unique label or label(s) on pod template (required). User does
|
||||||
|
think carefully about uniqueness.
|
||||||
- No defaulting of pod labels or the selector happen.
|
- No defaulting of pod labels or the selector happen.
|
||||||
|
|
||||||
### Rationale
|
### Rationale
|
||||||
|
|
||||||
UID is better than Name in that:
|
UID is better than Name in that:
|
||||||
- it allows cross-namespace control someday if we need it.
|
- it allows cross-namespace control someday if we need it.
|
||||||
- it is unique across all kinds. `controller-name=foo` does not ensure uniqueness across Kinds `job` vs `replicaSet`. Even `job-name=foo` has a problem: you might have a `batch.Job` and a `snazzyjob.io/types.Job` -- the latter cannot use label `job-name=foo`, though there is a temptation to do so.
|
- it is unique across all kinds. `controller-name=foo` does not ensure
|
||||||
- it uniquely identifies the controller across time. This prevents the case where, for example, someone deletes a job via the REST api or client (where cascade=false), leaving pods around. We don't want those to be picked up unintentionally. It also prevents the case where a user looks at an old job that finished but is not deleted, and tries to select its pods, and gets the wrong impression that it is still running.
|
uniqueness across Kinds `job` vs `replicaSet`. Even `job-name=foo` has a
|
||||||
|
problem: you might have a `batch.Job` and a `snazzyjob.io/types.Job` -- the
|
||||||
|
latter cannot use label `job-name=foo`, though there is a temptation to do so.
|
||||||
|
- it uniquely identifies the controller across time. This prevents the case
|
||||||
|
where, for example, someone deletes a job via the REST api or client
|
||||||
|
(where cascade=false), leaving pods around. We don't want those to be picked up
|
||||||
|
unintentionally. It also prevents the case where a user looks at an old job that
|
||||||
|
finished but is not deleted, and tries to select its pods, and gets the wrong
|
||||||
|
impression that it is still running.
|
||||||
|
|
||||||
Job name is more user friendly. It is self documenting
|
Job name is more user friendly. It is self documenting
|
||||||
|
|
||||||
Commands like `kubectl get pods -l job-name=myjob` should do exactly what is wanted 99.9% of the time. Automated control loops should still use the controller-uid=label.
|
Commands like `kubectl get pods -l job-name=myjob` should do exactly what is
|
||||||
|
wanted 99.9% of the time. Automated control loops should still use the
|
||||||
|
controller-uid=label.
|
||||||
|
|
||||||
Using both gets the benefits of both, at the cost of some label verbosity.
|
Using both gets the benefits of both, at the cost of some label verbosity.
|
||||||
|
|
||||||
@ -102,11 +137,15 @@ users looking at a stored pod spec do not need to be aware of this field.
|
|||||||
|
|
||||||
### Overriding Unique Labels
|
### Overriding Unique Labels
|
||||||
|
|
||||||
If user does specify `job.spec.selector` then the user must also specify `job.spec.manualSelector`.
|
If user does specify `job.spec.selector` then the user must also specify
|
||||||
This ensures the user knows that what he is doing is not the normal thing to do.
|
`job.spec.manualSelector`. This ensures the user knows that what he is doing is
|
||||||
|
not the normal thing to do.
|
||||||
|
|
||||||
To prevent users from copying the `job.spec.manualSelector` flag from existing jobs, it will be
|
To prevent users from copying the `job.spec.manualSelector` flag from existing
|
||||||
optional and default to false, which means when you ask GET and existing job back that didn't use this feature, you don't even see the `job.spec.manualSelector` flag, so you are not tempted to wonder if you should fiddle with it.
|
jobs, it will be optional and default to false, which means when you ask GET and
|
||||||
|
existing job back that didn't use this feature, you don't even see the
|
||||||
|
`job.spec.manualSelector` flag, so you are not tempted to wonder if you should
|
||||||
|
fiddle with it.
|
||||||
|
|
||||||
## Job Controller
|
## Job Controller
|
||||||
|
|
||||||
@ -114,8 +153,8 @@ No changes
|
|||||||
|
|
||||||
## Kubectl
|
## Kubectl
|
||||||
|
|
||||||
No required changes.
|
No required changes. Suggest moving SELECTOR to wide output of `kubectl get
|
||||||
Suggest moving SELECTOR to wide output of `kubectl get jobs` since users do not write the selector.
|
jobs` since users do not write the selector.
|
||||||
|
|
||||||
## Docs
|
## Docs
|
||||||
|
|
||||||
@ -124,42 +163,50 @@ Recommend `kubectl get jobs -l job-name=name` as the way to find pods of a job.
|
|||||||
|
|
||||||
# Conversion
|
# Conversion
|
||||||
|
|
||||||
The following applies to Job, as well as to other types that adopt this pattern.
|
The following applies to Job, as well as to other types that adopt this pattern:
|
||||||
|
|
||||||
- Type `extensions/v1beta1` gets a field called `job.spec.autoSelector`.
|
- Type `extensions/v1beta1` gets a field called `job.spec.autoSelector`.
|
||||||
- Both the internal type and the `batch/v1` type will get `job.spec.manualSelector`.
|
- Both the internal type and the `batch/v1` type will get
|
||||||
|
`job.spec.manualSelector`.
|
||||||
- The fields `manualSelector` and `autoSelector` have opposite meanings.
|
- The fields `manualSelector` and `autoSelector` have opposite meanings.
|
||||||
- Each field defaults to false when unset, and so v1beta1 has a different default than v1 and internal. This is intentional: we want new
|
- Each field defaults to false when unset, and so v1beta1 has a different
|
||||||
uses to default to the less error-prone behavior, and we do not want to change the behavior
|
default than v1 and internal. This is intentional: we want new uses to default
|
||||||
of v1beta1.
|
to the less error-prone behavior, and we do not want to change the behavior of
|
||||||
|
v1beta1.
|
||||||
|
|
||||||
*Note*: since the internal default is changing, client
|
*Note*: since the internal default is changing, client library consumers that
|
||||||
library consumers that create Jobs may need to add "job.spec.manualSelector=true" to keep working, or switch
|
create Jobs may need to add "job.spec.manualSelector=true" to keep working, or
|
||||||
to auto selectors.
|
switch to auto selectors.
|
||||||
|
|
||||||
Conversion is as follows:
|
Conversion is as follows:
|
||||||
- `extensions/__internal` to `extensions/v1beta1`: the value of `__internal.Spec.ManualSelector` is defaulted to false if nil, negated, defaulted to nil if false, and written `v1beta1.Spec.AutoSelector`.
|
- `extensions/__internal` to `extensions/v1beta1`: the value of
|
||||||
- `extensions/v1beta1` to `extensions/__internal`: the value of `v1beta1.SpecAutoSelector` is defaulted to false if nil, negated, defaulted to nil if false, and written to `__internal.Spec.ManualSelector`.
|
`__internal.Spec.ManualSelector` is defaulted to false if nil, negated,
|
||||||
|
defaulted to nil if false, and written `v1beta1.Spec.AutoSelector`.
|
||||||
|
- `extensions/v1beta1` to `extensions/__internal`: the value of
|
||||||
|
`v1beta1.SpecAutoSelector` is defaulted to false if nil, negated, defaulted to
|
||||||
|
nil if false, and written to `__internal.Spec.ManualSelector`.
|
||||||
|
|
||||||
This conversion gives the following properties.
|
This conversion gives the following properties.
|
||||||
|
|
||||||
1. Users that previously used v1beta1 do not start seeing a new field when they get back objects.
|
1. Users that previously used v1beta1 do not start seeing a new field when they
|
||||||
2. Distinction between originally unset versus explicitly set to false is not preserved (would have been nice to do so, but requires more complicated
|
get back objects.
|
||||||
solution).
|
2. Distinction between originally unset versus explicitly set to false is not
|
||||||
3. Users who only created v1beta1 examples or v1 examples, will not ever see the existence of either field.
|
preserved (would have been nice to do so, but requires more complicated
|
||||||
4. Since v1beta1 are convertable to/from v1, the storage location (path in etcd) does not need to change, allowing scriptable rollforward/rollback.
|
solution).
|
||||||
|
3. Users who only created v1beta1 examples or v1 examples, will not ever see the
|
||||||
|
existence of either field.
|
||||||
|
4. Since v1beta1 are convertable to/from v1, the storage location (path in etcd)
|
||||||
|
does not need to change, allowing scriptable rollforward/rollback.
|
||||||
|
|
||||||
# Future Work
|
# Future Work
|
||||||
|
|
||||||
Follow this pattern for Deployments, ReplicaSet, DaemonSet when going to v1, if it works well for job.
|
Follow this pattern for Deployments, ReplicaSet, DaemonSet when going to v1, if
|
||||||
|
it works well for job.
|
||||||
|
|
||||||
Docs will be edited to show examples without a `job.spec.selector`.
|
Docs will be edited to show examples without a `job.spec.selector`.
|
||||||
|
|
||||||
We probably want as much as possible the same behavior for Job and ReplicationController.
|
We probably want as much as possible the same behavior for Job and
|
||||||
|
ReplicationController.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -40,26 +40,30 @@ Processes in Pods may need to call the Kubernetes API. For example:
|
|||||||
- scheduler
|
- scheduler
|
||||||
- replication controller
|
- replication controller
|
||||||
- node controller
|
- node controller
|
||||||
- a map-reduce type framework which has a controller that then tries to make a dynamically determined number of workers and watch them
|
- a map-reduce type framework which has a controller that then tries to make a
|
||||||
|
dynamically determined number of workers and watch them
|
||||||
- continuous build and push system
|
- continuous build and push system
|
||||||
- monitoring system
|
- monitoring system
|
||||||
|
|
||||||
They also may interact with services other than the Kubernetes API, such as:
|
They also may interact with services other than the Kubernetes API, such as:
|
||||||
- an image repository, such as docker -- both when the images are pulled to start the containers, and for writing
|
- an image repository, such as docker -- both when the images are pulled to
|
||||||
images in the case of pods that generate images.
|
start the containers, and for writing images in the case of pods that generate
|
||||||
- accessing other cloud services, such as blob storage, in the context of a large, integrated, cloud offering (hosted
|
images.
|
||||||
or private).
|
- accessing other cloud services, such as blob storage, in the context of a
|
||||||
|
large, integrated, cloud offering (hosted or private).
|
||||||
- accessing files in an NFS volume attached to the pod
|
- accessing files in an NFS volume attached to the pod
|
||||||
|
|
||||||
## Design Overview
|
## Design Overview
|
||||||
|
|
||||||
A service account binds together several things:
|
A service account binds together several things:
|
||||||
- a *name*, understood by users, and perhaps by peripheral systems, for an identity
|
- a *name*, understood by users, and perhaps by peripheral systems, for an
|
||||||
|
identity
|
||||||
- a *principal* that can be authenticated and [authorized](../admin/authorization.md)
|
- a *principal* that can be authenticated and [authorized](../admin/authorization.md)
|
||||||
- a [security context](security_context.md), which defines the Linux Capabilities, User IDs, Groups IDs, and other
|
- a [security context](security_context.md), which defines the Linux
|
||||||
capabilities and controls on interaction with the file system and OS.
|
Capabilities, User IDs, Groups IDs, and other capabilities and controls on
|
||||||
- a set of [secrets](secrets.md), which a container may use to
|
interaction with the file system and OS.
|
||||||
access various networked resources.
|
- a set of [secrets](secrets.md), which a container may use to access various
|
||||||
|
networked resources.
|
||||||
|
|
||||||
## Design Discussion
|
## Design Discussion
|
||||||
|
|
||||||
@ -76,94 +80,119 @@ type ServiceAccount struct {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
The name ServiceAccount is chosen because it is widely used already (e.g. by Kerberos and LDAP)
|
The name ServiceAccount is chosen because it is widely used already (e.g. by
|
||||||
to refer to this type of account. Note that it has no relation to Kubernetes Service objects.
|
Kerberos and LDAP) to refer to this type of account. Note that it has no
|
||||||
|
relation to Kubernetes Service objects.
|
||||||
|
|
||||||
The ServiceAccount object does not include any information that could not be defined separately:
|
The ServiceAccount object does not include any information that could not be
|
||||||
|
defined separately:
|
||||||
- username can be defined however users are defined.
|
- username can be defined however users are defined.
|
||||||
- securityContext and secrets are only referenced and are created using the REST API.
|
- securityContext and secrets are only referenced and are created using the
|
||||||
|
REST API.
|
||||||
|
|
||||||
The purpose of the serviceAccount object is twofold:
|
The purpose of the serviceAccount object is twofold:
|
||||||
- to bind usernames to securityContexts and secrets, so that the username can be used to refer succinctly
|
- to bind usernames to securityContexts and secrets, so that the username can
|
||||||
in contexts where explicitly naming securityContexts and secrets would be inconvenient
|
be used to refer succinctly in contexts where explicitly naming securityContexts
|
||||||
- to provide an interface to simplify allocation of new securityContexts and secrets.
|
and secrets would be inconvenient
|
||||||
|
- to provide an interface to simplify allocation of new securityContexts and
|
||||||
|
secrets.
|
||||||
|
|
||||||
These features are explained later.
|
These features are explained later.
|
||||||
|
|
||||||
### Names
|
### Names
|
||||||
|
|
||||||
From the standpoint of the Kubernetes API, a `user` is any principal which can authenticate to Kubernetes API.
|
From the standpoint of the Kubernetes API, a `user` is any principal which can
|
||||||
This includes a human running `kubectl` on her desktop and a container in a Pod on a Node making API calls.
|
authenticate to Kubernetes API. This includes a human running `kubectl` on her
|
||||||
|
desktop and a container in a Pod on a Node making API calls.
|
||||||
|
|
||||||
There is already a notion of a username in Kubernetes, which is populated into a request context after authentication.
|
There is already a notion of a username in Kubernetes, which is populated into a
|
||||||
However, there is no API object representing a user. While this may evolve, it is expected that in mature installations,
|
request context after authentication. However, there is no API object
|
||||||
the canonical storage of user identifiers will be handled by a system external to Kubernetes.
|
representing a user. While this may evolve, it is expected that in mature
|
||||||
|
installations, the canonical storage of user identifiers will be handled by a
|
||||||
|
system external to Kubernetes.
|
||||||
|
|
||||||
Kubernetes does not dictate how to divide up the space of user identifier strings. User names can be
|
Kubernetes does not dictate how to divide up the space of user identifier
|
||||||
simple Unix-style short usernames, (e.g. `alice`), or may be qualified to allow for federated identity (
|
strings. User names can be simple Unix-style short usernames, (e.g. `alice`), or
|
||||||
`alice@example.com` vs `alice@example.org`.) Naming convention may distinguish service accounts from user
|
may be qualified to allow for federated identity (`alice@example.com` vs
|
||||||
accounts (e.g. `alice@example.com` vs `build-service-account-a3b7f0@foo-namespace.service-accounts.example.com`),
|
`alice@example.org`.) Naming convention may distinguish service accounts from
|
||||||
but Kubernetes does not require this.
|
user accounts (e.g. `alice@example.com` vs
|
||||||
|
`build-service-account-a3b7f0@foo-namespace.service-accounts.example.com`), but
|
||||||
|
Kubernetes does not require this.
|
||||||
|
|
||||||
Kubernetes also does not require that there be a distinction between human and Pod users. It will be possible
|
Kubernetes also does not require that there be a distinction between human and
|
||||||
to setup a cluster where Alice the human talks to the Kubernetes API as username `alice` and starts pods that
|
Pod users. It will be possible to setup a cluster where Alice the human talks to
|
||||||
also talk to the API as user `alice` and write files to NFS as user `alice`. But, this is not recommended.
|
the Kubernetes API as username `alice` and starts pods that also talk to the API
|
||||||
|
as user `alice` and write files to NFS as user `alice`. But, this is not
|
||||||
|
recommended.
|
||||||
|
|
||||||
Instead, it is recommended that Pods and Humans have distinct identities, and reference implementations will
|
Instead, it is recommended that Pods and Humans have distinct identities, and
|
||||||
make this distinction.
|
reference implementations will make this distinction.
|
||||||
|
|
||||||
The distinction is useful for a number of reasons:
|
The distinction is useful for a number of reasons:
|
||||||
- the requirements for humans and automated processes are different:
|
- the requirements for humans and automated processes are different:
|
||||||
- Humans need a wide range of capabilities to do their daily activities. Automated processes often have more narrowly-defined activities.
|
- Humans need a wide range of capabilities to do their daily activities.
|
||||||
- Humans may better tolerate the exceptional conditions created by expiration of a token. Remembering to handle
|
Automated processes often have more narrowly-defined activities.
|
||||||
this in a program is more annoying. So, either long-lasting credentials or automated rotation of credentials is
|
- Humans may better tolerate the exceptional conditions created by
|
||||||
needed.
|
expiration of a token. Remembering to handle this in a program is more annoying.
|
||||||
- A Human typically keeps credentials on a machine that is not part of the cluster and so not subject to automatic
|
So, either long-lasting credentials or automated rotation of credentials is
|
||||||
management. A VM with a role/service-account can have its credentials automatically managed.
|
needed.
|
||||||
|
- A Human typically keeps credentials on a machine that is not part of the
|
||||||
|
cluster and so not subject to automatic management. A VM with a
|
||||||
|
role/service-account can have its credentials automatically managed.
|
||||||
- the identity of a Pod cannot in general be mapped to a single human.
|
- the identity of a Pod cannot in general be mapped to a single human.
|
||||||
- If policy allows, it may be created by one human, and then updated by another, and another, until its behavior cannot be attributed to a single human.
|
- If policy allows, it may be created by one human, and then updated by
|
||||||
|
another, and another, until its behavior cannot be attributed to a single human.
|
||||||
|
|
||||||
**TODO**: consider getting rid of separate serviceAccount object and just rolling its parts into the SecurityContext or
|
**TODO**: consider getting rid of separate serviceAccount object and just
|
||||||
Pod Object.
|
rolling its parts into the SecurityContext or Pod Object.
|
||||||
|
|
||||||
The `secrets` field is a list of references to /secret objects that an process started as that service account should
|
The `secrets` field is a list of references to /secret objects that an process
|
||||||
have access to be able to assert that role.
|
started as that service account should have access to be able to assert that
|
||||||
|
role.
|
||||||
|
|
||||||
The secrets are not inline with the serviceAccount object. This way, most or all users can have permission to `GET /serviceAccounts` so they can remind themselves
|
The secrets are not inline with the serviceAccount object. This way, most or
|
||||||
what serviceAccounts are available for use.
|
all users can have permission to `GET /serviceAccounts` so they can remind
|
||||||
|
themselves what serviceAccounts are available for use.
|
||||||
|
|
||||||
Nothing will prevent creation of a serviceAccount with two secrets of type `SecretTypeKubernetesAuth`, or secrets of two
|
Nothing will prevent creation of a serviceAccount with two secrets of type
|
||||||
different types. Kubelet and client libraries will have some behavior, TBD, to handle the case of multiple secrets of a
|
`SecretTypeKubernetesAuth`, or secrets of two different types. Kubelet and
|
||||||
given type (pick first or provide all and try each in order, etc).
|
client libraries will have some behavior, TBD, to handle the case of multiple
|
||||||
|
secrets of a given type (pick first or provide all and try each in order, etc).
|
||||||
|
|
||||||
When a serviceAccount and a matching secret exist, then a `User.Info` for the serviceAccount and a `BearerToken` from the secret
|
When a serviceAccount and a matching secret exist, then a `User.Info` for the
|
||||||
are added to the map of tokens used by the authentication process in the apiserver, and similarly for other types. (We
|
serviceAccount and a `BearerToken` from the secret are added to the map of
|
||||||
might have some types that do not do anything on apiserver but just get pushed to the kubelet.)
|
tokens used by the authentication process in the apiserver, and similarly for
|
||||||
|
other types. (We might have some types that do not do anything on apiserver but
|
||||||
|
just get pushed to the kubelet.)
|
||||||
|
|
||||||
### Pods
|
### Pods
|
||||||
|
|
||||||
The `PodSpec` is extended to have a `Pods.Spec.ServiceAccountUsername` field. If this is unset, then a
|
The `PodSpec` is extended to have a `Pods.Spec.ServiceAccountUsername` field. If
|
||||||
default value is chosen. If it is set, then the corresponding value of `Pods.Spec.SecurityContext` is set by the
|
this is unset, then a default value is chosen. If it is set, then the
|
||||||
Service Account Finalizer (see below).
|
corresponding value of `Pods.Spec.SecurityContext` is set by the Service Account
|
||||||
|
Finalizer (see below).
|
||||||
|
|
||||||
TBD: how policy limits which users can make pods with which service accounts.
|
TBD: how policy limits which users can make pods with which service accounts.
|
||||||
|
|
||||||
### Authorization
|
### Authorization
|
||||||
|
|
||||||
Kubernetes API Authorization Policies refer to users. Pods created with a `Pods.Spec.ServiceAccountUsername` typically
|
Kubernetes API Authorization Policies refer to users. Pods created with a
|
||||||
get a `Secret` which allows them to authenticate to the Kubernetes APIserver as a particular user. So any
|
`Pods.Spec.ServiceAccountUsername` typically get a `Secret` which allows them to
|
||||||
policy that is desired can be applied to them.
|
authenticate to the Kubernetes APIserver as a particular user. So any policy
|
||||||
|
that is desired can be applied to them.
|
||||||
|
|
||||||
A higher level workflow is needed to coordinate creation of serviceAccounts, secrets and relevant policy objects.
|
A higher level workflow is needed to coordinate creation of serviceAccounts,
|
||||||
Users are free to extend Kubernetes to put this business logic wherever is convenient for them, though the
|
secrets and relevant policy objects. Users are free to extend Kubernetes to put
|
||||||
Service Account Finalizer is one place where this can happen (see below).
|
this business logic wherever is convenient for them, though the Service Account
|
||||||
|
Finalizer is one place where this can happen (see below).
|
||||||
|
|
||||||
### Kubelet
|
### Kubelet
|
||||||
|
|
||||||
The kubelet will treat as "not ready to run" (needing a finalizer to act on it) any Pod which has an empty
|
The kubelet will treat as "not ready to run" (needing a finalizer to act on it)
|
||||||
SecurityContext.
|
any Pod which has an empty SecurityContext.
|
||||||
|
|
||||||
The kubelet will set a default, restrictive, security context for any pods created from non-Apiserver config
|
The kubelet will set a default, restrictive, security context for any pods
|
||||||
sources (http, file).
|
created from non-Apiserver config sources (http, file).
|
||||||
|
|
||||||
Kubelet watches apiserver for secrets which are needed by pods bound to it.
|
Kubelet watches apiserver for secrets which are needed by pods bound to it.
|
||||||
|
|
||||||
@ -173,32 +202,41 @@ Kubelet watches apiserver for secrets which are needed by pods bound to it.
|
|||||||
|
|
||||||
There are several ways to use Pods with SecurityContexts and Secrets.
|
There are several ways to use Pods with SecurityContexts and Secrets.
|
||||||
|
|
||||||
One way is to explicitly specify the securityContext and all secrets of a Pod when the pod is initially created,
|
One way is to explicitly specify the securityContext and all secrets of a Pod
|
||||||
like this:
|
when the pod is initially created, like this:
|
||||||
|
|
||||||
**TODO**: example of pod with explicit refs.
|
**TODO**: example of pod with explicit refs.
|
||||||
|
|
||||||
Another way is with the *Service Account Finalizer*, a plugin process which is optional, and which handles
|
Another way is with the *Service Account Finalizer*, a plugin process which is
|
||||||
business logic around service accounts.
|
optional, and which handles business logic around service accounts.
|
||||||
|
|
||||||
The Service Account Finalizer watches Pods, Namespaces, and ServiceAccount definitions.
|
The Service Account Finalizer watches Pods, Namespaces, and ServiceAccount
|
||||||
|
definitions.
|
||||||
|
|
||||||
First, if it finds pods which have a `Pod.Spec.ServiceAccountUsername` but no `Pod.Spec.SecurityContext` set,
|
First, if it finds pods which have a `Pod.Spec.ServiceAccountUsername` but no
|
||||||
then it copies in the referenced securityContext and secrets references for the corresponding `serviceAccount`.
|
`Pod.Spec.SecurityContext` set, then it copies in the referenced securityContext
|
||||||
|
and secrets references for the corresponding `serviceAccount`.
|
||||||
|
|
||||||
Second, if ServiceAccount definitions change, it may take some actions.
|
Second, if ServiceAccount definitions change, it may take some actions.
|
||||||
**TODO**: decide what actions it takes when a serviceAccount definition changes. Does it stop pods, or just
|
|
||||||
allow someone to list ones that are out of spec? In general, people may want to customize this?
|
|
||||||
|
|
||||||
Third, if a new namespace is created, it may create a new serviceAccount for that namespace. This may include
|
**TODO**: decide what actions it takes when a serviceAccount definition changes.
|
||||||
a new username (e.g. `NAMESPACE-default-service-account@serviceaccounts.$CLUSTERID.kubernetes.io`), a new
|
Does it stop pods, or just allow someone to list ones that are out of spec? In
|
||||||
securityContext, a newly generated secret to authenticate that serviceAccount to the Kubernetes API, and default
|
general, people may want to customize this?
|
||||||
policies for that service account.
|
|
||||||
**TODO**: more concrete example. What are typical default permissions for default service account (e.g. readonly access
|
|
||||||
to services in the same namespace and read-write access to events in that namespace?)
|
|
||||||
|
|
||||||
Finally, it may provide an interface to automate creation of new serviceAccounts. In that case, the user may want
|
Third, if a new namespace is created, it may create a new serviceAccount for
|
||||||
to GET serviceAccounts to see what has been created.
|
that namespace. This may include a new username (e.g.
|
||||||
|
`NAMESPACE-default-service-account@serviceaccounts.$CLUSTERID.kubernetes.io`),
|
||||||
|
a new securityContext, a newly generated secret to authenticate that
|
||||||
|
serviceAccount to the Kubernetes API, and default policies for that service
|
||||||
|
account.
|
||||||
|
|
||||||
|
**TODO**: more concrete example. What are typical default permissions for
|
||||||
|
default service account (e.g. readonly access to services in the same namespace
|
||||||
|
and read-write access to events in that namespace?)
|
||||||
|
|
||||||
|
Finally, it may provide an interface to automate creation of new
|
||||||
|
serviceAccounts. In that case, the user may want to GET serviceAccounts to see
|
||||||
|
what has been created.
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
@ -34,32 +34,47 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
## Simple rolling update
|
## Simple rolling update
|
||||||
|
|
||||||
This is a lightweight design document for simple [rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in `kubectl`.
|
This is a lightweight design document for simple
|
||||||
|
[rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in `kubectl`.
|
||||||
|
|
||||||
Complete execution flow can be found [here](#execution-details). See the [example of rolling update](../user-guide/update-demo/) for more information.
|
Complete execution flow can be found [here](#execution-details). See the
|
||||||
|
[example of rolling update](../user-guide/update-demo/) for more information.
|
||||||
|
|
||||||
### Lightweight rollout
|
### Lightweight rollout
|
||||||
|
|
||||||
Assume that we have a current replication controller named `foo` and it is running image `image:v1`
|
Assume that we have a current replication controller named `foo` and it is
|
||||||
|
running image `image:v1`
|
||||||
|
|
||||||
`kubectl rolling-update foo [foo-v2] --image=myimage:v2`
|
`kubectl rolling-update foo [foo-v2] --image=myimage:v2`
|
||||||
|
|
||||||
If the user doesn't specify a name for the 'next' replication controller, then the 'next' replication controller is renamed to
|
If the user doesn't specify a name for the 'next' replication controller, then
|
||||||
|
the 'next' replication controller is renamed to
|
||||||
the name of the original replication controller.
|
the name of the original replication controller.
|
||||||
|
|
||||||
Obviously there is a race here, where if you kill the client between delete foo, and creating the new version of 'foo' you might be surprised about what is there, but I think that's ok.
|
Obviously there is a race here, where if you kill the client between delete foo,
|
||||||
See [Recovery](#recovery) below
|
and creating the new version of 'foo' you might be surprised about what is
|
||||||
|
there, but I think that's ok. See [Recovery](#recovery) below
|
||||||
|
|
||||||
If the user does specify a name for the 'next' replication controller, then the 'next' replication controller is retained with its existing name,
|
If the user does specify a name for the 'next' replication controller, then the
|
||||||
and the old 'foo' replication controller is deleted. For the purposes of the rollout, we add a unique-ifying label `kubernetes.io/deployment` to both the `foo` and `foo-next` replication controllers.
|
'next' replication controller is retained with its existing name, and the old
|
||||||
The value of that label is the hash of the complete JSON representation of the`foo-next` or`foo` replication controller. The name of this label can be overridden by the user with the `--deployment-label-key` flag.
|
'foo' replication controller is deleted. For the purposes of the rollout, we add
|
||||||
|
a unique-ifying label `kubernetes.io/deployment` to both the `foo` and
|
||||||
|
`foo-next` replication controllers. The value of that label is the hash of the
|
||||||
|
complete JSON representation of the`foo-next` or`foo` replication controller.
|
||||||
|
The name of this label can be overridden by the user with the
|
||||||
|
`--deployment-label-key` flag.
|
||||||
|
|
||||||
#### Recovery
|
#### Recovery
|
||||||
|
|
||||||
If a rollout fails or is terminated in the middle, it is important that the user be able to resume the roll out.
|
If a rollout fails or is terminated in the middle, it is important that the user
|
||||||
To facilitate recovery in the case of a crash of the updating process itself, we add the following annotations to each replication controller in the `kubernetes.io/` annotation namespace:
|
be able to resume the roll out. To facilitate recovery in the case of a crash of
|
||||||
* `desired-replicas` The desired number of replicas for this replication controller (either N or zero)
|
the updating process itself, we add the following annotations to each
|
||||||
* `update-partner` A pointer to the replication controller resource that is the other half of this update (syntax `<name>` the namespace is assumed to be identical to the namespace of this replication controller.)
|
replication controller in the `kubernetes.io/` annotation namespace:
|
||||||
|
* `desired-replicas` The desired number of replicas for this replication
|
||||||
|
controller (either N or zero)
|
||||||
|
* `update-partner` A pointer to the replication controller resource that is
|
||||||
|
the other half of this update (syntax `<name>` the namespace is assumed to be
|
||||||
|
identical to the namespace of this replication controller.)
|
||||||
|
|
||||||
Recovery is achieved by issuing the same command again:
|
Recovery is achieved by issuing the same command again:
|
||||||
|
|
||||||
@ -67,9 +82,12 @@ Recovery is achieved by issuing the same command again:
|
|||||||
kubectl rolling-update foo [foo-v2] --image=myimage:v2
|
kubectl rolling-update foo [foo-v2] --image=myimage:v2
|
||||||
```
|
```
|
||||||
|
|
||||||
Whenever the rolling update command executes, the kubectl client looks for replication controllers called `foo` and `foo-next`, if they exist, an attempt is
|
Whenever the rolling update command executes, the kubectl client looks for
|
||||||
made to roll `foo` to `foo-next`. If `foo-next` does not exist, then it is created, and the rollout is a new rollout. If `foo` doesn't exist, then
|
replication controllers called `foo` and `foo-next`, if they exist, an attempt
|
||||||
it is assumed that the rollout is nearly completed, and `foo-next` is renamed to `foo`. Details of the execution flow are given below.
|
is made to roll `foo` to `foo-next`. If `foo-next` does not exist, then it is
|
||||||
|
created, and the rollout is a new rollout. If `foo` doesn't exist, then it is
|
||||||
|
assumed that the rollout is nearly completed, and `foo-next` is renamed to
|
||||||
|
`foo`. Details of the execution flow are given below.
|
||||||
|
|
||||||
|
|
||||||
### Aborting a rollout
|
### Aborting a rollout
|
||||||
@ -82,22 +100,28 @@ This is really just semantic sugar for:
|
|||||||
|
|
||||||
`kubectl rolling-update foo-v2 foo`
|
`kubectl rolling-update foo-v2 foo`
|
||||||
|
|
||||||
With the added detail that it moves the `desired-replicas` annotation from `foo-v2` to `foo`
|
With the added detail that it moves the `desired-replicas` annotation from
|
||||||
|
`foo-v2` to `foo`
|
||||||
|
|
||||||
|
|
||||||
### Execution Details
|
### Execution Details
|
||||||
|
|
||||||
For the purposes of this example, assume that we are rolling from `foo` to `foo-next` where the only change is an image update from `v1` to `v2`
|
For the purposes of this example, assume that we are rolling from `foo` to
|
||||||
|
`foo-next` where the only change is an image update from `v1` to `v2`
|
||||||
|
|
||||||
If the user doesn't specify a `foo-next` name, then it is either discovered from the `update-partner` annotation on `foo`. If that annotation doesn't exist,
|
If the user doesn't specify a `foo-next` name, then it is either discovered from
|
||||||
then `foo-next` is synthesized using the pattern `<controller-name>-<hash-of-next-controller-JSON>`
|
the `update-partner` annotation on `foo`. If that annotation doesn't exist,
|
||||||
|
then `foo-next` is synthesized using the pattern
|
||||||
|
`<controller-name>-<hash-of-next-controller-JSON>`
|
||||||
|
|
||||||
#### Initialization
|
#### Initialization
|
||||||
|
|
||||||
* If `foo` and `foo-next` do not exist:
|
* If `foo` and `foo-next` do not exist:
|
||||||
* Exit, and indicate an error to the user, that the specified controller doesn't exist.
|
* Exit, and indicate an error to the user, that the specified controller
|
||||||
|
doesn't exist.
|
||||||
* If `foo` exists, but `foo-next` does not:
|
* If `foo` exists, but `foo-next` does not:
|
||||||
* Create `foo-next` populate it with the `v2` image, set `desired-replicas` to `foo.Spec.Replicas`
|
* Create `foo-next` populate it with the `v2` image, set
|
||||||
|
`desired-replicas` to `foo.Spec.Replicas`
|
||||||
* Goto Rollout
|
* Goto Rollout
|
||||||
* If `foo-next` exists, but `foo` does not:
|
* If `foo-next` exists, but `foo` does not:
|
||||||
* Assume that we are in the rename phase.
|
* Assume that we are in the rename phase.
|
||||||
@ -105,7 +129,8 @@ then `foo-next` is synthesized using the pattern `<controller-name>-<hash-of-nex
|
|||||||
* If both `foo` and `foo-next` exist:
|
* If both `foo` and `foo-next` exist:
|
||||||
* Assume that we are in a partial rollout
|
* Assume that we are in a partial rollout
|
||||||
* If `foo-next` is missing the `desired-replicas` annotation
|
* If `foo-next` is missing the `desired-replicas` annotation
|
||||||
* Populate the `desired-replicas` annotation to `foo-next` using the current size of `foo`
|
* Populate the `desired-replicas` annotation to `foo-next` using the
|
||||||
|
current size of `foo`
|
||||||
* Goto Rollout
|
* Goto Rollout
|
||||||
|
|
||||||
#### Rollout
|
#### Rollout
|
||||||
@ -125,11 +150,13 @@ then `foo-next` is synthesized using the pattern `<controller-name>-<hash-of-nex
|
|||||||
#### Abort
|
#### Abort
|
||||||
|
|
||||||
* If `foo-next` doesn't exist
|
* If `foo-next` doesn't exist
|
||||||
* Exit and indicate to the user that they may want to simply do a new rollout with the old version
|
* Exit and indicate to the user that they may want to simply do a new
|
||||||
|
rollout with the old version
|
||||||
* If `foo` doesn't exist
|
* If `foo` doesn't exist
|
||||||
* Exit and indicate not found to the user
|
* Exit and indicate not found to the user
|
||||||
* Otherwise, `foo-next` and `foo` both exist
|
* Otherwise, `foo-next` and `foo` both exist
|
||||||
* Set `desired-replicas` annotation on `foo` to match the annotation on `foo-next`
|
* Set `desired-replicas` annotation on `foo` to match the annotation on
|
||||||
|
`foo-next`
|
||||||
* Goto Rollout with `foo` and `foo-next` trading places.
|
* Goto Rollout with `foo` and `foo-next` trading places.
|
||||||
|
|
||||||
|
|
||||||
|
@ -36,75 +36,85 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
|
|
||||||
This document describes *taints* and *tolerations*, which constitute a generic mechanism for restricting
|
This document describes *taints* and *tolerations*, which constitute a generic
|
||||||
the set of pods that can use a node. We also describe one concrete use case for the mechanism,
|
mechanism for restricting the set of pods that can use a node. We also describe
|
||||||
namely to limit the set of users (or more generally, authorization domains)
|
one concrete use case for the mechanism, namely to limit the set of users (or
|
||||||
who can access a set of nodes (a feature we call
|
more generally, authorization domains) who can access a set of nodes (a feature
|
||||||
*dedicated nodes*). There are many other uses--for example, a set of nodes with a particular
|
we call *dedicated nodes*). There are many other uses--for example, a set of
|
||||||
piece of hardware could
|
nodes with a particular piece of hardware could be reserved for pods that
|
||||||
be reserved for pods that require that hardware, or a node could be marked as unschedulable
|
require that hardware, or a node could be marked as unschedulable when it is
|
||||||
when it is being drained before shutdown, or a node could trigger evictions when it experiences
|
being drained before shutdown, or a node could trigger evictions when it
|
||||||
hardware or software problems or abnormal node configurations; see #17190 and #3885 for more discussion.
|
experiences hardware or software problems or abnormal node configurations; see
|
||||||
|
issues #17190 and #3885 for more discussion.
|
||||||
|
|
||||||
## Taints, tolerations, and dedicated nodes
|
## Taints, tolerations, and dedicated nodes
|
||||||
|
|
||||||
A *taint* is a new type that is part of the `NodeSpec`; when present, it prevents pods
|
A *taint* is a new type that is part of the `NodeSpec`; when present, it
|
||||||
from scheduling onto the node unless the pod *tolerates* the taint (tolerations are listed
|
prevents pods from scheduling onto the node unless the pod *tolerates* the taint
|
||||||
in the `PodSpec`). Note that there are actually multiple flavors of taints: taints that
|
(tolerations are listed in the `PodSpec`). Note that there are actually multiple
|
||||||
prevent scheduling on a node, taints that cause the scheduler to try to avoid scheduling
|
flavors of taints: taints that prevent scheduling on a node, taints that cause
|
||||||
on a node but do not prevent it, taints that prevent a pod from starting on Kubelet even
|
the scheduler to try to avoid scheduling on a node but do not prevent it, taints
|
||||||
if the pod's `NodeName` was written directly (i.e. pod did not go through the scheduler),
|
that prevent a pod from starting on Kubelet even if the pod's `NodeName` was
|
||||||
and taints that evict already-running pods.
|
written directly (i.e. pod did not go through the scheduler), and taints that
|
||||||
|
evict already-running pods.
|
||||||
[This comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
|
[This comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
|
||||||
has more background on these different scenarios. We will focus on the first
|
has more background on these different scenarios. We will focus on the first
|
||||||
kind of taint in this doc, since it is the kind required for the "dedicated nodes" use case.
|
kind of taint in this doc, since it is the kind required for the "dedicated
|
||||||
|
nodes" use case.
|
||||||
|
|
||||||
Implementing dedicated nodes using taints and tolerations is straightforward: in essence, a node that
|
Implementing dedicated nodes using taints and tolerations is straightforward: in
|
||||||
is dedicated to group A gets taint `dedicated=A` and the pods belonging to group A get
|
essence, a node that is dedicated to group A gets taint `dedicated=A` and the
|
||||||
toleration `dedicated=A`. (The exact syntax and semantics of taints and tolerations are
|
pods belonging to group A get toleration `dedicated=A`. (The exact syntax and
|
||||||
described later in this doc.) This keeps all pods except those belonging to group A off of the nodes.
|
semantics of taints and tolerations are described later in this doc.) This keeps
|
||||||
This approach easily generalizes to pods that are allowed to
|
all pods except those belonging to group A off of the nodes. This approach
|
||||||
schedule into multiple dedicated node groups, and nodes that are a member of multiple
|
easily generalizes to pods that are allowed to schedule into multiple dedicated
|
||||||
dedicated node groups.
|
node groups, and nodes that are a member of multiple dedicated node groups.
|
||||||
|
|
||||||
Note that because tolerations are at the granularity of pods,
|
Note that because tolerations are at the granularity of pods, the mechanism is
|
||||||
the mechanism is very flexible -- any policy can be used to determine which tolerations
|
very flexible -- any policy can be used to determine which tolerations should be
|
||||||
should be placed on a pod. So the "group A" mentioned above could be all pods from a
|
placed on a pod. So the "group A" mentioned above could be all pods from a
|
||||||
particular namespace or set of namespaces, or all pods with some other arbitrary characteristic
|
particular namespace or set of namespaces, or all pods with some other arbitrary
|
||||||
in common. We expect that any real-world usage of taints and tolerations will employ an admission controller
|
characteristic in common. We expect that any real-world usage of taints and
|
||||||
to apply the tolerations. For example, to give all pods from namespace A access to dedicated
|
tolerations will employ an admission controller to apply the tolerations. For
|
||||||
node group A, an admission controller would add the corresponding toleration to all
|
example, to give all pods from namespace A access to dedicated node group A, an
|
||||||
pods from namespace A. Or to give all pods that require GPUs access to GPU nodes, an admission
|
admission controller would add the corresponding toleration to all pods from
|
||||||
controller would add the toleration for GPU taints to pods that request the GPU resource.
|
namespace A. Or to give all pods that require GPUs access to GPU nodes, an
|
||||||
|
admission controller would add the toleration for GPU taints to pods that
|
||||||
|
request the GPU resource.
|
||||||
|
|
||||||
Everything that can be expressed using taints and tolerations can be expressed using
|
Everything that can be expressed using taints and tolerations can be expressed
|
||||||
[node affinity](https://github.com/kubernetes/kubernetes/pull/18261), e.g. in the example
|
using [node affinity](https://github.com/kubernetes/kubernetes/pull/18261), e.g.
|
||||||
in the previous paragraph, you could put a label `dedicated=A` on the set of dedicated nodes and
|
in the example in the previous paragraph, you could put a label `dedicated=A` on
|
||||||
a node affinity `dedicated NotIn A` on all pods *not* belonging to group A. But it is
|
the set of dedicated nodes and a node affinity `dedicated NotIn A` on all pods *not*
|
||||||
cumbersome to express exclusion policies using node affinity because every time you add
|
belonging to group A. But it is cumbersome to express exclusion policies using
|
||||||
a new type of restricted node, all pods that aren't allowed to use those nodes need to start avoiding those
|
node affinity because every time you add a new type of restricted node, all pods
|
||||||
nodes using node affinity. This means the node affinity list can get quite long in clusters with lots of different
|
that aren't allowed to use those nodes need to start avoiding those nodes using
|
||||||
groups of special nodes (lots of dedicated node groups, lots of different kinds of special hardware, etc.).
|
node affinity. This means the node affinity list can get quite long in clusters
|
||||||
Moreover, you need to also update any Pending pods when you add new types of special nodes.
|
with lots of different groups of special nodes (lots of dedicated node groups,
|
||||||
In contrast, with taints and tolerations,
|
lots of different kinds of special hardware, etc.). Moreover, you need to also
|
||||||
when you add a new type of special node, "regular" pods are unaffected, and you just need to add
|
update any Pending pods when you add new types of special nodes. In contrast,
|
||||||
the necessary toleration to the pods you subsequent create that need to use the new type of special nodes.
|
with taints and tolerations, when you add a new type of special node, "regular"
|
||||||
To put it another way, with taints and tolerations, only pods that use a set of special nodes
|
pods are unaffected, and you just need to add the necessary toleration to the
|
||||||
need to know about those special nodes; with the node affinity approach, pods that have
|
pods you subsequent create that need to use the new type of special nodes. To
|
||||||
no interest in those special nodes need to know about all of the groups of special nodes.
|
put it another way, with taints and tolerations, only pods that use a set of
|
||||||
|
special nodes need to know about those special nodes; with the node affinity
|
||||||
|
approach, pods that have no interest in those special nodes need to know about
|
||||||
|
all of the groups of special nodes.
|
||||||
|
|
||||||
One final comment: in practice, it is often desirable to not
|
One final comment: in practice, it is often desirable to not only keep "regular"
|
||||||
only keep "regular" pods off of special nodes, but also to keep "special" pods off of
|
pods off of special nodes, but also to keep "special" pods off of regular nodes.
|
||||||
regular nodes. An example in the dedicated nodes case is to not only keep regular
|
An example in the dedicated nodes case is to not only keep regular users off of
|
||||||
users off of dedicated nodes, but also to keep dedicated users off of non-dedicated (shared)
|
dedicated nodes, but also to keep dedicated users off of non-dedicated (shared)
|
||||||
nodes. In this case, the "non-dedicated" nodes can be modeled as their own dedicated node group
|
nodes. In this case, the "non-dedicated" nodes can be modeled as their own
|
||||||
(for example, tainted as `dedicated=shared`), and pods that are not given access to any
|
dedicated node group (for example, tainted as `dedicated=shared`), and pods that
|
||||||
dedicated nodes ("regular" pods) would be given a toleration for `dedicated=shared`. (As mentioned earlier,
|
are not given access to any dedicated nodes ("regular" pods) would be given a
|
||||||
we expect tolerations will be added by an admission controller.) In this case taints/tolerations
|
toleration for `dedicated=shared`. (As mentioned earlier, we expect tolerations
|
||||||
are still better than node affinity because with taints/tolerations each pod only needs one special "marking",
|
will be added by an admission controller.) In this case taints/tolerations are
|
||||||
versus in the node affinity case where every time you add a dedicated node group (i.e. a new
|
still better than node affinity because with taints/tolerations each pod only
|
||||||
`dedicated=` value), you need to add a new node affinity rule to all pods (including pending pods)
|
needs one special "marking", versus in the node affinity case where every time
|
||||||
except the ones allowed to use that new dedicated node group.
|
you add a dedicated node group (i.e. a new `dedicated=` value), you need to add
|
||||||
|
a new node affinity rule to all pods (including pending pods) except the ones
|
||||||
|
allowed to use that new dedicated node group.
|
||||||
|
|
||||||
## API
|
## API
|
||||||
|
|
||||||
@ -112,56 +122,56 @@ except the ones allowed to use that new dedicated node group.
|
|||||||
// The node this Taint is attached to has the effect "effect" on
|
// The node this Taint is attached to has the effect "effect" on
|
||||||
// any pod that that does not tolerate the Taint.
|
// any pod that that does not tolerate the Taint.
|
||||||
type Taint struct {
|
type Taint struct {
|
||||||
Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
|
Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
|
||||||
Value string `json:"value,omitempty"`
|
Value string `json:"value,omitempty"`
|
||||||
Effect TaintEffect `json:"effect"`
|
Effect TaintEffect `json:"effect"`
|
||||||
}
|
}
|
||||||
|
|
||||||
type TaintEffect string
|
type TaintEffect string
|
||||||
|
|
||||||
const (
|
const (
|
||||||
// Do not allow new pods to schedule unless they tolerate the taint,
|
// Do not allow new pods to schedule unless they tolerate the taint,
|
||||||
// but allow all pods submitted to Kubelet without going through the scheduler
|
// but allow all pods submitted to Kubelet without going through the scheduler
|
||||||
// to start, and allow all already-running pods to continue running.
|
// to start, and allow all already-running pods to continue running.
|
||||||
// Enforced by the scheduler.
|
// Enforced by the scheduler.
|
||||||
TaintEffectNoSchedule TaintEffect = "NoSchedule"
|
TaintEffectNoSchedule TaintEffect = "NoSchedule"
|
||||||
// Like TaintEffectNoSchedule, but the scheduler tries not to schedule
|
// Like TaintEffectNoSchedule, but the scheduler tries not to schedule
|
||||||
// new pods onto the node, rather than prohibiting new pods from scheduling
|
// new pods onto the node, rather than prohibiting new pods from scheduling
|
||||||
// onto the node. Enforced by the scheduler.
|
// onto the node. Enforced by the scheduler.
|
||||||
TaintEffectPreferNoSchedule TaintEffect = "PreferNoSchedule"
|
TaintEffectPreferNoSchedule TaintEffect = "PreferNoSchedule"
|
||||||
// Do not allow new pods to schedule unless they tolerate the taint,
|
// Do not allow new pods to schedule unless they tolerate the taint,
|
||||||
// do not allow pods to start on Kubelet unless they tolerate the taint,
|
// do not allow pods to start on Kubelet unless they tolerate the taint,
|
||||||
// but allow all already-running pods to continue running.
|
// but allow all already-running pods to continue running.
|
||||||
// Enforced by the scheduler and Kubelet.
|
// Enforced by the scheduler and Kubelet.
|
||||||
TaintEffectNoScheduleNoAdmit TaintEffect = "NoScheduleNoAdmit"
|
TaintEffectNoScheduleNoAdmit TaintEffect = "NoScheduleNoAdmit"
|
||||||
// Do not allow new pods to schedule unless they tolerate the taint,
|
// Do not allow new pods to schedule unless they tolerate the taint,
|
||||||
// do not allow pods to start on Kubelet unless they tolerate the taint,
|
// do not allow pods to start on Kubelet unless they tolerate the taint,
|
||||||
// and try to eventually evict any already-running pods that do not tolerate the taint.
|
// and try to eventually evict any already-running pods that do not tolerate the taint.
|
||||||
// Enforced by the scheduler and Kubelet.
|
// Enforced by the scheduler and Kubelet.
|
||||||
TaintEffectNoScheduleNoAdmitNoExecute = "NoScheduleNoAdmitNoExecute"
|
TaintEffectNoScheduleNoAdmitNoExecute = "NoScheduleNoAdmitNoExecute"
|
||||||
)
|
)
|
||||||
|
|
||||||
// The pod this Toleration is attached to tolerates any taint that matches
|
// The pod this Toleration is attached to tolerates any taint that matches
|
||||||
// the triple <key,value,effect> using the matching operator <operator>.
|
// the triple <key,value,effect> using the matching operator <operator>.
|
||||||
type Toleration struct {
|
type Toleration struct {
|
||||||
Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
|
Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
|
||||||
// operator represents a key's relationship to the value.
|
// operator represents a key's relationship to the value.
|
||||||
// Valid operators are Exists and Equal. Defaults to Equal.
|
// Valid operators are Exists and Equal. Defaults to Equal.
|
||||||
// Exists is equivalent to wildcard for value, so that a pod can
|
// Exists is equivalent to wildcard for value, so that a pod can
|
||||||
// tolerate all taints of a particular category.
|
// tolerate all taints of a particular category.
|
||||||
Operator TolerationOperator `json:"operator"`
|
Operator TolerationOperator `json:"operator"`
|
||||||
Value string `json:"value,omitempty"`
|
Value string `json:"value,omitempty"`
|
||||||
Effect TaintEffect `json:"effect"`
|
Effect TaintEffect `json:"effect"`
|
||||||
// TODO: For forgiveness (#1574), we'd eventually add at least a grace period
|
// TODO: For forgiveness (#1574), we'd eventually add at least a grace period
|
||||||
// here, and possibly an occurrence threshold and period.
|
// here, and possibly an occurrence threshold and period.
|
||||||
}
|
}
|
||||||
|
|
||||||
// A toleration operator is the set of operators that can be used in a toleration.
|
// A toleration operator is the set of operators that can be used in a toleration.
|
||||||
type TolerationOperator string
|
type TolerationOperator string
|
||||||
|
|
||||||
const (
|
const (
|
||||||
TolerationOpExists TolerationOperator = "Exists"
|
TolerationOpExists TolerationOperator = "Exists"
|
||||||
TolerationOpEqual TolerationOperator = "Equal"
|
TolerationOpEqual TolerationOperator = "Equal"
|
||||||
)
|
)
|
||||||
|
|
||||||
```
|
```
|
||||||
@ -169,18 +179,17 @@ const (
|
|||||||
(See [this comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
|
(See [this comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
|
||||||
to understand the motivation for the various taint effects.)
|
to understand the motivation for the various taint effects.)
|
||||||
|
|
||||||
We will add
|
We will add:
|
||||||
|
|
||||||
```go
|
```go
|
||||||
// Multiple tolerations with the same key are allowed.
|
// Multiple tolerations with the same key are allowed.
|
||||||
Tolerations []Toleration `json:"tolerations,omitempty"`
|
Tolerations []Toleration `json:"tolerations,omitempty"`
|
||||||
```
|
```
|
||||||
|
|
||||||
to `PodSpec`. A pod must tolerate all of a node's taints (except taints
|
to `PodSpec`. A pod must tolerate all of a node's taints (except taints of type
|
||||||
of type TaintEffectPreferNoSchedule) in order to be able
|
TaintEffectPreferNoSchedule) in order to be able to schedule onto that node.
|
||||||
to schedule onto that node.
|
|
||||||
|
|
||||||
We will add
|
We will add:
|
||||||
|
|
||||||
```go
|
```go
|
||||||
// Multiple taints with the same key are not allowed.
|
// Multiple taints with the same key are not allowed.
|
||||||
@ -201,30 +210,32 @@ Taints and tolerations are not scoped to namespace.
|
|||||||
Using taints and tolerations to implement dedicated nodes requires these steps:
|
Using taints and tolerations to implement dedicated nodes requires these steps:
|
||||||
|
|
||||||
1. Add the API described above
|
1. Add the API described above
|
||||||
1. Add a scheduler predicate function that respects taints and tolerations (for TaintEffectNoSchedule)
|
1. Add a scheduler predicate function that respects taints and tolerations (for
|
||||||
and a scheduler priority function that respects taints and tolerations (for TaintEffectPreferNoSchedule).
|
TaintEffectNoSchedule) and a scheduler priority function that respects taints
|
||||||
1. Add to the Kubelet code to implement the "no admit" behavior of TaintEffectNoScheduleNoAdmit and
|
and tolerations (for TaintEffectPreferNoSchedule).
|
||||||
TaintEffectNoScheduleNoAdmitNoExecute
|
1. Add to the Kubelet code to implement the "no admit" behavior of
|
||||||
|
TaintEffectNoScheduleNoAdmit and TaintEffectNoScheduleNoAdmitNoExecute
|
||||||
1. Implement code in Kubelet that evicts a pod that no longer satisfies
|
1. Implement code in Kubelet that evicts a pod that no longer satisfies
|
||||||
TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the controllers
|
TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the
|
||||||
instead, but since taints might be used to enforce security policies, it is better
|
controllers instead, but since taints might be used to enforce security
|
||||||
to do in kubelet because kubelet can respond quickly and can guarantee the rules will
|
policies, it is better to do in kubelet because kubelet can respond quickly and
|
||||||
be applied to all pods.
|
can guarantee the rules will be applied to all pods. Eviction may need to happen
|
||||||
Eviction may need to happen under a variety of circumstances: when a taint is added, when an existing
|
under a variety of circumstances: when a taint is added, when an existing taint
|
||||||
taint is updated, when a toleration is removed from a pod, or when a toleration is modified on a pod.
|
is updated, when a toleration is removed from a pod, or when a toleration is
|
||||||
|
modified on a pod.
|
||||||
1. Add a new `kubectl` command that adds/removes taints to/from nodes,
|
1. Add a new `kubectl` command that adds/removes taints to/from nodes,
|
||||||
1. (This is the one step is that is specific to dedicated nodes)
|
1. (This is the one step is that is specific to dedicated nodes) Implement an
|
||||||
Implement an admission controller that adds tolerations to pods that are supposed
|
admission controller that adds tolerations to pods that are supposed to be
|
||||||
to be allowed to use dedicated nodes (for example, based on pod's namespace).
|
allowed to use dedicated nodes (for example, based on pod's namespace).
|
||||||
|
|
||||||
In the future one can imagine a generic policy configuration that configures
|
In the future one can imagine a generic policy configuration that configures an
|
||||||
an admission controller to apply the appropriate tolerations to the desired class of pods and
|
admission controller to apply the appropriate tolerations to the desired class
|
||||||
taints to Nodes upon node creation. It could be used not just for policies about dedicated nodes,
|
of pods and taints to Nodes upon node creation. It could be used not just for
|
||||||
but also other uses of taints and tolerations, e.g. nodes that are restricted
|
policies about dedicated nodes, but also other uses of taints and tolerations,
|
||||||
due to their hardware configuration.
|
e.g. nodes that are restricted due to their hardware configuration.
|
||||||
|
|
||||||
The `kubectl` command to add and remove taints on nodes will be modeled after `kubectl label`.
|
The `kubectl` command to add and remove taints on nodes will be modeled after
|
||||||
Examples usages:
|
`kubectl label`. Examples usages:
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
# Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoScheduleNoAdmitNoExecute'.
|
# Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoScheduleNoAdmitNoExecute'.
|
||||||
@ -258,36 +269,41 @@ to enumerate them by name.
|
|||||||
|
|
||||||
## Future work
|
## Future work
|
||||||
|
|
||||||
At present, the Kubernetes security model allows any user to add and remove any taints and tolerations.
|
At present, the Kubernetes security model allows any user to add and remove any
|
||||||
Obviously this makes it impossible to securely enforce
|
taints and tolerations. Obviously this makes it impossible to securely enforce
|
||||||
rules like dedicated nodes. We need some mechanism that prevents regular users from mutating the `Taints`
|
rules like dedicated nodes. We need some mechanism that prevents regular users
|
||||||
field of `NodeSpec` (probably we want to prevent them from mutating any fields of `NodeSpec`)
|
from mutating the `Taints` field of `NodeSpec` (probably we want to prevent them
|
||||||
and from mutating the `Tolerations` field of their pods. #17549 is relevant.
|
from mutating any fields of `NodeSpec`) and from mutating the `Tolerations`
|
||||||
|
field of their pods. #17549 is relevant.
|
||||||
|
|
||||||
Another security vulnterability arises if nodes are added to the cluster before receiving
|
Another security vulnerability arises if nodes are added to the cluster before
|
||||||
their taint. Thus we need to ensure that a new node does not become "Ready" until it has been
|
receiving their taint. Thus we need to ensure that a new node does not become
|
||||||
configured with its taints. One way to do this is to have an admission controller that adds the taint whenever
|
"Ready" until it has been configured with its taints. One way to do this is to
|
||||||
a Node object is created.
|
have an admission controller that adds the taint whenever a Node object is
|
||||||
|
created.
|
||||||
|
|
||||||
A quota policy may want to treat nodes differently based on what taints, if any,
|
A quota policy may want to treat nodes differently based on what taints, if any,
|
||||||
they have. For example, if a particular namespace is only allowed to access dedicated nodes,
|
they have. For example, if a particular namespace is only allowed to access
|
||||||
then it may be convenient to give the namespace unlimited quota. (To use finite quota,
|
dedicated nodes, then it may be convenient to give the namespace unlimited
|
||||||
you'd have to size the namespace's quota to the sum of the sizes of the machines in the
|
quota. (To use finite quota, you'd have to size the namespace's quota to the sum
|
||||||
dedicated node group, and update it when nodes are added/removed to/from the group.)
|
of the sizes of the machines in the dedicated node group, and update it when
|
||||||
|
nodes are added/removed to/from the group.)
|
||||||
|
|
||||||
It's conceivable that taints and tolerations could be unified with [pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265).
|
It's conceivable that taints and tolerations could be unified with
|
||||||
We have chosen not to do this for the reasons described in the "Future work" section of that doc.
|
[pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265).
|
||||||
|
We have chosen not to do this for the reasons described in the "Future work"
|
||||||
|
section of that doc.
|
||||||
|
|
||||||
## Backward compatibility
|
## Backward compatibility
|
||||||
|
|
||||||
Old scheduler versions will ignore taints and tolerations. New scheduler versions
|
Old scheduler versions will ignore taints and tolerations. New scheduler
|
||||||
will respect them.
|
versions will respect them.
|
||||||
|
|
||||||
Users should not start using taints and tolerations until the full implementation
|
Users should not start using taints and tolerations until the full
|
||||||
has been in Kubelet and the master for enough binary versions that we
|
implementation has been in Kubelet and the master for enough binary versions
|
||||||
feel comfortable that we will not need to roll back either Kubelet or
|
that we feel comfortable that we will not need to roll back either Kubelet or
|
||||||
master to a version that does not support them. Longer-term we will
|
master to a version that does not support them. Longer-term we will use a
|
||||||
use a progamatic approach to enforcing this (#4855).
|
progamatic approach to enforcing this (#4855).
|
||||||
|
|
||||||
## Related issues
|
## Related issues
|
||||||
|
|
||||||
|
@ -38,7 +38,9 @@ Reference: [Semantic Versioning](http://semver.org)
|
|||||||
|
|
||||||
Legend:
|
Legend:
|
||||||
|
|
||||||
* **Kube X.Y.Z** refers to the version (git tag) of Kubernetes that is released. This versions all components: apiserver, kubelet, kubectl, etc. (**X** is the major version, **Y** is the minor version, and **Z** is the patch version.)
|
* **Kube X.Y.Z** refers to the version (git tag) of Kubernetes that is released.
|
||||||
|
This versions all components: apiserver, kubelet, kubectl, etc. (**X** is the
|
||||||
|
major version, **Y** is the minor version, and **Z** is the patch version.)
|
||||||
* **API vX[betaY]** refers to the version of the HTTP API.
|
* **API vX[betaY]** refers to the version of the HTTP API.
|
||||||
|
|
||||||
## Release versioning
|
## Release versioning
|
||||||
@ -46,43 +48,76 @@ Legend:
|
|||||||
### Minor version scheme and timeline
|
### Minor version scheme and timeline
|
||||||
|
|
||||||
* Kube X.Y.0-alpha.W, W > 0 (Branch: master)
|
* Kube X.Y.0-alpha.W, W > 0 (Branch: master)
|
||||||
* Alpha releases are released roughly every two weeks directly from the master branch.
|
* Alpha releases are released roughly every two weeks directly from the master
|
||||||
* No cherrypick releases. If there is a critical bugfix, a new release from master can be created ahead of schedule.
|
branch.
|
||||||
|
* No cherrypick releases. If there is a critical bugfix, a new release from
|
||||||
|
master can be created ahead of schedule.
|
||||||
* Kube X.Y.Z-beta.W (Branch: release-X.Y)
|
* Kube X.Y.Z-beta.W (Branch: release-X.Y)
|
||||||
* When master is feature-complete for Kube X.Y, we will cut the release-X.Y branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential to X.Y.
|
* When master is feature-complete for Kube X.Y, we will cut the release-X.Y
|
||||||
|
branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential
|
||||||
|
to X.Y.
|
||||||
* This cut will be marked as X.Y.0-beta.0, and master will be revved to X.Y+1.0-alpha.0.
|
* This cut will be marked as X.Y.0-beta.0, and master will be revved to X.Y+1.0-alpha.0.
|
||||||
* If we're not satisfied with X.Y.0-beta.0, we'll release other beta releases, (X.Y.0-beta.W | W > 0) as necessary.
|
* If we're not satisfied with X.Y.0-beta.0, we'll release other beta releases,
|
||||||
|
(X.Y.0-beta.W | W > 0) as necessary.
|
||||||
* Kube X.Y.0 (Branch: release-X.Y)
|
* Kube X.Y.0 (Branch: release-X.Y)
|
||||||
* Final release, cut from the release-X.Y branch cut two weeks prior.
|
* Final release, cut from the release-X.Y branch cut two weeks prior.
|
||||||
* X.Y.1-beta.0 will be tagged at the same commit on the same branch.
|
* X.Y.1-beta.0 will be tagged at the same commit on the same branch.
|
||||||
* X.Y.0 occur 3 to 4 months after X.(Y-1).0.
|
* X.Y.0 occur 3 to 4 months after X.(Y-1).0.
|
||||||
* Kube X.Y.Z, Z > 0 (Branch: release-X.Y)
|
* Kube X.Y.Z, Z > 0 (Branch: release-X.Y)
|
||||||
* [Patch releases](#patch-releases) are released as we cherrypick commits into the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed.
|
* [Patch releases](#patch-releases) are released as we cherrypick commits into
|
||||||
* X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is tagged on the followup commit that updates pkg/version/base.go with the beta version.
|
the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed.
|
||||||
|
* X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is
|
||||||
|
tagged on the followup commit that updates pkg/version/base.go with the beta
|
||||||
|
version.
|
||||||
* Kube X.Y.Z, Z > 0 (Branch: release-X.Y.Z)
|
* Kube X.Y.Z, Z > 0 (Branch: release-X.Y.Z)
|
||||||
* These are special and different in that the X.Y.Z tag is branched to isolate the emergency/critical fix from all other changes that have landed on the release branch since the previous tag
|
* These are special and different in that the X.Y.Z tag is branched to isolate
|
||||||
|
the emergency/critical fix from all other changes that have landed on the
|
||||||
|
release branch since the previous tag
|
||||||
* Cut release-X.Y.Z branch to hold the isolated patch release
|
* Cut release-X.Y.Z branch to hold the isolated patch release
|
||||||
* Tag release-X.Y.Z branch + fixes with X.Y.(Z+1)
|
* Tag release-X.Y.Z branch + fixes with X.Y.(Z+1)
|
||||||
* Branched [patch releases](#patch-releases) are rarely needed but used for emergency/critical fixes to the latest release
|
* Branched [patch releases](#patch-releases) are rarely needed but used for
|
||||||
* See [#19849](https://issues.k8s.io/19849) tracking the work that is needed for this kind of release to be possible.
|
emergency/critical fixes to the latest release
|
||||||
|
* See [#19849](https://issues.k8s.io/19849) tracking the work that is needed
|
||||||
|
for this kind of release to be possible.
|
||||||
|
|
||||||
### Major version timeline
|
### Major version timeline
|
||||||
|
|
||||||
There is no mandated timeline for major versions. They only occur when we need to start the clock on deprecating features. A given major version should be the latest major version for at least one year from its original release date.
|
There is no mandated timeline for major versions. They only occur when we need
|
||||||
|
to start the clock on deprecating features. A given major version should be the
|
||||||
|
latest major version for at least one year from its original release date.
|
||||||
|
|
||||||
### CI and dev version scheme
|
### CI and dev version scheme
|
||||||
|
|
||||||
* Continuous integration versions also exist, and are versioned off of alpha and beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an additional +aaaa build suffix added; X.Y.Z-beta.W.C+bbbb is C commits after X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds that are built off of a dirty build tree, (during development, with things in the tree that are not checked it,) it will be appended with -dirty.
|
* Continuous integration versions also exist, and are versioned off of alpha and
|
||||||
|
beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an
|
||||||
|
additional +aaaa build suffix added; X.Y.Z-beta.W.C+bbbb is C commits after
|
||||||
|
X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds
|
||||||
|
that are built off of a dirty build tree, (during development, with things in
|
||||||
|
the tree that are not checked it,) it will be appended with -dirty.
|
||||||
|
|
||||||
### Supported releases
|
### Supported releases
|
||||||
|
|
||||||
We expect users to stay reasonably up-to-date with the versions of Kubernetes they use in production, but understand that it may take time to upgrade.
|
We expect users to stay reasonably up-to-date with the versions of Kubernetes
|
||||||
|
they use in production, but understand that it may take time to upgrade.
|
||||||
|
|
||||||
We expect users to be running approximately the latest patch release of a given minor release; we often include critical bug fixes in [patch releases](#patch-release), and so encourage users to upgrade as soon as possible. Furthermore, we expect to "support" three minor releases at a time. "Support" means we expect users to be running that version in production, though we may not port fixes back before the latest minor version. For example, when v1.3 comes out, v1.0 will no longer be supported: basically, that means that the reasonable response to the question "my v1.0 cluster isn't working," is, "you should probably upgrade it, (and probably should have some time ago)". With minor releases happening approximately every three months, that means a minor release is supported for approximately nine months.
|
We expect users to be running approximately the latest patch release of a given
|
||||||
|
minor release; we often include critical bug fixes in
|
||||||
|
[patch releases](#patch-release), and so encourage users to upgrade as soon as
|
||||||
|
possible. Furthermore, we expect to "support" three minor releases at a time.
|
||||||
|
"Support" means we expect users to be running that version in production, though
|
||||||
|
we may not port fixes back before the latest minor version. For example, when
|
||||||
|
v1.3 comes out, v1.0 will no longer be supported: basically, that means that the
|
||||||
|
reasonable response to the question "my v1.0 cluster isn't working," is, "you
|
||||||
|
should probably upgrade it, (and probably should have some time ago)". With
|
||||||
|
minor releases happening approximately every three months, that means a minor
|
||||||
|
release is supported for approximately nine months.
|
||||||
|
|
||||||
This does *not* mean that we expect to introduce breaking changes between v1.0 and v1.3, but it does mean that we probably won't have reasonable confidence in clusters where some components are running at v1.0 and others running at v1.3.
|
This does *not* mean that we expect to introduce breaking changes between v1.0
|
||||||
|
and v1.3, but it does mean that we probably won't have reasonable confidence in
|
||||||
|
clusters where some components are running at v1.0 and others running at v1.3.
|
||||||
|
|
||||||
This policy is in line with [GKE's supported upgrades policy](https://cloud.google.com/container-engine/docs/clusters/upgrade).
|
This policy is in line with
|
||||||
|
[GKE's supported upgrades policy](https://cloud.google.com/container-engine/docs/clusters/upgrade).
|
||||||
|
|
||||||
## API versioning
|
## API versioning
|
||||||
|
|
||||||
@ -91,33 +126,74 @@ This policy is in line with [GKE's supported upgrades policy](https://cloud.goog
|
|||||||
Here is an example major release cycle:
|
Here is an example major release cycle:
|
||||||
|
|
||||||
* **Kube 1.0 should have API v1 without v1beta\* API versions**
|
* **Kube 1.0 should have API v1 without v1beta\* API versions**
|
||||||
* The last version of Kube before 1.0 (e.g. 0.14 or whatever it is) will have the stable v1 API. This enables you to migrate all your objects off of the beta API versions of the API and allows us to remove those beta API versions in Kube 1.0 with no effect. There will be tooling to help you detect and migrate any v1beta\* data versions or calls to v1 before you do the upgrade.
|
* The last version of Kube before 1.0 (e.g. 0.14 or whatever it is) will have
|
||||||
|
the stable v1 API. This enables you to migrate all your objects off of the beta
|
||||||
|
API versions of the API and allows us to remove those beta API versions in Kube
|
||||||
|
1.0 with no effect. There will be tooling to help you detect and migrate any
|
||||||
|
v1beta\* data versions or calls to v1 before you do the upgrade.
|
||||||
* **Kube 1.x may have API v2beta***
|
* **Kube 1.x may have API v2beta***
|
||||||
* The first incarnation of a new (backwards-incompatible) API in HEAD is v2beta1. By default this will be unregistered in apiserver, so it can change freely. Once it is available by default in apiserver (which may not happen for several minor releases), it cannot change ever again because we serialize objects in versioned form, and we always need to be able to deserialize any objects that are saved in etcd, even between alpha versions. If further changes to v2beta1 need to be made, v2beta2 is created, and so on, in subsequent 1.x versions.
|
* The first incarnation of a new (backwards-incompatible) API in HEAD is
|
||||||
* **Kube 1.y (where y is the last version of the 1.x series) must have final API v2**
|
v2beta1. By default this will be unregistered in apiserver, so it can change
|
||||||
* Before Kube 2.0 is cut, API v2 must be released in 1.x. This enables two things: (1) users can upgrade to API v2 when running Kube 1.x and then switch over to Kube 2.x transparently, and (2) in the Kube 2.0 release itself we can cleanup and remove all API v2beta\* versions because no one should have v2beta\* objects left in their database. As mentioned above, tooling will exist to make sure there are no calls or references to a given API version anywhere inside someone's kube installation before someone upgrades.
|
freely. Once it is available by default in apiserver (which may not happen for
|
||||||
* Kube 2.0 must include the v1 API, but Kube 3.0 must include the v2 API only. It *may* include the v1 API as well if the burden is not high - this will be determined on a per-major-version basis.
|
several minor releases), it cannot change ever again because we serialize
|
||||||
|
objects in versioned form, and we always need to be able to deserialize any
|
||||||
|
objects that are saved in etcd, even between alpha versions. If further changes
|
||||||
|
to v2beta1 need to be made, v2beta2 is created, and so on, in subsequent 1.x
|
||||||
|
versions.
|
||||||
|
* **Kube 1.y (where y is the last version of the 1.x series) must have final
|
||||||
|
API v2**
|
||||||
|
* Before Kube 2.0 is cut, API v2 must be released in 1.x. This enables two
|
||||||
|
things: (1) users can upgrade to API v2 when running Kube 1.x and then switch
|
||||||
|
over to Kube 2.x transparently, and (2) in the Kube 2.0 release itself we can
|
||||||
|
cleanup and remove all API v2beta\* versions because no one should have
|
||||||
|
v2beta\* objects left in their database. As mentioned above, tooling will exist
|
||||||
|
to make sure there are no calls or references to a given API version anywhere
|
||||||
|
inside someone's kube installation before someone upgrades.
|
||||||
|
* Kube 2.0 must include the v1 API, but Kube 3.0 must include the v2 API only.
|
||||||
|
It *may* include the v1 API as well if the burden is not high - this will be
|
||||||
|
determined on a per-major-version basis.
|
||||||
|
|
||||||
#### Rationale for API v2 being complete before v2.0's release
|
#### Rationale for API v2 being complete before v2.0's release
|
||||||
|
|
||||||
It may seem a bit strange to complete the v2 API before v2.0 is released, but *adding* a v2 API is not a breaking change. *Removing* the v2beta\* APIs *is* a breaking change, which is what necessitates the major version bump. There are other ways to do this, but having the major release be the fresh start of that release's API without the baggage of its beta versions seems most intuitive out of the available options.
|
It may seem a bit strange to complete the v2 API before v2.0 is released,
|
||||||
|
but *adding* a v2 API is not a breaking change. *Removing* the v2beta\*
|
||||||
|
APIs *is* a breaking change, which is what necessitates the major version bump.
|
||||||
|
There are other ways to do this, but having the major release be the fresh start
|
||||||
|
of that release's API without the baggage of its beta versions seems most
|
||||||
|
intuitive out of the available options.
|
||||||
|
|
||||||
## Patch releases
|
## Patch releases
|
||||||
|
|
||||||
Patch releases are intended for critical bug fixes to the latest minor version, such as addressing security vulnerabilities, fixes to problems affecting a large number of users, severe problems with no workaround, and blockers for products based on Kubernetes.
|
Patch releases are intended for critical bug fixes to the latest minor version,
|
||||||
|
such as addressing security vulnerabilities, fixes to problems affecting a large
|
||||||
|
number of users, severe problems with no workaround, and blockers for products
|
||||||
|
based on Kubernetes.
|
||||||
|
|
||||||
They should not contain miscellaneous feature additions or improvements, and especially no incompatibilities should be introduced between patch versions of the same minor version (or even major version).
|
They should not contain miscellaneous feature additions or improvements, and
|
||||||
|
especially no incompatibilities should be introduced between patch versions of
|
||||||
|
the same minor version (or even major version).
|
||||||
|
|
||||||
Dependencies, such as Docker or Etcd, should also not be changed unless absolutely necessary, and also just to fix critical bugs (so, at most patch version changes, not new major nor minor versions).
|
Dependencies, such as Docker or Etcd, should also not be changed unless
|
||||||
|
absolutely necessary, and also just to fix critical bugs (so, at most patch
|
||||||
|
version changes, not new major nor minor versions).
|
||||||
|
|
||||||
## Upgrades
|
## Upgrades
|
||||||
|
|
||||||
* Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a rolling upgrade across their cluster. (Rolling upgrade means being able to upgrade the master first, then one node at a time. See #4855 for details.)
|
* Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a
|
||||||
* However, we do not recommend upgrading more than two minor releases at a time (see [Supported releases](#supported-releases)), and do not recommend running non-latest patch releases of a given minor release.
|
rolling upgrade across their cluster. (Rolling upgrade means being able to
|
||||||
|
upgrade the master first, then one node at a time. See #4855 for details.)
|
||||||
|
* However, we do not recommend upgrading more than two minor releases at a
|
||||||
|
time (see [Supported releases](#supported-releases)), and do not recommend
|
||||||
|
running non-latest patch releases of a given minor release.
|
||||||
* No hard breaking changes over version boundaries.
|
* No hard breaking changes over version boundaries.
|
||||||
* For example, if a user is at Kube 1.x, we may require them to upgrade to Kube 1.x+y before upgrading to Kube 2.x. In others words, an upgrade across major versions (e.g. Kube 1.x to Kube 2.x) should effectively be a no-op and as graceful as an upgrade from Kube 1.x to Kube 1.x+1. But you can require someone to go from 1.x to 1.x+y before they go to 2.x.
|
* For example, if a user is at Kube 1.x, we may require them to upgrade to
|
||||||
|
Kube 1.x+y before upgrading to Kube 2.x. In others words, an upgrade across
|
||||||
|
major versions (e.g. Kube 1.x to Kube 2.x) should effectively be a no-op and as
|
||||||
|
graceful as an upgrade from Kube 1.x to Kube 1.x+1. But you can require someone
|
||||||
|
to go from 1.x to 1.x+y before they go to 2.x.
|
||||||
|
|
||||||
There is a separate question of how to track the capabilities of a kubelet to facilitate rolling upgrades. That is not addressed here.
|
There is a separate question of how to track the capabilities of a kubelet to
|
||||||
|
facilitate rolling upgrades. That is not addressed here.
|
||||||
|
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
Loading…
Reference in New Issue
Block a user