Automatic merge from submit-queue
Updaing QoS policy to be at the pod level
Quality of Service will be derived from an entire Pod Spec, instead of being derived from resource specifications of individual resources per-container.
A Pod is `Guaranteed` iff all its containers have limits == requests for all the first-class resources (cpu, memory as of now).
A Pod is `BestEffort` iff requests & limits are not specified for any resource across all containers.
A Pod is `Burstable` otherwise.
Note: Existing pods might be more susceptible to OOM Kills on the node due to this PR! To protect pods from being OOM killed on the node, set `limits` for all resources across all containers in a pod.
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/14943)
<!-- Reviewable:end -->
Automatic merge from submit-queue
add CIDR allocator for NodeController
This PR:
* use pkg/controller/framework to watch nodes and reduce lists when allocate CIDR for node
* decouple the cidr allocation logic from monitoring status logic
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/19242)
<!-- Reviewable:end -->
Automatic merge from submit-queue
Add 'kubectl set image'
```release-note
Add "kubectl set image" for easier updating container images (for pods or resources with pod templates).
```
**Usage:**
```
kubectl set image (-f FILENAME | TYPE NAME) CONTAINER_NAME_1=CONTAINER_IMAGE_1 ... CONTAINER_NAME_N=CONTAINER_IMAGE_N
```
**Example:**
```console
# Set a deployment's nginx container image to 'nginx:1.9.1', and its busybox container image to 'busybox'.
$ kubectl set image deployment/nginx busybox=busybox nginx=nginx:1.9.1
# Update all deployments' nginx container's image to 'nginx:1.9.1'
$ kubectl set image deployments nginx=nginx:1.9.1 --all
# Update image of all containers of daemonset abc to 'nginx:1.9.1'
$ kubectl set image daemonset abc *=nginx:1.9.1
# Print result (in yaml format) of updating nginx container image from local file, without hitting the server
$ kubectl set image -f path/to/file.yaml nginx=nginx:1.9.1 --local -o yaml
```
I abandoned the `--container=xxx --image=xxx` flags in the [deploy proposal](https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/deploy.md#kubectl-set) since it's much easier to use with just KEY=VALUE (CONTAINER_NAME=CONTAINER_IMAGE) pairs.
Ref #21648
@kubernetes/kubectl @bgrant0607 @kubernetes/sig-config
[]()
Automatic merge from submit-queue
kubelet: Don't attempt to apply the oom score if container exited already
Containers could terminate before kubelet applies the oom score. This is normal
and the function should not error out.
This addresses #25844 partially.
/cc @smarterclayton @Random-Liu
Automatic merge from submit-queue
Fixes panic on round tripper when TLS under a proxy
When under a proxy with a valid cert from a trusted authority, the `SpdyRoundTripper` will likely not have a `*tls.Config` (no cert verification nor `InsecureSkipVerify` happened), which will result in a panic. So we have to create a new `*tls.Config` to be able to create a TLS client right after. If `RootCAs` in that new config is nil, the system pool will be used.
@ncdc PTAL
[]()
Automatic merge from submit-queue
NodeController doesn't evict Pods if no Nodes are Ready
Fix#13412#24597
When NodeControllers don't see any Ready Node it goes into "network segmentation mode". In this mode it cancels all evictions and don't evict any Pods.
It leaves network segmentation mode when it sees at least one Ready Node. When leaving it resets all timers, so each Node has full grace period to reconnect to the cluster.
cc @lavalamp @davidopp @mml @wojtek-t @fgrzadkowski
The binder sorts all available volumes first, then it filters out volumes
that cannot be bound by processing each volume in a loop and then finds
the smallest matching volume by binary search.
So, if we process every available volume in a loop, we can also remember the
smallest matching one and save us potentially long sorting (and quick binary
search).
Check docker's pid file, then fallback to pidof when trying to determine the pid for docker. The
latest docker RPM for RHEL changes /usr/bin/docker from an executable to a shell script (to support
/usr/bin/docker-current and /usr/bin/docker-latest). The pidof check for docker fails in this case,
so we check /var/run/docker.pid first (the default location), and fallback to pidof if that fails.
When the controller binds a PV to PVC, it saves both objects to etcd.
However, there is still an old version of these objects in the controller
Informer cache. So, when a new PVC comes, the PV is still seen as available
and may get bound to the new PVC. This will be blocked by etcd, still, it
creates unnecessary traffic that slows everything down.
Also, we save bound PV/PVC as two transactions - we save PV/PVC.Spec first
and then .Status. The controller gets "PV/PVC.Spec updated" event from etcd
and tries to fix the Status, as it seems to the controller it's outdated.
This write again fails - there already is a correct version in etcd.
We can't influence the Informer cache, it is read-only to the controller.
To prevent these useless writes to etcd, this patch introduces second cache
in the controller, which holds latest and greatest version on PVs and PVCs.
It gets updated with events from etcd *and* after etcd confirms successful
save of PV/PVC modified by the controller.
The cache stores only *pointers* to PVs/PVCs, so in ideal case it shares the
actual object data with the informer cache. They will diverge only when
the controller modifies something and the informer cache did not get update
events yet.
Automatic merge from submit-queue
AWS: Move enforcement of attached AWS device limit from kubelet to scheduler
Limit of nr. of attached EBS volumes to a node is now enforced by scheduler. It can be adjusted by `KUBE_MAX_PD_VOLS` env. variable there. Therefore we don't need the same check in kubelet. If the system admin wants to attach more, we should allow it.
Kubelet limit is now 650 attached volumes ('ba'..'zz').
Note that the scheduler counts only *pods* assigned to a node. When a pod is deleted and a new pod is scheduled on a node, kubelet start (slowly) detaching the old volume and (slowly) attaching the new volume. Depending on AWS speed **it may happen that more than KUBE_MAX_PD_VOLS volumes are actually attached to a node for some time!** Kubelet will clean it up in few seconds / minutes (both attach/detach is quite slow).
Fixes#22994
Automatic merge from submit-queue
kube-apiserver options should be decoupled from impls
A few months ago we refactored options to keep it independent of the
implementations, so that it could be used in CLI tools to validate
config or to generate config, without pulling in the full dependency
tree of the master. This change restores that by separating
server_run_options.go back to its own package.
Also, options structs should never contain non-serializable types, which
storagebackend.Config was doing with runtime.Codec. Split the codec out.
Fix a typo on the name of the etcd2.go storage backend.
Finally, move DefaultStorageMediaType to server_run_options.
@nikhiljindal as per my comment in #24454, @liggitt because you and I
discussed this last time