This function does not actually attempt to connect to the docker daemon,
it just creates a client object that can be used to do so later. The old
name was confusing, as it implied that a failure to touch the docker daemon
could cause program termination (rather than just a failure to create the
client).
Automatic merge from submit-queue
Fix hang/websocket timeout when streaming container log with no content
When streaming and following a container log, no response headers are sent from the kubelet `containerLogs` endpoint until the first byte of content is written to the log. This propagates back to the API server, which also will not send response headers until it gets response headers from the kubelet. That includes upgrade headers, which means a websocket connection upgrade is not performed and can time out.
To recreate, create a busybox pod that runs `/bin/sh -c 'sleep 30 && echo foo && sleep 10'`
As soon as the pod starts, query the kubelet API:
```
curl -N -k -v 'https://<node>:10250/containerLogs/<ns>/<pod>/<container>?follow=true&limitBytes=100'
```
or the master API:
```
curl -N -k -v 'http://<master>:8080/api/v1/<ns>/pods/<pod>/log?follow=true&limitBytes=100'
```
In both cases, notice that the response headers are not sent until the first byte of log content is available.
This PR:
* does a 0-byte write prior to handing off to the container runtime stream copy. That commits the response header, even if the subsequent copy blocks waiting for the first byte of content from the log.
* fixes a bug with the "ping" frame sent to websocket streams, which was not respecting the requested protocol (it was sending a binary frame to a websocket that requested a base64 text protocol)
* fixes a bug in the limitwriter, which was not propagating 0-length writes, even before the writer's limit was reached
Automatic merge from submit-queue
rkt: Force `rkt fetch` to fetch from remote to conform the image pull policy.
Fix https://github.com/kubernetes/kubernetes/issues/27646
Use `--no-store` option for `rkt fetch` to force it to fetch from remote.
However, `--no-store` will fetch the remote image regardless of whether the content of the image has changed or not.
This causes performance downgrade when the image tag is ':latest' and the image pull policy is 'always'.
The issue is tracked in https://github.com/coreos/rkt/issues/2937.
Automatic merge from submit-queue
Filter internal Kubernetes labels from Prometheus metrics
**What this PR does / why we need it**:
Kubernetes uses Docker labels as storage for some internal labels. The
majority of these labels are not meaningful metric labels and a few of
them are even harmful as they're not static and cause wrong aggregation
results.
This change provides a custom labels func to only attach meaningful
labels to cAdvisor exported metrics.
**Which issue this PR fixes**
google/cadvisor#1312
**Special notes for your reviewer**:
Depends on google/cadvisor#1429. Once that is merged, I'll update the vendor update commit.
**Release note**:
```release-note
Remove environment variables and internal Kubernetes Docker labels from cAdvisor Prometheus metric labels.
Old behavior:
- environment variables explicitly whitelisted via --docker-env-metadata-whitelist were exported as `container_env_*=*`. Default is zero so by default non were exported
- all docker labels were exported as `container_label_*=*`
New behavior:
- Only `container_name`, `pod_name`, `namespace`, `id`, `image`, and `name` labels are exposed
- no environment variables will be exposed ever via /metrics, even if whitelisted
```
---
Given that we have full control over the exported label set, I shortened the pod_name, pod_namespace and container_name label names. Below an example of the change (reformatted for readability).
```
# BEFORE
container_cpu_cfs_periods_total{
container_label_io_kubernetes_container_hash="5af8c3b4",
container_label_io_kubernetes_container_name="sync",
container_label_io_kubernetes_container_restartCount="1",
container_label_io_kubernetes_container_terminationMessagePath="/dev/termination-log",
container_label_io_kubernetes_pod_name="popularsearches-web-3165456836-2bfey",
container_label_io_kubernetes_pod_namespace="popularsearches",
container_label_io_kubernetes_pod_terminationGracePeriod="30",
container_label_io_kubernetes_pod_uid="6a291e48-47c4-11e6-84a4-c81f66bdf8bd",
id="/docker/68e1f15353921f4d6d4d998fa7293306c4ac828d04d1284e410ddaa75cf8cf25",
image="redacted.com/popularsearches:42-16-ba6bd88",
name="k8s_sync.5af8c3b4_popularsearches-web-3165456836-2bfey_popularsearches_6a291e48-47c4-11e6-84a4-c81f66bdf8bd_c02d3775"
} 72819
# AFTER
container_cpu_cfs_periods_total{
container_name="sync",
pod_name="popularsearches-web-3165456836-2bfey",
namespace="popularsearches",
id="/docker/68e1f15353921f4d6d4d998fa7293306c4ac828d04d1284e410ddaa75cf8cf25",
image="redacted.com/popularsearches:42-16-ba6bd88",
name="k8s_sync.5af8c3b4_popularsearches-web-3165456836-2bfey_popularsearches_6a291e48-47c4-11e6-84a4-c81f66bdf8bd_c02d3775"
} 72819
```
Feedback requested on:
* Label names. Other suggestions? Should we keep these very long ones?
* Do we need to export io.kubernetes.pod.uid? It makes working with the metrics a bit more complicated and the pod name is already unique at any time (but not over time). The UID is aslo part of `name`.
As discussed with @timstclair, this should be added to v1.4 as the current labels are harmful.
PTAL @jimmidyson @fabxc @vishh
This refactor removes the legacy KubeletConfig object and adds a new
KubeletDeps object, which contains injected runtime objects and
separates them from static config. It also reduces NewMainKubelet to two
arguments: a KubeletConfiguration and a KubeletDeps.
Some mesos and kubemark code was affected by this change, and has been
modified accordingly.
And a few final notes:
KubeletDeps:
KubeletDeps will be a temporary bin for things we might consider
"injected dependencies", until we have a better dependency injection
story for the Kubelet. We will have to discuss this eventually.
RunOnce:
We will likely not pull new KubeletConfiguration from the API server
when in runonce mode, so it doesn't make sense to make this something
that can be configured centrally. We will leave it as a flag-only option
for now. Additionally, it is increasingly looking like nobody actually uses the
Kubelet's runonce mode anymore, so it may be a candidate for deprecation
and removal.
Automatic merge from submit-queue
rkt: Improve support for privileged pod (pod whose all containers are privileged)
Fix https://github.com/kubernetes/kubernetes/issues/31100
This takes advantage of https://github.com/coreos/rkt/pull/2983 . By appending the new `--all-run` insecure-options to `rkt run-prepared` command when all the containers are privileged. The pod now gets more privileged power.
Automatic merge from submit-queue
Add sysctl support
Implementation of proposal https://github.com/kubernetes/kubernetes/pull/26057, feature https://github.com/kubernetes/features/issues/34
TODO:
- [x] change types.go
- [x] implement docker and rkt support
- [x] add e2e tests
- [x] decide whether we want apiserver validation
- ~~[ ] add documentation~~: api docs exist. Existing PodSecurityContext docs is very light and links back to the api docs anyway: 6684555ed9/docs/user-guide/security-context.md
- [x] change PodSecurityPolicy in types.go
- [x] write admission controller support for PodSecurityPolicy
- [x] write e2e test for PodSecurityPolicy
- [x] make sure we are compatible in the sense of https://github.com/kubernetes/kubernetes/blob/master/docs/devel/api_changes.md
- [x] test e2e with rkt: it only works with kubenet, not with no-op network plugin. The later has no sysctl support.
- ~~[ ] add RunC implementation~~ (~~if that is already in kube,~~ it isn't)
- [x] update whitelist
- [x] switch PSC fields to annotations
- [x] switch PSP fields to annotations
- [x] decide about `--experimental-whitelist-sysctl` flag to be additive or absolute
- [x] decide whether to add a sysctl node whitelist annotation
### Release notes:
```release-note
The pod annotation `security.alpha.kubernetes.io/sysctls` now allows customization of namespaced and well isolated kernel parameters (sysctls), starting with `kernel.shm_rmid_forced`, `net.ipv4.ip_local_port_range`, `net.ipv4.tcp_max_syn_backlog` and `net.ipv4.tcp_syncookies` for Kubernetes 1.4.
The pod annotation `security.alpha.kubernetes.io/unsafeSysctls` allows customization of namespaced sysctls where isolation is unclear. Unsafe sysctls must be enabled at-your-own-risk on the kubelet with the `--experimental-allowed-unsafe-sysctls` flag. Future versions will improve on resource isolation and more sysctls will be considered safe.
```
Automatic merge from submit-queue
Increase request timeout based on termination grace period
When terminationGracePeriodSeconds is set to > 2 minutes (which is
the default request timeout), ContainerStop() times out at 2 minutes.
We should check the timeout being passed in and bump up the
request timeout if needed.
Fixes#31219
Automatic merge from submit-queue
Kubelet code move: volume / util
Addresses some odds and ends that I apparently missed earlier. Preparation for kubelet code-move ENDGAME.
cc @kubernetes/sig-node
Automatic merge from submit-queue
Kubelet: implement GetNetNS for new runtime api
Kubelet: implement GetNetNS for new runtime api.
CC @yujuhong @thockin @kubernetes/sig-node @kubernetes/sig-rktnetes
Automatic merge from submit-queue
Add kubelet --network-plugin-mtu flag for MTU selection
* Add network-plugin-mtu option which lets us pass down a MTU to a network provider (currently processed by kubenet)
* Add a test, and thus make sysctl testable
When terminationGracePeriodSeconds is set to > 2 minutes (which is
the default request timeout), ContainerStop() times out at 2 minutes.
We should check the timeout being passed in and bump up the
request timeout if needed.
Fixes#31219
Automatic merge from submit-queue
[kubelet test] Improve node status test debug info
I find the output format `%v` of glog couldn't output useful information of an `api.Node` object. The output of this line https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status_test.go#L492
is
```
kubelet_node_status_test.go:491: expected
&TypeMeta{Kind:,APIVersion:,}
, got
&TypeMeta{Kind:,APIVersion:,}
```
- It's difficult for me to tell the difference between expected and got.
- I prefer to use `diff.ObjectDiff(expectedNode, updatedNode)` to output the debug information as it will point out the starting character of the different objects.
I think this line https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status_test.go#L647 can use `diff.ObjectDiff()` as well.
The related issus is #30952
MTU selection is difficult, and if there is a transport such as IPSEC in
use may be impossible. So we allow specification of the MTU with the
network-plugin-mtu flag, and we pass this down into the network
provider.
Currently implemented by kubenet.
Automatic merge from submit-queue
Kubelet: pass pod name/namespace/uid in new runtime API
First part of #30463.
Pass pod name/namespace/uid in new runtime API and change dockershim to build unique sandbox/container name based on them.
CC @yujuhong @euank @yifan-gu @kubernetes/sig-node
Automatic merge from submit-queue
remove duplicate code in updatePodCIDR
As kl.runtimeState.podCIDR() is a sync method, need fetch lock and release lock, so we only invoke once here
Kubernetes uses Docker labels as storage for some internal labels. The
majority of these labels are not meaningful metric labels and a few of
them are even harmful as they're not static and cause wrong aggregation
results.
This change provides a custom labels func to only attach meaningful
labels to cAdvisor exported metrics.
Automatic merge from submit-queue
Kubelet: implement GetPods for new runtime API
Implement GetPods for kuberuntime. Part of #28789 .
CC @yujuhong @Random-Liu
Automatic merge from submit-queue
Kubelet: add --container-runtime-endpoint and --image-service-endpoint
Flag `--container-runtime-endpoint` (overrides `--container-runtime`) is introduced to identify the unix socket file of the remote runtime service. And flag `--image-service-endpoint` is introduced to identify the unix socket file of the image service.
This PR is part of #28789 Milestone 0.
CC @yujuhong @Random-Liu
Automatic merge from submit-queue
rkt: Support subPath volume mounts feature
So that at most one volume object will be created for every unique
host path. Also the volume's name is random generated UUID to avoid
collision since the mount point's name passed by kubelet is not
guaranteed to be unique when 'subpath' is specified.
Should partially fix https://github.com/kubernetes/kubernetes/issues/26986
The non-existing host path creation issue is not touched here.
cc @kubernetes/sig-rktnetes
also cc @kubernetes/sig-node for the Mount name comments I added.
Automatic merge from submit-queue
kubelet status manager: Fix nil in error message due to var shadowing
Variable shadowing can cause this log message to print a nil:
```go
glog.Warningf("Failed to update status for pod %q: %v", format.Pod(pod), err)
```
@kubernetes/rh-cluster-infra
The serviceAccountName is occasionally useful for clients running on
Kube that need to know who they are when talking to other components.
The nodeName is useful for PetSet or DaemonSet pods that need to make
calls back to the API to fetch info about their node.
Both fields are immutable, and cannot easily be retrieved in another
way.
Automatic merge from submit-queue
Fix image inspection and matching
An image string could contain a hostname (e.g., "docker.io") or not. The same
applies to the RepoTags returned from an image inspection. To determine whether
the image docker pulled matches what the user ask for, we check if the either
string is the suffix of the other.
/cc @dims @dchen1107 @Random-Liu
This fixes#30710
Automatic merge from submit-queue
Always return command output for exec probes and kubelet RunInContainer
Always return command output for exec probes and kubelet RunInContainer, even if the command invocation returns nonzero.
When #24921 replaced RunInContainer with ExecInContainer, it introduced a change where an exec probe that failed no longer included the stdout/stderr from the probe in the event. For example, when running at log level 4, you see:
```
I0816 15:01:36.259826 29713 exec.go:38] Exec probe response: "Failed to access the status endpoint : HTTP Error 404: Not Found.\nHawkular metrics has only been running for 7\n seconds not aborting yet.\n"
```
But the event looks like this:
```
54s 22s 5 hawkular-metrics-hjme4 Pod spec.containers{hawkular-metrics} Warning Unhealthy {kubelet corbeau} Readiness probe failed:
```
Note the absence of the exec probe response after "Readiness probe failed". This PR restores the previous behavior.
cc @kubernetes/rh-cluster-infra @mwringe
xref https://github.com/openshift/origin/issues/10424
Automatic merge from submit-queue
Unblock iterative development on pod-level cgroups
In order to allow forward progress on this feature, it takes the commits from #28017#29049 and then it globally disables the flag that allows these features to be exercised in the kubelet. The flag can be re-added to the kubelet when its actually ready.
/cc @vishh @dubstack @kubernetes/rh-cluster-infra
Automatic merge from submit-queue
dockertools: Don't use network plugin if net=host
I'm pretty sure this was just an oversight the first time around.
Before: `E0815 18:06:17.627468 976 docker_manager.go:350] NetworkPlugin kubenet failed on the status hook for pod 'sleep' - Unexpected command output Device "eth0" does not exist.`
After: No such logline is printed
The pod IP reported in `describe` is the same either way
cc @kubernetes/sig-node
Automatic merge from submit-queue
rkt: Do not error out when there are unrecognized lines in os-release
Also fix the error handling which will cause panic. Also fix the error handling which will cause panic.
cc @kubernetes/sig-rktnetes
So that at most one volume object will be created for every unique
host path. Also the volume's name is random generated UUID to avoid
collision since the mount point's name passed by kubelet is not
guaranteed to be unique when 'subpath' is specified.
Automatic merge from submit-queue
kubelet/api: split RuntimeService interface
Splits `RuntimeService` interface into smaller interfaces
to make testing easier and delineate the responsibilities.
Its a non-breaking change to the previous users of `api.RuntimeService`
Automatic merge from submit-queue
Add Events for operation_executor to show status of mounts, failed/successful to show in describe events
Fixes#27590
@saad-ali @pmorie @erinboyd
After talking with @pmorie last week about the above issue, I decided to poke around and see if I could remedy. The refactoring broke my previous UXP merged PR's that correctly showed failed mount errors in the describe events. However, Not sure I implemented correctly, but it tested out and seems to be working, let me know what I missed or if this is not the correct approach.
```
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
2m 2m 1 {default-scheduler } Normal Scheduled Successfully assigned nfs-bb-pod1 to 127.0.0.1
44s 44s 1 {kubelet 127.0.0.1} Warning FailedMount Unable to mount volumes for pod "nfs-bb-pod1_default(a94f64f1-37c9-11e6-9aa5-52540073d346)": timeout expired waiting for volumes to attach/mount for pod "nfs-bb-pod1"/"default". list of unattached/unmounted volumes=[nfsvol]
44s 44s 1 {kubelet 127.0.0.1} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "nfs-bb-pod1"/"default". list of unattached/unmounted volumes=[nfsvol]
38s 38s 1 {kubelet } Warning FailedMount Unable to mount volumes for pod "a94f64f1-37c9-11e6-9aa5-52540073d346": Mount failed: exit status 32
Mounting arguments: nfs1.rhs:/opt/data99 /var/lib/kubelet/pods/a94f64f1-37c9-11e6-9aa5-52540073d346/volumes/kubernetes.io~nfs/nfsvol nfs []
Output: mount.nfs: Connection timed out
Resolution hint: Check and make sure the NFS Server exists (ensure that correct IPAddress/Hostname was given) and is available/reachable.
Also make sure firewall ports are open on both client and NFS Server (2049 v4 and 2049, 20048 and 111 for v3).
Use commands telnet <nfs server> <port> and showmount <nfs server> to help test connectivity.
```
New flag --container-runtime-endpoint (overrides --container-runtime)
is introduced to kubelet which identifies the unix socket file of
the remote runtime service. And new flag --image-service-endpoint is
introduced to kubelet which identifies the unix socket file of the
image service.
An image string could contain a hostname (e.g., "docker.io") or not. The same
applies to the RepoTags returned from an image inspection. To determine whether
the image docker pulled matches what the user ask for, we check if the either
string is the suffix of the other.
Automatic merge from submit-queue
Fix default resource limits (node allocatable) for downward api volumes and env vars
@kubernetes/rh-cluster-infra @pmorie @derekwaynecarr
Automatic merge from submit-queue
Add note: kubelet manages only k8s containers.
Kubelet wrote log when accesing container which was not created in k8s, what could confuse users. That's why we added note about it in documentation and lowered log level of the message to 5.
Here is example of the message:
```
> Apr 19 11:50:32 openshift-114.lab.sjc.redhat.com atomic-openshift-node[9551]:
I0419 11:50:32.194020 9600 docker.go:363]
Docker Container: /tiny_babbage is not managed by kubelet.
```
bug 1328441
Bugzilla link https://bugzilla.redhat.com/show_bug.cgi?id=1328441
Automatic merge from submit-queue
syncNetworkUtil in kubelet and fix loadbalancerSourceRange on GCE
fixes: #29997#29039
@yujuhong Can you take a look at the kubelet part?
@girishkalele KUBE-MARK-DROP is the chain for dropping connections. Marked connection will be drop in INPUT/OUTPUT chain of filter table. Let me know if this is good enough for your use case.
Automatic merge from submit-queue
Implement AppArmor Kubelet support
Includes PR https://github.com/kubernetes/kubernetes/pull/29812
Implements the Kubelet logic for AppArmor based on the alpha API proposed [here](https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/apparmor.md). Also adds an E2E test, and I ran manual tests.
Remaining work: PodSecurityPolicy support, profile loader daemon, documentation, (maybe) beta API.
/cc @jfrazelle @Amey-D @kubernetes/sig-node
*Note on release-note-none: I am implementing AppArmor over multiple PRs. I will submit a single release note once the implementation is done to cover all of them.*
This was already handled in most places. I think this is the only
remaining instance of it in the docker package.
This could lead to confusing results. E.g. if `networkPlugin` was cni,
it could lead to error logs about not getting network status for host
pods if eth0 didn't exist on the host.
Automatic merge from submit-queue
Fix image verification when hostname is present in image
Deal better with the situation where a image name contains
a hostname as well.
Fixes#30580
Automatic merge from submit-queue
Add volume reconstruct/cleanup logic in kubelet volume manager
Currently kubelet volume management works on the concept of desired
and actual world of states. The volume manager periodically compares the
two worlds and perform volume mount/unmount and/or attach/detach
operations. When kubelet restarts, the cache of those two worlds are
gone. Although desired world can be recovered through apiserver, actual
world can not be recovered which may cause some volumes cannot be cleaned
up if their information is deleted by apiserver. This change adds the
reconstruction of the actual world by reading the pod directories from
disk. The reconstructed volume information is added to both desired
world and actual world if it cannot be found in either world. The rest
logic would be as same as before, desired world populator may clean up
the volume entry if it is no longer in apiserver, and then volume
manager should invoke unmount to clean it up.
Fixes https://github.com/kubernetes/kubernetes/issues/27653
Automatic merge from submit-queue
CRI: remove pod sandbox resources
The pod-level resources need further discussion. Remove it from CRI for now.
See the original discussion in #29871
Currently kubelet volume management works on the concept of desired
and actual world of states. The volume manager periodically compares the
two worlds and perform volume mount/unmount and/or attach/detach
operations. When kubelet restarts, the cache of those two worlds are
gone. Although desired world can be recovered through apiserver, actual
world can not be recovered which may cause some volumes cannot be cleaned
up if their information is deleted by apiserver. This change adds the
reconstruction of the actual world by reading the pod directories from
disk. The reconstructed volume information is added to both desired
world and actual world if it cannot be found in either world. The rest
logic would be as same as before, desired world populator may clean up
the volume entry if it is no longer in apiserver, and then volume
manager should invoke unmount to clean it up.
Automatic merge from submit-queue
move syncNetworkConfig to Init for cni network plugin
start syncNetworkConfig routine in `Init` instead of probing. This fixes a bug where the syncNetworkConfig runs periodically even `cni` network plugin is not in use.
Automatic merge from submit-queue
Remove resource specifications from CRI until further notice
See #29871 for the discussion issue.
cc @dchen1107 @vishh @yujuhong @euank @yifan-gu @feiskyer
Automatic merge from submit-queue
Validate SHA/Tag when checking docker images
Docker API does not validate the tag/sha, for example, all the following
calls work say for a alpine image with short SHA "4e38e38c8ce0"
echo -e "GET /images/alpine:4e38e38c8ce0/json HTTP/1.0\r\n" | nc -U /var/run/docker.sock
echo -e "GET /images/alpine:4e38e38c/json HTTP/1.0\r\n" | nc -U /var/run/docker.sock
echo -e "GET /images/alpine:4/json HTTP/1.0\r\n" | nc -U /var/run/docker.sock
So we should check the response from the Docker API and look for the tags or SHA explicitly.
Fixes#30355
Automatic merge from submit-queue
Set pod state as "unknown" when CNI plugin fails
Before this change, CNI plugin failure didn't change anything in the pod status, so pods having containers without requested network were "running".
Fixes#29148
Automatic merge from submit-queue
Remove kubelet pkill dependency
Issue #26093 identified pkill as one of the dependencies of kublet
which could be worked around. Build on the code introduced for pidof
and regexp for the process(es) we need to send a signal to.
Related to #26093
Automatic merge from submit-queue
[kubelet] Introduce --protect-kernel-defaults flag to make the tunable behaviour configurable
Let's make the default behaviour of kernel tuning configurable. The default behaviour is kept modify as has been so far.
Docker API does not validate the tag/sha, for example, all the following
calls work say for a alpine image with short SHA "4e38e38c8ce0"
echo -e "GET /images/alpine:4e38e38c8ce0/json HTTP/1.0\r\n" | nc -U /var/run/docker.sock
echo -e "GET /images/alpine:4e38e38c/json HTTP/1.0\r\n" | nc -U /var/run/docker.sock
echo -e "GET /images/alpine:4/json HTTP/1.0\r\n" | nc -U /var/run/docker.sock
So we should check the response from the Docker API and look for the
tags or SHA explicitly.
Fixes#30355
Issue #26093 identified pkill as one of the dependencies of kublet
which could be worked around. Build on the code introduced for pidof
and regexp for the process(es) we need to send a signal to.
Related to #26093
Automatic merge from submit-queue
Remove kubelet dependency on pidof
Issue #26093 identified pidof as one of the dependencies of kublet
which could be worked around. In this PR, we just look at /proc
to construct the list of pids we need for a specified process
instead of running "pidof" executable
Related to #26093
<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.kubernetes.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.kubernetes.io/reviews/kubernetes/kubernetes/30002)
<!-- Reviewable:end -->
Issue #26093 identified pidof as one of the dependencies of kublet
which could be worked around. In this PR, we just look at /proc
to construct the list of pids we need for a specified process
instead of running "pidof" executable
Related to #26093
Automatic merge from submit-queue
Kubelet: Add container ports label.
Addresses https://github.com/kubernetes/kubernetes/pull/30049#discussion_r73983952.
Add container ports label, although we don't use it now, it will make it easier to switch to new runtime interface in the future.
@yujuhong @feiskyer
Automatic merge from submit-queue
Add total inodes to kubelet summary api
Needed to support inode based eviction thresholds as a percentage.
/cc @ronnielai @vishh @kubernetes/rh-cluster-infra
Before this change, CNI plugin failure didn't change anything in
the pod status, so pods having containers without requested
network were "running".
Fixes#29148
Automatic merge from submit-queue
[kubelet] Auto-discover node IP if neither cloud provider exists and IP is not explicitly specified
One example where the earlier implementation failed is when running kubelet on CoreOS (bare-metal), where the nameserver is set to `8.8.8.8`. kubelet tries to lookup the node name agains Google DNS, which obviously fails. The kubelet won't recover after that.
The workaround hsa been to set `--hostname-override` to an IP address, but it's quite annoying to try to make a multi-distro way of getting the IP in bash for example. This way is much cleaner.
Refactored the function a little bit at the same time
@vishh @yujuhong @resouer @Random-Liu
Automatic merge from submit-queue
Verify volume.GetPath() never returns ""
Add a new helper method volume.GetPath(Mounter) instead of calling
the GetPath() of the Mounter directly. Check if GetPath() is returning
a "" and convert that into an error.
Fixes#23163
Add a new helper method volume.GetPath(Mounter) instead of calling
the GetPath() of the Mounter directly. Check if GetPath() is returning
a "" and convert that into an error. At this point, we only have
information about the type of the Mounter, so let's log that if
there is a problem
Fixes#23163
Automatic merge from submit-queue
Node disk pressure should induce image gc
If the node reports disk pressure, prior to evicting pods, the node should clean up unused images.
Automatic merge from submit-queue
Consolidating image pulling logic
Moving image puller logic into image manager by consolidating 2 pullers into one implementation.
Automatic merge from submit-queue
Kubelet: add kubeGenericRuntimeManager for new runtime API
Part of #28789. Add `kubeGenericRuntimeManager` for kubelet new runtime API #17048.
Note that:
- To facilitate code reviewing, #28396 is splited into a few small PRs. This is the first part.
- This PR also fixes some syntax errors in `api.proto`.
- This PR is depending on #29811 (already merged).
CC @yujuhong @Random-Liu @kubernetes/sig-node
Automatic merge from submit-queue
Kubelet: add gRPC implementation of new runtime interface
Add gRPC implementation of new runtime interface.
CC @yujuhong @Random-Liu @kubernetes/sig-node
Automatic merge from submit-queue
Kubelet: add fake kube runtime
Add a new fake kube runtime with kubelet using the new runtime API.
CC @yujuhong @Random-Liu
Automatic merge from submit-queue
pods which can not be admitted should return directly
if the pod can not be admitted, the code runPod(pod, retryDelay) should not be run.
Automatic merge from submit-queue
Use the CNI bridge plugin to set hairpin mode
Following up this part of #23711:
> I'd like to wait until containernetworking/cni#175 lands and then just pass the request through to CNI.
The code here just
* passes the required setting down from kubenet to CNI
* disables `DockerManager` from doing hairpin-veth, if kubenet is in use
Note to test you need a very recent version of the CNI `bridge` plugin; the one brought in by #28799 should be OK.
Also relates to https://github.com/kubernetes/kubernetes/issues/19766#issuecomment-232722864
If the resource in the delete call does not exist, the runtime should
not return an error. This eliminates the need for kubelet to define a
resource "not found" error that every runtime has to return.
Automatic merge from submit-queue
pkg/various: plug leaky time.New{Timer,Ticker}s
According to the documentation for Go package time, `time.Ticker` and
`time.Timer` are uncollectable by garbage collector finalizers. They
leak until otherwise stopped. This commit ensures that all remaining
instances are stopped upon departure from their relative scopes.
Similar efforts were incrementally done in #29439 and #29114.
```release-note
* pkg/various: plugged various time.Ticker and time.Timer leaks.
```
Automatic merge from submit-queue
Initial support for pod eviction based on disk
This PR adds the following:
1. node reports disk pressure condition based on configured thresholds
1. scheduler does not place pods on nodes reporting disk pressure
1. kubelet will not admit any pod when it reports disk pressure
1. kubelet ranks pods for eviction when low on disk
1. kubelet evicts greediest pod
Follow-on PRs will need to handle:
1. integrate with new image gc PR (https://github.com/kubernetes/kubernetes/pull/27199)
1. container gc policy should always run (will not be launched from eviction, tbd who does that)
1. this means kill pod is fine for all eviction code paths since container gc will remove dead container
1. min reclaim support will just poll summary provider (derek will do follow-on)
1. need to know if imagefs is same device as rootfs from summary (derek follow-on)
/cc @vishh @kubernetes/sig-node
Automatic merge from submit-queue
Add a docker-shim package
Add a new docker integration with kubelet using the new runtime API.
This change adds the package with the skeleton and implements some of the basic operations.
This PR only implements a small sets of functions. The rest of the functions will be implemented
in the followup PRs to keep the changes readable, and the reviewers sane.
Note: The first commit is from #28396, only the second commit is for review.
/cc @kubernetes/sig-node @feiskyer @Random-Liu
Automatic merge from submit-queue
Restrict log sym link to 256 characters
This fix can potentially cause conflicts in log file names. The current model of exporting log data is fundamentally broken. This PR does not attempt to fix all of the issues.
Automatic merge from submit-queue
Fix mount collision timeout issue
Short- or medium-term workaround for #29555. The root issue being fixed here is that the recent attach/detach work in the kubelet uses a unique volume name as a key that tracks the work that has to be done for each volume in a pod to attach/mount/umount/detach. However, the non-attachable volume plugins do not report unique names for themselves, which causes collisions when a single secret or configmap is mounted multiple times in a pod.
This is still a WIP -- I need to add a couple E2E tests that ensure that tests break in the future if there is a regression -- but posting for early review.
cc @kubernetes/sig-storage
Ultimately, I would like to refine this a bit further. A couple things I would like to change:
1. `GetUniqueVolumeName` should be a property ONLY of attachable volumes
2. I would like to see the kubelet apparatus for attach/mount/umount/detach handle non-attachable volumes specifically to avoid things like the `WaitForControllerAttach` call that has to be done for those volume types now
Add a new docker integration with kubelet using the new runtime API.
This change adds the package with some skeletons, and implements some
of the basic operations.
For non-attachable volumes, do not call GetVolumeName on the plugin and instead
generate a unique name based on the identity of the pod and the name of the volume
within the pod.
According to the documentation for Go package time, `time.Ticker` and
`time.Timer` are uncollectable by garbage collector finalizers. They
leak until otherwise stopped. This commit ensures that all remaining
instances are stopped upon departure from their relative scopes.
Automatic merge from submit-queue
network/cni: Unconditionally bring up `lo` interface
This is already done in kubenet. This specifically fixes an issue where a kubelet-managed network for the rkt runtime does not have an "UP" lo interface.
Fixes#28561
If this fix doesn't seem right, it could also be implemented by rkt effectively managing two "cni" network plugins, one for the user requested network, one for lo.
Followup CRs can improve unit testing further and then possibly remove the vendor directory logic (which seems like dead code)
cc @kubernetes/sig-rktnetes @kubernetes/sig-network @dcbw
Automatic merge from submit-queue
Extract kubelet node status into separate file
Extract kubelet node status management into a separate file as a continuation of the kubelet code simplification effort.
Automatic merge from submit-queue
Remove duplicate prometheus metrics
This was a relic from before Kubernetes set Docker labels properly. Cadvisor now properly exposes the Docker labels (e.g. `io.kubernetes.pod.name` as `io_kubernetes_pod_name`, etc) so this is no longer required & actually results in unnecessary duplicate Prometheus labels.
Automatic merge from submit-queue
Syncing imaging pulling backoff logic
- Syncing the backoff logic in the parallel image puller and the sequential image puller to prepare for merging the two pullers into one.
- Moving image error definitions under kubelet/images
Automatic merge from submit-queue
Eviction manager needs to start as runtime dependent module
To support disk eviction, the eviction manager needs to know if there is a dedicated device for the imagefs. In order to know that information, we need to start the eviction manager after cadvisor. This refactors the location eviction manager is started.
/cc @kubernetes/sig-node @kubernetes/rh-cluster-infra @vishh @ronnielai
Automatic merge from submit-queue
Allow PVs to specify supplemental GIDs
Retry of https://github.com/kubernetes/kubernetes/pull/28691 . Adds a Kubelet helper function for getting extra supplemental groups
Automatic merge from submit-queue
Add parsing code in kubelet for eviction-minimum-reclaim
The kubelet parses the eviction-minimum-reclaim flag and validates it for correctness.
The first two commits are from https://github.com/kubernetes/kubernetes/pull/29329 which has already achieved LGTM.
Automatic merge from submit-queue
CRI: add LinuxUser to LinuxContainerConfig
Following discussion in https://github.com/kubernetes/kubernetes/pull/25899#discussion_r70996068
The Container Runtime Interface should provide runtimes with User information to run the container process as (OCI being one of them).
This patch introduces a new field `user` into `LinuxContainerConfig` structure. The `user` field introduces also a new type structure `LinuxUser` which consists of `uid`, `gid` and `additional_gids`.
The `LinuxUser` struct has been embedded into `LinuxContainerConfig` to leave space for future implementations which are not Linux-related (e.g. Windows may have a different representation of _Users_).
If you feel naming can be better we can probably move `LinuxUser` to `UnixUser` also.
/cc @mrunalp @vishh @euank @yujuhong
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Automatic merge from submit-queue
Removing images with multiple tags
If an image has multiple tags, we need to remove all the tags in order to make docker image removing successful.
#28491
Automatic merge from submit-queue
Fix incorrect if conditions
When the current conditions `if inspect == nil && inspect.Config == nil && inspect.Config.Labels == nil` is true, the func containerAndPodFromLabels will return. else will not. Suppose `inspect != nil` but `inspect.Config == nil`, the current conditions will be false and the func won't return, then the below `labels := inspect.Config.Labels` will lead to panic.
Automatic merge from submit-queue
rkt: Don't return if the service file doesn't exist when killing the pod
Remove an unused logic. Also this prevents the KillPod() from failing
when the service file doesn't exist. E.g., it can be removed by garbage
collection in a rare case:
1, There are already more than `gcPolicy.MaxContainers` containers running
on the host.
2, The new pod(A) starts to run but doesn't enter 'RUNNING' state yet.
3, GC is triggered, and it sees the pod(A) is in an inactive state (not running),
and the it needs to remove the pod to force the `gcPolicy.MaxContainers`.
4, GC fails to remove the pod because `rkt rm` fails when the pod is running,
but it removes the service file anyway.
5, Follow up KillPod() call will fail because it cannot find the service file
on disk.
Also this is possible only when the pod has been in prepared state for longer
than 1 min, which sounds like another issue.
cc @kubernetes/sig-rktnetes
Automatic merge from submit-queue
Allow mounts to run in parallel for non-attachable volumes
This PR:
* Fixes https://github.com/kubernetes/kubernetes/issues/28616
* Enables mount volume operations to run in parallel for non-attachable volume plugins.
* Enables unmount volume operations to run in parallel for all volume plugins.
* Renames `GoRoutineMap` to `GoroutineMap`, resolving a long outstanding request from @thockin: `"Goroutine" is a noun`
Automatic merge from submit-queue
ImagePuller refactoring
A plain refactoring
- Moving image pullers to a new pkg/kubelet/images directory
- Hiding image pullers inside the new ImageManager
The next step is to consolidate the logic of the serialized and the parallel image pullers inside ImageManager
xref: #25577
Automatic merge from submit-queue
Kubelet: Set PruneChildren when removing image.
This is a bug introduced during switching to engine-api. https://github.com/kubernetes/kubernetes/issues/23563.
When removing image, there is an option `noprune`:
```
If prune is true, ancestor images will each attempt to be deleted quietly.
```
In go-dockerclient, the default value of the option is ["noprune=false"](https://github.com/fsouza/go-dockerclient/blob/master/image.go#L171), which means that ancestor images should be also removed. This is the expected behaviour.
However in engine-api, the option is changed to `PruneChildren`, and the default value is `PruneChildren=false`, which means that ancestor images won't be removed.
This makes `ImageRemove` only remove the first layer of the image, which causes the image garbage collection not working as expected.
This should be fixed in 1.3.
And thanks to @ronnielai for finding the bug! :)
/cc @kubernetes/sig-node
Automatic merge from submit-queue
docker_manager: Correct determineContainerIP args
This could result in the network plugin not retrieving the pod ip in a
call to SyncPod when using the `exec` network plugin.
The CNI and kubenet network plugins ignore the name/namespace arguments,
so they are not impacted by this bug.
I verified the second included test failed prior to correcting the
argument order.
Fixes#29161
cc @yujuhong
Automatic merge from submit-queue
Move ExtractPodBandwidthResources test into appropriate package
Found during #28511, this test is in the wrong package currently.
cc @kubernetes/sig-network
Allow mount volume operations to run in parallel for non-attachable
volume plugins.
Allow unmount volume operations to run in parallel for all volume
plugins.
Automatic merge from submit-queue
Delete redundant if condition
The case `containerStatus == nil` has already been checked just above. It's redundant here.
Automatic merge from submit-queue
Make kubelet continue cleanup when there is noncritical error.
Fix https://github.com/kubernetes/kubernetes/issues/29078.
Even though there is error when cleaning up pod directory or bandwidth limits, kubelet could continue cleanup the following stuff.
However, when runtime cache or runtime returns error, cleanup should fail, because the following cleanup relies on the `runningPod`.
@yujuhong
/cc @kubernetes/sig-node
This could result in the network plugin not retrieving the pod ip in a
call to SyncPod when using the `exec` network plugin.
The CNI and kubenet network plugins ignore the name/namespace arguments,
so they are not impacted by this bug.
I verified the second included test failed prior to correcting the
argument order.
Fixes#29161
Automatic merge from submit-queue
Return (bool, error) in Authorizer.Authorize()
Before this change, Authorize() method was just returning an error, regardless of whether the user is unauthorized or whether there is some other unrelated error. Returning boolean with information about user authorization and error (which should be unrelated to the authorization) separately will make it easier to debug.
Fixes#27974
Automatic merge from submit-queue
Do not skip check for cgroup creation in the systemd mount
As soon as libcontainer dependency is update in #28410, we can skip check for cgroup creation in the systemd mount. As the latest version of libcontainer should create cgroups in the sytemd mount aswell.
This is tied to the upstream issue: #27204
@vishh PTAL
Remove an unused logic. Also this prevents the KillPod() from failing
when the service file doesn't exist. E.g., it can be removed by garbage
collection in a rare case:
1, There are already more than `gcPolicy.MaxContainers` containers running
on the host.
2, The new pod(A) starts to run but doesn't enter 'RUNNING' state yet.
3, GC is triggered, and it sees the pod(A) is in an inactive state (not running),
and the it needs to remove the pod to force the `gcPolicy.MaxContainers`.
4, GC fails to remove the pod because `rkt rm` fails when the pod is running,
but it removes the service file anyway.
5, Follow up KillPod() call will fail because it cannot find the service file
on disk.
Also this is possible only when the pod has been in prepared state for longer
than 1 min, which sounds like another issue.
Before this change, Authorize() method was just returning an error,
regardless of whether the user is unauthorized or whether there
is some other unrelated error. Returning boolean with information
about user authorization and error (which should be unrelated to
the authorization) separately will make it easier to debug.
Fixes#27974
Automatic merge from submit-queue
rkt: Copy the /etc/hosts /etc/resolv.conf into pod dir before mounting.
rkt: Copy the /etc/hosts /etc/resolv.conf into pod dir before mounting.
This enables the container to modify the /etc/hosts/ /etc/resolv.conf without changing the host's ones.
With this PR, we now match the docker's behavior.
Fix https://github.com/kubernetes/kubernetes/issues/29022
cc @kubernetes/sig-rktnetes @quentin-m
This enables the container to modify the /etc/hosts/ /etc/resolv.conf
without changing the host's ones.
With this PR, we now match the docker's behavior.
Automatic merge from submit-queue
Support terminal resizing for exec/attach/run
```release-note
Add support for terminal resizing for exec, attach, and run. Note that for Docker, exec sessions
inherit the environment from the primary process, so if the container was created with tty=false,
that means the exec session's TERM variable will default to "dumb". Users can override this by
setting TERM=xterm (or whatever is appropriate) to get the correct "smart" terminal behavior.
```
Fixes#13585
Add support for terminal resizing for exec, attach, and run. Note that for Docker, exec sessions
inherit the environment from the primary process, so if the container was created with tty=false,
that means the exec session's TERM variable will default to "dumb". Users can override this by
setting TERM=xterm (or whatever is appropriate) to get the correct "smart" terminal behavior.
Automatic merge from submit-queue
Generates DELETE pod update operations
fixes#27105
Generates DELETE pod update operations to make the code and log more intuitive.
1. main refactoring is in `kubelet/config`
2. kubelet will log if it received DELETE, just like other OPs
cc @Random-Liu :)
Automatic merge from submit-queue
Remove unnecessary calls to api.GetReference
These calls are unnecessary, can be removed. `Eventf` and others just call `GetReference` on the object they are passed.
cc @kubernetes/sig-node
Automatic merge from submit-queue
Remove no needed todo
ref #19645#13418
Remove comment about refactoring pod cleanup since we have agree to keep it.
cc @yujuhong
Automatic merge from submit-queue
Optimizing the processing flow of HandlePodAdditions and canAdmitPod …
Optimizing the processing flow of HandlePodAdditions and canAdmitPod methods. If the following loop body in canAdmitPod method is removed, the detection speed can be improved, and the change is very small.
------
otherPods := []*api.Pod{}
for _, p := range pods {
if p != pod {
otherPods = append(otherPods, p)
}
}
------
Signed-off-by: Kevin Wang <wang.kanghua@zte.com.cn>
change the note for the canAdmitPod method.
Signed-off-by: Kevin Wang <wang.kanghua@zte.com.cn>
gofmt kubelet.go
Signed-off-by: Kevin Wang <wang.kanghua@zte.com.cn>
Automatic merge from submit-queue
Add meta field to predicate signature to avoid computing the same things multiple times
This PR only uses it to avoid computing QOS of a pod for every node from scratch.
Ref #28590
Automatic merge from submit-queue
RemoveContainer in Runtime interface
- Added a DeleteContainer method in Runtime interface
- Implemented DeleteContainer for docker
#28552
Automatic merge from submit-queue
Add checks in Create and Update Cgroup methods
This PR is connected to upstream issue for adding pod level cgroups in Kubernetes: #27204
Libcontainer currently doesen't support updates to parent devices cgroups. Until we get libcontainer to support skipping devices cgroup we will have that logic on the kubelet side.
This PR includes:
1. Skip the devices cgroup when updating a cgroup. We only update the memory and cpu subsytems.
2. We explicitly pass all the cgroup paths that don't already exist to Apply()
3. Adds an AlreadyExists() method which is a utility function to check if all the subsystems of a cgroup already exist.
On cgroupManager.Update() we only call Set() and cgroupManager.Create() we only call Apply() method
@vishh PTAL
Automatic merge from submit-queue
Cleanup third party (pt 2)
Move forked-and-hacked golang code to the forked/ directory. Remove ast/build/parse code that is now in stdlib. Remove unused shell2junit
Automatic merge from submit-queue
Reorganize volume controllers and manager
* Move both PV and attach/detach volume controllers to `controllers/volume` (closes#26222)
* Rename `kubelet/volume` to `kubelet/volumemanager`
* Add/update OWNER files
Automatic merge from submit-queue
Add additional testing scenarios for compute resource requests=0
I was asked about the qos tier of a pod that specified
`--requests=cpu=0,memory=0 --limits=cpu=100m,memory=1Gi`
and in just investigating current behavior, realized we should have an explicit test case to ensure that 0 values are preserved in defaulting passes, and that this is still a burstable pod (but the lowest for that tier as it related to eviction)
/cc @vishh
This commit includes a proposal and a Go file to re-define the container
runtime interface.
Note that this is an experimental interface and is expected to go through
multiple revisions once developers start implementing against it. As stated in
the proposal, there are also individual issues to carry discussions of
specific features.
Automatic merge from submit-queue
[kubelet] Allow opting out of automatic cloud provider detection in kubelet. By default kubelet will auto-detect cloud providers
fixes#28231
Automatic merge from submit-queue
Track object modifications in fake clientset
Fake clientset is used by unit tests extensively but it has some
shortcomings:
- no filtering on namespace and name: tests that want to test objects in
multiple namespaces end up getting all objects from this clientset,
as it doesn't perform any filtering based on name and namespace;
- updates and deletes don't modify the clientset state, so some tests
can get unexpected results if they modify/delete objects using the
clientset;
- it's possible to insert multiple objects with the same
kind/name/namespace, this leads to confusing behavior, as retrieval is
based on the insertion order, but anchors on the last added object as
long as no more objects are added.
This change changes core.ObjectRetriever implementation to track object
adds, updates and deletes.
Some unit tests were depending on the previous (and somewhat incorrect)
behavior. These are fixed in the following few commits.
Automatic merge from submit-queue
[Refactor] Make QoS naming consistent across the codebase
@derekwaynecarr @vishh PTAL. Can one of you please attach a LGTM.
Automatic merge from submit-queue
Volume manager must verify containers terminated before deleting for ungracefully terminated pods
A pod is removed from volume manager (triggering unmount) when it is deleted from the kubelet pod manager. Kubelet deletes the pod from pod manager as soon as it receives a delete pod request. As long as the graceful termination period is non-zero, this happens after kubelet has terminated all containers for the pod. However, when graceful termination period for a pod is set to zero, the volume is deleted from pod manager *before* its containers are terminated.
This can result in volumes getting unmounted from a pod before all containers have exited when graceful termination is set to zero.
This PR prevents that from happening by only deleting a volume from volume manager once it is deleted from the pod manager AND the kubelet containerRuntime status indicates all containers for the pod have exited. Because we do not want to call containerRuntime too frequently, we introduce a delay in the `findAndRemoveDeletedPods()` method to prevent it from executing more frequently than every two seconds.
Fixes https://github.com/kubernetes/kubernetes/issues/27691
Running test in tight loop to verify fix.
Automatic merge from submit-queue
Kubelet should mark VolumeInUse before checking if it is Attached
Kubelet should mark VolumeInUse before checking if it is Attached.
Controller should fetch fresh copy of node object before detach instead of relying on node informer cache.
Fixes#27836
Ensure that kublet marks VolumeInUse before checking if it is Attached.
Also ensures that the attach/detach controller always fetches a fresh
copy of the node object before detach (instead ofKubelet relying on node
informer cache).
Automatic merge from submit-queue
Add support for basic QoS and pod level cgroup management
This PR is a WIP and is tied to this upstream issue #27204
It adds support for creation,deletion and updates of cgroups in Kubernetes.
@vishh PTAL
Please note that the first commit is part of this PR: #27749
cc @kubernetes/sig-node
Signed-off-by: Buddha Prakash <buddhap@google.com>
Automatic merge from submit-queue
Refactor func canRunPod
After refactoring, we only need to check `if pod.Spec.SecurityContext == nil` once. The logic is a bit clearer.
Automatic merge from submit-queue
Fix pkg/kubelet unit tests fail on OSX
use runtime.GOOS for the OperatingSystem and not hardcode it to linux.
Fixes#27730
Automatic merge from submit-queue
[Refactor] QOS to have QOS Class type for QoS classes
This PR adds a QOSClass type and initializes QOSclass constants for the three QoS classes.
It would be good to use this in all future QOS related features.
This would be good to have for the (Pod level cgroups isolation proposal)[https://github.com/kubernetes/kubernetes/pull/26751] that i am working on aswell.
@vishh PTAL
Signed-off-by: Buddha Prakash <buddhap@google.com>
Automatic merge from submit-queue
Kubelet can retrieve host IP even when apiserver has not been contacted
fixes https://github.com/kubernetes/kubernetes/issues/26590, fixes https://github.com/kubernetes/kubernetes/issues/6558
Right now the kubelet expects to get the hostIP from the kubelet's local nodeInfo cache. However, this will be empty if there is no api-server (or the apiServer has not yet been contacted).
In the case of static pods, this change means the downward api can now be used to populate hostIP.
Automatic merge from submit-queue
kubelt: Remove a couple lines of dead code
Presumably that code was added for debugging reasons and never removed. Hopefully.
If it's actually important and there's a good reason to do what looks like a no-op to get pause-the-world behaviour or whatever, I'd hope there'd be a comment.
cc @pwittrock
Automatic merge from submit-queue
Image GC logic should compensate for reserved blocks
Calculating the disk usage based on available bytes instead of usage bytes to account for reserved blocks in image GC
#27169
Automatic merge from submit-queue
rkt: Fix the 'privileged' check when stage1 annotation is provided.
Previously when stage1 annotation is provided, we only checks if
the kubelet allows privileged, which is not useful as that is a global
setting.
Instead, we should check if the pod has explicitly set the privileged
security context to 'true'.
cc @kubernetes/sig-rktnetes @kubernetes/sig-node
Automatic merge from submit-queue
Bump minimum API version for docker to 1.21
The corresponding docker version is 1.9.x. Dropping support for docker 1.8.
/cc @kubernetes/sig-node
Previously when stage1 annotation is provided, we only checks if
the kubelet allows privileged, which is not useful as that is a global
setting.
Instead, we should check if the pod has explicitly set the privileged
security context to 'true'.
Automatic merge from submit-queue
kubenet: Fix host port for rktnetes.
Because rkt pod runs after plugin.SetUpPod() is called, so
getRunningPods() does not return the newly created pod, which
causes the hostport iptable rules to be missing for this new pod.
cc @dcbw @freehan
A follow up fix for https://github.com/kubernetes/kubernetes/pull/27878#issuecomment-227898936
Because rkt pod runs after plugin.SetUpPod() is called, so
getRunningPods() does not return the newly created pod, which
causes the hostport iptable rules to be missing for this new pod.
Automatic merge from submit-queue
rkt: Refactor grace termination period.
Add `TimeoutStopSec` service option to support grace termination.
Found we can improve the grace-period-termination by adding a systemd service option.
cc @kubernetes/sig-rktnetes
Use the generic runtime method to get the netns path. Also
move reading the container IP address into cni (based off kubenet)
instead of having it in the Docker manager code. Both old and new
methods use nsenter and /sbin/ip and should be functionally
equivalent.
Automatic merge from submit-queue
Remove pod mutation for volumes annotated with supplemental groups
Removes the pod mutation added in #20490 -- partially resolves#27197 from the standpoint of making the feature inactive in 1.3. Our plan is to make this work correctly in 1.4.
@kubernetes/sig-storage
Automatic merge from submit-queue
Kubelet Volume Manager Wait For Attach Detach Controller and Backoff on Error
* Closes https://github.com/kubernetes/kubernetes/issues/27483
* Modified Attach/Detach controller to report `Node.Status.AttachedVolumes` on successful attach (unique volume name along with device path).
* Modified Kubelet Volume Manager wait for Attach/Detach controller to report success before proceeding with attach.
* Closes https://github.com/kubernetes/kubernetes/issues/27492
* Implemented an exponential backoff mechanism for for volume manager and attach/detach controller to prevent operations (attach/detach/mount/unmount/wait for controller attach/etc) from executing back to back unchecked.
* Closes https://github.com/kubernetes/kubernetes/issues/26679
* Modified volume `Attacher.WaitForAttach()` methods to uses the device path reported by the Attach/Detach controller in `Node.Status.AttachedVolumes` instead of calling out to cloud providers.
Modify attach/detach controller to keep track of volumes to report
attached in Node VolumeToAttach status.
Modify kubelet volume manager to wait for volume to show up in Node
VolumeToAttach status.
Implement exponential backoff for errors in volume manager and attach
detach controller
This enables rkt to use cached stage1 image instead of unpacking the
stage1 image every time for every pod.
After this change, users need to preload the stage1 images in order to
enable rkt to find the stage1 image with the name specified by this flag.
Automatic merge from submit-queue
Logging for OutOfDisk when file system info is not available
#26566
1. Adding logs for file system info being not available.
2. Reporting outOfDisk when file system info is not available.
Automatic merge from submit-queue
Filter seccomp profile path from malicious .. and /
Without this patch with `localhost/<some-releative-path>` as seccomp profile one can load any file on the host, e.g. `localhost/../../../../dev/mem` which is not healthy for the kubelet.
/cc @jfrazelle
Unit tests depend on https://github.com/kubernetes/kubernetes/pull/26710.
Automatic merge from submit-queue
kubelet/kubenet: split hostport handling into separate module
This pulls the hostport functionality of kubenet out into a separate module so that it can be more easily tested and potentially used from other code (maybe CNI, maybe downstream consumers like OpenShift, etc). Couldn't find a mock iptables so I wrote one, but I didn't look very hard.
@freehan @thockin @bprashanth
Automatic merge from submit-queue
Revert revert of downward api node defaults
Reverts the revert of https://github.com/kubernetes/kubernetes/pull/27439Fixes#27062
@dchen1107 - who at Google can help debug why this caused issues with GKE infrastructure but not GCE merge queue?
/cc @wojtek-t @piosz @fgrzadkowski @eparis @pmorie
- improve restoreInternal implementation in iptables
- add SetStdin and SetStdout functions to Cmd interface
- modify kubelet/prober and some tests in order to work with Cmd interface
If the mount operation exceeds the timeout, it will return an error and the
pod worker will retry in the next sync (10s or less). Compared with the
original value (i.e., 10 minutes), this frees the pod worker sooner to process
pod updates, if there are any.
This commit adds a new volume manager in kubelet that synchronizes
volume mount/unmount (and attach/detach, if attach/detach controller
is not enabled).
This eliminates the race conditions between the pod creation loop
and the orphaned volumes loops. It also removes the unmount/detach
from the `syncPod()` path so volume clean up never blocks the
`syncPod` loop.
Automatic merge from submit-queue
Let kubelet log the DeletionTimestamp if it's not nil in update
This helps to debug if it's the kubelet to blame when a pod is not deleted.
Example output:
```
SyncLoop (UPDATE, "api"): "redis-master_default(c6782276-2dd4-11e6-b874-64510650ab1c):DeletionTimestamp=2016-06-08T23:58:12Z"
```
ref #26290
cc @Random-Liu
Automatic merge from submit-queue
Update reason_cache.go, Get method operate lru cache not threadsafe
The reason_cache wrapped lru cache , lru cache modies linked list even for a get, should use WLock for both read and write
Automatic merge from submit-queue
Fix docker api version in kubelet
There are two variables `dockerv110APIVersion` and `dockerV110APIVersion` with
the same purpose, but different values. Remove the incorrect one and fix usage
in the file.
/cc @dchen1107 @Random-Liu
Automatic merge from submit-queue
Sets IgnoreUnknown=1 in CNI_ARGS
```release-note
release-note-none
```
K8 uses CNI_ARGS to pass pod namespace, name and infra container
id to the CNI network plugin. CNI logic will throw an error
if these args are not known to it, unless the user specifies
IgnoreUnknown as part of CNI_ARGS. This PR sets IgnoreUnknown=1
to prevent the CNI logic from erroring and blocking pod setup.
https://github.com/appc/cni/pull/158https://github.com/appc/cni/issues/126
Since appc requires gid to be non-empty today (https://github.com/appc/spec/issues/623),
we have to error out when gid is empty instead of using the root gid.
Automatic merge from submit-queue
rkt: Replace 'journalctl' with rkt's GetLogs() API.
This replaced the `journactl` shell out with rkt's GetLogs() API.
Fixes#26997
To make this fully work, we need rkt to have this patch #https://github.com/coreos/rkt/pull/2763
cc @kubernetes/sig-node @euank @alban @iaguis @jonboulle
Automatic merge from submit-queue
rkt: Do not run rkt pod inside a pre-created netns when network plugin is no-op
This fixed a panic where the returned pod network status is nil. (Fix#26540)
Also this makes lkvm stage1 able to run inside a user defined network, where the network name needs to be 'rkt.kubernetes.io'. A temporal solution to solve the network issue for lkvm stage1.
Besides, I fixed minor issues such as passing the wrong pod UID when cleaning up the netns file.
/cc @euank @pskrzyns @jellonek @kubernetes/sig-node
I tested with no networkplugin locally, works fine.
As a reminder, we need to document this in the release.https://github.com/kubernetes/kubernetes/issues/26201
This fixed a panic where the returned pod network status is nil.
Also this makes lkvm stage1 able to run inside a user defined
network, where the network name needs to be 'rkt.kubernetes.io'.
Also fixed minor issues such as passing the wrong pod UID, ignoring
logging errors.
Automatic merge from submit-queue
rkt: Fix incomplete selinux context string when the option is partial.
Fix "EmptyDir" e2e tests failures caused by #https://github.com/kubernetes/kubernetes/pull/24901
As mentioned in https://github.com/kubernetes/kubernetes/pull/24901#discussion_r61372312
We should apply the selinux context of the rkt data directory (/var/lib/rkt) when users do not specify all the selinux options.
Due to my fault, the change was missed during rebase, thus caused the regression.
After applying this PR, the e2e tests passed.
```
$ go run hack/e2e.go -v -test --test_args="--ginkgo.dryRun=false --ginkgo.focus=EmptyDir"
...
Ran 19 of 313 Specs in 199.319 seconds
SUCCESS! -- 19 Passed | 0 Failed | 0 Pending | 294 Skipped PASS
```
BTW, the test is removed because the `--no-overlay=true` flag will only be there on non-coreos distro.
cc @euank @kubernetes/sig-node
Automatic merge from submit-queue
Preserve query strings in HTTP probes instead of escaping them
Fixes a problem reported on Slack by devth.
```release-note
* Allow the use of query strings and URI fragments in HTTP probes
```
This might also preserve fragments, for those crazy enough to pass them.
I am using url.Parse() on the path in order to get path/query/fragment
and also deliberately avoiding the addition of more fields to the API.
There are two variables `dockerv110APIVersion` and `dockerV110APIVersion` with
the same purpose, but different values. Remove the incorrect one and fix usage
in the file.
Double slashes are not allowed in annotation keys. Moreover, using the 63
characters of the name component in an annotation key will shorted the space
for the container name.
Automatic merge from submit-queue
rkt: Wrap exec errors as utilexec.ExitError
This is needed by the exec prober to distinguish error types and exit
codes correctly. Without this, the exec prober used for liveness probes
doesn't identify errors correctly and restarts aren't triggered. Fixes#26456
An alternative, and preferable solution would be to use utilexec
everywhere, but that change is much more involved and should come at a
later date. Unfortunately, until that change is made, writing tests for
this is quite difficult.
cc @yifan-gu @sjpotter
Automatic merge from submit-queue
Add timeout for image pulling
Fix#26300.
With this PR, if image pulling makes no progress for *1 minute*, the operation will be cancelled. Docker reports progress for every 512kB block (See [here](3d13fddd2b/pkg/progress/progressreader.go (L32))), *512kB/min* means the throughput is *<= 8.5kB/s*, which should be kind of abnormal?
It's a little hard to write unit test for this, so I just manually tested it. If I set the `defaultImagePullingStuckTimeout` to 0s, and `defaultImagePullingProgressReportInterval` to 1s, image pulling will be cancelled.
```
E0601 18:48:29.026003 46185 kube_docker_client.go:274] Cancel pulling image "nginx:latest" because of no progress for 0, latest progress: "89732b811e7f: Pulling fs layer "
E0601 18:48:29.026308 46185 manager.go:2110] container start failed: ErrImagePull: net/http: request canceled
```
/cc @kubernetes/sig-node
[]()
Automatic merge from submit-queue
Attach/Detach Controller Kubelet Changes
This PR contains changes to enable attach/detach controller proposed in #20262.
Specifically it:
* Introduces a new `enable-controller-attach-detach` kubelet flag to enable control by attach/detach controller. Default enabled.
* Removes all references `SafeToDetach` annotation from controller.
* Adds the new `VolumesInUse` field to the Node Status API object.
* Modifies the controller to use `VolumesInUse` instead of `SafeToDetach` annotation to gate detachment.
* Modifies kubelet to set `VolumesInUse` before Mount and after Unmount.
* There is a bug in the `node-problem-detector` binary that causes `VolumesInUse` to get reset to nil every 30 seconds. Issue https://github.com/kubernetes/node-problem-detector/issues/9#issuecomment-221770924 opened to fix that.
* There is a bug here in the mount/unmount code that prevents resetting `VolumeInUse in some cases, this will be fixed by mount/unmount refactor.
* Have controller process detaches before attaches so that volumes referenced by pods that are rescheduled to a different node are detached first.
* Fix misc bugs in controller.
* Modify GCE attacher to: remove retries, remove mutex, and not fail if volume is already attached or already detached.
Fixes#14642, #19953
```release-note
Kubernetes v1.3 introduces a new Attach/Detach Controller. This controller manages attaching and detaching volumes on-behalf of nodes that have the "volumes.kubernetes.io/controller-managed-attach-detach" annotation.
A kubelet flag, "enable-controller-attach-detach" (default true), controls whether a node sets the "controller-managed-attach-detach" or not.
```
This PR contains Kubelet changes to enable attach/detach controller control.
* It introduces a new "enable-controller-attach-detach" kubelet flag to
enable control by controller. Default enabled.
* It removes all references "SafeToDetach" annoation from controller.
* It adds the new VolumesInUse field to the Node Status API object.
* It modifies the controller to use VolumesInUse instead of SafeToDetach
annotation to gate detachment.
* There is a bug in node-problem-detector that causes VolumesInUse to
get reset every 30 seconds. Issue https://github.com/kubernetes/node-problem-detector/issues/9
opened to fix that.
Automatic merge from submit-queue
rkt: Get logs via syslog identifier
This change works around https://github.com/coreos/rkt/issues/2630
Without this change, logs cannot reliably be collected for containers
with short lifetimes.
With this change, logs cannot be collected on rkt versions v1.6.0 and
before.
I'd like to also bump the required rkt version, but I don't want to do that until there's a released version that can be pointed to (so the next rkt release).
I haven't added tests (which were missing) because this code will be removed if/when logs are retrieved via the API. I have run E2E tests with this merged in and verified the tests which previously failed no longer fail.
cc @yifan-gu
Automatic merge from submit-queue
rkt: Add pod selinux support.
Currently only pod level selinux context is supported, besides when
running selinux, we will not be able to use the overlay fs, see:
https://github.com/coreos/rkt/issues/1727#issuecomment-173203129.
cc @kubernetes/sig-node @alban @mjg59 @pmorie
This is needed by the exec prober to distinguish error types and exit
codes correctly.
An alternative, and preferable solution would be to use utilexec
everywhere, but that change is much more involved and should come at a
later date. Unfortunately, until that change is made, writing tests for
this is quite difficult.
This change works around https://github.com/coreos/rkt/issues/2630
Without this change, logs cannot reliably be collected for containers
with short lifetimes.
With this change, logs cannot be collected on rkt versions v1.6.0 and
before.
With quotes, the service doesn't start for systemd 219 with the error
saying the path of the netns cannot be found.
This PR fixes the bug by removing the quotes surround the netns path.
Automatic merge from submit-queue
Kubelet: Cache image history to eliminate the performance regression
Fix https://github.com/kubernetes/kubernetes/issues/25057.
The image history operation takes almost 50% of cpu usage in kubelet performance test. We should cache image history instead of getting it from runtime everytime.
This PR cached image history in imageStatsProvider and added unit test.
@yujuhong @vishh
/cc @kubernetes/sig-node
Mark v1.3 because this is a relatively significant performance regression.
[]()
Automatic merge from submit-queue
Do not call NewFlannelServer() unless flannel overlay is enabled
Ref: #26093
This makes so kubelet does not warn the user that iptables isn't in PATH, although the user didn't enable the flannel overlay.
@vishh @freehan @bprashanth
Automatic merge from submit-queue
Add assert.NotNil for test case
I hardcode the `DefaultInterfaceName` from `eth0` to `eth-k8sdefault` at release 1.2.0, in order to test my CNI plugins. When running the test, it panics and prints wrongly formatted messages as below.
In the test case `TestBuildSummary`, `containerInfoV2ToNetworkStats` will return `nil` if `DefaultInterfaceName` is not `eth0`. So maybe we should add `assert.NotNil` to the test case.
```
ok k8s.io/kubernetes/pkg/kubelet/server 0.591s
W0523 03:25:28.257074 2257 summary.go:311] Missing default interface "eth-k8sdefault" for s%!(EXTRA string=node:FooNode)
W0523 03:25:28.257322 2257 summary.go:311] Missing default interface "eth-k8sdefault" for s%!(EXTRA string=pod:test0_pod1)
W0523 03:25:28.257361 2257 summary.go:311] Missing default interface "eth-k8sdefault" for s%!(EXTRA string=pod:test0_pod0)
W0523 03:25:28.257419 2257 summary.go:311] Missing default interface "eth-k8sdefault" for s%!(EXTRA string=pod:test2_pod0)
--- FAIL: TestBuildSummary (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x471817]
goroutine 16 [running]:
testing.func·006()
/usr/src/go/src/testing/testing.go:441 +0x181
k8s.io/kubernetes/pkg/kubelet/server/stats.checkNetworkStats(0xc20806d3b0, 0x140bbc0, 0x4, 0x0, 0x0)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/summary_test.go:296 +0xc07
k8s.io/kubernetes/pkg/kubelet/server/stats.TestBuildSummary(0xc20806d3b0)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/summary_test.go:124 +0x11d2
testing.tRunner(0xc20806d3b0, 0x1e43180)
/usr/src/go/src/testing/testing.go:447 +0xbf
created by testing.RunTests
/usr/src/go/src/testing/testing.go:555 +0xa8b
```
Automatic merge from submit-queue
Various kubenet fixes (panics and bugs and cidrs, oh my)
This PR fixes the following issues:
1. Corrects an inverse error-check that prevented `shaper.Reset` from ever being called with a correct ip address
2. Fix an issue where `parseCIDR` would fail after a kubelet restart due to an IP being stored instead of a CIDR being stored in the cache.
3. Fix an issue where kubenet could panic in TearDownPod if it was called before SetUpPod (e.g. after a kubelet restart).. because of bug number 1, this didn't happen except in rare situations (see 2 for why such a rare situation might happen)
This adds a test, but more would definitely be useful.
The commits are also granular enough I could split this up more if desired.
I'm also not super-familiar with this code, so review and feedback would be welcome.
Testing done:
```
$ cat examples/egress/egress.yml
apiVersion: v1
kind: Pod
metadata:
labels:
name: egress
name: egress-output
annotations: {"kubernetes.io/ingress-bandwidth": "300k"}
spec:
restartPolicy: Never
containers:
- name: egress
image: busybox
command: ["sh", "-c", "sleep 60"]
$ cat kubelet.log
...
Running: tc filter add dev cbr0 protocol ip parent 1:0 prio 1 u32 match ip dst 10.0.0.5/32 flowid 1:1
# setup
...
Running: tc filter del dev cbr0 parent 1:proto ip prio 1 handle 800::800 u32
# teardown
```
I also did various other bits of manual testing and logging to hunt down the panic and other issues, but don't have anything to paste for that
cc @dcbw @kubernetes/sig-network
Automatic merge from submit-queue
Add a NodeCondition "NetworkUnavaiable" to prevent scheduling onto a node until the routes have been created
This is new version of #26267 (based on top of that one).
The new workflow is:
- we have an "NetworkNotReady" condition
- Kubelet when it creates a node, it sets it to "true"
- RouteController will set it to "false" when the route is created
- Scheduler is scheduling only on nodes that doesn't have "NetworkNotReady ==true" condition
@gmarek @bgrant0607 @zmerlynn @cjcullen @derekwaynecarr @danwinship @dcbw @lavalamp @vishh
Automatic merge from submit-queue
rkt: Fix panic in setting ReadOnlyRootFS
What the title says. I wish this method were broken out in a reasonably unit testable way. fixing this panic is more important for the second though, testing will come in a later commit.
I observed the panic in a `./hack/local-up-cluster.sh` run with rkt as the container runtime.
This is also the panic that's failing our jenkins against master ([recent run](https://console.cloud.google.com/m/cloudstorage/b/rktnetes-jenkins/o/logs/kubernetes-e2e-gce/1946/artifacts/jenkins-e2e-minion-group-qjh3/kubelet.log for the log output of a recent run))
cc @tmrts @yifan-gu
Automatic merge from submit-queue
rkt: Pass through podIP
This is needed for the /etc/hosts mount and the downward API to work.
Furthermore, this is required for the reported `PodStatus` to be
correct.
The `Status` bit mostly worked prior to #25062, and this restores that
functionality in addition to the new functionality.
In retrospect, the regression in status is large enough the prior PR should have included at least some of this; my bad for not realizing the full implications there.
#25902 is needed for downwards api stuff, but either merge order is fine as neither will break badly by itself.
cc @yifan-gu @dcbw
Automatic merge from submit-queue
Stabilize map order in kubectl describe
Refs #25251.
Add `SortedResourceNames()` methods to map type aliases in order to achieve stable output order for `kubectl` descriptors.
This affects QoS classes, resource limits, and resource requests.
A few remarks:
1. I couldn't find map usages for described fields other than the ones mentioned above. Then again, I failed to identify those programmatically/systematically. Pointers given, I'd be happy to cover any gaps within this PR or along additional ones.
1. It's somewhat difficult to deterministically test a function that brings reliable ordering to Go maps due to its randomizing nature. None of the possibilities I came up with (rely a "probabilistic testing" against repeatedly created maps, add complexity through additional interfaces) seemed very appealing to me, so I went with testing my `sort.Interface` implementation and the changed logic in `kubectl.describeContainers()`.
1. It's apparently not possible to implement a single function that sorts any map's keys generically in Go without producing lots of boilerplate: a `map[<key type>]interface{}` is different from any other map type and thus requires explicit iteration on the caller site to convert back and forth. Unfortunately, this makes it hard to completely avoid code/test duplication.
Please let me know what you think.
Automatic merge from submit-queue
Fix system container detection in kubelet on systemd
```release-note
Fix system container detection in kubelet on systemd.
This fixed environments where CPU and Memory Accounting were not enabled on the unit
that launched the kubelet or docker from reporting the root cgroup when
monitoring usage stats for those components.
```
Fixes https://github.com/kubernetes/kubernetes/issues/25909
/cc @kubernetes/sig-node @kubernetes/rh-cluster-infra @vishh @dchen1107
Automatic merge from submit-queue
rkt: Use volumes from RunContainerOptions
This replaces the previous creation of mounts from the `volumeGetter`
with mounts provided via RunContainerOptions.
This is motivated by the fact that the latter has a more complete set of
mounts (e.g. the `/etc/hosts` one created in kubelet.go in the case an IP is available).
This does not induce further e2e failures as far as I can tell.
cc @yifan-gu
The length of an IP can be 4 or 16, and even if 16 it can be a valid
ipv4 address. This check is the more-correct way to handle this, and it
also provides more granular error messages.
This replaces the previous creation of mounts from the `volumeGetter`
with mounts provided via RunContainerOptions.
This is motivated by the fact that the latter has a more complete set of
mounts (e.g. the `/etc/hosts` one created in kubelet.go).
Teardown can run before Setup when the kubelet is restarted... in that
case, the shaper was nil and thus calling the shaper resulted in a panic
This fixes that by ensuring the shaper is always set... +1 level of
indirection and all that.
Before this change, the podCIDRs map contained both cidrs and ips
depending on which code path entered a container into it.
Specifically, SetUpPod would enter a CIDR while GetPodNetworkStatus
would enter an IP.
This normalizes both of them to always enter just IP addresses.
This also removes the now-redundant cidr parsing that was used to get
the ip before
Automatic merge from submit-queue
Use read-only root filesystem capabilities of rkt
Propagates `api.Container.SecurityContext.ReadOnlyRootFileSystem` flag to rkt container runtime.
cc @yifan-gu
Fixes#23837
Automatic merge from submit-queue
Use docker containerInfo.LogPath and not manually constructed path
## Pull Request Guidelines
Since the containerInfo has the LogPath in it, let's use that and
not manually construct the path ourselves. This also makes the code
less prone to breaking if docker change this path.
Fixes#23695
Automatic merge from submit-queue
Ensure that init containers are preserved during pruning
Pods with multiple init containers were getting the wrong containers
pruned. Fix an error message and add a test.
Fixes#26131
This is needed for the /etc/hosts mount and the downward API to work.
Furthermore, this is required for the reported `PodStatus` to be
correct.
The `Status` bit mostly worked prior to #25062, and this restores that
functionality in addition to the new functionality.
Automatic merge from submit-queue
rkt: Support alternate stage1's via annotation
This provides a basic implementation for setting a stage1 on a per-pod
basis via an annotation.
This provides a basic implementation for setting a stage1 on a per-pod
basis via an annotation. See discussion here for how this approach was arrived at: https://github.com/kubernetes/kubernetes/issues/23944#issuecomment-212653776
It's possible this feature should be gated behind additional knobs, such
as a kubelet flag to filter allowed stage1s, or a check akin to what
priviliged gets in the apiserver.
Currently, it checks `AllowPrivileged`, as a means to let people disable
this feature, though overloading it as stage1 and privileged isn't
ideal.
Fixes#23944
Testing done (note, unfortunately done with some additional ./cluster changes merged in):
```
$ cat examples/stage1-fly/fly-me-to-the-moon.yaml
apiVersion: v1
kind: Pod
metadata:
labels:
name: exit
name: exit-fast
annotations: {"rkt.alpha.kubernetes.io/stage1-name-override": "coreos.com/rkt/stage1-fly:1.3.0"}
spec:
restartPolicy: Never
containers:
- name: exit
image: busybox
command: ["sh", "-c", "ps aux"]
$ kubectl create -f examples/stage1-fly
$ ssh core@minion systemctl status -l --no-pager k8s_2f169b2e-c32a-49e9-a5fb-29ae1f6b4783.service
...
failed
...
May 04 23:33:03 minion rkt[2525]: stage0: error writing /etc/rkt-resolv.conf: open /var/lib/rkt/pods/run/2f169b2e-c32a-49e9-a5fb-29ae1f6b4783/stage1/rootfs/etc/rkt-resolv.conf: no such file or directory
...
# Restart kubelet with allow-privileged=false
$ kubectl create -f examples/stage1-fly
$ kubectl describe exit-fast
...
1m 19s 5 {kubelet euank-e2e-test-minion-dv3u} spec.containers{exit} Warning Failed Failed to create rkt container with error: cannot make "exit-fast_default(17050ce9-1252-11e6-a52a-42010af00002)": running a custom stage1 requires a privileged security context
....
```
Note as well that the "success" here is rkt spitting out an [error message](https://github.com/coreos/rkt/issues/2141) which indicates that the right stage1 was being used at least.
cc @yifan-gu @aaronlevy
Automatic merge from submit-queue
Downward API implementation for resources limits and requests
This is an implementation of Downward API for resources limits and requests, and it works with environment variables and volume plugin.
This is based on proposal https://github.com/kubernetes/kubernetes/pull/24051. This implementation follows API with magic keys approach as discussed in the proposal.
@kubernetes/rh-cluster-infra
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24179)
<!-- Reviewable:end -->
This provides a basic implementation for setting a stage1 on a per-pod
basis via an annotation.
It's possible this feature should be gated behind additional knobs, such
as a kubelet flag to filter allowed stage1s, or a check akin to what
priviliged gets in the apiserver.
Currently, it checks `AllowPrivileged`, as a means to let people disable
this feature, though overloading it as stage1 and privileged isn't
ideal.
Since the containerInfo has the LogPath in it, let's use that and
not manually construct the path ourselves. This also makes the code
less prone to breaking if docker change this path.
Fixes#23695
Automatic merge from submit-queue
kubelet/cadvisor: Refactor cadvisor disk stat/usage interfaces.
basically
1) cadvisor struct will know what runtime the kubelet is, passed in via additional argument to New()
2) rename cadvisor wrapper function to DockerImagesFsInfo() to ImagesFsInfo() and have linux implementation choose a label based on the runtime inside the cadvisor struct
2a) mock/fake/unsupported modified to take the same additional argument in New()
3) kubelet's wrapper for the cadvisor wrapper is renamed in parallel
4) make all tests use new interface
Automatic merge from submit-queue
Fix detection of docker cgroup on RHEL
Check docker's pid file, then fallback to pidof when trying to determine the pid for docker. The
latest docker RPM for RHEL changes /usr/bin/docker from an executable to a shell script (to support
/usr/bin/docker-current and /usr/bin/docker-latest). The pidof check for docker fails in this case,
so we check /var/run/docker.pid first (the default location), and fallback to pidof if that fails.
@kubernetes/sig-node @kubernetes/rh-cluster-infra
Automatic merge from submit-queue
Add support for limiting grace period during soft eviction
Adds eviction manager support in kubelet for max pod graceful termination period when a soft eviction is met.
```release-note
Kubelet evicts pods when available memory falls below configured eviction thresholds
```
/cc @vishh
Automatic merge from submit-queue
Add support for PersistentVolumeClaim in Attacher/Detacher interface
The attach detach interface does not support volumes which are referenced through PVCs. This PR adds that support
Automatic merge from submit-queue
Only expose top N images in `NodeStatus`
Fix#25209
Sorted the image and only pick set top 50 sized images in node status.
cc @vishh
Automatic merge from submit-queue
Updaing QoS policy to be at the pod level
Quality of Service will be derived from an entire Pod Spec, instead of being derived from resource specifications of individual resources per-container.
A Pod is `Guaranteed` iff all its containers have limits == requests for all the first-class resources (cpu, memory as of now).
A Pod is `BestEffort` iff requests & limits are not specified for any resource across all containers.
A Pod is `Burstable` otherwise.
Note: Existing pods might be more susceptible to OOM Kills on the node due to this PR! To protect pods from being OOM killed on the node, set `limits` for all resources across all containers in a pod.
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/14943)
<!-- Reviewable:end -->
Automatic merge from submit-queue
kubelet: Don't attempt to apply the oom score if container exited already
Containers could terminate before kubelet applies the oom score. This is normal
and the function should not error out.
This addresses #25844 partially.
/cc @smarterclayton @Random-Liu
Check docker's pid file, then fallback to pidof when trying to determine the pid for docker. The
latest docker RPM for RHEL changes /usr/bin/docker from an executable to a shell script (to support
/usr/bin/docker-current and /usr/bin/docker-latest). The pidof check for docker fails in this case,
so we check /var/run/docker.pid first (the default location), and fallback to pidof if that fails.
Automatic merge from submit-queue
Make name validators return string slices
Part of the larger validation PR, broken out for easier review and merge. Builds on previous PRs in the series.
This patch adds the --exit-on-lock-contention flag, which must be used
in conjunction with the --lock-file flag. When provided, it causes the
kubelet to wait for inotify events for that lock file. When an 'open'
event is received, the kubelet will exit.
When the IP isn't in the internal map, GetPodNetworkStatus() needs
to call the execer for the 'nsenter' program. That means the execer
needs to be !nil, which it wasn't before.
Automatic merge from submit-queue
Make IsValidLabelValue return error strings
Part of the larger validation PR, broken out for easier review and merge. Builds on previous PRs in the series.
Automatic merge from submit-queue
Add init containers to pods
This implements #1589 as per proposal #23666
Incorporates feedback on #1589, creates parallel structure for InitContainers and Containers, adds validation for InitContainers that requires name uniqueness, and comments on a number of implications of init containers.
This is a complete alpha implementation.
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/23567)
<!-- Reviewable:end -->
If a pod has not printed anything to stdout/stderr, it's expected
behaviour to get `-- No entries --`, even when requesting json output.
Prior to this change, a warning would be printed in such an occasion.
Automatic merge from submit-queue
Remove RunInContainer interface in Kubelet Runtime interface
According to #24689, we should merge RunInContainer and ExecInContainer in the container runtime interface.
@yujuhong @kubernetes/sig-node
Automatic merge from submit-queue
Make IsQualifiedName return error strings
Part of the larger validation PR, broken out for easier review and merge.
@lavalamp FYI, but I know you're swamped, too.
Automatic merge from submit-queue
Automatically add node labels beta.kubernetes.io/{os,arch}
Proposal: #17981
As discussed in #22623:
> @davidopp: #9044 says cloud provider but can also cover platform stuff.
Adds a label `beta.kubernetes.io/platform` to `kubelet` that informs about the os/arch it's running on.
Makes it easy to specify `nodeSelectors` for different arches in multi-arch clusters.
```console
$ kubectl get no --show-labels
NAME STATUS AGE LABELS
127.0.0.1 Ready 1m beta.kubernetes.io/platform=linux-amd64,kubernetes.io/hostname=127.0.0.1
$ kubectl describe no
Name: 127.0.0.1
Labels: beta.kubernetes.io/platform=linux-amd64,kubernetes.io/hostname=127.0.0.1
CreationTimestamp: Thu, 31 Mar 2016 20:39:15 +0300
```
@davidopp @vishh @fgrzadkowski @thockin @wojtek-t @ixdy @bgrant0607 @dchen1107 @preillyme
Automatic merge from submit-queue
Add IPv6 address support for pods - does NOT include services
This allows a container to have an IPv6 address only and extracts the address via nsenter and iproute2 or the docker client directly. An IPv6 address is now correctly reported when describing a pod.
@thockin @kubernetes/sig-network
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/23090)
<!-- Reviewable:end -->
Automatic merge from submit-queue
Add eviction-pressure-transitition-period flag to kubelet
This PR does the following:
* add the new flag to control how often a node will go out of memory pressure or disk pressure conditions see: https://github.com/kubernetes/kubernetes/pull/25282
* pass an `eviction.Config` into `kubelet` so we can group config
/cc @vishh
Automatic merge from submit-queue
PodWorkers UpdatePod takes options struct
First commit from https://github.com/kubernetes/kubernetes/pull/24843
Second commit:
The `PodWorkers.UpdatePod` operation is updated as follows:
* use options struct to pass arguments
* add a pod status func to allow override status
* add pod termination grace period if sync operation requires a kill pod
* add a call-back that is error aware
Third commit:
Add a `killPodNow` to kubelet that does a blocking kill pod call that properly integrates with pod workers.
The plan is to pass `killPodNow` as a function pointer into the out of resource killer.
```
// KillPodFunc kills a pod.
// The pod status is updated, and then it is killed with the specified grace period.
// This function must block until either the pod is killed or an error is encountered.
// Arguments:
// pod - the pod to kill
// status - the desired status to associate with the pod (i.e. why its killed)
// gracePeriodOverride - the grace period override to use instead of what is on the pod spec
type KillPodFunc func(pod *api.Pod, status api.PodStatus, gracePeriodOverride *int64) error
```
You can see it being used here in the WIP out of resource killer PR.
1344f858fb (diff-92ff0f643237f29824b4929574f84609R277)
/cc @vishh @yujuhong @pmorie
Automatic merge from submit-queue
Add utility for kubelet to log resource lists consistently
This is a simple utility for logging resource lists with standardized output.
I find it useful when logging work in node eviction, similar to kubelet logging convention for pods in same package.
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24742)
<!-- Reviewable:end -->
Automatic merge from submit-queue
WIP v0 NVIDIA GPU support
```release-note
* Alpha support for scheduling pods on machines with NVIDIA GPUs whose kubelets use the `--experimental-nvidia-gpus` flag, using the alpha.kubernetes.io/nvidia-gpu resource
```
Implements part of #24071 for #23587
I am not familiar with the scheduler enough to know what to do with the scores. Mostly punting for now.
Missing items from the implementation plan: limitranger, rkt support, kubectl
support and docs
cc @erictune @davidopp @dchen1107 @vishh @Hui-Zhi @gopinatht
Automatic merge from submit-queue
Add pod condition PodScheduled to detect situation when scheduler tried to schedule a Pod, but failed
Set `PodSchedule` condition to `ConditionFalse` in `scheduleOne()` if scheduling failed and to `ConditionTrue` in `/bind` subresource.
Ref #24404
@mml (as it seems to be related to "why pending" effort)
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24459)
<!-- Reviewable:end -->
Automatic merge from submit-queue
kubenet try to retrieve ip inside pod net namespace
Kubenet currently stores the ips of pods inside a map. Kubelet gets pod ip from kubenet during syncpod. If Kubelet restarts, all pods on the node lost their ips in podStatus. This PR adds logic to retrieve pod IP from pod netns.
cc: @yujuhong
Automatic merge from submit-queue
Kubelet: Add docker operation timeout
For #23563.
Based on #24748, only the last 2 commits are new.
This PR:
1) Add timeout for all docker operations.
2) Add docker operation timeout metrics
3) Cleanup kubelet stats and add runtime operation error and timeout rate monitoring.
4) Monitor runtime operation error and timeout rate in kubelet perf.
@yujuhong
/cc @gmarek Because of the metrics change.
/cc @kubernetes/sig-node
Implements part of #24071
I am not familiar with the scheduler enough to know what to do with the scores. Punting for now.
Missing items from the implementation plan: limitranger, rkt support, kubectl
support and user docs
Automatic merge from submit-queue
Kubelet eviction flag parsers and tests
The first two commits are from https://github.com/kubernetes/kubernetes/pull/24559 that have achieved LGTM.
The last commit is only part that is interesting, it adds the parsing logic to handle the flags, and reserves `pkg/kubelet/eviction` for eviction manager logic.
Automatic merge from submit-queue
kubelet: Remove redundant `Container.Created`
As far as I can tell, this has been supplanted by a) the `DockerJSON.CreatedAt` field and b) the
`ContainerStatus.CreatedAt`, where the first is used for creating the
second.
The `.Created` field was only written to as far as I can see.
cc @yifan-gu & @Random-Liu
Is there any reason we might want to keep this around?
Automatic merge from submit-queue
Abstract node side functionality of attachable plugins
- Create PhysicalAttacher interface to abstract MountDevice and
WaitForAttach.
- Create PhysicalDetacher interface to abstract WaitForDetach and
UnmountDevice.
- Expand unit tests to check that Attach, Detach, WaitForAttach,
WaitForDetach, MountDevice, and UnmountDevice get call where
appropriet.
Physical{Attacher,Detacher} are working titles suggestions welcome. Some other thoughts:
- NodeSideAttacher or NodeAttacher.
- AttachWatcher
- Call this Attacher and call the Current Attacher CloudAttacher.
- DeviceMounter (although there are way too many things called Mounter right now :/)
This is to address: https://github.com/kubernetes/kubernetes/pull/21709#issuecomment-192035382
@saad-ali
Automatic merge from submit-queue
Add subPath to mount a child dir or file of a volumeMount
Allow users to specify a subPath in Container.volumeMounts so they can use a single volume for many mounts instead of creating many volumes. For instance, a user can now use a single PersistentVolume to store the Mysql database and the document root of an Apache server of a LAMP stack pod by mapping them to different subPaths in this single volume.
Also solves https://github.com/kubernetes/kubernetes/issues/20466.
Automatic merge from submit-queue
PLEG: reinspect pods that failed prior inspections
Fix the following sequence of events:
1. relist call 1 successfully inspects a pod (just has infra container)
1. relist call 2 gets an error inspecting the same pod (has infra container and a transient
container that failed to create) and doesn't update the old/new pod records
1. relist calls 3+ don't inspect the pod any more (just has infra container so it doesn't look like
anything changed)
This change adds a new list that keeps track of pods that failed inspection and retries them the
next time relist is called. Without this change, a pod in this state would never be inspected again,
its entry in the status cache would never be updated, and the pod worker would never call syncPod
again because the most recent entry in the status cache has an error associated with it. Without
this change, pods in this state would be stuck Terminating forever, unless the user issued a
deletion with a grace period value of 0.
Fixes#24819
cc @kubernetes/rh-cluster-infra @kubernetes/sig-node
Automatic merge from submit-queue
Define interfaces for kubelet pod admission and eviction
There is too much code and logic in `kubelet.go` that makes it hard to test functions in discrete pieces.
I propose an interface that an internal module can implement that will let it make an admission decision for a pod. If folks are ok with the pattern, I want to move the a) predicate checking, b) out of disk, c) eviction preventing best-effort pods being admitted into their own dedicated handlers that would be easier for us to mock test. We can then just write tests to ensure that the `Kubelet` calls a call-out, and we can write easier unit tests to ensure that dedicated handlers do the right thing.
The second interface I propose was a `PodEvictor` that is invoked in the main kubelet sync loop to know if pods should be pro-actively evicted from the machine. The current active deadline check should move into a simple evictor implementation, and I want to plug the out of resource killer code path as an implementation of the same interface.
@vishh @timothysc - if you guys can ack on this, I will add some unit testing to ensure we do the call-outs.
/cc @kubernetes/sig-node @kubernetes/rh-cluster-infra
Automatic merge from submit-queue
kubenet: fix up CNI bridge TX queue length if needed
CNI's bridge plugin mis-handles the TxQLen when creating the bridge,
leading to a zero-length TX queue. This doesn't typically cause
problems (since virtual interfaces don't have hard queue limits)
but when adding traffic shaping, some qdiscs pull their packet
limits from the TX queue length, leading to a packet limit of 0
in some cases. Until we can depend on a new enough version of
CNI, fix up the TX queue length internally.
Closes: https://github.com/kubernetes/kubernetes/issues/25092
Automatic merge from submit-queue
Remove nodeName from predicate signature.
With this approach, I'm getting the initial throughput (in empty cluster) in 1000-node cluster of ~95pods/s.
Which is ~30% improvement.
@kubernetes/sig-scalability
Automatic merge from submit-queue
Kubelet: Cleanup with new engine api
Finish step 2 of #23563
This PR:
1) Cleanup go-dockerclient reference in the code.
2) Bump up the engine-api version.
3) Cleanup the code with new engine-api.
Fixes#24076.
Fixes#23809.
/cc @yujuhong
Automatic merge from submit-queue
Delete pod with uid as precondition.
Addressed https://github.com/kubernetes/kubernetes/issues/25169#issuecomment-217033202.
Fix#25169Fix#24937
This PR change status manager to delete pods with uid as a precondition, so that kubelet won't delete pods with different uid but the same name and namespace accidentally.
/cc @yujuhong
CNI's bridge plugin mis-handles the TxQLen when creating the bridge,
leading to a zero-length TX queue. This doesn't typically cause
problems (since virtual interfaces don't have hard queue limits)
but when adding traffic shaping, some qdiscs pull their packet
limits from the TX queue length, leading to a packet limit of 0
in some cases. Until we can depend on a new enough version of
CNI, fix up the TX queue length internally.
- Expand Attacher/Detacher interfaces to break up work more
explicitly.
- Add arguments to all functions to avoid having implementers store
the data needed for operations.
- Expand unit tests to check that Attach, Detach, WaitForAttach,
WaitForDetach, MountDevice, and UnmountDevice get call where
appropriet.
Fix the following sequence of events:
1. relist call 1 successfully inspects a pod (just has infra container)
1. relist call 2 gets an error inspecting the same pod (has infra container and a transient
container that failed to create) and doesn't update the old/new pod records
1. relist calls 3+ don't inspect the pod any more (just has infra container so it doesn't look like
anything changed)
This change adds a new list that keeps track of pods that failed inspection and retries them the
next time relist is called. Without this change, a pod in this state would never be inspected again,
its entry in the status cache would never be updated, and the pod worker would never call syncPod
again because the most recent entry in the status cache has an error associated with it. Without
this change, pods in this state would be stuck Terminating forever, unless the user issued a
deletion with a grace period value of 0.
This might also preserve fragments, for those crazy enough to pass them.
I am using url.Parse() on the path in order to get path/query/fragment
and also deliberately avoiding the addition of more fields to the API.
Automatic merge from submit-queue
add mutex for kubenet
I saw a bunch of weird cases in kubenet suite. For instance, SetUpPod return successfully, but right after that, kubelet cannot retrieve podIP from podCIDR map.
cc: @dcbw @thockin
ref: #24211
K8 uses CNI_ARGS to pass pod namespace, name and infra container
id to the CNI network plugin. CNI logic will throw an error
if these args are not known to it, unless the user specifies
IgnoreUnknown as part of CNI_ARGS. This PR sets IgnoreUnknown=1
to prevent the CNI logic from erroring and blocking pod setup.
https://github.com/appc/cni/pull/158https://github.com/appc/cni/issues/126
Automatic merge from submit-queue
Promote Pod Hostname & Subdomain to fields (were annotations)
Deprecating the podHostName, subdomain and PodHostnames annotations and created corresponding new fields for them on PodSpec and Endpoints types.
Annotation doc: #22564
Annotation code: #20688
Automatic merge from submit-queue
Store node information in NodeInfo
This is significantly improving scheduler throughput.
On 1000-node cluster:
- empty cluster: ~70pods/s
- full cluster: ~45pods/s
Drop in throughput is mostly related to priority functions, which I will be looking into next (I already have some PR #24095, but we need for more things before).
This is roughly ~40% increase.
However, we still need better understanding of predicate function, because in my opinion it should be even faster as it is now. I'm going to look into it next week.
@gmarek @hongchaodeng @xiang90
Automatic merge from submit-queue
Do not update cache with so much effort
Fixes: #24298
1. Remove automatic update
2. Every time we check if we can get valid value from cache, if not, get the value directly from api
cc @Random-Liu
Automatic merge from submit-queue
rkt: Add post-start hook support.
This adds a poll-and-timeout procedure after the pod is
started, to make sure the post-start hooks execute when the
container is actually running.
This is a temporal workaround for implementing post-hooks,
a long term solution is to use lifecycle event to trigger
those hooks, see https://github.com/kubernetes/kubernetes/issues/23084.
Also this fixes a bug of getting container ID for a non-running
container when running pre-stop hook.
cc @sjpotter @euank @kubernetes/sig-node
Automatic merge from submit-queue
Fix use of docker removed ParseRepositoryTag() function
Docker has removed the ParseRepositoryTag() function in
leading to failures using the kubernetes Go client API.
Failure:
```
../k8s.io/kubernetes/pkg/util/parsers/parsers.go:30: undefined: parsers.ParseRepositoryTag
```
Automatic merge from submit-queue
Collect and expose runtime's image storage usage via Kubelet's /stats/summary endpoint
This information is useful to users since docker images are typically not stored on the root filesystem.
Kubelet will also consume this feature in the future to decide is evicting images will help with disk usage on the nodes.
cc @kubernetes/sig-node
This has been supplanted by a) the DockerJSON.CreatedAt field and b) the
ContainerStatus.CreatedAt, where the first is used for creating the
second.
The `.Created` field was only written to as far as I can see.
Docker has removed the ParseRepositoryTag() function in
leading to failures using the kubernetes Go client API.
Lets use github.com/docker/distribution reference.ParseNamed()
instead.
Failure:
../k8s.io/kubernetes/pkg/util/parsers/parsers.go:30: undefined: parsers.ParseRepositoryTag
Automatic merge from submit-queue
Refactor image related functions to use docker engine-api
ref #23563
Hopes can do some help, cc @Random-Liu
If it's ok, will add more work here.
Automatic merge from submit-queue
rkt: Return `FinishedAt` for pod
This is implemented via touching a file on stop as a hook in the systemd
unit. The ctime of this file is then used to get the `finishedAt` time
in the future.
In addition, this changes the `startedAt` and `createdAt` to use the api
server's results rather than the annotations it previously used.
It's possible we might want to move this into the api in the future.
Fixes#23887
I did the following manual testing:
```
$ cat ./examples/output/exit-output.yml
apiVersion: v1
kind: Pod
metadata:
labels:
name: exit
name: exit-output
spec:
restartPolicy: Never
containers:
- name: exit
image: busybox
command: ["sh", "-c", "echo Exiting in 60; sleep 60; echo goodbye"]
$ kubectl create -f ./examples/exit/exit-output.yaml
$ # wait
$ kubectl describe pod exit-output | grep State -A 4
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 19 Apr 2016 13:23:13 -0700
Finished: Tue, 19 Apr 2016 13:24:13 -0700
$ kubectl logs exit-output
Exiting in 60
goodbye
```
I double checked as well that the file at `/var/lib/kubelet/pods/$id/finished-$id` existed and looked as expected.
This is related to https://github.com/coreos/rkt/issues/1789#issuecomment-207111814 and follows https://github.com/kubernetes/kubernetes/pull/24367 + https://github.com/coreos/rkt/issues/2445
cc @jonboulle @iaguis @yifan-gu @kubernetes/sig-node
This adds a poll-and-timeout procedure after the pod is
started, to make sure the post-start hooks execute when the
container is actually running.
This is a temporal workaround for implementing post-hooks,
a long term solution is to use lifecycle event to trigger
those hooks, see https://github.com/kubernetes/kubernetes/issues/23084.
Also this fixes a bug of getting container ID for a non-running
container when running pre-stop hook.
This is implemented via touching a file on stop as a hook in the systemd
unit. The ctime of this file is then used to get the `finishedAt` time
in the future.
In addition, this changes the `startedAt` and `createdAt` to use the api
server's results rather than the annotations it previously used.
It's possible we might want to move this into the api in the future.
Fixes#23887
Automatic merge from submit-queue
Kubelet: Refactor all but image related functions in DockerInterface
For #23563.
Based on #23699 and #23844.
Only last 3 commits are new. This PR refactored all functions except image related functions, including:
* CreateExec
* StartExec
* InspectExec
* AttachToContainer
* Logs
* Info
* Version
@kubernetes/sig-node
Automatic merge from submit-queue
docker daemon complains SHM size must be greater than 0
Fixes https://github.com/kubernetes/kubernetes/issues/24588
I am hitting this on Fedora 23 w/ docker 1.9.1 using systemd cgroup-driver.
```
$ docker version
Client:
Version: 1.9.1
API version: 1.21
Package version: docker-1.9.1-9.gitee06d03.fc23.x86_64
Go version: go1.5.3
Git commit: ee06d03/1.9.1
Built:
OS/Arch: linux/amd64
Server:
Version: 1.9.1
API version: 1.21
Package version: docker-1.9.1-9.gitee06d03.fc23.x86_64
Go version: go1.5.3
Git commit: ee06d03/1.9.1
Built:
OS/Arch: linux/amd64
```
Not sure why I am on the only one hitting it right now, but putting this out here for comment.
/cc @kubernetes/sig-node @kubernetes/rh-cluster-infra @smarterclayton
Automatic merge from submit-queue
Fix PullImage and add corresponding node e2e test
Fixes#24101. This is a bug introduced by #23506, since ref #23563.
The root cause of #24101 is described [here](https://github.com/kubernetes/kubernetes/issues/24101#issuecomment-208547623).
This PR
1) Fixes#24101 by decoding the messages returned during pulling image, and return error if any of the messages contains error.
2) Add the node e2e test to detect this kind of failure.
3) Get present check out of `ConformanceImage.Remove()` and `ConformanceImage.Pull()`. Because sometimes we may expect error to occur in `PullImage()` and `RemoveImage()`, but even that doesn't happen, the `Present()` check will still return error and let the test pass.
@yujuhong @freehan @liangchenye
Also /cc @resouer, because he is doing the image related functions refactoring.
Automatic merge from submit-queue
kubenet: Load bridge netfilter module in Init().
This lets the kubenet loads the bridge netfilter module and set bridge-nf-call-iptables=1
Fix#24018
Follow up PRs would be appreciate if we also load the module in the bridge plugin binary itself. Ref https://github.com/kubernetes/kubernetes/issues/24018#issuecomment-207682514
cc @kubernetes/sig-node @sjpotter @euank
Automatic merge from submit-queue
Expose SummaryProvider for reuse by other parts of kubelet
To support out of resource killing in the kubelet, we will introduce a new top-level module that will ensure node stability by checking if eviction thresholds have been met for memory and file-system usage on the node. In addition, it will then need information about pod memory and disk usage in order to make an eviction selection. Currently, this information is collected in `SummaryProvider` but it's hidden away and not available for re-use by other top-level modules of the kubelet. This initial refactor adds the ability to get summary stat information from the `ResourceAnalyzer` so it can be reused by other top-level modules.
I suspect we will further re-factor this area as code evolves, but this unblocks further progress on out-of-resource killing.
/cc @vishh @timothysc @kubernetes/sig-node @kubernetes/rh-cluster-infra
Automatic merge from submit-queue
Add memory available to summary stats provider
To support out of resource killing when low on memory, we want to let operators specify eviction thresholds based on available memory instead of memory usage for ease of use when working with heterogeneous nodes.
So for example, a valid eviction threshold would be the following:
* If node.memory.available < 200Mi for 30s, then evict pod(s)
For the node, `memory.availableBytes` is always known since the `memory.limit_in_bytes` is always known for root cgroup. For individual containers in pods, we only populate the `availableBytes` if the container was launched with a memory limit specified. When no memory limit is specified, the cgroupfs sets a value of 1 << 63 in the `memory.limit_in_bytes` so we look for a similar max value to handle unbounded limits, and ignore setting `memory.availableBytes`.
FYI @vishh @timstclair - as discussed on Slack.
/cc @kubernetes/sig-node @kubernetes/rh-cluster-infra
Automatic merge from submit-queue
Kubelet: Refactor container related functions in DockerInterface
For #23563.
Based on #23506, will rebase after #23506 is merged.
The last 4 commits of this PR are new.
This PR refactors all container lifecycle related functions in DockerInterface, including:
* ListContainers
* InspectContainer
* CreateContainer
* StartContainer
* StopContainer
* RemoveContainer
@kubernetes/sig-node
Automatic merge from submit-queue
rkt: Fix hostnetwork.
Mount hosts' /etc/hosts, /etc/resolv.conf, set host's hostname
when running the pod in the host's network.
Fix#24235
cc @kubernetes/sig-node
Automatic merge from submit-queue
rkt: Use rkt pod's uuid as the systemd service file's name.
Previously, the service file's name is 'k8s_${POD_UID}.service',
which means we need to `systemctl daemon-reload` if the we replace
the content of the service file (e.g. pod is restarted).
However this makes the journal in the previous pod get disconnected.
This PR solves the issue by using the unique rkt uuid as the service
file's name. After the change, the service file's name will be:
'k8s_${rkt_uuid}.service'.
Fix#23691
Automatic merge from submit-queue
rkt: Update the directory path for saving auth config.
Since #23308 is merged, now we have more stable way to determine where to store the auth configs.
cc @yujuhong @sjpotter
Automatic merge from submit-queue
Allow lazy binding in credential providers; don't use it in AWS yet
This is step one for cross-region ECR support and has no visible effects yet.
I'm not crazy about the name LazyProvide. Perhaps the interface method could
remain like that and the package method of the same name could become
LateBind(). I still don't understand why the credential provider has a
DockerConfigEntry that has the same fields but is distinct from
docker.AuthConfiguration. I had to write a converter now that we do that in
more than one place.
In step two, I'll add another intermediate, lazy provider for each AWS region,
whose empty LazyAuthConfiguration will have a refresh time of months or years.
Behind the scenes, it'll use an actual ecrProvider with the usual ~12 hour
credentials, that will get created (and later refreshed) only when kubelet is
attempting to pull an image. If we simply turned ecrProvider directly into a
lazy provider, we would bypass all the caching and get new credentials for
each image pulled.
Automatic merge from submit-queue
Kubelet: Better-defined Container Waiting state
For issue #20478 and #21125.
This PR corrected logic and add unit test for `ShouldContainerBeRestarted()`, cleaned up `Waiting` state related code and added unit test for `generateAPIPodStatus()`.
Fixes#20478Fixes#17971
@yujuhong
Mount hosts' /etc/hosts, /etc/resolv.conf, set host's hostname
when running the pod in the host's network.
Besides, do not set the DNS flags when running in host's network.
Previously, the service file's name is 'k8s_${POD_UID}.service',
which means we need to `systemctl daemon-reload` if the we replace
the content of the service file (e.g. pod is restarted).
However this makes the journal in the previous pod get disconnected.
This PR solves the issue by using the unique rkt uuid as the service
file's name. After the change, the service file's name will be:
'k8s_${rkt_uuid}.service'.
Automatic merge from submit-queue
rkt: Add pre-stop lifecycle hooks for rkt.
When a pod is being terminated, the pre-stop hooks of all the containers
will be run before the containers are stopped.
cc @yujuhong @Random-Liu @sjpotter
Automatic merge from submit-queue
Use OomScoreAdj in kubelet for newer docker api
fixes: #20121
Related: client side PR [pull 454](https://github.com/fsouza/go-dockerclient/pull/454)
Godeps has already been updated to `0099401a7342ad77e71ca9f9a57c5e72fb80f6b2`, which included client side's modification. But it seems too aggressive to upgrade the docker api version of kubelet.
Automatic merge from submit-queue
Add godoc to kubelet/volumes.go
Noticed that `mountExternalVolumes`, of all things, was missing Godoc while working w/ @screeley44. Decided to add some tonight since I have been making noise about grokkability of the kubelet lately.
@kubernetes/sig-storage
Automatic merge from submit-queue
Move predicates into library
This PR tries to implement #12744
Any suggestions/ideas are welcome. @davidopp
current state: integration test fails if including podCount check in Kubelet.
DONE:
1. refactor all predicates: predicates return fitOrNot(bool) and error(Error) in which the latter is of type PredicateFailureError or InsufficientResourceError
2. GeneralPredicates() is a predicate function, which includes serveral other predicate functions (PodFitsResource, PodFitsHost, PodFitsHostPort). It is registered as one of the predicates in DefaultAlgorithmProvider, and is also called in canAdmitPod() in Kubelet and should be called by other components (like rescheduler, etc if necessary. See discussion in issue #12744
TODO:
1. determine which predicates should be included in GeneralPredicates()
2. separate GeneralPredicates() into: a.) GeneralPredicatesEvictPod() and b.) GeneralPredicatesNotEvictPod()
3. DaemonSet should use GeneralPredicates()
DONE:
1. refactor all predicates: predicates return fitOrNot(bool) and error(Error) in which the latter is of type
PredicateFailureError or InsufficientResourceError. (For violation of either MaxEBSVolumeCount or
MaxGCEPDVolumeCount, returns one same error type as ErrMaxVolumeCountExceeded)
2. GeneralPredicates() is a predicate function, which includes serveral other predicate functions (PodFitsResource,
PodFitsHost, PodFitsHostPort). It is registered as one of the predicates in DefaultAlgorithmProvider, and
is also called in canAdmitPod() in Kubelet and should be called by other components (like rescheduler, etc)
if necessary. See discussion in issue #12744
3. remove podNumber check from GeneralPredicates
4. HostName is now verified in Kubelet's canAdminPod(). add TestHostNameConflicts in kubelet_test.go
5. add getNodeAnyWay() method in Kubelet to get node information in standaloneMode
TODO:
1. determine which predicates should be included in GeneralPredicates()
2. separate GeneralPredicates() into:
a. GeneralPredicatesEvictPod() and
b. GeneralPredicatesNotEvictPod()
3. DaemonSet should use GeneralPredicates()
Automatic merge from submit-queue
Kubelet: Remove nsinit related code and bump up minimum docker apiversion
Docker has native exec support after 1.3.x. We never need this code now.
As for the apiversion, because Kubernetes supports 1.8.x - 1.10.x now, we should bump up the minimum docker apiversion.
@yujuhong I checked the [changes](https://github.com/docker/engine-api/blob/master/types/versions/v1p20/types.go), we are not relying on any of those changes. So #23506 should work with docker 1.8.x+
Automatic merge from submit-queue
Kubelet: Start using the official docker engine-api
For #23563.
This is the **first step** in the roadmap of switching to docker [engine-api](https://github.com/docker/engine-api).
In this PR, I keep the old `DockerInterface` and implement it with the new engine-api.
With this approach, we could switch to engine-api with minimum change, so that we could:
* Test the engine-api without huge refactoring.
* Send following PRs to refactor functions in `DockerInterface` separately so as to avoid a huge change in one PR.
I've tested this PR locally, it passed all the node conformance test:
```
make test_e2e_node
Ran 19 of 19 Specs in 823.395 seconds
SUCCESS! -- 19 Passed | 0 Failed | 0 Pending | 0 Skipped PASS
Ginkgo ran 1 suite in 13m49.429979585s
Test Suite Passed
```
And it also passed the jenkins gce e2e test:
```
go run hack/e2e.go -test -v --test_args="--ginkgo.skip=\[Slow\]|\[Serial\]|\[Disruptive\]|\[Flaky\]|\[Feature:.+\]"
Ran 161 of 268 Specs in 4570.214 seconds
SUCCESS! -- 161 Passed | 0 Failed | 0 Pending | 107 Skipped PASS
Ginkgo ran 1 suite in 1h16m16.325934558s
Test Suite Passed
2016/03/25 15:12:42 e2e.go:196: Step 'Ginkgo tests' finished in 1h16m18.918754301s
```
I'm writing the design document, and will post the switching roadmap in an umbrella issue soon.
@kubernetes/sig-node
Automatic merge from submit-queue
Refactor streaming code to support interop testing
Refactor exec/attach/port forward client and server code to better
support interop testing of different client and server subprotocol
versions.
Fixes#16119
Allow network plugins to declare that they handle shaping and that
Kuberenetes should not. Will be first used by openshift-sdn which
handles shaping through OVS, but this triggers a warning when
kubelet notices the bandwidth annotations.
Automatic merge from submit-queue
rkt: bump rkt version to 1.2.1
Upon bumping the rkt version, `--hostname` is supported. Also we now gets the configs from the rkt api service, so `stage1-image` is deprecated.
cc @yujuhong @Random-Liu
Automatic merge from submit-queue
Make kubelet use an arch-specific pause image depending on GOARCH
Related to: #22876, #22683 and #15140
@ixdy @pwittrock @brendandburns @mikedanese @yujuhong @thockin @zmerlynn
This is step one for cross-region ECR support and has no visible effects yet.
I'm not crazy about the name LazyProvide. Perhaps the interface method could
remain like that and the package method of the same name could become
LateBind(). I still don't understand why the credential provider has a
DockerConfigEntry that has the same fields but is distinct from
docker.AuthConfiguration. I had to write a converter now that we do that in
more than one place.
In step two, I'll add another intermediate, lazy provider for each AWS region,
whose empty LazyAuthConfiguration will have a refresh time of months or years.
Behind the scenes, it'll use an actual ecrProvider with the usual ~12 hour
credentials, that will get created (and later refreshed) only when kubelet is
attempting to pull an image. If we simply turned ecrProvider directly into a
lazy provider, we would bypass all the caching and get new credentials for
each image pulled.
Add GeneratePodHostNameAndDomain() to RuntimeHelper to
get the hostname of the pod from kubelet.
Also update the logging flag to change the journal match from
_HOSTNAME to _MACHINE_ID.
Using json makes this robust to ENTRYPOINT/CMD that contains space.
Also removed 'RemainAfterExit' option, originally this option is
useful when we implement GetPods() by 'systemctl list-units'.
However since we are using rkt API service now, it's no longer needed.
The kubelet sync loop relies on getting one update as the signal that the
specific source is ready. This change ensures that we don't send multiple
updates (ADD, UPDATE) for the first batch of pods. This is required to prevent
the cleanup routine from killing pods prematurely.
This fixes an issue when using CNI where the hash of a Container object will differ between creation and change checks due to the docker image exporting ports
This enables rkt runtime to setup versions during creation,
this fixes a kubelet nil pointer panic when kubelet tries to get the
rkt versions but it's not set.
Use an array to store the pod IDs and use that to build the pod array with consistent ordering,
instead of map ordering, which is random and causes test flakes.
We can save a docker inspect in podInfraContainerChanged() because
it's only used within the useHostNetwork() block. We can also
consolidate some code in createPodInfraContainer() because if
the pod uses the host network, no network plugin will be involved.
Finally, in syncPodWithSyncResult() we can consolidate some
conditionals because both hairpin setup and getting the container
IP are only relevant when host networking is *not* being used.
More specifically, putting the dm.determineContainerIP() call
into the !useHostNetwork() block is OK since if no network plugin
was called to set the container up, it makes no sense to call
the network plugin to retrieve the IP address that it did not
handle. The CNI plugin even calls back into the docker manager
to GetContainerIP() which grabs the IP from docker, which will
always be "" for host networked containers anyway.
PLEG is reponsible for listing the pods running on the node. If it's hung
due to non-responsive container runtime or internal bugs, we should restart
kubelet.
cleanupTerminatedPods is responsible for checking whether a pod has been
terminated and force a status update to trigger the pod deletion. However, this
function is called in the periodic clenup routine, which runs every 2 seconds.
In other words, it forces a status update for each non-running (and not yet
deleted in the apiserver) pod. When batch deleting tens of pods, the rate of
new updates surpasses what the status manager can handle, causing numerous
redundant requests (and the status channel to be full).
This change forces a status update only when detecting the DeletionTimestamp is
set for a terminated pod. Note that for other non-terminated pods, the pod
workers should be responsible for setting the correct status after killling all
the containers.
cleanupTerminatedPods is responsible for checking whether a pod has been
terminated and force a status update to trigger the pod deletion. However, this
function is called in the periodic clenup routine, which runs every 2 seconds.
In other words, it forces a status update for each non-running (and not yet
deleted in the apiserver) pod. When batch deleting tens of pods, the rate of
new updates surpasses what the status manager can handle, causing numerous
redundant requests (and the status channel to be full).
This change forces a status update only when detecting the DeletionTimestamp is
set for a terminated pod. Note that for other non-terminated pods, the pod
workers should be responsible for setting the correct status after killling all
the containers.
If the container status exists but the container hasn't been created yet, it won't have an ID.
Have the probe wait for a valid status if the container ID is not yet set; otherwise, you'll see the
following cryptic log message from runtime.go: invalid container ID: "".
bridge-nf-call-iptables appears to only be relevant when the containers are
attached to a Linux bridge, which is usually the case with default Kubernetes
setups, docker, and flannel. That ensures that the container traffic is
actually subject to the iptables rules since it traverses a Linux bridge
and bridged traffic is only subject to iptables when bridge-nf-call-iptables=1.
But with other networking solutions (like openshift-sdn) that don't use Linux
bridges, bridge-nf-call-iptables may not be not relevant, because iptables is
invoked at other points not involving a Linux bridge.
The decision to set bridge-nf-call-iptables should be influenced by networking
plugins, so push the responsiblity out to them. If no network plugin is
specified, fall back to the existing bridge-nf-call-iptables=1 behavior.
cleanupTerminatedPods is responsible for checking whether a pod has been
terminated and force a status update to trigger the pod deletion. However, this
function is called in the periodic clenup routine, which runs every 2 seconds.
In other words, it forces a status update for each non-running (and not yet
deleted in the apiserver) pod. When batch deleting tens of pods, the rate of
new updates surpasses what the status manager can handle, causing numerous
redundant requests (and the status channel to be full).
This change forces a status update only when detecting the DeletionTimestamp is
set for a terminated pod. Note that for other non-terminated pods, the pod
workers should be responsible for setting the correct status after killling all
the containers.
Node controller is generating a huge amount of logging at v(3) that is
more appropriate for v(5). Split the log into two levels and ensure it
also ends up on one line (so grep works).
The pod manager generates a v(4) pod output on sync that always contains
a newline - since the size of the pod is so excessive in output, kick it
to v(5) for deep debugging (we're pretty happy with this loop).
When sending out an pod status update, kubelet
GETs the pod from the apiserver
Terminates if the apiserver returns an not found error; otherwise, proceed to
to update.
Even after a pod has been deleted, there might still be queued up updates for
the pod. This leads to expensive, unncessary GET operations. The situation is
worse when there are batch creation/deletion of a significant number of pods
(e.g., E2E tests), leaving many updates in the queue.
This change checks whether a pod exists before GET the pod from the apiserver
to avoid redundant GETs.
`--kubelet-cgroups` and `--system-cgroups` respectively.
Updated `--runtime-container` to `--runtime-cgroups`.
Cleaned up most of the kubelet code that consumes these flags to match
the flag name changes.
Signed-off-by: Vishnu kannan <vishnuk@google.com>
- pod_workers: pod syncing
- prober workers: container syncing
In order to synchronize the current state of Kubernetes's objects (e.g. pods, containers, etc.),
periodic synch loops are run. When there is a lot of objects to synchronize with,
loops increase communication traffic. At some point when all the traffic interfere cpu usage curve
hits the roof causing 100% cpu utilization.
To distribute the traffic in time, some sync loops can jitter their period in each loop
and help to flatten the curve.
* Metrics will not be expose until they are hooked up to a handler
* Metrics are not cached and expose a dos vector, this must be fixed before release or the stats should not be exposed through an api endpoint
Kubelet.cleanupOrphanedVolumes() compares list of volumes mounted to a node
with list of volumes that are required by pods scheduled on the node
("scheduled volume").
Both lists should contain real volumes, i.e. when a pod uses
PersistentVolumeClaim, the list must contain name of the bound volume instead
of name of the claim.
This change removes RuntimeCache in the pod workers and the syncPod() function.
Note that it doesn't deprecate RuntimeCache completely as other components
still rely on the cache.
If 'TerminationMessagePath' in container spec is set, then
We will mount the termination message log into the container.
Also in GetPodStatus, if the container exits and the 'TerminationMessagePath'
is set, then the 'message' field in container state will be populated.
Many users attempt to use 'kubectl logs' in order to find the logs
for a container, but receive no logs or an error telling them their
container is not running. The fix in this case is to run with '--previous',
but this does not match user expectations for the logs command.
This commit changes the behavior of the Kubelet to return the logs of
the currently running container or the previous running container unless
the user provides the "previous" flag. If the user specifies "follow"
the logs of the most recent container will be displayed, and if it is
a terminated container the logs will come to an end (the user can
repeatedly invoke 'kubectl logs --follow' and see the same output).
Clean up error messages in the kubelet log path to be consistent and
give users a more predictable experience.
Have the Kubelet return 400 on invalid requests
This updates the dockertools.dockerVersion to use a semantic versioning
library to more gracefully support engine versions which include
additional version fields.
Previously, go-dockerclient's APIVersion struct was use which only
handles plain numeric x.y.z version strings. With #19675, the library
was now used on the Docker engine string, however it is possible for the
engine string to include include additional information for beta, rc, or
distro specific builds.
This PR also enables the TestDockerRuntimeVersion test which was
previously just a FIXME and updates it to pass, and be used to test the
version string that cause #20005.
This negates the need for fsouza/go-dockerclient#451, since even with
that change, if a user was running Docker 1.10.0-rc1, this would cause
the kubelet to report it as simply 1.10.0.
- Ignore the "not found" error on deletion.
- Recognize the "already exists" error on creation and check if the existing
pod meets requirement. If so, don't report an error.
- Immediately create a mirror pod after a successful deletion, if needed.
This address a TODO when collecting the node version information so it
will properly report the configured runtime and its version. Previously,
this was hardcoded to "docker://" and the docker version, and would show
"docker://1.9.1" even when the kubelet was configured to use rkt.
With this change, it will use the runtime's Type() and Version() data.
This also changes the container.Runtime interface to add an APIVersion()
method. This can be used when the runtime has separate versions for the
engine and the API, such as with Docker. The Docker minimum version
validation has been updated to use APIVersion(), and
DockerManager.Version() now returns the engine version.
Add a `/stats/summary` endpoint to the kubelet which will return an
empty Summary{} struct (json formatted), as a summary API
placeholder. Once the next cAdvisor release is vendored, the summary
API will be filled in.
This cache will be used to stores the PodStatus of all pods/containers
visible on the node. This will elimiate the need for pod workers to query the
container runtime directly.
Add `kube-reserved` and `system-reserved` flags for configuration
reserved resources for usage outside of kubernetes pods. Allocatable is
provided by the Kubelet according to the formula:
```
Allocatable = Capacity - KubeReserved - SystemReserved
```
Also provides a method for estimating a reasonable default for
`KubeReserved`, but the current implementation probably is low and needs
more tuning.
Currently, pleg would report a event if a container transitions from running to
exited between relisting. However, if would not report any event if a container
gets stopped and removed between relisting. This event will eventually be
handled when the pod syncs periodically, but this is undesirable. This change
ensures that we detect all such events.
Kubelet doesn't perform checkpointing and loses all its internal states after
restarts. It'd then mistaken pods from the api server as new pods and attempt
to go through the admission process. This may result in pods being rejected
even though they are running on the node (e.g., out of disk situation). This
change adds a condition to check whether the pod was seen before and categorize
such pods as updates. The change also removes freeze/unfreeze mechanism used to
work around such cases, since it is no longer needed and it stopped working
correctly ever since we switched to incremental updates.
Changes include:
- Moving stats serving & routes to pkg/kubelet/server/stats/handler.go
- Managing the routes with restful.WebService, rather than manual
parsing
- Misc cleanup
These changes will make adding the new routes for /stats/summary more
manageable.
There has been a recent regression causing kubelet to assume no containers are
running for the pod if kubelet has not seen the pod before. This would cause
all containers to be restarted after kubelet gets restarted. This change fixes
the bug.
Implement a flag that defines the frequency at which a node's out of
disk condition can change its status. Use this flag to suspend out of
disk status changes in the time period specified by the flag, after
the status is changed once.
Set the flag to 0 in e2e tests so that we can predictably test out of
disk node condition.
Also, use util.Clock interface for all time related functionality in
the kubelet. Calling time functions in unversioned package or time
package such as unversioned.Now() or time.Now() makes it really hard
to test such code. It also makes the tests flaky and sometimes
unnecessarily slow due to time.Sleep() calls used to simulate the
time elapsed. So use util.Clock interface instead which can be faked
in the tests.
- Sample is now the toplevel struct, so all child structs have the same
timestamp
- Removed FilesystemStats. There are more discussions needed
wrt. volumes and disk accounting, so this will be added in a follow
up PR
- Removed Options. The most recent sample will be returned.
This API has been discussed ad nauseam across several forums, and this
API represents the latest conclusion. In summary, we will provide this
API as temporary solution for providing the new stats required for 1.2.
In the longterm this API will be split into "essential" stats, which
will be provided by a first-party API served through the kubelet, and
"non-essential" (monitoring) stats, which will be provided by a 3rd
party API served from a pod.
Refactor Kubelet's server functionality into a server package. Most
notably, move pkg/kubelet/server.go into
pkg/kubelet/server/server.go. This will lead to better separation of
concerns and a more readable code hierarchy.
- Add volume.MetricsProvider function to Volume interface.
- Add volume.MetricsDu for providing metrics via executing "du".
- Add volulme.MetricsNil for unsupported Volumes.
The logic doesn't apply to static pods as their corresponding mirror pod may
not have been created yet, or may be in the process of recreation. Deleting the
pod status immediately resets the version of the status for the static pod,
while the apiStatusVersion remains unchanged. This could lead to incorrect
versioning and hence stale pod status in the apiserver.
The formatting function is used often in logging. This improves the readability
by shortening the length of the call. Also change the fomartted string to
include the pod UID.
Before this change we have a mish-mash of ways to pass field names around for
error generation. Sometimes string fieldnames, sometimes .Prefix(), sometimes
neither, often wrong names or not indexed when it should be.
Instead of that mess, this is part one of a couple of commits that will make it
more strongly typed and hopefully encourage correct behavior. At least you
will have to think about field names, which is better than nothing.
It turned out to be really hard to do this incrementally.
if LowThresholdPercent > HighThresholdPercent, amountToFree at image_manager.go:208 is negative and image GC will not free memory properly.
Justification:
1) LowThresholdPercent > HighThresholdPercent implies (LowThresholdPercent * capacity / 100) > (HighThresholdPercent * capacity / 100)
2) usage is at least (HighThresholdPercent * capacity / 100)
3) amountToFree = usage - (LowThresholdPercent * capacity / 100)
Combining 1), 2) and 3) implies amountToFree can be negative.
What happens if amountToFree is negative? in freeSpace method, "for _, image := range images " loops at least once
and if everything goes fine, "delete(im.imageRecords, image.id)" is executed.
When checking for condition "if spaceFreed >= bytesToFree", it is always true as bytesToFree is negative
and spaceFreed is positive. The loop is finished, so is image GC.
At the end, only the oldest image is deleted. In situations where there is a lot of dead containers,
each container corresponing to distinct image, number of unused images can get higher.
If two new images get pulled in every 5 minutes, image GC will not work properly and will not free enough space.
Secondly, it will take a lot of time to free all unused images (hours depending on a number of unused images).
This is an incorrect configuration. Image GC should report it and refuse to work.
All external types that are not int64 are now marked as int32,
including
IntOrString. Prober is now int32 (43 years should be enough of an initial
probe time for anyone).
Did not change the metadata fields for now.
Addresses a version skew issue where the last condition status is always
evaluated as the NodeReady status. As a workaround force the NodeReady
condition to be the last in the list of node conditions.
ref: https://github.com/kubernetes/kubernetes/issues/16961
The mapping of static pod <--> mirror pod UIDs was backwards in a couple
places. Fortunately, they canceled each other out. Fixed, and added a
test case.
Docker's health is checked separately from kubelet by the processing monitoring
tool (e.g., supervisord). kubelet should not be killed when docker is down.
This change removes the docker health handler from kubelet's /healthz handler.
This feature is no longer useful pods don't sync as often. For batch
creation/deletions/syncs, the cache will be up-to-date for most pods since it
will be updated frequently. For other cases, continue updating for two more
seconds don't usually help, as temporal locality doesn't hold across pod syncs.
Set resyncInterval to one minute now that we rely on the generic pleg to trigger
pod syncs on container events. When there is an error during syncing, pod
workers need to wake up sooner to retry. Set the sync error backoff period to
10 second in this case.
This change introduces pod lifecycle event generator (PLEG), and adds a generic
PLEG. The generic PLEG relies on relisting to discover container events, and is
container-runtime-agnostic. Both docker and rkt are changed to use generic
PLEG.
+ Fix#14992
+ "When deploying a pod using an on-disk kubelet manifest (a la /etc/kubernetes/manifests), it appears that the network plugin setUpPod is notified of the new pod before the apiserver."
Add a helper method to set the container map and list at the same time, without
having to specify them separately. This reduces the effort required for
adding/modifying tests as well as making the code more concise.
Push status updates as soon as readiness state changes for containers,
rather than waiting for the sync loop to update the status. In
particular, this should help new containers to come online faster.
Additionally, consolidates prober test helpers into a single file.
This commit fixes getting the logs from complete/failed pods after
a kubelet restart by falling back to the api server in case we fail
to resolve the pod status using the status cache.
- PeriodSeconds - How often to probe
- SuccessThreshold - Number of successful probes to go from failure to success state
- FailureThreshold - Number of failing probes to go from success to failure state
This commit includes to changes in behavior:
1. InitialDelaySeconds now defaults to 10 seconds, rather than the
kubelet sync interval (although that also defaults to 10 seconds).
2. Prober only retries on probe error, not failure. To compensate, the
default FailureThreshold is set to the maxRetries, 3.
This lays the groundwork for simple multizone capabilities.
In a cloud environment, nodes are typically created by the kubelet
registering with the API server. When creating a new node, we now query
the cloudprovider to see if it can provide Zone information, and if so
we add some well-known labels to the Node we are creating.
- status.Manager always deals with the local (static) pod, but gets the
mirror pod when syncing
- This lets components like the probe workers ignore mirror pods
Now that kubelet checks sources seen correctly, there is no need to enforce the
initial order of pod updates and housekeeping. Use a ticker for housekeeping to
simplify the code.
Currently kubelet syncs all pods every 10s. This is not preferred because
* Some pods may have been sync'd recently.
* This may cause all the pods to be sync'd at once, causing undesirable
CPU spikes.
This PR replaces the global syncs with independent, periodic pod syncs. At the
end of syncing, each pod worker will enqueue itslef with a future timestamp (
current time + sync interval), when it will be due for another sync.
* If the pod worker encoutners an sync error, it may requeue with a different
timestamp to retry sooner.
* If a sync is triggered by the update channel (events or spec changes), the
pod worker would enqueue a new sync time.
This change is necessary for moving to long or no periodic sync period once pod
lifecycle event generator is completed. We will still rely on the mechanism to
requeue the pod on sync error.
This change also makes sure that if a sync does not succeed (either due to
real error or the per-container backoff mechanism), an error would be propagated
back to the pod worker, which is responsible for requeuing.
Relaxes the very very ancient restriction we put in place to keep the
original API surface area PR small. Better to be consistent with actual
expected use of tail.
- Fix deadlock when syncing deleted pods with full update channel
- Prevent sending stale updates to API server
- Don't delete cached status when sync fails (causes problems for prober)
Define a new out of disk node condition and use it to report when node
goes out of disk.
Make a copy of loop range clause variable in node listers so that it
is available outside the for loop.
Also update/implement unit tests.
Move port forward protocol name constant to a subpackage underneath
pkg/kubelet to avoid flags applicable to the kubelet leaking into
kubectl. Eventually, handlers for specific protocol versions will move
into the new subpackage as well.
Add streaming subprotocol negotiation for exec, attach, and port
forwarding. Restore previous (buggy) exec functionality as an
unspecified/unversioned subprotocol so newer kubectl clients can work
against 1.0.x kubelets.
This commit builds on previous work and creates an independent
worker for every liveness probe. Liveness probes behave largely the same
as readiness probes, so much of the code is shared by introducing a
probeType paramater to distinguish the type when it matters. The
circular dependency between the runtime and the prober is broken by
exposing a shared liveness ResultsManager, owned by the
kubelet. Finally, an Updates channel is introduced to the ResultsManager
so the kubelet can react to unhealthy containers immediately.
This enables more fine-grained control over the things we want to
output. Also by closing the stdout/stderr of the journalctl process
when user hits `Ctrl-C` after `kubectl logs $POD -f`, this enables
the journalctl process to exit.
In PodConfigNotificationIncremental PodConfig mode, when no pods are available
for a source, the Merge function correctly concluded that neither ADD, UPDATE nor
REMOVE updates are to be sent to the kubelet. But as a consequence the kubelet will
not mark that source as seen.
This is usually not a problem for the apiserver source. But it is a problem for
an empty "file" source, e.g. by passing an empty directory to the kubelet for
static pods. Then the file source will never be seen and the kubelet will stay
in its special not-all-source-seen mode.