A number of race conditions exist when pods are terminated early in
their lifecycle because components in the kubelet need to know "no
running containers" or "containers can't be started from now on" but
were relying on outdated state.
Only the pod worker knows whether containers are being started for
a given pod, which is required to know when a pod is "terminated"
(no running containers, none coming). Move that responsibility and
podKiller function into the pod workers, and have everything that
was killing the pod go into the UpdatePod loop. Split syncPod into
three phases - setup, terminate containers, and cleanup pod - and
have transitions between those methods be visible to other
components. After this change, to kill a pod you tell the pod worker
to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).
Several places in the kubelet were incorrect about whether they
were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker
exposes methods that allow other loops to know when to set up or tear
down resources based on the state of the pod - these methods remove
the possibility of race conditions by ensuring a single component is
responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.
Removing containers now no longer blocks final pod deletion in the
API server and are handled as background cleanup. Node shutdown
no longer marks pods as failed as they can be restarted in the
next step.
See https://docs.google.com/document/d/1Pic5TPntdJnYfIpBeZndDelM-AbS4FN9H2GTLFhoJ04/edit# for details
This adds the gate `SeccompDefault` as new alpha feature. Seccomp path
and field fallbacks are now passed to the helper functions, whereas unit
tests covering those code paths have been added as well.
Beside enabling the feature gate, the feature has to be enabled by the
`SeccompDefault` kubelet configuration or its corresponding
`--seccomp-default` CLI flag.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Apply suggestions from code review
Co-authored-by: Paulo Gomes <pjbgf@linux.com>
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
We can set the container cpuset.cpus diring the creation and it
will not need to call to update resources after the container creation.
Additional side effect of the change, that the runc process that responsible
to create the container will run with the same CPU affinity because the
runc runs on the cpuset provided in the config.json arg.
It will allow to prevent undesirable interupts on isolated CPUs.
Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
This code can be called not only when a container is dead and restarted,
but when is started for the first time too. For example, any pod with
initContainer and containers will exhibit this behaviour. The reason is
that in that case, the "if createPodSandbox" path will return the
initContainers only and on the next call to this function this code is
executed to start the containers for the fist time.
In that case, it is wrong to log that the container is dead and will be
restarted, as it was never started. In fact, the restart count will not
be increased.
This commit just changes this to say that the container is not in the
desired state and should be started. In the end, the kubelet is a state
machine and that is all we really care about.
No tests are added, as the behaviour was correct and tests don't check
logs messages.
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As of now, the kubelet is passing the security context to container runtime even
if the security context has invalid options for a particular OS. As a result,
the pod fails to come up on the node. This error is particularly pronounced on
the Windows nodes where kubelet is allowing Linux specific options like SELinux,
RunAsUser etc where as in [documentation](https://kubernetes.io/docs/setup/production-environment/windows/intro-windows-in-kubernetes/#v1-container),
we clearly state they are not supported. This PR ensures that the kubelet strips
the security contexts of the pod, if they don't make sense on the Windows OS.
The kubelet would attempt to create a new sandbox for a pod whose
RestartPolicy is OnFailure even after all container succeeded. It caused
unnecessary CRI and CNI calls, confusing logs and conflicts between the
routine that creates the new sandbox and the routine that kills the Pod.
This patch checks the containers to start and stops creating sandbox if
no container is supposed to start.
These changes allow to set FQDN as hostname of pods for pods
that set the new PodSpec field setHostnameAsFQDN to true. The PodSpec
new field was added in related PR.
This is PART2 (last) of the changes to enable KEP #1797 and addresses #91036
docker folks added NumCPU implementation for windows that
supported hot-plugging of CPUs. The implementation used the
GetProcessAffinityMask to be able to check which CPUs are
active as well.
3707a76921
The golang "runtime" package has also bene using GetProcessAffinityMask
since 1.6 beta1:
6410e67a1e
So we don't seem to need the sysinfo.NumCPU from docker/docker.
(Note that this is PR is an effort to get away from dependencies from
docker/docker)
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
In https://github.com/kubernetes/kubernetes/pull/88372, we added the
ability to inject errors to the `FakeImageService`. Use this ability to
test the error paths executed by the `kubeGenericRuntimeManager` when
underlying `ImageService` calls fail.
I don't foresee this change having a huge impact, but it should set a
good precedent for test coverage, and should the failure case behavior
become more "interesting" or risky in the future, we already will have
the scaffolding in place with which we can expand the tests.
Unit test for updating container hugepage limit
Add warning message about ignoring case.
Update error handling about hugepage size requirements
Signed-off-by: sewon.oh <sewon.oh@samsung.com>
For Windows, CPU Requests ( Shares, Count and Maximum ) are mutually exclusive, however
Kubernetes sends them all anyway in the pod spec.
When using dockershim this is not an issue, as Docker checks for this specific situation
here: 1bd184a4c2/daemon/daemon_windows.go (L87-L106)
However, when using CRI-Containerd this pods fail to spawn with an error from hcsshim.
This PR intends to filter these values before they are sent to the CRI and not rely on the
runtime for it.
Related to: https://github.com/kubernetes/kubernetes/issues/84804
If Containerd is used on Windows, then we can also mount individual
files into containers (e.g.: termination-log files), which was not
possible with Docker.
Checks if the container runtime is containerd, and if it is, then also
mount the termination-log file.
add host file write for podIPs
update tests
remove import alias
update type check
update type check
remove import alias
update open api spec
add tests
update test
add tests
address review comments
update imports
remove todo and import alias
This starts ephemeral containers prior to init containers so that
ephemeral containers will still be started when init containers fail to
start.
Also improves tests and comments with review suggestions.