Commit Graph

9495 Commits

Author SHA1 Message Date
Clayton Coleman
d7ee024cc5
kubelet: Make condition processing in one spot
The list of status conditions should be calculated all together,
this made review more complex. Readability only.
2021-07-19 17:56:22 -04:00
Clayton Coleman
c2a6d07b8f
kubelet: Avoid allocating multiple times during status
Noticed while reviewing this code path. We can assume the
temporary slice should be about the same size as it was previously.
2021-07-19 17:55:18 -04:00
Clayton Coleman
9efd40d72a kubelet: Preserve reason/message when phase changes
The Kubelet always clears reason and message in generateAPIPodStatus
even when the phase is unchanged. It is reasonable that we preserve
the previous values when the phase does not change, and clear it
when the phase does change.

When a pod is evicted, this ensurse that the eviction message and
reason are propagated even in the face of subsequent updates. It also
preserves the message and reason if components beyond the Kubelet
choose to set that value.

To preserve the value we need to know the old phase, which requires
a change to convertStatusToAPIStatus so that both methods have
access to it.
2021-07-19 17:54:55 -04:00
Davanum Srinivas
75748c185e
enable verify-golangci-lint.sh
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2021-07-14 08:53:33 -04:00
Davanum Srinivas
26cc8e40a8
fix deadcode issues
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2021-07-14 08:41:21 -04:00
Kubernetes Prow Robot
2da4d48e6d
Merge pull request #100567 from jingxu97/mar/mark
Mark volume mount as uncertain in case of volume expansion fails
2021-07-13 22:20:26 -07:00
Kubernetes Prow Robot
d6f2473d08
Merge pull request #103668 from smarterclayton/panic_in_pod_worker
kubelet: Prevent runtime-only pods from going into terminated phase
2021-07-13 17:42:26 -07:00
Clayton Coleman
de9cdab5ae
kubelet: Prevent runtime-only pods from going into terminated phase
If a pod is already in terminated and the housekeeping loop sees an
out of date cache entry for a running container, the pod worker
should ignore that running pod termination request. Once the worker
completes, a subsequent housekeeping invocation will then invoke
terminating because the worker is no longer processing any pod
with that UID.

This does leave the possibility of syncTerminatedPod being blocked
if a container in the pod is started after killPod successfully
completes but before syncTerminatedPod can exit successfully,
perhaps because the terminated flow (detach volumes) is blocked on
that running container. A future change will address that issue.
2021-07-13 15:41:49 -04:00
rarashid
bf2ae14501 Move feature flag to beta (but leave as false) and remove the feature flag from Kubelet 2021-07-13 14:25:44 -05:00
Kubernetes Prow Robot
04ef2b115d
Merge pull request #90216 from DataDog/nayef/fix-container-statuses-race
Avoid overwriting podStatus ContainerStatuses in convertToAPIContainerStatuses
2021-07-12 17:02:29 -07:00
Elana Hashman
642eff0c69
Rename NodeSwapEnabled flag to NodeSwap 2021-07-09 11:39:52 -07:00
Kubernetes Prow Robot
a6c2cd7d18
Merge pull request #103291 from wzshiming/fix/nodeshutdown-restart
Fix Data Race in nodeshutdown restart
2021-07-09 08:43:14 -07:00
Kubernetes Prow Robot
617064d732
Merge pull request #101432 from swatisehgal/smtaware
node: cpumanager: add options to reject non SMT-aligned workload
2021-07-08 21:04:53 -07:00
Kubernetes Prow Robot
83baa708df
Merge pull request #103429 from saschagrunert/metrics-test-fix
Fix resource metrics e2e test
2021-07-08 17:58:53 -07:00
Kubernetes Prow Robot
dab6f6a43d
Merge pull request #102344 from smarterclayton/keep_pod_worker
Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker
2021-07-08 16:48:53 -07:00
Jing Xu
0fa01c371c Mark volume mount as uncertain in case of volume expansion fails
should mark volume mount in actual state even if volume expansion fails so that
reconciler can tear down the volume when needed. To avoid pods start
using it, mark volume as uncertain instead of mounted.

Will add unit test after the logic is reviewed.

Change-Id: I5aebfa11ec93235a87af8f17bea7f7b1570b603d
2021-07-08 16:00:34 -07:00
Kubernetes Prow Robot
57716897eb
Merge pull request #103434 from perithompson/windows-etchostcreate-skip
Explicitly skip host file mounting for Windows when HostProcess pod
2021-07-08 15:36:53 -07:00
Francesco Romani
23abdab2b7 smtalign: propagate policy options to policies
Consume in the static policy the cpu manager policy options from
the cpumanager instance.
Validate in the none policy if any option is given, and fail if so -
this is almost surely a configuration mistake.

Add new cpumanager.Options type to hold the options and translate from
user arguments to flags.

Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-07-08 23:15:37 +02:00
Francesco Romani
6dcec345df smtalign: cm: factor out admission response
Introduce a new `admission` subpackage to factor out the responsability
to create `PodAdmitResult` objects. This enables resource manager
to report specific errors in Allocate() and to bubble up them
in the relevant fields of the `PodAdmitResult`.

To demonstrate the approach we refactor TopologyAffinityError as a
proper error.

Co-authored-by: Kevin Klues <kklues@nvidia.com>
Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-07-08 23:15:37 +02:00
Francesco Romani
c5cb263dcf smtalign: propagate policy options to cpumanager
The CPUManagerPolicyOptions received from the kubelet config/command line args
is propogated to the Container Manager.

We defer the consumption of the options to a later patch(set).

Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-07-08 23:15:35 +02:00
Francesco Romani
6dccad45b4 smtalign: add auto generated code
Files generate after running `make generated_files`.

Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-07-08 23:14:59 +02:00
Swati Sehgal
cc76a756e4 smtalign: add cpu-manager-policy-options flag in Kubelet
In this patch we enhance the kubelet configuration to support
cpuManagerPolicyOptions.

In order to introduce SMT-awareness in CPU Manager, we introduce a
new flag in Kubelet to allow the user to specify an additional flag
called `cpumanager-policy-options` to allow the user to modify the
behaviour of static policy to strictly guarantee allocation of whole
core.

Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2021-07-08 23:14:59 +02:00
Kubernetes Prow Robot
4d78db54a5
Merge pull request #103580 from tkestack/fix-version-format
fix kubelet panic when DynamicKubeletConfig enabled
2021-07-08 14:02:24 -07:00
Kubernetes Prow Robot
a9d7526864
Merge pull request #102970 from tkestack/feature-memory-qos
Feature: Support memory qos with cgroups v2
2021-07-08 14:01:36 -07:00
Kubernetes Prow Robot
7c84064a4f
Merge pull request #99000 from verb/1.21-kubelet-metrics
Add kubelet metrics for ephemeral containers
2021-07-08 14:00:55 -07:00
Peri Thompson
8e2b728c68
Explicitly skip host file mounting for windows 2021-07-08 19:38:49 +01:00
Li Bo
79e230ea21 fix kubelet panic when DynamicKubeletConfig enabled 2021-07-08 16:20:51 +08:00
Li Bo
c3d9b10ca8 feature: support Memory QoS for cgroups v2 2021-07-08 09:26:46 +08:00
Kubernetes Prow Robot
36a7426aa5
Merge pull request #99144 from bart0sh/PR0094-promote-HugePageStorageMediumSize-to-GA
promote huge page storage medium size to GA
2021-07-07 18:09:05 -07:00
Kubernetes Prow Robot
ebbe63f116
Merge pull request #92863 from AkihiroSuda/rootless-pr
kubelet & kube-proxy: ignore sysctl errors and rlimit errors when running in UserNS (for rootless)
2021-07-07 18:08:53 -07:00
Kubernetes Prow Robot
8e56a34195
Merge pull request #102966 from SergeyKanzhelev/deprecateDynamicKubeletConfig
deprecate and disable by default DynamicKubeletConfig feature flag
2021-07-07 17:05:15 -07:00
Nayef Ghattas
bb3fe633b4 add test for triggering race condition 2021-07-07 20:17:22 +02:00
Nayef Ghattas
ab1807f2bc copy podStatus.ContainerStatuses before sorting it 2021-07-07 20:14:53 +02:00
Akihiro Suda
26e83ac4d4
kubelet: ignore /dev/kmsg error when running in userns
oomwatcher.NewWatcher returns "open /dev/kmsg: operation not permitted" error,
when running with sysctl value `kernel.dmesg_restrict=1`.

The error is negligible for KubeletInUserNamespace.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-07-07 14:23:31 +09:00
Akihiro Suda
dbe0155139
kubelet/cm: ignore sysctl error when running in userns
Errors during setting the following sysctl values are ignored:
- vm.overcommit_memory
- vm.panic_on_oom
- kernel.panic
- kernel.panic_on_oops
- kernel.keys.root_maxkeys
- kernel.keys.root_maxbytes

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-07-07 14:23:29 +09:00
Kubernetes Prow Robot
2547c5bb97
Merge pull request #103307 from aojea/kubelet_podIPs
podIPs order match node IP family preference (Downward API)
2021-07-06 22:11:20 -07:00
Kubernetes Prow Robot
561959f682
Merge pull request #102823 from ehashman/kep-2400-swap
Alpha node swap support
2021-07-06 22:11:11 -07:00
Antonio Ojea
a7469cf680 sort and filter exposed Pod IPs
runtimes may return an arbitrary number of Pod IPs, however, kubernetes
only takes into consideration the first one of each IP family.

The order of the IPs are the one defined by the Kubelet:
- default prefer IPv4
- if NodeIPs are defined, matching the first nodeIP family

PodIP is always the first IP of PodIPs.

The downward API must expose the same IPs and in the same order than
the pod.Status API object.
2021-07-07 00:15:31 +02:00
Elana Hashman
5584725605
Explicitly set LimitedSwap case with fallthrough 2021-07-06 13:50:09 -07:00
Clayton Coleman
3eadd1a9ea
Keep pod worker running until pod is truly complete
A number of race conditions exist when pods are terminated early in
their lifecycle because components in the kubelet need to know "no
running containers" or "containers can't be started from now on" but
were relying on outdated state.

Only the pod worker knows whether containers are being started for
a given pod, which is required to know when a pod is "terminated"
(no running containers, none coming). Move that responsibility and
podKiller function into the pod workers, and have everything that
was killing the pod go into the UpdatePod loop. Split syncPod into
three phases - setup, terminate containers, and cleanup pod - and
have transitions between those methods be visible to other
components. After this change, to kill a pod you tell the pod worker
to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).

Several places in the kubelet were incorrect about whether they
were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker
exposes methods that allow other loops to know when to set up or tear
down resources based on the state of the pod - these methods remove
the possibility of race conditions by ensuring a single component is
responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.

Removing containers now no longer blocks final pod deletion in the
API server and are handled as background cleanup. Node shutdown
no longer marks pods as failed as they can be restarted in the
next step.

See https://docs.google.com/document/d/1Pic5TPntdJnYfIpBeZndDelM-AbS4FN9H2GTLFhoJ04/edit# for details
2021-07-06 15:55:22 -04:00
Kubernetes Prow Robot
eae87bfe7e
Merge pull request #103483 from odinuge/revert-102508-runc-1.0
Revert "Update runc to 1.0.0"
2021-07-06 10:42:56 -07:00
Artyom Lukianov
bb6d5b1f95 memory manager: provide unittests for init containers re-use
- provide tests for static policy allocation, when init containers
requested memory bigger than the memory requested by app containers
- provide tests for static policy allocation, when init containers
requested memory smaller than the memory requested by app containers
- provide tests to verify that init containers removed from the state
file once the app container started

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-07-05 20:52:25 +03:00
Artyom Lukianov
960da7895c memory manager: remove init containers once app container started
Remove init containers from the state file once the app container started,
it will release the memory allocated for the init container and can intense
the density of containers on the NUMA node in cases when the memory allocated
for init containers is bigger than the memory allocated for app containers.

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-07-05 20:52:25 +03:00
Artyom Lukianov
b965502c49 memory manager: re-use the memory allocated for init containers
The idea that during allocation phase we will:

- during call to `Allocate` and `GetTopologyHints`  we will take into account the init containers reusable memory,
which means that we will re-use the memory and update container memory blocks accordingly.
For example for the pod with two init containers that requested: 1Gi and 2Gi,
and app container that requested 4Gi, we can re-use 2Gi of memory.

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
2021-07-05 20:52:25 +03:00
Odin Ugedal
61d88af9e4
Revert "Update runc to 1.0.0" 2021-07-05 14:03:04 +02:00
Sascha Grunert
2d0f99fba1
Fix resource metrics e2e test
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2021-07-05 11:16:05 +02:00
Sergey Kanzhelev
dffc2a60a2 deprecate and disable by default DynamicKubeletConfig feature flag 2021-07-02 23:53:11 +00:00
Kubernetes Prow Robot
659c7e709f
Merge pull request #99494 from enj/enj/i/not_after_ttl_hint
csr: add expirationSeconds field to control cert lifetime
2021-07-01 23:02:12 -07:00
Monis Khan
cd91e59f7c
csr: add expirationSeconds field to control cert lifetime
This change updates the CSR API to add a new, optional field called
expirationSeconds.  This field is a request to the signer for the
maximum duration the client wishes the cert to have.  The signer is
free to ignore this request based on its own internal policy.  The
signers built-in to KCM will honor this field if it is not set to a
value greater than --cluster-signing-duration.  The minimum allowed
value for this field is 600 seconds (ten minutes).

This change will help enforce safer durations for certificates in
the Kube ecosystem and will help related projects such as
cert-manager with their migration to the Kube CSR API.

Future enhancements may update the Kubelet to take advantage of this
field when it is configured in a way that can tolerate shorter
certificate lifespans with regular rotation.

Signed-off-by: Monis Khan <mok@vmware.com>
2021-07-01 23:38:15 -04:00
Kubernetes Prow Robot
062bc359ca
Merge pull request #102444 from sanwishe/resourceStartTime
Expose container start time in kubelet /metrics/resource endpoint
2021-07-01 14:27:51 -07:00