Commit Graph

11471 Commits

Author SHA1 Message Date
carlory
3836d58744 fix handle terminating pvc when kubelet rebuild dsw
Signed-off-by: carlory <baofa.fan@daocloud.io>
2025-03-10 18:59:59 +08:00
Richa Banker
19ebee96b2 Add tests 2025-02-10 14:39:06 -08:00
Tim Allclair
4272f7016c Kubelet server handler cleanup 2025-02-06 11:04:01 -08:00
Kubernetes Prow Robot
d7fc7e30cb Merge pull request #129519 from kishen-v/automated-cherry-pick-of-#127422-upstream-release-1.31
Automated cherry pick of #127422: Fix Go vet errors for master golang
2025-01-22 11:10:37 -08:00
Aravindh Puthiyaparambil
c94919d68b kubelet: use env vars in node log query PS command
- Use environment variables to pass string arguments in the node log
  query PS command
- Split getLoggingCmd into getLoggingCmdEnv and getLoggingCmdArgs
  for better modularization
2025-01-13 14:46:05 -08:00
Abhishek Kr Srivastav
9d10ddb060 Fix Go vet errors for master golang
Co-authored-by: Rajalakshmi-Girish <rajalakshmi.girish1@ibm.com>
Co-authored-by: Abhishek Kr Srivastav <Abhishek.kr.srivastav@ibm.com>
2025-01-08 15:11:34 +05:30
carlory
04f5b20388 kubelet: Fix the volume manager did't check the device mount state in the actual state of the world before marking the volume as detached. It may cause a pod to be stuck in the Terminating state due to the above issue when it was deleted. 2024-12-03 09:47:51 +08:00
Kubernetes Prow Robot
a8a78f0da6 Merge pull request #127212 from SergeyKanzhelev/automated-cherry-pick-of-#126543-upstream-release-1.31
Automated cherry pick of #126543: Restart the init container to not be stuck in created state
2024-09-09 23:15:07 +01:00
Kubernetes Prow Robot
939edc7c6b Merge pull request #127207 from SergeyKanzhelev/automated-cherry-pick-of-#126343-upstream-release-1.31
Automated cherry pick of #126343: Terminated pod should not be re-admitted
2024-09-09 22:05:54 +01:00
Gunju Kim
fc5d752394 Restart the init container to not be stuck in created state
The main sync loop should have created and started the container in one
step. If the init container is in the 'created' state, it's likely that
the container runtime failed to start it. To prevent the container from
getting stuck in the 'created' state, restart it.
2024-09-06 20:00:48 +00:00
Sergey Kanzhelev
8a28b17c3a succeeded pod is being re-admitted 2024-09-06 18:27:57 +00:00
Gunju Kim
8469207728 Avoid SidecarContainers code path for non-sidecar pods
This fixes a regression in the SidecarContainers feature by minimizing
the impact of the new code path. Use the old code path for pods without
restartable init containers, and apply the new code path only to pods
with restartable init containers.
2024-09-06 16:37:09 +00:00
James Sturtevant
2454d8d4c3 Revert "fix: handle socket file detection on Windows"
This reverts commit 4060ee60c1.
2024-09-03 17:40:06 +00:00
Jordan Liggitt
d8da86b16d Switch DisableNodeKubeProxyVersion back to disabled-by-default
This is clearing a stable API field, so the 1 year from announcement to change period applies
2024-08-15 13:16:30 -04:00
Kubernetes Prow Robot
b5b21717ca Merge pull request #126427 from pacoxu/fix-TestUpdateAllocatedResourcesStatus
ignore order of containers status allocated resources
2024-07-29 15:54:07 -07:00
Sascha Grunert
50e430b3e9 Fix kubelet cadvisor stats runtime panic
Fixing a kubelet runtime panic when the runtime returns incomplete data:

```
E0729 08:17:47.260393    5218 panic.go:115] "Observed a panic" panic="runtime error: index out of range [0] with length 0" panicGoValue="runtime.boundsError{x:0, y:0, signed:true, code:0x0}" stacktrace=<
        goroutine 174 [running]:
        k8s.io/apimachinery/pkg/util/runtime.logPanic({0x33631e8, 0x4ddf5c0}, {0x2c9bfe0, 0xc000a563f0})
                k8s.io/apimachinery/pkg/util/runtime/runtime.go:107 +0xbc
        k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x33631e8, 0x4ddf5c0}, {0x2c9bfe0, 0xc000a563f0}, {0x4ddf5c0, 0x0, 0x10000000043c9e5?})
                k8s.io/apimachinery/pkg/util/runtime/runtime.go:82 +0x5e
        k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000ae08c0?})
                k8s.io/apimachinery/pkg/util/runtime/runtime.go:59 +0x108
        panic({0x2c9bfe0?, 0xc000a563f0?})
                runtime/panic.go:785 +0x132
        k8s.io/kubernetes/pkg/kubelet/stats.(*cadvisorStatsProvider).ImageFsStats(0xc000535d10, {0x3363348, 0xc000afa330})
                k8s.io/kubernetes/pkg/kubelet/stats/cadvisor_stats_provider.go:277 +0xaba
        k8s.io/kubernetes/pkg/kubelet/images.(*realImageGCManager).GarbageCollect(0xc000a3c820, {0x33631e8?, 0x4ddf5c0?}, {0x0?, 0x0?, 0x4dbca20?})
                k8s.io/kubernetes/pkg/kubelet/images/image_gc_manager.go:354 +0x1d3
        k8s.io/kubernetes/pkg/kubelet.(*Kubelet).StartGarbageCollection.func2()
                k8s.io/kubernetes/pkg/kubelet/kubelet.go:1472 +0x58
        k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
                k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
        k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000add110, {0x3330380, 0xc000afa300}, 0x1, 0xc0000ac150)
                k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
        k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000add110, 0x45d964b800, 0x0, 0x1, 0xc0000ac150)
                k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
        k8s.io/apimachinery/pkg/util/wait.Until(...)
                k8s.io/apimachinery/pkg/util/wait/backoff.go:161
        created by k8s.io/kubernetes/pkg/kubelet.(*Kubelet).StartGarbageCollection in goroutine 1
                k8s.io/kubernetes/pkg/kubelet/kubelet.go:1470 +0x247
```

This commit fixes panics if:

- `len(imageStats.ImageFilesystems) == 0`
- `len(imageStats.ContainerFilesystems) == 0`
- `imageStats.ImageFilesystems[0].FsId == nil`
- `imageStats.ContainerFilesystems[0].FsId == nil`
- `imageStats.ImageFilesystems[0].UsedBytes == nil`
- `imageStats.ContainerFilesystems[0].UsedBytes == nil`

It also fixes the wrapped `nil` error for the check: `err != nil ||
imageStats == nil` in case that `imageStats == nil`.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2024-07-29 14:13:47 +02:00
Paco Xu
78d3830d97 ignore order of containers status allocated resources 2024-07-29 16:48:00 +08:00
Kubernetes Prow Robot
e9d9a82839 Merge pull request #124101 from haircommander/process_stats-with-pid-fix
kubelet: fix PID based eviction
2024-07-25 11:59:57 -07:00
Kevin Hannon
3e642aee3f move container fs check so that we only check if system is split 2024-07-24 11:22:23 -04:00
carlory
c4851c64a0 remove volumeoptions from VolumePlugin and BlockVolumePlugin 2024-07-24 14:07:02 +08:00
Kubernetes Prow Robot
57d197fb89 Merge pull request #124430 from AllenXu93/fix-kubelet-restart-notReady
fix node notReady in first sync period after kubelet restart
2024-07-23 21:20:40 -07:00
Kubernetes Prow Robot
5af1710d90 Merge pull request #126243 from SergeyKanzhelev/devicePluginFailures
Implement resource health in pod status (KEP 4680)
2024-07-23 20:12:24 -07:00
Kubernetes Prow Robot
d97cf3a1eb Merge pull request #126303 from bart0sh/PR150-dra-refactor-checkpoint-upstream
DRA: refactor checkpointing
2024-07-23 18:01:53 -07:00
Sergey Kanzhelev
62f96d2748 set AllocatedResourcesStatus in the Pod Status 2024-07-24 00:29:35 +00:00
Kubernetes Prow Robot
fa4b8f32ac Merge pull request #125935 from gjkim42/fix-125880
Terminate restartable init containers ignoring not-started containers
2024-07-23 15:45:11 -07:00
Ed Bartosh
c0d922e786 DRA: Kubelet code cleanup 2024-07-24 00:27:52 +03:00
Ed Bartosh
59555c6a62 DRA: move dra/checkpont/* to dra/state/* 2024-07-24 00:12:10 +03:00
Ed Bartosh
35fbbc5cfd DRA: use crc32.ChecksumIEEE to calculate checkpoint checksum 2024-07-24 00:10:39 +03:00
Ed Bartosh
59daed75d6 DRA: refactor checkpointing
Co-authored-by: Kevin Klues <klueska@gmail.com>
2024-07-24 00:10:30 +03:00
Kubernetes Prow Robot
107f621462 Merge pull request #126108 from gnufied/changes-volume-recovery
Reduce state changes when expansion fails and mark certain failures as infeasible
2024-07-23 13:30:56 -07:00
Kubernetes Prow Robot
fbdfb9d8d9 Merge pull request #126031 from harche/kubelet_cgroupv1_arg
KEP-4569: Kubelet option to disable cgroup v1 support
2024-07-23 09:21:11 -07:00
Kubernetes Prow Robot
a4f9910c51 Merge pull request #126014 from PannagaRao/kep-ephemeral-storage-quota
pkg/volume/*: Enable quotas in user namespace
2024-07-23 09:21:02 -07:00
Kubernetes Prow Robot
d7194eb370 Merge pull request #124884 from carlory/report-event-when-kubelet-attach-failed
report an event to pod if kubelet does attach operation failed
2024-07-23 09:20:43 -07:00
Kubernetes Prow Robot
581a073dc4 Merge pull request #125663 from saschagrunert/oci-volumesource-kubelet
[KEP-4639] Add `ImageVolumeSource` implementation
2024-07-22 15:48:33 -07:00
Kubernetes Prow Robot
d21b17264e Merge pull request #125488 from pohly/dra-1.31
DRA for 1.31
2024-07-22 11:45:55 -07:00
Sascha Grunert
979863d15c Add ImageVolumeSource implementation
This patch adds the kubelet implementation of the image volume source
feature.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2024-07-22 18:46:46 +02:00
Patrick Ohly
d11b58efe6 DRA kubelet: refactor gRPC call timeouts
Some of the E2E node tests were flaky. Their timeout apparently was chosen
under the assumption that kubelet would retry immediately after a failed gRPC
call, with a factor of 2 as safety margin. But according to
0449cef8fd,
kubelet has a different, higher retry period of 90 seconds, which was exactly
the test timeout. The test timeout has to be higher than that.

As the tests don't use the gRPC call timeout anymore, it can be made
private. While at it, the name and documentation gets updated.
2024-07-22 18:09:34 +02:00
Patrick Ohly
877829aeaa DRA kubelet: adapt to v1alpha3 API
This adds the ability to select specific requests inside a claim for a
container.

NodePrepareResources is always called, even if the claim is not used by any
container. This could be useful for drivers where that call has some effect
other than injecting CDI device IDs into containers. It also ensures that
drivers can validate configs.

The pod resource API can no longer report a class for each claim because there
is no such 1:1 relationship anymore. Instead, that API reports claim,
API devices (with driver/pool/device as ID) and CDI device IDs. The kubelet
itself doesn't extract that information from the claim. Instead, it relies on
drivers to report this information when the claim gets prepared. This isolates
the kubelet from API changes.

Because of a faulty E2E test, kubelet was told to contact the wrong driver for
a claim. This was not visible in the kubelet log output. Now changes to the
claim info cache are getting logged. While at it, naming of variables and some
existing log output gets harmonized.

Co-authored-by: Oksana Baranova <oksana.baranova@intel.com>
Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>
2024-07-22 18:09:34 +02:00
Patrick Ohly
91d7882e86 DRA: new API for 1.31
This is a complete revamp of the original API. Some of the key
differences:
- refocused on structured parameters and allocating devices
- support for constraints across devices
- support for allocating "all" or a fixed amount
  of similar devices in a single request
- no class for ResourceClaims, instead individual
  device requests are associated with a mandatory
  DeviceClass

For the sake of simplicity, optional basic types (ints, strings) where the null
value is the default are represented as values in the API types. This makes Go
code simpler because it doesn't have to check for nil (consumers) and values
can be set directly (producers). The effect is that in protobuf, these fields
always get encoded because `opt` only has an effect for pointers.

The roundtrip test data for v1.29.0 and v1.30.0 changes because of the new
"request" field. This is considered acceptable because the entire `claims`
field in the pod spec is still alpha.

The implementation is complete enough to bring up the apiserver.
Adapting other components follows.
2024-07-22 18:09:34 +02:00
Itamar Holder
6c1f14c468 unit tests: exclude critical pods from swapping
Signed-off-by: Itamar Holder <iholder@redhat.com>
2024-07-22 17:56:52 +03:00
Itamar Holder
532cd5f84c Exclude critical pods from having swap access
Signed-off-by: Itamar Holder <iholder@redhat.com>
2024-07-22 17:56:52 +03:00
Peter Hunt
5fd7219cf4 kubelet/stats: fix pid stats for cadvisor stats provider
the process stats aren't correct coming from only the pod stats.
They need to be summed for all of the containers, as cadvisor
is only reading per pid (per container process)

Signed-off-by: Peter Hunt <pehunt@redhat.com>
2024-07-22 10:54:42 -04:00
Kevin Hannon
7d8ba7849b priority pid tests should match on processes
pids 0
process should not be nonzero
2024-07-22 10:54:42 -04:00
David Porter
6e6b2b76a3 test: Update summary test to check for process count
The process count is expected to always be >= 1 for pods in the test.

Let's check it's >= 1, so we can catch issues if the proecss count is
not reported.

Signed-off-by: David Porter <david@porter.me>
Signed-off-by: Paco Xu <paco.xu@daocloud.io>
2024-07-22 10:54:42 -04:00
David Porter
f58b46cb97 fix process stats
Signed-off-by: David Porter <david@porter.me>
2024-07-22 10:54:42 -04:00
PannagaRamamanohara
d16fd6a915 pkg/volume: Use QuotaMonitoring in UserNamespace
Enable LocalStorageCapacityIsolationFSQuotaMonitoring
only when hostUsers in PodSpec is set to false.
Modify unit tests and e2e tests to verify

Signed-off-by: PannagaRamamanohara <pbhojara@redhat.com>
2024-07-22 09:43:57 -04:00
Patrick Ohly
b51d68bb87 DRA: bump API v1alpha2 -> v1alpha3
This is in preparation for revamping the resource.k8s.io completely. Because
there will be no support for transitioning from v1alpha2 to v1alpha3, the
roundtrip test data for that API in 1.29 and 1.30 gets removed.

Repeating the version in the import name of the API packages is not really
required. It was done for a while to support simpler grepping for usage of
alpha APIs, but there are better ways for that now. So during this transition,
"resourceapi" gets used instead of "resourcev1alpha3" and the version gets
dropped from informer and lister imports. The advantage is that the next bump
to v1beta1 will affect fewer source code lines.

Only source code where the version really matters (like API registration)
retains the versioned import.
2024-07-21 17:28:13 +02:00
Kubernetes Prow Robot
558c9536a1 Merge pull request #123678 from kinvolk/userns-use-kubelet-user-mappings
kubelet: Add logs for userns custom mappings parsing
2024-07-20 19:59:57 -07:00
Kubernetes Prow Robot
a8d354bf39 Merge pull request #126122 from HirazawaUi/remove-unused-options
kubelet: Remove unused run container options
2024-07-19 18:05:16 -07:00
Kubernetes Prow Robot
14b34fc255 Merge pull request #125834 from tallclair/log-cleanup
[kubelet] Cleanup incorrect log about static pod status change
2024-07-19 16:58:54 -07:00