Commit Graph

1408 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
f2428d66cc Merge pull request #125163 from pohly/dra-kubelet-api-version-independent-no-rest-proxy
DRA: make kubelet independent of the resource.k8s.io API version
2024-07-18 17:47:48 -07:00
Kubernetes Prow Robot
5fc7032a0e Merge pull request #126156 from pohly/kubelet-test-enhancements
kubelet test enhancements
2024-07-18 14:50:54 -07:00
Patrick Ohly
7701a48bd6 dra kubelet: bump gRPC API to v1alpha4
The previous changes are an API break, therefore we need a new version.
2024-07-18 23:30:09 +02:00
Kubernetes Prow Robot
9196650533 Merge pull request #123819 from fakecore/fc/master
fix: handle socket file detection on Windows
2024-07-18 00:53:16 -07:00
Patrick Ohly
348f94ab55 DRA: read ResourceClaim in DRA drivers
This is the second and final step towards making kubelet independent of the
resource.k8s.io API versioning because it now doesn't need to copy structs
defined by that API from the driver to the API server.
2024-07-18 09:09:20 +02:00
Patrick Ohly
616a014347 DRA: move ResourceSlice publishing into DRA drivers
This is a first step towards making kubelet independent of the resource.k8s.io
API versioning because it now doesn't need to copy structs defined by that API
from the driver to the API server. The next step is removing the other
direction (reading ResourceClaim status and passing the resource handle to
drivers).

The drivers must get deployed so that they have their own connection to the API
server. Securing at least the writes via a validating admission policy should
be possible.

As before, the kubelet removes all ResourceSlices for its node at startup, then
DRA drivers recreate them if (and only if) they start up again. This ensures
that there are no orphaned ResourceSlices when a driver gets removed while the
kubelet was down.

While at it, logging gets cleaned up and updated to use structured, contextual
logging as much as possible. gRPC requests and streams now use a shared,
per-process request ID and streams also get logged.
2024-07-18 09:09:19 +02:00
Patrick Ohly
b9d00841a6 kubelet: improve checkpoint errors
Recording the expected and actual checksum in the error makes it possible to
provide that information, for example in a failed test like the ones for DRA.
Otherwise developers have to manually step through the test with a debugger to
figure out what the new checksum is.
2024-07-17 16:07:31 +02:00
Kubernetes Prow Robot
2263f2d719 Merge pull request #124148 from cyclinder/add_flag_kubelet
kubelet: Add a TopologyManager policy option: max-allowable-numa-nodes
2024-07-15 19:27:16 -07:00
Kubernetes Prow Robot
3361895612 Merge pull request #123733 from Jeffwan/jiaxin/kep-4176-240305
KEP-4176: Add a new static policy SpreadPhysicalCPUsPreferredOption
2024-07-15 01:41:10 -07:00
Jiaxin Shan
6c85fd4ddd KEP-4176: Add static policy option to distribute cpus across cores 2024-07-12 11:52:51 -07:00
Kubernetes Prow Robot
2d4514e169 Merge pull request #125802 from mmorel-35/testifylint/len+empty
fix: enable empty and len rules from testifylint on pkg and staging package
2024-07-11 23:12:06 -07:00
Harshal Patil
68d317a8d1 Add a warning log, event and metric for cgroup version 1
Signed-off-by: Harshal Patil <harpatil@redhat.com>
2024-07-09 11:34:46 -04:00
cyclinder
87129c350a kubelet: Add a TopologyManager policy options: "max-allowable-numa-nodes"
Signed-off-by: cyclidner <kuocyclinder@gmail.com>
2024-07-09 22:26:24 +08:00
Matthieu MOREL
f014b754fb fix: enable empty and len rules from testifylint on pkg package
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>

Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2024-07-06 23:15:43 +00:00
Kubernetes Prow Robot
7e1a5a0ea8 Merge pull request #125687 from bart0sh/PR146-DevicePluginCDIDevices-LockToDefault
kube_features: DevicePluginCDIDevices: LockToDefault
2024-07-01 17:07:41 -07:00
Kubernetes Prow Robot
34b8832edb Merge pull request #125631 from SergeyKanzhelev/logFailedAdmission
improve logging of pod admission denied
2024-06-28 19:36:20 -07:00
Kubernetes Prow Robot
16b7d5310a Merge pull request #125047 from zhanluxianshen/clean-typos-in-kubelet
clean typos logs in kubelet.
2024-06-28 16:48:24 -07:00
Kubernetes Prow Robot
ac9aec9f9b Merge pull request #125116 from pohly/dra-one-of-source
DRA: remove "source" indirection from v1 Pod API
2024-06-28 12:46:45 -07:00
Matthieu MOREL
0cde5f1e28 fix: enable bool-compare rule from testifylint linter (#125135)
* fix: enable bool-compare rule from testifylint linter

Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>

* Update hack/golangci.yaml.in

Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>

* Update golangci.yaml.in

* Update golangci-strict.yaml

* Update golangci.yaml.in

* Update golangci.yaml.in

* Update golangci.yaml.in

* Update golangci.yaml.in

* Update golangci.yaml

* Update golangci-hints.yaml

* Update golangci-strict.yaml

* Update golangci.yaml.in

* Update golangci.yaml

* Update mux_test.go

---------

Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2024-06-28 10:58:05 -07:00
Kubernetes Prow Robot
bcadbfcc55 Merge pull request #125496 from harche/cgroup_imp
KEP-4569: Separate cgroup v1 and v2 manager implementations
2024-06-28 09:54:02 -07:00
Harshal Patil
79495a21a8 Separate cgroup v1 and v2 manager implementations
Signed-off-by: Harshal Patil <harpatil@redhat.com>
2024-06-28 07:49:43 -04:00
Patrick Ohly
bde9b64cdf DRA: remove "source" indirection from v1 Pod API
This makes the API nicer:

    resourceClaims:
    - name: with-template
      resourceClaimTemplateName: test-inline-claim-template
    - name: with-claim
      resourceClaimName: test-shared-claim

Previously, this was:

    resourceClaims:
    - name: with-template
      source:
        resourceClaimTemplateName: test-inline-claim-template
    - name: with-claim
      source:
        resourceClaimName: test-shared-claim

A more long-term benefit is that other, future alternatives
might not make sense under the "source" umbrella.

This is a breaking change. It's justified because DRA is still
alpha and will have several other API breaks in 1.31.
2024-06-27 17:53:24 +02:00
Ed Bartosh
f53991d111 kube_features: DevicePluginCDIDevices: LockToDefault 2024-06-25 16:14:48 +03:00
Sergey Kanzhelev
e8e2fda5c3 improve logging of pod admission denied 2024-06-21 17:46:49 +00:00
Stephen Kitt
3f36c83c68 Switch to stretchr/testify / mockery for mocks
testify is used throughout the codebase; this switches mocks from
gomock to testify with the help of mockery for code generation.

Handlers and mocks in test/utils/oidc are moved to a new package:
mockery operates package by package, and requires packages to build
correctly; test/utils/oidc/testserver.go relies on the mocks and fails
to build when they are removed. Moving the interface and mocks to a
different package allows mockery to process that package without
having to build testserver.go.

Signed-off-by: Stephen Kitt <skitt@redhat.com>
2024-06-20 19:42:53 +02:00
Kubernetes Prow Robot
e6616033cb Merge pull request #120844 from bzsuni/cleanup/sets/kubelet
[kubelet] Use a generic Set instead of a specified Set
2024-06-14 09:09:17 -07:00
Harshal Patil
966d304704 Report correct error after validating the root container
Signed-off-by: Harshal Patil <harpatil@redhat.com>
2024-06-11 16:42:59 -04:00
Kubernetes Prow Robot
a8d51f4f05 Use a generic Set instead of a specified Set in kubelet
Signed-off-by: bzsuni <bingzhe.sun@daocloud.io>
2024-06-04 14:25:43 +08:00
Kubernetes Prow Robot
fad52aedfc Merge pull request #125086 from oxxenix/exponential-backoff
add exponential backoff in NodeResourceSlices controller
2024-05-28 02:46:43 -07:00
Oksana Baranova
c4ec24890e nodeResourceSlicesController: add exponential backoff 2024-05-27 23:12:53 +03:00
zhanluxianshen
e5c229fafa clean typos logs in kubelet. 2024-05-22 16:56:06 +08:00
Itamar Holder
a6b971f14b Use kubelet owned directories for mounting rather than /tmp
Signed-off-by: Itamar Holder <iholder@redhat.com>
2024-05-21 13:18:16 +03:00
Itamar Holder
74f29880bd Replace log entry by a warning event
Signed-off-by: Itamar Holder <iholder@redhat.com>
2024-05-21 13:18:16 +03:00
Itamar Holder
29535c0463 Warn of swap is enabled on the OS and tmpfs noswap is not supported
When --fail-swap-on=false kubelet CLI argument
is provided, but tmpfs noswap is not supported
by the kernel, warn about the risks of memory-backed
volumes being swapped into disk

Signed-off-by: Itamar Holder <iholder@redhat.com>
2024-05-21 13:18:16 +03:00
Kubernetes Prow Robot
8352c09592 Merge pull request #124323 from bart0sh/PR142-dra-fix-cache-integrity
kubelet: DRA: fix cache integrity
2024-05-13 09:54:02 -07:00
Alvaro Aleman
6d0ac8c561 Use the generic/typed workqueue throughout
This change makes us use the generic workqueue throughout the project in
order to improve type safety and readability of the code.
2024-05-04 14:33:12 -04:00
Ed Bartosh
f24134d7b2 kubelet: DRA: add unit test for ClaimInfo and claimInfoCache 2024-05-03 13:30:31 +00:00
Ed Bartosh
6ce294558a kubelet: DRA: add stress test
The tests calls PrepareResources and UnprepareResources API in
parallel to help discover race conditions.
2024-05-03 13:30:29 +00:00
Kevin Klues
86a18d5333 kubelet: DRA: update manager test to adhere to new claiminfo cache APIs
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-05-03 13:28:37 +00:00
Kevin Klues
805e7c3434 kubelet: DRA: remove check to set pluginName to DriverName if not in ResourceHandle
It has always been validated that a ResourceHandle MUST have DriverName set, so
this check is unnecessary.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-05-03 13:23:29 +00:00
Kevin Klues
f80be2728e kubelet: DRA: change key of claimInfo cache to "namespace/claimname"
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-05-03 13:23:29 +00:00
Kevin Klues
639e887631 kubelet: DRA: add a reconcile loop to unprepare claims for deleted pods
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-05-03 13:23:29 +00:00
Kevin Klues
a8931c6c25 kubelet: DRA: update locking/checkpoint semantics of the claimInfo cache
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-05-03 13:23:27 +00:00
Kubernetes Prow Robot
1fd835ce59 Merge pull request #123398 from ffromani/remove-legacy-checkpoint
node: devicemgr: remove obsolete pre-1.20 checkpoint file support
2024-04-29 14:46:53 -07:00
Marek Siarkowicz
3ee8178768 Cleanup defer from SetFeatureGateDuringTest function call 2024-04-24 20:25:29 +02:00
Patrick Ohly
77341f7595 DRA: remove support for v1alpha2 kubelet API
The v1alpha2 API is several releases old. No current drivers should still
depend on it.
2024-04-19 18:27:05 +02:00
Kubernetes Prow Robot
bbfd2145de Merge pull request #124091 from bitoku/dra-nil-check
kubelet: add nil check for Node(Un)PrepareResources.
2024-04-18 10:46:05 -07:00
Kubernetes Prow Robot
528cff12f6 Merge pull request #120969 from skitt/uber-go-mock
Switch from golang/mock to uber-go/mock
2024-04-17 23:59:24 -07:00
Francesco Romani
181fb0da51 node: devicemgr: remove obsolete pre-1.20 checkpoint file support
In commit 2f426fdba6 we added
compatibility (and tests) to deal with pre-1.20 checkpoint files.
We are now well past the end of support for pre-1.20 kubelets,
so we can get rid of this code.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2024-04-15 14:01:56 +02:00
Ayato Tokubi
d04f87abde add nil check for Node(Un)PrepareResources.
Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
2024-04-04 23:24:25 +00:00