kubernetes

Author	SHA1	Message	Date
Kubernetes Submit Queue	a3f40dd8df	Merge pull request #60856 from jiayingz/race-fix Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Fixes the races around devicemanager Allocate() and endpoint deletion. There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc() could get Node with non-zero deviceplugin resource allocatable for a non-existing endpoint. That race can happen when a device plugin fails, but is more likely when kubelet restarts as with the current registration model, there is a time gap between kubelet restart and device plugin re-registration. During this time window, even though devicemanager could have removed the resource initially during GetCapacity() call, Kubelet may overwrite the device plugin resource capacity/allocatable with the old value when node update from the API server comes in later. This could cause a pod to be started without proper device runtime config set. To solve this problem, introduce endpointStopGracePeriod. When a device plugin fails, don't immediately remove the endpoint but set stopTime in its endpoint. During kubelet restart, create endpoints with stopTime set for any checkpointed registered resource. The endpoint is considered to be in stopGracePeriod if its stoptime is set. This allows us to track what resources should be handled by devicemanager during the time gap. When an endpoint's stopGracePeriod expires, we remove the endpoint and its resource. This allows the resource to be exported through other channels (e.g., by directly updating node status through API server) if there is such use case. Currently endpointStopGracePeriod is set as 5 minutes. Given that an endpoint is no longer immediately removed upon disconnection, mark all its devices unhealthy so that we can signal the resource allocatable change to the scheduler to avoid scheduling more pods to the node. When a device plugin endpoint is in stopGracePeriod, pods requesting the corresponding resource will fail admission handler. Tested: Ran GPUDevicePlugin e2e_node test 100 times and all passed now. What this PR does / why we need it: Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes https://github.com/kubernetes/kubernetes/issues/60176 Special notes for your reviewer: Release note: ```release-note Fixes the races around devicemanager Allocate() and endpoint deletion. ```	2018-03-12 02:50:13 -07:00
Jiaying Zhang	5514a1f4dd	Fixes the races around devicemanager Allocate() and endpoint deletion. There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc() could get Node with non-zero deviceplugin resource allocatable for a non-existing endpoint. That race can happen when a device plugin fails, but is more likely when kubelet restarts as with the current registration model, there is a time gap between kubelet restart and device plugin re-registration. During this time window, even though devicemanager could have removed the resource initially during GetCapacity() call, Kubelet may overwrite the device plugin resource capacity/allocatable with the old value when node update from the API server comes in later. This could cause a pod to be started without proper device runtime config set. To solve this problem, introduce endpointStopGracePeriod. When a device plugin fails, don't immediately remove the endpoint but set stopTime in its endpoint. During kubelet restart, create endpoints with stopTime set for any checkpointed registered resource. The endpoint is considered to be in stopGracePeriod if its stoptime is set. This allows us to track what resources should be handled by devicemanager during the time gap. When an endpoint's stopGracePeriod expires, we remove the endpoint and its resource. This allows the resource to be exported through other channels (e.g., by directly updating node status through API server) if there is such use case. Currently endpointStopGracePeriod is set as 5 minutes. Given that an endpoint is no longer immediately removed upon disconnection, mark all its devices unhealthy so that we can signal the resource allocatable change to the scheduler to avoid scheduling more pods to the node. When a device plugin endpoint is in stopGracePeriod, pods requesting the corresponding resource will fail admission handler.	2018-03-09 17:00:57 -08:00
Jing Xu	b2e744c620	Promote LocalStorageCapacityIsolation feature to beta The LocalStorageCapacityIsolation feature added a new resource type ResourceEphemeralStorage "ephemeral-storage" so that this resource can be allocated, limited, and consumed as the same way as CPU/memory. All the features related to resource management (resource request/limit, quota, limitrange) are avaiable for local ephemeral storage. This local ephemeral storage represents the storage for root file system, which will be consumed by containers' writtable layer and logs. Some volumes such as emptyDir might also consume this storage.	2018-03-02 15:10:08 -08:00
Kubernetes Submit Queue	e31c8a2252	Merge pull request #60318 from jiayingz/api-change Automatic merge from submit-queue (batch tested with PRs 59159, 60318, 60079, 59371, 57415). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Made a couple API changes to deviceplugin/v1beta1 to avoid future incompatible API changes: - Add GetDevicePluginOptions rpc call. This is needed when we switch from Registration service to probe-based plugin watcher. - Change AllocateRequest and AllocateResponse to allow device requests from multiple containers in a pod. Currently only made mechanical change on the devicemanager and test code to cope with the API but still issues an Allocate call per container. We can modify the devicemanager in 1.11 to issue a single Allocate call per pod. The change will also facilitate incremental API change to communicate pod level information through Allocate rpc if there is such future need. What this PR does / why we need it: Made a couple API changes to deviceplugin/v1beta1 to avoid future incompatible API changes. Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes https://github.com/kubernetes/kubernetes/issues/59370 Special notes for your reviewer: Release note: ```release-note ```	2018-02-24 21:19:33 -08:00
Jiaying Zhang	07beac6004	Made a couple API changes to deviceplugin/v1beta1 to avoid future incompatible changes: - Add GetDevicePluginOptions rpc call. This is needed when we switch from Registration service to probe-based plugin watcher. - Change AllocateRequest and AllocateResponse to allow device requests from multiple containers in a pod. Currently only made mechanical change on the devicemanager and test code to cope with the API but still issues an Allocate call per container. We can modify the devicemanager in 1.11 to issue a single Allocate call per pod. The change will also facilitate incremental API change to communicate pod level information through Allocate rpc if there is such future need.	2018-02-23 16:15:09 -08:00
Kubernetes Submit Queue	d5aba0c6ca	Merge pull request #59088 from YuxiJin-tobeyjin/codeClean-merge-logfAndFailnow-to-fatalf Automatic merge from submit-queue (batch tested with PRs 60106, 59510, 60263, 60063, 59088). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. CodeClean, merge Logf And FailNow to Fatalf What this PR does / why we need it: Trivial changes to clean code, merge Logf And FailNow to Fatalf. Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes # Special notes for your reviewer: Release note: ```release-note "NONE" ```	2018-02-23 02:59:55 -08:00
Kubernetes Submit Queue	e8dd75f37d	Merge pull request #58282 from vikaschoudhary16/per-container-allocate Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Invoke preStart RPC call before container start, if desired by plugin What this PR does / why we need it: 1. Adds a new RPC `preStart` to device plugin API 2. Update `Register` RPC handling to receive a flag from the Device plugins as an indicator if kubelet should invoke `preStart` RPC before starting container. 3. Changes in device manager to invoke `preStart` before container start 4. Test case updates Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes #56943 #56307 Special notes for your reviewer: Release note: ```release-note None ``` /sig node /area hw-accelerators /cc @jiayingz @RenaudWasTaken @vishh @ScorpioCPH @sjenning @derekwaynecarr @jeremyeder @lichuqiang @tengqm	2018-02-21 13:07:26 -08:00
vikaschoudhary16	e64517cd74	Migrate deviceplugin api from v1alpha to v1beta1	2018-02-21 01:26:20 -05:00
vikaschoudhary16	defcab81d5	Invoke PreStart RPC call before container start, if desired by plugin Signed-off-by: vikaschoudhary16 <vichoudh@redhat.com>	2018-02-21 01:25:24 -05:00
ravisantoshgudimetla	a9a724d500	Test cases fix after path expansion	2018-02-20 14:23:09 -05:00
Kubernetes Submit Queue	96ec318718	Merge pull request #59842 from ixdy/update-rules_go-02-2018 Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Update bazelbuild/rules_go, kubernetes/repo-infra, and gazelle dependencies What this PR does / why we need it: updates our bazelbuild/rules_go dependency in order to bump everything to go1.9.4. I'm separating this effort into two separate PRs, since updating rules_go requires a large cleanup, removing an attribute from most build rules. Release note: ```release-note NONE ```	2018-02-19 22:23:05 -08:00
David Ashpole	960856f4e8	collect metrics on the /kubepods cgroup on-demand	2018-02-17 12:32:40 -08:00
Jeff Grafton	ef56a8d6bb	Autogenerated: hack/update-bazel.sh	2018-02-16 13:43:01 -08:00
David Ashpole	b259543985	collect ephemeral storage capacity on initialization	2018-02-15 17:33:22 -08:00
Kubernetes Submit Queue	58dcf3c533	Merge pull request #59489 from pohly/master-tmpdir Automatic merge from submit-queue (batch tested with PRs 59489, 59716). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. devicemanager testing: dynamically choose tmp dir This avoids the test issue #59488 that I was running into. I believe I have a reasonable explanation for the race condition in that issue (TLDR: it's probably part of the gRPC API and k8s can only avoid the issue until a proper solution gets worked out together with gRPC), therefore I suggest to merge this PR now both because it avoids the issue and because using fixed tmp directories is something that should be avoided anyway. /assign @jiayingz	2018-02-14 00:14:31 -08:00
Kubernetes Submit Queue	317853c90c	Merge pull request #59464 from dixudx/fix_all_typos Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. fix all the typos across the project What this PR does / why we need it: There are lots of typos across the project. We should avoid small PRs on fixing those annoying typos, which is time-consuming and low efficient. This PR does fix all the typos across the project currently. And with #59463, typos could be avoided when a new PR gets merged. Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes # Special notes for your reviewer: /sig testing /area test-infra /sig release /cc @ixdy /assign @fejta Release note: ```release-note None ```	2018-02-10 22:12:45 -08:00
Di Xu	48388fec7e	fix all the typos across the project	2018-02-11 11:04:14 +08:00
Patrick Ohly	0d828e061b	devicemanager testing: time out sooner Each individual step should not take longer than a second. Suggest by Vikas Choudhary (https://github.com/kubernetes/kubernetes/pull/59489#discussion_r167205672).	2018-02-09 20:51:54 +01:00
Kubernetes Submit Queue	76e6da25fa	Merge pull request #59481 from rojkov/dm-unittests Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. devicemanager: increase code coverege of endpoint's unit test Particularly cover the code path when an unhealthy device becomes healthy.	2018-02-09 10:35:22 -08:00
Patrick Ohly	1325c2f8be	devicemanager testing: dynamically choose tmp dir Hard-coding the tests to use /tmp/device_plugin for sockets is problematic because it prevents running tests in parallel on the same machine (perhaps because there are multiple developers, perhaps because testing is done independently on different code checkouts). /tmp/device_plugin also was not removed after testing. This is probably not that relevant. But more importantly, this change also fixes https://github.com/kubernetes/kubernetes/issues/59488. "make test" failed in TestDevicePluginReRegistration because something removed /tmp/device_plugin/device-plugin.sock while something else tried to connect to it: 2018/02/07 14:34:39 Starting to serve on /tmp/device_plugin/device-plugin.sock [pid 29568] connect(14, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/server.sock"}, 33) = 0 [pid 29568] unlinkat(AT_FDCWD, "/tmp/device_plugin/server.sock", 0) = 0 [pid 29568] unlinkat(AT_FDCWD, "/tmp/device_plugin/device-plugin.sock", 0) = 0 [pid 29568] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=29568, si_uid=1000} --- [pid 29568] connect(6, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory) E0207 14:34:39.961321 29568 endpoint.go:117] listAndWatch ended unexpectedly for device plugin mock with error rpc error: code = Unavailable desc = transport is closing strace: Process 29623 attached [pid 29574] connect(3, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory) [pid 29623] connect(3, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory) [pid 29574] connect(3, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory) E0207 14:34:49.961324 29568 endpoint.go:60] Can't create new endpoint with path /tmp/device_plugin/device-plugin.sock err failed to dial device plugin: context deadline exceeded E0207 14:34:49.961390 29568 manager.go:340] Failed to dial device plugin with request &RegisterRequest{Version:v1alpha2,Endpoint:device-plugin.sock,ResourceName:fake-domain/resource,}: failed to dial device plugin: context deadline exceeded panic: test timed out after 2m0s It's not entirely certain which code was to blame for this unlinkat() calls (perhaps some cleanup code from a previous test running in a goroutine?) but this no longer happened after switching to per-test socket directories.	2018-02-09 14:01:13 +01:00
Dmitry Rozhkov	3175a687a0	devicemanager: increase code coverege of endpoint's unit test Particularly cover the code path when an unhealthy device becomes healthy.	2018-02-07 12:29:48 +02:00
Lee Verberne	e10042d22f	Increment CRI version from v1alpha1 to v1alpha2 This also incorporates the version string into the package name so that incompatibile versions will fail to connect. Arbitrary choices: - The proto3 package name is runtime.v1alpha2. The proto compiler normally translates this to a go package of "runtime_v1alpha2", but I renamed it to "v1alpha2" for consistency with existing packages. - kubelet/apis/cri is used as "internalapi". I left it alone and put the public "runtimeapi" in kubelet/apis/cri/runtime.	2018-02-07 09:06:26 +01:00
Kubernetes Submit Queue	056e9ecc43	Merge pull request #58941 from vikaschoudhary16/test-allocate Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add unit test for endpoint allocate What this PR does / why we need it: Adds a unit test for covering `allocate` function at endpoint. Release note: ```release-note None ``` /kind testing /area hw-accelerators /cc @jiayingz @vishh @derekwaynecarr @RenaudWasTaken @resouer @ConnorDoyle	2018-02-06 17:19:41 -08:00
Kubernetes Submit Queue	c02b784b76	Merge pull request #58172 from NVIDIA/annotations Automatic merge from submit-queue (batch tested with PRs 58184, 59307, 58172). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add annotations to the device plugin API What this PR does / why we need it: Which issue(s) this PR fixes : Related to #56649 but does not fix it This adds the ability for the device plugins to annotate containers. Product wise, this allows the NVIDIA device plugin to support CRI-O (which allows hooks through container annotations). Special notes for your reviewer: /area hw-accelerators /cc @vishh @jiayingz @vikaschoudhary16 I'm wondering if it would make sense to fire a blank call to `newContainerAnnotations` at the start of the deviceplugin to get Annotations that are forbidden. Current behavior is that any Annotations that conflicts with Kubelet will be overwritten by Kubelet. Release note: ```release-note NONE ```	2018-02-05 13:50:35 -08:00
Derek Carr	4afc0c8052	kubelet ignores hugepages if hugetlb is not enabled	2018-02-05 13:07:59 -05:00
vikaschoudhary16	abfb99645b	Add unit test for endpoint allocate	2018-02-05 00:53:07 -05:00
Renaud Gaubert	db537e5954	Add Annotations from the deviceplugin to the runtime	2018-02-03 19:53:20 +01:00
Kubernetes Submit Queue	c817765b0e	Merge pull request #58445 from hanxiaoshuai/typo Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. fix some typos in comments What this PR does / why we need it: Fixes # fix some typos in comments	2018-01-30 19:44:44 -08:00
YuxiJin-tobeyjin	af6b4e39c2	codeClean-merge-logfAndFailnow-to-fatalf	2018-01-31 11:39:31 +08:00
Kubernetes Submit Queue	bf111161b7	Merge pull request #57973 from dims/set-pids-limit-at-pod-level Automatic merge from submit-queue (batch tested with PRs 57973, 57990). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Set pids limit at pod level What this PR does / why we need it: Add a new Alpha Feature to set a maximum number of pids per Pod. This is to allow the use case where cluster administrators wish to limit the pids consumed per pod (example when running a CI system). By default, we do not set any maximum limit, If an administrator wants to enable this, they should enable `SupportPodPidsLimit=true` in the `--feature-gates=` parameter to kubelet and specify the limit using the `--pod-max-pids` parameter. The limit set is the total count of all processes running in all containers in the pod. Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes #43783 Special notes for your reviewer: Release note: ```release-note New alpha feature to limit the number of processes running in a pod. Cluster administrators will be able to place limits by using the new kubelet command line parameter --pod-max-pids. Note that since this is a alpha feature they will need to enable the "SupportPodPidsLimit" feature. ```	2018-01-25 18:29:31 -08:00
Connor Doyle	e5667cf426	Rename package deviceplugin => devicemanager.	2018-01-24 22:32:43 -08:00
Kubernetes Submit Queue	62616d79ad	Merge pull request #58053 from tianshapjq/nit-errUnsupportedVersion Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. typo of errUnsuportedVersion What this PR does / why we need it: typo of errUnsuportedVersion in pkg/kubelet/cm/deviceplugin/types.go Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes # Special notes for your reviewer: Release note: ```release-note ```NONE	2018-01-19 03:26:34 -08:00
Kubernetes Submit Queue	44d0ba29d3	Merge pull request #56960 from islinwb/remove_unused_code_ut_pkg Automatic merge from submit-queue (batch tested with PRs 53631, 56960). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Remove unused code in UT files in pkg/ What this PR does / why we need it: Remove unused code in UT files in pkg/ . Release note: ```release-note NONE ```	2018-01-18 02:41:29 -08:00
hangaoshuai	005f8c4926	fix some typos in comments	2018-01-18 17:07:51 +08:00
vikaschoudhary16	9c847fc4d6	Call Dial in blocking mode	2018-01-16 10:50:17 -05:00
linweibin	fa8afc1d39	Remove unused code in UT files in pkg/	2018-01-15 16:02:35 +08:00
Kubernetes Submit Queue	9007df35b9	Merge pull request #55921 from ScorpioCPH/fix-endpoint-ut Automatic merge from submit-queue (batch tested with PRs 58216, 58193, 53033, 58219, 55921). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Fix device plugin endpoint UT What this PR does / why we need it: Fix some issues in device plugin endpoint UT. Which issue(s) this PR fixes: Fixes #55920 Special notes for your reviewer: @jiayingz @RenaudWasTaken @lichuqiang PTAL. /sig node Release note: ```release-note None ```	2018-01-13 03:34:57 -08:00
Kubernetes Submit Queue	f2e46a2147	Merge pull request #57266 from vikaschoudhary16/unhealthy_device Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Handle Unhealthy devices Update node capacity with sum of both healthy and unhealthy devices. Node allocatable reflect only healthy devices. What this PR does / why we need it: Currently node capacity only reflects healthy devices. Unhealthy devices are ignored totally while updating node status. This PR handles unhealthy devices while updating node status. Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes #57241 Special notes for your reviewer: Release note: <!-- Write your release note: Handle Unhealthy devices ```release-note Handle Unhealthy devices ``` /cc @tengqm @ConnorDoyle @jiayingz @vishh @jeremyeder @sjenning @resouer @ScorpioCPH @lichuqiang @RenaudWasTaken @balajismaniam /sig node	2018-01-12 19:55:54 -08:00
Penghao Cen	b96c383ef7	Check grpc server ready properly	2018-01-13 05:47:49 +08:00
Penghao Cen	90bc1265cf	Fix endpoint not work issue	2018-01-12 20:09:07 +08:00
Davanum Srinivas	ecd6361ff0	Set pids limit at pod level Add a new Alpha Feature to set a maximum number of pids per Pod. This is to allow the use case where cluster administrators wish to limit the pids consumed per pod (example when running a CI system). By default, we do not set any maximum limit, If an administrator wants to enable this, they should enable `SupportPodPidsLimit=true` in the `--feature-gates=` parameter to kubelet and specify the limit using the `--pod-max-pids` parameter. The limit set is the total count of all processes running in all containers in the pod.	2018-01-11 21:22:38 -05:00
Penghao Cen	671c4eb2b7	Add e2e test logic for device plugin	2018-01-11 14:41:45 +08:00
Penghao Cen	dc5384a139	Don't rewrite device health	2018-01-11 14:18:13 +08:00
tianshapjq	e8005face7	typo of errUnsuportedVersion	2018-01-10 15:47:11 +08:00
vikaschoudhary16	e9cf3f1ac4	Handle Unhealthy devices Update node capacity with sum of both healthy and unhealthy devices. Node allocatable reflect only healthy devices.	2018-01-09 11:38:48 -05:00
Jonathan Basseri	85c5862552	Fix scheduler refs in BUILD files. Update references to moved scheduler code.	2018-01-05 15:05:01 -08:00
Jonathan Basseri	30b89d830b	Move scheduler code out of plugin directory. This moves plugin/pkg/scheduler to pkg/scheduler and plugin/cmd/kube-scheduler to cmd/kube-scheduler. Bulk of the work was done with gomvpkg, except for kube-scheduler main package.	2018-01-05 15:05:01 -08:00
Kubernetes Submit Queue	4d215fd235	Merge pull request #56611 from tianshapjq/testcase-cgroup_manager_linux.go Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. new testcase to cgroup_manager_linux.go a new test case to adaptName(), for testing "cgroupManagerType != libcontainerSystemd"	2017-12-28 11:11:47 -08:00
Kubernetes Submit Queue	a4eb2f96d0	Merge pull request #57610 from vikaschoudhary16/remove-redundant-sleep Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Remove redundant sleep from ReRegistration unit test case /kind cleanup /sig node What this PR does / why we need it: Once upon a time, there was a race in the device plugin registration logic. At that time, [list()](`5cac9fc984/pkg/kubelet/deviceplugin/manager.go (L206)`) and [listAndWatch()](`5cac9fc984/pkg/kubelet/deviceplugin/manager.go (L224)`) used to be separate functions. Race was there for taking manager.mutex lock from two places. [One, from within the m.addEndpoint()](`5cac9fc984/pkg/kubelet/deviceplugin/manager.go (L214)`) and the [second, from within m.Devices()](`5cac9fc984/pkg/kubelet/deviceplugin/manager.go (L137)`). This race was making `TestDevicePluginReRegistration` flaky as explained below. ``` 1. p1.Register(socketName, testResourceName) 2. // Wait for the first callback to be issued. 3. <-callbackChan 4. devices := m.Devices() ``` * L#1 leads to eventually asynchronous invocation of m.addEndpoint(), let say thread1. * L#3 holds the test case execution till the [callback gets invoked](`5cac9fc984/pkg/kubelet/deviceplugin/endpoint.go (L108)`). This means test case execution waits on channel till the thread1 reaches the point where [e.list() call completes in the addEndpoint.](`5cac9fc984/pkg/kubelet/deviceplugin/manager.go (L206)`) * L#4 triggers a new thread. thread1 and this new thread are both racing for m.mutex.Lock(). Former, in the addEndpoint() and later one in the m.Devices(). If m.Devices wins the race, result is the test case failure because endpoint gets added in the manager only after taking mutex.Lock() in the addEndpoint(). To deal with this flake, we added `Sleep` between L#3 and L#4. `Sleep` was getting some extra time to addEndpoint() and thus making thread1 win the race each time. Above explained race scenario got fixed and merged sometime back in this PR: [Deviceplugin refactoring: merge func list and listwatch in endpoint into one](https://github.com/kubernetes/kubernetes/pull/52149) With the above PR, callback function is invoked from e.run() which makes sure that test case waits on channel till the endpoint is added and devices are updated Above explained race scenario does not exist now, therefore removing redundant sleeps from the test case. Tested: go test -race -count 500 k8s.io/kubernetes/pkg/kubelet/cm/deviceplugin -run TestDevicePluginReRegistration -timeout 5h Related #52616 #56026 Special notes for your reviewer: Release note: ```release-note None ``` /cc @vishh @derekwaynecarr @jiayingz @RenaudWasTaken @lichuqiang @ScorpioCPH @tengqm @mindprince @ConnorDoyle @jeremyeder	2017-12-27 14:53:21 -08:00
vikaschoudhary16	5d10dcd983	Remove redundant sleep from ReRegistration unit test case	2017-12-27 03:02:21 -05:00

1 2 3 4 5 ...

309 Commits