Commit Graph

66 Commits

Author SHA1 Message Date
Abdullah Gharaibeh
d6522e0e74 rename framework pkg with schedulerframework for all instances under pkg/kubelet 2020-04-14 14:24:07 -04:00
Abdullah Gharaibeh
bed9b2f23b Cleanup obsolete NodeInfo methods 2020-04-12 18:13:46 -04:00
nolancon
cb9fdc49db Device Manager - Refactor allocatePodResources
- allocatePodResources logic altered to allow for container by container
device allocation.
- New type PodReusableDevices
- New field in devicemanager devicesToReuse
2020-02-27 07:24:34 +00:00
Kevin Klues
0b168f0243 Change devicemanager to implement HintProvider.Allocate()
This change will not work on its own. Higher level code needs to make
sure and call Allocate() before AddContainer is called. This is already
being done in cases when the TopologyManager feature gate is enabled (in
the PodAdmitHandler of the TopologyManager). However, we need to make
sure we add proper logic to call it in cases when the TopologyManager
feature gate is disabled.
2020-02-10 03:27:47 +00:00
Kevin Klues
a3f099ea4d Split devicemanager Allocate into two functions
Instead of having a single call for Allocate(), we now split this into two
functions Allocate() and UpdatePluginResources().

The semantics split across them:

// Allocate configures and assigns devices to a pod. From the requested
// device resources, Allocate will communicate with the owning device
// plugin to allow setup procedures to take place, and for the device
// plugin to provide runtime settings to use the device (environment
// variables, mount points and device files).
Allocate(pod *v1.Pod) error

// UpdatePluginResources updates node resources based on devices already
// allocated to pods. The node object is provided for the device manager to
// update the node capacity to reflect the currently available devices.
UpdatePluginResources(
    node *schedulernodeinfo.NodeInfo,
    attrs *lifecycle.PodAdmitAttributes) error

As we move to a model in which the TopologyManager is able to ensure
aligned allocations from the CPUManager, devicemanger, and any
other TopologManager HintProviders in the same synchronous loop, we will
need to be able to call Allocate() independently from an
UpdatePluginResources(). This commit makes that possible.
2020-02-10 03:27:47 +00:00
Takeaki Matsumoto
785fac6826 Make updateAllocatedDevices() as a public method and call it in
podresources api
2020-02-07 13:26:56 +09:00
danielqsj
1a9b121764 remove deprecated metrics of kubelet 2020-01-10 16:46:52 +08:00
Kubernetes Prow Robot
ad5d4c4705
Merge pull request #85706 from yutedz/per-node-dev
Remove nodes slice in loop of takeByTopology
2019-12-05 13:50:30 -08:00
Kubernetes Prow Robot
57b6b287d4
Merge pull request #85688 from yutedz/pods-to-rm
Reduce unnecessary Set in updateAllocatedDevices
2019-12-02 17:07:26 -08:00
Ted Yu
84a9803741 Log error when writing checkpoint fails 2019-11-29 19:47:17 -08:00
Ted Yu
6415fa765e Remove nodes slice in loop of takeByTopology 2019-11-29 12:12:22 -08:00
Ted Yu
86f3bc25e1 Reduce unnecessary Set in updateAllocatedDevices 2019-11-27 08:48:06 -08:00
Kubernetes Prow Robot
30e6238795
Merge pull request #85147 from yutedz/devmgr-rm-contents
Continue removing file in ManagerImpl#removeContents
2019-11-14 16:38:28 -08:00
Ted Yu
fb046f7787 Continue removing file in ManagerImpl#removeContents 2019-11-13 06:00:34 -08:00
Kubernetes Prow Robot
ed10b5b17f
Merge pull request #85047 from yutedz/dev-mgr-err-handling
Handle error return from allocatePodResources
2019-11-12 11:51:27 -08:00
Kubernetes Prow Robot
897ce3073c
Merge pull request #84533 from davidz627/fix/deprecatedPath
Remove plugin watching of deprecated directory and CSI v0 support in accordance with deprecation policy
2019-11-12 04:48:20 -08:00
David Zhu
802fe12803 Remove plugin watching of deprecated directory {kubelet_root_dir}/plugins and support for CSI V0 in accordance with deprecation announcement in https://v1-13.docs.kubernetes.io/docs/setup/release/notes/ 2019-11-11 11:42:58 -08:00
Ted Yu
db0f616974 Handle error return from allocatePodResources 2019-11-09 16:25:15 -08:00
yanghaichao12
5cbafba457 change directory permissions from 0755 to 0750 2019-11-04 17:04:37 +08:00
Kubernetes Prow Robot
3db6d3abcf
Merge pull request #83551 from dims/move-external-facing-kubelet-apis-to-staging
Move external facing kubelet apis to staging
2019-10-10 13:41:36 -07:00
Davanum Srinivas
f29d2272c8
fix gofmt and golint failures
Change-Id: I6535b506f50558b31663a13cd270b15023afa2c6
2019-10-06 18:43:17 -04:00
Davanum Srinivas
d30c489c54
Move pkg/kubelet/pluginregistration and deviceplugin
Change-Id: I06adcb43bd278b430ffad2010869e1524c8cc4ff
2019-10-06 15:28:38 -04:00
tanjunchen
de3cf23414 remove the repeat word in documents 2019-10-06 23:32:01 +08:00
Connor Doyle
e35301c19f Rename package socketmask to bitmask.
- As discussed in reviews and other public channels,
  this abstraction is used to represent numa nodes, not sockets.
- There is nothing inherently related to sockets in this package anyway.
2019-09-23 17:08:45 -07:00
Kevin Klues
cc567afaf0 Consume TopologyHints in the devicemanager 2019-08-29 08:22:50 -05:00
Louise Daly
9a118ceac4 Added stub support for Topology Manager to Device Manager
Co-authored-by: Conor Nolan <conor.nolan@intel.com>
Co-authored-by: Sreemanti Ghosh <sreemanti.ghosh@intel.com>
Co-authored-by: Kevin Klues <kklues@nvidia.com>
2019-08-29 07:45:43 -05:00
Tim Allclair
8a495cb5e4 Clean up error messages (ST1005) 2019-08-21 10:40:21 -07:00
Tara Gu
5e18554442 Implement plugin manager - a controller that manages plugin registration/unregistration 2019-05-30 19:00:59 -04:00
Richard Chen
c9f1b57b5b Reset extended resources only when node is recreated. 2019-05-21 14:16:54 -07:00
Kubernetes Prow Robot
e476a60ccb
Merge pull request #73241 from vikaschoudhary16/selinux-label
Add correct selinux label at plugin socket directory
2019-05-20 11:07:17 -07:00
vikaschoudhary16
58d1b4d564 Add correct selinux label at plugin socket directory 2019-05-18 12:35:17 +05:30
danielqsj
79a3eb816c rename latency to duration in metrics 2019-02-18 17:40:04 +08:00
danielqsj
9fd99a48f5 Change kubelet metrics to conform guideline 2019-02-18 14:01:58 +08:00
Jiaying Zhang
00b88c14b0 Checks whether we have cached runtime state before starting a container
that requests any device plugin resource. If not, re-issue Allocate
grpc calls. This allows us to handle the edge case that a pod got
assigned to a node even before it populates its extended resource
capacity.
2019-02-07 11:12:36 -08:00
Kubernetes Prow Robot
03b434c9d4
Merge pull request #58122 from tianshapjq/nit-int-is-enough
Len() is already int
2019-02-03 12:02:24 -08:00
Kubernetes Prow Robot
d88994cf9f
Merge pull request #71306 from ping035627/k8s-181121
fix some typos
2019-01-09 09:06:31 -08:00
yuexiao-wang
f3353c358d [scheduler cleanup phase 2]: Rename to
Signed-off-by: yuexiao-wang <wang.yuexiao@zte.com.cn>
2018-12-11 11:21:12 +08:00
saad-ali
a7c5582bba Permit use of deprecated dir in device plugin. 2018-11-21 18:37:31 -08:00
saad-ali
8f666d9e41 Modify kubelet watcher to support old versions
Modify kubelet plugin watcher to support older CSI drivers that use an
the old plugins directory for socket registration.
Also modify CSI plugin registration to support multiple versions of CSI
registering with the same name.
2018-11-21 18:37:31 -08:00
PingWang
9d541911bb fix some typos
Signed-off-by: PingWang <wang.ping5@zte.com.cn>

fix typo

Signed-off-by: PingWang <wang.ping5@zte.com.cn>
2018-11-22 08:27:14 +08:00
David Ashpole
630cb53f82 add kubelet grpc server for pod-resources service 2018-11-15 09:43:20 -08:00
Davanum Srinivas
954996e231
Move from glog to klog
- Move from the old github.com/golang/glog to k8s.io/klog
- klog as explicit InitFlags() so we add them as necessary
- we update the other repositories that we vendor that made a similar
change from glog to klog
  * github.com/kubernetes/repo-infra
  * k8s.io/gengo/
  * k8s.io/kube-openapi/
  * github.com/google/cadvisor
- Entirely remove all references to glog
- Fix some tests by explicit InitFlags in their init() methods

Change-Id: I92db545ff36fcec83afe98f550c9e630098b3135
2018-11-10 07:50:31 -05:00
Renaud Gaubert
8dd1d27c03 Updated the device manager pluginwatcher handler 2018-09-06 15:34:46 +02:00
Kubernetes Submit Queue
d017bebf6b
Merge pull request #67145 from jiayingz/reboot-fix
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fail container start if its requested device plugin resource is unknown.

With the change, Kubelet device manager now checks whether it has cached option state for the requested device plugin resource to make sure the resource is in ready state when we start the container.



**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes https://github.com/kubernetes/kubernetes/issues/67107

**Special notes for your reviewer**:

**Release note**:

```release-note
Fail container start if its requested device plugin resource hasn't registered after Kubelet restart.
```
2018-08-21 01:48:54 -07:00
tianshapjq
81081dc9e7 nits in manager.go 2018-08-15 08:16:04 +08:00
Jiaying Zhang
7b1ae66432 Fail container start if its requested device plugin resource doesn't
have cached option state to make sure the device plugin resource is
in ready state when we start the container.
2018-08-08 13:11:36 -07:00
hui luo
7101c17498 While reviewing devicemanager code, found
the caching layer on endpoint is redundant.

Here are the 3 related objects in picture:
devicemanager <-> endpoint <-> plugin

Plugin is the source of truth for devices
and device health status.

devicemanager maintain healthyDevices,
unhealthyDevices, allocatedDevices based on updates
from plugin.

So there is no point for endpoint caching devices,
this patch is removing this caching layer on endpoint,

Also removing the Manager.Devices() since i didn't
find any caller of this other than test, i am adding a
notification channel to facilitate testing,

If we need to get all devices from manager in future,
it just need to return healthyDevices + unhealthyDevices,
we don't have to call endpoint after all.

This patch makes code more readable, data model been simplified.
2018-07-29 21:07:14 -07:00
Kubernetes Submit Queue
32e38b6659
Merge pull request #58755 from vikaschoudhary16/probing-mode
Automatic merge from submit-queue (batch tested with PRs 58755, 66414). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Use probe based plugin watcher mechanism in Device Manager

**What this PR does / why we need it**:
Uses this probe based utility in the device plugin manager.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #56944 

**Notes For Reviewers**:
Changes are backward compatible and existing device plugins will continue to work. At the same time, any new plugins that has required support for probing model (Identity service implementation), will also work. 


**Release note**
```release-note
Add support kubelet plugin watcher in device manager.
```
/sig node
/area hw-accelerators
/cc /cc @jiayingz @RenaudWasTaken @vishh @ScorpioCPH @sjenning @derekwaynecarr @jeremyeder @lichuqiang @tengqm @saad-ali @chakri-nelluri @ConnorDoyle
2018-07-27 15:20:06 -07:00
bingshen.wbs
b1bdd043c4 fix kubelet npe on device plugin return zero container
Signed-off-by: bingshen.wbs <bingshen.wbs@alibaba-inc.com>
2018-07-25 10:15:30 +08:00
vikaschoudhary16
a5842503eb Use probe based plugin discovery mechanism in device manager 2018-07-17 04:02:31 -04:00