Commit Graph

91 Commits

Author SHA1 Message Date
ZhangKe10140699
186ddce07b Fix problem in updating VolumeAttached in node status 2022-08-02 19:01:57 +08:00
Jan Safranek
3b94ac228a Don't force detach volume from healthy nodes
6 minute force-deatch timeout should be used only for nodes that are not
healthy. 

In case a CSI driver is being upgraded or it's simply slow, NodeUnstage
can take more than 6 minutes. In that case, Pod is already deleted from the
API server and thus A/D controller will force-detach a mounted volume,
possibly corrupting the volume and breaking CSI - a CSI driver expects
NodeUnstage to succeed before Kubernetes can call ControllerUnpublish.
2022-06-24 12:51:41 +02:00
Ashutosh Kumar
c00975370a
Handle Non-graceful Node Shutdown (#108486)
Signed-off-by: Ashutosh Kumar <sonasingh46@gmail.com>

Co-authored-by: Ashutosh Kumar <sonasingh46@gmail.com>

Co-authored-by: xing-yang <xingyang105@gmail.com>
2022-03-26 09:23:21 -07:00
Kubernetes Prow Robot
66daef4aa7
Merge pull request #108167 from jfremy/fix-107973
Fix nodes volumesAttached status not being updated
2022-03-01 12:49:54 -08:00
Kubernetes Prow Robot
06e107081e
Merge pull request #104732 from mengjiao-liu/remove-flag-experimental-check-node-capabilities-before-mount
kubelet: Remove the deprecated flag `--experimental-check-node-capabilities-before-mount`
2022-02-24 07:56:30 -08:00
Jean-Francois Remy
e83184568d Add unit tests
- actual_state_of_world_test.go: test the new method GetVolumesToReportAttachedForNode
  for an existing node and a non-existing node
- node_status_updater_test.go: test UpdateNodeStatuses and UpdateNodeStatuses in nominal
  case with 2 nodes getting one volume each. Test UpdateNodeStatuses with the first call
  to node.patch failing but the following one succeeding
- add comment in node_status_updater.go
- fix log line in reconciler.go
- rename variable in actual_state_of_world.go
2022-02-22 12:21:58 -08:00
Jean-Francois Remy
f1717baaaa Fix nodes volumesAttached status not updated
The UpdateNodeStatuses code stops too early in case there is
an error when calling updateNodeStatus. It will return immediately
which means any remaining node won't have its update status put back
to true.

Looking at the call sites for UpdateNodeStatuses, it appears this is
not the only issue. If the lister call fails with anything but a Not Found
error, it's silently ignored which is wrong in the detach path.
Also the reconciler detach path calls UpdateNodeStatuses but the real intent
is to only update the node currently processed in the loop and not proceed
with the detach call if there is an error updating that specifi node volumesAttached
property. With the current implementation, it will not proceed if there is
an error updating another node (which is not completely bad but not ideal) and
worse it will proceed if there is a lister error on that node which means the
node volumesAttached property won't have been updated.

To fix those issues, introduce the following changes:
- [node_status_updater] introduce UpdateNodeStatusForNode which does what
  UpdateNodeStatuses does but only for the provided node
- [node_status_updater] if the node lister call fails for anything but a Not
  Found error, we will return an error, not ignore it
- [node_status_updater] if the update of a node volumesAttached properties fails
  we continue processing the other nodes
- [actual_state_of_world] introduce GetVolumesToReportAttachedForNode which
  does what GetVolumesToReportAttached but for the node whose name is provided
  it returns a bool which indicates if the node in question needs an update as
  well as the volumesAttached list. It is used by UpdateNodeStatusForNode
- [actual_state_of_world] use write lock in updateNodeStatusUpdateNeeded, we're
  modifying the map content
- [reconciler] use UpdateNodeStatusForNode in the detach loop
2022-02-22 12:20:53 -08:00
Patrick Ohly
9eaa2dc554 avoid klog Info calls without verbosity
In the following code pattern, the log message will get logged with v=0 in JSON
output although conceptually it has a higher verbosity:

   if klog.V(5).Enabled() {
       klog.Info("hello world")
   }

Having the actual verbosity in the JSON output is relevant, for example for
filtering out only the important info messages. The solution is to use
klog.V(5).Info or something similar.

Whether the outer if is necessary at all depends on how complex the parameters
are. The return value of klog.V can be captured in a variable and be used
multiple times to avoid the overhead for that function call and to avoid
repeating the verbosity level.
2022-01-12 07:48:36 +01:00
Mengjiao Liu
beda4cafb6 kubelet: Remove the deprecated flag --experimental-check-node-capabilities-before-mount 2022-01-06 11:47:11 +08:00
Jing Xu
69b9f9b1f0 Fix issue in node status updating VolumeAttached list
During volume detach, the following might happen in reconciler

1. Pod is deleting
2. remove volume from reportedAsAttached, so node status updater will
update volumeAttached list
3. detach failed due to some issue
4. volume is added back in reportedAsAttached
5. reconciler loops again the volume, remove volume from
reportedAsAttached
6. detach will not be trigged because exponential back off, detach call
will fail with exponential backoff error
7. another pod is added which using the same volume on the same node
8. reconciler loops and it will NOT try to tigger detach anymore

At this point, volume is still attached and in actual state, but
volumeAttached list in node status does not has this volume anymore, and
will block volume mount from kubelet.

The fix in first round is to add volume back into the volume list that
need to reported as attached at step 6 when detach call failed with
error (exponentical backoff). However this might has some performance
issue if detach fail for a while. During this time, volume will be keep
removing/adding back to node status which will cause a surge of API
calls.

So we changed to logic to check first whether operation is safe to retry which
means no pending operation or it is not in exponentical backoff time
period before calling detach. This way we can avoid keep removing/adding
volume from node status.

Change-Id: I5d4e760c880d72937d34b9d3e904ecad125f802e
2021-10-05 09:44:35 -07:00
Benjamin Elder
56e092e382 hack/update-bazel.sh 2021-02-28 15:17:29 -08:00
caodonghui
f435e24403 Remove deadcode 2021-02-23 17:58:47 +08:00
Kubernetes Prow Robot
8bf42039e6
Merge pull request #96552 from pandaamanda/klog_fmt
use klog.Info and klog.Warning when had no format
2021-01-15 17:57:43 -08:00
xiongzhongliang
90f4aeeea4 use klog.Info and klog.Warning when had no format 2020-11-14 00:55:06 +08:00
Cheng Xing
d9a629fe3a IsVolumeAttachedToNode() renamed to GetAttachState(), and returns 3 states instead of combining "uncertain" and "detached" into "false" 2020-10-29 13:24:51 -07:00
Cheng Xing
a61743b125 Fixes Attach Detach Controller reconciler race reading ActualStateOfWorld and operation pending states; fixes reconciler_test mock detach to account for multiple attaches on a node 2020-10-27 23:51:55 -07:00
Davanum Srinivas
07d88617e5
Run hack/update-vendor.sh
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2020-05-16 07:54:33 -04:00
Davanum Srinivas
442a69c3bd
switch over k/k to use klog v2
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2020-05-16 07:54:27 -04:00
Kubernetes Prow Robot
ef672c1c2d
Merge pull request #88678 from verult/slow-rxm-attach
Parallelize attach operations across different nodes for volumes that allow multi-attach
2020-03-06 13:17:21 -08:00
Cheng Xing
ef3d66b98b Parallelize attach operations across different nodes for volumes that allow multi-attach 2020-03-05 22:22:05 -08:00
taesun_lee
79680b5d9b Fix pkg/controller typos in some error messages, comments etc
- applied review results by LuisSanchez
- Co-Authored-By: Luis Sanchez <sanchezl@redhat.com>

genernal -> general
iniital -> initial
initalObjects -> initialObjects
intentionaly -> intentionally
inforer -> informer
anotother -> another
triger -> trigger
mutli -> multi
Verifyies -> Verifies
valume -> volume
unexpect -> unexpected
unfulfiled -> unfulfilled
implenets -> implements
assignement -> assignment
expectataions -> expectations
nexpected -> unexpected
boundSatsified -> boundSatisfied
externel -> external
calcuates -> calculates
workes -> workers
unitialized -> uninitialized
afater -> after
Espected -> Expected
nodeMontiorGracePeriod -> NodeMonitorGracePeriod
estimateGrracefulTermination -> estimateGracefulTermination
secondrary -> secondary
ShouldRunDaemonPodOnUnscheduableNode -> ShouldRunDaemonPodOnUnschedulableNode
rrror -> error
expectatitons -> expectations
foud -> found
epackage -> package
succesfulJobs -> successfulJobs
namesapce -> namespace
ConfigMapResynce -> ConfigMapResync
2020-02-27 00:15:33 +09:00
Jordan Liggitt
cd1059e3c4 Revert "Merge pull request #87258 from verult/slow-rxm-attach"
This reverts commit 15c3f1b119, reversing
changes made to 52d7614a8c.
2020-01-29 14:58:32 -05:00
Cheng Xing
c6a03fa5be Parallelize attach operations across different nodes for volumes that allow multi-attach 2020-01-27 15:02:25 -08:00
yuxiaobo
81e9f21f83 Correct spelling mistakes
Signed-off-by: yuxiaobo <yuxiaobogo@163.com>
2019-11-06 20:25:19 +08:00
danielqsj
657a1a1a34 change import alias of utils/strings 2019-01-30 10:44:09 +08:00
danielqsj
093328e57f migrate to k8s.io/utils/strings 2019-01-30 10:24:00 +08:00
Jing Xu
7bac6ca73a Address comments
This commit addressed the comment and also add a unit test.
2019-01-11 10:57:37 -08:00
Jing Xu
562d0fea53 Handle failed attach operation leave uncertain volume attach state
This commit adds the unit tests for the PR. It also includes some files
that are affected by the function name changes.
2018-11-19 17:21:49 -08:00
Davanum Srinivas
954996e231
Move from glog to klog
- Move from the old github.com/golang/glog to k8s.io/klog
- klog as explicit InitFlags() so we add them as necessary
- we update the other repositories that we vendor that made a similar
change from glog to klog
  * github.com/kubernetes/repo-infra
  * k8s.io/gengo/
  * k8s.io/kube-openapi/
  * github.com/google/cadvisor
- Entirely remove all references to glog
- Fix some tests by explicit InitFlags in their init() methods

Change-Id: I92db545ff36fcec83afe98f550c9e630098b3135
2018-11-10 07:50:31 -05:00
nikopen
6f2a45aefe Fix VMWare VM freezing bug by reverting #51066 2018-08-24 14:28:44 +02:00
Fabio Bertinatto
4ce2058ef6 Add more metrics for A/D Controller:
* Number of Volumes in ActualStateofWorld and DesiredStateofWorld
* Numer of times A/D Controller performs force detach
2018-08-15 10:01:57 +02:00
Jeff Grafton
23ceebac22 Run hack/update-bazel.sh 2018-06-22 16:22:57 -07:00
Michal Fojtik
97f546d249
volume: decrease memory allocations for debugging messages 2018-06-11 13:52:38 +02:00
Jeff Grafton
ef56a8d6bb Autogenerated: hack/update-bazel.sh 2018-02-16 13:43:01 -08:00
Di Xu
48388fec7e fix all the typos across the project 2018-02-11 11:04:14 +08:00
Kubernetes Submit Queue
7de1a8e0f5
Merge pull request #56288 from jsafrane/multiattach-pods
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add list of pods that use a volume to multiattach events

So users knows what pods are blocking a volume and can realize their error.

**Release note**:

```release-note
NONE
```

UX:
* User can get one of following events, depending what other pod(s) are already using a volume and in which namespace they are:
```
Multi-Attach error for volume"volume-name" Volume is already exclusively attached to one node and can't be attached to another
Multi-Attach error for volume "volume-name" Volume is already used by pod(s) pod3 and 1 pod(s) in different namespaces
```

* controller-manager gets always full logs:
  * When the node where is the volume attached is known:
        ```
        Multi-Attach error for volume "volume-name" (UniqueName: "fake-plugin/volume-name") from node "node1" Volume is already used by pods ns2/pod2, ns1/pod3 on node node2, node3
        ```

  * When the node where is the volume attached is not known:
        ```
        Multi-Attach error for volume "volume-name" (UniqueName: "fake-plugin/volume-name") from node "node1" Volume is already exclusively attached to node node2 and can't be attached to another
        ```

/kind bug
/sig storage
/assign @gnufied
2018-01-25 05:31:34 -08:00
Jan Safranek
e46c886bf3 Add list of pods that use a volume to multiattach events
So users knows what pods are blocking a volume and can realize their error.
2018-01-24 13:22:03 +01:00
Jeff Grafton
efee0704c6 Autogenerate BUILD files 2017-12-23 13:12:11 -08:00
mtanino
8903e8cd85 BlockVolumesSupport: CRI, VolumeManager and OperationExecutor changes
This patch contains following changes.
- container runtime changes for adding block devices
- volumemanager changes
- operationexecutor changes
2017-11-20 14:10:26 -05:00
Jeff Grafton
aee5f457db update BUILD files 2017-10-15 18:18:13 -07:00
Hemant Kumar
68d417d7d8 Fix possibly flake in multiattach unit test
It is possible that by the time we check for multiattach
error on node, the reconciler loop may not have processed second
volume and hence we are going to retry for multiattach error
on node before giving up and marking the test as failed.
2017-10-12 16:27:54 -04:00
Hemant Kumar
67d4c40849 Fix spam of multiattach errors in event logs
We should be careful while generating multiattach errors.
We seem to be generating too many of them because old code
had minor bug.
2017-10-03 15:45:06 -04:00
Hemant Kumar
8edae9b3fc Always populate volume status from node 2017-09-12 09:03:42 -04:00
Balu Dontu
cfdff1ae46 Multi-Attach volume fix for vSphere 2017-08-21 18:06:29 -07:00
Jeff Grafton
a7f49c906d Use buildozer to delete licenses() rules except under third_party/ 2017-08-11 09:32:39 -07:00
Jeff Grafton
33276f06be Use buildozer to remove deprecated automanaged tags 2017-08-11 09:31:50 -07:00
Jacob Simpson
29c1b81d4c Scripted migration from clientset_generated to client-go. 2017-07-17 15:05:37 -07:00
Alexander Block
61275ad8d4 Fix flaky test Test_Run_OneVolumeAttachAndDetachMultipleNodesWithReadWriteMany
Only relying on the NewAttacher/Detacher call counts is not enough as they
happen in parallel to the testing/verification code and thus the actual
attaching/detaching may not be done yet, resulting in flaky test results.

Fixes #46244
2017-07-11 18:21:50 +02:00
Kubernetes Submit Queue
c662e1d7d8 Merge pull request #46949 from xingzhou/typo
Automatic merge from submit-queue

Fixed a comment typo

Typo fix

Fixed #48414 

**Release note**:
```
None
```
2017-07-03 11:33:36 -07:00
Chao Xu
60604f8818 run hack/update-all 2017-06-22 11:31:03 -07:00