A number of race conditions exist when pods are terminated early in
their lifecycle because components in the kubelet need to know "no
running containers" or "containers can't be started from now on" but
were relying on outdated state.
Only the pod worker knows whether containers are being started for
a given pod, which is required to know when a pod is "terminated"
(no running containers, none coming). Move that responsibility and
podKiller function into the pod workers, and have everything that
was killing the pod go into the UpdatePod loop. Split syncPod into
three phases - setup, terminate containers, and cleanup pod - and
have transitions between those methods be visible to other
components. After this change, to kill a pod you tell the pod worker
to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).
Several places in the kubelet were incorrect about whether they
were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker
exposes methods that allow other loops to know when to set up or tear
down resources based on the state of the pod - these methods remove
the possibility of race conditions by ensuring a single component is
responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.
Removing containers now no longer blocks final pod deletion in the
API server and are handled as background cleanup. Node shutdown
no longer marks pods as failed as they can be restarted in the
next step.
See https://docs.google.com/document/d/1Pic5TPntdJnYfIpBeZndDelM-AbS4FN9H2GTLFhoJ04/edit# for details
fakeVolumeHost previously implemented both the KubeletVolumeHost and
AttachDetachVolumeHost interfaces. This design makes it difficult to test the
CSIAttacher since it behaves differently depending on what type of
VolumeHost is supplied.
This patch removes pkg/util/mount completely, and replaces it with the
mount package now located at k8s.io/utils/mount. The code found at
k8s.io/utils/mount was moved there from pkg/util/mount, so the code is
identical, just no longer in-tree to k/k.
This patch moves fake.go to mount_fake.go, and follows to principle of
always returning a discrete type rather than an Interface. All callers
of "FakeMounter" are changed to instead use "NewFakeMounter()". The
FakeMounter "Log" struct member is changed to not be exported, and
instead only access through a new "GetLog()" method.
`Get` method within the fake clientset returns an object that would not be normally returned when using the real clientset. Reproducer:
```go
package main
import (
v1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes/fake"
)
func main () {
cm := &v1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{Namespace: metav1.NamespaceSystem, Name: "cm"},
}
client := fake.NewSimpleClientset(cm)
obj, err := client.CoreV1().ConfigMaps("").Get("", metav1.GetOptions{})
if err != nil {
panic(err)
}
fmt.Printf("obj: %#v\n", obj)
}
```
stored under `test.go` of `github.com/kubernetes/kubernetes` (master HEAD) root directory and ran:
```sh
$ go run test.go
obj: &v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"cm", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}
```
As you can see fake clientset with a "test" configmap is created. When getting the object through the clientset back, I intentionally set the object name to an empty string. I would expect to get an error saying config map "" was not found. However, I get "test" configmap instead.
Reason for that is inside implementation of `filterByNamespaceAndName` private function:
```go
func filterByNamespaceAndName(objs []runtime.Object, ns, name string) ([]runtime.Object, error) {
var res []runtime.Object
for _, obj := range objs {
acc, err := meta.Accessor(obj)
if err != nil {
return nil, err
}
if ns != "" && acc.GetNamespace() != ns {
continue
}
if name != "" && acc.GetName() != name {
continue
}
res = append(res, obj)
}
return res, nil
}
```
When `name` is empty, `name != "" && acc.GetName() != name` condition is false and thus `obj` is consider as a fit.
[1] https://github.com/kubernetes/client-go/blob/master/testing/fixture.go#L481-L493
This patch moves the HostUtil functionality from the util/mount package
to the volume/util/hostutil package.
All `*NewHostUtil*` calls are changed to return concrete types instead
of interfaces.
All callers are changed to use the `*NewHostUtil*` methods instead of
directly instantiating the concrete types.
DesiredStateOfWorldPopulator should skip a volume that is not used in any
pod. "Used" means either mounted (via volumeMounts) or used as raw block
device (via volumeDevices).
Especially when block feature is disabled, a block volume must not get into
DesiredStateOfWorld, because it would be formatted and mounted there.
This patch refactors pkg/util/mount to be more usable outside of
Kubernetes. This is done by refactoring mount.Interface to only contain
methods that are not K8s specific. Methods that are not relevant to
basic mount activities but still have OS-specific implementations are
now found in a mount.HostUtils interface.
Automatic merge from submit-queue (batch tested with PRs 53689, 56880, 55856, 59289, 60249). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
simplify defer statement
**What this PR does / why we need it**:
simplify defer statement
Automatic merge from submit-queue (batch tested with PRs 37228, 40146, 40075, 38789, 40189)
kubelet: storage: teardown terminated pod volumes
This is a continuation of the work done in https://github.com/kubernetes/kubernetes/pull/36779
There really is no reason to keep volumes for terminated pods attached on the node. This PR extends the removal of volumes on the node from memory-backed (the current policy) to all volumes.
@pmorie raised a concern an impact debugging volume related issues if terminated pod volumes are removed. To address this issue, the PR adds a `--keep-terminated-pod-volumes` flag the kubelet and sets it for `hack/local-up-cluster.sh`.
For consideration in 1.6.
Fixes#35406
@derekwaynecarr @vishh @dashpole
```release-note
kubelet tears down pod volumes on pod termination rather than pod deletion
```