Automatic merge from submit-queue
vSphere Volume Plugin Implementation
This PR implements vSphere Volume plugin support in Kubernetes (ref. issue #23932).
Automatic merge from submit-queue
Add support for PersistentVolumeClaim in Attacher/Detacher interface
The attach detach interface does not support volumes which are referenced through PVCs. This PR adds that support
Automatic merge from submit-queue
Extend secrets volumes with path control
As per [1] this PR extends secrets mapped into volume with:
* key-to-path mapping the same way as is for configmap. E.g.
```
{
"apiVersion": "v1",
"kind": "Pod",
"metadata": {
"name": "mypod",
"namespace": "default"
},
"spec": {
"containers": [{
"name": "mypod",
"image": "redis",
"volumeMounts": [{
"name": "foo",
"mountPath": "/etc/foo",
"readOnly": true
}]
}],
"volumes": [{
"name": "foo",
"secret": {
"secretName": "mysecret",
"items": [{
"key": "username",
"path": "my-username"
}]
}
}]
}
}
```
Here the ``spec.volumes[0].secret.items`` added changing original target ``/etc/foo/username`` to ``/etc/foo/my-username``.
* secondly, refactoring ``pkg/volumes/secrets/secrets.go`` volume plugin to use ``AtomicWritter`` to project a secret into file.
[1] https://github.com/kubernetes/kubernetes/blob/master/docs/design/configmap.md#changes-to-secret
Automatic merge from submit-queue
volume recycler: Don't start a new recycler pod if one already exists.
Recycling is a long duration process and when the recycler controller is restarted in the meantime, it should not start a new recycler pod if there is one already running.
This means that the recycler pod must have deterministic name based on name of the recycled PV, we then get name conflicts when creating the pod.
Two things need to be changed:
- recycler controller and recycler plugins must pass the PV.Name to place, where the pod is created. This is most of the patch and it should be pretty straightforward.
- create recycler pod with deterministic name and check "already exists" error.
When at it, remove useless 'resourceVersion' argument and make log messages starting with lowercase.
There is an unit test to check the behavior + there is an e2e test that checks that regular recycling is not broken (it does not try to run two recycler pods in parallel as the recycler is single-threaded now).
Recycling is a long duration process and when the recycler controller is
restarted in the meantime, it should not start a new recycler pod if there is
one already running.
This means that the recycler pod must have deterministic name based on name
of the recycled PV, we then get name conflicts when creating the pod.
Two things need to be changed:
- recycler controller and recycler plugins must pass the PV.Name to place,
where the pod is created.
- create recycler pod with deterministic name and check "already exists" error.
When at it, remove useless 'resourceVersion' argument and make log messages
starting with lowercase.
Automatic merge from submit-queue
Refactor persistent volume controller
Here is complete persistent controller as designed in https://github.com/pmorie/pv-haxxz/blob/master/controller.go
It's feature complete and compatible with current binder/recycler/provisioner. No new features, it *should* be much more stable and predictable.
Testing
--
The unit test framework is quite complicated, still it was necessary to reach reasonable coverage (78% in `persistentvolume_controller.go`). The untested part are error cases, which are quite hard to test in reasonable way - sure, I can inject a VersionConflictError on any object update and check the error bubbles up to appropriate places, but the real test would be to run `syncClaim`/`syncVolume` again and check it recovers appropriately from the error in the next periodic sync. That's the hard part.
Organization
---
The PR starts with `rm -rf kubernetes/pkg/controller/persistentvolume`. I find it easier to read when I see only the new controller without old pieces scattered around.
[`types.go` from the old controller is reused to speed up matching a bit, the code looks solid and has 95% unit test coverage].
I tried to split the PR into smaller patches, let me know what you think.
~~TODO~~
--
* ~~Missing: provisioning, recycling~~.
* ~~Fix integration tests~~
* ~~Fix e2e tests~~
@kubernetes/sig-storage
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24331)
<!-- Reviewable:end -->
Fixes#15632
The key to path mapping allows pod to specify different name (thus location) of each secret.
At the same time refactor the volume plugin to use AtomicWritter to project secrets to files in a volume.
Update e2e Secrets test, the secret file permission has changed from 0444 to 0644
Remove TestPluginIdempotent as the AtomicWritter is responsible for secret creation
Automatic merge from submit-queue
Make IsQualifiedName return error strings
Part of the larger validation PR, broken out for easier review and merge.
@lavalamp FYI, but I know you're swamped, too.
Automatic merge from submit-queue
Use local disk for ConfigMap volume instead of tmpfs
So that ConfigMap volumes are counted against pod's storage quota.
@kubernetes/sig-node
cc @derekwaynecarr @vishh
Automatic merge from submit-queue
Abstract node side functionality of attachable plugins
- Create PhysicalAttacher interface to abstract MountDevice and
WaitForAttach.
- Create PhysicalDetacher interface to abstract WaitForDetach and
UnmountDevice.
- Expand unit tests to check that Attach, Detach, WaitForAttach,
WaitForDetach, MountDevice, and UnmountDevice get call where
appropriet.
Physical{Attacher,Detacher} are working titles suggestions welcome. Some other thoughts:
- NodeSideAttacher or NodeAttacher.
- AttachWatcher
- Call this Attacher and call the Current Attacher CloudAttacher.
- DeviceMounter (although there are way too many things called Mounter right now :/)
This is to address: https://github.com/kubernetes/kubernetes/pull/21709#issuecomment-192035382
@saad-ali
Automatic merge from submit-queue
Automatically Add Supplemental Groups from Volumes to Pods
This adds support for a "GID" annotation that one can add to their PVs. When this annotation is seen the kubelet automatically adds the given GID to the list of supplemental groups for the pod to which the PV is attached. This allows admins to create volumes and suggest a GID to use to access the volume. This is needed for volumes which do not support ownership management such as NFS.
@markturansky PTAL
- Expand Attacher/Detacher interfaces to break up work more
explicitly.
- Add arguments to all functions to avoid having implementers store
the data needed for operations.
- Expand unit tests to check that Attach, Detach, WaitForAttach,
WaitForDetach, MountDevice, and UnmountDevice get call where
appropriet.
Automatic merge from submit-queue
Rackspace improvements (OpenStack Cinder)
This adds PV support via Cinder on Rackspace clusters. Rackspace Cloud Block Storage is pretty much vanilla OpenStack Cinder, so there is no need for a separate Volume Plugin. Instead I refactored the Cinder/OpenStack interaction a bit (by introducing a CinderProvider Interface and moving the device path detection logic to the OpenStack part).
Right now this is limited to `AttachDisk` and `DetachDisk`. Creation and deletion of Block Storage is not in scope of this PR.
Also the `ExternalID` and `InstanceID` cloud provider methods have been implemented for Rackspace.
Automatic merge from submit-queue
Add mpio support for iscsi
This allows the iscsi volume to check if a iscsi device belongs to a mpio device
If it does belong to the device then we make sure we mount the mpio device instead of
the raw device.
The code is based on the current FibreChannel volume support for mpio
example
/dev/disk/by-path/iqn-example.com.2999 -> /dev/sde
Then we check
/sys/block/[dm-X]/slaves/xx
until we find the [dm-X] containing /dev/sde and mount it
Additional work that can be done in future
1. Add multiple portal support to iscsi
2. Move the FibreChannel volume provider to use the code that has been extracted
If it does belong to the device then we make sure we mount the mpio device instead of
the raw device.
Heuristics
Login into /dev/disk/by-path/iqn-example.com.2999 -> /dev/sde
Check if sde existsin in /sys/block/[dm-X]/slaves/xx
If it does mount /dev/[dm-x] which will look like /dev/mapper/mpiodevicename in mount
examples/iscsi has more details
Automatic merge from submit-queue
Additional go vet fixes
Mostly:
- pass lock by value
- bad syntax for struct tag value
- example functions not formatted properly
AWS has soft support limit for 40 attached EBS devices. Assuming there is just
one root device, use the rest for persistent volumes.
The devices will have name /dev/xvdba - /dev/xvdcm, leaving /dev/sda - /dev/sdz
to the system.
Also, add better error handling and propagate error
"Too many EBS volumes attached to node XYZ" to a pod.
In podSecurityPolicy:
1. Rename .seLinuxContext to .seLinux
2. Rename .seLinux.type to .seLinux.rule
3. Rename .runAsUser.type to .runAsUser.rule
4. Rename .seLinux.SELinuxOptions
1,2,3 as suggested by thockin in #22159.
I added 3 for consistency with 2.
Similar to #11543, the local hostname is not guaranteed to be the node
name, as the AWS cloud provider looks up node name using
`private-dns-name`. This value can be different such as when using a
private hosted zone.
The previous code uses GetHostName(), which fails in this case. Instead,
pass in an empty string so the aws cloud provider will use the cached
self instance to find the instance id.
Authors: @balooo, @dogan-sky, @jsravn
This is a first-aid bandage to let admission controller ignore persistent
volumes that are being provisioned right now and thus may not exist in
external cloud infrastructure yet.
Volume names have now format <cluster-name>-dynamic-<pv-name>.
pv-name is guaranteed to be unique in Kubernetes cluster, adding
<cluster-name> ensures we don't conflict with any running cluster
in the cloud project (kube-controller-manager --cluster-name=XXX).
'kubernetes' is the default cluster name.
* Metrics will not be expose until they are hooked up to a handler
* Metrics are not cached and expose a dos vector, this must be fixed before release or the stats should not be exposed through an api endpoint
We are (sadly) using a copy-and-paste of the GCE PD code for AWS EBS.
This code hasn't been updated in a while, and it seems that the GCE code
has some code to make volume mounting more robust that we should copy.
Add a mutex to guard SetUpAt() and TearDownAt() calls - they should not
run in parallel. There is a race in these calls when there are two pods
using the same volume, one of them is dying and the other one starting.
TearDownAt() checks that a volume is not needed by any pods and detaches the
volume. It does so by counting how many times is the volume mounted
(GetMountRefs() call below).
When SetUpAt() of the starting pod already attached the volume and did not mount
it yet, TearDownAt() of the dying pod will detach it - GetMountRefs() does not
count with this volume.
These two threads run in parallel:
dying pod.TearDownAt("myVolume") starting pod.SetUpAt("myVolume")
| |
| AttachDisk("myVolume")
refs, err := mount.GetMountRefs() |
Unmount("myDir") |
if refs == 1 { |
| | Mount("myVolume", "myDir")
| | |
| DetachDisk("myVolume") |
| start containers - OOPS! The volume is detached!
|
finish the pod cleanup
Also, add some logs to cinder plugin for easier debugging in the future, add
a test and update the fake mounter to know about bind mounts.
GCE disks don't have tags, we must encode the tags into Description field.
It's encoded as JSON, which is both human and machine readable:
description: '{"kubernetes.io/created-for/pv/name":"pv-gce-oxwts","kubernetes.io/created-for/pvc/name":"myclaim","kubernetes.io/created-for/pvc/namespace":"default"}'
We adapt the existing code to work across all zones in a region.
We require a feature-flag to enable Ubernetes-Lite
Reasons:
* There are some behavioural changes if users create volumes with
the same name in two zones.
* We don't want to make one API call per zone if we're not running
Ubernetes-Lite.
* Ubernetes-Lite is still experimental.
There isn't a parallel flag implemented for AWS, because at the moment
there would be no behaviour changes from this.
This synchronizes Cinder with AWS EBS code, where we already tag volumes with
claim.Namespace and claim.Name (and pv.Name, as suggested in separate PR).
- Add volume.MetricsProvider function to Volume interface.
- Add volume.MetricsDu for providing metrics via executing "du".
- Add volulme.MetricsNil for unsupported Volumes.
This enables use of software or hardware transports viz. be2iscsi,
bnx2i, cxgb3i, cxgb4i, qla4xx, iser and ocs. The default transport
(tcp) happens to be called "default".
Use of non-default transports changes the disk path to the following format:
/dev/disk/by-path/pci-<pci_id>-ip-<portal>-iscsi-<iqn>-lun-<lun_id>
Code comments currently claim the default iscsi mount path as
kubernetes.io/pod/iscsi/<portal>-iqn-<iqn>-lun-<id>, however actual
path being used is
kubernetes.io/iscsi/iscsi/<portal>-iqn-<iqn>-lun-<id>
This leads to ultimate path being similar to this :
kubernetes.io/iscsi/iscsi/...iqn-iqn...-lun-N
Both iscsi and iqn are repated twice for no reason, since "iqn" is
required by spec to be part of an iqn. This is also wrong on
multiple leves as actual allowed naming formats are :
iqn.2001-04.com.example:storage:diskarrays-sn-a8675309
eui.02004567A425678D
(RFC 3720 3.2.6.3)
and in the second case "iqn-eui" in the path would be misleading.
Change this to a more reasonable path of
kubernetes.io/iscsi/<portal>-<iqn>-lun-<id>
which also aligns up with how the /dev/by-path and sysfs entries
are created for iscsi devices on linux
* -- *
Update iSCSI README and sample json file
There seems to have been quite a skew in recent updates to these
files adding in wrong info or info that no longer lines up the
sample config with the README.
Fixed the following issues :
* Fix discrepancy in samples json using initiator iqn from previous
linked example as target iqn (which was just wrong)
* Generate sample output and README from the same json config provided.
* Remove recommendation to edit initiator name, this is not required
(open-iscsi warns against editing this manually and provides a utility
for the same)
* Update docker inspect command to one that works.
* Use separate LUNs for separate mount points instead of re-using.
remount was originally needed to ensure rw/ro is set correctly. There is no such need since mount is using exec interface
Signed-off-by: Huamin Chen <hchen@redhat.com>
Flocker [1] is an open-source container data volume manager for
Dockerized applications.
This PR adds a volume plugin for Flocker.
The plugin interfaces the Flocker Control Service REST API [2] to
attachment attach the volume to the pod.
Each kubelet host should run Flocker agents (Container Agent and Dataset
Agent).
The kubelet will also require environment variables that contain the
host and port of the Flocker Control Service. (see Flocker architecture
[3] for more).
- `FLOCKER_CONTROL_SERVICE_HOST`
- `FLOCKER_CONTROL_SERVICE_PORT`
The contribution introduces a new 'flocker' volume type to the API with
fields:
- `datasetName`: which indicates the name of the dataset in Flocker
added to metadata;
- `size`: a human-readable number that indicates the maximum size of the
requested dataset.
Full documentation can be found docs/user-guide/volumes.md and examples
can be found at the examples/ folder
[1] https://clusterhq.com/flocker/introduction/
[2] https://docs.clusterhq.com/en/1.3.1/reference/api.html
[3] https://docs.clusterhq.com/en/1.3.1/concepts/architecture.html
rbd: if rbd image is not formatted, format it to the designated filesystem type
rbd: update example README.md and include instructions to get base64 encoded Ceph secret
if rbd fails to lock image, unmap the image before exiting
Signed-off-by: Huamin Chen <hchen@redhat.com>
GlusterFS by default uses a log file based on the mountpoint path munged into a
file, i.e. `/mnt/foo/bar` becomes `/var/log/glusterfs/mnt-foo-bar.log`.
On certain Kubernetes environments this can result in a log file that exceeds
the 255 character length most filesystems impose on filenames causing the mount
to fail. Instead, use the `log-file` mount option to place the log file under
the kubelet plugin directory with a filename of our choosing keeping it fairly
persistent in the case of troubleshooting.
A lot of packages use StringSet, but they don't use anything else from
the util package. Moving StringSet into another package will shrink
their dependency trees significantly.
This code was originally added because the first mount call did not
respect the ro option. This no longer seems to be the cause so there
is no need to use remount.
Signed-off-by: Sami Wagiaalla <swagiaal@redhat.com>
The GCE PD plugin uses safe_format_and_mount found on standard GCE images:
https://github.com/GoogleCloudPlatform/compute-image-packages/blob/master/google-startup-scripts/usr/share/google/safe_format_and_mount
On custom images where this is not available pods fail to format and
mount GCE PDs. This patch uses linux utilities in a similar way to the
safe_format_and_mount script to format and mount the GCE PD and AWS EBC
devices. That is first attempt a mount. If mount fails try to use file to
investigate the device. If 'file' fails to get any information about
the device and simply returns "data" then assume the device is not
formatted and format it and attempt to mount it again.
Signed-off-by: Sami Wagiaalla <swagiaal@redhat.com>
IsLikelyNotMountPoint determines if a directory is not a mountpoint.
It is fast but not necessarily ALWAYS correct. If the path is in fact
a bind mount from one part of a mount to another it will not be detected.
mkdir /tmp/a /tmp/b; mount --bin /tmp/a /tmp/b; IsLikelyNotMountPoint("/tmp/b")
will return true. When in fact /tmp/b is a mount point. So this patch
renames the function and switches it from a positive to a negative (I
could think of a good positive name). This should make future users of
this function aware that it isn't quite perfect, but probably good
enough.
Make clients opt in to decoding objects that are stored
in the generic api.List object by invoking runtime.DecodeList()
with a set of schemes. Makes it easier to handle unknown
schema objects because decoding is in the control of the code.
Add runtime.Unstructured, which is a simple in memory
representation of an external object.
I have a pod, which exports a Gluster filesystem in non-default namespace.
When I try to use this FS as a GlusterfsVolumeSource in a 'client' pod
definition, Kubernetes looks for the appropriate endpoint in 'default'
namespace instead of the namespace where the client pod is being defined.
Standardize how our fakes are used so that a test case can use a
simpler mechanism for providing large, complex data sets, as well
as represent queries over time.
* Improper format specifier (e.g. %s for bools or %s for ints)
* More or less parameters than format specifiers
* Not calling a formatting function when it should have (e.g. Error() instead of Errorf())
Break up the monolithic volumes code in kubelet into very small individual
modules with a well-defined interface. Move them all into their own packages
and beef up testing along the way.
Adds GCEPersistentDisk volume struct
Adds gce-utils to attach disk to kubelet's VM.
Updates config to give compute-rw to every minion.
Adds GCEPersistentDisk to API
Adds ability to mount attached disks
Generalizes PD and adds tests.
PD now uses an pluggable API interface.
Unit Tests more cleanly separates TearDown and SetUp
Modify boilerplate hook to omit build tags
Adds Mounter interface; mount is now built by OS
TearDown() for PD now detaches disk on final refcount
Un-generalized PD; GCE calls moved to cloudprovider
Address comments.
Different information is needed to perform setup versus teardown. It
makes sense to separate these two interfaces since when we call teardown
from the reconciliation loop, we cannot rely on having the
information provided by the api definition of the volume.
Determines the set of active volumes versus the set of valid volumes
defined by the manifests. If there is an active volume that is not
defined in any of the manifests, deletes and cleans up that volume.
The volume package does not test enough use-cases.
Improve test coverage by adding additional tests and refactoring
current tests to use table testing.
This change introduces a new error var to make testing unsupported
volume type errors easier.
This change does not introduce any changes in behavior.
Adds EmptyDirectory to volumes. This represents a directory
on the host, given to a pod that should not persist beyond.
The current draft does not cleanup after itself.
Adds the framework for external volume mounts.
Currently supports bare host directory mounts.
Modifies the API to support host directory mounts from Volumes
instead of VolumeMounts.