`ioutil` has been deprecated by golang. All the code in `ioutil` just
forwards functionality to code in either the `io` or `os` packages.
See https://github.com/golang/go/pull/51961 for more info.
Signed-off-by: Jeff Widman <jeff@jeffwidman.com>
This change does a couple things to remove some cruft/unused functionality
in the Windows snapshotter, as well as add a way to specify the rootfs
size in bytes for a Windows container via a new field added in the CRI api in
k8s 1.24. Setting the rootfs/scratch volume size was assumed to be working
prior to this but turns out not to be the case.
Previously I'd added a change to pass any annotations in the containerd
snapshot form (containerd.io/snapshot/*) as labels for the containers
rootfs snapshot. This was added as a means for a client to be able to provide
containerd.io/snapshot/io.microsoft.container.storage.rootfs.size-gb as an
annotation and have that be translated to a label and ultimately set the
size for the scratch volume in Windows. However, this actually only worked if
interfacing with the CRI api directly (crictl) as Kubernetes itself will
fail to validate annotations that if split by "/" end up with > 2 parts,
which the snapshot labels will (containerd.io / snapshot / foobarbaz).
With this in mind, passing the annotations and filtering to
containerd.io/snapshot/* is moot, so I've removed this code in favor of
a new `snapshotterOpts()` function that will return platform specific
snapshotter options if ones exist. Now on Windows we can just check if
RootfsSizeInBytes is set on the WindowsContainerResources struct and
then return a snapshotter option that sets the right label.
So all in all this change:
- Gets rid of code to pass CRI annotations as labels down to snapshotters.
- Gets rid of the functionality to create a 1GB sized scratch disk if
the client provided a size < 20GB. This code is not used currently and
has a few logical shortcomings as it won't be able to create the disk
if a container is already running and using the same base layer. WCIFS
(driver that handles the unioning of windows container layers together)
holds open handles to some files that we need to delete to create the
1GB scratch disk is the underlying problem.
- Deprecates the containerd.io/snapshot/io.microsoft.container.storage.rootfs.size-gb
label in favor of a new containerd.io/snapshot/windows/rootfs.sizebytes label.
The previous label/annotation wasn't being used by us, and from a cursory
github search wasn't being used by anyone else either. Now that there is a CRI
field to specify the size, this should just be a field that users can set
on their pod specs and don't need to concern themselves with what it eventually
gets translated to, but non-CRI clients can still use the new label/deprecated
label as usual.
- Add test to cri integration suite to validate expanding the rootfs size.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
When upperdirLabel specified, overlay Update will throw tx closed error since Commit is invoked before GetInfo
Signed-off-by: cardy.tang <zuniorone@gmail.com>
Previouslty "Size" was reserved by protoc-gen-gogoctrd and user-generated
"Size" was automatically renamed to "Size_" to avoid conflicts.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
This commit removes the following gogoproto extensions;
- gogoproto.nullable
- gogoproto.customename
- gogoproto.unmarshaller_all
- gogoproto.stringer_all
- gogoproto.sizer_all
- gogoproto.marshaler_all
- gogoproto.goproto_unregonized_all
- gogoproto.goproto_stringer_all
- gogoproto.goproto_getters_all
None of them are supported by Google's toolchain (see #6564).
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
While executing mke2fs, 'Not enough space to build proposed filesystem while setting up superblock' error is happend on Ubuntu20.04
Signed-off-by: Shinichi Morimoto <shnmorimoto@gmail.com>
This commit removes gogoproto.enumvalue_customname,
gogoproto.goproto_enum_prefix and gogoproto.enum_customname.
All of them make proto-generated Go code more idiomatic, but we already
don't use these enums in our external-surfacing types and they are anyway
not supported by Google's official toolchain (see #6564).
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
The directory created by `T.TempDir` is automatically removed when the
test and all its subtests complete.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Two xfs file systems with same UUID can not be mounted on the same system.
However devmapper snapshots will have same UUID as original filesystem.
This patch fixes the bug by mounting a xfs file system with "nouuid" option.
Signed-off-by: Henry Wang <henwang@amazon.com>
Add file system options for config file, so that user can use
non-default file system parameters for the fs type of choice
Using file system options in config file overwrites the default
options already being used.
Signed-off-by: Alakesh Haloi <alakeshh@amazon.com>
Disallow traversal into directories that may contain
unpacked or mounted image filesystems.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Signed-off-by: Samuel Karp <skarp@amazon.com>
The io/ioutil package has been deprecated as of Go 1.16, see
https://golang.org/doc/go1.16#ioutil. This commit replaces the existing
io/ioutil functions with their new definitions in io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
ext4 file system was supported before. This adds support for xfs as
well. Containerd config file can have fs_type as an additional option
with possible values as "xfs" and "ext4" for now. In future other
fstype support can be added. A snapshot created from a committed
snapshot inherits the file system type of the parent. Any new snapshots
that has no parent is created with the file system type indicated in
config. If there is no config for file system type is found, then
ext4 is assumed. This allows users to use xfs as an optional file system
type.
Signed-off-by: Alakesh Haloi <alakeshh@amazon.com>
Go 1.15.7 contained a security fix for CVE-2021-3115, which allowed arbitrary
code to be executed at build time when using cgo on Windows. This issue also
affects Unix users who have “.” listed explicitly in their PATH and are running
“go get” outside of a module or with module mode disabled.
This issue is not limited to the go command itself, and can also affect binaries
that use `os.Command`, `os.LookPath`, etc.
From the related blogpost (ttps://blog.golang.org/path-security):
> Are your own programs affected?
>
> If you use exec.LookPath or exec.Command in your own programs, you only need to
> be concerned if you (or your users) run your program in a directory with untrusted
> contents. If so, then a subprocess could be started using an executable from dot
> instead of from a system directory. (Again, using an executable from dot happens
> always on Windows and only with uncommon PATH settings on Unix.)
>
> If you are concerned, then we’ve published the more restricted variant of os/exec
> as golang.org/x/sys/execabs. You can use it in your program by simply replacing
This patch replaces all uses of `os/exec` with `golang.org/x/sys/execabs`. While
some uses of `os/exec` should not be problematic (e.g. part of tests), it is
probably good to be consistent, in case code gets moved around.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Remove build tags which are already implied by the name of the file.
Ensures build tags are used consistently
Signed-off-by: Derek McGowan <derek@mcg.dev>
dmsetup does not discard blocks when removing a thin device. The unused blocks
are reused by the thin-pool, but will remain allocated in the underlying
device indefinitely. For loop device backed thin-pools, this results in
"lost" disk space in the underlying file system as the blocks remain allocated
in the loop device's backing file.
This change adds an option, discard_blocks, to the devmapper snapshotter which
causes the snapshotter to issue blkdiscard ioctls on the thin device before
removal. With this option enabled, loop device setups will see disk space
return to the underlying filesystem immediately on exiting a container.
Fixes#5691
Signed-off-by: Kern Walster <walster@amazon.com>
- View was somehow logging itself as "prepare"
- Cleanup should have its debug log as like other exported methods
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
It seems that something has shifted in an API, and vhd.DetachVhd is
returning "failed to open virtual disk: invalid argument" on Windows
Server LTSC 2019.
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
A Scratch layer only contains a sandbox.vhdx, but to be used as a parent
layer, it must also contain the files on-disk.
Hence, we Export the layer from the sandbox.vhdx and Import it back into
itself, so that both data formats are present.
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
If mkfs on device mapper thin pool fails, it will show pool status
as returned by dmsetup for enahnced error reporting.
Signed-off-by: Alakesh Haloi <alakeshh@amazon.com>
Before the change, the error on the caller-side (e.g. ctr) was
something like
> unpack: failed to prepare extraction snapshot "...": exit status 5:
> unknown
which was too cryptic.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
When an error is returned here, unlike the other error returns in the function, nothing is done to mark the added device as faulty or remove it.
I have observed this causing future snapshot creations to continue to attempt to use the same ID (from the sequence) to create new devices
and get blocked because the device already exists because it was not rolled back here.
Hopefully fixes#5110
Signed-off-by: Jeremy Williams <ctrlaltdel121@gmail.com>
The "userxattr" option is needed for mounting overlayfs inside a user namespace with kernel >= 5.11.
The "userxattr" option is NOT needed for the initial user namespace (aka "the host").
Also, Ubuntu (since circa 2015) and Debian (since 10) with kernel < 5.11 can mount the overlayfs in a user namespace without the "userxattr" option.
The corresponding kernel commit: 2d2f2d7322ff43e0fe92bf8cccdc0b09449bf2e1
> ovl: user xattr
>
> Optionally allow using "user.overlay." namespace instead of "trusted.overlay."
> ...
> Disable redirect_dir and metacopy options, because these would allow privilege escalation through direct manipulation of the
> "user.overlay.redirect" or "user.overlay.metacopy" xattrs.
Fix issue 5060
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This test has been flaky in GitHub Actions. This change logs the
values from devmapper to further investigate the issue.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
btrfs plugin needs CGO support. However on riscv64, cgo
is only support on go1.16 (not released yet).
Instead of setting no_btrfs manually, adding a cgo tag tells
the compiler to skip it automatically.
Signed-off-by: Shengjing Zhu <zhsj@debian.org>
* Currently we rely on making the UVMs sandbox.vhdx in the shim itself instead of this being
made by the snapshotter itself. This change adds a label that affects whether to create the UVMs
scratch layer in the snapshotter itself.
* Adds container scratch size customization. Before adding the computestorage calls
(vendored in with https://github.com/containerd/containerd/pull/4859) there was no way to make a containers
or UVMs scratch size less than the default (20 for containers and 10 for the UVM).
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Currently we would create a new disk and mount this into the LCOW UVM for every container but there
are certain scenarios where we'd rather just mount a single disk and then have every container share this one
storage space instead of every container having it's own xGB of space to play around with.
This is accomplished by just making a symlink to the disk that we'd like to share and then
using ref counting later on down the stack in hcsshim if we see that we've already mounted this
disk.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
No need to use the private losetup command line wrapper package.
The generic package provides the same functionality.
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
For LCOW currently we copy (or create) the scratch.vhdx for every single snapshot
so there ends up being a sandbox.vhdx in every directory seemingly unnecessarily. With the default scratch
size of 20GB the size on disk is about 17MB so there's a 17MB overhead per layer plus the time to copy the
file with every snapshot. Only the final sandbox.vhdx is actually used so this would be a nice little
optimization.
For WCOW we essentially do the exact same except copy the blank vhdx from the base layer.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
When running tests on any modern distro, this assumption will work. If
we need to make it work with kernels where we don't append this option
it will require some more involved changes.
Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com>
Previously there wwasn't a way to pass any labels to snapshotters as the wrapper
around WithNewSnapshot didn't have a parm to pass them in.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Put the overlay plugin in a separate package to allow the overlay package to be
used without needing to import and initialize the plugin.
Signed-off-by: Derek McGowan <derek@mcg.dev>
This allows configuring the location of the overlayfs snapshotter by
adding the following in config.toml
```
[plugins]
[plugins.overlayfs]
root_path = "/custom_location"
```
This is useful to isolate disk i/o for overlayfs from the rest of
containerd and prevent containers saturating disk i/o from negatively
affecting containerd operations and cause timeouts.
Signed-off-by: Ashray Jain <ashrayj@palantir.com>
The rollback mechanism is implemented by calling deleteDevice() and
RemoveDevice(). But RemoveDevice() is internally calling
deleteDevice() as well.
Since a device will be deleted by first deleteDevice(),
RemoveDevice() always will see ENODATA. The specific error must be
ignored to remove the device's metadata correctly.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
kernel version > 4.13rc1 support index=on feature, it will be failed
with EBUSY when trying to mount.
Related: https://github.com/moby/moby/pull/37993
Signed-off-by: Rudy Zhang <rudyflyzhang@gmail.com>
Dependencies may be switching to use the new `%w` formatting
option to wrap errors; switching to use `errors.Is()` makes
sure that we are still able to unwrap the error and detect the
underlying cause.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The issue beblow happens several times beforing the root
cause found:
1. A `fdisk -l` process has being hung up for a long time;
2. A image layer snapshot device is visiable to dmsetup, which
should *not* happen because it should be deactivated after
`Commit()`;
The backtrace of `fdisk` is always the same over time:
```bash
[<ffffffff810bbc6a>] io_schedule+0x2a/0x80
[<ffffffff81295a3f>] do_blockdev_direct_IO+0x1e9f/0x2f10
[<ffffffff81296aea>] __blockdev_direct_IO+0x3a/0x40
[<ffffffff81290e43>] blkdev_direct_IO+0x43/0x50
[<ffffffff811b8a14>] generic_file_read_iter+0x374/0x960
[<ffffffff81291ad5>] blkdev_read_iter+0x35/0x40
[<ffffffff8125229b>] new_sync_read+0xfb/0x240
[<ffffffff81252406>] __vfs_read+0x26/0x40
[<ffffffff81252b96>] vfs_read+0x96/0x130
[<ffffffff812540e5>] SyS_read+0x55/0xc0
[<ffffffff81003c04>] do_syscall_64+0x74/0x180
```
The root cause is, in Commit(), there's a race window between
`SuspendDevice()` and `DeactivateDevice()`, which may cause the
IOs of a process or command like `fdisk` on the "suspended" device
hang up forever. It has twofold:
1. The IOs suspends on the devices;
2. The device is in `Suspended` state, because it's deactivated with
`deferred` flag and without `force` flag;
So they cannot make progress.
One reproducer is:
1. enlarge the race window by putting sleep seconds there;
2. run `while true; do sudo fdisk -l; sleep 0.5; done` on one terminal;
3. and pull image on another terminal;
Fixes it by:
1. Resume the devices again after flushing IO by suspend;
2. Remove device without `deferred` flag;
Fix: #4234
Signed-off-by: Eric Ren <renzhen@linux.alibaba.com>