Background:
With current design, the content backend uses key-lock for long-lived
write transaction. If the content reference has been marked for write
transaction, the other requestes on the same reference will fail fast with
unavailable error. Since the metadata plugin is based on boltbd which
only supports single-writer, the content backend can't block or handle
the request too long. It requires the client to handle retry by itself,
like OpenWriter - backoff retry helper. But the maximum retry interval
can be up to 2 seconds. If there are several concurrent requestes fo the
same image, the waiters maybe wakeup at the same time and there is only
one waiter can continue. A lot of waiters will get into sleep and we will
take long time to finish all the pulling jobs and be worse if the image
has many more layers, which mentioned in issue #4937.
After fetching, containerd.Pull API allows several hanlers to commit
same ChainID snapshotter but only one can be done successfully. Since
unpack tar.gz is time-consuming job, it can impact the performance on
unpacking for same ChainID snapshotter in parallel.
For instance, the Request 2 doesn't need to prepare and commit, it
should just wait for Request 1 finish, which mentioned in pull
request #6318.
```text
Request 1 Request 2
Prepare
|
|
|
| Prepare
Commit |
|
|
|
Commit(failed on exist)
```
Both content backoff retry and unnecessary unpack impacts the performance.
Solution:
Introduced the duplicate suppression in fetch and unpack context. The
deplicate suppression uses key-mutex and single-waiter-notify to support
singleflight. The caller can use the duplicate suppression in different
PullImage handlers so that we can avoid unnecessary unpack and spin-lock
in OpenWriter.
Test Result:
Before enhancement:
```bash
➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...
real 1m6.172s
user 0m0.268s
sys 0m0.193s
docker pull localhost:5000/redis:latest (x20) takes ...
real 0m1.324s
user 0m0.441s
sys 0m0.316s
➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...
real 1m47.657s
user 0m0.284s
sys 0m0.224s
docker pull localhost:5000/golang:latest (x20) takes ...
real 0m6.381s
user 0m0.488s
sys 0m0.358s
```
With this enhancement:
```bash
➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20
crictl pull localhost:5000/redis:latest (x20) takes ...
real 0m1.140s
user 0m0.243s
sys 0m0.178s
docker pull localhost:5000/redis:latest (x20) takes ...
real 0m1.239s
user 0m0.463s
sys 0m0.275s
➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20
crictl pull localhost:5000/golang:latest (x20) takes ...
real 0m5.546s
user 0m0.217s
sys 0m0.219s
docker pull localhost:5000/golang:latest (x20) takes ...
real 0m6.090s
user 0m0.501s
sys 0m0.331s
```
Test Script:
localhost:5000/{redis|golang}:latest is equal to
docker.io/library/{redis|golang}:latest. The image is hold in local registry
service by `docker run -d -p 5000:5000 --name registry registry:2`.
```bash
image_name="${1}"
pull_times="${2:-10}"
cleanup() {
ctr image rmi "${image_name}"
ctr -n k8s.io image rmi "${image_name}"
crictl rmi "${image_name}"
docker rmi "${image_name}"
sleep 2
}
crictl_testing() {
for idx in $(seq 1 ${pull_times}); do
crictl pull "${image_name}" > /dev/null 2>&1 &
done
wait
}
docker_testing() {
for idx in $(seq 1 ${pull_times}); do
docker pull "${image_name}" > /dev/null 2>&1 &
done
wait
}
cleanup > /dev/null 2>&1
echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo
echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing
```
Fixes: #4937Close: #4985Close: #6318
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This commit upgrades github.com/containerd/typeurl to use typeurl.Any.
The interface hides gogo/protobuf/types.Any from containerd's Go client.
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
runc option --criu is now ignored (with a warning), and the option will be
removed entirely in a future release. Users who need a non- standard criu
binary should rely on the standard way of looking up binaries in $PATH.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Currently when handling 'container_path' elements in container mounts we simply call
filepath.Clean on those paths. However, filepath.Clean adds an extra '.' if the path is a
simple drive letter ('E:' or 'Z:' etc.). These type of paths cause failures (with incorrect
parameter error) when creating containers via hcsshim. This commit checks for such paths
and doesn't call filepath.Clean on them.
It also adds a new check to error out if the destination path is a C drive and moves the
dst path checks out of the named pipe condition.
Signed-off-by: Amit Barve <ambarve@microsoft.com>
The linter on platforms that have a hardcoded response complains about
"if xyz == nil" checks; ignore those.
Signed-off-by: Phil Estes <estesp@amazon.com>
The directory created by `T.TempDir` is automatically removed when the
test and all its subtests complete.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
There's two mappings of hostpath to IDType and ID in the wild:
- dockershim and dockerd-cri (implicitly via docker) use class/ID
-- The only supported IDType in Docker is 'class'.
-- https://github.com/aarnaud/k8s-directx-device-plugin generates this form
- https://github.com/jterry75/cri (windows_port branch) uses IDType://ID
-- hcsshim's CRI test suite generates this form
`://` is much more easily distinguishable, so I've gone with that one as
the generic separator, with `class/` as a special-case.
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
These unit tests don't check hugetlb. However by setting
TolerateMissingHugetlbController to false, these tests can't
be run on system without hugetlb (e.g. Debian buildd).
Signed-off-by: Shengjing Zhu <zhsj@debian.org>
full diff: https://github.com/emicklei/go-restful/compare/v2.9.5...v3.7.3
- Switch to using go modules
- Add check for wildcard to fix CORS filter
- Add check on writer to prevent compression of response twice
- Add OPTIONS shortcut WebService receiver
- Add Route metadata to request attributes or allow adding attributes to routes
- Add wroteHeader set
- Enable content encoding on Handle and ServeHTTP
- Feat: support google custom verb
- Feature: override list of method allowed without content-type
- Fix Allow header not set on '405: Method Not Allowed' responses
- Fix Go 1.15: conversion from int to string yields a string of one rune
- Fix WriteError return value
- Fix: use request/response resulting from filter chain
- handle path params with prefixes and suffixes
- HTTP response body was broken, if struct to be converted to JSON has boolean value
- List available representations in 406 body
- Support describing response headers
- Unwrap function in filter chain + remove unused dispatchWithFilters
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
We were not properly ignoring errors from
gorestrl.rdt.ContainerClassFromAnnotations() causing the config option
to be ineffective, in practice.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
With the cgroupv2 configuration employed by Kubernetes, the pod cgroup (slice)
and container cgroup (scope) will both have the same memory limit applied. In
that situation, the kernel will consider an OOM event to be triggered by the
parent cgroup (slice), and increment 'oom' there. The child cgroup (scope) only
sees an oom_kill increment. Since we monitor child cgroups for oom events,
check the OOMKill field so that we don't miss events.
This is not visible when running containers through docker or ctr, because they
set the limits differently (only container level). An alternative would be to
not configure limits at the pod level - that way the container limit will be
hit and the OOM will be correctly generated. An interesting consequence is that
when spawning a pod with multiple containers, the oom events also work
correctly, because:
a) if one of the containers has no limit, the pod has no limit so OOM events in
another container report correctly.
b) if all of the containers have limits then the pod limit will be a sum of
container events, so a container will be able to hit its limit first.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
When the cgroup is removed, EventChan is closed (this was pulled in by
8d69c041c5). This results in a nil error
being received. Don't log an error in that case but instead return.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
The Linux kernel never sets the Inheritable capability flag to
anything other than empty. Non-empty values are always exclusively
set by userspace code.
[The kernel stopped defaulting this set of capability values to the
full set in 2000 after a privilege escalation with Capabilities
affecting Sendmail and others.]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Enabling this option effectively causes RDT class of a container to be a
soft requirement. If RDT support has not been enabled the RDT class
setting will not have any effect.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
Use goresctrl for parsing container and pod annotations related to RDT.
In practice, from the users' point of view, this patchs adds support for
a container annotation and two separate pod annotations for controlling
the RDT class of containers.
Container annotation can be used by a CRI client:
"io.kubernetes.cri.rdt-class"
Pod annotations for specifying the RDT class in the K8s pod spec level:
"rdt.resources.beta.kubernetes.io/pod"
(pod-wide default for all containers within)
"rdt.resources.beta.kubernetes.io/container.<container_name>"
(container-specific overrides)
Annotations are intended as an intermediate step before the CRI API
supports RDT.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
The ability to handle KVM based runtimes with SELinux has been added as
part of d715d00906.
However, that commit introduced some logic to check whether the
"container_kvm_t" label would or not be present in the system, and while
the intentions were good, there's two major issues with the approach:
1. Inspecting "/etc/selinux/targeted/contexts/customizable_types" is not
the way to go, as it doesn't list the "container_kvm_t" at all.
2. There's no need to check for the label, as if the label is invalid an
"Invalid Label" error will be returned and that's it.
With those two in mind, let's simplify the logic behind setting the
"container_kvm_t" label, removing all the unnecessary code.
Here's an output of VMM process running, considering:
* The state before this patch:
```
$ containerd --version
containerd github.com/containerd/containerd v1.6.0-beta.3-88-g7fa44fc98 7fa44fc98f
$ kubectl apply -f ~/simple-pod.yaml
pod/nginx created
$ ps -auxZ | grep cloud-hypervisor
system_u:system_r:container_runtime_t:s0 root 609717 4.0 0.5 2987512 83588 ? Sl 08:32 0:00 /usr/bin/cloud-hypervisor --api-socket /run/vc/vm/be9d5cbabf440510d58d89fc8a8e77c27e96ddc99709ecaf5ab94c6b6b0d4c89/clh-api.sock
```
* The state after this patch:
```
$ containerd --version
containerd github.com/containerd/containerd v1.6.0-beta.3-89-ga5f2113c9 a5f2113c9fc15b19b2c364caaedb99c22de4eb32
$ kubectl apply -f ~/simple-pod.yaml
pod/nginx created
$ ps -auxZ | grep cloud-hypervisor
system_u:system_r:container_kvm_t:s0:c638,c999 root 614842 14.0 0.5 2987512 83228 ? Sl 08:40 0:00 /usr/bin/cloud-hypervisor --api-socket /run/vc/vm/f8ff838afdbe0a546f6995fe9b08e0956d0d0cdfe749705d7ce4618695baa68c/clh-api.sock
```
Note, the tests were performed using the following configuration snippet:
```
[plugins]
[plugins.cri]
enable_selinux = true
[plugins.cri.containerd]
[plugins.cri.containerd.runtimes]
[plugins.cri.containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
privileged_without_host_devices = true
```
And using the following pod yaml:
```
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
runtimeClassName: kata
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
```
Fixes: #6371
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
CRI API has been updated to include a an optional `resources` field in the
LinuxPodSandboxConfig field, as part of the RunPodSandbox request.
Having sandbox level resource details at sandbox creation time will have
large benefits for sandboxed runtimes. In the case of Kata Containers,
for example, this'll allow for better support of SW/HW architectures
which don't allow for CPU/memory hotplug, and it'll allow for better
queue sizing for virtio devices associated with the sandbox (in the VM
case).
If this sandbox resource information is provided as part of the run
sandbox request, let's introduce a pattern where we will update the
pause container's runtiem spec to include this information in the
annotations field.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
When containerd use this config:
```
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."localhost:5000"]
endpoint = ["http://localhost:5000"]
```
Due to the `newTransport` function does not initialize the `TLSClientConfig` field.
Then use `TLSClientConfig` to cause nil pointer dereference
Signed-off-by: wanglei <wllenyj@linux.alibaba.com>
These are simple metrics that allow users to view more fine grained metrics on
internal operations.
Signed-off-by: Michael Crosby <michael@thepasture.io>
See https://kep.k8s.io/2371
* Implement new CRI RPCs - `ListPodSandboxStats` and `PodSandboxStats`
* `ListPodSandboxStats` and `PodSandboxStats` which return stats about
pod sandbox. To obtain pod sandbox stats, underlying metrics are
read from the pod sandbox cgroup parent.
* Process info is obtained by calling into the underlying task
* Network stats are taken by looking up network metrics based on the
pod sandbox network namespace path
* Return more detailed stats for cpu and memory for existing container
stats. These metrics use the underlying task's metrics to obtain
stats.
Signed-off-by: David Porter <porterdavid@google.com>
This commit adds a flag that enable all devices whitelisting when
privileged_without_host_devices is already enabled.
Fixes#5679
Signed-off-by: Dat Nguyen <dnguyen7@atlassian.com>
This change ignore errors during container runtime due to large
image labels and instead outputs warning. This is necessary as certain
image building tools like buildpacks may have large labels in the images
which need not be passed to the container.
Signed-off-by: Sambhav Kothari <sambhavs.email@gmail.com>
This will allow running Windows Containers to have their resource
limits updated through containerd. The CPU resource limits support
has been added for Windows Server 20H2 and newer, on older versions
hcsshim will raise an Unimplemented error.
Signed-off-by: Claudiu Belu <cbelu@cloudbasesolutions.com>
In linux 5.14 and hopefully some backports, core scheduling allows processes to
be co scheduled within the same domain on SMT enabled systems.
The containerd impl sets the core sched domain when launching a shim. This
allows a clean way for each shim(container/pod) to be in its own domain and any
additional containers, (v2 pods) be be launched with the same domain as well as
any exec'd process added to the container.
kernel docs: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/core-scheduling.html
Signed-off-by: Michael Crosby <michael@thepasture.io>
Currently, there are few issues that preventing containers
with image volumes to properly start on Windows.
- Unlike the Linux implementation, the Container volume mount paths
were not created if they didn't exist. Those paths are now created.
- while copying the image volume contents to the container volume,
the layers were not properly deactivated, which means that the
container can't start since those layers are still open. The layers
are now properly deactivated, allowing the container to start.
- even if the above issue didn't exist, the Windows implementation of
mount/Mount.Mount deactivates the layers, which wouldn't allow us
to copy files from them. The layers are now deactivated after we've
copied the necessary files from them.
- the target argument of the Windows implementation of mount/Mount.Mount
was unused, which means that folder was always empty. We're now
symlinking the Layer Mount Path into the target folder.
- hcsshim needs its Container Mount Paths to be properly formated, to be
prefixed by C:. This was an issue for Volumes defined with Linux-like
paths (e.g.: /test_dir). filepath.Abs solves this issue.
Signed-off-by: Claudiu Belu <cbelu@cloudbasesolutions.com>
The io/ioutil package has been deprecated as of Go 1.16, see
https://golang.org/doc/go1.16#ioutil. This commit replaces the existing
io/ioutil functions with their new definitions in io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
This fixes the TODO of this function and also expands on how the primary pod ip
is selected. This change allows the operator to prefer ipv4, ipv6, or retain the
ordering provided by the return results of the CNI plugins.
This makes it much more flexible for ops to configure containerd and how IPs are
set on the pod.
Signed-off-by: Michael Crosby <michael@thepasture.io>
After containerd restarts, it will try to recover its sandboxes,
containers, and images. If it detects a task in the Created or
Stopped state, it will be removed. This will cause the containerd
process it hang on Windows on the t.io.Wait() call.
Calling t.io.Close() beforehand will solve this issue.
Additionally, the same issue occurs when trying to stopp a sandbox
after containerd restarts. This will solve that case as well.
Signed-off-by: Claudiu Belu <cbelu@cloudbasesolutions.com>
The CRI-plugin subscribes the image event on k8s.io namespace. By
default, the image event is created by CRI-API. However, the image can
be downloaded by containerd API on k8s.io with the customized labels.
The CRI-plugin should use patch update for `io.cri-containerd.image`
label in this case.
Fixes: #5900
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The Pod Sandbox can enter in a NotReady state if the task associated
with it no longer exists (it died, or it was killed). In this state,
the Pod network namespace could still be open, which means we can't
remove the sandbox, even if --force was used.
Signed-off-by: Claudiu Belu <cbelu@cloudbasesolutions.com>
With the introduction of Windows Server 2022, some images have been updated
to support WS2022 in their manifest list. This commit updates the test images
accordingly.
Signed-off-by: Adelina Tuvenie <atuvenie@cloudbasesolutions.com>
CRI container runtimes mount devices (set via kubernetes device plugins)
to containers by taking the host user/group IDs (uid/gid) to the
corresponding container device.
This triggers a problem when trying to run those containers with
non-zero (root uid/gid = 0) uid/gid set via runAsUser/runAsGroup:
the container process has no permission to use the device even when
its gid is permissive to non-root users because the container user
does not belong to that group.
It is possible to workaround the problem by manually adding the device
gid(s) to supplementalGroups. However, this is also problematic because
the device gid(s) may have different values depending on the workers'
distro/version in the cluster.
This patch suggests to take RunAsUser/RunAsGroup set via SecurityContext
as the device UID/GID, respectively. The feature must be enabled by
setting device_ownership_from_security_context runtime config value to
true (valid on Linux only).
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Go 1.15.7 contained a security fix for CVE-2021-3115, which allowed arbitrary
code to be executed at build time when using cgo on Windows. This issue also
affects Unix users who have “.” listed explicitly in their PATH and are running
“go get” outside of a module or with module mode disabled.
This issue is not limited to the go command itself, and can also affect binaries
that use `os.Command`, `os.LookPath`, etc.
From the related blogpost (ttps://blog.golang.org/path-security):
> Are your own programs affected?
>
> If you use exec.LookPath or exec.Command in your own programs, you only need to
> be concerned if you (or your users) run your program in a directory with untrusted
> contents. If so, then a subprocess could be started using an executable from dot
> instead of from a system directory. (Again, using an executable from dot happens
> always on Windows and only with uncommon PATH settings on Unix.)
>
> If you are concerned, then we’ve published the more restricted variant of os/exec
> as golang.org/x/sys/execabs. You can use it in your program by simply replacing
This patch replaces all uses of `os/exec` with `golang.org/x/sys/execabs`. While
some uses of `os/exec` should not be problematic (e.g. part of tests), it is
probably good to be consistent, in case code gets moved around.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
There was recent changes to cri to bring in a Windows section containing a
security context object to the pod config. Before this there was no way to specify
a user for the pod sandbox container to run as. In addition, the security context
is a field for field mirror of the Windows container version of it, so add the
ability to specify a GMSA credential spec for the pod sandbox container as well.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Exclude the `security.selinux` xattr when copying content from layer
storage for image volumes. This allows for the already correct label
at the target location to be applied to the copied content, thus
enabling containers to write to volumes that they implicitly expect to be
able to write to.
- Fixescontainerd/containerd#5090
- See rancher/rke2#690
Signed-off-by: Jacob Blain Christen <jacob@rancher.com>
Refactor shim v2 to load and register plugins.
Update init shim interface to not require task service implementation on
returned service, but register as plugin if it is.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Remove build tags which are already implied by the name of the file.
Ensures build tags are used consistently
Signed-off-by: Derek McGowan <derek@mcg.dev>
In containerd 1.5.x, we introduced support for go modules by adding a
go.mod file in the root directory. This go.mod lists all the things
needed across the whole code base (with the exception of
integration/client which has its own go.mod). So when projects that
need to make calls to containerd API will pull in some code from
containerd/containerd, the `go mod` commands will add all the things
listed in the root go.mod to the projects go.mod file. This causes
some problems as the list of things needed to make a simple API call
is enormous. in effect, making a API call will pull everything that a
typical server needs as well as the root go.mod is all encompassing.
In general if we had smaller things folks could use, that will make it
easier by reducing the number of things that will end up in a consumers
go.mod file.
Now coming to a specific problem, the root containerd go.mod has various
k8s.io/* modules listed. Also kubernetes depends on containerd indirectly
via both moby/moby (working with docker maintainers seperately) and via
google/cadvisor. So when the kubernetes maintainers try to use latest
1.5.x containerd, they will see the kubernetes go.mod ending up depending
on the older version of kubernetes!
So if we can expose just the minimum things needed to make a client API
call then projects like cadvisor can adopt that instead of pulling in
the entire go.mod from containerd. Looking at the existing code in
cadvisor the minimum things needed would be the api/ directory from
containerd. Please see proof of concept here:
github.com/google/cadvisor/pull/2908
To enable that, in this PR, we add a go.mod file in api/ directory. we
split the Protobuild.yaml into two, one for just the things in api/
directory and the rest in the root directory. We adjust various targets
to build things correctly using `protobuild` and also ensure that we
end up with the same generated code as before as well. To ensure we
better take care of the various go.mod/go.sum files, we update the
existing `make vendor` and also add a new `make verify-vendor` that one
can run locally as well in the CI.
Ideally, we would have a `containerd/client` either as a standalone repo
or within `containerd/containerd` as a separate go module. but we will
start here to experiment with a standalone api go module first.
Also there are various follow ups we can do, for example @thaJeztah has
identified two tasks we could do after this PR lands:
github.com/containerd/containerd/pull/5716#discussion_r668821396
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
systemd uses SIGRTMIN+n signals, but containerd didn't support the signals
since Go's sys/unix doesn't support them.
This change introduces SIGRTMIN+n handling by utilizing moby/sys/signal.
Fixes#5402.
https://www.freedesktop.org/software/systemd/man/systemd.html#Signals
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
Before this change, for several of the services that `WithServices`
handles, only the grpc client is supported.
Now, for instance, one can use an `images.Store` directly instead of
only an `imagesapi.StoreSlient`.
Some of the methods have been renamed to satisfy the difference between
using a grpc `<Foo>Client` vs the main interface.
I did not see a good candidate for TaskService so have left that mostly
unchanged.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
CNI plugins that need to wait for network state to converge
may want to cancel waiting when a short lived pod is deleted.
However, there is a race between when kubelet asks the runtime
to create the sandbox for the pod, and when the plugin is able
request the pod object from the apiserver. It may be the case
that the plugin receives the new pod, rather than the pod
the sandbox request was initiated for.
Passing the pod UID to the plugin allows the plugin to check
whether the pod it gets from the apiserver is actually the
pod its sandbox request was started for.
Signed-off-by: Dan Williams <dcbw@redhat.com>
Similar to other deferred cleanup operations, teardownPodNetwork should
use a different context as the original context may have expired,
otherwise CNI wouldn't been invoked, leading to leak of network
resources, e.g. IP addresses.
Signed-off-by: Quan Tian <qtian@vmware.com>
This change splits the definition of pkg/cri/os.ResolveSymbolicLink by
platform (windows/!windows), and switches to an alternate implementation
for Windows. This aims to fix the issue described in containerd/containerd#5405.
The previous implementation which just called filepath.EvalSymlinks has
historically had issues on Windows. One of these issues we were able to
fix in Go, but EvalSymlinks's behavior is not well specified on
Windows, and there could easily be more issues in the future, so it
seems prudent to move to a separate implementation for Windows.
The new implementation uses the Windows GetFinalPathNameByHandle API,
which takes a handle to an open file or directory and some flags, and
returns the "real" name for the object. See comments in the code for
details on the implementation.
I have tested this change with a variety of mounts and everything seems
to work as expected. Functions that make incorrect assumptions on what a
Windows path can look like may have some trouble with the \\?\ path
syntax. For instance EvalSymlinks fails when given a \\?\UNC\ path. For
this reason, the resolvePath implementation modifies the returned path
to translate to the more common form (\\?\UNC\server\share ->
\\server\share).
Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
This commit adds support for the PID namespace mode TARGET
when generating a container spec.
The container that is created will be sharing its PID namespace
with the target container that was specified by ID in the namespace
options.
Signed-off-by: Thomas Hartland <thomas.george.hartland@cern.ch>
It does not make sense to check if seccomp is supported by the kernel
more than once per runtime, so let's use sync.Once to speed it up.
A quick benchmark (old implementation, before this commit, after):
BenchmarkIsEnabledOld-4 37183 27971 ns/op
BenchmarkIsEnabled-4 1252161 947 ns/op
BenchmarkIsEnabledOnce-4 666274008 2.14 ns/op
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Current implementation of seccomp.IsEnabled (rooted in runc) is not
too good.
First, it parses the whole /proc/self/status, adding each key: value
pair into the map (lots of allocations and future work for garbage
collector), when using a single key from that map.
Second, the presence of "Seccomp" key in /proc/self/status merely means
that kernel option CONFIG_SECCOMP is set, but there is a need to _also_
check for CONFIG_SECCOMP_FILTER (the code for which exists but never
executed in case /proc/self/status has Seccomp key).
Replace all this with a single call to prctl; see the long comment in
the code for details.
While at it, improve the IsEnabled documentation.
NOTE historically, parsing /proc/self/status was added after a concern
was raised in https://github.com/opencontainers/runc/pull/471 that
prctl(PR_GET_SECCOMP, ...) can result in the calling process being
killed with SIGKILL. This is a valid concern, so the new code here
does not use PR_GET_SECCOMP at all.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Looks like we had our own copy of the "getDevices" code already, so use
that code (which also matches the code that's used to _generate_ the spec,
so a better match).
Moving the code to a separate file, I also noticed that the _unix and _linux
code was _exactly_ the same (baring some `//nolint:` comments), so also
removing the duplicated code.
With this patch applied, we removed the dependency on the libcontainer/devices
package (leaving only libcontainer/user).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Move `pkg/cri/opts.WithoutRunMount` function to `oci.WithoutRunMount`
so that it can be used without dependency on CRI.
Also add `oci.WithoutMounts(dests ...string)` for generality.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Currently image references end up being stored in a
random order due to the way maps are iterated through
in Go. This leads to inconsistent identifiers being
resolved when a single reference is needed to identify
an image and the ordering of the references is used for
the selection.
Sort references in a consistent and ranked manner,
from higher information formats to lower.
Note: A `name + tag` reference is considered higher
information than a `name + digest` reference since a
registry may be used to resolve the digest from a
`name + tag` reference.
Signed-off-by: Derek McGowan <derek@mcg.dev>
golang has enabled RFC 6555 Fast Fallback (aka HappyEyeballs)
by default in 1.12.
It means that if a host resolves to both IPv6 and IPv4,
it will try to connect to any of those addresses and use the
working connection.
However, the implementation uses go routines to start both connections in parallel,
and this has limitations when running inside a namespace, so we try to the connections
serially, trying IPv4 first for keeping the same behaviour.
xref https://github.com/golang/go/issues/44922
Signed-off-by: Antonio Ojea <aojea@redhat.com>
- process.Init#io could be nil
- Make sure CreateTaskRequest#Options is not empty before unmarshaling
Signed-off-by: Kazuyoshi Kato <katokazu@amazon.com>
This enables cases where devices exist in a subdirectory of /dev,
particularly where those device names are not portable across machines,
which makes it problematic to specify from a runtime such as cri.
Added this to `ctr` as well so I could test that the code at least
works.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This will be used instead of the cri registry config in the main config
toml.
---
Also pulls in changes from containerd/cri@d0b4eecbb3
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
pkg/cap has the full list of the caps (for UT, originally),
so we can drop dependency on github.com/syndtr/gocapability
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>