This is implemented via touching a file on stop as a hook in the systemd
unit. The ctime of this file is then used to get the `finishedAt` time
in the future.
In addition, this changes the `startedAt` and `createdAt` to use the api
server's results rather than the annotations it previously used.
It's possible we might want to move this into the api in the future.
Fixes#23887
Automatic merge from submit-queue
Kubelet: Better-defined Container Waiting state
For issue #20478 and #21125.
This PR corrected logic and add unit test for `ShouldContainerBeRestarted()`, cleaned up `Waiting` state related code and added unit test for `generateAPIPodStatus()`.
Fixes#20478Fixes#17971
@yujuhong
Automatic merge from submit-queue
rkt: Add pre-stop lifecycle hooks for rkt.
When a pod is being terminated, the pre-stop hooks of all the containers
will be run before the containers are stopped.
cc @yujuhong @Random-Liu @sjpotter
Automatic merge from submit-queue
Move predicates into library
This PR tries to implement #12744
Any suggestions/ideas are welcome. @davidopp
current state: integration test fails if including podCount check in Kubelet.
DONE:
1. refactor all predicates: predicates return fitOrNot(bool) and error(Error) in which the latter is of type PredicateFailureError or InsufficientResourceError
2. GeneralPredicates() is a predicate function, which includes serveral other predicate functions (PodFitsResource, PodFitsHost, PodFitsHostPort). It is registered as one of the predicates in DefaultAlgorithmProvider, and is also called in canAdmitPod() in Kubelet and should be called by other components (like rescheduler, etc if necessary. See discussion in issue #12744
TODO:
1. determine which predicates should be included in GeneralPredicates()
2. separate GeneralPredicates() into: a.) GeneralPredicatesEvictPod() and b.) GeneralPredicatesNotEvictPod()
3. DaemonSet should use GeneralPredicates()
DONE:
1. refactor all predicates: predicates return fitOrNot(bool) and error(Error) in which the latter is of type
PredicateFailureError or InsufficientResourceError. (For violation of either MaxEBSVolumeCount or
MaxGCEPDVolumeCount, returns one same error type as ErrMaxVolumeCountExceeded)
2. GeneralPredicates() is a predicate function, which includes serveral other predicate functions (PodFitsResource,
PodFitsHost, PodFitsHostPort). It is registered as one of the predicates in DefaultAlgorithmProvider, and
is also called in canAdmitPod() in Kubelet and should be called by other components (like rescheduler, etc)
if necessary. See discussion in issue #12744
3. remove podNumber check from GeneralPredicates
4. HostName is now verified in Kubelet's canAdminPod(). add TestHostNameConflicts in kubelet_test.go
5. add getNodeAnyWay() method in Kubelet to get node information in standaloneMode
TODO:
1. determine which predicates should be included in GeneralPredicates()
2. separate GeneralPredicates() into:
a. GeneralPredicatesEvictPod() and
b. GeneralPredicatesNotEvictPod()
3. DaemonSet should use GeneralPredicates()
Automatic merge from submit-queue
Kubelet: Start using the official docker engine-api
For #23563.
This is the **first step** in the roadmap of switching to docker [engine-api](https://github.com/docker/engine-api).
In this PR, I keep the old `DockerInterface` and implement it with the new engine-api.
With this approach, we could switch to engine-api with minimum change, so that we could:
* Test the engine-api without huge refactoring.
* Send following PRs to refactor functions in `DockerInterface` separately so as to avoid a huge change in one PR.
I've tested this PR locally, it passed all the node conformance test:
```
make test_e2e_node
Ran 19 of 19 Specs in 823.395 seconds
SUCCESS! -- 19 Passed | 0 Failed | 0 Pending | 0 Skipped PASS
Ginkgo ran 1 suite in 13m49.429979585s
Test Suite Passed
```
And it also passed the jenkins gce e2e test:
```
go run hack/e2e.go -test -v --test_args="--ginkgo.skip=\[Slow\]|\[Serial\]|\[Disruptive\]|\[Flaky\]|\[Feature:.+\]"
Ran 161 of 268 Specs in 4570.214 seconds
SUCCESS! -- 161 Passed | 0 Failed | 0 Pending | 107 Skipped PASS
Ginkgo ran 1 suite in 1h16m16.325934558s
Test Suite Passed
2016/03/25 15:12:42 e2e.go:196: Step 'Ginkgo tests' finished in 1h16m18.918754301s
```
I'm writing the design document, and will post the switching roadmap in an umbrella issue soon.
@kubernetes/sig-node
Allow network plugins to declare that they handle shaping and that
Kuberenetes should not. Will be first used by openshift-sdn which
handles shaping through OVS, but this triggers a warning when
kubelet notices the bandwidth annotations.
Add GeneratePodHostNameAndDomain() to RuntimeHelper to
get the hostname of the pod from kubelet.
Also update the logging flag to change the journal match from
_HOSTNAME to _MACHINE_ID.
The kubelet sync loop relies on getting one update as the signal that the
specific source is ready. This change ensures that we don't send multiple
updates (ADD, UPDATE) for the first batch of pods. This is required to prevent
the cleanup routine from killing pods prematurely.
PLEG is reponsible for listing the pods running on the node. If it's hung
due to non-responsive container runtime or internal bugs, we should restart
kubelet.
cleanupTerminatedPods is responsible for checking whether a pod has been
terminated and force a status update to trigger the pod deletion. However, this
function is called in the periodic clenup routine, which runs every 2 seconds.
In other words, it forces a status update for each non-running (and not yet
deleted in the apiserver) pod. When batch deleting tens of pods, the rate of
new updates surpasses what the status manager can handle, causing numerous
redundant requests (and the status channel to be full).
This change forces a status update only when detecting the DeletionTimestamp is
set for a terminated pod. Note that for other non-terminated pods, the pod
workers should be responsible for setting the correct status after killling all
the containers.
cleanupTerminatedPods is responsible for checking whether a pod has been
terminated and force a status update to trigger the pod deletion. However, this
function is called in the periodic clenup routine, which runs every 2 seconds.
In other words, it forces a status update for each non-running (and not yet
deleted in the apiserver) pod. When batch deleting tens of pods, the rate of
new updates surpasses what the status manager can handle, causing numerous
redundant requests (and the status channel to be full).
This change forces a status update only when detecting the DeletionTimestamp is
set for a terminated pod. Note that for other non-terminated pods, the pod
workers should be responsible for setting the correct status after killling all
the containers.
* Metrics will not be expose until they are hooked up to a handler
* Metrics are not cached and expose a dos vector, this must be fixed before release or the stats should not be exposed through an api endpoint
Kubelet.cleanupOrphanedVolumes() compares list of volumes mounted to a node
with list of volumes that are required by pods scheduled on the node
("scheduled volume").
Both lists should contain real volumes, i.e. when a pod uses
PersistentVolumeClaim, the list must contain name of the bound volume instead
of name of the claim.
This change removes RuntimeCache in the pod workers and the syncPod() function.
Note that it doesn't deprecate RuntimeCache completely as other components
still rely on the cache.
Many users attempt to use 'kubectl logs' in order to find the logs
for a container, but receive no logs or an error telling them their
container is not running. The fix in this case is to run with '--previous',
but this does not match user expectations for the logs command.
This commit changes the behavior of the Kubelet to return the logs of
the currently running container or the previous running container unless
the user provides the "previous" flag. If the user specifies "follow"
the logs of the most recent container will be displayed, and if it is
a terminated container the logs will come to an end (the user can
repeatedly invoke 'kubectl logs --follow' and see the same output).
Clean up error messages in the kubelet log path to be consistent and
give users a more predictable experience.
Have the Kubelet return 400 on invalid requests
- Ignore the "not found" error on deletion.
- Recognize the "already exists" error on creation and check if the existing
pod meets requirement. If so, don't report an error.
- Immediately create a mirror pod after a successful deletion, if needed.
This address a TODO when collecting the node version information so it
will properly report the configured runtime and its version. Previously,
this was hardcoded to "docker://" and the docker version, and would show
"docker://1.9.1" even when the kubelet was configured to use rkt.
With this change, it will use the runtime's Type() and Version() data.
This also changes the container.Runtime interface to add an APIVersion()
method. This can be used when the runtime has separate versions for the
engine and the API, such as with Docker. The Docker minimum version
validation has been updated to use APIVersion(), and
DockerManager.Version() now returns the engine version.
Add `kube-reserved` and `system-reserved` flags for configuration
reserved resources for usage outside of kubernetes pods. Allocatable is
provided by the Kubelet according to the formula:
```
Allocatable = Capacity - KubeReserved - SystemReserved
```
Also provides a method for estimating a reasonable default for
`KubeReserved`, but the current implementation probably is low and needs
more tuning.
Kubelet doesn't perform checkpointing and loses all its internal states after
restarts. It'd then mistaken pods from the api server as new pods and attempt
to go through the admission process. This may result in pods being rejected
even though they are running on the node (e.g., out of disk situation). This
change adds a condition to check whether the pod was seen before and categorize
such pods as updates. The change also removes freeze/unfreeze mechanism used to
work around such cases, since it is no longer needed and it stopped working
correctly ever since we switched to incremental updates.
There has been a recent regression causing kubelet to assume no containers are
running for the pod if kubelet has not seen the pod before. This would cause
all containers to be restarted after kubelet gets restarted. This change fixes
the bug.
Implement a flag that defines the frequency at which a node's out of
disk condition can change its status. Use this flag to suspend out of
disk status changes in the time period specified by the flag, after
the status is changed once.
Set the flag to 0 in e2e tests so that we can predictably test out of
disk node condition.
Also, use util.Clock interface for all time related functionality in
the kubelet. Calling time functions in unversioned package or time
package such as unversioned.Now() or time.Now() makes it really hard
to test such code. It also makes the tests flaky and sometimes
unnecessarily slow due to time.Sleep() calls used to simulate the
time elapsed. So use util.Clock interface instead which can be faked
in the tests.
Refactor Kubelet's server functionality into a server package. Most
notably, move pkg/kubelet/server.go into
pkg/kubelet/server/server.go. This will lead to better separation of
concerns and a more readable code hierarchy.
The formatting function is used often in logging. This improves the readability
by shortening the length of the call. Also change the fomartted string to
include the pod UID.
Before this change we have a mish-mash of ways to pass field names around for
error generation. Sometimes string fieldnames, sometimes .Prefix(), sometimes
neither, often wrong names or not indexed when it should be.
Instead of that mess, this is part one of a couple of commits that will make it
more strongly typed and hopefully encourage correct behavior. At least you
will have to think about field names, which is better than nothing.
It turned out to be really hard to do this incrementally.
Addresses a version skew issue where the last condition status is always
evaluated as the NodeReady status. As a workaround force the NodeReady
condition to be the last in the list of node conditions.
ref: https://github.com/kubernetes/kubernetes/issues/16961
Set resyncInterval to one minute now that we rely on the generic pleg to trigger
pod syncs on container events. When there is an error during syncing, pod
workers need to wake up sooner to retry. Set the sync error backoff period to
10 second in this case.
This change introduces pod lifecycle event generator (PLEG), and adds a generic
PLEG. The generic PLEG relies on relisting to discover container events, and is
container-runtime-agnostic. Both docker and rkt are changed to use generic
PLEG.
+ Fix#14992
+ "When deploying a pod using an on-disk kubelet manifest (a la /etc/kubernetes/manifests), it appears that the network plugin setUpPod is notified of the new pod before the apiserver."
Push status updates as soon as readiness state changes for containers,
rather than waiting for the sync loop to update the status. In
particular, this should help new containers to come online faster.
Additionally, consolidates prober test helpers into a single file.
This commit fixes getting the logs from complete/failed pods after
a kubelet restart by falling back to the api server in case we fail
to resolve the pod status using the status cache.
- PeriodSeconds - How often to probe
- SuccessThreshold - Number of successful probes to go from failure to success state
- FailureThreshold - Number of failing probes to go from success to failure state
This commit includes to changes in behavior:
1. InitialDelaySeconds now defaults to 10 seconds, rather than the
kubelet sync interval (although that also defaults to 10 seconds).
2. Prober only retries on probe error, not failure. To compensate, the
default FailureThreshold is set to the maxRetries, 3.
This lays the groundwork for simple multizone capabilities.
In a cloud environment, nodes are typically created by the kubelet
registering with the API server. When creating a new node, we now query
the cloudprovider to see if it can provide Zone information, and if so
we add some well-known labels to the Node we are creating.
- status.Manager always deals with the local (static) pod, but gets the
mirror pod when syncing
- This lets components like the probe workers ignore mirror pods
Now that kubelet checks sources seen correctly, there is no need to enforce the
initial order of pod updates and housekeeping. Use a ticker for housekeeping to
simplify the code.
Currently kubelet syncs all pods every 10s. This is not preferred because
* Some pods may have been sync'd recently.
* This may cause all the pods to be sync'd at once, causing undesirable
CPU spikes.
This PR replaces the global syncs with independent, periodic pod syncs. At the
end of syncing, each pod worker will enqueue itslef with a future timestamp (
current time + sync interval), when it will be due for another sync.
* If the pod worker encoutners an sync error, it may requeue with a different
timestamp to retry sooner.
* If a sync is triggered by the update channel (events or spec changes), the
pod worker would enqueue a new sync time.
This change is necessary for moving to long or no periodic sync period once pod
lifecycle event generator is completed. We will still rely on the mechanism to
requeue the pod on sync error.
This change also makes sure that if a sync does not succeed (either due to
real error or the per-container backoff mechanism), an error would be propagated
back to the pod worker, which is responsible for requeuing.
Define a new out of disk node condition and use it to report when node
goes out of disk.
Make a copy of loop range clause variable in node listers so that it
is available outside the for loop.
Also update/implement unit tests.
This commit builds on previous work and creates an independent
worker for every liveness probe. Liveness probes behave largely the same
as readiness probes, so much of the code is shared by introducing a
probeType paramater to distinguish the type when it matters. The
circular dependency between the runtime and the prober is broken by
exposing a shared liveness ResultsManager, owned by the
kubelet. Finally, an Updates channel is introduced to the ResultsManager
so the kubelet can react to unhealthy containers immediately.
Change all references to the container ID in pkg/kubelet/... to the
strong type defined in pkg/kubelet/container: ContainerID
The motivation for this change is to make the format of the ID
unambiguous, specifically whether or not it includes the runtime
prefix (e.g. "docker://").
The current implementation considers a source seen when it receives a SET at
kubelet/config/config.go. However, the main kubelet sync loop may not have
received the pod update from the source via the channel. This change ensures
that kubelet would consider all sources are ready only after the sync loop has
seen all the sources.
Each container with a readiness has an individual go-routine which
handles periodic probing for that container. The results are cached, and
written to the status.Manager in the pod sync path.
Network configuration error message while setting Kubelet status was
being written to "reasons" slice. Write this message to "messages" slice
instead.
Also remove "reasons" slice entirely since it is not used anywhere.
Now that kubelet has switched to incremental updates, it has complete
information of the pod update type (create, update, sync). This change pipes
this information to pod workers so that they don't have to derive the type
again.
Increase the supported controls on pod logging. Add validaiton to pod
log options. Ensure the Kubelet is using a consistent, structured way to
process pod log arguments.
Add ?sinceSeconds=<durationInSeconds>, &sinceTime=<RFC3339>, ?timestamps=<bool>,
?tailLines=<number>, and ?limitBytes=<number>
In many cases clients may wish to view not ready addresses for endpoints
in order to do set membership prior to a pod being ready. For instance,
a pod that uses the service endpoints to connect to other pods under
the same service, but does not want to signal ready before it has
contacted at least a minimal number of other pods.
This is backwards compatible with old servers and clients. There is
an additional cost in size of endpoints before services ramp up, which
will add minor CPU and memory use for services that have a significant
number of pods which have not become ready.
The methods for registering a node and syncing node status to the apiserver
have grown large enough that it makes sense for them to live in a separate
place. This change adds a nodeManager to handle such interaction with the
apiserver.
This refactor is in preparation for moving more state handling to the
status manager. It will become the canonical cache for the latest
information on running containers and probe status, as part of the
prober refactoring.
1. Make reason field of StatusReport objects in kubelet in CamelCase format.
2. Add Message field for ContainerStateWaiting to describe detail about Reason.
3. Make reason field of Events in kubelet in CamelCase format.
4. Update swagger,deep-copy and so on.
A lot of packages use StringSet, but they don't use anything else from
the util package. Moving StringSet into another package will shrink
their dependency trees significantly.
Both GetNode and the cache.ListWatch listfunc in the
kubelet package call List unnecessary.
GetNodeInfo is sufficient for GetNode and makes looping
through a list of nodes to check for a matching name
unnecessary.
resolves#13476
1. Add EvnetRecordQps and EventBurst parameter in kubelet.
2. If EvnetRecordQps and EventBurst was set, rate limit events in kubelet
with a independent ratelimiter as setted.
PR #13293 added a safety check to not remove a pod directory if the child
volumes directory is not empty. This logic is faulty because kubelet may have
directory structures podUID/volumes/volumeKind/volumeName. E.g.,
`056db95d-50ee-11e5-a2e4-42010af0ba1d/volumes/kubernetes.io~empty-dir/default-token-al3r2`
This change fixes that by properly listing all volumes under a pod.
Before, kubelet performs global cleanup tasks every iteration. After the
PR #13003, kubelet performs the tasks on every sync internval (10 seconds).
This PR decouples the housekeeping period with the sync internval to ensure
that kubelet cleans up promptly, while not too often (no more than once every
minimum housekeeping period).
Allow the user to specify the resolver configuration file that is used
to determine the default DNS parameters. This defaults to the system's
/etc/resolv.conf.
Currently, whenever there is any update, kubelet would force all pod workers to
sync again, causing resource contention and hence performance degradation.
This commit flips kubelet to use incremental updates (as opposed to snapshots).
This allows us to know what pods have changed and send updates to those pod
workers only. The `SyncPods` function has been replaced with individual
handlers, each handling an operation (ADD, REMOVE, UPDATE). Pod workers are
still triggered periodically, and kubelet performs periodic cleanup as well.
This commit also spawns a new goroutine solely responsible for killing pods.
This is necessary because pod killing could hold up the sync loop for
indefinitely long amount of time now user can define the graceful termination
period in the container spec.
We chose to use podFullName (name_namespace) as key in the status manager
because mirror pod and static pod share the same status. This is no longer
needed because we do not store statuses for static pods anymore (we only
store statuses for their mirror pods). Also, reviously, a few fixes were
merged to ensure statuses are cleaned up so that a new pod with the same
name would not resuse an old status.
This change cleans up the code by using UID as key so that the code would
become less brittle.
The sync loop should check for terminated pods that are no longer
running and clear them. The status loop should never write status
if the pod UID changes. Mirror pods should be deleted immediately
rather than gracefully.
Avoid TTL by deleting pods immediately when they aren't
scheduled, and letting the Kubelet delete them otherwise.
Ensure the Kubelet uses pod.Spec.TerminationGracePeriodSeconds
when no pod.DeletionGracePeriodSeconds is available.
Getting the public IP a container is supposed to use is O(hard),
and usually involves ugly gyrations in python or with interfaces.
Using the downward API means that the IP Kube is announcing to
other endpoints is also visible inside the container for pods to
identify themselves.
Eventually we would like to replace the all-encompassing SyncPods function with
more well-defined, smaller functions. This would not only help with the
readability and profiling of the code, it'd also set in motion for the plans to
trigger pod worker individually based on the content of the pod updates.
This commit serves as the first step of that, while avoiding breaking all unit
tests by preserving the SyncPods function for the time being.
/runningpods returns a list of pods currently running on the kubelet. The list
is composed by examining the container runtime, and may be different from the
desired pods to run known by kubelet.
This is useful for tests to verify that pods are indeed deleted on nodes.
Add a new latency metric for the time from seeing the pod for the first time
to starting a pod worker for it.
Also, change PodStartLatency to include this initial processing latency.
Refactor GetNodeHostIP into pkg/util/node (instead of pkg/util to break import cycle).
Include internalIP in gce NodeAddresses. Remove NodeLegacyHostIP
Fixes#8569.
This requires the DNS server to be running kube2sky v1.6 or higher (part of
release 0.18). Users with older kube2sky MUST NOT update to this kubelet until
they upgrade DNS. Versions of kube2sky >= 1.6 support both old and new style
names. Old style names are deprectaed and will be removed around the time of
kubernetes v1.0 release.
This commit wires together the graceful delete option for pods
on the Kubelet. When a pod is deleted on the API server, a
grace period is calculated that is based on the
Pod.Spec.TerminationGracePeriodInSeconds, the user's provided grace
period, or a default. The grace period can only shrink once set.
The value provided by the user (or the default) is set onto metadata
as DeletionGracePeriod.
When the Kubelet sees a pod with DeletionTimestamp set, it uses the
value of ObjectMeta.GracePeriodSeconds as the grace period
sent to Docker. When updating status, if the pod has DeletionTimestamp
set and all containers are terminated, the Kubelet will update the
status one last time and then invoke Delete(pod, grace: 0) to
clean up the pod immediately.
Add support for pluggable Docker exec handlers. The default handler is
now Docker's native exec API call. The previous default, nsenter, can be
selected by passing --docker-exec-handler=nsenter when starting the
kubelet.
This generalizes the handling of containers in the
ContainerManager.
Also introduces the ability to determine how much
resources are reserved for those system containers.
The system container is a resource-only container which contains all
non-kernel processes that are not already part of a container. This will
allow monitoring of their resource usage and limiting it (eventually).
This patch substitutes the misleading reason "unknown" for the event
recording. For symmetry with kubelet's message "online" the conditions
Unknown and False are reported as "offline".
Signed-off-by: Federico Simoncelli <fsimonce@redhat.com>
- Delete nodes when they are no longer ready and don't exist in the
cloud provider.
- Label each node with it's hostname.
- Add flag to skip node registration.
- Add a test for registering an existing node.
We recently changed `SyncPods` to filter out terminated pods at the beginning
for two reasons:
* performance: kubelet no longer keeps goroutines to checks containers for
terminated pods.
* correctness: kubelet relies on inspecting dead containers to generate
pod status. Because dead containers may get garbage collected and
kubelet does not have checkpoints yet, syncing terminated pod could
lead to modifying the status of a terminated pod.
However, even though kubelet should not *sync* the terminated pods, it
should not attempt to remove the directories and volumes for such
pods as long as they have not been deleted. This change fixes aggresive
directory removal by passing all pods (including terminated pods) to the
cleanup functions.
Per-pod workers have sufficient knowledge to determine whether a pod has
exceeded the active deadline, and they set the status at the end of each sync.
Move the active deadline check to generatePodStatus so that per pod workers
can update the pod status directly. This eliminates the possibility of a race
condition where both SyncPods and the pod worker are updating the status, which
could lead to temporary erratic pod status behavior (pod phase: failed ->
running -> failed).
Pod statuses are periodically writtien to the status manager, and status
manager sets the start time of the pod. All non-status-modifying code should
perform cache lookup and should not attempt to generate pod status on its own.
Currently, kubelet doesn't filter out terminated pods before determining whether
a pod fits. This could lead to duplicated events for rejecting the pods. This
change fixes that.
This change also groups all related pod fitness checking functions into one
function to improve readability.
Kubelet will stop accepting new pods if it detects low disk space on root fs or fs holding docker images.
Running pods are not affected. low-diskspace-threshold-mb is used to configure the low diskspace threshold.
This change instructs kubelet to switch to using the Runtime interface. In order
to do it, the change moves the Prober instantiation to DockerManager.
Note that most of the tests in kubelet_test.go needs to be migrated to
dockertools. For now, we use type assertion to convert the Runtime interface to
DockerManager in most tests.
This change is part of the efforts to make DockerManager implement the Runtime
interface.
The change also modifies the interface slightly to work with existing
code, and aggregates the type converting functions to convert.go.
This change removes docker-specifc code in killUnwantedPods. It
also instructs the cleanup code to move away from interacting with
containers directly. They should always deal with the pod-level
abstraction if at all possible.
We must not clear the pod directory in killUnwantedPods(), volumes are still
mounted there at this time. There already is cleanupOrphanedPodDirs(),
called later in the SyncPods() sequence, which should remove these pod
directories.
This moved Docker specific logic there and allows it to align with the
runtime API. There is still a pod infra container reference in the
function due to network plugins. We can handle this in the Kubelet since
we'll need to be explicit in stating that the network plugin will not
work in a non-Docker runtime.
Once a pod reaches a terminated state (whether failed or succeeded), it should
not transit out ever again. Currently, kubelet relies on examining the dead
containers to verify that the container has already been run. This is fine
in most cases, but if the dead containers were garbage collected, kubelet may
falsely concluded that the pod has never been run. It would then try to restart
all the containers.
This change eliminates most of such possibilities by pre-filtering out the pods
in the final states before sending updates to per-pod workers.
Remove GetDockerServerVersion() from DockerContainerCommandRunner interface,
replaced with runtime.Version(). Also added Version type in runtime for version
comparision.
Kubelet kills unwanted pods in SyncPods, which directly impact the latency of a
sync iteration. This change parallelizes the cleanup to lessen the effect.
Eventually, we should leverage per-pod workers for cleanup, with the exception
of truly orphaned pods.
Use go-dockerclient's APIVersion to check the minimum required Docker
version, as it contains methods for parsing the ApiVersion response from
the Docker daemon and for comparing 2 APIVersion objects.
Remove kubelet.getPodInfraContainer().
Remove dockertools.RemoveContainerWithID().
Remove dockertools.FindContainersByPod().
Also replace the useless test with a test for GetPods().
Container creation/start failure cannot be reproduced by inspecting the
containers. This change caches such errors so that kubelet can retrieve it
later.
This change also extends FakeDockerClient to support setting error response
for a specific function.
If a static pod changes, delete the corresponding mirror pod. When kubelet
could not see mirror pod from the API server update, it'd attemp to create a
new mirror pod with up-to-date specs.
We want to stop leaking more docker details into kubelet, and we also want to
consolidate some of the existing docker interfaces/structs. This change creates
DockerManager as the new home of some functions in dockertools/docker.go. It
also absorbs containerRunner. In addition, GetDockerPodStatus is renamed to
GetPodStatus with the entire pod passed to it so that it is simialr to the what
is defined in the container Runtime interface.
Eventually, DockerManager should implement the container Runtime interface, and
integrate DockerCache with a flag to turn on/off caching. Code in kubelet.go
should not be using docker client directly.
* Improper format specifier (e.g. %s for bools or %s for ints)
* More or less parameters than format specifiers
* Not calling a formatting function when it should have (e.g. Error() instead of Errorf())
This change cleans up the pod manager extensively so that
* Mirror pods are actually stored in the pod manager.
* Both (non-mirror) pods and mirror pods are indexed by UID and full name for
easy lookup and mapping. This is required for the next change to send
full pod along with the pod status update.
This change also renames mirrorManager as mirrorClient since it is merely a
client to contact the API server and create/delete mirror pods.
This change moves pod array and mirrorPods into podManager, along with all
methods accessing these internal pod storages. This is the first step of the
refactoring, and no function change is involved.
Functions Build/ParseDockerName now work with struct instead of the long
list of arguments. This new struct also was reused in the kubelet.go
instead of auxilary podContainer struct.
Kubelet supports retrieving stats for pods/containers with and without UID.
This does not always work for the static pods because users may get the UIDs of
the mirror pods from the API server, and use them to query Kubelet. In this
case, Kubelet would fail to locate the containers due to mismatched UIDs.
This change adds a intenral mirror to static pod UID mapping and teaches all
public-facing functions to perform UID lookup before proceeding. This allows
users to use either mirror or static pod's UID to retrieve stats.
Per-pod worker syncs the pod and container status, and write the pod status in
the pod status cache. Given that it already owns a copy of the pod, it can
bypass the additional pod lookup step completely. This change adds a new
generatePodStatusByPod() method to achieve this. In general, per-pod worker
should avoid accessing the internal pod array completely, as this would may
lead to high contention.
This change also changes the return type of GetPodByFullName to reflect the
name, and consolidates GetPodByFullName() and GetPodByName().
Integrated the imageManager into the Kubelet and applies the garbage
collection policy every 5 minutes. The default policy allows up to 90%
disk usage, after which images are garbage collected to bring limit back
down to 80%.
Fixes#157.
Currently, API server is not aware of the static pods (manifests from
sources other than the API server, e.g. file and http) at all. This is
inconvenient since users cannot check the static pods through kubectl.
It is also sub-optimal because scheduler is unaware of the resource
consumption by these static pods on the node.
This change syncs the information back to the API server by creating a
mirror pod via API server for each static pod.
- Kubelet creates containers for the static pod, as it would do
normally.
- If a mirror pod gets deleted, Kubelet will re-create one. The
containers are sync'd to the static pods, so they will not be
affected.
- If a static pod gets removed from the source (e.g. manifest file
removed from the directory), the orphaned mirror pod will be deleted.
Note that because events are associated with UID, and the mirror pod has
a different UID than the original static pod, the events will not be
shown for the mirror pod when running `kubectl describe pod
<mirror_pod>`.
These containers may be caused by a change in the Kubernetes naming
convention. The old containers are killed, the new ones started, but the
old ones are never GC'd. This change makes Kubelet GC all Kubernetes
containers, old and new.
Fixes#5372.
This will make it easier to start running the real cAdvisor alongside
Kubelet. This change is primarily no-op refactoring. The main behavioral
change is that we always create a cAdvisor interface and expect it to
always be available. When we make a request, if cAdvisor is not
connected the request fails with a connection error. This failure is
handled today as well.
The Kubelet assumes Docker is running during its execution and on
machine boot it is a race between Docker coming up and Kubelet calling
Docker. This PR waits for Docker to be up before the Kubelet begins
doing useful work. On timeout, Kubelet exits and expects to be
restarted.
There are two main goals for this change.
1. Fix the naming scheme in kubelet so that it accepts DNS subdomain
name/namespaces correctly (#4920). The design is discussed in #3453.
2. Prepare for syncing the static pods back to the apiserver(#4090). This
includes
- Eliminate the source component in the internal full pod name (#4922). Pods
no longer need sources as they will all be sync'd via apiserver.
- Changing the naming scheme for the static (file-, http-, and etcd-based)
pods such that they are distinguishable when syncing back to the apiserver.
The changes includes:
* name = <pod.Name>-<hostname>
* namespace = <cluster_namespace> (i.e. "default" for now).
* container_name = k8s_<contianer_name>.<hash_of_container>_<pod_name>_<namespace>_<uid>_<random>
Note that this is not backward-compatible, meaning the kubelet won't recognize
existing running containers using the old naming scheme.
When a host port conflict is detected, kubelet should set the pod status to
fail. The failed status will then be polled by other components at a later time,
which allows replication controller to create a new pod if necessary.
To achieve this, this change stores the pod status information in a status map
upon the detecton of port conflict. GetPodStatus() consults this status map
before attempting to query docker. The entries in the status map will be removed
when the pod is no longer associated with the node.
Currently, kubelet silently ignores pods that caused host port conflict. This
commit surfaces the error by recording an event.
It also makes sure that kubelet iterates through the pods in the order of the
creation timestamp, which ensures that pods created later are ignored on
conflict.
During the kubelet's /healthz responce check to see if the
hostname used by the master matches the hostname the kubelet
knows itself by. If not fail the health check.
Signed-off-by: Sami Wagiaalla <swagiaal@redhat.com>
Setting the oom_score_adj of the PID of the POD container to -100 which is less
than the default of 0. This ensures that this PID is the last OOM victim
chosen by the kernel.
Fixes#3067.