Automatic merge from submit-queue
Add hooks for cluster health detection
Separate a function that decides if zone is healthy. First real commit for preventing massive pod eviction.
Ref. #28832
cc @davidopp
Automatic merge from submit-queue
Change storeToNodeConditionLister to return []*api.Node instead of api.NodeList for performance
Currently copies that are made while copying/creating api.NodeList are significant part of scheduler profile, and a bunch of them are made in places, that are not-parallelizable.
Ref #28590
Automatic merge from submit-queue
Add test case to TestPodFitsResources() of scheduler algorithm
File "plugin\pkg\scheduler\algorithm\predicates", function "TestPodFitsResources()", line 199, only provide test case "one resource cpu fits but memory not", it should add test case "one resource memory fits but cpu not".
Automatic merge from submit-queue
Optimize priorities in scheduler
Ref #28590
It's probably easier to review it commit by commit, since those changes are kind of independent from each other.
@davidopp - FYI
Automatic merge from submit-queue
Add meta field to predicate signature to avoid computing the same things multiple times
This PR only uses it to avoid computing QOS of a pod for every node from scratch.
Ref #28590
Automatic merge from submit-queue
scheduler: change phantom pod test from integration into unit test
This is an effort for #24440.
Why this PR?
- Integration test is hard to debug. We could model the test as a unit test similar to [TestSchedulerForgetAssumedPodAfterDelete()](132ebb091a/plugin/pkg/scheduler/scheduler_test.go (L173)). Currently the test is testing expiring case, we can change that to delete.
- Add a test similar to TestSchedulerForgetAssumedPodAfterDelete() to test phantom pod.
- refactor scheduler tests to share the code between TestSchedulerNoPhantomPodAfterExpire() and TestSchedulerNoPhantomPodAfterDelete()
- Decouple scheduler tests from scheduler events: not to use events
Automatic merge from submit-queue
[Refactor] QOS to have QOS Class type for QoS classes
This PR adds a QOSClass type and initializes QOSclass constants for the three QoS classes.
It would be good to use this in all future QOS related features.
This would be good to have for the (Pod level cgroups isolation proposal)[https://github.com/kubernetes/kubernetes/pull/26751] that i am working on aswell.
@vishh PTAL
Signed-off-by: Buddha Prakash <buddhap@google.com>
Automatic merge from submit-queue
Fix #25606: Add the length detection of the "predicateFuncs" in generic_scheduler.go
Fix #25606
The PR add the length detection of the "predicateFuncs" for "findNodesThatFit" function of generic_scheduler.go.
In “findNodesThatFit” function, if the length of the "predicateFuncs" parameter is 0, it can set filtered equals nodes.Items, and needn't to traverse the nodes.Items.
When kubelet starts a pod that refers to non-existing PV, PVC or Node, it
should clearly show that the requested element does not exist.
Previous "PersistentVolumeClaim 'default/ceph-claim-wm' is not in cache"
looks like random kubelet hiccup, while "PersistentVolumeClaim
'default/ceph-claim-wm' not found" suggests that the object may not exist at
all and it might be an user error.
Fixes#27523
Automatic merge from submit-queue
scheduler: remove unused random generator
The way scheduler selecting host has been changed to round-robin.
Clean up leftover.
Automatic merge from submit-queue
Add a NodeCondition "NetworkUnavaiable" to prevent scheduling onto a node until the routes have been created
This is new version of #26267 (based on top of that one).
The new workflow is:
- we have an "NetworkNotReady" condition
- Kubelet when it creates a node, it sets it to "true"
- RouteController will set it to "false" when the route is created
- Scheduler is scheduling only on nodes that doesn't have "NetworkNotReady ==true" condition
@gmarek @bgrant0607 @zmerlynn @cjcullen @derekwaynecarr @danwinship @dcbw @lavalamp @vishh
Automatic merge from submit-queue
Introduce node memory pressure condition to scheduler
Following the work done by @derekwaynecarr at https://github.com/kubernetes/kubernetes/pull/21274, introducing memory pressure predicate for scheduler.
Missing:
* write down unit-test
* test the implementation
At the moment this is a heads up for further discussion how the new node's memory pressure condition should be handled in the generic scheduler.
**Additional info**
* Based on [1], only best effort pods are subject to filtering.
* Based on [2], best effort pods are those pods "iff requests & limits are not specified for any resource across all containers".
[1] 542668cc79/docs/proposals/kubelet-eviction.md (scheduler)
[2] https://github.com/kubernetes/kubernetes/pull/14943
The PR add the length detection of the "predicateFuncs" for "findNodesThatFit" function of generic_scheduler.go.
In “findNodesThatFit” function, if the length of the "predicateFuncs" parameter is 0, it can set filtered equals nodes.Items, and needn't to traverse the nodes.Items.
It should reduce the resource data after finding the pod in the pods, because perhaps no corresponding pod in the pods of the node, at this time it shouldn't reduce the resource data of the node.
Automatic merge from submit-queue
Make IsValidLabelValue return error strings
Part of the larger validation PR, broken out for easier review and merge. Builds on previous PRs in the series.
Automatic merge from submit-queue
Improve fatal error description in plugins.go of scheduler
The PR add more information for the fatal error in plugins.go of scheduler.
Automatic merge from submit-queue
Make IsQualifiedName return error strings
Part of the larger validation PR, broken out for easier review and merge.
@lavalamp FYI, but I know you're swamped, too.
In RegisterCustomFitPredicate, when policy.Argument is nil and fitPredicateMap has the policy.Name, it can return the policy.Name directly. Subsequent operations are redundant.
Automatic merge from submit-queue
WIP v0 NVIDIA GPU support
```release-note
* Alpha support for scheduling pods on machines with NVIDIA GPUs whose kubelets use the `--experimental-nvidia-gpus` flag, using the alpha.kubernetes.io/nvidia-gpu resource
```
Implements part of #24071 for #23587
I am not familiar with the scheduler enough to know what to do with the scores. Mostly punting for now.
Missing items from the implementation plan: limitranger, rkt support, kubectl
support and docs
cc @erictune @davidopp @dchen1107 @vishh @Hui-Zhi @gopinatht
Automatic merge from submit-queue
Add pod condition PodScheduled to detect situation when scheduler tried to schedule a Pod, but failed
Set `PodSchedule` condition to `ConditionFalse` in `scheduleOne()` if scheduling failed and to `ConditionTrue` in `/bind` subresource.
Ref #24404
@mml (as it seems to be related to "why pending" effort)
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24459)
<!-- Reviewable:end -->
Implements part of #24071
I am not familiar with the scheduler enough to know what to do with the scores. Punting for now.
Missing items from the implementation plan: limitranger, rkt support, kubectl
support and user docs
Automatic merge from submit-queue
add namespace index for cache
@wojtek-t
Implement in this approach make the change of lister.go small, but we should replace all `NewInformer()` to `NewIndexInformer()`, even when someone not want to filter by namespace(eg. gc_controller and scheduler). Any suggestion?
Automatic merge from submit-queue
Store node information in NodeInfo
This is significantly improving scheduler throughput.
On 1000-node cluster:
- empty cluster: ~70pods/s
- full cluster: ~45pods/s
Drop in throughput is mostly related to priority functions, which I will be looking into next (I already have some PR #24095, but we need for more things before).
This is roughly ~40% increase.
However, we still need better understanding of predicate function, because in my opinion it should be even faster as it is now. I'm going to look into it next week.
@gmarek @hongchaodeng @xiang90
Automatic merge from submit-queue
schedulercache: remove bind() from AssumePod
Due to #24197, we make bind() asynchronous and don't need it anymore.
Submit this PR to clean it up.
DONE:
1. refactor all predicates: predicates return fitOrNot(bool) and error(Error) in which the latter is of type
PredicateFailureError or InsufficientResourceError. (For violation of either MaxEBSVolumeCount or
MaxGCEPDVolumeCount, returns one same error type as ErrMaxVolumeCountExceeded)
2. GeneralPredicates() is a predicate function, which includes serveral other predicate functions (PodFitsResource,
PodFitsHost, PodFitsHostPort). It is registered as one of the predicates in DefaultAlgorithmProvider, and
is also called in canAdmitPod() in Kubelet and should be called by other components (like rescheduler, etc)
if necessary. See discussion in issue #12744
3. remove podNumber check from GeneralPredicates
4. HostName is now verified in Kubelet's canAdminPod(). add TestHostNameConflicts in kubelet_test.go
5. add getNodeAnyWay() method in Kubelet to get node information in standaloneMode
TODO:
1. determine which predicates should be included in GeneralPredicates()
2. separate GeneralPredicates() into:
a. GeneralPredicatesEvictPod() and
b. GeneralPredicatesNotEvictPod()
3. DaemonSet should use GeneralPredicates()
AWS has soft support limit for 40 attached EBS devices. Assuming there is just
one root device, use the rest for persistent volumes.
The devices will have name /dev/xvdba - /dev/xvdcm, leaving /dev/sda - /dev/sdz
to the system.
Also, add better error handling and propagate error
"Too many EBS volumes attached to node XYZ" to a pod.
Instead of sorting hosts by score, find max score and choose one of
the hosts with max score directly. Saves a little time on sorting and
avoids extra copying of HostPriorityList entries.
Test had to be updated as one of the test cases relied on having a
stable order from pre-sorting.
For certain volume types (e.g. AWS EBS or GCE PD), a limitted
number of such volumes can be attached to a given node. This commit
introduces a predicate with allows cluster admins to cap
the maximum number of volumes matching a particular type attached to a
given node.
The volume type is configurable by passing a pair of filter functions,
and the maximum number of such volumes is configurable to allow node
admins to reserve a certain number of volumes for system use.
By default, the predicate is exposed as MaxEBSVolumeCount and
MaxGCEPDVolumeCount (for AWS ElasticBlocKStore and GCE PersistentDisk
volumes, respectively), each of which can be configured using the
`KUBE_MAX_PD_VOLS` environment variable.
Fixes#7835
Combine the fields that will be used for content transformation
(content-type, codec, and group version) into a single struct in client,
and then pass that struct into the rest client and request. Set the
content-type when sending requests to the server, and accept the content
type as primary.
Will form the foundation for content-negotiation via the client.
Remove Codec from versionInterfaces in meta (RESTMapper is now agnostic
to codec and serialization). Register api/latest.Codecs as the codec
factory and use latest.Codecs.LegacyCodec(version) as an equvialent to
the previous codec.
For AWS EBS, a volume can only be attached to a node in the same AZ.
The scheduler must therefore detect if a volume is being attached to a
pod, and ensure that the pod is scheduled on a node in the same AZ as
the volume.
So that the scheduler need not query the cloud provider every time, and
to support decoupled operation (e.g. bare metal) we tag the volume with
our placement labels. This is done automatically by means of an
admission controller on AWS when a PersistentVolume is created backed by
an EBS volume.
Support for tagging GCE PVs will follow.
Pods that specify a volume directly (i.e. without using a
PersistentVolumeClaim) will not currently be scheduled correctly (i.e.
they will be scheduled without zone-awareness).
1. Name default scheduler with name `kube-scheduler`
2. The default scheduler only schedules the pods meeting the following condition:
- the pod has no annotation "scheduler.alpha.kubernetes.io/name: <scheduler-name>"
- the pod has annotation "scheduler.alpha.kubernetes.io/name: kube-scheduler"
update gofmt
update according to @david's review
run hack/test-integration.sh, hack/test-go.sh and local e2e.test
To improve the throughput of current scheduler, we can do
a simple optimization by calcluating priorities in parallel.
This doubles the throughput of density test, which has the default
config with 3 priority funcs (the spreading one does not actually
consume any computation time). It matches the expectation.
The "Combined requested resources" message becomes excessive as
the cluster fills up, drop it down to V(2)
Put an explicit V(2) on the only other scheduler Infof call that didn't
have V specified already.
Set the out of disk node condition to unknown in the node controller if
the kubelet does not report its node condition in a long time. Update
node controller unit tests.
Implement a node condition predicate function that checks if a given
node satisfies the conditions defined by the predicate and if it
does, use that node for scheduling pods. The predicate function takes
both NodeReady and NodeOutOfDisk into consideration to determine if a
node is fit for scheduling pods.
The predicate is then passed to the node lister in the scheduler factory
so that the node lister can run the predicate function on the nodes when
schedling pods thereby omitting nodes that does not satisfy the
predicate.
Also update listers test.
A lot of packages use StringSet, but they don't use anything else from
the util package. Moving StringSet into another package will shrink
their dependency trees significantly.
These messages are only useful if you want to debug a particular
scheduler assigment, and they are extremely verbose-- they each print
out a line per host per assignment. Let's try to keep our log messages
linear in the number of assigments.
This commit wires together the graceful delete option for pods
on the Kubelet. When a pod is deleted on the API server, a
grace period is calculated that is based on the
Pod.Spec.TerminationGracePeriodInSeconds, the user's provided grace
period, or a default. The grace period can only shrink once set.
The value provided by the user (or the default) is set onto metadata
as DeletionGracePeriod.
When the Kubelet sees a pod with DeletionTimestamp set, it uses the
value of ObjectMeta.GracePeriodSeconds as the grace period
sent to Docker. When updating status, if the pod has DeletionTimestamp
set and all containers are terminated, the Kubelet will update the
status one last time and then invoke Delete(pod, grace: 0) to
clean up the pod immediately.
Having no nodes in the cluster is unusual and is likely a test
environment, and when a pod is deleted there is no need to log
information about our inability to schedule it.
Allows POST to create a binding as a child. Also refactors internal
and v1beta3 Binding to be more generic (so that other resources can
support Bindings).
Revise our code to only call Request.Namespace() if a namespace
*should* be present. For root scoped resources, namespace should
be ignored. For namespaced resources, it is an error to have
Namespace=="".
Currently, the validation logic validates fields in an object and supply default
values wherever applies. This change factors out defaulting to a set of
defaulting callback functions for decoding (see #1502 for more discussion).
* This change is based on pull request 2587.
* Most defaulting has been migrated to defaults.go where the defaulting
functions are added.
* validation_test.go and converter_test.go have been adapted to not testing the
default values.
* Fixed all tests with that create invalid objects with the absence of
defaulting logic.
Support namespacing in cache.Store by framing the interface functions
around interface{} and providing a key function to each Store implementation.
Implementation of a fix for #2294.
# *** ERROR: *** Some files have not been gofmt'd. To fix these
# errors, run gofmt -s -w <file>, or cut and paste the following:
# gofmt -s -w pkg/kubecfg/resource_printer.go pkg/proxy/config/config.go pkg/runtime/types.go
#
# Your commit will be aborted unless you override this warning. To
# commit in spite of these format errors, delete the following line:
# COMMIT_BLOCKED_ON_GOFMT
I would like to use these in kubelet and kube-proxy.
This is the minimal change to get them moved.
I will follow up with changes to make interfaces consistent
and add Listers for other resources.
RESTClient is an abstraction on top of arbitrary HTTP endpoints that
follow the Kubernetes API conventions. Refactored RESTClientFor so that
assumptions that are Kube specific happen outside of that method (so
others can reuse the RESTClient). Added more validation to client.New
to ensure clients give good input. Exposed APIVersion on RESTClient
as a method so that wrapper code (code that adds typed / structured
methods over rest endpoints like client.Client) can more easily make
decisions about what APIVersion it is running under.
Scheduler uses Reflector from pkg/client/cache.
It defines some helper classes.
I'd like to use those helpers with pkg/client/cache
in kube-proxy and kubelet too.
There are quite a few 'composite literal uses unkeyed fields' errors that I have kept out of this patch.
And there's a couple where vet just seems confused. These are the easiest ones.
Allows us to define different watch versioning regimes in the future
as well as to encode information with the resource version.
This changes /watch/resources?resourceVersion=3 to start the watch at
4 instead of 3, which means clients can read a resource version and
then send it back to the server. Clients should no longer do math on
resource versions.
* Allows consumers to provide their own transports for common cases.
* Supports KUBE_API_VERSION on test cases for controlling which
api version they test against
* Provides a common flag registration method for CLIs that need
to connect to an API server (to avoid duplicating flags)
* Ensures errors are properly returned by the server
* Add a Context field to client.Config
Move a lot of common error logging into better buckets:
glog.Errorf() - Always an error
glog.Warningf() - Something unexpected, but probably not an error
glog.V(0) - Generally useful for this to ALWAYS be visible
to an operator
* Programmer errors
* Logging extra info about a panic
* CLI argument handling
glog.V(1) - A reasonable default log level if you don't want
verbosity
* Information about config (listening on X, watching Y)
* Errors that repeat frequently that relate to conditions
that can be corrected (pod detected as unhealthy)
glog.V(2) - Useful steady state information about the service
* Logging HTTP requests and their exit code
* System state changing (killing pod)
* Controller state change events (starting pods)
* Scheduler log messages
glog.V(3) - Extended information about changes
* More info about system state changes
glog.V(4) - Debug level verbosity (for now)
* Logging in particularly thorny parts of code where
you may want to come back later and check it
* Defaults to v1beta1
* apiserver takes -storage_version which controls etcd storage version
and the version of the client used to connect to other apiservers
* Changed signature of client.New to add version parameter
* All controller code and component code prefers the oldest (most common)
server version
* Make Codec separate from Scheme
* Move EncodeOrDie off Scheme to take a Codec
* Make Copy work without a Codec
* Create a "latest" package that imports all versions and
sets global defaults for "most recent encoding"
* v1beta1 is the current "latest", v1beta2 exists
* Kill DefaultCodec, replace it with "latest.Codec"
* This updates the client and etcd to store the latest known version
* EmbeddedObject is per schema and per package now
* Move runtime.DefaultScheme to api.Scheme
* Split out WatchEvent since it's not an API object today, treat it
like a special object in api
* Kill DefaultResourceVersioner, instead place it on "latest" (as the
package that understands all packages)
* Move objDiff to runtime.ObjectDiff
Convert host:port and URLs passed to client.New() into the proper
values, and return an error if the value is invalid. Change CLI
to return an error if -master is invalid. Remove Client.rawRequest
which was not in use, and fix the involved tests. Add NewOrDie
Preserves the behavior of the client to not auth when a non-https
URL is passed (although in the future this should be corrected).