173 lines
6.0 KiB
Markdown
173 lines
6.0 KiB
Markdown
# Kubernetes-Mesos Scheduler
|
||
|
||
Kubernetes on Mesos does not use the upstream scheduler binary, but replaces it
|
||
with its own Mesos framework scheduler. The following gives an overview of
|
||
the differences.
|
||
|
||
## Labels and Mesos Agent Attributes
|
||
|
||
The scheduler of Kubernetes-Mesos takes [labels][1] into account: it matches
|
||
specified labels in pod specs with defined labels of nodes.
|
||
|
||
In addition to user defined labels, [attributes of Mesos agents][2] are converted
|
||
into node labels by the scheduler, following the pattern
|
||
|
||
```yaml
|
||
k8s.mesosphere.io/attribute-<name>: value
|
||
```
|
||
|
||
As an example, a Mesos agent attribute of `generation:2015` will result in the node label
|
||
|
||
```yaml
|
||
k8s.mesosphere.io/attribute-generation: 2015
|
||
```
|
||
|
||
and can be used to schedule pods onto nodes which are of generation 2015.
|
||
|
||
**Note:** Node labels prefixed by `k8s.mesosphere.io` are managed by
|
||
Kubernetes-Mesos and should not be modified manually by the user or admin. For
|
||
example, the Kubernetes-Mesos executor manages `k8s.mesosphere.io/attribute`
|
||
labels and will auto-detect and update modified attributes when the mesos-slave
|
||
is restarted.
|
||
|
||
## Resource Roles
|
||
|
||
A Mesos cluster can be statically partitioned using [resources roles][2]. Each
|
||
resource is assigned such a role (`*` is the default role, if none is explicitly
|
||
assigned in the mesos-slave command line). The Mesos master will send offers to
|
||
frameworks for `*` resources and – optionally – for one extra role that a
|
||
framework is assigned to. Right now only one such extra role for a framework is
|
||
supported.
|
||
|
||
### Configuring Roles for the Scheduler
|
||
|
||
Every Mesos framework scheduler can choose among the offered `*` resources and
|
||
those of the extra role. The Kubernetes-Mesos scheduler supports this by setting
|
||
the framework roles in the scheduler command line, e.g.
|
||
|
||
```bash
|
||
$ km scheduler ... --mesos-roles="*,role1" ...
|
||
```
|
||
|
||
This will tell the Kubernetes-Mesos scheduler to default to using `*` resources
|
||
if a pod is not specially assigned to another role. Moreover, the extra role
|
||
`role1` is allowed, i.e. the Mesos master will send resources or role `role1`
|
||
to the Kubernetes scheduler.
|
||
|
||
Note the following restrictions and possibilities:
|
||
- Due to the restrictions of Mesos, only one extra role may be provided on the
|
||
command line.
|
||
- It is allowed to only pass an extra role without the `*`, e.g. `--mesos-roles=role1`.
|
||
This means that no `*` resources should be considered by the scheduler at all.
|
||
- It is allowed to pass the extra role first, e.g. `--mesos-roles=role1,*`.
|
||
This means that `role1` is the default role for pods without special role
|
||
assignment (see below). But `*` resources would be considered for pods with a special `*`
|
||
assignment.
|
||
|
||
### Specifying Roles for Pods
|
||
|
||
By default a pod is scheduled using resources of the role which comes first in
|
||
the list of scheduler roles.
|
||
|
||
A pod can opt-out of this default behaviour using the `k8s.mesosphere.io/roles`
|
||
label:
|
||
|
||
```yaml
|
||
k8s.mesosphere.io/roles: role1,role2,role3
|
||
```
|
||
|
||
The format is a comma separated list of allowed resource roles. The scheduler
|
||
will try to schedule the pod with `role1` resources first, using `role2`
|
||
resources if the former are not available and finally falling back to `role3`
|
||
resources.
|
||
|
||
The `*` role may be specified as well in this list.
|
||
|
||
**Note:** An empty list will mean that no resource roles are allowed which is
|
||
equivalent to a pod which is unschedulable.
|
||
|
||
For example:
|
||
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Pod
|
||
metadata:
|
||
name: backend
|
||
labels:
|
||
k8s.mesosphere.io/roles: *,prod,test,dev
|
||
namespace: prod
|
||
spec:
|
||
...
|
||
```
|
||
|
||
This `prod/backend` pod will be scheduled using resources from all four roles,
|
||
preferably using `*` resources, followed by `prod`, `test` and `dev`. If none
|
||
of those for roles provides enough resources, the scheduling fails.
|
||
|
||
**Note:** The scheduler will also allow to mix different roles in the following
|
||
sense: if a node provides `cpu` resources for the `*` role, but `mem` resources
|
||
only for the `prod` role, the upper pod will be schedule using `cpu(*)` and
|
||
`mem(prod)` resources.
|
||
|
||
**Note:** The scheduler might also mix within one resource type, i.e. it will
|
||
use as many `cpu`s of the `*` role as possible. If a pod requires even more
|
||
`cpu` resources (defined using the `pod.spec.resources.limits` property) for successful
|
||
scheduling, the scheduler will add resources from the `prod`, `test` and `dev`
|
||
roles, in this order until the pod resource requirements are satisfied. E.g. a
|
||
pod might be scheduled with 0.5 `cpu(*)`, 1.5 `cpu(prod)` and 1 `cpu(test)`
|
||
resources plus e.g. 2 GB `mem(prod)` resources.
|
||
|
||
## Tuning
|
||
|
||
The scheduler configuration can be fine-tuned using an ini-style configuration file.
|
||
The filename is passed via `--scheduler-config` to the `km scheduler` command.
|
||
|
||
Be warned though that some them are pretty low-level and one has to know the inner
|
||
workings of k8sm to find sensible values. Moreover, these settings may change or
|
||
even disappear from version to version without further notice.
|
||
|
||
The following settings are the default:
|
||
|
||
```
|
||
[scheduler]
|
||
; duration an offer is viable, prior to being expired
|
||
offer-ttl = 5s
|
||
|
||
; duration an expired offer lingers in history
|
||
offer-linger-ttl = 2m
|
||
|
||
<<<<<<< HEAD
|
||
; duration between offer listener notifications
|
||
listener-delay = 1s
|
||
|
||
; size of the pod updates channel
|
||
updates-backlog = 2048
|
||
|
||
; interval we update the frameworkId stored in etcd
|
||
framework-id-refresh-interval = 30s
|
||
|
||
; wait this amount of time after initial registration before attempting
|
||
; implicit reconciliation
|
||
initial-implicit-reconciliation-delay = 15s
|
||
|
||
; interval in between internal task status checks/updates
|
||
explicit-reconciliation-max-backoff = 2m
|
||
|
||
; waiting period after attempting to cancel an ongoing reconciliation
|
||
explicit-reconciliation-abort-timeout = 30s
|
||
|
||
initial-pod-backoff = 1s
|
||
max-pod-backoff = 60s
|
||
http-handler-timeout = 10s
|
||
http-bind-interval = 5s
|
||
```
|
||
|
||
## Low-Level Scheduler Architecture
|
||
|
||

|
||
|
||
[1]: ../../../docs/user-guide/labels.md
|
||
[2]: http://mesos.apache.org/documentation/attributes-resources/
|
||
|
||
[]()
|