Without this it fails after deployments were switched from
extensions to apps with
```
E0902 11:25:51.197420 1 reflector.go:283] github.com/kubernetes-incubator/cluster-proportional-autoscaler/pkg/autoscaler/k8sclient/k8sclient.go:96: Failed to watch *v1.Node: unknown (get nodes)
E0902 11:25:53.118490 1 reflector.go:283] github.com/kubernetes-incubator/cluster-proportional-autoscaler/pkg/autoscaler/k8sclient/k8sclient.go:96: Failed to watch *v1.Node: unknown (get nodes)
E0902 11:25:54.997493 1 reflector.go:283] github.com/kubernetes-incubator/cluster-proportional-autoscaler/pkg/autoscaler/k8sclient/k8sclient.go:96: Failed to watch *v1.Node: unknown (get nodes)
E0902 11:25:57.097423 1 reflector.go:283] github.com/kubernetes-incubator/cluster-proportional-autoscaler/pkg/autoscaler/k8sclient/k8sclient.go:96: Failed to watch *v1.Node: unknown (get nodes)
E0902 11:25:59.097417 1 reflector.go:283] github.com/kubernetes-incubator/cluster-proportional-autoscaler/pkg/autoscaler/k8sclient/k8sclient.go:96: Failed to watch *v1.Node: unknown (get nodes)
I0902 11:25:59.697325 1 k8sclient.go:221] Falling back to extensions/v1beta1, error using apps/v1: deployments.apps "calico-typha" is forbidden: User "system:serviceaccount:kube-system:typha-cpha" cannot get resource "deployments/scale" in API group "apps" in the namespace "kube-system"
E0902 11:25:59.699833 1 autoscaler_server.go:120] Update failure: the server could not find the requested resource
```
Ref. https://github.com/kubernetes/test-infra/pull/13709
This preference list matches is used to pick prefered field from k8s
node object. It was introduced in metrics-server 0.3 and changed default
behaviour to use DNS instead of IP addresses. It was merged into k8s
1.12 and caused breaking change by introducing dependency on DNS
configuration.
[stackdriver addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes.
[fluentd-gcp addon] Bump fluentd-gcp-scaler to v0.5.1 to pick up security fixes.
[fluentd-gcp addon] Bump event-exporter to v0.2.4 to pick up security fixes.
[fluentd-gcp addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes.
[metatada-proxy addon] Bump prometheus-to-sd v0.5.0 to pick up security fixes.
Fix three issues with the fluentd-gcp liveness probe:
h1. STUCK_THRESHOLD_SECONDS was overridden by LIVENESS_THRESHOLD_SECONDS
if defined
Probably a copy/paste issue introduced in edf1ffc074
h1. `[[` is [a bashism](https://stackoverflow.com/a/47576482), and will always failed when called with `/bin/sh`
Introduced by a844523c20
Given that we call the liveness probe with `/bin/sh`, we cannot use the
double-bracketed `[[` syntax for test, as it is not POSIX-compliant and
will throw an error.
Annoyingly, even through it prints an error, `sh` returns with exit code 0
in this case:
```bash
root@fluentd-7mprs:/# sh liveness.sh
liveness.sh: 8: liveness.sh: [[: not found
liveness.sh: 15: liveness.sh: [[: not found
root@fluentd-7mprs:/# echo $?
0
```
Which means the liveness probe is considered successful by Kubernetes,
despite failing to test things as it was intended. This is also
probably the reason why this bug wasn't reported sooner :)
Thankfully, the test in this case can just as easily be written as
POSIX-compliant as it doesn't use any bash-specific features within the
`[[` block.
h1. Buffers are transient and cannot be relied upon for monitoring
Finally, after fixing the above issue, we started seeing the fluentd
containers being restarted very often, and found an issue with the
underlying logic of the liveness probe.
The probe checks that the pod is still alive by running the following
command:
`find /var/log/fluentd-buffers -type f -newer /tmp/marker-stuck -print -quit`
This checks if any _regular_ file exists under `/var/log/fluentd-buffers`
that is more recent than a predetermined time, and will return an empty
string otherwise.
The issue is that these buffers are temporary and volatile, they get created and
deleted constantly. Here is an example of running that check every second on a
running fluentd:
```
root@fluentd-eks-playground-jdc8m:/# LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300};
root@fluentd-eks-playground-jdc8m:/# STUCK_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-900};
root@fluentd-eks-playground-jdc8m:/# touch -d "${STUCK_THRESHOLD_SECONDS} seconds ago" /tmp/marker-stuck;
root@fluentd-eks-playground-jdc8m:/# touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness;
root@fluentd-eks-playground-jdc8m:/# while true; do date ; find /var/log/fluentd-buffers -type f -newer /tmp/marker-stuck -print -quit ; sleep 1 ; done
Fri Feb 22 10:52:57 UTC 2019
Fri Feb 22 10:52:58 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964ccf4c7004103c3fa7c8533f85.log
Fri Feb 22 10:52:59 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964ccf4c7004103c3fa7c8533f85.log
Fri Feb 22 10:53:00 UTC 2019
Fri Feb 22 10:53:01 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964fb8b2eedcccd2763ea7775cc2.log
Fri Feb 22 10:53:02 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964fb8b2eedcccd2763ea7775cc2.log
Fri Feb 22 10:53:03 UTC 2019
Fri Feb 22 10:53:04 UTC 2019
Fri Feb 22 10:53:05 UTC 2019
Fri Feb 22 10:53:06 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:07 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:08 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:09 UTC 2019
Fri Feb 22 10:53:10 UTC 2019
Fri Feb 22 10:53:11 UTC 2019
Fri Feb 22 10:53:12 UTC 2019
Fri Feb 22 10:53:13 UTC 2019
Fri Feb 22 10:53:14 UTC 2019
Fri Feb 22 10:53:15 UTC 2019
Fri Feb 22 10:53:16 UTC 2019
```
We can see buffers being created, then disappearing. The LivenessProbe running
under these conditions has a ~50% chance of failing, despite fluentd being
perfectly happy.
I believe that check is probably ok for fluentd installs using large
amounts of buffers, in which case the liveness probe will be correct more
often than not, but fluentd installs that use buffering less intensively
will be negatively impacted by this.
My solution to fix this is to check the last updated time of buffering
_folders_ within `/var/log/fluentd_buffers`. These _do_ get updated when
buffers are created, and do not get deleted as buffers are emptied,
making them the perfect candidate for our use.
Here's an example with the `-d` flag for directories:
```
root@fluentd-eks-playground-jdc8m:/# while true; do date ; find /var/log/fluentd-buffers -type d -newer /tmp/marker-stuck -print -quit ; sleep 1 ; done
Fri Feb 22 10:57:51 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:52 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:53 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:54 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:55 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:56 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:57 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:58 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:59 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:00 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:01 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:02 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:03 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
```
And example of the directory being updated as new buffers come in:
```
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 0
drwxr-xr-x 2 root root 6 Feb 22 11:17 .
drwxr-xr-x 3 root root 38 Feb 22 11:14 ..
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 16K
drwxr-xr-x 2 root root 224 Feb 22 11:18 .
drwxr-xr-x 3 root root 38 Feb 22 11:14 ..
-rw-r--r-- 1 root root 1.8K Feb 22 11:18 buffer.b58279be6e21e8b29fc333a7d50096ed0.log
-rw-r--r-- 1 root root 215 Feb 22 11:18 buffer.b58279be6e21e8b29fc333a7d50096ed0.log.meta
-rw-r--r-- 1 root root 429 Feb 22 11:18 buffer.b58279be6f09bdfe047a96486a525ece2.log
-rw-r--r-- 1 root root 195 Feb 22 11:18 buffer.b58279be6f09bdfe047a96486a525ece2.log.meta
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 0
drwxr-xr-x 2 root root 6 Feb 22 11:18 .
drwxr-xr-x 3 root root 38 Feb 22 11:14 ..
```
Similar to `--no-negcache` on dnsmasq, this prevents issues which poll DNS for orchestration such as operators with StatefulSets. It can also be very confusing for users when negative caching results in a change they just made seeming to be broken until the cache expires. This assumes that 5 seconds is reasonable and will still catch repeated AAAA negative responses. We could also set the denial cache size to zero which should effectively fully disable it like dnsmasq in kube-dns but testing shows this approach seems to work well in our (albeit small) test clusters.
These DaemonSets supports only Linux today, so this change updates the
specs to reflect this limitation. The labels have recently been promoted
to GA. Using the beta labels for now until node-master version skew
problem no longer exists.
The node_exporter CPU use is bursty, as it needs a bit of CPU at scrape time. Don't set a CPU limit to avoid collection stalls.
Set the request to 100m to more closely match the typical max core needs.
setting CPU_LIMITS to '1' fixes the following log appearing every 60 seconds:
Running: kubectl set resources -n kube-system ds fluentd-gcp-v3.1.0 -c fluentd-gcp --requests=cpu=100m,memory=200Mi --limits=cpu=1000m,memory=500Mi
error: info: {extensions v1beta1 daemonsets} "fluentd-gcp-v3.1.0" was not changed
this PR does not change scaler's behaviour, pods are scaled correctly despite error in the logs
This change includes the yaml files and gce startup script changes
to run this addon. It is disabled by default, can be enabled by setting
KUBE_ENABLE_NODELOCAL_DNS=true
An ip address is required for the cache instance to listen for
requests on, default is a link local ip address of value 169.254.25.10
addressed review comments, updated image location
Picked a different prometheus port so stats port is not same as the
coredns deployment
Removed the nodelocaldns-ready label.
Set memory limit to 30Mi
- Move from the old github.com/golang/glog to k8s.io/klog
- klog as explicit InitFlags() so we add them as necessary
- we update the other repositories that we vendor that made a similar
change from glog to klog
* github.com/kubernetes/repo-infra
* k8s.io/gengo/
* k8s.io/kube-openapi/
* github.com/google/cadvisor
- Entirely remove all references to glog
- Fix some tests by explicit InitFlags in their init() methods
Change-Id: I92db545ff36fcec83afe98f550c9e630098b3135
This fixes an issue where SRV records were incorrectly being compressed.
Also updated kubedns version for kubeadm
Upgrade to 1.14.12 with manifest support. Runs dnsmasq version 2.78
Automatic merge from submit-queue (batch tested with PRs 68087, 68256, 64621, 68299, 68296). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.
Bump addon-manager to v8.7
**What this PR does / why we need it**:
Major changes:
- Support extra `--prune-whitelist` resources in kube-addon-manager.
- Update kubectl to v1.10.7.
Basically picking up https://github.com/kubernetes/kubernetes/pull/67743.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #NONE
**Special notes for your reviewer**:
/assign @Random-Liu @mikedanese
**Release note**:
```release-note
Bump addon-manager to v8.7
- Support extra `--prune-whitelist` resources in kube-addon-manager.
- Update kubectl to v1.10.7.
```
Automatic merge from submit-queue (batch tested with PRs 68161, 68023, 67909, 67955, 67731). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.
Register RuntimeClass CRD as an addon
**What this PR does / why we need it**:
Register the RuntimeClass CRD when the RuntimeClass feature gate is enabled. This is done in through the addon manager.
This is an alternative approach to https://github.com/kubernetes/kubernetes/pull/67924
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
For https://github.com/kubernetes/features/issues/585
**Release note**:
Covered by #67737
```release-note
NONE
```
/sig node
/kind feature
/priority important-soon
/milestone v1.12
Automatic merge from submit-queue (batch tested with PRs 67691, 68147). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.
Bump versions of components with latest security patches.
**What this PR does / why we need it**:
Upgrade versions of monitoring components used on GCP, to include latest security patches.
**Release note**:
```release-note
[fluentd-gcp-scaler addon] Bump fluentd-gcp-scaler to 0.4 to pick up security fixes.
[prometheus-to-sd addon] Bump prometheus-to-sd to 0.3.1 to pick up security fixes, bug fixes and new features.
[event-exporter addon] Bump event-exporter to 0.2.3 to pick up security fixes.
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.
cluster/addons: add labels to fluentd owner files
**What this PR does / why we need it**:
this PR adds SIG labels to fluentd OWNER files:
- cluster/addons/fluentd-elasticsearch/OWNERS
- cluster/addons/fluentd-gcp/OWNERS
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
let me know if the labels need adjustment.
**Release note**:
```release-note
NONE
```
/assign @roberthbailey @mikedanese
/cc @timothysc
/sig gcp
/sig instrumentation
/kind cleanup
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.
Support extra prune resources in kube-addon-manager.
The default prune whitelist resources in https://github.com/kubernetes/kubernetes/blob/master/pkg/kubectl/cmd/apply.go#L531 are sometimes not enough.
One example is that when we remove an admission webhook running as an addon pod, after we remove the addon yaml file, the admission webhook pod will be pruned, but the `MutatingWebhookConfiguration`/`ValidationWebhookConfiguration` won't... If the webhook failure policy is `Fail`, this will break the cluster, and users can't create new pods anymore.
It would be good to at least make this configurable, so that users and vendors can configure it based on their requirement.
This PR keeps the default prune resource list exactly the same with before, just makes it possible to add extra ones.
@dchen1107 @MrHohn @kubernetes/sig-cluster-lifecycle-pr-reviews @kubernetes/sig-gcp-pr-reviews
Signed-off-by: Lantao Liu <lantaol@google.com>
**Release note**:
```release-note
Support extra `--prune-whitelist` resources in kube-addon-manager.
```
Automatic merge from submit-queue (batch tested with PRs 68051, 68130, 67211, 68065, 68117). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.
Put fluentd back to host network
In the future we will want to monitor each system component that is deployed as a DaemonSet using only one instance of prometheus-to-sd (which will be deployed as a DaemonSet too), but for this we need all the system components to be part of host network. There is no port colision created with this change.
```release-note
Port 31337 will be used by fluentd
```
Automatic merge from submit-queue (batch tested with PRs 67756, 64149, 68076, 68131, 68120). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.
Update manifest and version for CoreDNS
**What this PR does / why we need it**:
Updates the manifest of CoreDNS and also bumps the version of CoreDNS to 1.2.2
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes https://github.com/kubernetes/kubernetes/issues/68020
**Special notes for your reviewer**:
**Release note**:
```release-note
CoreDNS is now v1.2.2 for Kubernetes 1.12
```
v0.3.0 is the latest version of metrics-server, and brings a number of
internal stability improvements as well as some bugfixes and features.
NB: this currently disables Kubelet auth entirely, since this setup
needs to work on GKE for the tests, and GKE doesn't support delegated
Kubelet auth yet. When that's rectified, we can switch this over to
use secure options.