Commit Graph

8105 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
7a4496429d
Merge pull request #71252 from monotek/elasticsearch
updated elasticsearch to 6.6.1
2019-02-26 09:33:44 -08:00
Kubernetes Prow Robot
b8ddc7945b
Merge pull request #74522 from Pluies/master
Fix fluentd-gcp addon liveness probe
2019-02-26 06:38:24 -08:00
André Bauer
9e2d9cfbb0 changed es image repo
Signed-off-by: André Bauer <monotek23@gmail.com>
2019-02-26 09:09:21 +01:00
Kubernetes Prow Robot
3fb6e77770
Merge pull request #74549 from yujuhong/pause-image
GCE: switch to using e2eteam/pause:3.1 for pause containers
2019-02-25 23:12:00 -08:00
Kubernetes Prow Robot
0ff7e463ee
Merge pull request #73746 from mrbobbytables/kubemark-shellcheck
Fix shellcheck lint errors in Kubemark scripts
2019-02-25 17:25:13 -08:00
Kubernetes Prow Robot
3814176d42
Merge pull request #74455 from SataQiu/fix-shell-2019022302
Fix shellcheck lint errors in cluster and hack scripts
2019-02-25 15:15:19 -08:00
Yu-Ju Hong
b863655faa GCE: switch to using e2eteam/pause:3.1 for pause containers
Stop building pause images on node startup.
2019-02-25 14:36:49 -08:00
Kubernetes Prow Robot
2aacb77374
Merge pull request #74444 from pjh/gce-windows-no-defender
Disable Windows Defender on Windows nodes.
2019-02-25 13:54:42 -08:00
Bob Killen
9a4f4878f5
Fix shellcheck lint errors in cluster/kubemark/util.sh 2019-02-25 15:21:29 -05:00
Bob Killen
9a58913e8f
Fix shellcheck lint errors in cluster/kubemark/iks/config-default.sh 2019-02-25 15:21:25 -05:00
Bob Killen
ce4c85e3fd
Fix shellcheck lint errors in cluster/kubemark/gce/config-default.sh 2019-02-25 14:55:01 -05:00
Kubernetes Prow Robot
35a258d640
Merge pull request #73272 from danielqsj/juju
fix shellcheck in cluster/juju
2019-02-25 11:33:21 -08:00
Kubernetes Prow Robot
f288678cfa
Merge pull request #73261 from danielqsj/local
fix shellcheck in cluster/local
2019-02-25 11:33:11 -08:00
Florent Delannoy
e627474e8f Fix fluentd-gcp addon liveness probe
Fix three issues with the fluentd-gcp liveness probe:

h1. STUCK_THRESHOLD_SECONDS was overridden by LIVENESS_THRESHOLD_SECONDS
if defined

Probably a copy/paste issue introduced in edf1ffc074

h1. `[[` is [a bashism](https://stackoverflow.com/a/47576482), and will always failed when called with `/bin/sh`

Introduced by a844523c20

Given that we call the liveness probe with `/bin/sh`, we cannot use the
double-bracketed `[[` syntax for test, as it is not POSIX-compliant and
will throw an error.

Annoyingly, even through it prints an error, `sh` returns with exit code 0
in this case:

```bash
root@fluentd-7mprs:/# sh liveness.sh
liveness.sh: 8: liveness.sh: [[: not found
liveness.sh: 15: liveness.sh: [[: not found
root@fluentd-7mprs:/# echo $?
0
```

Which means the liveness probe is considered successful by Kubernetes,
despite failing to test things as it was intended. This is also
probably the reason why this bug wasn't reported sooner :)

Thankfully, the test in this case can just as easily be written as
POSIX-compliant as it doesn't use any bash-specific features within the
`[[` block.

h1. Buffers are transient and cannot be relied upon for monitoring

Finally, after fixing the above issue, we started seeing the fluentd
containers being restarted very often, and found an issue with the
underlying logic of the liveness probe.

The probe checks that the pod is still alive by running the following
command:

`find /var/log/fluentd-buffers -type f -newer /tmp/marker-stuck -print -quit`

This checks if any _regular_ file exists under `/var/log/fluentd-buffers`
that is more recent than a predetermined time, and will return an empty
string otherwise.

The issue is that these buffers are temporary and volatile, they get created and
deleted constantly. Here is an example of running that check every second on a
running fluentd:

```
root@fluentd-eks-playground-jdc8m:/# LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300};
root@fluentd-eks-playground-jdc8m:/# STUCK_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-900};
root@fluentd-eks-playground-jdc8m:/# touch -d "${STUCK_THRESHOLD_SECONDS} seconds ago" /tmp/marker-stuck;
root@fluentd-eks-playground-jdc8m:/# touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness;
root@fluentd-eks-playground-jdc8m:/# while true; do date ; find /var/log/fluentd-buffers -type f -newer /tmp/marker-stuck -print -quit ; sleep 1 ; done
Fri Feb 22 10:52:57 UTC 2019
Fri Feb 22 10:52:58 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964ccf4c7004103c3fa7c8533f85.log
Fri Feb 22 10:52:59 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964ccf4c7004103c3fa7c8533f85.log
Fri Feb 22 10:53:00 UTC 2019
Fri Feb 22 10:53:01 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964fb8b2eedcccd2763ea7775cc2.log
Fri Feb 22 10:53:02 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964fb8b2eedcccd2763ea7775cc2.log
Fri Feb 22 10:53:03 UTC 2019
Fri Feb 22 10:53:04 UTC 2019
Fri Feb 22 10:53:05 UTC 2019
Fri Feb 22 10:53:06 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:07 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:08 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:09 UTC 2019
Fri Feb 22 10:53:10 UTC 2019
Fri Feb 22 10:53:11 UTC 2019
Fri Feb 22 10:53:12 UTC 2019
Fri Feb 22 10:53:13 UTC 2019
Fri Feb 22 10:53:14 UTC 2019
Fri Feb 22 10:53:15 UTC 2019
Fri Feb 22 10:53:16 UTC 2019
```

We can see buffers being created, then disappearing. The LivenessProbe running
under these conditions has a ~50% chance of failing, despite fluentd being
perfectly happy.

I believe that check is probably ok for fluentd installs using large
amounts of buffers, in which case the liveness probe will be correct more
often than not, but fluentd installs that use buffering less intensively
will be negatively impacted by this.

My solution to fix this is to check the last updated time of buffering
_folders_ within `/var/log/fluentd_buffers`. These _do_ get updated when
buffers are created, and do not get deleted as buffers are emptied,
making them the perfect candidate for our use.

Here's an example with the `-d` flag for directories:
```
root@fluentd-eks-playground-jdc8m:/# while true; do date ; find /var/log/fluentd-buffers -type d -newer /tmp/marker-stuck -print -quit ; sleep 1 ; done
Fri Feb 22 10:57:51 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:52 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:53 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:54 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:55 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:56 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:57 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:58 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:59 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:00 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:01 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:02 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:03 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
```

And example of the directory being updated as new buffers come in:
```
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 0
drwxr-xr-x 2 root root  6 Feb 22 11:17 .
drwxr-xr-x 3 root root 38 Feb 22 11:14 ..
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 16K
drwxr-xr-x 2 root root  224 Feb 22 11:18 .
drwxr-xr-x 3 root root   38 Feb 22 11:14 ..
-rw-r--r-- 1 root root 1.8K Feb 22 11:18 buffer.b58279be6e21e8b29fc333a7d50096ed0.log
-rw-r--r-- 1 root root  215 Feb 22 11:18 buffer.b58279be6e21e8b29fc333a7d50096ed0.log.meta
-rw-r--r-- 1 root root  429 Feb 22 11:18 buffer.b58279be6f09bdfe047a96486a525ece2.log
-rw-r--r-- 1 root root  195 Feb 22 11:18 buffer.b58279be6f09bdfe047a96486a525ece2.log.meta
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 0
drwxr-xr-x 2 root root  6 Feb 22 11:18 .
drwxr-xr-x 3 root root 38 Feb 22 11:14 ..
```
2019-02-25 11:48:31 +00:00
André Bauer
2d15ffc9cc updated to 6.5.2
Signed-off-by: André Bauer <monotek23@gmail.com>
2019-02-25 10:56:50 +01:00
André Bauer
0c29ea1a2e Update es-statefulset.yaml 2019-02-25 10:55:23 +01:00
André Bauer
53a936c359 Update Makefile 2019-02-25 10:55:23 +01:00
André Bauer
0e44fa6359 updated elasticsearch to 6.5.0 2019-02-25 10:55:23 +01:00
danielqsj
7d051e1a75 update juju shell 2019-02-24 20:46:20 +08:00
danielqsj
7e655e8666 fix shellcheck in cluster/juju 2019-02-24 20:40:59 +08:00
danielqsj
f02a986081 add comments to shell function 2019-02-24 20:35:46 +08:00
SataQiu
9cda80e836 fix shellcheck lint errors in cluster and hack scripts 2019-02-24 11:15:35 +08:00
Kubernetes Prow Robot
1cf8001e53
Merge pull request #74449 from xichengliudui/fix190223
make more of the shell pass lints
2019-02-23 12:52:34 -08:00
Kubernetes Prow Robot
8993fbc543
Merge pull request #74328 from daixiang0/delete-blank
delete all duplicate empty blanks
2019-02-23 01:43:58 -08:00
Peter Hornyack
621df2cddb Disable Windows Defender on Windows test nodes. 2019-02-22 18:35:38 -08:00
Xiang Dai
36065c6dd7 delete all duplicate empty blanks
Signed-off-by: Xiang Dai <764524258@qq.com>
2019-02-23 10:28:04 +08:00
Jeff Grafton
7a938eb541 Create work-around wrappers for pure attribute on go_binary and go_test
This enables cgo when cross-compiling certain tests and binaries to
Linux, while disabling cgo for Windows and Darwin.
2019-02-22 18:08:34 -08:00
Kubernetes Prow Robot
46d7e9c784
Merge pull request #74381 from yujuhong/add-key
GCE: add EventLog registry key for docker
2019-02-22 17:58:44 -08:00
Kubernetes Prow Robot
fd7acc3275
Merge pull request #74065 from ixdy/break-up-test-tarball
break up the test tarball
2019-02-22 17:58:23 -08:00
Kubernetes Prow Robot
743f864310
Merge pull request #73819 from coffeepac/move-fluentd-es-images
Move fluentd es images
2019-02-22 17:58:12 -08:00
Patrick Christopher
1bd45ba6eb review updates 2019-02-22 10:00:10 -08:00
Aaron Crickenberger
2d868025be Promote spiffxp to approver for cluster/
Also add Katharine as reviewer
2019-02-22 09:25:31 -08:00
Kubernetes Prow Robot
c7ac532816
Merge pull request #74360 from pjh/gce-windows-cluster-ssh
Enable OpenSSH on Windows nodes in test clusters.
2019-02-21 20:51:23 -08:00
Kubernetes Prow Robot
125dc6c8ea
Merge pull request #74187 from xichengliudui/fixgolint0218
Fix shellcheck lint errors in cluster/addons/fluentd-elasticsearch/fl……uentd-es-image/run.sh
2019-02-21 20:51:13 -08:00
Yu-Ju Hong
40d0ae311c GCE: add EventLog registry key for docker 2019-02-21 17:47:32 -08:00
Kubernetes Prow Robot
042f9ed3af
Merge pull request #74093 from blakebarnett/lower-neg-cache-ttl
Lowers the default nodelocaldns denial cache TTL
2019-02-21 17:47:16 -08:00
Blake
46c299c1b1 Match default cache size of 10000
https://github.com/coredns/coredns/blob/master/plugin/cache/cache.go#L236
This gets rounded down to the nearest multiple of 256: 9984
2019-02-21 15:03:30 -08:00
Peter Hornyack
57ca6f007e Enable OpenSSH on Windows nodes in test clusters.
Also switches to the most recent 64-bit version of OpenSSH for Windows.

Tested:
PROJECT=${CLOUDSDK_CORE_PROJECT} KUBERNETES_SKIP_CONFIRM=y NUM_NODES=2 \
NUM_WINDOWS_NODES=2 KUBE_GCE_ENABLE_IP_ALIASES=true TEST_CLUSTER=true \
./cluster/kube-up.sh
2019-02-21 14:03:43 -08:00
Jeff Grafton
b360f95eb3 cleanup: we always need to download client and server tarballs 2019-02-21 13:17:58 -08:00
Jeff Grafton
56949c7834 Support split test tarballs in get-kube-binaries.sh 2019-02-21 13:17:58 -08:00
Peter Hornyack
6d78f2b666 Default to Windows Server version 1809 for Windows nodes.
Removes all references to 1803, including moving "win1803" directory to
just "windows". A single Windows directory suffices for now, if
necessary in the future we can shard it into directories for each
Windows version.

We've been running tests with Windows 1809 nodes for a couple days in
our fork without major problems:
https://testgrid.k8s.io/google-windows#windows-prototype&width=20.
Testing on Azure is already using 1809:
https://testgrid.k8s.io/sig-windows#Conformance%20acs-engine%20on%20Azure&width=20.
2019-02-21 09:44:44 -08:00
Kubernetes Prow Robot
f1de0b557c
Merge pull request #74324 from mtaufen/fix-windows
Fix hash if statement
2019-02-20 23:57:18 -08:00
xichengliudui
053332ad46 Fix shellcheck lint errors in cluster/addons/fluentd-elasticsearch/fluentd-es-image/run.sh
update pull request

update pull request

update pull request

update pull request

update pull request
2019-02-21 02:00:48 -05:00
Kubernetes Prow Robot
6c1f2077e5
Merge pull request #74192 from xichengliudui/fixshellcheck190218
make more of the shell pass lints
2019-02-20 21:41:25 -08:00
Kubernetes Prow Robot
054a676141
Merge pull request #74142 from javier-b-perez/master
GCE config.sh script should use headers for curl GCS apis
2019-02-20 21:41:12 -08:00
Michael Taufen
cf3ad9c655 Fix hash if statement 2019-02-20 16:56:00 -08:00
Kubernetes Prow Robot
f04ce297d6
Merge pull request #74100 from mtaufen/file-download-improvements
Retry downloads, respect URL list, validate tar hash
2019-02-20 11:34:06 -08:00
Michael Taufen
7ffe810f1d Retry downloads, respect URL list, validate tar hash 2019-02-20 08:52:46 -08:00
Kubernetes Prow Robot
f5989303b7
Merge pull request #74060 from SataQiu/fix-shellcheck-20190214
Fix shellcheck failures on kube-down.sh kubeadm.sh get-build.sh
2019-02-19 21:41:17 -08:00
Kubernetes Prow Robot
db7d930aab
Merge pull request #74109 from pjh/gce-windows-cluster-smoke-test
Detect ready pods correctly and untaint Windows nodes in smoke-test.
2019-02-19 19:57:40 -08:00