Automatic merge from submit-queue
make storage enablement, serialization, and location orthogonal
This allows a caller (command-line, config, code) to specify multiple separate pieces of config information regarding storage and have them properly composed at runtime. The information provided is exposed through interfaces to allow alternate implementations, which allows us to change the expression of the config moving forward. I also fixed up the types to be correct as I moved through.
The same options still exist, but they're composed slightly differently
1. specify target etcd servers per Group or per GroupResource
1. specify storage GroupVersions per Groups or per GroupResource
1. specify etcd prefixes per GroupVersion or per GroupResource
1. specify that multiple GroupResources share the same location in etcd
1. enable GroupResources by GroupVersion or by GroupResource whitelist or GroupResource blacklist
The `storage.Interface` is built per GroupResource by:
1. find the set of possible storage GroupResource based on the priority list of cohabitators
1. choose a GroupResource from the set by looking at which Groups have the resource enabled
1. find the target etcd server, etcd prefix, and storage encoding based on the GroupResource
The API server can have its resources separately enabled, but for now I've kept them linked.
@liggitt I think we need this (or something like it) to be able to go from config to these interfaces. Given another round of refactoring, we may be able to reshape these to be more forward driving.
@smarterclayton this is important for rebasing and for a seamless 1.2 to 1.3 migration for us.
Automatic merge from submit-queue
etcd3 store: provide compactor util
What's this PR?
- Provides a util to compact keys in etcd.
Reason:
We want to save the most recent 10 minutes event history. It should be more than enough for slow watchers. It is not number based, so it can tolerate event bursts too. We do not want to save longer since the current storage API cannot take advantage of the multi-version key yet. We might keep a longer history in the future.
Automatic merge from submit-queue
Bump up etcd dependency to fix data race
ref: https://github.com/kubernetes/kubernetes/pull/23694
What this PR does
- Bumping up the godep of etcd to fix data race in etcd watcher. Without this change, watcher PR builds will fail in race detection.
- Small changes to fix builds after upgrade
Automatic merge from submit-queue
Make etcd cache size configurable
Instead of the prior 50K limit, allow users to specify a more sensible size for their cluster.
I'm not sure what a sensible default is here. I'm still experimenting on my own clusters. 50 gives me a 270MB max footprint. 50K caused my apiserver to run out of memory as it exceeded >2GB. I believe that number is far too large for most people's use cases.
There are some other fundamental issues that I'm not addressing here:
- Old etcd items are cached and potentially never removed (it stores using modifiedIndex, and doesn't remove the old object when it gets updated)
- Cache isn't LRU, so there's no guarantee the cache remains hot. This makes its performance difficult to predict. More of an issue with a smaller cache size.
- 1.2 etcd entries seem to have a larger memory footprint (I never had an issue in 1.1, even though this cache existed there). I suspect that's due to image lists on the node status.
This is provided as a fix for #23323
In the e2e benchmarks, this timer is a significant source of garbage
and stale timers. Because the timer is not stopped after its use
in the select, it stays in the timer heap until it eventually fires
(5 seconds later). Under load, a lot of 5-second timers can pile up
before any start going away. The timer heap being large makes timer
operations take longer; the operations are O(log N) but N is still big.
The way to fix this in current versions of Go is to stop the underlying
timer explicitly, which this CL does for this one case.
There are many other places in the code that use the same idiom,
but those do not show up on profiles of the e2e server.
I am investigating changes for Go 1.7's runtime that would make
the old code behave like this new code transparently, so I don't
think it's worth updating any uses of the idiom that are not in
hot spots found with profiling.
Measuring 'LIST nodes' latency in milliseconds during e2e test
shows the benefit of this change.
Using Go 1.4.2:
BEFORE p50: 148±7 p90: 328±19 p99: 513±29 n: 10
AFTER p50: 151±8 p90: 339±19 p99: 479±20 n: 9
Using Go 1.6.0:
BEFORE p50: 141±9 p90: 383±32 p99: 604±44 n: 11
AFTER p50: 140±14 p90: 360±31 p99: 483±39 n: 10
We decided to remove this test, as there's no way to get an upper bound
on its running time. Etcd restart behavior should be tested in
integration or e2e tests.
Before this change we have a mish-mash of ways to pass field names around for
error generation. Sometimes string fieldnames, sometimes .Prefix(), sometimes
neither, often wrong names or not indexed when it should be.
Instead of that mess, this is part one of a couple of commits that will make it
more strongly typed and hopefully encourage correct behavior. At least you
will have to think about field names, which is better than nothing.
It turned out to be really hard to do this incrementally.
When etcd is down today we don't specifically handle the error involved,
which means clients get a generic 500 error. This commit adds a formal
error type internally for both WatchExpired and EtcdUnreachable, and
then converts them to api/errors before returning to the client. It also
upgrades the client to retry on any 429 or 5xx error that has a
Retry-After header, instead of just 429.
In combination, this allows the apiserver to exert backpressure on
controllers that are hotlooping. Picked 2 seconds by default, but we
could potentially ramp that up even further in a future iteration.