
Automatic merge from submit-queue (batch tested with PRs 44626, 45641) Update Google Cloud DNS provider Rrset.Get(name) method to return a list and change the `Rrset.List()` implementation to perform a paged walk Some federated service e2e tests and a few ingress tests would become flaky after a few hundred runs. @csbell spent quite a lot of time debugging this and found out that this flakiness was due to a bug in the federated service controller deletion logic. Deletion of a federated service object triggers a logic in the controller to update the DNS records corresponding to that object. This DNS record update logic would return an error in failed runs which would in-turn cause the controller to reschedule the operation. This led to an infinite retry-failure cycle that never gave the API server a chance to garbage collect the deleted service object. A couple of days ago we started seeing a correlation between the number of resource records in a DNS managed zone and these test failures. If you look at the test runs before and after run 2900 in the test grid - https://k8s-testgrid.appspot.com/cluster-federation#gce, you will notice that the grid became super green at 2900. That's when I deleted all the dangling DNS records from the past runs. After some investigation yesterday, we found that `ResourceRecordSet.Get()` interface and its implementation, and `ResourceRecordSet.List()` implementation at least for Google Cloud DNS were incorrect. This PR makes minimal set of changes (read: least invasive) in Google Cloud DNS provider implementation to fix these problems: 1. Modifies DNS provider Rrset.Get(name) interface to return multiple records and updates federated service controller. There can be multiple DNS resource records for a given name. They can vary by type, ttl, rrdata and a number of various other parameters. It is incorrect to return a single resource record for a given name. This change updates the Get interface to return multiple records for a given name and uses this list in the federated service controller to perform DNS operations. 2. Update Google Cloud DNS List implementation to perform a paged walk of lists to aggregate all the DNS records. The current `List()` implementation just lists the DNS resorce records in a given managed zone once and retruns the list. It neither performs a paged walk nor does it consider the `page_token` in the returned response. This change walks all the pages and aggregates the records in the pages and returns the aggregated list. This is potentially dangerous as it can blow up memory if there are a huge number of records in the given managed zone. But this is the best we can do without changing the provider interface too much. Next step is to define a new paged list interface and implement it. **Release note**: ```release-note NONE ``` /assign @csbell cc @justinsb @shashidharatd @quinton-hoole @kubernetes/sig-federation-pr-reviews
Cluster Federation
Kubernetes Cluster Federation enables users to federate multiple Kubernetes clusters. Please see the user guide and the admin guide for more details about setting up and using the Cluster Federation.
Building Kubernetes Cluster Federation
Please see the Kubernetes Development Guide
for initial setup. Once you have the development environment setup
as explained in that guide, you also need to install jq
Building cluster federation artifacts should be as simple as running:
make build
You can specify the docker registry to tag the image using the KUBE_REGISTRY environment variable. Please make sure that you use the same value in all the subsequent commands.
To push the built docker images to the registry, run:
make push
To initialize the deployment run:
(This pulls the installer images)
make init
To deploy the clusters and install the federation components, edit the
${KUBE_ROOT}/_output/federation/config.json
file to describe your
clusters and run:
make deploy
To turn down the federation components and tear down the clusters run:
make destroy
Ideas for improvement
-
Continue with
destroy
phase even in the face of errors.The bash script sets
set -e errexit
which causes the script to exit at the very first error. This should be the default mode for deploying components but not for destroying/cleanup.