The problem is that attachments are now done on the master, and we are
only caching the attachment map persistently for the local instance. So
there is now a race, because the attachment map is cleared every time.
Issue #29324
Automatic merge from submit-queue
AWS: More ELB attributes via service annotations
Replaces #25015 and addresses all of @justinsb's feedback therein. This is a new PR because I was unable to reopen#25015 to amend it.
I noticed recently that there is existing (but undocumented) precedent for the AWS cloud provider to manage ELB-specifc load balancer configuration based on service annotations. In particular, one can _already_ designate an ELB as "internal" or enable PROXY protocol.
This PR extends this capability to the management of ELB attributes, which includes the following items:
* Access logs:
* Enabled / disabled
* Emit interval
* S3 bucket name
* S3 bucket prefix
* Connection draining:
* Enabled / disabled
* Timeout
* Connection:
* Idle timeout
* Cross-zone load balancing:
* Enabled / disabled
Some of these are possibly more useful than others. Use cases that immediately come to mind:
* Enabling cross-zone load balancing is potentially useful for "Ubernetes Light," or anyone otherwise attempting to spread worker nodes around multiple AZs.
* Increasing idle timeout is useful for the benefit of anyone dealing with long-running requests. An example I personally care about would be git pushes to Deis' builder component.
Automatic merge from submit-queue
AWS: Support HTTP->HTTP mode for ELB
**What this PR does / why we need it**:
Right now it is not possible to create an AWS ELB that listens for HTTP and where the backend pod also listens for HTTP.
I asked @justinsb in slack and he said that this seems to be an oversight, so I'd like to use this PR as a step towards solving this.
**Special notes for your reviewer**:
I've only added a simple unit test. Are any integration tests needed? I'm not familiar with the code base.
cc @therc
Automatic merge from submit-queue
Run goimport for the whole repo
While removing GOMAXPROC and running goimports, I noticed quite a lot of other files also needed a goimport format. Didn't commit `*.generated.go`, `*.deepcopy.go` or files in `vendor`
This is more for testing if it builds.
The only strange thing here is the gopkg.in/gcfg.v1 => github.com/scalingdata/gcfg replace.
cc @jfrazelle @thockin
We have a few functions that predate aws-sdk-go, but they have natural
equivalents in aws-sdk-go. Document them as deprecated, and replace
the implementation with the equivalent in aws-sdk-go to make it obvious
that they are the same.
Automatic merge from submit-queue
AWS: Added experimental option to skip zone check
This pull request resolves#28380. In the vast majority of cases, it is appropriate to validate the AWS region against a known set of regions. However, there is the edge case where this is undesirable as Kubernetes may be deployed in an AWS-like environment where the region is not one of the known regions.
By adding the optional **DisableStrictZoneCheck true** to the **[Global]** section in the aws.conf file (e.g. /etc/aws/aws.conf) one can bypass the ragion validation.
Automatic merge from submit-queue
AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones
Long term we plan on integrating this into the scheduler, but in the
short term we use the volume name to place it onto a zone.
We hash the volume name so we don't bias to the first few zones.
If the volume name "looks like" a PetSet volume name (ending with
-<number>) then we use the number as an offset. In that case we hash
the base name.
Lots of comments describing the heuristics, how it fits together and the
limitations.
In particular, we can't guarantee correct volume placement if the set of
zones is changing between allocating volumes.
Long term we plan on integrating this into the scheduler, but in the
short term we use the volume name to place it onto a zone.
We hash the volume name so we don't bias to the first few zones.
If the volume name "looks like" a PetSet volume name (ending with
-<number>) then we use the number as an offset. In that case we hash
the base name.
Fixes#27256
Fixes#26268
Implements the second SSL ELB annotation, per #24978
service.beta.kubernetes.io/aws-load-balancer-ssl-ports=* (or e.g. https)
If not specified, all ports are secure (SSL or HTTPS).
Add ELB proxy protocol support via the annotation
"service.beta.kubernetes.io/aws-load-balancer-proxy-protocol". This
allows servers like Nginx and Haproxy to retrieve the real IP address of
a remote client.
Automatic merge from submit-queue
AWS: Move enforcement of attached AWS device limit from kubelet to scheduler
Limit of nr. of attached EBS volumes to a node is now enforced by scheduler. It can be adjusted by `KUBE_MAX_PD_VOLS` env. variable there. Therefore we don't need the same check in kubelet. If the system admin wants to attach more, we should allow it.
Kubelet limit is now 650 attached volumes ('ba'..'zz').
Note that the scheduler counts only *pods* assigned to a node. When a pod is deleted and a new pod is scheduled on a node, kubelet start (slowly) detaching the old volume and (slowly) attaching the new volume. Depending on AWS speed **it may happen that more than KUBE_MAX_PD_VOLS volumes are actually attached to a node for some time!** Kubelet will clean it up in few seconds / minutes (both attach/detach is quite slow).
Fixes#22994
Automatic merge from submit-queue
AWS: Allow cross-region image pulling with ECR
Fixes#23298
Definitely should be in the release notes; should maybe get merged in 1.2 along with #23594 after some soaking. Documentation changes to follow.
cc @justinsb @erictune @rata @miguelfrde
This is step two. We now create long-lived, lazy ECR providers in all regions.
When first used, they will create the actual ECR providers doing the work
behind the scenes, namely talking to ECR in the region where the image lives,
rather than the one our instance is running in.
Also:
- moved the list of AWS regions out of the AWS cloudprovider and into the
credentialprovider, then exported it from there.
- improved logging
Behold, running in us-east-1:
```
aws_credentials.go:127] Creating ecrProvider for us-west-2
aws_credentials.go:63] AWS request: ecr:GetAuthorizationToken in us-west-2
aws_credentials.go:217] Adding credentials for user AWS in us-west-2
Successfully pulled image "123456789012.dkr.ecr.us-west-2.amazonaws.com/test:latest"
```
*"One small step for a pod, one giant leap for Kube-kind."*
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24369)
<!-- Reviewable:end -->
Use constructor for ecrProvider
Rename package to "credentials" like golint requests
Don't wrap the lazy provider with a caching provider
Add immedita compile-time interface conformance checks for the interfaces
Added comments
This is step two. We now create long-lived, lazy ECR providers in all regions.
When first used, they will create the actual ECR providers doing the work
behind the scenes, namely talking to ECR in the region where the image lives,
rather than the one our instance is running in.
Also:
- moved the list of AWS regions out of the AWS cloudprovider and into the
credentialprovider, then exported it from there.
- improved logging
Behold, running in us-east-1:
```
aws_credentials.go:127] Creating ecrProvider for us-west-2
aws_credentials.go:63] AWS request: ecr:GetAuthorizationToken in us-west-2
aws_credentials.go:217] Adding credentials for user AWS in us-west-2
Successfully pulled image 123456789012.dkr.ecr.us-west-2.amazonaws.com/test:latest"
```
*"One small step for a pod, one giant leap for Kube-kind."*
Automatic merge from submit-queue
AWS: Add support for ap-northeast-2 region (Seoul)
This PR does:
- Support AWS Seoul region: ap-northeast-2.
Currently, I can not setup Kubernetes on AWS Seoul.
Error Messages:
>
> ip-10-0-0-50 core # docker logs 0697db
> I0419 07:57:44.569174 1 aws.go:466] Zone not specified in configuration file; querying AWS metadata service
> F0419 07:57:44.570380 1 controllermanager.go:279] Cloud provider could not be initialized: could not init cloud provider "aws": not a valid AWS zone (unknown region): ap-northeast-2a
Limit of nr. of attached EBS volumes to a node is now enforced by scheduler. It
can be adjusted by KUBE_MAX_PD_VOLS env. variable there.
Therefore we don't need the same check in kubelet. If the system admin wants to
attach more, we should allow it.
Kubelet limit is now 650 attached volumes ('ba'..'zz').
This is a better abstraction than passing in specific pieces of the
Service that each of the cloudproviders may or may not need. For
instance, many of the providers don't need a region, yet this is passed
in. Similarly many of the providers want a string IP for the load
balancer, but it passes in a converted net ip. Affinity is unused by
AWS. A provider change may also require adding a new parameter which has
an effect on all other cloud provider implementations.
Further, this will simplify adding provider specific load balancer
options, such as with labels or some other metadata. For example, we
could add labels for configuring the details of an AWS elastic load
balancer, such as idle timeout on connections, whether it is
internal or external, cross-zone load balancing, and so on.
Authors: @chbatey, @jsravn
The previous logic was incorrect; if we saw two untagged security groups
before seeing the first tagged security, we would incorrectly return an
error.
Fix#23339
AWS has soft support limit for 40 attached EBS devices. Assuming there is just
one root device, use the rest for persistent volumes.
The devices will have name /dev/xvdba - /dev/xvdcm, leaving /dev/sda - /dev/sdz
to the system.
Also, add better error handling and propagate error
"Too many EBS volumes attached to node XYZ" to a pod.
There are known issues with the attached-volume state cache that just aren't
possible to fix with the current interface.
Replace it with a map of the active attach jobs (that was the original
requirement, to avoid a nasty race condition).
This costs us an extra DescribeInstance call on attach/detach, but that
seems worth it if it ends this class of bugs.
Fix#15073
Either ELB is slow to delete (in which case the bumped timeout will
help), or the security groups are otherwise blocked (in which case
logging them will help us track this down).
Fix#17626
We know the ELB call will fail, so we error out early rather than
hitting the API. Preserves rate limit quota, and also allows us to give
a more self-evident message.
Fix#21993
Now that we can't build an awsInstance from metadata, because of the
PrivateDnsName issue, we might as well simplify the arguments.
Create a 'placeholder' method though - newAWSInstanceFromMetadata - that
documents the desire to use metadata, shows how we would get it, but
links to the bug which explains why we can't use it.
Had to move other things around too to avoid a weird api ->
cloudprovider dependency.
Also adding fixes per code reviews.
(This is a squash of the previously approved commits)
This refactors #21431 to pull a lot of the code into cloudprovider so it
can be reused by AWS.
It also changes the name of the annotation to be non-GCE specific:
service.beta.kubernetes.io/load-balancer-source-ranges
Fix#21651
Fix the AWS subnet lookup that checks if a subnet is public, which was
missing a few cases:
- Subnets without explicit routing tables, which use the main VPC
routing table.
- Routing tables not tagged with KubernetesCluster. The filter for this
is now removed.
Like everything else AWS, we differentiate between k8s-owned security
groups and k8s-not-owned security groups using tags.
When we are setting up the ingress rule for ELBs, pick the security
group that is tagged over any others.
We continue to tolerate a single security group being untagged, but
having multiple security groups without tagging is now an error, as it
leads to undefined behaviour.
We also log at startup if the cluster tag is not defined.
Fix#21986
Add aws cloud config:
[global]
disableSecurityGroupIngress = true
The aws provider creates an inbound rule per load balancer on the node
security group. However, this can quickly run into the AWS security
group rule limit of 50.
This disables the automatic ingress creation. It requires that the user
has setup a rule that allows inbound traffic on kubelet ports from the
local VPC subnet (so load balancers can access it). E.g. `10.82.0.0/16
30000-32000`.
Limits: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_Limits.html#vpc-limits-security-groups
Authors: @jsravn, @balooo
When finding instance by node name in AWS, only retrieve running
instances. Otherwise terminated, old nodes can show up with the same
tag when rebuilding nodes in the cluster.
Another improvement made is to filter instances by the node names
provided, rather than selecting all instances and filtering in code.
Authors: @jsravn, @chbatey, @balooo
This applies a cross-request time delay when we observe
RequestLimitExceeded errors, unlike the default library behaviour which
only applies a *per-request* backoff.
Issue #12121
In the AWS API (generally) we tag things we create, and then we filter
to find them. However, creation & tagging are typically two separate
calls. So there is a chance that we will create an object, but fail to
tag it.
We fix this (done here in the case of security groups, but we can do
this more generally) by retrieving the resource without a tag filter.
If the retrieved resource has the correct tags, great. If it has the
tags for another cluster, that's a problem, and we raise an error. If
it has no tags at all, we add the tags.
This only works where the resource is uniquely named (or we can
otherwise retrieve it uniquely). For security groups, the SG name comes
from the service UUID, so that's unique.
Fixes#11324
This commit allows the AWS cloud provider plugin to work on EC2 instances
that do not have a public IP. The EC2 metadata service returns a 404 for the
'public-ipv4' endpoint for private instances, and the plugin was bubbling this
up as a fatal error.
We are (sadly) using a copy-and-paste of the GCE PD code for AWS EBS.
This code hasn't been updated in a while, and it seems that the GCE code
has some code to make volume mounting more robust that we should copy.
The ip permission method now checks for containment, not equality, so
order of parameters matter. This change fixes
`removeSecurityGroupIngress` to pass in the removal permission first to
compare against the existing permission.
Change isEqualIPPermission to consider the entire list of security group
ids on when checking if a security group id has already been added.
This is used for example when adding and removing ingress rules to the
cluster nodes from an elastic load balancer. Without this, once there
are multiple load balancers, the method as it stands incorrectly returns
false even if the security group id is in the list of group ids. This
causes a few problems: dangling security groups which fill up an
account's limit since they don't get removed, and inability to recreate
load balancers in certain situations (receiving an
InvalidPermission.Duplicate from AWS when adding the same security
group).
findInstancesByNodeNames was a simple loop around
findInstanceByNodeName, which made an EC2 API call for each call.
We've had trouble with this sort of behaviour hitting EC2 rate limits on
bigger clusters (e.g. #11979).
Instead, change this method to fetch _all_ the tagged EC2 instances, and
then loop through the local results. This is one API call (modulo
paging).
We are currently only using findInstancesByNodeNames for the load
balancer, where we attach every node, so we were fetching all but one of
the instances anyway.
Issue #11979
For AWS EBS, a volume can only be attached to a node in the same AZ.
The scheduler must therefore detect if a volume is being attached to a
pod, and ensure that the pod is scheduled on a node in the same AZ as
the volume.
So that the scheduler need not query the cloud provider every time, and
to support decoupled operation (e.g. bare metal) we tag the volume with
our placement labels. This is done automatically by means of an
admission controller on AWS when a PersistentVolume is created backed by
an EBS volume.
Support for tagging GCE PVs will follow.
Pods that specify a volume directly (i.e. without using a
PersistentVolumeClaim) will not currently be scheduled correctly (i.e.
they will be scheduled without zone-awareness).
General purpose SSD ('gp2') volume type is just slighly more expensive than
Magnetic ('standard' / default in AWS), while the performance gain is pretty
significant.
So far, the volumes were created only during testing, where the extra cost
won't make any difference. In future, we plan to introduce QoS classes, where
users could choose SSD/Magnetic depending on their use cases.
'gp2' is just the default volume type for (hopefuly) short period before these
QoS classes are implemented.
From some reason, MiBs were used for public functions and AWS cloud provider
recalculated them to GiB. Let's expose what AWS really supports and don't hide
real allocation units.