This change adds 2 options for windows:
--forward-healthcheck-vip: If true forward service VIP for health check
port
--root-hnsendpoint-name: The name of the hns endpoint name for root
namespace attached to l2bridge, default is cbr0
When --forward-healthcheck-vip is set as true and winkernel is used,
kube-proxy will add an hns load balancer to forward health check request
that was sent to lb_vip:healthcheck_port to the node_ip:healthcheck_port.
Without this forwarding, the health check from google load balancer will
fail, and it will stop forwarding traffic to the windows node.
This change fixes the following 2 cases for service:
- `externalTrafficPolicy: Cluster` (default option): healthcheck_port is
10256 for all services. Without this fix, all traffic won't be directly
forwarded to windows node. It will always go through a linux node and
get forwarded to windows from there.
- `externalTrafficPolicy: Local`: different healthcheck_port for each
service that is configured as local. Without this fix, this feature
won't work on windows node at all. This feature preserves client ip
that tries to connect to their application running in windows pod.
Change-Id: If4513e72900101ef70d86b91155e56a1f8c79719
For each iptables-restore call, log the number of services, endpoints,
filter chains, filter rules, NAT chains, and NAT rules in the update
at V(2), in addition to logging the actual rules if V(9).
All of the tests used a localDetector that considered the pod IP range
to be 10.0.0.0/24, but lots of the tests used pod IPs in 10.180.0.0/16
or 10.0.1.0/24, meaning the generated iptables rules were somewhat
inconsistent. Fix this by expanding the localDetector's pod IP range
to 10.0.0.0/8. (Changing the pod IPs to all be in 10.0.0.0/24 instead
would be a much larger change since it would result in the SEP chain
names changing.)
Meanwhile, the different tests were also horribly inconsistent about
what values they used for other IPs, and some of them even used the
same IPs (or ports) for different things in the same test case. Fix
these all up and create a consistent set of IP assignments:
// Pod IPs: 10.0.0.0/8
// Service ClusterIPs: 172.30.0.0/16
// Node IPs: 192.168.0.0/24
// Local Node IP: 192.168.0.2
// Service ExternalIPs: 192.168.99.0/24
// LoadBalancer IPs: 1.2.3.4, 5.6.7.8, 9.10.11.12
// Non-cluster IPs: 203.0.113.0/24
// LB Source Range: 203.0.113.0/25
Only run assertIPTablesRuleJumps() on the expected output, not on the
actual output, since if there's a problem with the actual output, we'd
rather see it as the diff from the expected output.
In large clusters, the iptables-restore input will be tens of
thousands of lines long, and logging it at V(5) essentially means that
"kube-proxy -v=5" cannot be used in such clusters to see _other_
things that get logged at V(5), because logs will get rolled over far
too quickly. So bump the full-rules logging output down to V(9).
kube-proxy sets the sysctl net.ipv4.conf.all.route_localnet=1
so NodePort services can be accessed on the loopback addresses in
IPv4, but this may present security issues.
Leverage the --nodeport-addresses flag to opt-out of this feature,
if the list is not empty and none of the IP ranges contains an IPv4
loopback address this sysctl is not set.
In addition, add a warning to inform users about this behavior.
Just check that the actual IP:port of the filtered endpoints is
correct; using DeepEqual requires us to copy all the extra endpoint
fields (eg, ZoneHints, IsLocal) from endpoints to expectedEndpoints,
which just makes the test cases unnecessarily bigger.
The "node local endpoints, hints are ignored" test was not actually
enabling topology correctly, so it would have gotten the expected
result even if the code was wrong. (Which, FTR, it wasn't.)
When nodePortAddresses is not specified for kube-proxy, it tried to open
the node port for a NodePort service twice, triggered by IPv4ZeroCIDR
and IPv6ZeroCIDR separately. The first attempt would succeed and the
second one would always generate an error log like below:
"listen tcp4 :30522: bind: address already in use"
This patch fixes it by ensuring nodeAddresses of a proxier only contain
the addresses for its IP family.
The same code appeared twice, once for the SVC chain and once for the
XLB chain, with the only difference being that the XLB version had
more verbose comments.
Also, in the NodePort code, fix it to properly take advantage of the
fact that GetNodeAddresses() guarantees that if it returns a
"match-all" CIDR, then it doesn't return anything else. That also
makes it unnecessary to loop over the node addresses twice.
If you pass just an IP address to "-s" or "-d", the iptables command
will fill in the correct mask automatically.
Originally, the proxier was just hardcoding "/32" for all of these,
which was unnecessary but simple. But when IPv6 support was added, the
code was made more complicated to deal with the fact that the "/32"
needed to be "/128" in the IPv6 case, so it would parse the IPs to
figure out which family they were, which in turn involved adding some
checks in case the parsing fails (even though that "can't happen" and
the old code didn't check for invalid IPs, even though that would
break the iptables-restore if there had been any).
Anyway, all of that is unnecessary because we can just pass the IP
strings to iptables directly rather than parsing and unparsing them
first.
(The diff to proxier_test.go is just deleting "/32" everywhere.)
If GetNodeAddresses() fails (eg, because you passed the wrong CIDR to
`--nodeport-addresses`), then any NodePort services would end up with
only half a set of iptables rules. Fix it to just not output the
NodePort-specific parts in that case (in addition to logging an error
about the GetNodeAddresses() failure).
The iptables and ipvs proxiers both had a check that none of the
elements of svcInfo.LoadBalancerIPStrings() were "", but that was
already guaranteed by the svcInfo code. Drop the unnecessary checks
and remove a level of indentation.
* Fix regression in kube-proxy
Don't use a prepend() - that allocates. Instead, make Write() take
either strings or slices (I wish we could express that better).
* WIP: switch to intf
* WIP: less appends
* tests and ipvs