All of the tests used a localDetector that considered the pod IP range
to be 10.0.0.0/24, but lots of the tests used pod IPs in 10.180.0.0/16
or 10.0.1.0/24, meaning the generated iptables rules were somewhat
inconsistent. Fix this by expanding the localDetector's pod IP range
to 10.0.0.0/8. (Changing the pod IPs to all be in 10.0.0.0/24 instead
would be a much larger change since it would result in the SEP chain
names changing.)
Meanwhile, the different tests were also horribly inconsistent about
what values they used for other IPs, and some of them even used the
same IPs (or ports) for different things in the same test case. Fix
these all up and create a consistent set of IP assignments:
// Pod IPs: 10.0.0.0/8
// Service ClusterIPs: 172.30.0.0/16
// Node IPs: 192.168.0.0/24
// Local Node IP: 192.168.0.2
// Service ExternalIPs: 192.168.99.0/24
// LoadBalancer IPs: 1.2.3.4, 5.6.7.8, 9.10.11.12
// Non-cluster IPs: 203.0.113.0/24
// LB Source Range: 203.0.113.0/25
Only run assertIPTablesRuleJumps() on the expected output, not on the
actual output, since if there's a problem with the actual output, we'd
rather see it as the diff from the expected output.
When nodePortAddresses is not specified for kube-proxy, it tried to open
the node port for a NodePort service twice, triggered by IPv4ZeroCIDR
and IPv6ZeroCIDR separately. The first attempt would succeed and the
second one would always generate an error log like below:
"listen tcp4 :30522: bind: address already in use"
This patch fixes it by ensuring nodeAddresses of a proxier only contain
the addresses for its IP family.
The same code appeared twice, once for the SVC chain and once for the
XLB chain, with the only difference being that the XLB version had
more verbose comments.
If you pass just an IP address to "-s" or "-d", the iptables command
will fill in the correct mask automatically.
Originally, the proxier was just hardcoding "/32" for all of these,
which was unnecessary but simple. But when IPv6 support was added, the
code was made more complicated to deal with the fact that the "/32"
needed to be "/128" in the IPv6 case, so it would parse the IPs to
figure out which family they were, which in turn involved adding some
checks in case the parsing fails (even though that "can't happen" and
the old code didn't check for invalid IPs, even though that would
break the iptables-restore if there had been any).
Anyway, all of that is unnecessary because we can just pass the IP
strings to iptables directly rather than parsing and unparsing them
first.
(The diff to proxier_test.go is just deleting "/32" everywhere.)
The logic to detect stale endpoints was not assuming the endpoint
readiness.
We can have stale entries on UDP services for 2 reasons:
- an endpoint was receiving traffic and is removed or replaced
- a service was receiving traffic but not forwarding it, and starts
to forward it.
Add an e2e test to cover the regression
Filter the allEndpoints list into readyEndpoints sooner, and set
"hasEndpoints" based (mostly) on readyEndpoints, not allEndpoints (so
that, eg, we correctly generate REJECT rules for services with no
_functioning_ endpoints, even if they have unusable terminating
endpoints).
Also, write out the endpoint chains at the top of the loop when we
iterate the endpoints for the first time, rather than copying some of
the data to another set of variables and then writing them out later.
And don't write out endpoint chains that won't be used
Also, generate affinity rules only for readyEndpoints rather than
allEndpoints, so affinity gets broken correctly when an endpoint
becomes unready.
The external traffic policy terminating endpoints test was testing
LoadBalancer functionality against a NodePort service with no
nodePorts (or loadBalancer IPs). It managed to test what it wanted to
test, but it's kind of dubious (and we probably _shouldn't_ have been
generating the rules it was looking for since there was no way to
actually reach the XLB chains). So fix that.
Also make the terminating endpoints test use session affinity, to add
more testing for that. Also, remove the multiple copies of the same
identical Service that is used for all of the test cases in that test.
Also add a "Cluster traffic policy and no source ranges" test to
TestOverallIPTablesRulesWithMultipleServices since we weren't really
testing either of those.
Also add a test of --masquerade-all.
The test got broken to not actually use "no cluster CIDR" when
LocalDetector was implemented (and the old version of the unit test
didn't check enough to actually notice this).
The original tests here were very shy about looking at the iptables
output, and just relied on checks like "make sure there's a jump to
table X that also includes string Y somewhere in it" and stuff like
that. Whereas the newer tests were just like, "eh, here's a wall of
text, make sure the iptables output is exactly that". Although the
latter looks messier in the code, it's more precise, and it's easier
to update correctly when you change the rules. So just make all of the
tests do a check on the full iptables output.
(Note that I didn't double-check any of the output; I'm just assuming
that the output of the current iptables proxy code is actually
correct...)
Also, don't hardcode the expected number of rules in the metrics
tests, so that there's one less thing to adjust when rules change.
Also, use t.Run() in one place to get more precise errors on failure.
The test was sorting the iptables output so as to not depend on the
order that services get processed in, but this meant it wasn't
checking the relative ordering of rules (and in fact, the ordering of
the rules in the "expected" string was wrong, in a way that would
break things if the rules had actually been generated in that order).
Add a more complicated sorting function that sorts services
alphabetically while preserving the ordering of rules within each
service.
Because the proxy.Provider interface included
proxyconfig.EndpointsHandler, all the backends needed to
implement its methods. But iptables, ipvs, and winkernel implemented
them as no-ops, and metaproxier had an implementation that wouldn't
actually work (because it couldn't handle Services with no active
Endpoints).
Since Endpoints processing in kube-proxy is deprecated (and can't be
re-enabled unless you're using a backend that doesn't support
EndpointSlice), remove proxyconfig.EndpointsHandler from the
definition of proxy.Provider and drop all the useless implementations.
The nat KUBE-SERVICES chain is called from OUTPUT and PREROUTING stages. In
clusters with large number of services, the nat-KUBE-SERVICES chain is the largest
chain with for eg: 33k rules. This patch aims to move the KubeMarkMasq rules from
the kubeServicesChain into the respective KUBE-SVC-* chains. This way during each
packet-rule matching we won't have to traverse the MASQ rules of all services which
get accumulated in the KUBE-SERVICES and/or KUBE-NODEPORTS chains. Since the
jump to KUBE-MARK-MASQ ultimately sets the 0x400 mark for nodeIP SNAT, it should not
matter whether the jump is made from KUBE-SERVICES or KUBE-SVC-* chains.
Specifically we change:
1) For ClusterIP svc, we move the KUBE-MARK-MASQ jump rule from KUBE-SERVICES
chain into KUBE-SVC-* chain.
2) For ExternalIP svc, we move the KUBE-MARK-MASQ jump rule in the case of
non-ServiceExternalTrafficPolicyTypeLocal from KUBE-SERVICES
chain into KUBE-SVC-* chain.
3) For NodePorts svc, we move the KUBE-MARK-MASQ jump rule in case of
non-ServiceExternalTrafficPolicyTypeLocal from KUBE-NODEPORTS chain to
KUBE-SVC-* chain.
4) For load-balancer svc, we don't change anything since it is already svc specific
due to creation of KUBE-FW-* chains per svc.
This would cut the rules per svc in KUBE-SERVICES and KUBE-NODEPORTS in half.
1. Add API definitions;
2. Add feature gate and drops the field when feature gate is not on;
3. Set default values for the field;
4. Add API Validation
5. add kube-proxy iptables and ipvs implementations
6. add tests
Clear conntrack entries for UDP NodePorts,
this has to be done AFTER the iptables rules are programmed.
It can happen that traffic to the NodePort hits the host before
the iptables rules are programmed this will create an stale entry
in conntrack that will blackhole the traffic, so we need to
clear it ONLY when the service has endpoints.
1. For iptables mode, add KUBE-NODEPORTS chain in filter table. Add
rules to allow healthcheck node port traffic.
2. For ipvs mode, add KUBE-NODE-PORT chain in filter table. Add
KUBE-HEALTH-CHECK-NODE-PORT ipset to allow traffic to healthcheck
node port.
* api: structure change
* api: defaulting, conversion, and validation
* [FIX] validation: auto remove second ip/family when service changes to SingleStack
* [FIX] api: defaulting, conversion, and validation
* api-server: clusterIPs alloc, printers, storage and strategy
* [FIX] clusterIPs default on read
* alloc: auto remove second ip/family when service changes to SingleStack
* api-server: repair loop handling for clusterIPs
* api-server: force kubernetes default service into single stack
* api-server: tie dualstack feature flag with endpoint feature flag
* controller-manager: feature flag, endpoint, and endpointSlice controllers handling multi family service
* [FIX] controller-manager: feature flag, endpoint, and endpointSlicecontrollers handling multi family service
* kube-proxy: feature-flag, utils, proxier, and meta proxier
* [FIX] kubeproxy: call both proxier at the same time
* kubenet: remove forced pod IP sorting
* kubectl: modify describe to include ClusterIPs, IPFamilies, and IPFamilyPolicy
* e2e: fix tests that depends on IPFamily field AND add dual stack tests
* e2e: fix expected error message for ClusterIP immutability
* add integration tests for dualstack
the third phase of dual stack is a very complex change in the API,
basically it introduces Dual Stack services. Main changes are:
- It pluralizes the Service IPFamily field to IPFamilies,
and removes the singular field.
- It introduces a new field IPFamilyPolicyType that can take
3 values to express the "dual-stack(mad)ness" of the cluster:
SingleStack, PreferDualStack and RequireDualStack
- It pluralizes ClusterIP to ClusterIPs.
The goal is to add coverage to the services API operations,
taking into account the 6 different modes a cluster can have:
- single stack: IP4 or IPv6 (as of today)
- dual stack: IPv4 only, IPv6 only, IPv4 - IPv6, IPv6 - IPv4
* [FIX] add integration tests for dualstack
* generated data
* generated files
Co-authored-by: Antonio Ojea <aojea@redhat.com>
In #56164, we had split the reject rules for non-ep existing services
into KUBE-EXTERNAL-SERVICES chain in order to avoid calling KUBE-SERVICES
from INPUT. However in #74394 KUBE-SERVICES was re-added into INPUT.
As noted in #56164, kernel is sensitive to the size of INPUT chain. This
patch refrains from calling the KUBE-SERVICES chain from INPUT and FORWARD,
instead adds the lb reject rule to the KUBE-EXTERNAL-SERVICES chain which will be
called from INPUT and FORWARD.
Before this fix, a Service with a loadBalancerSourceRange value that
included a space would cause kube-proxy to crashloop. This updates
kube-proxy to trim any space from that field.
It seems that if you set the packet mark on a packet and then route
that packet through a kernel VXLAN interface, the VXLAN-encapsulated
packet will still have the mark from the original packet. Since our
NAT rules are based on the packet mark, this was causing us to
double-NAT some packets, which then triggered a kernel checksumming
bug. But even without the checksum bug, there are reasons to avoid
double-NATting, so fix the rules to unmark the packets before
masquerading them.
Errors from staticcheck:
pkg/proxy/healthcheck/proxier_health.go:55:2: field port is unused (U1000)
pkg/proxy/healthcheck/proxier_health.go:162:20: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
pkg/proxy/healthcheck/service_health.go:166:20: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
pkg/proxy/iptables/proxier.go:737:2: this value of args is never used (SA4006)
pkg/proxy/iptables/proxier.go:737:15: this result of append is never used, except maybe in other appends (SA4010)
pkg/proxy/iptables/proxier.go:1287:28: this result of append is never used, except maybe in other appends (SA4010)
pkg/proxy/userspace/proxysocket.go:293:3: this value of n is never used (SA4006)
pkg/proxy/winkernel/metrics.go:74:6: func sinceInMicroseconds is unused (U1000)
pkg/proxy/winkernel/metrics.go:79:6: func sinceInSeconds is unused (U1000)
pkg/proxy/winuserspace/proxier.go:94:2: field portMapMutex is unused (U1000)
pkg/proxy/winuserspace/proxier.go:118:2: field owner is unused (U1000)
pkg/proxy/winuserspace/proxier.go:119:2: field socket is unused (U1000)
pkg/proxy/winuserspace/proxysocket.go:620:4: this value of n is never used (SA4006)
This reverts commit 1ca0ffeaf2.
kube-proxy is not recreating the rules associated to the
KUBE-MARK-DROP chain, that is created by the kubelet.
Is preferrable avoid the dependency between the kubelet and
kube-proxy and that each of them handle their own rules.
Until now, iptables probabilities had 5 decimal places of granularity.
That meant that probabilities would start to repeat once a Service
had 319 or more endpoints.
This doubles the granularity to 10 decimal places, ensuring that
probabilities will not repeat until a Service reaches 100,223 endpoints.
Kube-proxy runs two different health servers; one for monitoring the
health of kube-proxy itself, and one for monitoring the health of
specific services. Rename them to "ProxierHealthServer" and
"ServiceHealthServer" to make this clearer, and do a bit of API
cleanup too.
The detectStaleConnections function in kube-proxy is very expensive in
terms of CPU utilization. The results of this function are only actually
used for UDP ports. This adds a protocol attribute to ServicePortName to
make it simple to only run this function for UDP connections. For
clusters with primarily TCP connections this can improve kube-proxy
performance by 2x.
This should fix a bug that could break masters when the EndpointSlice
feature gate was enabled. This was all tied to how the apiserver creates
and manages it's own services and endpoints (or in this case endpoint
slices). Consumers of endpoint slices also need to know about the
corresponding service. Previously we were trying to set an owner
reference here for this purpose, but that came with potential downsides
and increased complexity. This commit changes behavior of the apiserver
endpointslice integration to set the service name label instead of owner
references, and simplifies consumer logic to reference that (both are
set by the EndpointSlice controller).
Additionally, this should fix a bug with the EndpointSlice GenerateName
value that had previously been set with a "." as a suffix.
Work around Linux kernel bug that sometimes causes multiple flows to
get mapped to the same IP:PORT and consequently some suffer packet
drops.
Also made the same update in kubelet.
Also added cross-pointers between the two bodies of code, in comments.
Some day we should eliminate the duplicate code. But today is not
that day.
Currently the BaseServiceInfo struct implements the ServicePort interface, but
only uses that interface sometimes. All the elements of BaseServiceInfo are exported
and sometimes the interface is used to access them and othertimes not
I extended the ServicePort interface so that all relevent values can be accessed through
it and unexported all the elements of BaseServiceInfo
- Move from the old github.com/golang/glog to k8s.io/klog
- klog as explicit InitFlags() so we add them as necessary
- we update the other repositories that we vendor that made a similar
change from glog to klog
* github.com/kubernetes/repo-infra
* k8s.io/gengo/
* k8s.io/kube-openapi/
* github.com/google/cadvisor
- Entirely remove all references to glog
- Fix some tests by explicit InitFlags in their init() methods
Change-Id: I92db545ff36fcec83afe98f550c9e630098b3135
The requested Service Protocol is checked against the supported protocols of GCE Internal LB. The supported protocols are TCP and UDP.
SCTP is not supported by OpenStack LBaaS. If SCTP is requested in a Service with type=LoadBalancer, the request is rejected. Comment style is also corrected.
SCTP is not allowed for LoadBalancer Service and for HostPort. Kube-proxy can be configured not to start listening on the host port for SCTP: see the new SCTPUserSpaceNode parameter
changed the vendor github.com/nokia/sctp to github.com/ishidawataru/sctp. I.e. from now on we use the upstream version.
netexec.go compilation fixed. Various test cases fixed
SCTP related conformance tests removed. Netexec's pod definition and Dockerfile are updated to expose the new SCTP port(8082)
SCTP related e2e test cases are removed as the e2e test systems do not support SCTP
sctp related firewall config is removed from cluster/gce/util.sh. Variable name sctp_addr is corrected to sctpAddr in pkg/proxy/ipvs/proxier.go
cluster/gce/util.sh is copied from master
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Split KUBE-SERVICES chain to re-shrink the INPUT chain
**What this PR does / why we need it**:
#43972 added an iptables rule "`-A INPUT -j KUBE-SERVICES`" to make NodePort ICMP rejection work. (Previously the KUBE-SERVICES chain was only run from OUTPUT, not INPUT.) #44547 extended that patch for ExternalIP rejection as well.
However, the KUBE-SERVICES chain may potentially have a very large number of ICMP reject rules for plain ClusterIP services (the ones that get run from OUTPUT), and it seems that for some reason the kernel is much more sensitive to the length of the INPUT chain than it is to the length of the OUTPUT chain. So a node that worked fine with kube 1.6 (when KUBE-SERVICES was only run from OUTPUT) might fall over with kube 1.7 (with KUBE-SERVICES being run from both INPUT and OUTPUT).
(Specifically, a node with about 5000 ClusterIP reject rules that ran fine with OpenShift 3.6 [kube 1.6] slowed almost to a complete halt with OpenShift 3.7 [kube 1.7].)
This PR fixes things by splitting out the "new" part of KUBE-SERVICES (NodePort and ExternalIP reject rules) into a separate KUBE-EXTERNAL-SERVICES chain run from INPUT, and moves KUBE-SERVICES back to being only run from OUTPUT. (So, yes, this assumes that you don't have 5000 NodePort/ExternalIP services, but, if you do, there's not much we can do, since those rules *have* to be run on the INPUT side.)
Oh, and I left in the code to clean up the "`-A INPUT -j KUBE-SERVICES`" rule even though we don't generate it any more, so it gets fixed on upgrade.
**Release note**:
```release-note
Reorganized iptables rules to fix a performance regression on clusters with thousands of services.
```
@kubernetes/sig-network-bugs @kubernetes/rh-networking