The original tests here were very shy about looking at the iptables
output, and just relied on checks like "make sure there's a jump to
table X that also includes string Y somewhere in it" and stuff like
that. Whereas the newer tests were just like, "eh, here's a wall of
text, make sure the iptables output is exactly that". Although the
latter looks messier in the code, it's more precise, and it's easier
to update correctly when you change the rules. So just make all of the
tests do a check on the full iptables output.
(Note that I didn't double-check any of the output; I'm just assuming
that the output of the current iptables proxy code is actually
correct...)
Also, don't hardcode the expected number of rules in the metrics
tests, so that there's one less thing to adjust when rules change.
Also, use t.Run() in one place to get more precise errors on failure.
The test was sorting the iptables output so as to not depend on the
order that services get processed in, but this meant it wasn't
checking the relative ordering of rules (and in fact, the ordering of
the rules in the "expected" string was wrong, in a way that would
break things if the rules had actually been generated in that order).
Add a more complicated sorting function that sorts services
alphabetically while preserving the ordering of rules within each
service.
Because the proxy.Provider interface included
proxyconfig.EndpointsHandler, all the backends needed to
implement its methods. But iptables, ipvs, and winkernel implemented
them as no-ops, and metaproxier had an implementation that wouldn't
actually work (because it couldn't handle Services with no active
Endpoints).
Since Endpoints processing in kube-proxy is deprecated (and can't be
re-enabled unless you're using a backend that doesn't support
EndpointSlice), remove proxyconfig.EndpointsHandler from the
definition of proxy.Provider and drop all the useless implementations.
The nat KUBE-SERVICES chain is called from OUTPUT and PREROUTING stages. In
clusters with large number of services, the nat-KUBE-SERVICES chain is the largest
chain with for eg: 33k rules. This patch aims to move the KubeMarkMasq rules from
the kubeServicesChain into the respective KUBE-SVC-* chains. This way during each
packet-rule matching we won't have to traverse the MASQ rules of all services which
get accumulated in the KUBE-SERVICES and/or KUBE-NODEPORTS chains. Since the
jump to KUBE-MARK-MASQ ultimately sets the 0x400 mark for nodeIP SNAT, it should not
matter whether the jump is made from KUBE-SERVICES or KUBE-SVC-* chains.
Specifically we change:
1) For ClusterIP svc, we move the KUBE-MARK-MASQ jump rule from KUBE-SERVICES
chain into KUBE-SVC-* chain.
2) For ExternalIP svc, we move the KUBE-MARK-MASQ jump rule in the case of
non-ServiceExternalTrafficPolicyTypeLocal from KUBE-SERVICES
chain into KUBE-SVC-* chain.
3) For NodePorts svc, we move the KUBE-MARK-MASQ jump rule in case of
non-ServiceExternalTrafficPolicyTypeLocal from KUBE-NODEPORTS chain to
KUBE-SVC-* chain.
4) For load-balancer svc, we don't change anything since it is already svc specific
due to creation of KUBE-FW-* chains per svc.
This would cut the rules per svc in KUBE-SERVICES and KUBE-NODEPORTS in half.
1. Do not describe port type in message because lp.String() already has the
information.
2. Remove duplicate error detail from event log.
Previous log is like this.
47s Warning listen tcp4 :30764: socket: too many open files node/127.0.0.1 can't open port "nodePort for default/temp-svc:834" (:30764/tcp4), skipping it: listen tcp4 :30764: socket: too many open files
[issue]
When creating a NodePort service with the kubectl create command, the NodePort
assignment may fail.
Failure to assign a NodePort can be simulated with the following malicious
command[1].
$ kubectl create service nodeport temp-svc --tcp=`python3 <<EOF
print("1", end="")
for i in range(2, 1026):
print("," + str(i), end="")
EOF
`
The command succeeds and shows following output.
service/temp-svc created
The service has been successfully generated and can also be referenced with the
get command.
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
temp-svc NodePort 10.0.0.139 <none> 1:31335/TCP,2:32367/TCP,3:30263/TCP,(omitted),1023:31821/TCP,1024:32475/TCP,1025:30311/TCP 12s
The user does not recognize failure to assign a NodePort because
create/get/describe command does not show any error. This is the issue.
[solution]
Users can notice errors by looking at the kube-proxy logs, but it may be difficult to see the kube-proxy logs of all nodes.
E0327 08:50:10.216571 660960 proxier.go:1286] "can't open port, skipping this nodePort" err="listen tcp4 :30641: socket: too many open files" port="\"nodePort for default/temp-svc:744\" (:30641/tcp4)"
E0327 08:50:10.216611 660960 proxier.go:1286] "can't open port, skipping this nodePort" err="listen tcp4 :30827: socket: too many open files" port="\"nodePort for default/temp-svc:857\" (:30827/tcp4)"
...
E0327 08:50:10.217119 660960 proxier.go:1286] "can't open port, skipping this nodePort" err="listen tcp4 :32484: socket: too many open files" port="\"nodePort for default/temp-svc:805\" (:32484/tcp4)"
E0327 08:50:10.217293 660960 proxier.go:1612] "Failed to execute iptables-restore" err="pipe2: too many open files ()"
I0327 08:50:10.217341 660960 proxier.go:1615] "Closing local ports after iptables-restore failure"
So, this patch will fire an event when NodePort assignment fails.
In fact, when the externalIP assignment fails, it is also notified by event.
The event will be displayed like this.
$ kubectl get event
LAST SEEN TYPE REASON OBJECT MESSAGE
...
2s Warning listen tcp4 :31055: socket: too many open files node/127.0.0.1 can't open "nodePort for default/temp-svc:901" (:31055/tcp4), skipping this nodePort: listen tcp4 :31055: socket: too many open files
2s Warning listen tcp4 :31422: socket: too many open files node/127.0.0.1 can't open "nodePort for default/temp-svc:474" (:31422/tcp4), skipping this nodePort: listen tcp4 :31422: socket: too many open files
...
This PR fixes iptables and ipvs proxier.
Since userspace proxier does not seem to be affected by this issue, it is not fixed.
[1] Assume that fd limit is 1024(default).
$ ulimit -n
1024
1. Add API definitions;
2. Add feature gate and drops the field when feature gate is not on;
3. Set default values for the field;
4. Add API Validation
5. add kube-proxy iptables and ipvs implementations
6. add tests
Clear conntrack entries for UDP NodePorts,
this has to be done AFTER the iptables rules are programmed.
It can happen that traffic to the NodePort hits the host before
the iptables rules are programmed this will create an stale entry
in conntrack that will blackhole the traffic, so we need to
clear it ONLY when the service has endpoints.
1. For iptables mode, add KUBE-NODEPORTS chain in filter table. Add
rules to allow healthcheck node port traffic.
2. For ipvs mode, add KUBE-NODE-PORT chain in filter table. Add
KUBE-HEALTH-CHECK-NODE-PORT ipset to allow traffic to healthcheck
node port.