Just check that the actual IP:port of the filtered endpoints is
correct; using DeepEqual requires us to copy all the extra endpoint
fields (eg, ZoneHints, IsLocal) from endpoints to expectedEndpoints,
which just makes the test cases unnecessarily bigger.
The "node local endpoints, hints are ignored" test was not actually
enabling topology correctly, so it would have gotten the expected
result even if the code was wrong. (Which, FTR, it wasn't.)
When nodePortAddresses is not specified for kube-proxy, it tried to open
the node port for a NodePort service twice, triggered by IPv4ZeroCIDR
and IPv6ZeroCIDR separately. The first attempt would succeed and the
second one would always generate an error log like below:
"listen tcp4 :30522: bind: address already in use"
This patch fixes it by ensuring nodeAddresses of a proxier only contain
the addresses for its IP family.
The same code appeared twice, once for the SVC chain and once for the
XLB chain, with the only difference being that the XLB version had
more verbose comments.
Also, in the NodePort code, fix it to properly take advantage of the
fact that GetNodeAddresses() guarantees that if it returns a
"match-all" CIDR, then it doesn't return anything else. That also
makes it unnecessary to loop over the node addresses twice.
If you pass just an IP address to "-s" or "-d", the iptables command
will fill in the correct mask automatically.
Originally, the proxier was just hardcoding "/32" for all of these,
which was unnecessary but simple. But when IPv6 support was added, the
code was made more complicated to deal with the fact that the "/32"
needed to be "/128" in the IPv6 case, so it would parse the IPs to
figure out which family they were, which in turn involved adding some
checks in case the parsing fails (even though that "can't happen" and
the old code didn't check for invalid IPs, even though that would
break the iptables-restore if there had been any).
Anyway, all of that is unnecessary because we can just pass the IP
strings to iptables directly rather than parsing and unparsing them
first.
(The diff to proxier_test.go is just deleting "/32" everywhere.)
If GetNodeAddresses() fails (eg, because you passed the wrong CIDR to
`--nodeport-addresses`), then any NodePort services would end up with
only half a set of iptables rules. Fix it to just not output the
NodePort-specific parts in that case (in addition to logging an error
about the GetNodeAddresses() failure).
The iptables and ipvs proxiers both had a check that none of the
elements of svcInfo.LoadBalancerIPStrings() were "", but that was
already guaranteed by the svcInfo code. Drop the unnecessary checks
and remove a level of indentation.
* Fix regression in kube-proxy
Don't use a prepend() - that allocates. Instead, make Write() take
either strings or slices (I wish we could express that better).
* WIP: switch to intf
* WIP: less appends
* tests and ipvs
The logic to detect stale endpoints was not assuming the endpoint
readiness.
We can have stale entries on UDP services for 2 reasons:
- an endpoint was receiving traffic and is removed or replaced
- a service was receiving traffic but not forwarding it, and starts
to forward it.
Add an e2e test to cover the regression
Filter the allEndpoints list into readyEndpoints sooner, and set
"hasEndpoints" based (mostly) on readyEndpoints, not allEndpoints (so
that, eg, we correctly generate REJECT rules for services with no
_functioning_ endpoints, even if they have unusable terminating
endpoints).
Also, write out the endpoint chains at the top of the loop when we
iterate the endpoints for the first time, rather than copying some of
the data to another set of variables and then writing them out later.
And don't write out endpoint chains that won't be used
Also, generate affinity rules only for readyEndpoints rather than
allEndpoints, so affinity gets broken correctly when an endpoint
becomes unready.
The external traffic policy terminating endpoints test was testing
LoadBalancer functionality against a NodePort service with no
nodePorts (or loadBalancer IPs). It managed to test what it wanted to
test, but it's kind of dubious (and we probably _shouldn't_ have been
generating the rules it was looking for since there was no way to
actually reach the XLB chains). So fix that.
Also make the terminating endpoints test use session affinity, to add
more testing for that. Also, remove the multiple copies of the same
identical Service that is used for all of the test cases in that test.
Also add a "Cluster traffic policy and no source ranges" test to
TestOverallIPTablesRulesWithMultipleServices since we weren't really
testing either of those.
Also add a test of --masquerade-all.
The test got broken to not actually use "no cluster CIDR" when
LocalDetector was implemented (and the old version of the unit test
didn't check enough to actually notice this).
The original tests here were very shy about looking at the iptables
output, and just relied on checks like "make sure there's a jump to
table X that also includes string Y somewhere in it" and stuff like
that. Whereas the newer tests were just like, "eh, here's a wall of
text, make sure the iptables output is exactly that". Although the
latter looks messier in the code, it's more precise, and it's easier
to update correctly when you change the rules. So just make all of the
tests do a check on the full iptables output.
(Note that I didn't double-check any of the output; I'm just assuming
that the output of the current iptables proxy code is actually
correct...)
Also, don't hardcode the expected number of rules in the metrics
tests, so that there's one less thing to adjust when rules change.
Also, use t.Run() in one place to get more precise errors on failure.
The test was sorting the iptables output so as to not depend on the
order that services get processed in, but this meant it wasn't
checking the relative ordering of rules (and in fact, the ordering of
the rules in the "expected" string was wrong, in a way that would
break things if the rules had actually been generated in that order).
Add a more complicated sorting function that sorts services
alphabetically while preserving the ordering of rules within each
service.
proxy/winkernel/proxier.go was using format specifier with
structured logging pattern which is wrong. This commit removes
use of format specifiers to align with the pattern.
Signed-off-by: Umanga Chapagain <chapagainumanga@gmail.com>
Due to an incorrect version range definition in hcsshim for dualstack
support, the Windows kubeproxy had to define it's own version range logic
to check if dualstack was supported on the host. This was remedied in hcsshim
(https://github.com/microsoft/hcsshim/pull/1003) and this work has been vendored into
K8s as well (https://github.com/kubernetes/kubernetes/pull/104880). This
change simply makes use of the now correct version range to check if dualstack
is supported, and gets rid of the old custom logic.
Signed-off-by: Daniel Canter <dcanter@microsoft.com>
* Migrate to Structured Logs in `pkg/proxy/util`
* Minor fixes
* change key to cidr and remove namespace arg
* Update key from cidr to CIDR
Co-authored-by: JUN YANG <69306452+yangjunmyfm192085@users.noreply.github.com>
* Update key cidr to CIDR
Co-authored-by: JUN YANG <69306452+yangjunmyfm192085@users.noreply.github.com>
* Update key ip to IP
Co-authored-by: JUN YANG <69306452+yangjunmyfm192085@users.noreply.github.com>
* Update key ip to IP
Co-authored-by: JUN YANG <69306452+yangjunmyfm192085@users.noreply.github.com>
* Interchange svcNamespace and svcName
* Change first letter of all messages to capital
* Change key names in endpoints.go
* Change all keynames to lower bumby caps convention
Co-authored-by: JUN YANG <69306452+yangjunmyfm192085@users.noreply.github.com>
Because the proxy.Provider interface included
proxyconfig.EndpointsHandler, all the backends needed to
implement its methods. But iptables, ipvs, and winkernel implemented
them as no-ops, and metaproxier had an implementation that wouldn't
actually work (because it couldn't handle Services with no active
Endpoints).
Since Endpoints processing in kube-proxy is deprecated (and can't be
re-enabled unless you're using a backend that doesn't support
EndpointSlice), remove proxyconfig.EndpointsHandler from the
definition of proxy.Provider and drop all the useless implementations.
The iptables rule that matches kubeNodePortLocalSetSCTP must be inserted
before the one matches kubeNodePortSetSCTP, otherwise all SCTP traffic
would be masqueraded regardless of whether its ExternalTrafficPolicy is
Local or not.
To cover the case in tests, the patch adds rule order validation to
checkIptables.
* pkg/features: promote the ServiceInternalTrafficPolicy field to Beta and on by default
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
* pkg/api/service/testing: update Service test fixture functions to set internalTrafficPolicy=Cluster by default
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
* pkg/apis/core/validation: add more Service validation tests for internalTrafficPolicy
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
* pkg/registry/core/service/storage: fix failing Service REST storage tests to use internalTrafficPolicy: Cluster
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
* pkg/registry/core/service/storage: add two test cases for Service REST TestServiceRegistryInternalTrafficPolicyClusterThenLocal and TestServiceRegistryInternalTrafficPolicyLocalThenCluster
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
* pkg/registry/core/service: update strategy unit tests to expect default
internalTrafficPolicy=Cluster
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
* pkg/proxy/ipvs: fix unit test Test_EndpointSliceReadyAndTerminatingLocal to use internalTrafficPolicy=Cluster
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
* pkg/apis/core: update fuzzers to set Service internalTrafficPolicy field
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
* pkg/api/service/testing: refactor Service test fixtures to use Tweak funcs
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
The nat KUBE-SERVICES chain is called from OUTPUT and PREROUTING stages. In
clusters with large number of services, the nat-KUBE-SERVICES chain is the largest
chain with for eg: 33k rules. This patch aims to move the KubeMarkMasq rules from
the kubeServicesChain into the respective KUBE-SVC-* chains. This way during each
packet-rule matching we won't have to traverse the MASQ rules of all services which
get accumulated in the KUBE-SERVICES and/or KUBE-NODEPORTS chains. Since the
jump to KUBE-MARK-MASQ ultimately sets the 0x400 mark for nodeIP SNAT, it should not
matter whether the jump is made from KUBE-SERVICES or KUBE-SVC-* chains.
Specifically we change:
1) For ClusterIP svc, we move the KUBE-MARK-MASQ jump rule from KUBE-SERVICES
chain into KUBE-SVC-* chain.
2) For ExternalIP svc, we move the KUBE-MARK-MASQ jump rule in the case of
non-ServiceExternalTrafficPolicyTypeLocal from KUBE-SERVICES
chain into KUBE-SVC-* chain.
3) For NodePorts svc, we move the KUBE-MARK-MASQ jump rule in case of
non-ServiceExternalTrafficPolicyTypeLocal from KUBE-NODEPORTS chain to
KUBE-SVC-* chain.
4) For load-balancer svc, we don't change anything since it is already svc specific
due to creation of KUBE-FW-* chains per svc.
This would cut the rules per svc in KUBE-SERVICES and KUBE-NODEPORTS in half.
kube-proxy expose the metric network_programming_duration_seconds,
that is defined as the time it takes to program the network since
a a service or pod has changed. It uses an annotation on the endpoints
/endpointslices to calculate when the endpoint was created, however,
on restarts, kube-proxy process all the endpoints again, no matter
when those were generated, polluting the metrics.
To be safe, kube-proxy will estimate the latency only for those
endpoints that were generated after it started.
1. Do not describe port type in message because lp.String() already has the
information.
2. Remove duplicate error detail from event log.
Previous log is like this.
47s Warning listen tcp4 :30764: socket: too many open files node/127.0.0.1 can't open port "nodePort for default/temp-svc:834" (:30764/tcp4), skipping it: listen tcp4 :30764: socket: too many open files
[issue]
When creating a NodePort service with the kubectl create command, the NodePort
assignment may fail.
Failure to assign a NodePort can be simulated with the following malicious
command[1].
$ kubectl create service nodeport temp-svc --tcp=`python3 <<EOF
print("1", end="")
for i in range(2, 1026):
print("," + str(i), end="")
EOF
`
The command succeeds and shows following output.
service/temp-svc created
The service has been successfully generated and can also be referenced with the
get command.
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
temp-svc NodePort 10.0.0.139 <none> 1:31335/TCP,2:32367/TCP,3:30263/TCP,(omitted),1023:31821/TCP,1024:32475/TCP,1025:30311/TCP 12s
The user does not recognize failure to assign a NodePort because
create/get/describe command does not show any error. This is the issue.
[solution]
Users can notice errors by looking at the kube-proxy logs, but it may be difficult to see the kube-proxy logs of all nodes.
E0327 08:50:10.216571 660960 proxier.go:1286] "can't open port, skipping this nodePort" err="listen tcp4 :30641: socket: too many open files" port="\"nodePort for default/temp-svc:744\" (:30641/tcp4)"
E0327 08:50:10.216611 660960 proxier.go:1286] "can't open port, skipping this nodePort" err="listen tcp4 :30827: socket: too many open files" port="\"nodePort for default/temp-svc:857\" (:30827/tcp4)"
...
E0327 08:50:10.217119 660960 proxier.go:1286] "can't open port, skipping this nodePort" err="listen tcp4 :32484: socket: too many open files" port="\"nodePort for default/temp-svc:805\" (:32484/tcp4)"
E0327 08:50:10.217293 660960 proxier.go:1612] "Failed to execute iptables-restore" err="pipe2: too many open files ()"
I0327 08:50:10.217341 660960 proxier.go:1615] "Closing local ports after iptables-restore failure"
So, this patch will fire an event when NodePort assignment fails.
In fact, when the externalIP assignment fails, it is also notified by event.
The event will be displayed like this.
$ kubectl get event
LAST SEEN TYPE REASON OBJECT MESSAGE
...
2s Warning listen tcp4 :31055: socket: too many open files node/127.0.0.1 can't open "nodePort for default/temp-svc:901" (:31055/tcp4), skipping this nodePort: listen tcp4 :31055: socket: too many open files
2s Warning listen tcp4 :31422: socket: too many open files node/127.0.0.1 can't open "nodePort for default/temp-svc:474" (:31422/tcp4), skipping this nodePort: listen tcp4 :31422: socket: too many open files
...
This PR fixes iptables and ipvs proxier.
Since userspace proxier does not seem to be affected by this issue, it is not fixed.
[1] Assume that fd limit is 1024(default).
$ ulimit -n
1024
1. Add API definitions;
2. Add feature gate and drops the field when feature gate is not on;
3. Set default values for the field;
4. Add API Validation
5. add kube-proxy iptables and ipvs implementations
6. add tests
Clear conntrack entries for UDP NodePorts,
this has to be done AFTER the iptables rules are programmed.
It can happen that traffic to the NodePort hits the host before
the iptables rules are programmed this will create an stale entry
in conntrack that will blackhole the traffic, so we need to
clear it ONLY when the service has endpoints.
1. For iptables mode, add KUBE-NODEPORTS chain in filter table. Add
rules to allow healthcheck node port traffic.
2. For ipvs mode, add KUBE-NODE-PORT chain in filter table. Add
KUBE-HEALTH-CHECK-NODE-PORT ipset to allow traffic to healthcheck
node port.
When running in ipvs mode, kube-proxy generated wrong iptables-restore
input because the chain names are hardcoded.
It also fixed a typo in method name.
Currently kube-proxy treat ExternalIPs differently depending on:
- the traffic origin
- if the ExternalIP is present or not in the system.
It also depends on the CNI implementation to
discriminate between local and non-local traffic.
Since the ExternalIP belongs to a Service, we can avoid the roundtrip
of sending outside the traffic originated in the cluster.
Also, we leverage the new LocalTrafficDetector to detect the local
traffic and not rely on the CNI implementations for this.
- Remove feature gate consideration from EndpointSlice validation
- Deprecate topology field, note that it will be removed in future
release
- Update kube-proxy to check for NodeName if feature gate is enabled
- Add comments indicating the feature gates that can be used to enable
alpha API fields
- Add comments explaining use of deprecated address type in tests
The tests for most functions have also been revised to check the errors
explicitly upon validating. This will properly catch occasions
where we should be returning multiple errors if more error occurs or
if just one block is failing.
Signed-off-by: Christopher M. Luciano <cmluciano@us.ibm.com>