Conntrack invalid packets may cause unexpected and subtle bugs
on esblished connections, because of that we install by default an
iptables rules that drops the packets with this conntrack state.
However, there are network scenarios, specially those that use multihoming
nodes, that may have legit traffic that is detected by conntrack as
invalid, hence these iptables rules are causing problems dropping this
traffic.
An alternative to solve the spurious problems caused by the invalid
connectrack packets is to set the sysctl nf_conntrack_tcp_be_liberal
option, but this is a system wide setting and we don't want kube-proxy
to be opinionated about the whole node networking configuration.
Kube-proxy will only install the DROP rules for invalid conntrack states
if the nf_conntrack_tcp_be_liberal is not set.
Change-Id: I5eb326931ed915f5ae74d210f0a375842b6a790e
Instead of using two metrics use just one metrics with multiple labels,
since the labels can only get 2 values, 200 or 503 there is no risk of
carindality explosion and are simple to represent in graphs.
Change-Id: I0e9cbd6ec2051de44d277d673dc20f02b96aa4d1
When I first wrote TestInternalExternalMasquerade, I put "FIXME"
comments in all of the cases that seemed wrong to me, most of which
got removed as we fixed the corner cases. But there were two cases
where we decided that the implemented behavior, though odd, was
correct, and those FIXMEs never got removed.
All the code to deal with enabling/disabling the feature gate is gone,
but some of the tests were still specifying "this test case assumes
PTE is enabled".
Remove "EndpointSlice" from some unit test names, because they don't
need to clarify that they use EndpointSlices now, because all of the
tests use EndpointSlices now.
Likewise, remove TestEndpointSliceE2E entirely; it was originally an
EndpointSlice version of one of the other tests, but the other test
uses EndpointSlices now too.
TL;DR: we want to start failing the LB HC if a node is tainted with ToBeDeletedByClusterAutoscaler.
This field might need refinement, but currently is deemed our best way of understanding if
a node is about to get deleted. We want to do this only for eTP:Cluster services.
The goal is to connection draining terminating nodes
Historically, IptablesRulesTotal could have been intepreted as either
"the total number of iptables rules kube-proxy is responsible for" or
"the number of iptables rules kube-proxy rewrote on the last sync".
Post-MinimizeIPTablesRestore, these are very different things (and
IptablesRulesTotal unintentionally became the latter).
Fix IptablesRulesTotal (sync_proxy_rules_iptables_total) to be "the
total number of iptables rules kube-proxy is responsible for" and add
IptablesRulesLastSync (sync_proxy_rules_iptables_last) to be "the
number of iptables rules kube-proxy rewrote on the last sync".
This required fixing a small bug in the metric, where it had
previously been counting the "-X" lines that had been passed to
iptables-restore to delete stale chains, rather than only counting the
actual rules.
getLocalDetector() used to pass a utiliptables.Interface to
NewDetectLocalByCIDR() so that NewDetectLocalByCIDR() could verify
that the passed-in CIDR was of the same family as the iptables
interface. It would make more sense for getLocalDetector() to verify
this itself and just *not call NewDetectLocalByCIDR* if the families
don't match, and that's what the code does now. So there's no longer
any need to pass the utiliptables.Interface to the local detector.
These don't belong in pkg/proxy/util; they involve a completely
unrelated definition of proxying.
Since each is only used from one place, just inline them at the
callers.
- add new header "X-Load-Balancing-Endpoint-Weight" returned from service health. Value of the header is number of local endpoints. Header can be used in weighted load balancing. Parsing header for number of endpoints is faster than unmarshalling json from the content body.
- add missing unit test for new and old headers returned from service health
Since kube-proxy in LocalModeNodeCIDR needs to obtain the PodCIDR
assigned to the node it watches for the Node object.
However, kube-proxy startup process requires to have these watches in
different places, that opens the possibility of having a race condition
if the same node is recreated and a different PodCIDR is assigned.
Initializing the second watch with the value obtained in the first one
allows us to detect this situation.
Change-Id: I6adeedb6914ad2afd3e0694dcab619c2a66135f8
Signed-off-by: Antonio Ojea <aojea@google.com>
Rather than having GetNodeAddresses() return a special magic value
indicating that it matches all IPs, add a separate method to check
that. (And have GetNodeAddresses() just return the IPs as expected
instead.)
Both proxies handle IPv4 and IPv6 nodeport addresses separately, but
GetNodeAddresses went out of its way to make that difficult. Fix that.
This commit does not change any externally-visible semantics, but it
makes the existing weird semantics more obvious. Specifically, if you
say "--nodeport-addresses 10.0.0.0/8,192.168.0.0/16", then the
dual-stack proxy code would have split that into a list of IPv4 CIDRs
(["10.0.0.0/8", "192.168.0.0/16"]) to pass to the IPv4 proxier, and a
list of IPv6 CIDRs ([]) to pass to the IPv6 proxier, and then the IPv6
proxier would say "well since the list of nodeport addresses is empty,
I'll listen on all IPv6 addresses", which probably isn't what you
meant, but that's what it did.
Rather than duplicating some of the KubeProxyConfiguration into
ProxyServer, just store the KubeProxyConfiguration itself so later
code can reference it directly.
For the fields that get platform-specific defaults (Mode,
DetectLocalMode), fill the defaults directly into the
KubeProxyConfiguration rather than keeping the original there and the
defaulted version in the ProxyServer.
Validate the --detect-local-mode value in the API object validation
rather than doing it separately later. Also, remove runtime checks and
unit tests for cases that would be blocked by validation
This was making my eyes bleed as I read over code.
I used the following in vim. I made them up on the fly, but they seemed
to pass manual inspection.
:g/},\n\s*{$/s//}, {/
:w
:g/{$\n\s*{$/s//{{/
:w
:g/^\(\s*\)},\n\1},$/s//}},/
:w
:g/^\(\s*\)},$\n\1}$/s//}}/
:w
This touches cases where FromInt() is used on numeric constants, or
values which are already int32s, or int variables which are defined
close by and can be changed to int32s with little impact.
Signed-off-by: Stephen Kitt <skitt@redhat.com>
Rather than duplicating some of the KubeProxyConfiguration into
ProxyServer, just store the KubeProxyConfiguration itself so later
code can reference it directly.
For the fields that get platform-specific defaults (Mode,
DetectLocalMode), fill the defaults directly into the
KubeProxyConfiguration rather than keeping the original there and the
defaulted version in the ProxyServer.
Validate the --detect-local-mode value in the API object validation
rather than doing it separately later. Also, remove runtime checks and
unit tests for cases that would be blocked by validation
The winkernel proxy was originally created by copying+pasting from the
iptables code, but some iptables-specific things were never removed
(and one function got left behind after its functionality was moved
into the shared proxy code).
Now that the endpoint update fields have names that make it clear that
they only contain UDP objects, it's obvious that the "protocol == UDP"
checks in the iptables and ipvs proxiers were no-ops, so remove them.
Rather than calling fp.deleteEndpointConnection() directly, set up the
proxy to have syncProxyRules() call it, so that we are testing it in
the way that it actually gets called.
Squash the IPv4 and IPv6 unit tests together so we don't need to
duplicate all that code. Fix a tiny bug in NewFakeProxier() found
while doing this...
The APIs talked about "stale services" and "stale endpoints", but the
thing that is actually "stale" is the conntrack entries, not the
services/endpoints. Fix the names to indicate what they actual keep
track of.
Also, all three fields (2 in the endpoints update object and 1 in the
service update object) are currently UDP-specific, but only the
service one made that clear. Fix that too.
This commit did not actually work; in between when it was first
written and tested, and when it merged, the code in
pkg/proxy/endpoints.go was changed to only add UDP endpoints to the
"stale endpoints"/"stale services" lists, and so checking for "either
UDP or SCTP" rather than just UDP when processing those lists had no
effect.
This reverts most of commit aa8521df66
(but leaves the changes related to
ipvs.IsRsGracefulTerminationNeeded() since that actually did have the
effect it meant to have).
Annotation
As part of this change, kube-proxy accepts any value for either
annotation that is not "disabled".
Change-Id: Idfc26eb4cc97ff062649dc52ed29823a64fc59a4
Today, the health check response to the load balancers asking Kube-proxy for
the status of ETP:Local services does not include the healthz state of Kube-
proxy. This means that Kube-proxy might indicate to load balancers that they
should forward traffic to the node in question, simply because the endpoint
is running on the node - this overlooks the fact that Kube-proxy might be
not-healthy and hasn't successfully written the rules enabling traffic to
reach the endpoint.
For some reason we were calculating the available nodeport IPs at the
top of syncProxyRules even though we didn't use them until the end.
(Well, the previous code avoided generating KUBE-NODEPORTS chain rules
if there were no node IPs available, but that case is considered an
error anyway, so there's no need to optimize it.)
(Also fix a stale `err` reference exposed by this move.)
With the flag, ipvs uses both source IP and source port (instead of
only source IP) to distribute new connections evently to endpoints
that avoids sending all connections from the same client (i.e. same
source IP) to one single endpoint.
User can explicitly set sessionAffinity in service spec to keep all
connections from a source IP to end up on the same endpoint if needed.
Change-Id: I42f950c0840ac06a4ee68a7bbdeab0fc5505c71f
In addition to actually updating their data from the provided list of
changes, EndpointsMap.Update() and ServicePortMap.Update() return a
struct with some information about things that changed because of that
update (eg services with stale conntrack entries).
For some reason, they were also returning information about
HealthCheckNodePorts, but they were returning *static* information
based on the current (post-Update) state of the map, not information
about what had *changed* in the update. Since this doesn't match how
the other data in the struct is used (and since there's no reason to
have the data only be returned when you call Update() anyway) , split
it out.
ContainsIPv4Loopback() claimed that "::/0" contains IPv4 loopback IPs
(on the theory that listening on "::/0" will listen on "0.0.0.0/0" as
well and thus include IPv4 loopback). But its sole caller (the
iptables proxier) doesn't use listen() to accept connections, so this
theory was completely mistaken; if you passed, eg,
`--nodeport-addresses 192.168.0.0/0,::/0`, then it would not create
any rule that accepted nodeport connections on 127.0.0.1, but it would
nonetheless end up setting route_localnet=1 because
ContainsIPv4Loopback() claimed it needed to. Fix this.
Add names to the tests and use t.Run() (rather than having them just
be numbered, with number 9 mistakenly being used twice thus throwing
off all the later numbers...)
Remove unnecessary FakeNetwork element from the testCases struct since
it's always the same. Remove the expectedErr value since a non-nil
error is expected if and only if the returned set is nil, and there's
no reason to test the exact text of the error message.
Fix weird IPv6 subnet sizes.
Change the dual-stack tests to (a) actually have dual-stack interface
addrs, and (b) use a routable IPv6 address, not just localhost (given
that we never actually want to use IPv6 localhost for nodeports).
The unit tests were broken with MinimizeIPTablesRestore enabled
because syncProxyRules() assumed that needFullSync would be set on the
first (post-setInitialized()) run, but the unit tests didn't ensure
that.
(In fact, there was a race condition in the real Proxier case as well;
theoretically syncProxyRules() could be run by the
BoundedFrequencyRunner after OnServiceSynced() called setInitialized()
but before it called forceSyncProxyRules(), thus causing the first
real sync to try to do a partial sync and fail. This is now fixed as
well.)
This is an ugly-but-simple rewrite (particularly involving having to
rewrite "single Endpoints with multiple Subsets" as "multiple
EndpointSlices"). Can be cleaned up more later...
The slice code sorts the results slightly differently from the old
code in two cases, and it was simpler to just reorder the expectations
rather than fixing the comparison code. But other than that, the
expected results are exactly the same as before.