NFTables proxy will now drop traffic directed towards unallocated
ClusterIPs and reject traffic directed towards invalid ports of
Cluster IPs.
Signed-off-by: Daman Arora <aroradaman@gmail.com>
And use the fake interface in the unit tests, removing the dependency
on setting up FakeExec stuff when conntrack cleanup will be invoked.
Also, remove the isIPv6 argument to CleanStaleEntries, because it can
be inferred from the other args.
In the original version of "MinimizeIPTablesRestore", we skipped the
bottom half of the sync loop when we weren't re-syncing a service, so
certain things that couldn't be skipped had to be done in the top
half. But the code was later changed to always run through the whole
loop body (just not necessarily writing out rules in the bottom half),
so we can reorganize things now to put some related bits of code back
together.
(In particular, this also resolves the fact that we were accidentally
adding the endpoint chains to activeNATChains twice.)
Also change activeNATChains to be a proper generic Set type.
BaseEndpointInfo's fields, unlike BaseServicePortInfo's, were all
exported, which then required adding "Get" before some of the function
names in Endpoint so they wouldn't conflict.
Fix that, now that the iptables and ipvs unit tests don't need to be
able to construct BaseEndpointInfos by hand.
A new --init-only flag is added tha makes kube-proxy perform
configuration that requires privileged mode and exit. It is
intended to be executed in a privileged initContainer, while
the main container may run with a stricter securityContext
The use of "Endpoint" vs "Endpoints" in these type names is tricky
because it doesn't always make sense to use the same singular/plural
convention as the corresonding service-related types, since often the
service-related type is referring to a single service while the
endpoint-related type is referring to multiple endpoint IPs.
The "endpointsInfo" types in the iptables and winkernel proxiers are
now "endpointInfo" because they describe a single endpoint IP (and
wrap proxy.BaseEndpointInfo).
"UpdateEndpointMapResult" is now "UpdateEndpointsMapResult", because
it is the result of EndpointsMap.Update (and it's clearly correct for
EndpointsMap to have plural "Endpoints" because it's a map to an array
of proxy.Endpoint objects.)
"EndpointChangeTracker" is now "EndpointsChangeTracker" because it
tracks changes to the full set of endpoints for a particular service
(and the new name matches the existing "endpointsChange" type and
"Proxier.endpointsChanges" fields.)
Conntrack invalid packets may cause unexpected and subtle bugs
on esblished connections, because of that we install by default an
iptables rules that drops the packets with this conntrack state.
However, there are network scenarios, specially those that use multihoming
nodes, that may have legit traffic that is detected by conntrack as
invalid, hence these iptables rules are causing problems dropping this
traffic.
An alternative to solve the spurious problems caused by the invalid
connectrack packets is to set the sysctl nf_conntrack_tcp_be_liberal
option, but this is a system wide setting and we don't want kube-proxy
to be opinionated about the whole node networking configuration.
Kube-proxy will only install the DROP rules for invalid conntrack states
if the nf_conntrack_tcp_be_liberal is not set.
Change-Id: I5eb326931ed915f5ae74d210f0a375842b6a790e
TL;DR: we want to start failing the LB HC if a node is tainted with ToBeDeletedByClusterAutoscaler.
This field might need refinement, but currently is deemed our best way of understanding if
a node is about to get deleted. We want to do this only for eTP:Cluster services.
The goal is to connection draining terminating nodes
Historically, IptablesRulesTotal could have been intepreted as either
"the total number of iptables rules kube-proxy is responsible for" or
"the number of iptables rules kube-proxy rewrote on the last sync".
Post-MinimizeIPTablesRestore, these are very different things (and
IptablesRulesTotal unintentionally became the latter).
Fix IptablesRulesTotal (sync_proxy_rules_iptables_total) to be "the
total number of iptables rules kube-proxy is responsible for" and add
IptablesRulesLastSync (sync_proxy_rules_iptables_last) to be "the
number of iptables rules kube-proxy rewrote on the last sync".
This required fixing a small bug in the metric, where it had
previously been counting the "-X" lines that had been passed to
iptables-restore to delete stale chains, rather than only counting the
actual rules.
Rather than having GetNodeAddresses() return a special magic value
indicating that it matches all IPs, add a separate method to check
that. (And have GetNodeAddresses() just return the IPs as expected
instead.)
Both proxies handle IPv4 and IPv6 nodeport addresses separately, but
GetNodeAddresses went out of its way to make that difficult. Fix that.
This commit does not change any externally-visible semantics, but it
makes the existing weird semantics more obvious. Specifically, if you
say "--nodeport-addresses 10.0.0.0/8,192.168.0.0/16", then the
dual-stack proxy code would have split that into a list of IPv4 CIDRs
(["10.0.0.0/8", "192.168.0.0/16"]) to pass to the IPv4 proxier, and a
list of IPv6 CIDRs ([]) to pass to the IPv6 proxier, and then the IPv6
proxier would say "well since the list of nodeport addresses is empty,
I'll listen on all IPv6 addresses", which probably isn't what you
meant, but that's what it did.
Now that the endpoint update fields have names that make it clear that
they only contain UDP objects, it's obvious that the "protocol == UDP"
checks in the iptables and ipvs proxiers were no-ops, so remove them.
The APIs talked about "stale services" and "stale endpoints", but the
thing that is actually "stale" is the conntrack entries, not the
services/endpoints. Fix the names to indicate what they actual keep
track of.
Also, all three fields (2 in the endpoints update object and 1 in the
service update object) are currently UDP-specific, but only the
service one made that clear. Fix that too.
This commit did not actually work; in between when it was first
written and tested, and when it merged, the code in
pkg/proxy/endpoints.go was changed to only add UDP endpoints to the
"stale endpoints"/"stale services" lists, and so checking for "either
UDP or SCTP" rather than just UDP when processing those lists had no
effect.
This reverts most of commit aa8521df66
(but leaves the changes related to
ipvs.IsRsGracefulTerminationNeeded() since that actually did have the
effect it meant to have).