Previously, the firewall-check chain was run in input, forward, and
output hook but not prerouting hook. When the LoadBalancer traffic
arrived at input or forward hook, it had been DNATed to endpoint IP and
port, so the firewall-check chain didn't take effect, traffic from out
of LoadBalancerSourceRanges was not dropped.
It was not detected by unit test because the chains were sorted by
priority only, while hook should be taken into consideration.
The commit links the firewall-check chain to prerouting hook and unlinks
it from input and forward hook to ensure the traffic is filtered before
DNAT. The priorities of filter chains are updated from "DNATPriority-1"
to "DNATPriority-10" to allow third parties to insert something else
between them.
Signed-off-by: Quan Tian <qtian@vmware.com>
The nftables implementation made use of concatenation of ranges when
creating the set "firewall-allow", but the support was not available
before kernel 5.6. Therefore, nftables mode couldn't run on earlier
kernels, while 5.4 is still widely used.
An alternative of concatenation of ranges is to create a separate
firewall chain for every service port that needs firewalling, and jump
to the service's firewall chain from the common firewall chain via a
rule with vmap.
Renaming from "firewall" to "firewall-ips" is required when changing the
set to the map to support existing clusters to upgrade, otherwise it
would fail to create the map. Besides, "firewall-ips" corresponds to the
"service-ips" map, later we can add use "firewall-nodeports" if it's
determined that NodePort traffic should be subject to
LoadBalancerSourceRanges.
Signed-off-by: Quan Tian <qtian@vmware.com>
In some cases a chain could change from stale to active, but once it's
added to staleChains it would always be deleted once. When the proxier
tries to delete a previously stale but currently active chain, it would
fail and lead to errors, though it won't cause real problem thanks to
kernel's validation.
The commit removes a chain from staleChains if it becomes active.
Signed-off-by: Quan Tian <qtian@vmware.com>
In particular, fix the description of ServiceChangeTracker.Update's
return value, and point out that it's different from
EndpointsChangeTracker.EndpointSliceUpdate's (though fortunately, in a
way that doesn't matter for any existing code).
EndpointSliceCache was using the name "endpointInfo" to refer to two
different data types (most egregiously in addEndpoints(), which had a
variable named `endpoint` of type `*endpointInfo` and a variable named
`endpointInfo` of type `Endpoint`).
Continue using "endpointInfo" in places that refer to proxy.Endpoint /
BaseEndpointInfo, since that's consistent with other code, but rename
the local "cache of the Endpoints field of an EndpointSlice" type from
"endpointInfo" to "endpointData". Likewise, rename endpointSliceInfo
to endpointSliceData, for consistency.
Put the ServiceChangeTracker and EndpointsChangeTracker definitions at
the top of the files, and put the ServicePortMap and EndpointsMap
definitions before their methods.
(No code changes.)
Move the ServicePort/BaseServicePortInfo types to serviceport.go.
Move the Endpoint/BaseEndpointInfo types to endpoint.go.
To avoid confusion with the new filenames, rename service.go to
servicechangetracker.go and endpoints.go to endpointschangetracker.go.
(No code changes; this just moves some code from types.go and
services.go to serviceport.go, and some code from types.go and
endpoints.go to endpoint.go.)
(This would become an error rather than a warning once we try to move
this code to another file.)
Also rename an "ok" variable to "exists" since that what it really
means.
ServicePortMap.merge had a giant comment explaining its return value,
but nothing ever used that return value.
ServicePort had an InternalTrafficPolicy() method, but nothing used it
(because it was redundant with InternalPolicyLocal().)
ServicePortMap.Update() and EndpointsMap.Update() were just a tiny
wrappers around the corresponding apply() methods, which had no other
callers. So squash them together.
(Also fix the variable naming in ServicePortMap.Update() to match
other methods.)
safeIpset was a wrapper for thread-safely sharing an ipset.IPSet, but
this was unnecessary because ipset.IPSet is just a wrapper around exec
anyway and doesn't need any locking.
In the original version of "MinimizeIPTablesRestore", we skipped the
bottom half of the sync loop when we weren't re-syncing a service, so
certain things that couldn't be skipped had to be done in the top
half. But the code was later changed to always run through the whole
loop body (just not necessarily writing out rules in the bottom half),
so we can reorganize things now to put some related bits of code back
together.
(In particular, this also resolves the fact that we were accidentally
adding the endpoint chains to activeNATChains twice.)
Also change activeNATChains to be a proper generic Set type.
kubemark's proxy mode exists to test how kube-proxy affects the load
on the apiserver, not how it affects the load on the node. There's no
need to generate fake iptables commands, because that all happens
entirely independently of the api watchers.
And update most of the comments to refer to "nftables" rather than
"iptables" (even though it doesn't actually do any nftables updating
at this point).
For now the proxy also internally creates a
utiliptablestesting.FakeIPTables to keep the existing sync code
compiling.
(It is confusing, but allowed, to have distinct "KUBE-SERVICES" chains
in "nat" and "filter" in iptables, but in nftables the "type nat" and
"type filter" chains end up in the same table, so we'll need different
names for the two.)
Change the svcPortInfo and endpointInfo fields to string rather than
utiliptables.Chain, and various fixups from there.
Also use a proper set for activeNATChains, and fix the capitalization
of endpointInfo.chainName.