"iptables-save" takes several seconds to run on machines with lots of
iptables rules, and we only use its result to figure out which chains
are no longer referenced by any rules. While it makes things less
confusing if we delete unused chains immediately, it's not actually
_necessary_ since they never get called during packet processing. So
in large clusters, make it so we only clean up chains periodically
rather than on every sync.
We don't need to parse out the counter values from the iptables-save
output (since they are always 0 for the chains we care about). Just
parse the chain names themselves.
Also, all of the callers of GetChainLines() pass it input that
contains only a single table, so just assume that, rather than
carefully parsing only a single table's worth of the input.
The iptables and ipvs proxies have code to try to preserve certain
iptables counters when modifying chains via iptables-restore, but the
counters in question only actually exist for the built-in chains (eg
INPUT, FORWARD, PREROUTING, etc), which we never modify via
iptables-restore (and in fact, *can't* safely modify via
iptables-restore), so we are really just doing a lot of unnecessary
work to copy the constant string "[0:0]" over from iptables-save
output to iptables-restore input. So stop doing that.
Also fix a confused error message when iptables-save fails.
The ipvs proxier was figuring out LoadBalancerSourceRanges matches in
the nat table and using KUBE-MARK-DROP to mark unmatched packets to be
dropped later. But with ipvs, unlike with iptables, DNAT happens after
the packet is "delivered" to the dummy interface, so the packet will
still be unmodified when it reaches the filter table (the first time)
so there's no reason to split the work between the nat and filter
tables; we can just do it all from the filter table and call DROP
directly.
Before:
- KUBE-LOAD-BALANCER (in nat) uses kubeLoadBalancerFWSet to match LB
traffic for services using LoadBalancerSourceRanges, and sends it
to KUBE-FIREWALL.
- KUBE-FIREWALL uses kubeLoadBalancerSourceCIDRSet and
kubeLoadBalancerSourceIPSet to match allowed source/dest combos
and calls "-j RETURN".
- All remaining traffic that doesn't escape KUBE-FIREWALL is sent to
KUBE-MARK-DROP.
- Traffic sent to KUBE-MARK-DROP later gets dropped by chains in
filter created by kubelet.
After:
- All INPUT and FORWARD traffic gets routed to KUBE-PROXY-FIREWALL
(in filter). (We don't use "KUBE-FIREWALL" any more because
there's already a chain in filter by that name that belongs to
kubelet.)
- KUBE-PROXY-FIREWALL sends traffic matching kubeLoadbalancerFWSet
to KUBE-SOURCE-RANGES-FIREWALL
- KUBE-SOURCE-RANGES-FIREWALL uses kubeLoadBalancerSourceCIDRSet and
kubeLoadBalancerSourceIPSet to match allowed source/dest combos
and calls "-j RETURN".
- All remaining traffic that doesn't escape
KUBE-SOURCE-RANGES-FIREWALL is dropped (directly via "-j DROP").
- (KUBE-LOAD-BALANCER in nat is now used only to set up masquerading)
kube-proxy generates iptables rules to forward traffic from Services to Endpoints
kube-proxy uses iptables-restore to configure the rules atomically, however,
this has the downside that large number of rules take a long time to be processed,
causing disruption.
There are different parameters than influence the number of rules generated:
- ServiceType
- Number of Services
- Number of Endpoints per Service
This test will fail when the number of rules change, so the person
that is modifying the code can have feedback about the performance impact
on their changes. It also runs multiple number of rules test cases to check
if the number of rules grows linearly.
kubeLoadbalancerFWSet was the only LoadBalancer-related identifier
with a lowercase "b", so fix that.
rename TestLoadBalanceSourceRanges to TestLoadBalancerSourceRanges to
match the field name (and the iptables proxier test).
FakeIPTables barely implemented any of the iptables interface, and the
main part that it did implement, it implemented incorrectly. Fix it:
- Implement EnsureChain, DeleteChain, EnsureRule, and DeleteRule, not
just SaveInto/Restore/RestoreAll.
- Restore/RestoreAll now correctly merge the provided state with the
existing state, rather than simply overwriting it.
- SaveInto now returns the table that was requested, rather than just
echoing back the Restore/RestoreAll.
Sort the ":CHAINNAME" lines in the same order as the "-A CHAINNAME"
lines (meaning, KUBE-NODEPORTS and KUBE-SERVICES come first).
(This will simplify IPTablesDump because it won't need to keep track
of the declaration order and the rule order separately.)
There were previously some strange iptables-rule-parsing functions
that were only used by two unit tests in pkg/proxy/ipvs. Get rid of
them and replace them with some much better iptables-rule-parsing
functions.
The various loops in the LoadBalancer rule section were mis-nested
such that if a service had multiple LoadBalancer IPs, we would write
out the firewall rules multiple times (and the allowFromNode rule for
the second and later IPs would end up being written after the "else
DROP" rule from the first IP).
The LoadBalancer rules change if the node IP is in one of the
LoadBalancerSourceRange subnets, so make sure to set nodeIP on the
fake proxier so we can test this, and add a second source range to
TestLoadBalancer containing the node IP. (This changes the result of
one flow test that previously expected that node-to-LB would be
dropped.)
Resolved issues with proxy rules taking a long time to be synced on Windows, by caching HNS data.
In particular, the following HNS data will be cached for the context of syncProxyRules:
* HNS endpoints
* HNS load balancers
Add TestInternalExternalMasquerade, which tests whether various
packets are considered internal or external for purposes of traffic
policy, and whether they get masqueraded, with and without
--masquerade-all, with and without a working LocalTrafficDetector.
(This extends and replaces the old TestMasqueradeAll.)
Add a new framework for testing out how particular packets would be
handled by a given set of iptables rules. (eg, "assert that a packet
from 10.180.0.2 to 172.30.0.41:80 gets NATted to 10.180.0.1:80 without
being masqueraded"). Add tests using this to all of the existing unit
tests.
This makes it easier to tell whether a given code change has any
effect on behavior, without having to carefully examine the diffs to
the generated iptables rules.
We originally had one HealthCheckNodePort test that used
assertIPTablesRulesEqual() and one that didn't, but later I went
through and made all the tests use assertIPTablesRulesEqual() and
didn't notice that this resulted in there now being two
nearly-identical HealthCheckNodePort tests.
When cleaning up iptables rules and ipsets used by kube-proxy in IPVS mode
iptables chain KUBE-NODE-PORT needs to be deleted before ipset
KUBE-HEALTH-CHECK-NODE-PORT can be removed. Therefore, deletion of
iptables chain KUBE-NODE-PORT is added in this change.