This is an ugly-but-simple rewrite (particularly involving having to
rewrite "single Endpoints with multiple Subsets" as "multiple
EndpointSlices"). Can be cleaned up more later...
The slice code sorts the results slightly differently from the old
code in two cases, and it was simpler to just reorder the expectations
rather than fixing the comparison code. But other than that, the
expected results are exactly the same as before.
This exposed a bug in the EndpointSlice tracking code, which is that
we didn't properly reset the "last change time" when a slice was
deleted. (This means kube-proxy would report an erroneous value in the
"endpoint programming time" metric if a service was added/updated,
then deleted before kube-proxy processed the add/update, then later
added again.)
In the dual-stack case, iptables.NewDualStackProxier and
ipvs.NewDualStackProxier filtered the nodeport addresses values by IP
family before creating the single-stack proxiers. But in the
single-stack case, the kube-proxy startup code just passed the value
to the single-stack proxiers without validation, so they had to
re-check it themselves. Fix that.
TestEndpointsToEndpointsMap tested code that only ran when using
Endpoints tracking rather than EndpointSlice tracking--which is to
say, never, any more. (TestEndpointsMapFromESC in
endpointslicecache_test.go is an equivalent EndpointSlice test.)
This rule was mistakenly added to kubelet even though it only applies
to kube-proxy's traffic. We do not want to remove it from kubelet yet
because other components may be depending on it for security, but we
should make kube-proxy output its own rule rather than depending on
kubelet.
Some of the chains kube-proxy creates are also created by kubelet; we
need to ensure that those chains exist but we should not delete them
in CleanupLeftovers().
Server.Serve() always returns a non-nil error. If it exits because the
http server is closed, it will return ErrServerClosed. To differentiate
the error, this patch changes to close the http server instead of the
listener when the Service is deleted. Closing http server automatically
closes the listener, there is no need to cache the listeners any more.
Signed-off-by: Quan Tian <qtian@vmware.com>
Handle https://github.com/moby/ipvs/issues/27
A work-around was already in place, but a segv would occur
when the bug is fixed. That will not happen now.
To update the scheduler without node reboot now works.
The address for the probe VS is now 198.51.100.0 which is
reseved for documentation, please see rfc5737. The comment
about this is extended.
To check kernel modules is a bad way to check functionality.
This commit just removes the checks and makes it possible to
use a statically linked kernel.
Minimal updates to unit-tests are made.
We had a test that creating a Service with an SCTP port would create
an iptables rule with "-p sctp" in it, which let us test that
kube-proxy was doing vaguely the right thing with SCTP even if the e2e
environment didn't have SCTP support. But this would really make much
more sense as a unit test.
We currently invoke /sbin/iptables 24 times on each syncProxyRules
before calling iptables-restore. Since even trivial iptables
invocations are slow on hosts with lots of iptables rules, this adds a
lot of time to each sync. Since these checks are expected to be a
no-op 99% of the time, skip them on partial syncs.
Currently, there are some unit tests that are failing on Windows due to
various reasons:
- paths not properly joined (filepath.Join should be used).
- Proxy Mode IPVS not supported on Windows.
- DeadlineExceeded can occur when trying to read data from an UDP
socket. This can be used to detect whether the port was closed or not.
- In Windows, with long file name support enabled, file names can have
up to 32,767 characters. In this case, the error
windows.ERROR_FILENAME_EXCED_RANGE will be encountered instead.
- files not closed, which means that they cannot be removed / renamed.
- time.Now() is not as precise on Windows, which means that 2
consecutive calls may return the same timestamp.
- path.Base() will return the same path. filepath.Base() should be used
instead.
- path.Join() will always join the paths with a / instead of the OS
specific separator. filepath.Join() should be used instead.
Kube/proxy, in NodeCIDR local detector mode, uses the node.Spec.PodCIDRs
field to build the Services iptables rules.
The Node object depends on the kubelet, but if kube-proxy runs as a
static pods or as a standalone binary, it is not possible to guarantee
that the values obtained at bootsrap are valid, causing traffic outages.
Kube-proxy has to react on node changes to avoid this problems, it
simply restarts if detect that the node PodCIDRs have changed.
In case that the Node has been deleted, kube-proxy will only log an
error and keep working, since it may break graceful shutdowns of the
node.
Some of the unit tests cannot pass on Windows due to various reasons:
- fsnotify does not have a Windows implementation.
- Proxy Mode IPVS not supported on Windows.
- Seccomp not supported on Windows.
- VolumeMode=Block is not supported on Windows.
- iSCSI volumes are mounted differently on Windows, and iscsiadm is a
Linux utility.
iptables-restore requires that if you change any rule in a chain, you
have to rewrite the entire chain. But if you avoid mentioning a chain
at all, it will leave it untouched. Take advantage of this by not
rewriting the SVC, SVL, EXT, FW, and SEP chains for services that have
not changed since the last sync, which should drastically cut down on
the size of each iptables-restore in large clusters.
Level 4 is mean for debug operations.
The default level use to be level 2, on clusters with a lot of
Services this means that the kube-proxy will generate a lot of
noise on the logs, with the performance penalty associated to
it.
Update the IPVS proxier to have a bool `initialSync` which is set to
true when a new proxier is initialized and then set to false on all
syncs. This lets us run startup-only logic, which subsequently lets us
update the realserver only when needed and avoiding any expensive
operations.
Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>
If the user passes "--proxy-mode ipvs", and it is not possible to use
IPVS, then error out rather than falling back to iptables.
There was never any good reason to be doing fallback; this was
presumably erroneously added to parallel the iptables-to-userspace
fallback (which only existed because we had wanted iptables to be the
default but not all systems could support it).
In particular, if the user passed configuration options for ipvs, then
they presumably *didn't* pass configuration options for iptables, and
so even if the iptables proxy is able to run, it is likely to be
misconfigured.
Back when iptables was first made the default, there were
theoretically some users who wouldn't have been able to support it due
to having an old /sbin/iptables. But kube-proxy no longer does the
things that didn't work with old iptables, and we removed that check a
long time ago. There is also a check for a new-enough kernel version,
but it's checking for a feature which was added in kernel 3.6, and no
one could possibly be running Kubernetes with a kernel that old. So
the fallback code now never actually falls back, so it should just be
removed.
- Run hack/update-codegen.sh
- Run hack/update-generated-device-plugin.sh
- Run hack/update-generated-protobuf.sh
- Run hack/update-generated-runtime.sh
- Run hack/update-generated-swagger-docs.sh
- Run hack/update-openapi-spec.sh
- Run hack/update-gofmt.sh
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
The proxies watch node labels for topology changes, but node labels
can change in bursts especially in larger clusters. This causes
pressure on all proxies because they can't filter the events, since
the topology could match on any label.
Change node event handling to queue the request rather than immediately
syncing. The sync runner can already handle short bursts which shouldn't
change behavior for most cases.
Signed-off-by: Dan Williams <dcbw@redhat.com>
Part of reorganizing the syncProxyRules loop to do:
1. figure out what chains are needed, mark them in activeNATChains
2. write servicePort jump rules to KUBE-SERVICES/KUBE-NODEPORTS
3. write servicePort-specific chains (SVC, SVL, EXT, FW, SEP)
This moves the FW chain creation to the end (rather than having it in
the middle of adding the jump rules for the LB IPs).
Part of reorganizing the syncProxyRules loop to do:
1. figure out what chains are needed, mark them in activeNATChains
2. write servicePort jump rules to KUBE-SERVICES/KUBE-NODEPORTS
3. write servicePort-specific chains (SVC, SVL, EXT, FW, SEP)
This fixes the jump rules for internal traffic. Previously we were
handling "jumping from kubeServices to internalTrafficChain" and
"adding masquerade rules to internalTrafficChain" in the same place.
Part of reorganizing the syncProxyRules loop to do:
1. figure out what chains are needed, mark them in activeNATChains
2. write servicePort jump rules to KUBE-SERVICES/KUBE-NODEPORTS
3. write servicePort-specific chains (SVC, SVL, EXT, FW, SEP)
This fixes the handling of the EXT chain.
Part of reorganizing the syncProxyRules loop to do:
1. figure out what chains are needed, mark them in activeNATChains
2. write servicePort jump rules to KUBE-SERVICES/KUBE-NODEPORTS
3. write servicePort-specific chains (SVC, SVL, EXT, FW, SEP)
This fixes the handling of the SVC and SVL chains. We were already
filling them in at the end of the loop; this fixes it to create them
at the bottom of the loop as well.
Part of reorganizing the syncProxyRules loop to do:
1. figure out what chains are needed, mark them in activeNATChains
2. write servicePort jump rules to KUBE-SERVICES/KUBE-NODEPORTS
3. write servicePort-specific chains (SVC, SVL, EXT, FW, SEP)
This fixes the handling of the endpoint chains. Previously they were
handled entirely at the top of the loop. Now we record which ones are
in use at the top but don't create them and fill them in until the
bottom.
We figure out early on whether we're going to end up outputting no
endpoints, so update the metrics then.
(Also remove a redundant feature gate check; svcInfo already checks
the ServiceInternalTrafficPolicy feature gate itself and so
svcInfo.InternalPolicyLocal() will always return false if the gate is
not enabled.)
Rather than marking packets to be dropped in the "nat" table and then
dropping them from the "filter" table later, just use rules in
"filter" to drop the packets we don't like directly.
Re-sync the rules from TestOverallIPTablesRulesWithMultipleServices to
make sure we're testing all the right kinds of rules. Remove a
duplicate copy of the KUBE-MARK-MASQ and KUBE-POSTROUTING rules.
Update the "REJECT" test to use the new svc6 from
TestOverallIPTablesRulesWithMultipleServices. (Previously it had used
a modified version of TOIPTRWMS's svc3.)
svc2b was using the same ClusterIP as svc3; change it and rename the
service to svc5 to make everything clearer.
Move the test of LoadBalancerSourceRanges from svc2 to svc5, so that
svc2 tests the rules for dropping packets due to
externalTrafficPolicy, and svc5 tests the rules for dropping packets
due to LoadBalancerSourceRanges, rather than having them both mixed
together in svc2.
Add svc6 with no endpoints.
"iptables-save" takes several seconds to run on machines with lots of
iptables rules, and we only use its result to figure out which chains
are no longer referenced by any rules. While it makes things less
confusing if we delete unused chains immediately, it's not actually
_necessary_ since they never get called during packet processing. So
in large clusters, make it so we only clean up chains periodically
rather than on every sync.
We don't need to parse out the counter values from the iptables-save
output (since they are always 0 for the chains we care about). Just
parse the chain names themselves.
Also, all of the callers of GetChainLines() pass it input that
contains only a single table, so just assume that, rather than
carefully parsing only a single table's worth of the input.
The iptables and ipvs proxies have code to try to preserve certain
iptables counters when modifying chains via iptables-restore, but the
counters in question only actually exist for the built-in chains (eg
INPUT, FORWARD, PREROUTING, etc), which we never modify via
iptables-restore (and in fact, *can't* safely modify via
iptables-restore), so we are really just doing a lot of unnecessary
work to copy the constant string "[0:0]" over from iptables-save
output to iptables-restore input. So stop doing that.
Also fix a confused error message when iptables-save fails.
The ipvs proxier was figuring out LoadBalancerSourceRanges matches in
the nat table and using KUBE-MARK-DROP to mark unmatched packets to be
dropped later. But with ipvs, unlike with iptables, DNAT happens after
the packet is "delivered" to the dummy interface, so the packet will
still be unmodified when it reaches the filter table (the first time)
so there's no reason to split the work between the nat and filter
tables; we can just do it all from the filter table and call DROP
directly.
Before:
- KUBE-LOAD-BALANCER (in nat) uses kubeLoadBalancerFWSet to match LB
traffic for services using LoadBalancerSourceRanges, and sends it
to KUBE-FIREWALL.
- KUBE-FIREWALL uses kubeLoadBalancerSourceCIDRSet and
kubeLoadBalancerSourceIPSet to match allowed source/dest combos
and calls "-j RETURN".
- All remaining traffic that doesn't escape KUBE-FIREWALL is sent to
KUBE-MARK-DROP.
- Traffic sent to KUBE-MARK-DROP later gets dropped by chains in
filter created by kubelet.
After:
- All INPUT and FORWARD traffic gets routed to KUBE-PROXY-FIREWALL
(in filter). (We don't use "KUBE-FIREWALL" any more because
there's already a chain in filter by that name that belongs to
kubelet.)
- KUBE-PROXY-FIREWALL sends traffic matching kubeLoadbalancerFWSet
to KUBE-SOURCE-RANGES-FIREWALL
- KUBE-SOURCE-RANGES-FIREWALL uses kubeLoadBalancerSourceCIDRSet and
kubeLoadBalancerSourceIPSet to match allowed source/dest combos
and calls "-j RETURN".
- All remaining traffic that doesn't escape
KUBE-SOURCE-RANGES-FIREWALL is dropped (directly via "-j DROP").
- (KUBE-LOAD-BALANCER in nat is now used only to set up masquerading)
kube-proxy generates iptables rules to forward traffic from Services to Endpoints
kube-proxy uses iptables-restore to configure the rules atomically, however,
this has the downside that large number of rules take a long time to be processed,
causing disruption.
There are different parameters than influence the number of rules generated:
- ServiceType
- Number of Services
- Number of Endpoints per Service
This test will fail when the number of rules change, so the person
that is modifying the code can have feedback about the performance impact
on their changes. It also runs multiple number of rules test cases to check
if the number of rules grows linearly.
kubeLoadbalancerFWSet was the only LoadBalancer-related identifier
with a lowercase "b", so fix that.
rename TestLoadBalanceSourceRanges to TestLoadBalancerSourceRanges to
match the field name (and the iptables proxier test).
Signed-off-by: gkarthiks <github.gkarthiks@gmail.com>
refactor: svc port name variable #108806
Signed-off-by: gkarthiks <github.gkarthiks@gmail.com>
refactor: rename struct for service port information to servicePortInfo and fields for more redability
Signed-off-by: gkarthiks <github.gkarthiks@gmail.com>
fix: drop chain rule
Signed-off-by: gkarthiks <github.gkarthiks@gmail.com>
FakeIPTables barely implemented any of the iptables interface, and the
main part that it did implement, it implemented incorrectly. Fix it:
- Implement EnsureChain, DeleteChain, EnsureRule, and DeleteRule, not
just SaveInto/Restore/RestoreAll.
- Restore/RestoreAll now correctly merge the provided state with the
existing state, rather than simply overwriting it.
- SaveInto now returns the table that was requested, rather than just
echoing back the Restore/RestoreAll.
Sort the ":CHAINNAME" lines in the same order as the "-A CHAINNAME"
lines (meaning, KUBE-NODEPORTS and KUBE-SERVICES come first).
(This will simplify IPTablesDump because it won't need to keep track
of the declaration order and the rule order separately.)
There were previously some strange iptables-rule-parsing functions
that were only used by two unit tests in pkg/proxy/ipvs. Get rid of
them and replace them with some much better iptables-rule-parsing
functions.
The various loops in the LoadBalancer rule section were mis-nested
such that if a service had multiple LoadBalancer IPs, we would write
out the firewall rules multiple times (and the allowFromNode rule for
the second and later IPs would end up being written after the "else
DROP" rule from the first IP).
The LoadBalancer rules change if the node IP is in one of the
LoadBalancerSourceRange subnets, so make sure to set nodeIP on the
fake proxier so we can test this, and add a second source range to
TestLoadBalancer containing the node IP. (This changes the result of
one flow test that previously expected that node-to-LB would be
dropped.)
Resolved issues with proxy rules taking a long time to be synced on Windows, by caching HNS data.
In particular, the following HNS data will be cached for the context of syncProxyRules:
* HNS endpoints
* HNS load balancers
Add TestInternalExternalMasquerade, which tests whether various
packets are considered internal or external for purposes of traffic
policy, and whether they get masqueraded, with and without
--masquerade-all, with and without a working LocalTrafficDetector.
(This extends and replaces the old TestMasqueradeAll.)
Add a new framework for testing out how particular packets would be
handled by a given set of iptables rules. (eg, "assert that a packet
from 10.180.0.2 to 172.30.0.41:80 gets NATted to 10.180.0.1:80 without
being masqueraded"). Add tests using this to all of the existing unit
tests.
This makes it easier to tell whether a given code change has any
effect on behavior, without having to carefully examine the diffs to
the generated iptables rules.