- Remove feature gate consideration from EndpointSlice validation
- Deprecate topology field, note that it will be removed in future
release
- Update kube-proxy to check for NodeName if feature gate is enabled
- Add comments indicating the feature gates that can be used to enable
alpha API fields
- Add comments explaining use of deprecated address type in tests
The tests for most functions have also been revised to check the errors
explicitly upon validating. This will properly catch occasions
where we should be returning multiple errors if more error occurs or
if just one block is failing.
Signed-off-by: Christopher M. Luciano <cmluciano@us.ibm.com>
This commit revises validateProxyNodePortAddress and
validateExcludeCIDRS to report on the exact CIDR that is
invalid within the array of strings. Previously we would just return
the whole block of addresses and now we identify the exact address
within the block to eliminate confusion. I also removed the break from
validateProxyNodeAddress so that we can report on all addresses that
may not be valid.
The tests for each function have also been revised to check the errors
explicitly upon validating. This also will properly catch occasions
where we should be returning multiple errors if more than one CIDR is invalid.
Signed-off-by: Christopher M. Luciano <cmluciano@us.ibm.com>
A previous PR (#71573) intended to clear conntrack entry on endpoint
changes when using nodeport by introducing a dedicated function to
remove the stale conntrack entry on the node port and allow traffic to
resume. By doing so, it has introduced a nodeport specific bug where the
conntrack entries related to the ClusterIP does not get clean if
endpoint is changed (issue #96174). We fix by doing ClusterIP cleanup in
all cases.
* api: structure change
* api: defaulting, conversion, and validation
* [FIX] validation: auto remove second ip/family when service changes to SingleStack
* [FIX] api: defaulting, conversion, and validation
* api-server: clusterIPs alloc, printers, storage and strategy
* [FIX] clusterIPs default on read
* alloc: auto remove second ip/family when service changes to SingleStack
* api-server: repair loop handling for clusterIPs
* api-server: force kubernetes default service into single stack
* api-server: tie dualstack feature flag with endpoint feature flag
* controller-manager: feature flag, endpoint, and endpointSlice controllers handling multi family service
* [FIX] controller-manager: feature flag, endpoint, and endpointSlicecontrollers handling multi family service
* kube-proxy: feature-flag, utils, proxier, and meta proxier
* [FIX] kubeproxy: call both proxier at the same time
* kubenet: remove forced pod IP sorting
* kubectl: modify describe to include ClusterIPs, IPFamilies, and IPFamilyPolicy
* e2e: fix tests that depends on IPFamily field AND add dual stack tests
* e2e: fix expected error message for ClusterIP immutability
* add integration tests for dualstack
the third phase of dual stack is a very complex change in the API,
basically it introduces Dual Stack services. Main changes are:
- It pluralizes the Service IPFamily field to IPFamilies,
and removes the singular field.
- It introduces a new field IPFamilyPolicyType that can take
3 values to express the "dual-stack(mad)ness" of the cluster:
SingleStack, PreferDualStack and RequireDualStack
- It pluralizes ClusterIP to ClusterIPs.
The goal is to add coverage to the services API operations,
taking into account the 6 different modes a cluster can have:
- single stack: IP4 or IPv6 (as of today)
- dual stack: IPv4 only, IPv6 only, IPv4 - IPv6, IPv6 - IPv4
* [FIX] add integration tests for dualstack
* generated data
* generated files
Co-authored-by: Antonio Ojea <aojea@redhat.com>
In #56164, we had split the reject rules for non-ep existing services
into KUBE-EXTERNAL-SERVICES chain in order to avoid calling KUBE-SERVICES
from INPUT. However in #74394 KUBE-SERVICES was re-added into INPUT.
As noted in #56164, kernel is sensitive to the size of INPUT chain. This
patch refrains from calling the KUBE-SERVICES chain from INPUT and FORWARD,
instead adds the lb reject rule to the KUBE-EXTERNAL-SERVICES chain which will be
called from INPUT and FORWARD.
The provided DialContext wraps existing clients' DialContext in an attempt to
preserve any existing timeout configuration. In some cases, we may replace
infinite timeouts with golang defaults.
- scaleio: tcp connect/keepalive values changed from 0/15 to 30/30
- storageos: no change
Before this fix, a Service with a loadBalancerSourceRange value that
included a space would cause kube-proxy to crashloop. This updates
kube-proxy to trim any space from that field.
Currently kube-proxy defaults the min-sync-period for
iptables to 0. However, as explained by Dan Winship,
"With minSyncPeriod: 0, you run iptables-restore 100 times.
With minSyncPeriod: 1s , you run iptables-restore once.
With minSyncPeriod: 10s , you also run iptables-restore once,
but you might have to wait 10 seconds first"
Masquerade de traffic that loops back to the originator
before they hit the kubernetes-specific postrouting rules
Signed-off-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
when dual-stack kube-proxy infers the service IP family from
the ClusterIP because ipFamily field is going to be deprecated.
Since kube-proxy skip headless and externalname services we
can safely obtain the IPFamily from the ClusterIP field
Signed-off-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
instead of receiving the service name and namespace we
can obtain it from the service object directly.
Signed-off-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
It seems that if you set the packet mark on a packet and then route
that packet through a kernel VXLAN interface, the VXLAN-encapsulated
packet will still have the mark from the original packet. Since our
NAT rules are based on the packet mark, this was causing us to
double-NAT some packets, which then triggered a kernel checksumming
bug. But even without the checksum bug, there are reasons to avoid
double-NATting, so fix the rules to unmark the packets before
masquerading them.
Fixes two small issues with the metric added in #90175:
1. Bump the timestamp on initial informer sync. Otherwise it remains 0 if
restarting kube-proxy in a quiescent cluster, which isn't quite right.
2. Bump the timestamp even if no healthz server is specified.
This adds a metric, kubeproxy_sync_proxy_rules_last_queued_timestamp,
that captures the last time a change was queued to be applied to the
proxy. This matches the healthz logic, which fails if a pending change
is stale.
This allows us to write alerts that mirror healthz.
Signed-off-by: Casey Callendrello <cdc@redhat.com>
This builds on previous work but only sets the sysctlConnReuse value
if the kernel is known to be above 4.19. To avoid calling GetKernelVersion
twice, I store the value from the CanUseIPVS method and then check the version
constraint at time of expected sysctl call.
Signed-off-by: Christopher M. Luciano <cmluciano@us.ibm.com>
The kube-proxy metaproxier implementations tries to get the IPFamily
from the endpoints, but if the endpoints doesn't contains an IP
address it logs a Warning.
This causes that services without endpoints keep flooding the logs
with warnings.
We log this errors with a level of Verbosity of 4 instead of a Warning
This allows the proxier to cache local addresses instead of fetching all
local addresses every time in IsLocalIP.
Signed-off-by: Andrew Sy Kim <kiman@vmware.com>
This avoids fetching all local network interfaces everytime we sync an
external IP. For clusters with many external IPs this gets really
expensive. This change caches all local addresses once per sync.
Signed-off-by: Andrew Sy Kim <kiman@vmware.com>
This avoids fetching all local network interfaces everytime we sync an
external IP. For clusters with many external IPs this gets really
expensive. This change caches all local addresses once per sync.
Signed-off-by: Andrew Sy Kim <kiman@vmware.com>
kube-proxy, if is configured with an IP family, filters out the
incorrect IP version of the services.
This commit fix a bug caused by not filtering out the IPs in the
LoadBalancer Status Ingress field.
kube-proxy was not validating correctly the clusterCIDRs, if
dual-stack it MAY have 1 or more clusterCIDRs. If it has 2 cidrs and
at least one of each IP family.
It also fixes a bug where validation was not taking into account
the feature gates global state.
This creates a new EndpointSliceProxying feature gate to cover EndpointSlice
consumption (kube-proxy) and allow the existing EndpointSlice feature gate to
focus on EndpointSlice production only. Along with that addition, this enables
the EndpointSlice feature gate by default, now only affecting the controller.
The rationale here is that it's really difficult to guarantee all EndpointSlices
are created in a cluster upgrade process before kube-proxy attempts to consume
them. Although masters are generally upgraded before nodes, and in most cases,
the controller would have enough time to create EndpointSlices before a new node
with kube-proxy spun up, there are plenty of edge cases where that might not be
the case. The primary limitation on EndpointSlice creation is the API rate limit
of 20QPS. In clusters with a lot of endpoints and/or with a lot of other API
requests, it could be difficult to create all the EndpointSlices before a new
node with kube-proxy targeting EndpointSlices spun up.
Separating this into 2 feature gates allows for a more gradual rollout with the
EndpointSlice controller being enabled by default in 1.18, and EndpointSlices
for kube-proxy being enabled by default in the next release.
Errors from staticcheck:
pkg/proxy/healthcheck/proxier_health.go:55:2: field port is unused (U1000)
pkg/proxy/healthcheck/proxier_health.go:162:20: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
pkg/proxy/healthcheck/service_health.go:166:20: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
pkg/proxy/iptables/proxier.go:737:2: this value of args is never used (SA4006)
pkg/proxy/iptables/proxier.go:737:15: this result of append is never used, except maybe in other appends (SA4010)
pkg/proxy/iptables/proxier.go:1287:28: this result of append is never used, except maybe in other appends (SA4010)
pkg/proxy/userspace/proxysocket.go:293:3: this value of n is never used (SA4006)
pkg/proxy/winkernel/metrics.go:74:6: func sinceInMicroseconds is unused (U1000)
pkg/proxy/winkernel/metrics.go:79:6: func sinceInSeconds is unused (U1000)
pkg/proxy/winuserspace/proxier.go:94:2: field portMapMutex is unused (U1000)
pkg/proxy/winuserspace/proxier.go:118:2: field owner is unused (U1000)
pkg/proxy/winuserspace/proxier.go:119:2: field socket is unused (U1000)
pkg/proxy/winuserspace/proxysocket.go:620:4: this value of n is never used (SA4006)
This reverts commit 1ca0ffeaf2.
kube-proxy is not recreating the rules associated to the
KUBE-MARK-DROP chain, that is created by the kubelet.
Is preferrable avoid the dependency between the kubelet and
kube-proxy and that each of them handle their own rules.
This includes IPv4 and IPv6 address types and IPVS dual stack support.
Importantly this ensures that EndpointSlices with a FQDN address type
are not processed by kube-proxy.
Computing EndpointChanges is a relatively expensive operation for
kube-proxy when Endpoint Slices are used. This had been computed on
every EndpointSlice update which became quite inefficient at high levels
of scale when multiple EndpointSlice update events would be triggered
before a syncProxyRules call.
Profiling results showed that computing this on each update could
consume ~80% of total kube-proxy CPU utilization at high levels of
scale. This change reduced that to as little as 3% of total kube-proxy
utilization at high levels of scale.
It's worth noting that the difference is minimal when there is a 1:1
relationship between EndpointSlice updates and proxier syncs. This is
primarily beneficial when there are many EndpointSlice updates between
proxier sync loops.
Until now, iptables probabilities had 5 decimal places of granularity.
That meant that probabilities would start to repeat once a Service
had 319 or more endpoints.
This doubles the granularity to 10 decimal places, ensuring that
probabilities will not repeat until a Service reaches 100,223 endpoints.
The proxy healthz server assumed that kube-proxy would regularly call
UpdateTimestamp() even when nothing changed, but that's no longer
true. Fix it to only report unhealthiness when updates have been
received from the apiserver but not promptly pushed out to
iptables/ipvs.
Kube-proxy runs two different health servers; one for monitoring the
health of kube-proxy itself, and one for monitoring the health of
specific services. Rename them to "ProxierHealthServer" and
"ServiceHealthServer" to make this clearer, and do a bit of API
cleanup too.
The detectStaleConnections function in kube-proxy is very expensive in
terms of CPU utilization. The results of this function are only actually
used for UDP ports. This adds a protocol attribute to ServicePortName to
make it simple to only run this function for UDP connections. For
clusters with primarily TCP connections this can improve kube-proxy
performance by 2x.
The .IP() call that was previously used for sorting resulted in a call
to netutil to parse an IP out of an IP:Port string. This was very slow
and resulted in this sort taking up ~50% of total CPU util for
kube-proxy.
Kubelet and kube-proxy both had loops to ensure that their iptables
rules didn't get deleted, by repeatedly recreating them. But on
systems with lots of iptables rules (ie, thousands of services), this
can be very slow (and thus might end up holding the iptables lock for
several seconds, blocking other operations, etc).
The specific threat that they need to worry about is
firewall-management commands that flush *all* dynamic iptables rules.
So add a new iptables.Monitor() function that handles this by creating
iptables-flush canaries and only triggering a full rule reload after
noticing that someone has deleted those chains.
This should fix a bug that could break masters when the EndpointSlice
feature gate was enabled. This was all tied to how the apiserver creates
and manages it's own services and endpoints (or in this case endpoint
slices). Consumers of endpoint slices also need to know about the
corresponding service. Previously we were trying to set an owner
reference here for this purpose, but that came with potential downsides
and increased complexity. This commit changes behavior of the apiserver
endpointslice integration to set the service name label instead of owner
references, and simplifies consumer logic to reference that (both are
set by the EndpointSlice controller).
Additionally, this should fix a bug with the EndpointSlice GenerateName
value that had previously been set with a "." as a suffix.
Work around Linux kernel bug that sometimes causes multiple flows to
get mapped to the same IP:PORT and consequently some suffer packet
drops.
Also made the same update in kubelet.
Also added cross-pointers between the two bodies of code, in comments.
Some day we should eliminate the duplicate code. But today is not
that day.
Kube-proxy will add ipset entries for all node ips for an SCTP nodeport service. This will solve the problem 'SCTP nodeport service is not working for all IPs present in the node when ipvs is enabled. It is working only for node's InternalIP.'
As mentioned in issue #80061, in iptables lock contention case,
we can see increasing rate of iptables restore failures because it
need to grab iptables file lock.
The failure metric can provide administrators more insight
Metrics will be collected in kube-proxy iptables and ipvs modes
Signed-off-by: Hui Luo <luoh@vmware.com>
Kube-proxy's iptables mode used to care whether utiliptables's
EnsureRule was able to use "iptables -C" or if it had to implement it
hackily using "iptables-save". But that became irrelevant when
kube-proxy was reimplemented using "iptables-restore", and no one ever
noticed. So remove that check.
ipvs `getProxyMode` test fails on mac as `utilipvs.GetRequiredIPVSMods`
try to reach `/proc/sys/kernel/osrelease` to find version of the running
linux kernel. Linux kernel version is used to determine the list of required
kernel modules for ipvs.
Logic to determine kernel version is moved to GetKernelVersion
method in LinuxKernelHandler which implements ipvs.KernelHandler.
Mock KernelHandler is used in the test cases.
Read and parse file is converted to go function instead of execing cut.
Computing all node ips twice would always happen when no node port
addresses were explicitly set. The GetNodeAddresses call would return
two zero cidrs (ipv4 and ipv6) and we would then retrieve all node IPs
twice because the loop wouldn't break after the first time.
Also, it is possible for the user to set explicit node port addresses
including both a zero and a non-zero cidr, but this wouldn't make sense
for nodeIPs since the zero cidr would already cause nodeIPs to include
all IPs on the node.
Previously the same ip addresses would be computed for each nodePort
service and this could be CPU intensive for a large number of nodePort
services with a large number of ipaddresses on the node.
refs https://github.com/kubernetes/perf-tests/issues/640
We have too fine buckets granularity for lower latencies, at cost of the higher
latecies (7+ minutes). This is causing spikes in SLI calculated based on that
metrics.
I don't have strong opinion about actual values - those seemed to be better
matching our need. But let's have discussion about them.
Values:
0.015 s
0.030 s
0.060 s
0.120 s
0.240 s
0.480 s
0.960 s
1.920 s
3.840 s
7.680 s
15.360 s
30.720 s
61.440 s
122.880 s
245.760 s
491.520 s
983.040 s
1966.080 s
3932.160 s
7864.320 s
Currently the BaseServiceInfo struct implements the ServicePort interface, but
only uses that interface sometimes. All the elements of BaseServiceInfo are exported
and sometimes the interface is used to access them and othertimes not
I extended the ServicePort interface so that all relevent values can be accessed through
it and unexported all the elements of BaseServiceInfo
This adds some useful metrics around pending changes and last successful
sync time.
The goal is for administrators to be able to alert on proxies that, for
whatever reason, are quite stale.
Signed-off-by: Casey Callendrello <cdc@redhat.com>