Clear conntrack entries for UDP NodePorts,
this has to be done AFTER the iptables rules are programmed.
It can happen that traffic to the NodePort hits the host before
the iptables rules are programmed this will create an stale entry
in conntrack that will blackhole the traffic, so we need to
clear it ONLY when the service has endpoints.
1. For iptables mode, add KUBE-NODEPORTS chain in filter table. Add
rules to allow healthcheck node port traffic.
2. For ipvs mode, add KUBE-NODE-PORT chain in filter table. Add
KUBE-HEALTH-CHECK-NODE-PORT ipset to allow traffic to healthcheck
node port.
When running in ipvs mode, kube-proxy generated wrong iptables-restore
input because the chain names are hardcoded.
It also fixed a typo in method name.
Currently kube-proxy treat ExternalIPs differently depending on:
- the traffic origin
- if the ExternalIP is present or not in the system.
It also depends on the CNI implementation to
discriminate between local and non-local traffic.
Since the ExternalIP belongs to a Service, we can avoid the roundtrip
of sending outside the traffic originated in the cluster.
Also, we leverage the new LocalTrafficDetector to detect the local
traffic and not rely on the CNI implementations for this.
- Remove feature gate consideration from EndpointSlice validation
- Deprecate topology field, note that it will be removed in future
release
- Update kube-proxy to check for NodeName if feature gate is enabled
- Add comments indicating the feature gates that can be used to enable
alpha API fields
- Add comments explaining use of deprecated address type in tests
The tests for most functions have also been revised to check the errors
explicitly upon validating. This will properly catch occasions
where we should be returning multiple errors if more error occurs or
if just one block is failing.
Signed-off-by: Christopher M. Luciano <cmluciano@us.ibm.com>
This commit revises validateProxyNodePortAddress and
validateExcludeCIDRS to report on the exact CIDR that is
invalid within the array of strings. Previously we would just return
the whole block of addresses and now we identify the exact address
within the block to eliminate confusion. I also removed the break from
validateProxyNodeAddress so that we can report on all addresses that
may not be valid.
The tests for each function have also been revised to check the errors
explicitly upon validating. This also will properly catch occasions
where we should be returning multiple errors if more than one CIDR is invalid.
Signed-off-by: Christopher M. Luciano <cmluciano@us.ibm.com>
A previous PR (#71573) intended to clear conntrack entry on endpoint
changes when using nodeport by introducing a dedicated function to
remove the stale conntrack entry on the node port and allow traffic to
resume. By doing so, it has introduced a nodeport specific bug where the
conntrack entries related to the ClusterIP does not get clean if
endpoint is changed (issue #96174). We fix by doing ClusterIP cleanup in
all cases.
* api: structure change
* api: defaulting, conversion, and validation
* [FIX] validation: auto remove second ip/family when service changes to SingleStack
* [FIX] api: defaulting, conversion, and validation
* api-server: clusterIPs alloc, printers, storage and strategy
* [FIX] clusterIPs default on read
* alloc: auto remove second ip/family when service changes to SingleStack
* api-server: repair loop handling for clusterIPs
* api-server: force kubernetes default service into single stack
* api-server: tie dualstack feature flag with endpoint feature flag
* controller-manager: feature flag, endpoint, and endpointSlice controllers handling multi family service
* [FIX] controller-manager: feature flag, endpoint, and endpointSlicecontrollers handling multi family service
* kube-proxy: feature-flag, utils, proxier, and meta proxier
* [FIX] kubeproxy: call both proxier at the same time
* kubenet: remove forced pod IP sorting
* kubectl: modify describe to include ClusterIPs, IPFamilies, and IPFamilyPolicy
* e2e: fix tests that depends on IPFamily field AND add dual stack tests
* e2e: fix expected error message for ClusterIP immutability
* add integration tests for dualstack
the third phase of dual stack is a very complex change in the API,
basically it introduces Dual Stack services. Main changes are:
- It pluralizes the Service IPFamily field to IPFamilies,
and removes the singular field.
- It introduces a new field IPFamilyPolicyType that can take
3 values to express the "dual-stack(mad)ness" of the cluster:
SingleStack, PreferDualStack and RequireDualStack
- It pluralizes ClusterIP to ClusterIPs.
The goal is to add coverage to the services API operations,
taking into account the 6 different modes a cluster can have:
- single stack: IP4 or IPv6 (as of today)
- dual stack: IPv4 only, IPv6 only, IPv4 - IPv6, IPv6 - IPv4
* [FIX] add integration tests for dualstack
* generated data
* generated files
Co-authored-by: Antonio Ojea <aojea@redhat.com>
In #56164, we had split the reject rules for non-ep existing services
into KUBE-EXTERNAL-SERVICES chain in order to avoid calling KUBE-SERVICES
from INPUT. However in #74394 KUBE-SERVICES was re-added into INPUT.
As noted in #56164, kernel is sensitive to the size of INPUT chain. This
patch refrains from calling the KUBE-SERVICES chain from INPUT and FORWARD,
instead adds the lb reject rule to the KUBE-EXTERNAL-SERVICES chain which will be
called from INPUT and FORWARD.
The provided DialContext wraps existing clients' DialContext in an attempt to
preserve any existing timeout configuration. In some cases, we may replace
infinite timeouts with golang defaults.
- scaleio: tcp connect/keepalive values changed from 0/15 to 30/30
- storageos: no change
Before this fix, a Service with a loadBalancerSourceRange value that
included a space would cause kube-proxy to crashloop. This updates
kube-proxy to trim any space from that field.
Currently kube-proxy defaults the min-sync-period for
iptables to 0. However, as explained by Dan Winship,
"With minSyncPeriod: 0, you run iptables-restore 100 times.
With minSyncPeriod: 1s , you run iptables-restore once.
With minSyncPeriod: 10s , you also run iptables-restore once,
but you might have to wait 10 seconds first"
Masquerade de traffic that loops back to the originator
before they hit the kubernetes-specific postrouting rules
Signed-off-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
when dual-stack kube-proxy infers the service IP family from
the ClusterIP because ipFamily field is going to be deprecated.
Since kube-proxy skip headless and externalname services we
can safely obtain the IPFamily from the ClusterIP field
Signed-off-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
instead of receiving the service name and namespace we
can obtain it from the service object directly.
Signed-off-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
It seems that if you set the packet mark on a packet and then route
that packet through a kernel VXLAN interface, the VXLAN-encapsulated
packet will still have the mark from the original packet. Since our
NAT rules are based on the packet mark, this was causing us to
double-NAT some packets, which then triggered a kernel checksumming
bug. But even without the checksum bug, there are reasons to avoid
double-NATting, so fix the rules to unmark the packets before
masquerading them.
Fixes two small issues with the metric added in #90175:
1. Bump the timestamp on initial informer sync. Otherwise it remains 0 if
restarting kube-proxy in a quiescent cluster, which isn't quite right.
2. Bump the timestamp even if no healthz server is specified.
This adds a metric, kubeproxy_sync_proxy_rules_last_queued_timestamp,
that captures the last time a change was queued to be applied to the
proxy. This matches the healthz logic, which fails if a pending change
is stale.
This allows us to write alerts that mirror healthz.
Signed-off-by: Casey Callendrello <cdc@redhat.com>
This builds on previous work but only sets the sysctlConnReuse value
if the kernel is known to be above 4.19. To avoid calling GetKernelVersion
twice, I store the value from the CanUseIPVS method and then check the version
constraint at time of expected sysctl call.
Signed-off-by: Christopher M. Luciano <cmluciano@us.ibm.com>
The kube-proxy metaproxier implementations tries to get the IPFamily
from the endpoints, but if the endpoints doesn't contains an IP
address it logs a Warning.
This causes that services without endpoints keep flooding the logs
with warnings.
We log this errors with a level of Verbosity of 4 instead of a Warning
This allows the proxier to cache local addresses instead of fetching all
local addresses every time in IsLocalIP.
Signed-off-by: Andrew Sy Kim <kiman@vmware.com>
This avoids fetching all local network interfaces everytime we sync an
external IP. For clusters with many external IPs this gets really
expensive. This change caches all local addresses once per sync.
Signed-off-by: Andrew Sy Kim <kiman@vmware.com>
This avoids fetching all local network interfaces everytime we sync an
external IP. For clusters with many external IPs this gets really
expensive. This change caches all local addresses once per sync.
Signed-off-by: Andrew Sy Kim <kiman@vmware.com>
kube-proxy, if is configured with an IP family, filters out the
incorrect IP version of the services.
This commit fix a bug caused by not filtering out the IPs in the
LoadBalancer Status Ingress field.
kube-proxy was not validating correctly the clusterCIDRs, if
dual-stack it MAY have 1 or more clusterCIDRs. If it has 2 cidrs and
at least one of each IP family.
It also fixes a bug where validation was not taking into account
the feature gates global state.
This creates a new EndpointSliceProxying feature gate to cover EndpointSlice
consumption (kube-proxy) and allow the existing EndpointSlice feature gate to
focus on EndpointSlice production only. Along with that addition, this enables
the EndpointSlice feature gate by default, now only affecting the controller.
The rationale here is that it's really difficult to guarantee all EndpointSlices
are created in a cluster upgrade process before kube-proxy attempts to consume
them. Although masters are generally upgraded before nodes, and in most cases,
the controller would have enough time to create EndpointSlices before a new node
with kube-proxy spun up, there are plenty of edge cases where that might not be
the case. The primary limitation on EndpointSlice creation is the API rate limit
of 20QPS. In clusters with a lot of endpoints and/or with a lot of other API
requests, it could be difficult to create all the EndpointSlices before a new
node with kube-proxy targeting EndpointSlices spun up.
Separating this into 2 feature gates allows for a more gradual rollout with the
EndpointSlice controller being enabled by default in 1.18, and EndpointSlices
for kube-proxy being enabled by default in the next release.
Errors from staticcheck:
pkg/proxy/healthcheck/proxier_health.go:55:2: field port is unused (U1000)
pkg/proxy/healthcheck/proxier_health.go:162:20: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
pkg/proxy/healthcheck/service_health.go:166:20: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
pkg/proxy/iptables/proxier.go:737:2: this value of args is never used (SA4006)
pkg/proxy/iptables/proxier.go:737:15: this result of append is never used, except maybe in other appends (SA4010)
pkg/proxy/iptables/proxier.go:1287:28: this result of append is never used, except maybe in other appends (SA4010)
pkg/proxy/userspace/proxysocket.go:293:3: this value of n is never used (SA4006)
pkg/proxy/winkernel/metrics.go:74:6: func sinceInMicroseconds is unused (U1000)
pkg/proxy/winkernel/metrics.go:79:6: func sinceInSeconds is unused (U1000)
pkg/proxy/winuserspace/proxier.go:94:2: field portMapMutex is unused (U1000)
pkg/proxy/winuserspace/proxier.go:118:2: field owner is unused (U1000)
pkg/proxy/winuserspace/proxier.go:119:2: field socket is unused (U1000)
pkg/proxy/winuserspace/proxysocket.go:620:4: this value of n is never used (SA4006)
This reverts commit 1ca0ffeaf2.
kube-proxy is not recreating the rules associated to the
KUBE-MARK-DROP chain, that is created by the kubelet.
Is preferrable avoid the dependency between the kubelet and
kube-proxy and that each of them handle their own rules.
This includes IPv4 and IPv6 address types and IPVS dual stack support.
Importantly this ensures that EndpointSlices with a FQDN address type
are not processed by kube-proxy.