Clear conntrack entries for UDP NodePorts,
this has to be done AFTER the iptables rules are programmed.
It can happen that traffic to the NodePort hits the host before
the iptables rules are programmed this will create an stale entry
in conntrack that will blackhole the traffic, so we need to
clear it ONLY when the service has endpoints.
1. For iptables mode, add KUBE-NODEPORTS chain in filter table. Add
rules to allow healthcheck node port traffic.
2. For ipvs mode, add KUBE-NODE-PORT chain in filter table. Add
KUBE-HEALTH-CHECK-NODE-PORT ipset to allow traffic to healthcheck
node port.
Currently kube-proxy treat ExternalIPs differently depending on:
- the traffic origin
- if the ExternalIP is present or not in the system.
It also depends on the CNI implementation to
discriminate between local and non-local traffic.
Since the ExternalIP belongs to a Service, we can avoid the roundtrip
of sending outside the traffic originated in the cluster.
Also, we leverage the new LocalTrafficDetector to detect the local
traffic and not rely on the CNI implementations for this.
A previous PR (#71573) intended to clear conntrack entry on endpoint
changes when using nodeport by introducing a dedicated function to
remove the stale conntrack entry on the node port and allow traffic to
resume. By doing so, it has introduced a nodeport specific bug where the
conntrack entries related to the ClusterIP does not get clean if
endpoint is changed (issue #96174). We fix by doing ClusterIP cleanup in
all cases.
* api: structure change
* api: defaulting, conversion, and validation
* [FIX] validation: auto remove second ip/family when service changes to SingleStack
* [FIX] api: defaulting, conversion, and validation
* api-server: clusterIPs alloc, printers, storage and strategy
* [FIX] clusterIPs default on read
* alloc: auto remove second ip/family when service changes to SingleStack
* api-server: repair loop handling for clusterIPs
* api-server: force kubernetes default service into single stack
* api-server: tie dualstack feature flag with endpoint feature flag
* controller-manager: feature flag, endpoint, and endpointSlice controllers handling multi family service
* [FIX] controller-manager: feature flag, endpoint, and endpointSlicecontrollers handling multi family service
* kube-proxy: feature-flag, utils, proxier, and meta proxier
* [FIX] kubeproxy: call both proxier at the same time
* kubenet: remove forced pod IP sorting
* kubectl: modify describe to include ClusterIPs, IPFamilies, and IPFamilyPolicy
* e2e: fix tests that depends on IPFamily field AND add dual stack tests
* e2e: fix expected error message for ClusterIP immutability
* add integration tests for dualstack
the third phase of dual stack is a very complex change in the API,
basically it introduces Dual Stack services. Main changes are:
- It pluralizes the Service IPFamily field to IPFamilies,
and removes the singular field.
- It introduces a new field IPFamilyPolicyType that can take
3 values to express the "dual-stack(mad)ness" of the cluster:
SingleStack, PreferDualStack and RequireDualStack
- It pluralizes ClusterIP to ClusterIPs.
The goal is to add coverage to the services API operations,
taking into account the 6 different modes a cluster can have:
- single stack: IP4 or IPv6 (as of today)
- dual stack: IPv4 only, IPv6 only, IPv4 - IPv6, IPv6 - IPv4
* [FIX] add integration tests for dualstack
* generated data
* generated files
Co-authored-by: Antonio Ojea <aojea@redhat.com>
In #56164, we had split the reject rules for non-ep existing services
into KUBE-EXTERNAL-SERVICES chain in order to avoid calling KUBE-SERVICES
from INPUT. However in #74394 KUBE-SERVICES was re-added into INPUT.
As noted in #56164, kernel is sensitive to the size of INPUT chain. This
patch refrains from calling the KUBE-SERVICES chain from INPUT and FORWARD,
instead adds the lb reject rule to the KUBE-EXTERNAL-SERVICES chain which will be
called from INPUT and FORWARD.
Before this fix, a Service with a loadBalancerSourceRange value that
included a space would cause kube-proxy to crashloop. This updates
kube-proxy to trim any space from that field.
It seems that if you set the packet mark on a packet and then route
that packet through a kernel VXLAN interface, the VXLAN-encapsulated
packet will still have the mark from the original packet. Since our
NAT rules are based on the packet mark, this was causing us to
double-NAT some packets, which then triggered a kernel checksumming
bug. But even without the checksum bug, there are reasons to avoid
double-NATting, so fix the rules to unmark the packets before
masquerading them.
Fixes two small issues with the metric added in #90175:
1. Bump the timestamp on initial informer sync. Otherwise it remains 0 if
restarting kube-proxy in a quiescent cluster, which isn't quite right.
2. Bump the timestamp even if no healthz server is specified.
This adds a metric, kubeproxy_sync_proxy_rules_last_queued_timestamp,
that captures the last time a change was queued to be applied to the
proxy. This matches the healthz logic, which fails if a pending change
is stale.
This allows us to write alerts that mirror healthz.
Signed-off-by: Casey Callendrello <cdc@redhat.com>
This allows the proxier to cache local addresses instead of fetching all
local addresses every time in IsLocalIP.
Signed-off-by: Andrew Sy Kim <kiman@vmware.com>
This avoids fetching all local network interfaces everytime we sync an
external IP. For clusters with many external IPs this gets really
expensive. This change caches all local addresses once per sync.
Signed-off-by: Andrew Sy Kim <kiman@vmware.com>
This creates a new EndpointSliceProxying feature gate to cover EndpointSlice
consumption (kube-proxy) and allow the existing EndpointSlice feature gate to
focus on EndpointSlice production only. Along with that addition, this enables
the EndpointSlice feature gate by default, now only affecting the controller.
The rationale here is that it's really difficult to guarantee all EndpointSlices
are created in a cluster upgrade process before kube-proxy attempts to consume
them. Although masters are generally upgraded before nodes, and in most cases,
the controller would have enough time to create EndpointSlices before a new node
with kube-proxy spun up, there are plenty of edge cases where that might not be
the case. The primary limitation on EndpointSlice creation is the API rate limit
of 20QPS. In clusters with a lot of endpoints and/or with a lot of other API
requests, it could be difficult to create all the EndpointSlices before a new
node with kube-proxy targeting EndpointSlices spun up.
Separating this into 2 feature gates allows for a more gradual rollout with the
EndpointSlice controller being enabled by default in 1.18, and EndpointSlices
for kube-proxy being enabled by default in the next release.
Errors from staticcheck:
pkg/proxy/healthcheck/proxier_health.go:55:2: field port is unused (U1000)
pkg/proxy/healthcheck/proxier_health.go:162:20: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
pkg/proxy/healthcheck/service_health.go:166:20: printf-style function with dynamic format string and no further arguments should use print-style function instead (SA1006)
pkg/proxy/iptables/proxier.go:737:2: this value of args is never used (SA4006)
pkg/proxy/iptables/proxier.go:737:15: this result of append is never used, except maybe in other appends (SA4010)
pkg/proxy/iptables/proxier.go:1287:28: this result of append is never used, except maybe in other appends (SA4010)
pkg/proxy/userspace/proxysocket.go:293:3: this value of n is never used (SA4006)
pkg/proxy/winkernel/metrics.go:74:6: func sinceInMicroseconds is unused (U1000)
pkg/proxy/winkernel/metrics.go:79:6: func sinceInSeconds is unused (U1000)
pkg/proxy/winuserspace/proxier.go:94:2: field portMapMutex is unused (U1000)
pkg/proxy/winuserspace/proxier.go:118:2: field owner is unused (U1000)
pkg/proxy/winuserspace/proxier.go:119:2: field socket is unused (U1000)
pkg/proxy/winuserspace/proxysocket.go:620:4: this value of n is never used (SA4006)
This reverts commit 1ca0ffeaf2.
kube-proxy is not recreating the rules associated to the
KUBE-MARK-DROP chain, that is created by the kubelet.
Is preferrable avoid the dependency between the kubelet and
kube-proxy and that each of them handle their own rules.
Until now, iptables probabilities had 5 decimal places of granularity.
That meant that probabilities would start to repeat once a Service
had 319 or more endpoints.
This doubles the granularity to 10 decimal places, ensuring that
probabilities will not repeat until a Service reaches 100,223 endpoints.
The proxy healthz server assumed that kube-proxy would regularly call
UpdateTimestamp() even when nothing changed, but that's no longer
true. Fix it to only report unhealthiness when updates have been
received from the apiserver but not promptly pushed out to
iptables/ipvs.
Kube-proxy runs two different health servers; one for monitoring the
health of kube-proxy itself, and one for monitoring the health of
specific services. Rename them to "ProxierHealthServer" and
"ServiceHealthServer" to make this clearer, and do a bit of API
cleanup too.
Kubelet and kube-proxy both had loops to ensure that their iptables
rules didn't get deleted, by repeatedly recreating them. But on
systems with lots of iptables rules (ie, thousands of services), this
can be very slow (and thus might end up holding the iptables lock for
several seconds, blocking other operations, etc).
The specific threat that they need to worry about is
firewall-management commands that flush *all* dynamic iptables rules.
So add a new iptables.Monitor() function that handles this by creating
iptables-flush canaries and only triggering a full rule reload after
noticing that someone has deleted those chains.
This should fix a bug that could break masters when the EndpointSlice
feature gate was enabled. This was all tied to how the apiserver creates
and manages it's own services and endpoints (or in this case endpoint
slices). Consumers of endpoint slices also need to know about the
corresponding service. Previously we were trying to set an owner
reference here for this purpose, but that came with potential downsides
and increased complexity. This commit changes behavior of the apiserver
endpointslice integration to set the service name label instead of owner
references, and simplifies consumer logic to reference that (both are
set by the EndpointSlice controller).
Additionally, this should fix a bug with the EndpointSlice GenerateName
value that had previously been set with a "." as a suffix.
Work around Linux kernel bug that sometimes causes multiple flows to
get mapped to the same IP:PORT and consequently some suffer packet
drops.
Also made the same update in kubelet.
Also added cross-pointers between the two bodies of code, in comments.
Some day we should eliminate the duplicate code. But today is not
that day.
As mentioned in issue #80061, in iptables lock contention case,
we can see increasing rate of iptables restore failures because it
need to grab iptables file lock.
The failure metric can provide administrators more insight
Metrics will be collected in kube-proxy iptables and ipvs modes
Signed-off-by: Hui Luo <luoh@vmware.com>
Kube-proxy's iptables mode used to care whether utiliptables's
EnsureRule was able to use "iptables -C" or if it had to implement it
hackily using "iptables-save". But that became irrelevant when
kube-proxy was reimplemented using "iptables-restore", and no one ever
noticed. So remove that check.
Currently the BaseServiceInfo struct implements the ServicePort interface, but
only uses that interface sometimes. All the elements of BaseServiceInfo are exported
and sometimes the interface is used to access them and othertimes not
I extended the ServicePort interface so that all relevent values can be accessed through
it and unexported all the elements of BaseServiceInfo
This adds some useful metrics around pending changes and last successful
sync time.
The goal is for administrators to be able to alert on proxies that, for
whatever reason, are quite stale.
Signed-off-by: Casey Callendrello <cdc@redhat.com>
As part of the endpoint creation process when going from 0 -> 1 conntrack entries
are cleared. This is to prevent an existing conntrack entry from preventing traffic
to the service. Currently the system ignores the existance of the services external IP
addresses, which exposes that errant behavior
This adds the externalIP addresses of udp services to the list of conntrack entries that
get cleared. Allowing traffic to flow
Signed-off-by: Jacob Tanenbaum <jtanenba@redhat.com>
We REJECT every other case. Close this FIXME.
To get this to work in all cases, we have to process service in
filter.INPUT, since LB IPS might be manged as local addresses.
When using NodePort to connect to an endpoint using UDP, if the endpoint is deleted on
restoration of the endpoint traffic does not flow. This happens because conntrack holds
the state of the connection and the proxy does not correctly clear the conntrack entry
for the stale endpoint.
Introduced a new function to conntrack ClearEntriesForPortNAT that uses the endpointIP
and NodePort to remove the stale conntrack entry and allow traffic to resume when
the endpoint is restored.
Signed-off-by: Jacob Tanenbaum <jtanenba@redhat.com>
- Move from the old github.com/golang/glog to k8s.io/klog
- klog as explicit InitFlags() so we add them as necessary
- we update the other repositories that we vendor that made a similar
change from glog to klog
* github.com/kubernetes/repo-infra
* k8s.io/gengo/
* k8s.io/kube-openapi/
* github.com/google/cadvisor
- Entirely remove all references to glog
- Fix some tests by explicit InitFlags in their init() methods
Change-Id: I92db545ff36fcec83afe98f550c9e630098b3135
This will allow for kube-proxy to be run without `privileged` and
with only adding the capability `NET_ADMIN`.
Signed-off-by: Jess Frazelle <acidburn@microsoft.com>
The requested Service Protocol is checked against the supported protocols of GCE Internal LB. The supported protocols are TCP and UDP.
SCTP is not supported by OpenStack LBaaS. If SCTP is requested in a Service with type=LoadBalancer, the request is rejected. Comment style is also corrected.
SCTP is not allowed for LoadBalancer Service and for HostPort. Kube-proxy can be configured not to start listening on the host port for SCTP: see the new SCTPUserSpaceNode parameter
changed the vendor github.com/nokia/sctp to github.com/ishidawataru/sctp. I.e. from now on we use the upstream version.
netexec.go compilation fixed. Various test cases fixed
SCTP related conformance tests removed. Netexec's pod definition and Dockerfile are updated to expose the new SCTP port(8082)
SCTP related e2e test cases are removed as the e2e test systems do not support SCTP
sctp related firewall config is removed from cluster/gce/util.sh. Variable name sctp_addr is corrected to sctpAddr in pkg/proxy/ipvs/proxier.go
cluster/gce/util.sh is copied from master