This makes the "destination" policy model clearer. All external
destination captures now jump to the "XLB chain, which is the main place
that masquerade is done (removing it from most other places).
This is simpler to trace - XLB *always* exists (as long as you have an
external exposure) and never gets bypassed.
No functional changes, much whitespace.
Make assertIPTablesRulesEqual() *not* sort the `expected` value - make
the test cases all be pre-sorted. This will make followup commits
cleaner.
Make the test output cleaner when this fails.
Use dedent everywhere for easier reading.
Fix internal and external traffic policy to be handled separately (so
that, in particular, services with Local internal traffic policy and
Cluster external traffic policy do not behave as though they had Local
external traffic policy as well.
Additionally, traffic to an `internalTrafficPolicy: Local` service on
a node with no endpoints is now dropped rather than being rejected
(which, as in the external case, may prevent traffic from being lost
when endpoints are in flux).
Now the XLB chain _only_ implements the "short-circuit local
connections to the SVC chain" rule, and the actual endpoint selection
happens in the SVL chain.
Though not quite implemented yet, this will eventually also mean that
"SVC" = "Service, Cluster traffic policy" as opposed to "SVL" =
"Service, Local traffic policy"
This commit adds the framework for the new local detection
modes BridgeInterface and InterfaceNamePrefix to work.
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
This PR introduces two new modes for detecting
local traffic in a cluster.
1) detectLocalByBridgeInterface: This takes a bridge name
as argument and decides all traffic that match on their
originating interface being that of this bridge, shall be
considered as local pod traffic.
2) detectLocalByInterfaceNamePrefix: This takes an interface prefix
name as argument and decides all traffic that match on their
originating interface names having a prefix that matches this
argument shall be considered as local pod traffic.
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Rather than lazily computing and then caching the endpoint chain name
because we don't have the right information at construct time, just
pass the right information at construct time and compute the chain
name then.
Now that we don't have to always append all of the iptables args into
a single array, there's no reason to have LocalTrafficDetector take in
a set of args to prepend to its own output, and also not much point in
having it write out the "-j CHAIN" by itself either.
This change adds 2 options for windows:
--forward-healthcheck-vip: If true forward service VIP for health check
port
--root-hnsendpoint-name: The name of the hns endpoint name for root
namespace attached to l2bridge, default is cbr0
When --forward-healthcheck-vip is set as true and winkernel is used,
kube-proxy will add an hns load balancer to forward health check request
that was sent to lb_vip:healthcheck_port to the node_ip:healthcheck_port.
Without this forwarding, the health check from google load balancer will
fail, and it will stop forwarding traffic to the windows node.
This change fixes the following 2 cases for service:
- `externalTrafficPolicy: Cluster` (default option): healthcheck_port is
10256 for all services. Without this fix, all traffic won't be directly
forwarded to windows node. It will always go through a linux node and
get forwarded to windows from there.
- `externalTrafficPolicy: Local`: different healthcheck_port for each
service that is configured as local. Without this fix, this feature
won't work on windows node at all. This feature preserves client ip
that tries to connect to their application running in windows pod.
Change-Id: If4513e72900101ef70d86b91155e56a1f8c79719
For each iptables-restore call, log the number of services, endpoints,
filter chains, filter rules, NAT chains, and NAT rules in the update
at V(2), in addition to logging the actual rules if V(9).
All of the tests used a localDetector that considered the pod IP range
to be 10.0.0.0/24, but lots of the tests used pod IPs in 10.180.0.0/16
or 10.0.1.0/24, meaning the generated iptables rules were somewhat
inconsistent. Fix this by expanding the localDetector's pod IP range
to 10.0.0.0/8. (Changing the pod IPs to all be in 10.0.0.0/24 instead
would be a much larger change since it would result in the SEP chain
names changing.)
Meanwhile, the different tests were also horribly inconsistent about
what values they used for other IPs, and some of them even used the
same IPs (or ports) for different things in the same test case. Fix
these all up and create a consistent set of IP assignments:
// Pod IPs: 10.0.0.0/8
// Service ClusterIPs: 172.30.0.0/16
// Node IPs: 192.168.0.0/24
// Local Node IP: 192.168.0.2
// Service ExternalIPs: 192.168.99.0/24
// LoadBalancer IPs: 1.2.3.4, 5.6.7.8, 9.10.11.12
// Non-cluster IPs: 203.0.113.0/24
// LB Source Range: 203.0.113.0/25
Only run assertIPTablesRuleJumps() on the expected output, not on the
actual output, since if there's a problem with the actual output, we'd
rather see it as the diff from the expected output.
In large clusters, the iptables-restore input will be tens of
thousands of lines long, and logging it at V(5) essentially means that
"kube-proxy -v=5" cannot be used in such clusters to see _other_
things that get logged at V(5), because logs will get rolled over far
too quickly. So bump the full-rules logging output down to V(9).
kube-proxy sets the sysctl net.ipv4.conf.all.route_localnet=1
so NodePort services can be accessed on the loopback addresses in
IPv4, but this may present security issues.
Leverage the --nodeport-addresses flag to opt-out of this feature,
if the list is not empty and none of the IP ranges contains an IPv4
loopback address this sysctl is not set.
In addition, add a warning to inform users about this behavior.