内核网络参数

nf_conntrack

nf_conntrack是Linux内核连接跟踪的模块,常用在iptables中,比如

-A INPUT -m state --state RELATED,ESTABLISHED  -j RETURN
-A INPUT -m state --state INVALID -j DROP

可以通过cat /proc/net/nf_conntrack来查看当前跟踪的连接信息,这些信息以哈希形式(用链地址法处理冲突)存在内存中,并且每条记录大约占300B空间。

nf_conntrack相关的内核参数有三个:

  • nf_conntrack_max:连接跟踪表的大小,建议根据内存计算该值CONNTRACK_MAX = RAMSIZE (in bytes) / 16384 / (x / 32),并满足nf_conntrack_max=4*nf_conntrack_buckets,默认262144

  • nf_conntrack_buckets:哈希表的大小,(nf_conntrack_max/nf_conntrack_buckets就是每条哈希记录链表的长度),默认65536

  • nf_conntrack_tcp_timeout_established:tcp会话的超时时间,默认是432000 (5天)

比如,对64G内存的机器,推荐配置:

net.netfilter.nf_conntrack_max=4194304
net.netfilter.nf_conntrack_tcp_timeout_established=300
net.netfilter.nf_conntrack_buckets=1048576

bridge-nf

bridge-nf使得netfilter可以对Linux网桥上的IPv4/ARP/IPv6包过滤。比如,设置net.bridge.bridge-nf-call-iptables=1后,二层的网桥在转发包时也会被iptables的FORWARD规则所过滤,这样有时会出现L3层的iptables rules去过滤L2的帧的问题(见这里)。

常用的选项包括

  • net.bridge.bridge-nf-call-arptables:是否在arptables的FORWARD中过滤网桥的ARP包

  • net.bridge.bridge-nf-call-ip6tables:是否在ip6tables链中过滤IPv6包

  • net.bridge.bridge-nf-call-iptables:是否在iptables链中过滤IPv4包

  • net.bridge.bridge-nf-filter-vlan-tagged:是否在iptables/arptables中过滤打了vlan标签的包

当然,也可以通过/sys/devices/virtual/net/<bridge-name>/bridge/nf_call_iptables来设置,但要注意内核是取两者中大的生效。

有时,可能只希望部分网桥禁止bridge-nf,而其他网桥都开启(比如CNI网络插件中一般要求bridge-nf-call-iptables选项开启,而有时又希望禁止某个网桥的bridge-nf),这时可以改用iptables的方法:

iptables -t raw -I PREROUTING -i <bridge-name> -j NOTRACK

反向路径过滤

反向路径过滤可用于防止数据包从一接口传入,又从另一不同的接口传出(这有时被称为 “非对称路由” )。除非必要,否则最好将其关闭,因为它可防止来自子网络的用户采用 IP 地址欺骗手段,并减少 DDoS (分布式拒绝服务)攻击的机会。

通过 rp_filter 选项启用反向路径过滤,比如 sysctl -w net.ipv4.conf.default.rp_filter=INTEGER。支持三种选项:

  • 0 ——未进行源验证。

  • 1 ——处于如 RFC3704 所定义的严格模式。

  • 2 ——处于如 RFC3704 所定义的松散模式。

可以通过 net.ipv4.interface.rp_filter可实现对每一网络接口设置的覆盖。

TCP相关

ARP相关

ARP回收

  • gc_stale_time 每次检查neighbour记录的有效性的周期。当neighbour记录失效时,将在给它发送数据前再解析一次。缺省值是60秒。

  • gc_thresh1 存在于ARP高速缓存中的最少记录数,如果少于这个数,垃圾收集器将不会运行。缺省值是128。

  • gc_thresh2 存在 ARP 高速缓存中的最多的记录软限制。垃圾收集器在开始收集前,允许记录数超过这个数字 5 秒。缺省值是 512。

  • gc_thresh3 保存在 ARP 高速缓存中的最多记录的硬限制,一旦高速缓存中的数目高于此,垃圾收集器将马上运行。缺省值是1024。

比如可以增大为

net.ipv4.neigh.default.gc_thresh1=1024
net.ipv4.neigh.default.gc_thresh2=4096
net.ipv4.neigh.default.gc_thresh3=8192

ARP过滤

arp_filter - BOOLEAN

1 - Allows you to have multiple network interfaces on the same
subnet, and have the ARPs for each interface be answered
based on whether or not the kernel would route a packet from
the ARP'd IP out that interface (therefore you must use source
based routing for this to work). In other words it allows control
of which cards (usually 1) will respond to an arp request.

0 - (default) The kernel can respond to arp requests with addresses
from other interfaces. This may seem wrong but it usually makes
sense, because it increases the chance of successful communication.
IP addresses are owned by the complete host on Linux, not by
particular interfaces. Only for more complex setups like load-
balancing, does this behaviour cause problems.

arp_filter for the interface will be enabled if at least one of
conf/{all,interface}/arp_filter is set to TRUE,
it will be disabled otherwise

arp_announce - INTEGER

Define different restriction levels for announcing the local
source IP address from IP packets in ARP requests sent on
interface:
0 - (default) Use any local address, configured on any interface
1 - Try to avoid local addresses that are not in the target's
subnet for this interface. This mode is useful when target
hosts reachable via this interface require the source IP
address in ARP requests to be part of their logical network
configured on the receiving interface. When we generate the
request we will check all our subnets that include the
target IP and will preserve the source address if it is from
such subnet. If there is no such subnet we select source
address according to the rules for level 2.
2 - Always use the best local address for this target.
In this mode we ignore the source address in the IP packet
and try to select local address that we prefer for talks with
the target host. Such local address is selected by looking
for primary IP addresses on all our subnets on the outgoing
interface that include the target IP address. If no suitable
local address is found we select the first local address
we have on the outgoing interface or on all other interfaces,
with the hope we will receive reply for our request and
even sometimes no matter the source IP address we announce.

The max value from conf/{all,interface}/arp_announce is used.

Increasing the restriction level gives more chance for
receiving answer from the resolved target while decreasing
the level announces more valid sender's information.

arp_ignore - INTEGER

Define different modes for sending replies in response to
received ARP requests that resolve local target IP addresses:
0 - (default): reply for any local target IP address, configured
on any interface
1 - reply only if the target IP address is local address
configured on the incoming interface
2 - reply only if the target IP address is local address
configured on the incoming interface and both with the
sender's IP address are part from same subnet on this interface
3 - do not reply for local addresses configured with scope host,
only resolutions for global and link addresses are replied
4-7 - reserved
8 - do not reply for all local addresses

The max value from conf/{all,interface}/arp_ignore is used
when ARP request is received on the {interface}

arp_notify - BOOLEAN

Define mode for notification of address and device changes.
0 - (default): do nothing
1 - Generate gratuitous arp requests when device is brought up
    or hardware address changes.

arp_accept - BOOLEAN

Define behavior for gratuitous ARP frames who's IP is not
already present in the ARP table:
0 - don't create new entries in the ARP table
1 - create new entries in the ARP table

Both replies and requests type gratuitous arp will trigger the
ARP table to be updated, if this setting is on.

If the ARP table already contains the IP address of the
gratuitous arp frame, the arp table will be updated regardless
if this setting is on or off.

参考文档

最后更新于