linux tcp GSO和TSO实现

——lvyilong316

linux tcp GSO和TSO实现,linuxtcpgsotso

linux tcp GSO和TSO实现

——lvyilong316

linux tcp GSO和TSO实现

(注:kernel版本:linux 2.6.32)

linux tcp GSO和TSO实现

——lvyilong316

(注:kernel版本:linux 2.6.32)

linux tcp GSO和TSO实现

——lvyilong316

概念

TSO(TCP Segmentation Offload):
是一种采用网卡来对天意据包进行机动分段,下跌CPU负载的技巧。
其根本是延迟分段。

GSO(Generic Segmentation Offload):
GSO是协议栈是还是不是推迟分段,在发送到网卡从前判断网卡是还是不是支持TSO,如若网卡帮忙TSO则让网卡分段,否则协议栈分完段再交由驱动。
一经TSO开启,GSO会自动开启。

以下是TSO和GSO的构成关系:

l  GSO开启, TSO开启:
协议栈推迟分段,并一向传送大数目包到网卡,让网卡自动分段

l  GSO开启, TSO关闭: 协议栈推迟分段,在最后发送到网卡前才实施分段

l  GSO关闭, TSO开启: 同GSO开启, TSO开启

l  GSO关闭, TSO关闭:
不推迟分段,在tcp_sendmsg中一贯发送MSS大小的数据包

(注:kernel版本:linux 2.6.32)

概念

TSO(TCP Segmentation Offload):
是一种选取网卡来对天意据包举行自动分段,降低CPU负载的技能。
其重点是延迟分段。

GSO(Generic Segmentation Offload):
GSO是协议栈是还是不是推迟分段,在发送到网卡以前判断网卡是或不是支持TSO,要是网卡协理TSO则让网卡分段,否则协议栈分完段再付诸驱动。
一经TSO开启,GSO会自动开启。

以下是TSO和GSO的结合关系:

l  GSO开启, TSO开启:
协议栈推迟分段,并直接传送大数额包到网卡,让网卡自动分段

l  GSO开启, TSO关闭: 协议栈推迟分段,在末了发送到网卡前才实施分段

l  GSO关闭, TSO开启: 同GSO开启, TSO开启

l  GSO关闭, TSO关闭:
不推迟分段,在tcp_sendmsg中直接发送MSS大小的数据包

(注:kernel版本:linux 2.6.32)

开启GSO/TSO

驱动程序在注册网卡设备的时候暗中认同开启GSO: NETIF_F_GSO

驱动程序会基于网卡硬件是还是不是支持来安装TSO: NETIF_F_TSO

可以透过ethtool -K来开关GSO/TSO

 1 #define NETIF_F_SOFT_FEATURES           (NETIF_F_GSO | NETIF_F_GRO)
 2 
 3 int register_netdevice(struct net_device *dev)
 4 
 5 {
 6 
 7               ...
 8 
 9               /* Transfer changeable features to wanted_features and enable
10 
11                * software offloads (GSO and GRO).
12 
13                */
14 
15               dev->hw_features |= NETIF_F_SOFT_FEATURES;
16 
17               dev->features |= NETIF_F_SOFT_FEATURES;         //默认开启GRO/GSO
18 
19               dev->wanted_features = dev->features & dev->hw_features;
20 
21               ...
22 
23 }
24 
25 static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
26 
27 {
28 
29               ...
30 
31               netdev->features = NETIF_F_SG |
32 
33                                 NETIF_F_TSO |
34 
35                                 NETIF_F_TSO6 |
36 
37                                 NETIF_F_RXHASH |
38 
39                                 NETIF_F_RXCSUM |
40 
41                                 NETIF_F_HW_CSUM;
42 
43               register_netdev(netdev);
44 
45               ...
46 
47 }

概念

TSO(TCP Segmentation Offload):
是一种采取网卡来对命局据包进行机动分段,降低CPU负载的技艺。
其主即使延迟分段。

GSO(Generic Segmentation Offload):
GSO是协议栈是否延迟分段,在发送到网卡之前判断网卡是不是帮忙TSO,假诺网卡协助TSO则让网卡分段,否则协议栈分完段再付诸驱动。
假诺TSO开启,GSO会自动开启。

以下是TSO和GSO的整合关系:

l  GSO开启, TSO开启:
协议栈推迟分段,并直接传送大数量包到网卡,让网卡自动分段

l  GSO开启, TSO关闭: 协议栈推迟分段,在结尾发送到网卡前才实施分段

l  GSO关闭, TSO开启: 同GSO开启, TSO开启

l  GSO关闭, TSO关闭:
不推迟分段,在tcp_sendmsg中平素发送MSS大小的数据包

开启GSO/TSO

驱动程序在登记网卡设备的时候暗中同意开启GSO: NETIF_F_GSO

驱动程序会基于网卡硬件是或不是扶助来设置TSO: NETIF_F_TSO

可以透过ethtool -K来开关GSO/TSO

 1 #define NETIF_F_SOFT_FEATURES           (NETIF_F_GSO | NETIF_F_GRO)
 2 
 3 int register_netdevice(struct net_device *dev)
 4 
 5 {
 6 
 7               ...
 8 
 9               /* Transfer changeable features to wanted_features and enable
10 
11                * software offloads (GSO and GRO).
12 
13                */
14 
15               dev->hw_features |= NETIF_F_SOFT_FEATURES;
16 
17               dev->features |= NETIF_F_SOFT_FEATURES;         //默认开启GRO/GSO
18 
19               dev->wanted_features = dev->features & dev->hw_features;
20 
21               ...
22 
23 }
24 
25 static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
26 
27 {
28 
29               ...
30 
31               netdev->features = NETIF_F_SG |
32 
33                                 NETIF_F_TSO |
34 
35                                 NETIF_F_TSO6 |
36 
37                                 NETIF_F_RXHASH |
38 
39                                 NETIF_F_RXCSUM |
40 
41                                 NETIF_F_HW_CSUM;
42 
43               register_netdev(netdev);
44 
45               ...
46 
47 }

概念

TSO(TCP Segmentation
Offload):是一种采取网卡来对天意据包进行自动分段,下降CPU负载的技巧。其重点是延迟分段。

GSO(Generic Segmentation Offload):
GSO是协议栈是或不是延迟分段,在发送到网卡从前判断网卡是或不是帮助TSO,若是网卡协助TSO则让网卡分段,否则协议栈分完段再提交驱动。假设TSO开启,GSO会自动开启。

以下是TSO和GSO的重组关系:

lGSO开启,TSO开启:协议栈推迟分段,并一贯传送大数目包到网卡,让网卡自动分段

lGSO开启,TSO关闭:协议栈推迟分段,在最后发送到网卡前才实施分段

lGSO关闭,TSO开启:同GSO开启,TSO开启

lGSO关闭,TSO关闭:不推迟分段,在tcp_sendmsg中向来发送MSS大小的数据包

是或不是推迟分段

从地方大家驾驭GSO/TSO是还是不是开启是保留在dev->features中,而装备和路由关联,当大家查询到路由后就足以把计划保存在sock中。

比如在tcp_v4_connect和tcp_v4_syn_recv_sock都会调用sk_setup_caps来设置GSO/TSO配置。

急需留意的是,只要打开了GSO,即便硬件不襄助TSO,也会安装NETIF_F_TSO,使得sk_can_gso(sk)在GSO开启或许TSO开启的时候都回去true

 l  sk_setup_caps

 1 #define NETIF_F_GSO_SOFTWARE            (NETIF_F_TSO | NETIF_F_TSO_ECN | NETIF_F_TSO6)
 2 
 3 #define NETIF_F_TSO                    (SKB_GSO_TCPV4 << NETIF_F_GSO_SHIFT)
 4 
 5 void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
 6 
 7 {
 8 
 9        __sk_dst_set(sk, dst);
10 
11        sk->sk_route_caps = dst->dev->features;
12 
13        if (sk->sk_route_caps & NETIF_F_GSO)   /*GSO默认都会开启*/
14 
15               sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;  /*打开TSO*/
16 
17        if (sk_can_gso(sk)) {  /*对于tcp这里会成立*/
18 
19               if (dst->header_len) {
20 
21                     sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
22 
23               } else {
24 
25                    sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
26 
27                    sk->sk_gso_max_size = dst->dev->gso_max_size;  /*GSO_MAX_SIZE=65536*/
28 
29              }
30  
31      }
32 
33 }

从地方可以看出,假设设备开启了GSO,sock都会将TSO标志打开,可是注意那和硬件是或不是打开TSO毫不相关,硬件的TSO取决于硬件本身特色的帮助。下边看下sk_can_gso的逻辑。

sk_can_gso

1 static inline int sk_can_gso(const struct sock *sk)
2 
3 {
4 
5     /*对于tcp,在tcp_v4_connect中被设置:sk->sk_gso_type = SKB_GSO_TCPV4*/
6 
7     return net_gso_ok(sk->sk_route_caps, sk->sk_gso_type);
8 
9 }

net_gso_ok

1 static inline int net_gso_ok(int features, int gso_type)
2 
3 {
4 
5       int feature = gso_type << NETIF_F_GSO_SHIFT;
6 
7       return (features & feature) == feature;
8 
9 }

由于对于tcp
在sk_setup_caps中sk->sk_route_caps也被装置有SKB_GSO_TCPV4,所以一切sk_can_gso成立。

开启GSO/TSO

驱动程序在注册网卡设备的时候暗许开启GSO: NETIF_F_GSO

驱动程序会根据网卡硬件是还是不是帮忙来设置TSO: NETIF_F_TSO

可以经过ethtool -K来开关GSO/TSO

 1 #define NETIF_F_SOFT_FEATURES           (NETIF_F_GSO | NETIF_F_GRO)
 2 
 3 int register_netdevice(struct net_device *dev)
 4 
 5 {
 6 
 7               ...
 8 
 9               /* Transfer changeable features to wanted_features and enable
10 
11                * software offloads (GSO and GRO).
12 
13                */
14 
15               dev->hw_features |= NETIF_F_SOFT_FEATURES;
16 
17               dev->features |= NETIF_F_SOFT_FEATURES;         //默认开启GRO/GSO
18 
19               dev->wanted_features = dev->features & dev->hw_features;
20 
21               ...
22 
23 }
24 
25 static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
26 
27 {
28 
29               ...
30 
31               netdev->features = NETIF_F_SG |
32 
33                                 NETIF_F_TSO |
34 
35                                 NETIF_F_TSO6 |
36 
37                                 NETIF_F_RXHASH |
38 
39                                 NETIF_F_RXCSUM |
40 
41                                 NETIF_F_HW_CSUM;
42 
43               register_netdev(netdev);
44 
45               ...
46 
47 }

是或不是延迟分段

从上边我们了解GSO/TSO是还是不是开启是保留在dev->features中,而装备和路由关联,当大家询问到路由后就可以把布署保存在sock中。

比如在tcp_v4_connect和tcp_v4_syn_recv_sock都会调用sk_setup_caps来设置GSO/TSO配置。

内需专注的是,只要打开了GSO,尽管硬件不扶助TSO,也会设置NETIF_F_TSO,使得sk_can_gso(sk)在GSO开启大概TSO开启的时候都回去true

 l  sk_setup_caps

 1 #define NETIF_F_GSO_SOFTWARE            (NETIF_F_TSO | NETIF_F_TSO_ECN | NETIF_F_TSO6)
 2 
 3 #define NETIF_F_TSO                    (SKB_GSO_TCPV4 << NETIF_F_GSO_SHIFT)
 4 
 5 void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
 6 
 7 {
 8 
 9        __sk_dst_set(sk, dst);
10 
11        sk->sk_route_caps = dst->dev->features;
12 
13        if (sk->sk_route_caps & NETIF_F_GSO)   /*GSO默认都会开启*/
14 
15               sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;  /*打开TSO*/
16 
17        if (sk_can_gso(sk)) {  /*对于tcp这里会成立*/
18 
19               if (dst->header_len) {
20 
21                     sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
22 
23               } else {
24 
25                    sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
26 
27                    sk->sk_gso_max_size = dst->dev->gso_max_size;  /*GSO_MAX_SIZE=65536*/
28 
29              }
30  
31      }
32 
33 }

从地点可以见到,尽管设备开启了GSO,sock都会将TSO标志打开,可是注意那和硬件是还是不是打开TSO无关,硬件的TSO取决于硬件自己特色的支撑。上边看下sk_can_gso的逻辑。

sk_can_gso

1 static inline int sk_can_gso(const struct sock *sk)
2 
3 {
4 
5     /*对于tcp,在tcp_v4_connect中被设置:sk->sk_gso_type = SKB_GSO_TCPV4*/
6 
7     return net_gso_ok(sk->sk_route_caps, sk->sk_gso_type);
8 
9 }

net_gso_ok

1 static inline int net_gso_ok(int features, int gso_type)
2 
3 {
4 
5       int feature = gso_type << NETIF_F_GSO_SHIFT;
6 
7       return (features & feature) == feature;
8 
9 }

出于对于tcp
在sk_setup_caps中sk->sk_route_caps也被装置有SKB_GSO_TCPV4,所以总体sk_can_gso成立。

开启GSO/TSO

驱动程序在登记网卡设备的时候暗中认同开启GSO: NETIF_F_GSO

驱动程序会基于网卡硬件是不是援救来安装TSO: NETIF_F_TSO

能够透过ethtool -K来开关GSO/TSO

  1. #define NETIF_F_SOFT_FEATURES(NETIF_F_GSO|NETIF_F_GRO)
  2. intregister_netdevice(struct net_device*dev)
  3. {
  4. /*Transfer changeable featurestowanted_featuresandenable
  5. *software offloads(GSOandGRO).
  6. */
  7. dev->hw_features|=NETIF_F_SOFT_FEATURES;
  8. dev->features|=NETIF_F_SOFT_FEATURES;//暗中认同开启GRO/GSO
  9. dev->wanted_features=dev->features&dev->hw_features;
  10. }
  11. staticintixgbe_probe(struct pci_dev*pdev,conststruct
    pci_device_id*ent)
  12. {
  13. netdev->features=NETIF_F_SG|
  14. NETIF_F_TSO|
  15. NETIF_F_TSO6|
  16. NETIF_F_RXHASH|
  17. NETIF_F_RXCSUM|
  18. NETIF_F_HW_CSUM;
  19. register_netdev(netdev);
  20. }

GSO的数额包长度

对热切数据包或GSO/TSO都不打开的图景,才不会推迟发送, 暗中认同使用当前MSS
开启GSO后,tcp_send_mss重回mss和单个skb的GSO大小,为mss的平头倍。

tcp_send_mss

 1 static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 2 
 3 {
 4 
 5        int mss_now;
 6 
 7  
 8 
 9        mss_now = tcp_current_mss(sk);/*通过ip option,SACKs及pmtu确定当前的mss*/
10 
11        *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
12 
13  
14 
15        return mss_now;
16 
17 }

 

tcp_xmit_size_goal

 1 static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed)
 2 {
 3     struct tcp_sock *tp = tcp_sk(sk);
 4     u32 xmit_size_goal, old_size_goal;
 5 
 6     xmit_size_goal = mss_now;
 7     /*这里large_allowed表示是否是紧急数据*/
 8     if (large_allowed && sk_can_gso(sk)) {  /*如果不是紧急数据且支持GSO*/
 9         xmit_size_goal = ((sk->sk_gso_max_size - 1) -
10                   inet_csk(sk)->icsk_af_ops->net_header_len -
11                   inet_csk(sk)->icsk_ext_hdr_len -
12                   tp->tcp_header_len);/*xmit_size_goal为gso最大分段大小减去tcp和ip头部长度*/
13 
14         xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);/*最多达到收到的最大rwnd窗口通告的一半*/
15 
16         /* We try hard to avoid divides here */
17         old_size_goal = tp->xmit_size_goal_segs * mss_now;
18 
19         if (likely(old_size_goal <= xmit_size_goal &&
20                old_size_goal + mss_now > xmit_size_goal)) {
21             xmit_size_goal = old_size_goal; /*使用老的xmit_size*/
22         } else {
23             tp->xmit_size_goal_segs = xmit_size_goal / mss_now;
24             xmit_size_goal = tp->xmit_size_goal_segs * mss_now; /*使用新的xmit_size*/
25         }
26     }
27 
28     return max(xmit_size_goal, mss_now);
29 }

tcp_sendmsg

应用程序send()数据后,会在tcp_sendmsg中品尝在同一个skb,保存size_goal大小的数额,然后再经过tcp_push把那个包通过tcp_write_xmit发出去

  1 int tcp_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size)
  2 {
  3     struct sock *sk = sock->sk;
  4     struct iovec *iov;
  5     struct tcp_sock *tp = tcp_sk(sk);
  6     struct sk_buff *skb;
  7     int iovlen, flags;
  8     int mss_now, size_goal;
  9     int err, copied;
 10     long timeo;
 11 
 12     lock_sock(sk);
 13     TCP_CHECK_TIMER(sk);
 14 
 15     flags = msg->msg_flags;
 16     timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
 17 
 18     /* Wait for a connection to finish. */
 19     if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))
 20         if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
 21             goto out_err;
 22 
 23     /* This should be in poll */
 24     clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
 25     /* size_goal表示GSO支持的大小,为mss的整数倍,不支持GSO时则和mss相等 */
 26     mss_now = tcp_send_mss(sk, &size_goal, flags);/*返回值mss_now为真实mss*/
 27 
 28     /* Ok commence sending. */
 29     iovlen = msg->msg_iovlen;
 30     iov = msg->msg_iov;
 31     copied = 0;
 32 
 33     err = -EPIPE;
 34     if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 35         goto out_err;
 36 
 37     while (--iovlen >= 0) {
 38         size_t seglen = iov->iov_len;
 39         unsigned char __user *from = iov->iov_base;
 40 
 41         iov++;
 42 
 43         while (seglen > 0) {
 44             int copy = 0;
 45             int max = size_goal; /*每个skb中填充的数据长度初始化为size_goal*/
 46             /* 从sk->sk_write_queue中取出队尾的skb,因为这个skb可能还没有被填满 */
 47             skb = tcp_write_queue_tail(sk);
 48             if (tcp_send_head(sk)) { /*如果之前还有未发送的数据*/
 49                 if (skb->ip_summed == CHECKSUM_NONE)  /*比如路由变更,之前的不支持TSO,现在的支持了*/
 50                     max = mss_now; /*上一个不支持GSO的skb,继续不支持*/
 51                 copy = max - skb->len; /*copy为每次想skb中拷贝的数据长度*/
 52             }
 53            /*copy<=0表示不能合并到之前skb做GSO*/
 54             if (copy <= 0) {
 55 new_segment:
 56                 /* Allocate new segment. If the interface is SG,
 57                  * allocate skb fitting to single page.
 58                  */
 59                  /* 内存不足,需要等待 */
 60                 if (!sk_stream_memory_free(sk))
 61                     goto wait_for_sndbuf;
 62                 /* 分配新的skb */
 63                 skb = sk_stream_alloc_skb(sk, select_size(sk),
 64                         sk->sk_allocation);
 65                 if (!skb)
 66                     goto wait_for_memory;
 67 
 68                 /*
 69                  * Check whether we can use HW checksum.
 70                  */
 71                 /*如果硬件支持checksum,则将skb->ip_summed设置为CHECKSUM_PARTIAL,表示由硬件计算校验和*/
 72                 if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
 73                     skb->ip_summed = CHECKSUM_PARTIAL;
 74                 /*将skb加入sk->sk_write_queue队尾, 同时去掉skb的TCP_NAGLE_PUSH标记*/
 75                 skb_entail(sk, skb);
 76                 copy = size_goal;  /*这里将每次copy的大小设置为size_goal,即GSO支持的大小*/
 77                 max = size_goal;
 78             }
 79 
 80             /* Try to append data to the end of skb. */
 81             if (copy > seglen)
 82                 copy = seglen;
 83 
 84             /* Where to copy to? */
 85             if (skb_tailroom(skb) > 0) { /*如果skb的线性区还有空间,则先填充skb的线性区*/
 86                 /* We have some space in skb head. Superb! */
 87                 if (copy > skb_tailroom(skb))
 88                     copy = skb_tailroom(skb);
 89                 if ((err = skb_add_data(skb, from, copy)) != 0) /*copy用户态数据到skb线性区*/
 90                     goto do_fault;
 91             } else {  /*否则尝试向SG的frags中拷贝*/
 92                 int merge = 0;
 93                 int i = skb_shinfo(skb)->nr_frags;
 94                 struct page *page = TCP_PAGE(sk);
 95                 int off = TCP_OFF(sk);
 96 
 97                 if (skb_can_coalesce(skb, i, page, off) &&
 98                     off != PAGE_SIZE) {/*pfrag->page和frags[i-1]是否使用相同页,并且page_offset相同*/
 99                     /* We can extend the last page
100                      * fragment. */
101                     merge = 1; /*说明和之前frags中是同一个page,需要merge*/
102                 } else if (i == MAX_SKB_FRAGS ||
103                        (!i && !(sk->sk_route_caps & NETIF_F_SG))) {
104                     /* Need to add new fragment and cannot
105                      * do this because interface is non-SG,
106                      * or because all the page slots are
107                      * busy. */
108                      /*如果设备不支持SG,或者非线性区frags已经达到最大,则创建新的skb分段*/
109                     tcp_mark_push(tp, skb); /*标记push flag*/
110                     goto new_segment;
111                 } else if (page) {
112                     if (off == PAGE_SIZE) {
113                         put_page(page); /*增加page引用计数*/
114                         TCP_PAGE(sk) = page = NULL;
115                         off = 0;
116                     }
117                 } else
118                     off = 0;
119 
120                 if (copy > PAGE_SIZE - off)
121                     copy = PAGE_SIZE - off;
122 
123                 if (!sk_wmem_schedule(sk, copy))
124                     goto wait_for_memory;
125 
126                 if (!page) {
127                     /* Allocate new cache page. */
128                     if (!(page = sk_stream_alloc_page(sk)))
129                         goto wait_for_memory;
130                 }
131 
132                 /* Time to copy data. We are close to
133                  * the end! */
134                 err = skb_copy_to_page(sk, from, skb, page, off, copy); /*拷贝数据到page中*/
135                 if (err) {
136                     /* If this page was new, give it to the
137                      * socket so it does not get leaked.
138                      */
139                     if (!TCP_PAGE(sk)) {
140                         TCP_PAGE(sk) = page;
141                         TCP_OFF(sk) = 0;
142                     }
143                     goto do_error;
144                 }
145 
146                 /* Update the skb. */
147                 if (merge) { /*pfrag和frags[i - 1]是相同的*/
148                     skb_shinfo(skb)->frags[i - 1].size += copy;
149                 } else {
150                     skb_fill_page_desc(skb, i, page, off, copy);
151                     if (TCP_PAGE(sk)) {
152                         get_page(page);
153                     } else if (off + copy < PAGE_SIZE) {
154                         get_page(page);
155                         TCP_PAGE(sk) = page;
156                     }
157                 }
158 
159                 TCP_OFF(sk) = off + copy;
160             }
161 
162             if (!copied)
163                 TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
164 
165             tp->write_seq += copy;
166             TCP_SKB_CB(skb)->end_seq += copy;
167             skb_shinfo(skb)->gso_segs = 0; /*清零tso分段数,让tcp_write_xmit去计算*/
168 
169             from += copy;
170             copied += copy;
171             if ((seglen -= copy) == 0 && iovlen == 0)
172                 goto out;
173             /* 还有数据没copy,并且没有达到最大可拷贝的大小(注意这里max之前被赋值为size_goal,即GSO支持的大小), 尝试往该skb继续添加数据*/
174             if (skb->len < max || (flags & MSG_OOB))
175                 continue;
176             /*下面的逻辑就是:还有数据没copy,但是当前skb已经满了,所以可以发送了(但不是一定要发送)*/
177             if (forced_push(tp)) { /*超过最大窗口的一半没有设置push了*/
178                 tcp_mark_push(tp, skb); /*设置push标记,更新pushed_seq*/
179                 __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH); /*调用tcp_write_xmit马上发送*/
180             } else if (skb == tcp_send_head(sk)) /*第一个包,直接发送*/
181                 tcp_push_one(sk, mss_now);
182             continue; /*说明发送队列前面还有skb等待发送,且距离之前push的包还不是非常久*/
183 
184 wait_for_sndbuf:
185             set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
186 wait_for_memory:
187             if (copied)/*先把copied的发出去再等内存*/
188                 tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
189             /*阻塞等待内存*/
190             if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
191                 goto do_error;
192 
193             mss_now = tcp_send_mss(sk, &size_goal, flags);
194         }
195     }
196 
197 out:
198     if (copied) /*所有数据都放到发送队列中了,调用tcp_push发送*/
199         tcp_push(sk, flags, mss_now, tp->nonagle);
200     TCP_CHECK_TIMER(sk);
201     release_sock(sk);
202     return copied;
203 
204 do_fault:
205     if (!skb->len) {
206         tcp_unlink_write_queue(skb, sk);
207         /* It is the one place in all of TCP, except connection
208          * reset, where we can be unlinking the send_head.
209          */
210         tcp_check_send_head(sk, skb);
211         sk_wmem_free_skb(sk, skb);
212     }
213 
214 do_error:
215     if (copied)
216         goto out;
217 out_err:
218     err = sk_stream_error(sk, flags, err);
219     TCP_CHECK_TIMER(sk);
220     release_sock(sk);
221     return err;
222 }

   
 最终会调用tcp_push发送skb,而tcp_push又会调用tcp_write_xmit。tcp_sendmsg已经把数据根据GSO最大的size,放到一个个的skb中,
最后调用tcp_write_xmit发送这个GSO包。tcp_write_xmit会检查当前的短路窗口,还有nagle测试,tsq检查来决定是还是不是能发送全部只怕局地的skb,
如若只好发送一部分,则要求调用tso_fragment做切分。最终通过tcp_transmit_skb发送,
如若发送窗口没有直达限制,skb中存放的数目将高达GSO最大值。

tcp_write_xmit

 1 static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 2               int push_one, gfp_t gfp)
 3 {
 4     struct tcp_sock *tp = tcp_sk(sk);
 5     struct sk_buff *skb;
 6     unsigned int tso_segs, sent_pkts;
 7     int cwnd_quota;
 8     int result;
 9 
10     sent_pkts = 0;
11 
12     if (!push_one) {
13         /* Do MTU probing. */
14         result = tcp_mtu_probe(sk);
15         if (!result) {
16             return 0;
17         } else if (result > 0) {
18             sent_pkts = 1;
19         }
20     }
21     /*遍历发送队列*/
22     while ((skb = tcp_send_head(sk))) {
23         unsigned int limit;
24 
25         tso_segs = tcp_init_tso_segs(sk, skb, mss_now); /*skb->len/mss,重新设置tcp_gso_segs,因为在tcp_sendmsg中被清零了*/
26         BUG_ON(!tso_segs);
27 
28         cwnd_quota = tcp_cwnd_test(tp, skb);
29         if (!cwnd_quota)
30             break;
31 
32         if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
33             break;
34 
35         if (tso_segs == 1) {  /*tso_segs=1表示无需tso分段*/
36             /* 根据nagle算法,计算是否需要推迟发送数据 */
37             if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
38                              (tcp_skb_is_last(sk, skb) ? /*last skb就直接发送*/
39                               nonagle : TCP_NAGLE_PUSH))))
40                 break;
41         } else {/*有多个tso分段*/
42             if (!push_one /*push所有skb*/
43                 && tcp_tso_should_defer(sk, skb))/*/如果发送窗口剩余不多,并且预计下一个ack将很快到来(意味着可用窗口会增加),则推迟发送*/
44                 break;
45         }
46         /*下面的逻辑是:不用推迟发送,马上发送的情况*/
47         limit = mss_now;
48 /*由于tso_segs被设置为skb->len/mss_now,所以开启gso时一定大于1*/
49         if (tso_segs > 1 && !tcp_urg_mode(tp)) /*tso分段大于1且非urg模式*/
50             limit = tcp_mss_split_point(sk, skb, mss_now, cwnd_quota);/*返回当前skb中可以发送的数据大小,通过mss和cwnd*/
51         /* 当skb的长度大于限制时,需要调用tso_fragment分片,如果分段失败则暂不发送 */
52         if (skb->len > limit &&
53             unlikely(tso_fragment(sk, skb, limit, mss_now))) /*/按limit切割成多个skb*/
54             break;
55 
56         TCP_SKB_CB(skb)->when = tcp_time_stamp;
57         /*发送,如果包被qdisc丢了,则退出循环,不继续发送了*/
58         if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
59             break;
60 
61         /* Advance the send_head.  This one is sent out.
62          * This call will increment packets_out.
63          */
64          /*更新sk_send_head和packets_out*/
65         tcp_event_new_data_sent(sk, skb);
66 
67         tcp_minshall_update(tp, mss_now, skb);
68         sent_pkts++;
69 
70         if (push_one)
71             break;
72     }
73 
74     if (likely(sent_pkts)) {
75         tcp_cwnd_validate(sk);
76         return 0;
77     }
78     return !tp->packets_out && tcp_send_head(sk);
79 }

   
其中tcp_init_tso_segs会设置skb的gso音信后文分析。大家来看tcp_write_xmit
会调用tso_fragment进行“tcp分段”。而分段的原则是skb->len >
limit。那里的根本就是limit的值,大家见到在tso_segs >
1时,相当于翻开gso的时候,limit的值是由tcp_mss_split_point拿到的,相当于min(skb->len,
window),即发送窗口允许的最大值。在尚未打开gso时limit就是当下的mss。

澳门金沙国际,tcp_init_tso_segs

 1 static int tcp_init_tso_segs(struct sock *sk, struct sk_buff *skb,
 2                  unsigned int mss_now)
 3 {
 4     int tso_segs = tcp_skb_pcount(skb); /*skb_shinfo(skb)->gso_seg之前被初始化为0*/
 5 
 6     if (!tso_segs || (tso_segs > 1 && tcp_skb_mss(skb) != mss_now)) {
 7         tcp_set_skb_tso_segs(sk, skb, mss_now);
 8         tso_segs = tcp_skb_pcount(skb);
 9     }
10     return tso_segs;
11 }
12 
13 static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb,
14                  unsigned int mss_now)
15 {
16     /* Make sure we own this skb before messing gso_size/gso_segs */
17     WARN_ON_ONCE(skb_cloned(skb));
18 
19     if (skb->len <= mss_now || !sk_can_gso(sk) ||
20         skb->ip_summed == CHECKSUM_NONE) {/*不支持gso的情况*/
21         /* Avoid the costly divide in the normal
22          * non-TSO case.
23          */
24         skb_shinfo(skb)->gso_segs = 1;
25         skb_shinfo(skb)->gso_size = 0;
26         skb_shinfo(skb)->gso_type = 0;
27     } else {
28         skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss_now); /*被设置为skb->len/mss_now*/
29         skb_shinfo(skb)->gso_size = mss_now;   /*注意mss_now为真实的mss,这里保存以供gso分段使用*/
30         skb_shinfo(skb)->gso_type = sk->sk_gso_type;
31     }
32 }
33     

 

tcp_write_xmit最后会调用ip_queue_xmit发送skb,进入ip层。

是或不是延迟分段

从上面大家领略GSO/TSO是还是不是开启是保留在dev->features中,而装备和路由关联,当大家询问到路由后就足以把安插保存在sock中。

比如在tcp_v4_connect和tcp_v4_syn_recv_sock都会调用sk_setup_caps来设置GSO/TSO配置。

急需小心的是,只要打开了GSO,即使硬件不帮衬TSO,也会安装NETIF_F_TSO,使得sk_can_gso(sk)在GSO开启或然TSO开启的时候都回到true

 l  sk_setup_caps

 1 #define NETIF_F_GSO_SOFTWARE            (NETIF_F_TSO | NETIF_F_TSO_ECN | NETIF_F_TSO6)
 2 
 3 #define NETIF_F_TSO                    (SKB_GSO_TCPV4 << NETIF_F_GSO_SHIFT)
 4 
 5 void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
 6 
 7 {
 8 
 9        __sk_dst_set(sk, dst);
10 
11        sk->sk_route_caps = dst->dev->features;
12 
13        if (sk->sk_route_caps & NETIF_F_GSO)   /*GSO默认都会开启*/
14 
15               sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;  /*打开TSO*/
16 
17        if (sk_can_gso(sk)) {  /*对于tcp这里会成立*/
18 
19               if (dst->header_len) {
20 
21                     sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
22 
23               } else {
24 
25                    sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
26 
27                    sk->sk_gso_max_size = dst->dev->gso_max_size;  /*GSO_MAX_SIZE=65536*/
28 
29              }
30  
31      }
32 
33 }

从地点可以观望,若是设备开启了GSO,sock都会将TSO标志打开,然则注意那和硬件是还是不是打开TSO毫不相关,硬件的TSO取决于硬件自个儿特点的支撑。下边看下sk_can_gso的逻辑。

sk_can_gso

1 static inline int sk_can_gso(const struct sock *sk)
2 
3 {
4 
5     /*对于tcp,在tcp_v4_connect中被设置:sk->sk_gso_type = SKB_GSO_TCPV4*/
6 
7     return net_gso_ok(sk->sk_route_caps, sk->sk_gso_type);
8 
9 }

net_gso_ok

1 static inline int net_gso_ok(int features, int gso_type)
2 
3 {
4 
5       int feature = gso_type << NETIF_F_GSO_SHIFT;
6 
7       return (features & feature) == feature;
8 
9 }

鉴于对于tcp
在sk_setup_caps中sk->sk_route_caps也被安装有SKB_GSO_TCPV4,所以总体sk_can_gso成立。

GSO的多寡包长度

对紧迫数据包或GSO/TSO都不打开的地方,才不会延迟发送, 暗中同意使用当前MSS
开启GSO后,tcp_send_mss再次来到mss和单个skb的GSO大小,为mss的平头倍。

tcp_send_mss

 1 static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 2 
 3 {
 4 
 5        int mss_now;
 6 
 7  
 8 
 9        mss_now = tcp_current_mss(sk);/*通过ip option,SACKs及pmtu确定当前的mss*/
10 
11        *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
12 
13  
14 
15        return mss_now;
16 
17 }

 

tcp_xmit_size_goal

 1 static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed)
 2 {
 3     struct tcp_sock *tp = tcp_sk(sk);
 4     u32 xmit_size_goal, old_size_goal;
 5 
 6     xmit_size_goal = mss_now;
 7     /*这里large_allowed表示是否是紧急数据*/
 8     if (large_allowed && sk_can_gso(sk)) {  /*如果不是紧急数据且支持GSO*/
 9         xmit_size_goal = ((sk->sk_gso_max_size - 1) -
10                   inet_csk(sk)->icsk_af_ops->net_header_len -
11                   inet_csk(sk)->icsk_ext_hdr_len -
12                   tp->tcp_header_len);/*xmit_size_goal为gso最大分段大小减去tcp和ip头部长度*/
13 
14         xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);/*最多达到收到的最大rwnd窗口通告的一半*/
15 
16         /* We try hard to avoid divides here */
17         old_size_goal = tp->xmit_size_goal_segs * mss_now;
18 
19         if (likely(old_size_goal <= xmit_size_goal &&
20                old_size_goal + mss_now > xmit_size_goal)) {
21             xmit_size_goal = old_size_goal; /*使用老的xmit_size*/
22         } else {
23             tp->xmit_size_goal_segs = xmit_size_goal / mss_now;
24             xmit_size_goal = tp->xmit_size_goal_segs * mss_now; /*使用新的xmit_size*/
25         }
26     }
27 
28     return max(xmit_size_goal, mss_now);
29 }

tcp_sendmsg

应用程序send()数据后,会在tcp_sendmsg中品尝在同2个skb,保存size_goal大小的数量,然后再经过tcp_push把那一个包通过tcp_write_xmit发出去

  1 int tcp_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size)
  2 {
  3     struct sock *sk = sock->sk;
  4     struct iovec *iov;
  5     struct tcp_sock *tp = tcp_sk(sk);
  6     struct sk_buff *skb;
  7     int iovlen, flags;
  8     int mss_now, size_goal;
  9     int err, copied;
 10     long timeo;
 11 
 12     lock_sock(sk);
 13     TCP_CHECK_TIMER(sk);
 14 
 15     flags = msg->msg_flags;
 16     timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
 17 
 18     /* Wait for a connection to finish. */
 19     if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))
 20         if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
 21             goto out_err;
 22 
 23     /* This should be in poll */
 24     clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
 25     /* size_goal表示GSO支持的大小,为mss的整数倍,不支持GSO时则和mss相等 */
 26     mss_now = tcp_send_mss(sk, &size_goal, flags);/*返回值mss_now为真实mss*/
 27 
 28     /* Ok commence sending. */
 29     iovlen = msg->msg_iovlen;
 30     iov = msg->msg_iov;
 31     copied = 0;
 32 
 33     err = -EPIPE;
 34     if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 35         goto out_err;
 36 
 37     while (--iovlen >= 0) {
 38         size_t seglen = iov->iov_len;
 39         unsigned char __user *from = iov->iov_base;
 40 
 41         iov++;
 42 
 43         while (seglen > 0) {
 44             int copy = 0;
 45             int max = size_goal; /*每个skb中填充的数据长度初始化为size_goal*/
 46             /* 从sk->sk_write_queue中取出队尾的skb,因为这个skb可能还没有被填满 */
 47             skb = tcp_write_queue_tail(sk);
 48             if (tcp_send_head(sk)) { /*如果之前还有未发送的数据*/
 49                 if (skb->ip_summed == CHECKSUM_NONE)  /*比如路由变更,之前的不支持TSO,现在的支持了*/
 50                     max = mss_now; /*上一个不支持GSO的skb,继续不支持*/
 51                 copy = max - skb->len; /*copy为每次想skb中拷贝的数据长度*/
 52             }
 53            /*copy<=0表示不能合并到之前skb做GSO*/
 54             if (copy <= 0) {
 55 new_segment:
 56                 /* Allocate new segment. If the interface is SG,
 57                  * allocate skb fitting to single page.
 58                  */
 59                  /* 内存不足,需要等待 */
 60                 if (!sk_stream_memory_free(sk))
 61                     goto wait_for_sndbuf;
 62                 /* 分配新的skb */
 63                 skb = sk_stream_alloc_skb(sk, select_size(sk),
 64                         sk->sk_allocation);
 65                 if (!skb)
 66                     goto wait_for_memory;
 67 
 68                 /*
 69                  * Check whether we can use HW checksum.
 70                  */
 71                 /*如果硬件支持checksum,则将skb->ip_summed设置为CHECKSUM_PARTIAL,表示由硬件计算校验和*/
 72                 if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
 73                     skb->ip_summed = CHECKSUM_PARTIAL;
 74                 /*将skb加入sk->sk_write_queue队尾, 同时去掉skb的TCP_NAGLE_PUSH标记*/
 75                 skb_entail(sk, skb);
 76                 copy = size_goal;  /*这里将每次copy的大小设置为size_goal,即GSO支持的大小*/
 77                 max = size_goal;
 78             }
 79 
 80             /* Try to append data to the end of skb. */
 81             if (copy > seglen)
 82                 copy = seglen;
 83 
 84             /* Where to copy to? */
 85             if (skb_tailroom(skb) > 0) { /*如果skb的线性区还有空间,则先填充skb的线性区*/
 86                 /* We have some space in skb head. Superb! */
 87                 if (copy > skb_tailroom(skb))
 88                     copy = skb_tailroom(skb);
 89                 if ((err = skb_add_data(skb, from, copy)) != 0) /*copy用户态数据到skb线性区*/
 90                     goto do_fault;
 91             } else {  /*否则尝试向SG的frags中拷贝*/
 92                 int merge = 0;
 93                 int i = skb_shinfo(skb)->nr_frags;
 94                 struct page *page = TCP_PAGE(sk);
 95                 int off = TCP_OFF(sk);
 96 
 97                 if (skb_can_coalesce(skb, i, page, off) &&
 98                     off != PAGE_SIZE) {/*pfrag->page和frags[i-1]是否使用相同页,并且page_offset相同*/
 99                     /* We can extend the last page
100                      * fragment. */
101                     merge = 1; /*说明和之前frags中是同一个page,需要merge*/
102                 } else if (i == MAX_SKB_FRAGS ||
103                        (!i && !(sk->sk_route_caps & NETIF_F_SG))) {
104                     /* Need to add new fragment and cannot
105                      * do this because interface is non-SG,
106                      * or because all the page slots are
107                      * busy. */
108                      /*如果设备不支持SG,或者非线性区frags已经达到最大,则创建新的skb分段*/
109                     tcp_mark_push(tp, skb); /*标记push flag*/
110                     goto new_segment;
111                 } else if (page) {
112                     if (off == PAGE_SIZE) {
113                         put_page(page); /*增加page引用计数*/
114                         TCP_PAGE(sk) = page = NULL;
115                         off = 0;
116                     }
117                 } else
118                     off = 0;
119 
120                 if (copy > PAGE_SIZE - off)
121                     copy = PAGE_SIZE - off;
122 
123                 if (!sk_wmem_schedule(sk, copy))
124                     goto wait_for_memory;
125 
126                 if (!page) {
127                     /* Allocate new cache page. */
128                     if (!(page = sk_stream_alloc_page(sk)))
129                         goto wait_for_memory;
130                 }
131 
132                 /* Time to copy data. We are close to
133                  * the end! */
134                 err = skb_copy_to_page(sk, from, skb, page, off, copy); /*拷贝数据到page中*/
135                 if (err) {
136                     /* If this page was new, give it to the
137                      * socket so it does not get leaked.
138                      */
139                     if (!TCP_PAGE(sk)) {
140                         TCP_PAGE(sk) = page;
141                         TCP_OFF(sk) = 0;
142                     }
143                     goto do_error;
144                 }
145 
146                 /* Update the skb. */
147                 if (merge) { /*pfrag和frags[i - 1]是相同的*/
148                     skb_shinfo(skb)->frags[i - 1].size += copy;
149                 } else {
150                     skb_fill_page_desc(skb, i, page, off, copy);
151                     if (TCP_PAGE(sk)) {
152                         get_page(page);
153                     } else if (off + copy < PAGE_SIZE) {
154                         get_page(page);
155                         TCP_PAGE(sk) = page;
156                     }
157                 }
158 
159                 TCP_OFF(sk) = off + copy;
160             }
161 
162             if (!copied)
163                 TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
164 
165             tp->write_seq += copy;
166             TCP_SKB_CB(skb)->end_seq += copy;
167             skb_shinfo(skb)->gso_segs = 0; /*清零tso分段数,让tcp_write_xmit去计算*/
168 
169             from += copy;
170             copied += copy;
171             if ((seglen -= copy) == 0 && iovlen == 0)
172                 goto out;
173             /* 还有数据没copy,并且没有达到最大可拷贝的大小(注意这里max之前被赋值为size_goal,即GSO支持的大小), 尝试往该skb继续添加数据*/
174             if (skb->len < max || (flags & MSG_OOB))
175                 continue;
176             /*下面的逻辑就是:还有数据没copy,但是当前skb已经满了,所以可以发送了(但不是一定要发送)*/
177             if (forced_push(tp)) { /*超过最大窗口的一半没有设置push了*/
178                 tcp_mark_push(tp, skb); /*设置push标记,更新pushed_seq*/
179                 __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH); /*调用tcp_write_xmit马上发送*/
180             } else if (skb == tcp_send_head(sk)) /*第一个包,直接发送*/
181                 tcp_push_one(sk, mss_now);
182             continue; /*说明发送队列前面还有skb等待发送,且距离之前push的包还不是非常久*/
183 
184 wait_for_sndbuf:
185             set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
186 wait_for_memory:
187             if (copied)/*先把copied的发出去再等内存*/
188                 tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
189             /*阻塞等待内存*/
190             if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
191                 goto do_error;
192 
193             mss_now = tcp_send_mss(sk, &size_goal, flags);
194         }
195     }
196 
197 out:
198     if (copied) /*所有数据都放到发送队列中了,调用tcp_push发送*/
199         tcp_push(sk, flags, mss_now, tp->nonagle);
200     TCP_CHECK_TIMER(sk);
201     release_sock(sk);
202     return copied;
203 
204 do_fault:
205     if (!skb->len) {
206         tcp_unlink_write_queue(skb, sk);
207         /* It is the one place in all of TCP, except connection
208          * reset, where we can be unlinking the send_head.
209          */
210         tcp_check_send_head(sk, skb);
211         sk_wmem_free_skb(sk, skb);
212     }
213 
214 do_error:
215     if (copied)
216         goto out;
217 out_err:
218     err = sk_stream_error(sk, flags, err);
219     TCP_CHECK_TIMER(sk);
220     release_sock(sk);
221     return err;
222 }

   
 末了会调用tcp_push发送skb,而tcp_push又会调用tcp_write_xmit。tcp_sendmsg已经把数量依据GSO最大的size,放到一个个的skb中,
最后调用tcp_write_xmit发送那个GSO包。tcp_write_xmit会检查当前的不通窗口,还有nagle测试,tsq检查来支配是或不是能发送全数或许有个其他skb,
若是只好发送一部分,则要求调用tso_fragment做切分。最终通过tcp_transmit_skb发送,
若是发送窗口没有直达限制,skb中存放的多寡将达成GSO最大值。

tcp_write_xmit

 1 static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 2               int push_one, gfp_t gfp)
 3 {
 4     struct tcp_sock *tp = tcp_sk(sk);
 5     struct sk_buff *skb;
 6     unsigned int tso_segs, sent_pkts;
 7     int cwnd_quota;
 8     int result;
 9 
10     sent_pkts = 0;
11 
12     if (!push_one) {
13         /* Do MTU probing. */
14         result = tcp_mtu_probe(sk);
15         if (!result) {
16             return 0;
17         } else if (result > 0) {
18             sent_pkts = 1;
19         }
20     }
21     /*遍历发送队列*/
22     while ((skb = tcp_send_head(sk))) {
23         unsigned int limit;
24 
25         tso_segs = tcp_init_tso_segs(sk, skb, mss_now); /*skb->len/mss,重新设置tcp_gso_segs,因为在tcp_sendmsg中被清零了*/
26         BUG_ON(!tso_segs);
27 
28         cwnd_quota = tcp_cwnd_test(tp, skb);
29         if (!cwnd_quota)
30             break;
31 
32         if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
33             break;
34 
35         if (tso_segs == 1) {  /*tso_segs=1表示无需tso分段*/
36             /* 根据nagle算法,计算是否需要推迟发送数据 */
37             if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
38                              (tcp_skb_is_last(sk, skb) ? /*last skb就直接发送*/
39                               nonagle : TCP_NAGLE_PUSH))))
40                 break;
41         } else {/*有多个tso分段*/
42             if (!push_one /*push所有skb*/
43                 && tcp_tso_should_defer(sk, skb))/*/如果发送窗口剩余不多,并且预计下一个ack将很快到来(意味着可用窗口会增加),则推迟发送*/
44                 break;
45         }
46         /*下面的逻辑是:不用推迟发送,马上发送的情况*/
47         limit = mss_now;
48 /*由于tso_segs被设置为skb->len/mss_now,所以开启gso时一定大于1*/
49         if (tso_segs > 1 && !tcp_urg_mode(tp)) /*tso分段大于1且非urg模式*/
50             limit = tcp_mss_split_point(sk, skb, mss_now, cwnd_quota);/*返回当前skb中可以发送的数据大小,通过mss和cwnd*/
51         /* 当skb的长度大于限制时,需要调用tso_fragment分片,如果分段失败则暂不发送 */
52         if (skb->len > limit &&
53             unlikely(tso_fragment(sk, skb, limit, mss_now))) /*/按limit切割成多个skb*/
54             break;
55 
56         TCP_SKB_CB(skb)->when = tcp_time_stamp;
57         /*发送,如果包被qdisc丢了,则退出循环,不继续发送了*/
58         if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
59             break;
60 
61         /* Advance the send_head.  This one is sent out.
62          * This call will increment packets_out.
63          */
64          /*更新sk_send_head和packets_out*/
65         tcp_event_new_data_sent(sk, skb);
66 
67         tcp_minshall_update(tp, mss_now, skb);
68         sent_pkts++;
69 
70         if (push_one)
71             break;
72     }
73 
74     if (likely(sent_pkts)) {
75         tcp_cwnd_validate(sk);
76         return 0;
77     }
78     return !tp->packets_out && tcp_send_head(sk);
79 }

   
其中tcp_init_tso_segs会设置skb的gso新闻后文分析。我们见到tcp_write_xmit
会调用tso_fragment举行“tcp分段”。而分段的规格是skb->len >
limit。那里的显要就是limit的值,我们见到在tso_segs >
1时,相当于敞开gso的时候,limit的值是由tcp_mss_split_point拿到的,相当于min(skb->len,
window),即发送窗口允许的最大值。在尚未打开gso时limit就是当前的mss。

tcp_init_tso_segs

 1 static int tcp_init_tso_segs(struct sock *sk, struct sk_buff *skb,
 2                  unsigned int mss_now)
 3 {
 4     int tso_segs = tcp_skb_pcount(skb); /*skb_shinfo(skb)->gso_seg之前被初始化为0*/
 5 
 6     if (!tso_segs || (tso_segs > 1 && tcp_skb_mss(skb) != mss_now)) {
 7         tcp_set_skb_tso_segs(sk, skb, mss_now);
 8         tso_segs = tcp_skb_pcount(skb);
 9     }
10     return tso_segs;
11 }
12 
13 static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb,
14                  unsigned int mss_now)
15 {
16     /* Make sure we own this skb before messing gso_size/gso_segs */
17     WARN_ON_ONCE(skb_cloned(skb));
18 
19     if (skb->len <= mss_now || !sk_can_gso(sk) ||
20         skb->ip_summed == CHECKSUM_NONE) {/*不支持gso的情况*/
21         /* Avoid the costly divide in the normal
22          * non-TSO case.
23          */
24         skb_shinfo(skb)->gso_segs = 1;
25         skb_shinfo(skb)->gso_size = 0;
26         skb_shinfo(skb)->gso_type = 0;
27     } else {
28         skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss_now); /*被设置为skb->len/mss_now*/
29         skb_shinfo(skb)->gso_size = mss_now;   /*注意mss_now为真实的mss,这里保存以供gso分段使用*/
30         skb_shinfo(skb)->gso_type = sk->sk_gso_type;
31     }
32 }
33     

 

tcp_write_xmit最终会调用ip_queue_xmit发送skb,进入ip层。

是或不是延迟分段

从下边咱们驾驭GSO/TSO是不是开启是保留在dev->features中,而装备和路由关联,当大家查询到路由后就可以把安排保存在sock中。

比如在tcp_v4_connect和tcp_v4_syn_recv_sock都会调用sk_setup_caps来设置GSO/TSO配置。

亟待专注的是,只要打开了GSO,纵然硬件不帮衬TSO,也会设置NETIF_F_TSO,使得sk_can_gso(sk)在GSO开启可能TSO开启的时候都回去true

lsk_setup_caps

  1. #define
    NETIF_F_GSO_SOFTWARE(NETIF_F_TSO|NETIF_F_TSO_ECN|NETIF_F_TSO6)
  2. #define NETIF_F_TSO(SKB_GSO_TCPV4<
  3. void sk_setup_caps(struct sock*sk,struct dst_entry*linux tcp GSO和TSO完结。dst)
  4. {
  5. __sk_dst_set(sk,dst);
  6. sk->sk_route_caps=dst->dev->features;
  7. if(sk->sk_route_caps&NETIF_F_GSO)/*GSO暗许都会敞开*/
  8. sk->sk_route_caps|=NETIF_F_GSO_SOFTWARE;/*打开TSO*/
  9. if(sk_can_gso(sk)){/*对于tcp这里会建立*/
  10. if(dst->header_len){
  11. sk->sk_route_caps&=~NETIF_F_GSO_MASK;
  12. }else{
  13. sk->sk_route_caps|=NETIF_F_SG|NETIF_F_HW_CSUM;
  14. sk->sk_gso_max_size=dst->dev->gso_max_size;/*GSO_MAX_SIZE=65536*/
  15. }
  16. }
  17. }

从地点可以观察,倘诺设备开启了GSO,sock都会将TSO标志打开,但是注意那和硬件是还是不是打开TSO毫无干系,硬件的TSO取决于硬件自身特色的支持。上面看下sk_can_gso的逻辑。

lsk_can_gso

  1. static inlineintsk_can_gso(conststruct sock*sk)
  2. {
  3. /*对于tcp,在tcp_v4_connect中被装置:sk->sk_gso_type=SKB_GSO_TCPV4*/
  4. return net_gso_ok(sk->sk_route_caps,sk->sk_gso_type);
  5. }

lnet_gso_ok

  1. static inlineintnet_gso_ok(intfeatures,intgso_type)
  2. {
  3. intfeature=gso_type<
  4. return(features&feature)==feature;
  5. }

鉴于对于tcp在sk_setup_caps中sk->sk_route_caps也被设置有SKB_GSO_TCPV4,所以整个sk_can_gso成立。

ip分片,tcp分段,GSO,TSO

从此将来的逻辑就是以前另一篇小说中剖析的GSO逻辑了。下边大家看下整个协议栈中ip分片,tcp分段,GSO,TSO的涉嫌。我将以此流程由下图表示。

 澳门金沙国际 1

 

GSO的数码包长度

对火急数据包或GSO/TSO都不打开的景色,才不会延迟发送, 暗中同意使用当前MSS
开启GSO后,tcp_send_mss重回mss和单个skb的GSO大小,为mss的平头倍。

tcp_send_mss

 1 static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 2 
 3 {
 4 
 5        int mss_now;
 6 
 7  
 8 
 9        mss_now = tcp_current_mss(sk);/*通过ip option,SACKs及pmtu确定当前的mss*/
10 
11        *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
12 
13  
14 
15        return mss_now;
16 
17 }

 

tcp_xmit_size_goal

 1 static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed)
 2 {
 3     struct tcp_sock *tp = tcp_sk(sk);
 4     u32 xmit_size_goal, old_size_goal;
 5 
 6     xmit_size_goal = mss_now;
 7     /*这里large_allowed表示是否是紧急数据*/
 8     if (large_allowed && sk_can_gso(sk)) {  /*如果不是紧急数据且支持GSO*/
 9         xmit_size_goal = ((sk->sk_gso_max_size - 1) -
10                   inet_csk(sk)->icsk_af_ops->net_header_len -
11                   inet_csk(sk)->icsk_ext_hdr_len -
12                   tp->tcp_header_len);/*xmit_size_goal为gso最大分段大小减去tcp和ip头部长度*/
13 
14         xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);/*最多达到收到的最大rwnd窗口通告的一半*/
15 
16         /* We try hard to avoid divides here */
17         old_size_goal = tp->xmit_size_goal_segs * mss_now;
18 
19         if (likely(old_size_goal <= xmit_size_goal &&
20                old_size_goal + mss_now > xmit_size_goal)) {
21             xmit_size_goal = old_size_goal; /*使用老的xmit_size*/
22         } else {
23             tp->xmit_size_goal_segs = xmit_size_goal / mss_now;
24             xmit_size_goal = tp->xmit_size_goal_segs * mss_now; /*使用新的xmit_size*/
25         }
26     }
27 
28     return max(xmit_size_goal, mss_now);
29 }

tcp_sendmsg

应用程序send()数据后,会在tcp_sendmsg中品尝在同1个skb,保存size_goal大小的数额,然后再经过tcp_push把那些包通过tcp_write_xmit发出去

  1 int tcp_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size)
  2 {
  3     struct sock *sk = sock->sk;
  4     struct iovec *iov;
  5     struct tcp_sock *tp = tcp_sk(sk);
  6     struct sk_buff *skb;
  7     int iovlen, flags;
  8     int mss_now, size_goal;
  9     int err, copied;
 10     long timeo;
 11 
 12     lock_sock(sk);
 13     TCP_CHECK_TIMER(sk);
 14 
 15     flags = msg->msg_flags;
 16     timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
 17 
 18     /* Wait for a connection to finish. */
 19     if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))
 20         if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
 21             goto out_err;
 22 
 23     /* This should be in poll */
 24     clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
 25     /* size_goal表示GSO支持的大小,为mss的整数倍,不支持GSO时则和mss相等 */
 26     mss_now = tcp_send_mss(sk, &size_goal, flags);/*返回值mss_now为真实mss*/
 27 
 28     /* Ok commence sending. */
 29     iovlen = msg->msg_iovlen;
 30     iov = msg->msg_iov;
 31     copied = 0;
 32 
 33     err = -EPIPE;
 34     if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 35         goto out_err;
 36 
 37     while (--iovlen >= 0) {
 38         size_t seglen = iov->iov_len;
 39         unsigned char __user *from = iov->iov_base;
 40 
 41         iov++;
 42 
 43         while (seglen > 0) {
 44             int copy = 0;
 45             int max = size_goal; /*每个skb中填充的数据长度初始化为size_goal*/
 46             /* 从sk->sk_write_queue中取出队尾的skb,因为这个skb可能还没有被填满 */
 47             skb = tcp_write_queue_tail(sk);
 48             if (tcp_send_head(sk)) { /*如果之前还有未发送的数据*/
 49                 if (skb->ip_summed == CHECKSUM_NONE)  /*比如路由变更,之前的不支持TSO,现在的支持了*/
 50                     max = mss_now; /*上一个不支持GSO的skb,继续不支持*/
 51                 copy = max - skb->len; /*copy为每次想skb中拷贝的数据长度*/
 52             }
 53            /*copy<=0表示不能合并到之前skb做GSO*/
 54             if (copy <= 0) {
 55 new_segment:
 56                 /* Allocate new segment. If the interface is SG,
 57                  * allocate skb fitting to single page.
 58                  */
 59                  /* 内存不足,需要等待 */
 60                 if (!sk_stream_memory_free(sk))
 61                     goto wait_for_sndbuf;
 62                 /* 分配新的skb */
 63                 skb = sk_stream_alloc_skb(sk, select_size(sk),
 64                         sk->sk_allocation);
 65                 if (!skb)
 66                     goto wait_for_memory;
 67 
 68                 /*
 69                  * Check whether we can use HW checksum.
 70                  */
 71                 /*如果硬件支持checksum,则将skb->ip_summed设置为CHECKSUM_PARTIAL,表示由硬件计算校验和*/
 72                 if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
 73                     skb->ip_summed = CHECKSUM_PARTIAL;
 74                 /*将skb加入sk->sk_write_queue队尾, 同时去掉skb的TCP_NAGLE_PUSH标记*/
 75                 skb_entail(sk, skb);
 76                 copy = size_goal;  /*这里将每次copy的大小设置为size_goal,即GSO支持的大小*/
 77                 max = size_goal;
 78             }
 79 
 80             /* Try to append data to the end of skb. */
 81             if (copy > seglen)
 82                 copy = seglen;
 83 
 84             /* Where to copy to? */
 85             if (skb_tailroom(skb) > 0) { /*如果skb的线性区还有空间,则先填充skb的线性区*/
 86                 /* We have some space in skb head. Superb! */
 87                 if (copy > skb_tailroom(skb))
 88                     copy = skb_tailroom(skb);
 89                 if ((err = skb_add_data(skb, from, copy)) != 0) /*copy用户态数据到skb线性区*/
 90                     goto do_fault;
 91             } else {  /*否则尝试向SG的frags中拷贝*/
 92                 int merge = 0;
 93                 int i = skb_shinfo(skb)->nr_frags;
 94                 struct page *page = TCP_PAGE(sk);
 95                 int off = TCP_OFF(sk);
 96 
 97                 if (skb_can_coalesce(skb, i, page, off) &&
 98                     off != PAGE_SIZE) {/*pfrag->page和frags[i-1]是否使用相同页,并且page_offset相同*/
 99                     /* We can extend the last page
100                      * fragment. */
101                     merge = 1; /*说明和之前frags中是同一个page,需要merge*/
102                 } else if (i == MAX_SKB_FRAGS ||
103                        (!i && !(sk->sk_route_caps & NETIF_F_SG))) {
104                     /* Need to add new fragment and cannot
105                      * do this because interface is non-SG,
106                      * or because all the page slots are
107                      * busy. */
108                      /*如果设备不支持SG,或者非线性区frags已经达到最大,则创建新的skb分段*/
109                     tcp_mark_push(tp, skb); /*标记push flag*/
110                     goto new_segment;
111                 } else if (page) {
112                     if (off == PAGE_SIZE) {
113                         put_page(page); /*增加page引用计数*/
114                         TCP_PAGE(sk) = page = NULL;
115                         off = 0;
116                     }
117                 } else
118                     off = 0;
119 
120                 if (copy > PAGE_SIZE - off)
121                     copy = PAGE_SIZE - off;
122 
123                 if (!sk_wmem_schedule(sk, copy))
124                     goto wait_for_memory;
125 
126                 if (!page) {
127                     /* Allocate new cache page. */
128                     if (!(page = sk_stream_alloc_page(sk)))
129                         goto wait_for_memory;
130                 }
131 
132                 /* Time to copy data. We are close to
133                  * the end! */
134                 err = skb_copy_to_page(sk, from, skb, page, off, copy); /*拷贝数据到page中*/
135                 if (err) {
136                     /* If this page was new, give it to the
137                      * socket so it does not get leaked.
138                      */
139                     if (!TCP_PAGE(sk)) {
140                         TCP_PAGE(sk) = page;
141                         TCP_OFF(sk) = 0;
142                     }
143                     goto do_error;
144                 }
145 
146                 /* Update the skb. */
147                 if (merge) { /*pfrag和frags[i - 1]是相同的*/
148                     skb_shinfo(skb)->frags[i - 1].size += copy;
149                 } else {
150                     skb_fill_page_desc(skb, i, page, off, copy);
151                     if (TCP_PAGE(sk)) {
152                         get_page(page);
153                     } else if (off + copy < PAGE_SIZE) {
154                         get_page(page);
155                         TCP_PAGE(sk) = page;
156                     }
157                 }
158 
159                 TCP_OFF(sk) = off + copy;
160             }
161 
162             if (!copied)
163                 TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
164 
165             tp->write_seq += copy;
166             TCP_SKB_CB(skb)->end_seq += copy;
167             skb_shinfo(skb)->gso_segs = 0; /*清零tso分段数,让tcp_write_xmit去计算*/
168 
169             from += copy;
170             copied += copy;
171             if ((seglen -= copy) == 0 && iovlen == 0)
172                 goto out;
173             /* 还有数据没copy,并且没有达到最大可拷贝的大小(注意这里max之前被赋值为size_goal,即GSO支持的大小), 尝试往该skb继续添加数据*/
174             if (skb->len < max || (flags & MSG_OOB))
175                 continue;
176             /*下面的逻辑就是:还有数据没copy,但是当前skb已经满了,所以可以发送了(但不是一定要发送)*/
177             if (forced_push(tp)) { /*超过最大窗口的一半没有设置push了*/
178                 tcp_mark_push(tp, skb); /*设置push标记,更新pushed_seq*/
179                 __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH); /*调用tcp_write_xmit马上发送*/
180             } else if (skb == tcp_send_head(sk)) /*第一个包,直接发送*/
181                 tcp_push_one(sk, mss_now);
182             continue; /*说明发送队列前面还有skb等待发送,且距离之前push的包还不是非常久*/
183 
184 wait_for_sndbuf:
185             set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
186 wait_for_memory:
187             if (copied)/*先把copied的发出去再等内存*/
188                 tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
189             /*阻塞等待内存*/
190             if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
191                 goto do_error;
192 
193             mss_now = tcp_send_mss(sk, &size_goal, flags);
194         }
195     }
196 
197 out:
198     if (copied) /*所有数据都放到发送队列中了,调用tcp_push发送*/
199         tcp_push(sk, flags, mss_now, tp->nonagle);
200     TCP_CHECK_TIMER(sk);
201     release_sock(sk);
202     return copied;
203 
204 do_fault:
205     if (!skb->len) {
206         tcp_unlink_write_queue(skb, sk);
207         /* It is the one place in all of TCP, except connection
208          * reset, where we can be unlinking the send_head.
209          */
210         tcp_check_send_head(sk, skb);
211         sk_wmem_free_skb(sk, skb);
212     }
213 
214 do_error:
215     if (copied)
216         goto out;
217 out_err:
218     err = sk_stream_error(sk, flags, err);
219     TCP_CHECK_TIMER(sk);
220     release_sock(sk);
221     return err;
222 }

   
 最终会调用tcp_push发送skb,而tcp_push又会调用tcp_write_xmit。tcp_sendmsg已经把多少根据GSO最大的size,放到3个个的skb中,
最后调用tcp_write_xmit发送那些GSO包。tcp_write_xmit会检查当前的围堵窗口,还有nagle测试,tsq检查来控制是还是不是能发送全数可能部分的skb,
假设只好发送一部分,则须要调用tso_fragment做切分。最终通过tcp_transmit_skb发送,
即便发送窗口没有落成限制,skb中存放的多寡将完结GSO最大值。

tcp_write_xmit

 1 static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 2               int push_one, gfp_t gfp)
 3 {
 4     struct tcp_sock *tp = tcp_sk(sk);
 5     struct sk_buff *skb;
 6     unsigned int tso_segs, sent_pkts;
 7     int cwnd_quota;
 8     int result;
 9 
10     sent_pkts = 0;
11 
12     if (!push_one) {
13         /* Do MTU probing. */
14         result = tcp_mtu_probe(sk);
15         if (!result) {
16             return 0;
17         } else if (result > 0) {
18             sent_pkts = 1;
19         }
20     }
21     /*遍历发送队列*/
22     while ((skb = tcp_send_head(sk))) {
23         unsigned int limit;
24 
25         tso_segs = tcp_init_tso_segs(sk, skb, mss_now); /*skb->len/mss,重新设置tcp_gso_segs,因为在tcp_sendmsg中被清零了*/
26         BUG_ON(!tso_segs);
27 
28         cwnd_quota = tcp_cwnd_test(tp, skb);
29         if (!cwnd_quota)
30             break;
31 
32         if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
33             break;
34 
35         if (tso_segs == 1) {  /*tso_segs=1表示无需tso分段*/
36             /* 根据nagle算法,计算是否需要推迟发送数据 */
37             if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
38                              (tcp_skb_is_last(sk, skb) ? /*last skb就直接发送*/
39                               nonagle : TCP_NAGLE_PUSH))))
40                 break;
41         } else {/*有多个tso分段*/
42             if (!push_one /*push所有skb*/
43                 && tcp_tso_should_defer(sk, skb))/*/如果发送窗口剩余不多,并且预计下一个ack将很快到来(意味着可用窗口会增加),则推迟发送*/
44                 break;
45         }
46         /*下面的逻辑是:不用推迟发送,马上发送的情况*/
47         limit = mss_now;
48 /*由于tso_segs被设置为skb->len/mss_now,所以开启gso时一定大于1*/
49         if (tso_segs > 1 && !tcp_urg_mode(tp)) /*tso分段大于1且非urg模式*/
50             limit = tcp_mss_split_point(sk, skb, mss_now, cwnd_quota);/*返回当前skb中可以发送的数据大小,通过mss和cwnd*/
51         /* 当skb的长度大于限制时,需要调用tso_fragment分片,如果分段失败则暂不发送 */
52         if (skb->len > limit &&
53             unlikely(tso_fragment(sk, skb, limit, mss_now))) /*/按limit切割成多个skb*/
54             break;
55 
56         TCP_SKB_CB(skb)->when = tcp_time_stamp;
57         /*发送,如果包被qdisc丢了,则退出循环,不继续发送了*/
58         if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
59             break;
60 
61         /* Advance the send_head.  This one is sent out.
62          * This call will increment packets_out.
63          */
64          /*更新sk_send_head和packets_out*/
65         tcp_event_new_data_sent(sk, skb);
66 
67         tcp_minshall_update(tp, mss_now, skb);
68         sent_pkts++;
69 
70         if (push_one)
71             break;
72     }
73 
74     if (likely(sent_pkts)) {
75         tcp_cwnd_validate(sk);
76         return 0;
77     }
78     return !tp->packets_out && tcp_send_head(sk);
79 }

   
其中tcp_init_tso_segs会设置skb的gso音信后文分析。咱们看看tcp_write_xmit
会调用tso_fragment进行“tcp分段”。而分段的原则是skb->len >
limit。这里的最主要就是limit的值,大家看来在tso_segs >
1时,也等于翻开gso的时候,limit的值是由tcp_mss_split_point得到的,约等于min(skb->len,
window),即发送窗口允许的最大值。在未曾开启gso时limit就是目前的mss。

tcp_init_tso_segs

 1 static int tcp_init_tso_segs(struct sock *sk, struct sk_buff *skb,
 2                  unsigned int mss_now)
 3 {
 4     int tso_segs = tcp_skb_pcount(skb); /*skb_shinfo(skb)->gso_seg之前被初始化为0*/
 5 
 6     if (!tso_segs || (tso_segs > 1 && tcp_skb_mss(skb) != mss_now)) {
 7         tcp_set_skb_tso_segs(sk, skb, mss_now);
 8         tso_segs = tcp_skb_pcount(skb);
 9     }
10     return tso_segs;
11 }
12 
13 static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb,
14                  unsigned int mss_now)
15 {
16     /* Make sure we own this skb before messing gso_size/gso_segs */
17     WARN_ON_ONCE(skb_cloned(skb));
18 
19     if (skb->len <= mss_now || !sk_can_gso(sk) ||
20         skb->ip_summed == CHECKSUM_NONE) {/*不支持gso的情况*/
21         /* Avoid the costly divide in the normal
22          * non-TSO case.
23          */
24         skb_shinfo(skb)->gso_segs = 1;
25         skb_shinfo(skb)->gso_size = 0;
26         skb_shinfo(skb)->gso_type = 0;
27     } else {
28         skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss_now); /*被设置为skb->len/mss_now*/
29         skb_shinfo(skb)->gso_size = mss_now;   /*注意mss_now为真实的mss,这里保存以供gso分段使用*/
30         skb_shinfo(skb)->gso_type = sk->sk_gso_type;
31     }
32 }
33     

 

tcp_write_xmit最终会调用ip_queue_xmit发送skb,进入ip层。

ip分片,tcp分段,GSO,TSO

而后的逻辑就是从前另一篇文章中剖析的GSO逻辑了。下边大家看下整个协议栈中ip分片,tcp分段,GSO,TSO的关联。我将以此流程由下图表示。

 澳门金沙国际 2

 

GSO的数码包长度

对迫切数据包或GSO/TSO都不打开的图景,才不会延迟发送,私行认同使用当前MSS。开启GSO后,tcp_send_mss再次来到mss和单个skb的GSO大小,为mss的平头倍。

ltcp_send_mss

  1. staticinttcp_send_mss(struct sock*sk,int*size_goal,intflags)
  2. {
  3. intmss_now;
  4. mss_now=tcp_current_mss(sk);/*经过ip
    option,SACKs及pmtu鲜明当前的mss*/
  5. *size_goal=tcp_xmit_size_goal(sk,mss_now,!(flags&MSG_OOB));
  6. return mss_now;
  7. }

ltcp_xmit_size_goal

  1. static unsignedinttcp_xmit_size_goal(struct sock*sk,u32
    mss_now,
  2. intlarge_allowed)
  3. {
  4. struct tcp_sock*tp=tcp_sk(sk);
  5. u32 xmit_size_goal,old_size_goal;
  6. xmit_size_goal=mss_now;
  7. /*这里large_allowed表示是还是不是是热切数据*/
  8. if(large_allowed&&sk_can_gso(sk)){/*假如不是急不可待数据且帮助GSO*/
  9. xmit_size_goal=((sk->sk_gso_max_size-1)-
  10. inet_csk(sk)->icsk_af_ops->net_header_len-
  11. inet_csk(sk)->icsk_ext_hdr_len-
  12. tp->tcp_header_len);/*xmit_size_goal为gso最大分段大小减去tcp和ip尾司长度*/
  13. xmit_size_goal=tcp_bound_to_half_wnd(tp,xmit_size_goal);/*最多达到收到的最大rwnd窗口通告的一半*/
  14. /*We try hardtoavoid divides here*/
  15. old_size_goal=tp->xmit_size_goal_segs*mss_now;
  16. if(likely(old_size_goal<=xmit_size_goal&&
  17. old_size_goal+mss_now>xmit_size_goal)){
  18. xmit_size_goal=old_size_goal;/*动用老的xmit_size*/
  19. }else{
  20. tp->xmit_size_goal_segs=xmit_size_goal/mss_now;
  21. xmit_size_goal=tp->xmit_size_goal_segs*mss_now;/*使用新的xmit_size*/
  22. }
  23. }
  24. return max(xmit_size_goal,mss_now);
  25. }

ltcp_sendmsg

应用程序send()数据后,会在tcp_sendmsg中品尝在同多个skb,保存size_goal大小的数目,然后再经过tcp_push把那几个包通过tcp_write_xmit发出去

  1. inttcp_sendmsg(struct kiocb*iocb,struct socket*sock,struct
    msghdr*msg,
  2. size_t size)
  3. {
  4. struct sock*sk=sock->sk;
  5. struct iovec*iov;
  6. struct tcp_sock*tp=tcp_sk(sk);
  7. struct sk_buff*skb;
  8. intiovlen,flags;
  9. intmss_now,size_goal;
  10. interr,copied;
  11. long timeo;
  12. lock_sock(sk);
  13. TCP_CHECK_TIMER(sk);
  14. flags=msg->msg_flags;
  15. timeo=sock_sndtimeo(sk,flags&MSG_DONTWAIT);
  16. /*Waitfora connectiontofinish.*/
  17. if((1<sk_state)&~(TCPF_ESTABLISHED|TCPF_CLOSE_WAIT))
  18. if((err=sk_stream_wait_connect(sk,&timeo))!=0)
  19. goto out_err;
  20. /*This should beinpoll*/
  21. clear_bit(SOCK_ASYNC_NOSPACE,&sk->sk_socket->flags);
  22. /*size_goal表示GSO支持的高低,为mss的整数倍,不协理GSO时则和mss相等*/
  23. mss_now=tcp_send_mss(sk,&size_goal,flags);/*返回值mss_now为真实mss*/
  24. /*Ok commence sending.*/
  25. iovlen=msg->msg_iovlen;
  26. iov=msg->msg_iov;
  27. copied=0;
  28. err=-EPIPE;
  29. if(sk->sk_err||(sk->sk_shutdown&SEND_SHUTDOWN))
  30. goto out_err;
  31. while(–iovlen>=0){
  32. size_t seglen=iov->iov_len;
  33. unsigned char __user*from=iov->iov_base;
  34. iov++;
  35. while(seglen>0){
  36. intcopy=0;
  37. intmax=size_goal;/*逐个skb中填入的数码长度初始化为size_goal*/
  38. /*从sk->sk_write_queue中取出队尾的skb,因为这一个skb只怕还从未被填满*/
  39. skb=tcp_write_queue_tail(sk);
  40. if(tcp_send_head(sk)){/*如果从前还有未发送的数目*/
  41. if(skb->ip_summed==CHECKSUM_NONE)/*譬如说路由变更,以前的不支持TSO,以往的支撑了*/
  42. max=mss_now;/*上贰个不协助GSO的skb,继续不辅助*/
  43. copy=max-skb->len;/*copy为每一遍想skb中拷贝的多长*/
  44. }
  45. /*copy<=0表示无法统一到事先skb做GSO*/
  46. if(copy<=0){
  47. new_segment:
  48. /*Allocate new segment.Ifthe interfaceisSG,
  49. *allocate skb fittingtosingle page.
  50. */
  51. /*内存不足,需求拭目以待*/
  52. if(!sk_stream_memory_free(sk))
  53. goto wait_for_sndbuf;
  54. /*分红新的skb*/
  55. skb=sk_stream_alloc_skb(sk,select_size(sk),sk->sk_allocation);
  56. if(!skb)
  57. goto wait_for_memory;
  58. /*
  59. *Check whether we can use HW checksum.
  60. */
  61. /*假设硬件襄助checksum,则将skb->ip_summed设置为CHECKSUM_PA奔驰G级TIAL,表示由硬件计算校验和*/
  62. if(sk->sk_route_caps&NETIF_F_ALL_CSUM)
  63. skb->ip_summed=CHECKSUM_PARTIAL;
  64. /*将skb加入sk->sk_write_queue队尾,同时去掉skb的TCP_NAGLE_PUSH标记*/
  65. skb_entail(sk,skb);
  66. copy=size_goal;/*此间将每一次copy的分寸设置为size_goal,即GSO协理的深浅*/
  67. max=size_goal;
  68. }
  69. /*Trytoappend datatotheendof skb.*/
  70. if(copy>seglen)
  71. copy=seglen;
  72. /*Wheretocopyto?*/
  73. if(skb_tailroom(skb)>0){/*一经skb的线性区还有空间,则先填充skb的线性区*/
  74. /*We have somespaceinskb head.*/
  75. if(copy>skb_tailroom(skb))
  76. copy=skb_tailroom(skb);
  77. if((err=skb_add_data(skb,from,copy))!=0)/*copy用户态数据到skb线性区*/
  78. goto do_fault;
  79. }else{/*不然尝试向SG的frags中拷贝*/
  80. intmerge=0;
  81. inti=skb_shinfo(skb)->nr_frags;
  82. struct page*page=TCP_PAGE(sk);
  83. intoff=TCP_OFF(sk);
  84. if(skb_can_coalesce(skb,i,page,off)&&off!=PAGE_SIZE){/*pfrag->page和frags[i-1]是否采用相同页,并且page_offset相同*/
  85. /*We can extend the last page
  86. *fragment.*/
  87. merge=1;/*证实和从前frags中是同2个page,需求merge*/
  88. }elseif(i==MAX_SKB_FRAGS||(!i&&!(sk->sk_route_caps&NETIF_F_SG))){
  89. /*Needtoadd new fragmentandcannot
  90. *dothis because interfaceisnon-SG,
  91. *orbecause all the page slots are
  92. *busy.*/
  93. /*若是设备不辅助SG,大概非线性区frags已经已毕最大,则创设新的skb分段*/
  94. tcp_mark_push(tp,skb);/*标记push flag*/
  95. goto new_segment;
  96. }elseif(page){
  97. if(off==PAGE_SIZE){
  98. put_page(page);/*日增page引用计数*/
  99. TCP_PAGE(sk)=page=NULL;
  100. off=0;
  101. }
  102. }else
  103. off=0;
  104. if(copy>PAGE_SIZE-off)
  105. copy=PAGE_SIZE-off;
  106. if(!sk_wmem_schedule(sk,copy))
  107. goto wait_for_memory;
  108. if(!page){
  109. /*Allocate new cache page.*/
  110. if(!(page=sk_stream_alloc_page(sk)))
  111. goto wait_for_memory;
  112. }
  113. err=skb_copy_to_page(sk,from,skb,page,off,copy);/*拷贝数据到page中*/
  114. if(err){
  115. /*Ifthis page was new,give ittothe
  116. *socket so it doesnotgetleaked.
  117. */
  118. if(!TCP_PAGE(sk)){
  119. TCP_PAGE(sk)=page;
  120. TCP_OFF(sk)=0;
  121. }
  122. goto do_error;
  123. }
  124. /*Update the skb.*/
  125. if(merge){/*pfrag和frags[i-1]是均等的*/
  126. skb_shinfo(skb)->frags[i-1].size+=copy;
  127. }else{
  128. skb_fill_page_desc(skb,i,page,off,copy);
  129. if(TCP_PAGE(sk)){
  130. get_page(page);
  131. }elseif(off+copy
  132. get_page(page);
  133. TCP_PAGE(sk)=page;
  134. }
  135. }
  136. TCP_OFF(sk)=off+copy;
  137. }
  138. if(!copied)
  139. TCP_SKB_CB(skb)->flags&=~TCPCB_FLAG_PSH;
  140. tp->write_seq+=copy;
  141. TCP_SKB_CB(skb)->end_seq+=copy;
  142. skb_shinfo(skb)->gso_segs=0;/*清零tso分段数,让tcp_write_xmit去计算*/
  143. from+=copy;
  144. copied+=copy;
  145. if((seglen-=copy)==0&&iovlen==0)
  146. goto out;
  147. /*再有数目没copy,并且没有达到最大可拷贝的轻重(注意那里max此前被赋值为size_goal,即GSO辅助的大大小小),
    尝试往该skb继续丰硕数据*/
  148. if(skb->len
  149. continue;
  150. /*上边的逻辑就是:还有多少没copy,不过如今skb已经满了,所以可以发送了(但不是毫无疑问要发送)*/
  151. if(forced_push(tp)){/*跨越最大窗口的50%不曾设置push了*/
  152. tcp_mark_push(tp,skb);/*设置push标记,更新pushed_seq*/
  153. __tcp_push_pending_frames(sk,mss_now,TCP_NAGLE_PUSH);/*调用tcp_write_xmit立刻发送*/
  154. }elseif(skb==tcp_send_head(sk))/*率先个包,直接发送*/
  155. tcp_push_one(sk,mss_now);
  156. continue;/*表明发送队列前面还有skb等待发送,且距离以前push的包还不是十三分久*/
  157. wait_for_sndbuf:
  158. set_bit(SOCK_NOSPACE,&sk->sk_socket->flags);
  159. wait_for_memory:
  160. if(copied)/*先把copied的爆发去再等内存*/
  161. tcp_push(sk,flags&~MSG_MORE,mss_now,TCP_NAGLE_PUSH);
  162. /*卡住等待内存*/
  163. if((err=sk_stream_wait_memory(sk,&timeo))!=0)
  164. goto do_error;
  165. mss_now=tcp_send_mss(sk,&size_goal,flags);
  166. }
  167. }
  168. out:
  169. if(copied)/*负有数据都放置发送队列中了,调用tcp_push发送*/
  170. tcp_push(sk,flags,mss_now,tp->nonagle);
  171. TCP_CHECK_TIMER(sk);
  172. release_sock(sk);
  173. return copied;
  174. do_fault:
  175. if(!skb->len){
  176. tcp_unlink_write_queue(skb,sk);
  177. /*Itisthe one placeinall of TCP,except connection
  178. *reset,where we can be unlinking the send_head.
  179. */
  180. tcp_check_send_head(sk,skb);
  181. sk_wmem_free_skb(sk,skb);
  182. }
  183. do_error:
  184. if(copied)
  185. goto out;
  186. out_err:
  187. err=sk_stream_error(sk,flags,err);
  188. TCP_CHECK_TIMER(sk);
  189. release_sock(sk);
  190. returnerr;
  191. }

最终会调用tcp_push发送skb,而tcp_push又会调用tcp_write_xmit。tcp_sendmsg已经把数量依据GSO最大的size,放到二个个的skb中,最后调用tcp_write_xmit发送那一个GSO包。tcp_write_xmit会检查当前的堵塞窗口,还有nagle测试,tsq检查来决定是或不是能发送全部可能局地的skb,倘若不得不发送一部分,则须求调用tso_fragment做切分。最终通过tcp_transmit_skb发送,即便发送窗口没有高达限制,skb中存放的数目将落成GSO最大值。

ltcp_write_xmit

  1. staticinttcp_write_xmit(struct
    sock*sk,unsignedintmss_now,intnonagle,
  2. intpush_one,gfp_t gfp)
  3. {
  4. struct tcp_sock*tp=tcp_sk(sk);
  5. struct sk_buff*skb;
  6. unsignedinttso_segs,sent_pkts;
  7. intcwnd_quota;
  8. intresult;
  9. sent_pkts=0;
  10. if(!push_one){
  11. /*DoMTU probing.*/
  12. result=tcp_mtu_probe(sk);
  13. if(!result){
  14. return 0;
  15. }elseif(result>0){
  16. sent_pkts=1;
  17. }
  18. }
  19. /*遍历发送队列*/
  20. while((skb=tcp_send_head(sk))){
  21. unsignedintlimit;
  22. tso_segs=tcp_init_tso_segs(sk,skb,mss_now);/*skb->len/mss,重新安装tcp_gso_segs,因为在tcp_sendmsg中被清零了*/
  23. BUG_ON(!tso_segs);
  24. cwnd_quota=tcp_cwnd_test(tp,skb);
  25. if(!cwnd_quota)
  26. break;
  27. if(unlikely(!tcp_snd_wnd_test(tp,skb,mss_now)))
  28. break;
  29. if(tso_segs==1){/*tso_segs=1表示无需tso分段*/
  30. /*据悉nagle算法,总计是还是不是须求延期发送数据*/
  31. if(unlikely(!tcp_nagle_test(tp,skb,mss_now,(tcp_skb_is_last(sk,skb)?/*last
    skb就一向发送*/
  32. nonagle:TCP_NAGLE_PUSH))))
  33. break;
  34. }else{/*有多个tso分段*/
  35. if(!push_one/*push所有skb*/
  36. &&tcp_tso_should_defer(sk,skb))/*/如若发送窗口剩余不多,并且估计下一个ack将高速到来(意味着可用窗口会大增),则推迟发送*/
  37. break;
  38. }
  39. /*上面的逻辑是:不用推迟发送,登时发送的气象*/
  40. limit=mss_now;
  41. /*由于tso_segs被装置为skb->len/mss_now,所以开启gso时肯定大于1*/
  42. if(tso_segs>1&&!tcp_urg_mode(tp))/*tso分段大于1且非urg情势*/
  43. limit=tcp_mss_split_point(sk,skb,mss_now,cwnd_quota);/*重临当前skb中能够发送的数额大小,通过mss和cwnd*/
  44. /*当skb的长短超越限制时,须要调用tso_fragment分片,倘若分段战败则暂不发送*/
  45. if(skb->len>limit&&
  46. unlikely(tso_fragment(sk,skb,limit,mss_now)))/*/按limit切割成多个skb*/
  47. break;
    1. TCP_SKB_CB(skb)->when=tcp_time_stamp;
  48. /*出殡,假若包被qdisc丢了,则脱离循环,不三番五次发送了*/
  49. if(unlikely(tcp_transmit_skb(sk,skb,1,gfp)))
  50. break;
  51. /*Advance the send_head.This oneissent out.
  52. *Thiscallwill increment packets_out.
  53. */
  54. /*更新sk_send_head和packets_out*/
  55. tcp_event_new_data_sent(sk,skb);
  56. tcp_minshall_update(tp,mss_now,skb);
  57. sent_pkts++;
    1. if(push_one)
  58. break;
  59. }
  60. if(likely(sent_pkts)){
  61. tcp_cwnd_validate(sk);
  62. return 0;
  63. }
  64. return!tp->packets_out&&tcp_send_head(sk);
  65. }

其中tcp_init_tso_segs会设置skb的gso消息后文分析。我们看来tcp_write_xmit会调用tso_fragment进行“tcp分段”。而分段的口径是skb->len
> limit。那里的重中之重就是limit的值,大家看到在tso_segs >
1时,相当于打开gso的时候,limit的值是由tcp_mss_split_point得到的,约等于min(skb->len,
window),即发送窗口允许的最大值。在一直不打开gso时limit就是目前的mss。

ltcp_init_tso_segs

  1. staticinttcp_init_tso_segs(struct sock*sk,struct
    sk_buff*skb,unsignedintmss_now)
  2. {
  3. inttso_segs=tcp_skb_pcount(skb);/*skb_shinfo(skb)->gso_seg以前被初叶化为0*/
  4. if(!tso_segs||(tso_segs>1&&tcp_skb_mss(skb)!=mss_now)){
  5. tcp_set_skb_tso_segs(sk,skb,mss_now);
  6. tso_segs=tcp_skb_pcount(skb);
  7. }
  8. return tso_segs;
  9. }
  1. static void tcp_set_skb_tso_segs(struct sock*sk,struct
    sk_buff*skb,unsignedintmss_now)
  2. {
  3. /*Make sure we own this skb before messing gso_size/gso_segs*/
  4. WARN_ON_ONCE(skb_cloned(skb));
  5. if(skb->len<=mss_now||!sk_can_gso(sk)||
  6. skb->ip_summed==CHECKSUM_NONE){/*不支持gso的情况*/
  7. /*Avoid the costly divideinthe normal
  8. *non-TSOcase.
  9. */
  10. skb_shinfo(skb)->gso_segs=1;
  11. skb_shinfo(skb)->gso_size=0;
  12. skb_shinfo(skb)->gso_type=0;
  13. }else{
  14. skb_shinfo(skb)->gso_segs=DIV_ROUND_UP(skb->len,mss_now);/*被设置为skb->len/mss_now*/
  15. skb_shinfo(skb)->gso_size=mss_now;/*注意mss_now为实际的mss,那里保留以供gso分段使用*/
  16. skb_shinfo(skb)->gso_type=sk->sk_gso_type;
  17. }
  18. }

tcp_write_xmit最后会调用ip_queue_xmit发送skb,进入ip层。

ip分片,tcp分段,GSO,TSO

从此将来的逻辑就是从前另一篇小说中分析的GSO逻辑了。上边我们看下整个协议栈中ip分片,tcp分段,GSO,TSO的关系。笔者将那个流程由下图表示。

 

 

tcp GSO和TSO达成,linuxtcpgsotso linux tcp
GSO和TSO完毕 ——lvyilong316 (注:kernel版本:linux 2.6.32) 概念 TSO(TCP
Segmentation Offload): 是一种拔取网…

ip分片,tcp分段,GSO,TSO

从此未来的逻辑就是从前另一篇小说中剖析的GSO逻辑了。上边大家看下整个协议栈中ip分片,tcp分段,GSO,TSO的关联。作者将那么些流程由下图表示。

澳门金沙国际 3

tcp GSO和TSO达成 linux tcp GSO和TSO已毕——lvyilong316 (注:kernel版本:linux 2.6.32) 概念 TSO(TCP Segmentation
Offload):是一种采取网卡来对天意据…

相关文章