Basics

Path MTU Discovery (PMTUD)

Whenever a host sends a packet, it is always recommended that the packet reaches unfragmented at the destination. However, the size of the datagram depends on the MTU of the each hop along the path. The minimum MTU along the path is the path MTU and is the size of the datagram.

The host while generating the packet decides the size as MTU size of the outgoing interface and sets the DF bit.

Path MTU discovery uses the Don’t Fragment (DF) bit in the IP header to dynamically discover the PMTU of a path.

The basic idea is that a source host initially assumes that the PMTU of a path is the (known) MTU of its first hop, and sends all datagrams on that path with the DF bit set. If any of the datagrams are too large to be forwarded without fragmentation by some router along the path, that router will discard them and return ICMP Destination Unreachable messages with a code meaning “fragmentation needed and DF set”. Upon receipt of such a message “Datagram Too Big” , the source host reduces its assumed PMTU for the path.

A host MUST never reduce its estimate of the Path MTU below 68 octets.

PMTUD is only supported by TCP and UDP. Other protocols do not support it. If PMTUD is enabled on a host, and it almost always is, all TCP/IP  or UDP packets from the host will have the DF bit set.

Today we will discuss about PMTUD with TCP:

What is MSS (Maximum Segment Size) :

  • TCP MSS is the Maximum size of TCP payload.
  • The Maximum Segment Size is the largest “chunk” of data, the TCP will send to the other end. When a connection is established , each end will announce its MSS value.
  • MSS includes TCP option. The TCP option is counted as the data that can be transported by the TCP segment. The MSS excludes IP Header and TCP Header.
  • The IP Header without Option is 20 Bytes and the TCP header without options is 20 Bytes.
  • The TCP syn packet carries the MSS option. Each side of the TCP session sends the MSS to the other.
  • The Sender host then limits the maximum TCP segment size to the MSS value.
  • As per defined RFC879 outs the MSS to 576, and hence the default  TCP MSS to 536. If PMTU is enabled, the MSS can be bigger. The initial MSS is send to the peer.

INTIAL MSS = MTU – Size of (TCP HEADER) – Size of (IP HEADER)

  • The Maximum Segment Size used to send packet is

SEND MAX SEGMENT SIZE = MIN (MTU – Size of (TCP HEADER) – SIZE of (IP HEADER)

  • The host doing PMTU Discovery must obey the rule that it can not send IP datagrams larger than 576 octets unless it has permission from the receiver. For TCP connections, this means that a host must not send datagrams larger than 40 octets plus the Maximum Segment Size (MSS) sent by its peer.
  • If one end does not send its MSS option , from the other end a default 536 Bytes is assumed.
  • TCP MSS (Maximum Segment Size) is automatically calculated by subtracting 40Bytes from MTU (20 Bytes IP + 20 Bytes TCP).
  • The negotiation of MSS does not happen in TCP three way handshake. Both the devices come to an agreement based on the minimum MSS value.

Path MTU Discovery

  • The purpose of PMTU discovery is to make network capacity as efficient as possible.
  • This is done by using an MTU or MSS for the packets, which is as larger as possible. This MTU can vary according to the path the packet take in the network.
  • The TCP layer should track and store the MSS per connection.
  • PMTU must regularly check if it can increase the MSS. If the packet fails, the MSS will not be decreased. The check for instance, is done by a timer. All packets are send out with the DF bit set when PMTU enabled.
  • There are “no probes” send out periodically by a router to check the MTU along the path when the router has PMTU Discovery enabled. PMTU works only by the regular TCP traffic.
  • If there is no TCP traffic for a while with packets at or close to Max MTU site of any link along a path, then it might be that MSS increased to a value which is too larger for the path, but this goes undetected until such large packet are send.
  • The MSS is capped to the MTU value of the outgoing interface minus the header. So if PMTUD is enabled, the MSS can increase each time to the next plateau, but will be capped at the IP MTU of the interface. The MSS is also capped to the reciver window value in IOS.
  • If the packet is dropped by a router that cannot forward the packet due to too large packet size for the outgoing link, the router sends ICMP message Type 3/4 back to the sender. An ICMP 3/4 means an ICMP with type 3 and code 4. The message carries the Next hop MTU of the router, so that sender can lower the MSS used to the value and send the packet again.
  • This could happen a few times until the minimum MTU along the path is discovered. When a ICMP 3/4 does not provide the next hop MTU, the router will try the next value. This should only happen if the router sending the ICP message Type 3/4.
  • The intermediate router sends an ICMP Destination Unreachable message to the source, if the received datagram exceeds the MTU of the next-hop network and DF bit is set.
  • The message will have a code indicating “Fragmentation need and DF set”. The router MUST include the MTU of that next-hop network in the low-order 16 bits of the ICMP header field that is labelled “unused” in the ICMP specification. The high-order 16 bits remain unused, and MUST be set to zero. Thus, the message has the following format:

Capture

  • This field will never contain a value less than 68, since every router “must be able to forward a datagram of 68 octets without fragmentation”.
  • The PMTU is associated with a specific path identified by Source address, Destination address and ToS. This associated is stored in routing table. Host must cache a per-host route for every active destination if per-host route is not available.
  • PMTU Discovery only creates or changes entries for per-host routes. If a per-host route for the path does not exist, then one is created (almost as if a per-host ICMP Redirect is being processed; the new route uses the same first-hop router as the current route)
  • The ip tcp path-mtu-discovery command is used to enable TCP MTU path discovery for TCP connections initiated by routers (BGP and Telnet for example).

Detailed approach in LAB to check the MTU and MSS behavior :

unnamed

Configuration:

R4:

interface Loopback0

ip address 4.4.4.4 255.255.255.255

interface Ethernet0/1

ip address 10.0.45.2 255.255.255.252

router bgp 200

bgp log-neighbor-changes

redistribute connected

neighbor 10.0.45.1 remote-as 100

R5:

interface Loopback0

ip address 5.5.5.5 255.255.255.255

interface Ethernet0/1

ip address 10.0.45.1 255.255.255.252

router bgp 200

bgp log-neighbor-changes

redistribute connected

neighbor 10.0.45.2 remote-as 100

SW:

interface GigabitEthernet0/0

mtu 1000

interface GigabitEthernet0/1

mtu 1000

BGP Summary

R5#sh ip bgp su

BGP router identifier 5.5.5.5, local AS number 100

BGP table version is 516, main routing table version 516

515 network entries using 127720 bytes of memory

516 path entries using 70176 bytes of memory

2/2 BGP path/bestpath attribute entries using 544 bytes of memory

1 BGP AS-PATH entries using 24 bytes of memory

0 BGP route-map cache entries using 0 bytes of memory

0 BGP filter-list cache entries using 0 bytes of memory

BGP using 198464 total bytes of memory

BGP activity 515/0 prefixes, 516/0 paths, scan interval 60 secs

Neighbor        V           AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd

10.0.45.2       4          200      21      20      516    0    0 00:15:34        3

R5#

 

R4#sh ip bgp su

BGP router identifier 4.4.4.4, local AS number 200

BGP table version is 516, main routing table version 516

515 network entries using 127720 bytes of memory

516 path entries using 70176 bytes of memory

2/2 BGP path/bestpath attribute entries using 544 bytes of memory

1 BGP AS-PATH entries using 24 bytes of memory

0 BGP route-map cache entries using 0 bytes of memory

0 BGP filter-list cache entries using 0 bytes of memory

BGP using 198464 total bytes of memory

BGP activity 515/0 prefixes, 516/0 paths, scan interval 60 secs

Neighbor        V           AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd

10.0.45.1       4          100      21      22      516    0    0 00:16:29      513

 

  • Basically initial TCP negotiation between R4 and R5 will have the MSS equal to (IP MTU – 40 Bytes of IP header) with DF bit set.
  • In our case the IP MTU for the interface is set as 1500 which results in 1460 as MSS.

R4#sh int e0/1

Ethernet0/1 is up, line protocol is up

Hardware is AmdP2, address is aabb.cc00.4110 (bia aabb.cc00.4110)

Internet address is 10.0.45.2/30

MTU 1500 bytes, BW 10000 Kbit/sec, DLY 1000 usec,

reliability 255/255, txload 1/255, rxload 1/255

Encapsulation ARPA, loopback not set

Keepalive set (10 sec)

ARP type: ARPA, ARP Timeout 04:00:00

Last input 00:00:00, output 00:00:00, output hang never

Last clearing of “show interface” counters never

Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0

Queueing strategy: fifo

Output queue: 0/40 (size/max)

5 minute input rate 0 bits/sec, 0 packets/sec

5 minute output rate 0 bits/sec, 0 packets/sec

702 packets input, 50933 bytes, 0 no buffer

Received 653 broadcasts (0 IP multicasts)

0 runts, 0 giants, 0 throttles

0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

0 input packets with dribble condition detected

215 packets output, 23214 bytes, 0 underruns

0 output errors, 0 collisions, 1 interface resets

38 unknown protocol drops

0 babbles, 0 late collision, 0 deferred

0 lost carrier, 0 no carrier

0 output buffer failures, 0 output buffers swapped out

R4#

 

R5#sh int e0/0

Ethernet0/0 is up, line protocol is up

Hardware is AmdP2, address is aabb.cc00.5100 (bia aabb.cc00.5100)

Internet address is 10.0.45.1/30

MTU 1500 bytes, BW 10000 Kbit/sec, DLY 1000 usec,

reliability 255/255, txload 1/255, rxload 1/255

Encapsulation ARPA, loopback not set

Keepalive set (10 sec)

ARP type: ARPA, ARP Timeout 04:00:00

Last input 00:00:01, output 00:00:05, output hang never

Last clearing of “show interface” counters never

Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0

Queueing strategy: fifo

Output queue: 0/40 (size/max)

5 minute input rate 0 bits/sec, 0 packets/sec

5 minute output rate 0 bits/sec, 0 packets/sec

735 packets input, 50893 bytes, 0 no buffer

Received 686 broadcasts (0 IP multicasts)

0 runts, 0 giants, 0 throttles

0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

0 input packets with dribble condition detected

223 packets output, 26547 bytes, 0 underruns

0 output errors, 0 collisions, 1 interface resets

40 unknown protocol drops

0 babbles, 0 late collision, 0 deferred

0 lost carrier, 0 no carrier

0 output buffer failures, 0 output buffers swapped out

R5#

 

  • As per the initial negotiation the packets are very small, it mostly moves the BGP to establish.

R5#sh ip bgp neighbors | i Data

Datagrams (max data segment is 1460 bytes)

MSS Value is 1460 initially.

  • The MSS used by both the BGP peers is 1460. The first update send after the BGP Open by R4 will be an IP packet of 1500 bytes.
  • Then R4 sends a BGP Update message with MSS value of 1460, the intermediate SW configured with MTU as 1000 (Both the interface) drops the packet. And sends a ICMP 3/4 indicates destination unreachable or fragmentation needed. Inside this, MTU of the next hop is mentioned.

 

If PATH MTU Discovery is enabled then :

R4#sh ip bg neighbors  | i tcp

Transport(tcp) path-mtu-discovery is enabled

  • Source Host sets DF Bit in the IP header to indicate that packet must not fragment in transit.
  • Intermediate switch Drops these large packet, because they exceed the MTU of outgoing interface and are not allowed to fragment them due to DF-bit setting.
  • Intermediate switch will send an ICMP “fragmentation Needed and DF set” back to source host.
  • The ICMP Fragmentation Needed messages also contain the recommended MTU Value.
  • After the TCP negotiation, the BGP packet is sent with DF bit. (BGP UPDATE PACKET).
  • As the intermediate switch interface MTU is 1000. It will drop the packet and results in ICMP error message from intermediate router with 1000 as Next-Hop MTU.

Now MSS is reduced to 1000-40 = 960

R5#sh ip bgp neighbors | i data

Datagrams (max data segment is 960 bytes):

Pasted Graphic 2

  • Now when some intermittent device is not able to forward ICMP end to end, Path MTU discovery will not be successful. This may result in BGP Session flap.
  • BGP sends update message based on the MSS value calculated by TCP.

IF PMTU Disabled :

If Path-MTU-Discovery (PMTUD) is not enabled and the destination is remote (not on same interface/subnet), the BGP MSS value defaults to 536 bytes as defined in RFC 879.

It may result in inefficient use of the network, sending segment of size 536 bytes on link capable of handling around 9000 bytes.

So if there are a huge number of updates getting exchanged between the two routers at the MSS value of 536 bytes, convergence issues are detected, which cause inefficient use of the network.

The reason is that the interface is capable of sending three times the MSS value, but it has to break down the updates in chunks of 536 bytes. If the TCP destination is on the same interface (case of non-multihop EBGP), the MSS value will be calculated based on the outgoing interface IP MTU settings.

WORKAROUND :

  1. ADJUSTING THE COMMAND WITH TCP MSS VALUE TO RESOLVE INTERMEDIATE MTU issue :

Another solution to this issue, but not optimized way to implement :

Configure “ip tcp mss 600” so that the update message sent with MSS value if 600. If in case the intermediate router has lower MTU than the where TCP connection formed.

The BGP Update Packets now sending with 600 + 20 bytes IP + 20 Bytes TCP + 14 Bytes Frame = 654 Bytes.

 Pasted Graphic

R5#sh ip bgp neighbors | i dat

Datagrams (max data segment is 600 bytes)

 

  1. Disable PMTU for BGP

 Disable PMTU for BGP / One neighbor with “neighbor x.x.x.x transport path-mtu-discovery disable” or Globally disabling TCP path MTU Discovery (Affects all the TCP session terminating on this router )

With PMTU disabled the MSS send by both sides is 536.

  1. Clearing the df-bit

You could configure the router to clear the DF bit on the packets. This is not good for the performance, because the router will have to fragment the larger packets.  The second disadvantage is that this is done through configuring policy based routing.

Example of clearing the DF bit in a route-map used by policy based routing

 

interface Serial2/0

ip address 10.1.3.3 255.255.255.0

ip policy route-map clearing-df-bit

 

access-list 100 permit ip host 10.100.1.4 host 10.100.1.1

access-list 100 permit ip host 10.100.1.1 host 10.100.1.4

 

Route-map clearing-df-bit

Match ip address 100

Set ip df 0

Advertisements

Categories: Basics, General, TCP

2 replies »

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s