Receiving 14 Million network packets per second
A very high-level flow diagram of packet flow from arriving at the Hardware (Network Interface card) of your machine to your application can be shown as-
The flow diagram can be divided into two parts.
- Moving packet from the physical layer (Network Card) to the application ( Kernel space processing )
In Linux (or in general), only kernel has can access the privileged section of memory ( Hardware buffers, etc ), hence the kernel takes the data from the NIC buffer and after some processing (Described in the next section) it delivers the packet to the Application.
2. Processing the client request ( User space processing )
When a packet has arrived at the application, it’s now the application’s responsibility to process that (like send the file that the client has requested/write to a file, etc.) and manage concurrent connections.We won’t be discussing this part in this blog.
So, We will only be talking about kernel space processing,i.e. what exactly kernel does to move a packet from hardware to the application and how many packets can it deliver per second to the application.
Here are some numbers-
We now have 10Gbit/sec Network Interface cards (NIC) which means they can accept 10 Billion bits/seconds which translate to around 14.88 Million packets per second (Mpps) ( taking the smallest packet size 64 bytes + 20 bytes for additional overhead ).
But as we have seen in the above diagram (image 1) this doesn’t mean that your application can accept around 14.8 Mpps, those packets need to be moved from NIC to your application, and taking a single-core 2.3 GHz CPU can only process around 1.4 Mpps! Around 10 times lower than NIC is receiving the packets![1]
Now the question comes -
- Why does it take so much time (w.r.t. NIC is receiving packets) to move the packet and what improvement can be done?
Note that I have written a single-core CPU, hence in this blog, we will only talk about every drawback and improvement on a single core. In a future blog, we will talk about multi-core environments.
Basic Familiarity with the OSI model will help you understand much better, if you are not familiar please go through this Hussein Naseer’s Video or any other introductory video.
Some quick recap on OS-related concepts which will be used frequently in the blog.
Kernel space and user space -
- To have the property that some memory space can only be accessed by the privileged process, we need logical isolation.
- In a Linux system based on 32-bit Architecture, user-space address space corresponds to the lower 3GB of virtual space and the upper 1 GB belongs to kernel space.
- Note that it does not mean the kernel uses that much physical memory, only that it has that portion of address space available to map whatever physical memory it wishes.
- Kernel space is flagged in the page tables as exclusive to privileged code, hence if user code tries to access that an interrupt is generated.
Kernel mode and user mode -
- In the end, every program is translated to a set of instructions that are loaded in the registers and are further executed.
- When an instruction is executed on the CPU there is a flag that defines whether this set of instructions can do the privileged tasks ( i.e- can they access kernel space memory like process page table, hardware buffers, etc).
- By default all programs run in unprivileged mode hence they can’t access the memory in kernel space so they make a system call which switches the mode from user mode to kernel mode and the kernel runs the privileged instructions on behalf of the process.
What exactly is the cost of switching from kernel mode to user mode or vice versa?
To understand this let’s have a look on a running process -
- A Process is just a set of instructions that are to be run on the microprocessor. Each process has a stack that is used for the management of the execution of that process. Stack Pointer(SP) points to the top of the stack, which stores the address of the instructions. Once its execution completes, SP points to the next instruction (new top of the stack).
- To execute privileged instructions, the user program makes a system call.
- There is a separate stack of every process called kernel stack that is used for managing privileged instructions for that process.
So actually moving from kernel mode to user mode or vice versa is to set the Stack pointer from the top of the user-space stack to the top of the kernel stack of the process or vice versa so that CPU can load the instructions from this stack (some details skipped here, for exact details please refer kernel stack management and essentials of micro-controller blog).
I couldn’t find an exact estimate but it takes around 90ns (90*10^-9 second)[1].
Note: Switching from kernel mode to user mode is different than moving data from kernel space to user space ( where data is copied from one memory location to another and its cost is dependent on data size).
Let’s get started -
Have a look at the Packet flow from NIC card to Application.
We can divide these into three steps ( from bottom to top in image 4)-
- Device driver processing
- When a packet arrives at the Network Interface Card of the machine, the card receives the packet and the packet is transferred into the NIC memory through DMA (direct memory access i.e. no CPU involvement needed).
- Since kernel needs to know the location of currently available packets in memory, NIC also maintains a circular ring buffer which stores the descriptors of packets (i.e. length of buffer and physical memory address, etc).
- Now an interrupt is generated to let the kernel know that packet has arrived and it should start the processing.
2. Kernel space processing-
- The interrupt is received and the proper interrupt handler to process the packet is invoked.
- Kernel picks the packet from the NIC memory.
- Since the packet is currently in NIC format, to represent it in OS-specific and a device-independent form another data structure like mbuf, sk_buff (mbuf in FreeBSD and sk_buff in Linux, sk_buff shown in image 5) is created.
- The next step is IP header removal, packet validation (Layer 3 processing), etc, hence those functions are called.
- Now it’s time for protocol-specific processing (i.e. Layer 4 processing, so if it’s a TCP packet then those specific functions are called).
- Finally, the packet is copied to the user-space so that the application can access it.
3. User space processing -
- As told at the start of the blog this processing is application-specific and not part of the discussion of this blog.
Key bottleneck steps in kernel space processing and what solutions have been proposed for them.
- Once creating a new data structure i.e. from raw packet to sk_buff/mbuf and one memory copy to move data from kernel space memory to user-space memory(look at image 6 and it’s description).
- The first copy (actually a shadow copy since a different representation is created) is done for converting raw packet to a device independent format sk_buff/mbuf which can be avoided if we can have a common representation.
- Since the user application can’t access the memory where the packet is residing currently (either Buffers or mbufs in image 6) hence the packet must be copied in user space from kernel space, but if we could bring the packet buffers in shared space (user space) and give user application protected access then this copy can also be avoided.
2. A data structure is allocated and de-allocated for every packet.
Memory is allocated and de-allocated for every packet, having a pre-allocated pool of data structures will remove this overhead.
3. Too many system calls (i.e. request to switch from user mode to kernel mode).
If we can give protected access of the packet buffer to the application, no system call will be made by the application to read/write the packet to/from that memory location.
But we can’t avoid the system call completely (details on this in the Netmap section), since we will have fewer system calls now and if we make system calls in batch (i.e. one system call for k packets) the cost will be amortized.
Can we achieve all the above improvements while still using the TCP/IP stack of Linux ?
Note: What I mean by the TCP/IP stack of Linux here is, libraries in Linux that are used for layer 3 and layer 4 processing.
The answer is simply No since-
- Since we want to represent the packet in a common format, linux stack can’t understand that.
- Linux TCP/IP stack doesn’t have batch processing logic (batch processing is way more important than it sounds, look at image 9).
So, the solution is to By-Pass the kernel! And here comes the term Kernel Bypass Networking!
Kernel bypass networking is not that simple as it sounds! As told above since we can’t use the TCP/IP stack of Linux, one has to write its own TCP processing library. It has taken decades to make the TCP/IP stack of Linux as robust and secure as it is today!
But what if an Application just wants the raw network packet, for example, a firewall application, network monitoring application, etc. who just cares about layer 2 of the packet, no TCP processing library is needed for them!
Yeah, for applications that need raw network packets, Kernel By-pass Networking is relatively more achievable and there are some great open source solutions available for that.
Netmap -
Netmap is a framework that provides user Applications direct access to the raw network packets on FreeBSD and Linux[4]. The best part about Netmap is that it doesn’t require any specific type of Network card.
Netmap improves performance around all three bottleneck steps discussed above.
- It keeps the packets in shared space (user-space), to which the user application has protected access to read and write. Hence no copy of data from kernel space to user space or vice versa. Packets are saved in a lightweight format called pkt_buf(look at image 7).
pkt_buf shown in image 7 forms a circular queue (called packet buffer), protected access to the user application means that application can only access the part of the queue (look at image -8) where it needs to read/write the packet (giving access to the whole queue may cause memory corruption by the user application).
2. Packets are represented in a device-independent format called pkt_buf so no need to create another data structure like sk_buff/mbuf (look at image 6). A pool of pkt_buf is pre-allocated, hence no overhead of allocating/de-allocating a data structure for every packet.
3. Number of System calls are reduced and whatever system calls are there, they are done in batches which amortizes the cost.
- Since now the application has protected access to the shared buffer location (look at image 8) hence it does not need to make a system call to read/write a packet.
- But we can’t avoid the system calls completely. For example, once a packet has been read/written by the user application, the positions for NIC to read/write need to be updated so that NIC will know the updated index from where to pick/write a new packet while sending/receiving!
Hence, once the user application has read/written a packet to the shared queue it makes a system call so that kernel can update the buffer descriptors (look at image 7 carefully, the NIC ring which stores the buffer descriptors i.e. physical memory addresses and other metadata used for reading/writing packets, is still in kernel space so only kernel can access that).
Some other systems calls are also there which can’t be avoided (skipping their explanation to avoid too much complexity, please refer [1] for more details), all these system calls are made in batches which amortizes the cost.
Netmap Performance numbers -
For batch processing of 8 or more packets (image 9), we are getting around 14.88 Mpps (Million packets per second) on 1 core with a clock speed of 900MHz (image 10).
And it’s going to saturate the 10Gbits/sec NIC card!
Limitations of Kernel bypass Networking
A very interesting blog by Cloudflare [2] summaries some of the great points on why do we even use kernel’s TCP/IP stack, and one of the most important reason is that it allows hardware resource (NIC) to be used by multiple processes since the kernel takes care of it.
Now, since we are bypassing the kernel we lose the resource sharing capability and hence you need a dedicated NIC for the consuming application.
One trick to get the best of both worlds is partial kernel by-pass which is possible for cases when you have multiple cores (some of the cores can access the packets bypassing the kernel and other cores can use the Linux TCP/IP stack! [3]).
Conclusion
We discussed how kernel moves a packet from NIC to the applications, what are the bottleneck steps that cause it to not deliver the packets as per the rate of being received at NIC, and finally what improvements have been done to saturate the line rate of 10 Gbit/sec NIC for raw Network packets using a single-core 900MHz CPU.
Stay tuned for a new blog to discuss the performance on a multi-core CPU setup.
I hope you enjoyed this article, feel free to drop any questions/suggestions in the comments.
Special thanks to Manas pradhan for reviewing the draft and all the suggestions.
References
- Netmap : A fast Network I/O framework
- Why do we use the Linux TCP/IP stack
- Partial Kernel Bypass Networking stack
- Netmap
- Montazerolghaem, Ahmadreza & Hosseini Seno, Seyed Amin & Yaghmaee, Mohammah-H & Tashtarian, Farzad. (2016). Overload mitigation mechanism for VoIP networks: A transport layer approach based on resource management. Transactions on Emerging Telecommunications Technologies. 27. 10.1002/ett.3038.