0x8086

Wednesday, 5 December 2012

This is going to be a short post. There are two things I need to do frequently with manycores machines that I have:* Switch them on/off to perform experiments in a more controlled environment. * Scale their frequency to match their performance. For first, there is a simple trick. All CPU cores in linux show up in sysfs

So one just have to echo 1 (for ON) or 0 (for OFF) into online file. This is a sysfs file, writing to which triggers some action inside kernel. In our case this would be switching off the CPU.

#echo 0 > /sys/devices/system/cpu/cpu0/online

For this to work, you kernel must support dynamic hotplugging of CPUs. Mind that CPU0 has a special status and you can not switch if off. Although, Linux is smart enough when running on a single core, it switches to uniprocessor (UP) code.

For second, you need to have associated processor power state driver. Most of the modern Intel processor can work with P state acpi_cpufreq.ko driver. For other driver options check in:

Tuesday, 13 December 2011

This is second part of the series about RDMA. In the first part I talked about RDMA history, evolution and the current status.

As I mentioned in the last post (link), iWARP is RDMA on top of IP based networks, which can use Ethernet as L2 layer technology but not necessarily limited to it. SoftiWARP is a pure software based implementation of iWARP protocol. In next section I will give a quick overview of technical details of iWARP protocol and describe how it is implemented in SoftiWARP.

Technical Details: The key advantages offered by RDMA are: zero copy networking and removal of the OS/application from fast data path. RDMA achieves it by pinning the user buffer pages involved, and marking all the network segments with buffer identifier and offset knowledge. Hence each received packet can be placed independently and immediately by identifying its position in the user buffer. However in order for NIC to do it, it must first process IP, TCP and RDMA headers. Traditional Ethernet NICs can only process Ethernet headers and have no idea about higher layer protocols such as IP and TCP. TCP socket based network communication is de-facto standard for data exchange on the Internet. Processing TCP headers (aka stateful offload) in hardware is a risky business( See Mogul'03 HotOS) and have met with fierce opposition from the community. That is why there is no out of the box compatible support for RDMA in Linux and requires some patching or knowledge on part of the users. For example port space collision between in-kernel TCP stack and offloaded stack in the NIC is still an unresolved issue. RDMA capable NICs (which can process IP, TCP and RDMA headers in hardware) care called RDMA NICs or RNICs. Hence an RDMA traffic can not be mixed with normal socket based TCP traffic as it has additional headers and information which enables an RNIC to place the segment directly in the user buffer.

SoftiWARP is a pure software based implementation of iWARP protocol on top of unmodified Linux kernel. It enables an ordinary NIC without RDMA capabilities to handle RDMA traffic in software. It is wire compatible with an RNIC, hence you can use it in a mixed setup. SoftiWARP is just another RDMA provider inside the OFED stack and consist of a kernel driver and a user space library. It uses in-kernel TCP sockets for data transmission. Some more details about its data transmission and receiving paths:

Transmission Path: SoftiWARP uses per core kernel threads to perform data transmission on behalf of user processes. When a user process posts a transmission request (post syscall) - if the request is small enough to fit into socket buffer then it is handed over to the TCP socket, otherwise data is pushed to the socket in a non-blocking mode until it hits -EAGAIN. At this point post syscall returns and the QP is put on a wait work queue. Application is now free to do anything else. Each QP also register a write space callback in order to get notification when there is more free space on the socket buffer. Upon receiving the notification for further free space, it schedules the kernel thread to push data on behalf of the user process. Kernel thread pushes the data until it hits -EAGAIN and then moves to the next QP. And this process is repeated until complete user buffer is transmitted. User process is notified asynchronously in the end about the successful data transmission. Depending upon data transmission semantics, send or sendpage (zcopy) can be used. For example, a read response transmission is always zero-copy by using tcp_sendpage.

Receive Path: SoftiWARP receive path is very simple. Each QP registers a socket callback function (sk->sk_data_ready) which is called at the end of netstack processing of tcp (called from at the end of tcp_rcv_established). In SoftiWARP code this function is siw_qp_llp_data_ready(). This function processes the RDMA header, locates the pinned user buffers, calculates the buffer offset and after checking access permissions copies the data by calling skb_copy_bits().

Setup on Debian 6.0: In this section I will outline how to install OFED and SoftiWARP from scratch on a Debian 6.0 machine. I like to install things from source so that in future you can check and see what is happening inside the code. I have a freshly installed Debian system with a vanilla kernel (2.6.36.2) installed. Nothing fancy here. Make sure you do compile in (from make menuconfig)-> Device Drivers -> InfiniBand Support -> InfiniBand userspace MAD support and InfiniBand userspace access (verbs and CM).

Step 1: Install OFED environment. This consist of installing librdmacm and libibverbs. I will install them from the source.#apt-get source libibverbs # cd libibverbs-1.1.3# ./configure # make # make install
Same steps for librdmacm. Now they should be installed at /usr/local/lib. If required, then include this in your LD_LIBRARY_PATH by putting that in .bashrc file as

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/libexport LD_LIBRARY_PATH

Step 2: Install libsiw (user space driver for SoftiWARP device). You should have autoconf, automake and libtool installed. Same steps but get the source from the git

# cd 'your directory of choice'

#git clone git://www.gitorious.org/softiwarp/userlib.git

#cd userlib

#./autogen.sh

#./autogen.sh (again)

#./configure

#make

#make install

Step 3: Get the kernel driver compiled. Nothing fancy here. Everything should go without any problem.

# cd 'your directory of choice'

#git clone git://www.gitorious.org/softiwarp/kernel.git

# cd kernel/softiwarp

#make

I do not recommend installing and tainting your kernel. It might be handy to make a shell script that can do insmod from this build location. See step 6.

Do ldconfig on a newly installed system, so it can learn about new libs

- "libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1"

In some peculiar cases due to wrong installation sometimes the driver file is not properly located. For example in my system I have all driver files in /usr/local/etc/libibverbs.d/ directory. Driver file is nothing special but a simple file that tells libibverbs about the name of driver (and hence library file name). I have following files in my /usr/local/etc/libibverbs.d/-rw-r--r-- 1 root staff 13 Aug 30 11:38 cxgb4.driver-rw-r--r-- 1 root staff 11 Aug 30 12:31 siw.drivercxgb4.driver is for Chelsio T4 RNIC, and siw.driver is for SoftiWARP. Inside the file there is nothing fancy. siw.driver contains one line of text :driver siwCheck the log from strace if it is finding and trying to open this file for the device. If the file is missing then just create it by yourself.

- Permission denied errors such as : "rping -s CMA: unable to open RDMA device Segmentation fault"
are related to if you missed setting up the udev rules. In this case only root can access the RDMA devices. Also on some debian system 50-udev.rules also contain some RDMA related rules. Delete them !

- For more detailed debugging try using strace with -f ...something like $strace -f rping -s
It will give you tons of details what system is doing, which files it is opening, which one failed which lead to failure of the RDMA program. It is also useful to check mis-configured library paths in the lookup as you can see if ld is checking all of them or not.

Tuesday, 22 November 2011

Remote Direct Memory Access or RDMA is a cool technology which enables applications to directly read and write remote memory (of course after some sort of setup) directly from a NIC. Think about it as networked DMA. Often people associate RDMA with Infiniband, which is fair but there is a subtle difference. RDMA has its roots in Virtual Interface Architecture, which was essentially developed to support user-level, fast, low-latency zero-copy networks. However VIA was an abstraction not a concrete implementation. Infiniband was one of the first concrete implementations of VIA and led to the development of concrete RDMA stacks. However in the beginning, Infiniband itself was badly fragmented. This was still back in late 90s and early 2000s.

Now fast forward to 2007. Infiniband was a commercial success and has found an easy way in to HPC community due to ultra-low latency and stringent performance demands. It is also popular with other high end data intensive appliances. But what about commodity stuff which runs mostly on IP based networks like data centers. Enter iWARP or Internet Wide Area RDMA Protocol (Don't ask me about why it is called iWARP). It defined RDMA semantics on top of IP-based networks, which makes its way to run on commodity interconnect technologies such as widely popular Ethernet/IP/TCP.

Now today there is another stack making a lot of buzz called - RoCE (pronounced Rocky) or RDMA on Converged Enhanced Ethernet (CEE). The main line of argument here is -- since L2 layer (Ethernet) is lossless there is no need for complicated IP and TCP stacks. RoCE puts RDMA semantics directly on top of Ethernet packets.

The point I am trying to make here is that - RDMA specification itself is just a set of abstractions and semantics. It is totally upto the developer of the stack how to develop it. There are also numerous proprietary implementations of RDMA around. Now just as with the low-level stuff, there is also no final word on higher-level stuff such as user-level APIs and libraries. Early on this led to very poor fragmentation of RDMA userspace. Every Infiniband vendors (in those days there was just one RDMA implementation which was IB) seemed to have their own user-space libraries and access mechanism to their RDMA hardware. But these days situation seems to be much coherent -- Open Fabric Alliance (OFA) distribution of user-space libraries and APIs seems to be the de-facto RDMA standard (although for sure not the only one). They provide kernel level support for RDMA and userlevel libraries. Their distribution is called OFA Enterprise Distribution or OFED.

Since initially RDMA was developed for Infiniband (IB), due to historical reasons much of the RDMA code base and abbreviations still use _ib_ or _IB_. But RDMA most certainly not tied up to that. There should be a clear separation between the RDMA concept and the transport (e.g. Infiniband, iWARP, RoCE) which implements it. Also today the term RDMA itself is used as an umbrella terms, which also includes more fancy operations apart from trivial remote memory reads and writes. However not all of these operations might be available on every RDMA-transport implementation. However there are ways applications can probe about the transport capabilities.

Another important aspect of this discussion is to understand how to write transport agnostic RDMA code, considering now there are more than one RDMA transports out there. RDMA transport differs in how they initiate and manage connections. To hide this complexity OFA has developed RDMA connection manager (which is distributed as librdmacm). In the standard RDMA library (which as you might have guessed is called libibverbs, btw - verbs is nothing but just a fancy name for API calls) there are also connection management calls which are IB specific but does not make much sense for iWARP (running on TCP/IP). So for that legacy IB code should have to be re-written to link against librdmacm to be transport agnostic. To one's surprise quite a bit of code out there (also inside OFED which include some benchmarks as well) is IB specific and will not run on iWARP.

With the standard OFED development environment one just have to do:

gcc your_rdma_app.c -l librcmam

I will write soon how to setup an RDMA development environement entirely in software. No need for any fancy hardware and one can see RDMA in action !

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Saturday, 6 August 2011

In my experiments, I have found that using jumbo frames (9k MTU) eclipses many gains seen by other improvements in the network stack such as GRO, LRO, interrupt coalescing etc. Because of the large MTU size, the per packet overhead is very small and renders other receive side optimizations a little less effective.

Now I have started experimenting with standard MTU size (1500 bytes) on 10 GbE. With 9K MTU, I can easily get to the line speed of ~9850-9900 Mbps with 4K or bigger message sizes. But with 1500 MTU, I can not get past some odd ~9400 Mbps, even with large message size of 1MB. It is not CPU bounded and both rx and tx side CPUs were less than 100% loaded. Upon further investigation and calculations I understood that ~9400 was the theoretical application data limit on 10 GbE with 1500 MTU. Lets break it down point by point

So for every 1500 bytes transmitted on the wire, Ethernet transmits additional 38 bytes. And in 1500 bytes payload, apart from user data we have 52 (20+32) bytes of TCP and IP headers. So the net efficiency of the stack becomes

(1500 - 52 ) / ( 1500 + 38) = 0.9414 or 94.14%, which is exactly what you get as end-to-end application data rate. This is called "protocol overhead". With jumbo frames we have same calculation but with 9k MTU so

In these calculations I have ignored the vLAN extension to Ethernet which adds another (optional) 4 bytes to Ethernet frame and other various optional TCP/IP headers. Any additional stuff would just increase the protocol overhead.

Thursday, 30 June 2011

I needed to do some profiling using netcat and time and it turns out that there is more to those commands than meets the eyes ;)

Case netcat: netcat is super awesome networking utility. I wanted to test how long does it take to transfer a large file from a cold cache start. So on the server side I did

nc -v -l 5001

and then on the client side I had

nc -v ip 5001 < file_name

and this is perfectly sane. It worked like a charm. But then I moved to another pair box (btw - both systems are running Debian testing Wheezy, in total 2 pair of boxes). On the new pair when I start nc in listen mode I get:

5001: inverse host lookup failed: Unknown host

listening on [any] 47022 ...

This is not what I wanted. Additionally file transfer does not terminate properly which was essential to my benchmarking. After couple of hours of starring at nc code, latter I realize that there are couple of variants around. Notably nc.traditional and nc.openbsd. Things just work fine with openbsd version. The man page is written for openbsd version (the file transfer example I copied from man page). So then I installed openbsd version by apt-get install netcat-openbsd which updated the nc link to it in /bin. Since then things have been back to normal. I still don't know what is missing (could not find as well) and what is the difference between those two, but for now examples from the world of man page started to make sense again !

Case time: Time is another useful command which can give you a lot of useful information about the process stats. But apparently, the version which is build in with bash is badly out-of-sync from man pages. So what I gathered, there are two versions of time: one built in in bash, other is /usr/bin/time. When you just run time, the built-in one gets invoked and ignores all the parameters and even complains about them. This is not what you would expect after reading the man page of time. For example I was trying :

$time -f "%P" ls

-bash: -f: command not found

It is even refusing to accept the command. Then after a while when I figured out the difference between the two versions:

$ /usr/bin/time -f "%P" ls

. ..

0%

It worked perfectly ! Again, the world of magnet and miracles aka. man pages started to make sense again.

Wednesday, 4 May 2011

Lately I have been doing a lot of TCP performance analysis on different configuration settings. TCP is a very complex protocol and have plenty of knobs which you can play with. Linux's TCP implement itself is messy (in its most positive sense) enough and requires quite a bit know-how and tools expertise. I am playing with
- 1GbE and 10GbE
- variable send and recv user buffers sizes (what is passed to the send, recv calls)
- variable send and recv socket buffer sizes (what is passed to setsockopt call SO_SNDBUF and SO_RCVBUF)
- Different MTU sizes (for now just sticking with 1500 and 9000 bytes)
- Different interrupt coalescing and offloading settings (primarily LRO and GRO)
And to make matter worse I have 2 pairs of machines of different generations of CPUs and memory bandwidth.

tcp_dump is an excellent tools which gives basic information about TCP behavior on wire, showing all the standard information which can be extracted form a TCP header. The primary limitation what I felt was it did not give any peek into the implementation of Linux (which I guess it is not suppose to do as well). Also it has non-negligible overhead in measurement (I will post numbers soon). When collecting snapshots, sometime it is desired to see some internal details about how a particular OS (here Linux) sees that connection. So two options come to rescue:

b) Use tcp_probe.ko to hook into TCP stream processing inside kernel. The main advantage of this approach is that it allows to selectively recompile the kernel module without having to recompile the whole kernel to export something peculiar. A quick tutorial

step 1: Insert tcp_probe.ko (if you are going to selectively recompile the module then I recommend going to linux/net/ipv4/ and then doing insmod tcp_probe.ko instead of modprobe tcp_probe). At the insertion time it takes two parameter, port and full or not. Port is which port you want to see activity on (or 0 for all). It can be either source or destination port. Second full, if you want to have log when congestion window changes or complete logging. Complete logging "can be" expensive (but I don't find it that much). I prefer complete logging.

On a straight forward end-to-end connected hosts (switch-less) one does not have much packet losses and window grow quickly. Hence you get a straight line or no samples at all in the log. So I prefer using full logging.

Now how to change it to export some custom stuff from inside the kernel. Simple.

Step #2: Calculate how to export from kernel here http://lxr.linux.no/linux+v2.6.38/net/ipv4/tcp_probe.c#L91. In this function ( jtcp_rcv_established) you have access to all the cool stuff inside Linux kernel -- struct sock, struct skb and struct tcp_sock. Get whatever you want to export and save it down in the log. For example you want to export toal number of retransmission so just add p->total_retrans = tp->total_retrans;

Thursday, 28 April 2011

Often working in GUI mode, I can not help to overflow "standard" good practice of sticking to the 80 char columns. So gvim comes for rescue set this up in your ~/.gvimrc or ~/.vimrc file and it will highlight the text which you overflow to aid visually: