We are working on benchmarking the performance of some QNX drivers for Cortex-A7 based system.For serial UART driver the performance numbers for transmitting large chunk of bytes is slightly on the lower side.We find that there is some delay introduced by our driver in the total time taken to transfer the bytes.Here are the numbers we see while transferring 40000 Bytes.

How are you getting data to your serial driver? Are you reading the 40K bytes from disk or are they already in RAM and ready to pass to the driver? If they are not in RAM you might consider that the disk I/O could be a factor.

The 40K byte array is located in DDR and is allocated statically as a global buffer in the test application.The test application calls write call to transfer the bytes using the file descriptor.The default ibuf and obuf sizes were 2k. I tried with increasing the sizes to 10k but the numbers didnt change.Will give it a try with 32k and 40k.

One more observation is the delay is increasing/decreasing as a function of baud rate so at 57600 baud rate the delay is more compared with 115200 baud.In ideal scenario the time taken to transmit 40k bytes at 57600 should be around 5.7 secs we observe 6.9 seconds with our driver.

You should check to make sure that the driver is making proper use of the uart FIFO. The place in the FIFO where an interrupt occurs is programmable and it could be the default for the driver is to minimize latency instead of maximizing throughput.

Make sure that all unused tty features are turned off.

Do not use putc() or putchar(). Instead use either fwrite() or better try using the unbuffered open()/write() calls.

That said it might help to know how QNX does things different than Linux. With linux the write buffers is probably passed to the driver as a pointer so data can be loaded directly to the hardware. With QNX a message with the data is passed to the driver with a message. My intuition is that at 115200 baud with a 600Mhz processor, this might be an issue, but your report that with 57600 baud the delay is worse indicates otherwise.

How are you connected to your board? Do you have a keyboard + monitor directly connected to it or are you sending/receiving data+keystrokes from a serial connection (ie remote debugging)? If it's the latter then it could be that connection to your board that slows things down slightly.

Our target board is connected to the host through a FTDI chip on serial over USB connection.We are running the test application on the target which transmits the 40k bytes continuously to the host machine.

The FIFO is enabled and with different watermark level settings the performance does not show any difference.

As suggested we increased the ibuf/obuf size to 10k with this change we did see some difference in performance measured.

For all the time measurements we are using the time shell utility provided by ksh.We run the performance test application with the following command from the shell to measure the time taken by the test app.Ex: time uart-perf

One interesting observation is that if we use a clock_gettime API to calculate the time taken by the write API call to transmit the bufferthe time taken is about 2.6 seconds.Below is the code snippet from the test application we are using to measure the performance.if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) { perror( "clock gettime" ); return EXIT_FAILURE;}write(fd, gTxBuf, 40000);if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) { perror( "clock gettime" ); return EXIT_FAILURE;}accum = ( stop.tv_sec - start.tv_sec ) + (double)( stop.tv_nsec - start.tv_nsec ) / (double)BILLION;printf( "\n\n\n Time taken is %lf\n", accum );

This way we see the time taken is about 2.6 seconds.But the time shell command continues to give 3.45 seconds as the time taken to execute the test application.This difference in timings of about 0.9 seconds is what we are not able to account for. This delay becomes dependent on the baud rate it increases if we reduce the baud rate value.

To rule out the possibility of the delay caused by test application we introduced a delay of 500ms instead of write API call then the time shell utility value and clockgettime value were almost exactly matching.

Interestingly a similar test carried out in linux gives same values from time shell command and clock_gettime API with minimum delta.

I'm suspicious about using the "time" command here. The obvious question is how is "time" measuring cpu time vs. real time? Though there may be one, I've never seen a QNX interface that provides cpu time used by a process. There are a number of complications on what this would mean. For example a process could have multiple threads on a multi-core system and use more cpu time than real time. Also with QNX, there is the question of how to account for a server providing service. On a linux system to do a system call, the user application jumps into the kernel, but I think the meter continues to run for the user. On QNX this can't happen.

Here are two simple ways to figure out if this is the issue. Write a simple program that does the following:

get-starting-timerun test-programget-ending-timereport total time

Increase the amount of data you send in the test by a factor of 10. Then use a stop watch.

I hope you understand that when you do a "time myProgram" from the shell that the time that gets reported is the total time of execution of myProgram. This time includes the time needed to load the program into memory, the time needed to execute the program and then the time to exit.

If your program resides on a slow speed physical device (USB, floppy) or even a medium speed physical device (harddrive) it's going to report slower times than a fast physical device (SSD or even better a RAM drive). Is it possible that your Linux machine has much faster load times of your program than your board?

Also is your program printing anything to the screen might add more time?

The clock_gettime() command is the most accurate way to show how long just the write() command took. Since it reports 2.6 seconds I'd say that's how long it's taking. The other .9 seconds is probably load time from the physical medium. You could verify if this is the case by creating a RAM drive:1) include devb-ram in your boot image.2) create a 5 meg ram drive with 'devb-ram capacity=10000 &'3) mount the RAM drive with 'mount -t qnx4 /dev/hd1t77 /fs/ram'4) Copy your program to the ram drive 'cp myProgram /fs/ram'5) Run from the ram drive with 'time /fs/ram/myProgram'