Performance problems on SCM-i.MX6Q (as compared to TI OMAP4460)

I am working on a bare metal project which involves running a real time protocol encode/decode application. The project originally started using TI’s OMAP4460 device (containing a dual core ARM Cortex A9 clocked at 1.2GHz) but for a number of reasons (mostly hardware related) we have move to the SCM-iMX6Q. I am evaluating the performance of the SCM-i.MX6Q using a QWKS-SCMIMX6 off the shelf development board, comparing it to the OMAP.

Our evaluation involves running some sample protocol encode/decode routines on arrays of data in memory (so does no external I/O). It all runs on a single ARM core, the others beings disabled. We only have access to the object libraries for this evaluation code (which is provided by a partner company in the project) this is built using TI Code Composer, and is in fact the same object code which I can run on the OMAP or the SCM-i.MX6Q. (i.e. I can link exactly the same libraries into my OMAP project as my SCM-i.MX6Q project). In both cases the device initialisations have come from the standard U-Boot sources for each device type, (clock and memory configurations, DCD configuration etc.).

In the SCM-iMX6Q the ARM is clocked at 800MHz and the OMAP at 1200MHz, the OMAP also uses the same PoP LPDDR2 RAM as the SCM-i.MX6Q, so I would expect running the same code in the same circumstance I would see roughly two thirds of the performance of the OMAP when running on the SCM-iMX6Q. Unfortunately the performance difference I see is huge,

OMAP4460 Performance:

Overall decoding and encoding finished: 16757906 = 16.757mS

NXP Performance:

Overall decoding and encoding finished: 136581353 = 136.581mS

The OMAP is more than 8 times faster! These timing are taken using the internal ARM performance counter. The test routines run exclusively on the processor with nothing else running and interrupts disabled, so it is pure “number crunching”. The test is very processor and memory intensive.

I have the L1 I/D caches enabled, the L2 cache is enabled, and the MMI is configured to map all of the LPDDR2 RAM addresses as cacheable (TTB_ENTRY_SUPERSEC_NORM equ 0x55C06). The clock settings appear to match what I see if I boot Linux then stop in U-Boot and display the clock settings. Using the same display code from U-Boot built into my project (after my initialisation of the hardware) I see these clock settings, which match what I see if U-Boot does the initialisation.

Clock Settings:

PLL_SYS 792 MHz

PLL_BUS 528 MHz

PLL_OTG 480 MHz

PLL_NET 50 MHz

ARMCLK 792000 kHz

IPG 66000 kHz

UART 80000 kHz

CSPI 60000 kHz

AHB 132000 kHz

AXI 198000 kHz

DDR 396000 kHz

USDHC1 198000 kHz

USDHC2 198000 kHz

USDHC3 198000 kHz

USDHC4 198000 kHz

EMI SLOW 99000 kHz

IPG PERCLK 66000 kHz

I am beginning to think that the problem has something to do with the L1 cache in the SCM.iMX6Q. If I do not enable the L1 cache in the SCM.iMX6Q I see only a small amount of difference in the performance, however if I do the same in my test using the OMAP there is a huge difference in performance (the OMAPs encode/decode times become 85mS). Is there something I am missing about configuring the L1 cache which is different to the ARM in the OMAP?

Clearly I am using difference build environments, Code Composer for the OMAP and IAR Workbench for the SCM-iMX6Q. So just in case the different C libraries were the cause of the problem I have tried building my project for the SCM-iMX6Q using TI’s C library instead of the one provided by IAR. It makes no difference at all to the timings.

I have been investigating this problems for some time now and am really running out of ideas as to why there is such a large difference in performance. Either there is some device configuration I have overlooked or there is really a big difference in the architecture between these two devices which is beyond my control. Any help would be much appreciated!

I have used the DCD data from the U-Boot file imximage_csm_lpddr2.cfg and created my own initialization table from it this is then processed by my startup code (running in OCRAM) rather than being done by the Boot ROMs directly. This uses the DCD data which has CONFIG_INTERLEAVING_MODE defined and with neither CONFIG_SCM_LPDDR2_512MB or CONFIG_SCM_LPDDR2_2GB defined (so I get the data for 1GB). I have attached my boot_DCD.c initialization code to the original post above.

Unfortunately since we do not own the object libraries used in the test it is difficult for me to make my application available for you.

I'm not sure what you mean by the "bufferable flag in L1". I am currently configuring the page tables to map the whole of the LPDDR2 using 16MB supersectors with the page table entry as follows in all entries.

My Page Table Entries are actually using a different way to define the cache control which is not shown in your table. There is one more entry for the TEX field as follows.

TEX C B

1BB A A Cached memory

BB = Outer ploicy

AA = Inner policy

See Table 6-3

Table 6-3 then defines,

BB or AA bits Cache policy

b00 Noncacheabe

b01 Write-Back cached, Write Allocate

b10 Write-Through cached, No Write on Allocate

b11 Write-Back cached, No Write on Allocate

So my entry of 0x55C06 defines both inner and outer as "Write-Back cached, Write on Allocate". This is also what I used on the OMAP.

I tried changing to 0x50C0E which would set the TEX=b000, C=1, B=1 and it makes no difference to the speed. If I change to 0x50C0A, TEX=b000, C=1, B=0 then my test runs much slower (by about half, 270ms).