Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Performance tuning on the blackfin Processor

3.
Introduction The first level of optimization is provided by the compiler The remaining portion of the optimization comes from techniques performed at the “system” level Memory management DMA management Interrupt management The purpose of this presentation is to help you understand how some of the system aspects of your application can be managed to tune performance3

6.
The World Leader in High Performance Signal Processing SolutionsCreating a Framework

7.
The Importance of Creating a “Framework” Framework definition: The software infrastructure that moves code and data within an embedded system Building this early on in your development can pay big dividends There are three categories of frameworks that we see consistently from Blackfin developers7

9.
Processing on the fly Can’t afford to wait until large video buffers filled in external memory Instead, they can be brought into on-chip memory immediately and processed line-by-line This approach can lead to quick results in decision-based systems The processor core can directly access lines of video in on- chip memory. Software must ensure the active video frame buffer is not overwritten until processing on the current frame is complete.9

10.
Example of “Processing-on-the-fly” Framework Collision avoidance CORE Single cycle Collision Warning 720 pixels access Video in 480 pixels DMA L1 MEM 33 msec frame rate 63 µsec line rate Data is processed one line at a time instead of waiting for an entire frame to be collected.10

11.
“Programming Ease Rules” Strives to achieve simplest programming model at the expense of some performance Focus is on time-to-market Optimization isn’t as important as time-to-market It can always be revisited later Provides a nice path for upgrades! Easier to develop for both novices and experts11

13.
“Performance Rules” Bandwidth-efficient Strives to attain best performance, even if programming model is more complex Might include targeted assembly routines Every aspect of data flow is carefully planned Allows use of less expensive processors because the device is “right-sized” for the application Caveats Might not leave enough room for extra features or upgrades Harder to reuse13

20.
The World Leader in High Performance Signal Processing Solutions Cache

21.
Configuring internal instruction memory as cache foo1(); Core Way 1 foo2(); foo3(); External memory Example: L1 instruction memory configured as 4-way set-associative cache Instructions are brought into internal memory where single cycle performance can be achieved Instruction cache provides several key benefits to increase performance Usually provides the highest bandwidth path into the core For linear code, the next instructions will be on the way into the core after each cache miss The most recently used instructions are the least likely to be replaced Critical items can be locked in cache, by line Instructions in cache can execute in a single cycle21

22.
Configuring internal data memory as cache Volatile data Peripheral Way 1 DMA Core Static data Cache External memory Example: Data brought in from a peripheral Data cache also provides a way to increase performance Usually provides the highest bandwidth path into the core For linear data, the next data elements will be on the way into the core after each cache miss Write-through option keeps “source” memory up to date Data written to source memory every time it is modified Write-back option can further improve performance Data only written to source memory when data is replaced in cache “Volatile” buffers must be managed to ensure coherency between DMA and cache22

23.
Write-back vs. Write-through Write-back is usually 10-15% more efficient but … It is algorithm dependent Write-through is better when coherency between more than one resource is required Make sure you try both options when all of your peripherals are running23

24.
The World Leader in High Performance Signal Processing Solutions DMA

25.
Why Use a DMA Controller? The DMA controller runs independently of the core The core should only have to set up the DMA and respond to interrupts Core processor cycles are available for processing data DMA can allow creative data movement and “filtering” Saves potential core passes to re-arrange the data Interrupt handler DMA Core Descriptor list Peripheral Buffer Data Source Internal memory25

27.
DMA and Data Cache Coherence Coherency between data cache and buffer filled via DMA transfer must be maintained Cache must be “invalidated” to ensure “old” data is not used when processing most recent buffer Invalidate buffer addresses or actual cache line, depending on size of the buffer Interrupts can be used to indicate when it is safe to invalidate buffer for next processing interval This often provides a simpler programming model (with less of a performance increase) than a pure DMA model Volatile buffer0 Data brought in Core Data Cache from a peripheral New Volatile buffer1 buffer Invalidate Process cache lines Interrupt buffer associated with that buffer27

29.
Is the data volatile or static? Static Volatile Map to cacheable Will the buffers fit into Data partitioning memory internal memory? locations Yes No Single cycle access achieved Map to external memory Is DMA part of the programming model? No Turn data cache on Desired performance is achieved Is buffer larger than cache size? No Yes Invalidate using “invalidate” instruction before read Invalidate with direct cache line access before read Programming effort 29Increases as you move across

33.
Instruction and Data Fetches In a single core clock cycle, the processor can perform One instruction fetch of 64 bits and either … Two 32-bit data fetches or One 32-bit data fetch and one 32-bit data store The DMA controller can also be running in parallel with the core without any “cycle stealing”33

38.
The World Leader in High Performance Signal Processing Solutions Benchmarks

39.
Important Benchmarks Core accesses to SDRAM take longer than accesses made by the DMA controller For example, Blackfin Processors with a 16-bit external bus behave as follows: 16-bit core reads take 8 System Clock (SCLK) cycles 32-bit core reads take 9 System Clock cycles 16-bit DMA reads take ~1 SCLK cycle 16-bit DMA writes take ~ 1 SCLK cycle Bottom line: Data is most efficiently moved with the DMA controllers!39

40.
The World Leader in High Performance Signal Processing SolutionsManaging Shared Resources

41.
External Memory Interface: ADSP-BF561 Core has priority to external bus You are able to program this so unless that the DMA has a higher DMA is urgent* priority than the core * “Urgent” DMA implies data will be lost if access isn’t grantedCore A is higher priority than Core B Programmable priority41

42.
Priority of Core Accesses and the DMA Controller at the External Bus A bit within the EBIU_AMGCTL register can be used to change the priority between core accesses and the DMA controller42

43.
What is an “urgent” condition? Peripheral Peripheral When the DMA FIFO is empty and the Peripheral is transmitting Peripheral Peripheral When the DMA FIFO is full and the FIFO or FIFO Peripheral has a sample to send DMA FIFO DMA FIFO Empty Full External memory External memory43

45.
The World Leader in High Performance Signal Processing SolutionsTuning Performance

46.
Managing External Memory Transfers in the same direction are more efficient than intermixed accesses in different directions DMA External External Bus Memory Core Group transfers in the same direction to reduce number of turn-arounds46

47.
Improving Performance DMA Traffic Control improves system performance when multiple DMAs are ongoing (typical system) Multiple DMA accesses can be done in the same “direction” For example, into SDRAM or out of SDRAM Makes more efficient use of SDRAM47

48.
DMA Traffic ControlDEB_Traffic_Period has the biggest impact on improving performanceThe correct value is application dependent but if 3 or less DMAchannels are active at any one time, a larger value (15) ofDEB_Traffic_Period is usually better. When more than 3 channels areactive, a value closer to the mid value (4 to 7) is usually better.48

49.
The World Leader in High Performance Signal Processing SolutionsPriority of DMAs

50.
Priority of DMA Channels Each DMA controller has multiple channels If more than one DMA channel tries to access the controller, the highest priority channel wins The DMA channels are programmable in priority When more than one DMA controller is present, the priority of arbitration is programmable The DMA Queue Manager provided with System Services should be used50

52.
Common Mistakes with Interrupts Spending too much time in an interrupt service routine prevents other critical code from executing From an architecture standpoint, interrupts are disabled once an interrupt is serviced Higher priority interrupts are enabled once the return address (RETI register) is saved to the stack It is important to understand your application real-time budget How long is the processor spending in each ISR? Are interrupts nested?52

53.
Programmer Options Program the most important interrupts as the highest priority Use nesting to ensure higher priority interrupts are not locked out by lower priority events Use the Call-back manager provided with System Services Interrupts are serviced quickly and higher priority interrupts are not locked out53

54.
The World Leader in High Performance Signal Processing SolutionsA Example

60.
Bandwidth Used by DMA Input Data Stream 1 MB/s Reference Data In 30 MB/s In Loop Filter Data In 15 MB/s Reference Data Out 15 MB/s In Loop Filter Data Out 15 MB/s Video Data Out 27 MB/s •Calculations based on 30 frames per second. Be careful not to simply add up the bandwidths! Shared resources, bus turnaround, and concurrent activity all need to be considered60

61.
Summary The compiler provides the first level of optimization There are some straightforward steps you can take up front in your development which will save you time Blackfin Processors have lots of features that help you achieve your desired performance level Thank you61