Comments 0

Document transcript

Design Issues in VLSI Implementation ofImage Processing Hardware AcceleratorsMethodology and ImplementationHongtu JiangLund 2007Department of ElectroscienceLund UniversityBox 118,S-221 00 LUNDSWEDENThis thesis is set in Computer Modern 10pt,with the LATEX Documentation Systemon 100gr Colortech+TMpaper.No.66ISSN 1402-8662cHongtu Jiang January 2007AbstractWith the increasing capacity in today’s hardware systems enabled by technol-ogy scaling,image processing algorithms with substantially higher complexitycan be implemented on a single chip enabling real-time performance.Com-bined with the demand for low power consumption or larger resolution seenin many applications such as mobile devices and HDTV,new design method-ologies and hardware architectures are constantly called for to bridge the gapbetween designers productivity and what the technology can oﬀer.This thesis tries to address several issues commonly encountered in theimplementations of real-time image processing system designs.Two imple-mentations are presented to focus on diﬀerent design issues in hardware designfor image processing systems.In the ﬁrst part,a real-time video surveillance system is presented by com-bining ﬁve papers.The segmentation unit is part of a real-time automatedvideo surveillance system developed at the department,aiming for trackingpeople in an indoor environment.Alternative segmentation algorithms areelaborated,and various modiﬁcations to the selected segmentation algorithmis made aiming for potential hardware eﬃciency.In order to bridge the mem-ory bandwidth issue which is identiﬁed as the bottleneck of the segmentationunit,combined memory bandwidth reduction schemes with pixel locality andwordlength reduction are utilized,resulting in an over 70%memory bandwidthreduction.Together with morphology,labeling and tracking units developedby two other Ph.D.students,the whole surveillance system is prototyped onan Xilinx VirtexII pro VP30 FPGA,with a real-time performance at a framerate of 25 fps with a resolution of 320 ×240.For the second part,two papers are extended to discuss issues of a con-troller design and the implementation of control intensive algorithms.To avoidtedious and error prone procedure of hand coding FSMs in VHDL,a controllersynthesis tool is modiﬁed to automate a controller design ﬂow from C-like con-trol algorithm speciﬁcation to controller implementation in VHDL.To addressissues of memory bandwidth as well as power consumptions,a three level ofmemory hierarchy is implemented,resulting in oﬀ-chip memory bandwidth re-duction from N2per clock cycle to only 1 per pixel operation.Furthermore,potential power consumption reduction of over 2.5 times can be obtained withthe architecture.Together with a controller synthesized from the developedtool,a real-time image convolution system is implemented on an Xilinx Vir-texE FPGA platform.iiiContentsAbstract iiiContents vPreface viiAcknowledgment ixList of Acronyms xiGeneral Introduction 11 Overview 31.1 Thesis Contributions.........................31.2 Thesis Outline............................42 Hardware Implementation Technologies 72.1 ASIC vs.FPGA...........................72.2 Image Sensors............................122.3 Memory Technology.........................152.4 Power Consumption in Digital CMOS technology.........22vHardware Accelerator Design of an AutomatedVideo Surveillance System 301 Segmentation 331.1 Introduction.............................331.2 Alternative Video Segmentation Algorithms............341.3 Algorithm Modiﬁcations.......................461.4 Hardware Implementation of Segmentation Unit..........531.5 System Integration and FPGA Prototype..............711.6 Results................................721.7 Conclusions.............................732 System Integration of Automated Video Surveillance System 752.1 Introduction.............................772.2 Segmentation............................782.3 Morphology.............................852.4 Labeling...............................882.5 Feature extraction..........................902.6 Tracking...............................902.7 Results................................912.8 Conclusions.............................93Controller Synthesis in Real-time Image Con-volution Hardware Accelerator Design 991 Introduction 1031.1 Motivation..............................1031.2 FSM Encoding............................1051.3 Architecture Optimization......................1071.4 Memories and Address Processing Unit...............1101.5 Conclusion..............................1102 Controller synthesis in image convolution hardware accelerator de-sign 1132.1 Introduction.............................1132.2 Two dimensional image convolution................1142.3 Controller synthesis.........................1182.4 Results................................1202.5 Conclusions.............................121PrefaceThis thesis summarizes my academic work in the digital ASIC group at the de-partment of Electroscience,Lund University,for the Ph.D.degree in circuit design.The main contribution to the thesis is derived from the following publications:H.Jiang and V.¨Owall,“FPGA Implementation of Real-time Image Convo-lutions with Three Level of Memory Hierarchy,” in IEEE conference on FieldProgrammable Technology (ICFPT),Tokyo,Japan,2003.H.Jiang and V.¨Owall,“FPGA Implementation of Controller-Datapath Pairin Custom Image Processor Design,” in IEEE International Symposium onCircuits and Systems (ISCAS),Vancouver,Canada,2004.H.Jiang,H.Ard¨o and V.¨Owall,“Hardware Accelerator Design for Video Seg-mentation with Multi-modal Background Modeling,” in IEEE InternationalSymposium on Circuits and Systems (ISCAS),Kobe,Japan,2005.F.Kristensen,H.Hedberg,H.Jiang,P.Nilsson and V.¨Owall,“HardwareAspects of a Real-Time Surveillance System,” in European Conference onComputer Vision,Graz,(ECCV),Graz,Austria,2006.H.Jiang,H.Ard¨o and V.¨Owall,“Real-time Video Segmentation with VGAResolution and Memory Bandwidth Reduction,” in IEEE International Con-ference on Advanced Video and Signal based Surveillance (AVSS),Sydney,Australia,2006.H.Jiang,H.Ard¨o and V.¨Owall,“VLSI Architecture for a Video Segmen-tation Embedded System with Algorithm Optimization and Low MemoryBandwidth,” To be Submitted to IEEE Transactions on Circuits and Sys-tems for Video Technology,February,2007.F.Kristensen,H.Hedberg,H.Jiang,P.Nilsson and V.¨Owall,“Working title:Hardware Aspects of a Real-Time Surveillance System,” To be submitted toSpringer Journal of VLSI Signal Processing Systems for Signal,Image,andVideo Technology,February,2007.viiAcknowledgmentFirst of all,I would like to thank my supervisor,Viktor¨Owall,for all the helpand encouragement during all these years of my graduate study.His knowledge,inspiration and eﬀorts to explain things clearly has a deep impact on my researchwork in Digital IC design.I can not imagine the completion of my thesis workwithout his help.I would also like to thank him for his eternal attempts to get meambitious,and his strong intentions to get me addictive to sushi and beer whenwe were traveling in Canada and Japan.I would also like to thank Thomas,for all the fruitful discussions that were ofgreat help to the project developments.I learned lots of practical design techniquesduring our discussions.I would also like to thank Anders for the suggestions on myproject work when I started here.Many thanks to Zhan,who gave me inspirationsand deep insights into Digital IC design and many other things.I amalso grateful to Fredrik for reading parts of the thesis,and Matthias,Hugo,Joachim,Erik,Henrik,Johan,Deepak,Martin,Peter for the many interestingconversations and help.I really enjoy working in the group.I would like to extend my gratitude to the colleagues and friends at the de-partment.I would like to thank Jiren for introducing me here,Kittichai for all theenjoyable conversations and help,Erik for helping me with the computers and allkinds of relevant or irrelevant interesting topics,Pia,Elsbieta and Lars for theirmany assistance.I would also like to thank Vinnova Competence Center for Circuit Design(CCCD) for ﬁnancing the projects,Xilinx for donating FPGA boards and Axisfor their network camera and expertise on image processing.Thanks to Depart-ment of Mathematics for their input on the video surveillance project,especiallyH˚akan Ard¨o who gave me many precious advices.Finally,I would like to thank my parents,my sister and my nephew,who supportme all the time,and my wife Alisa and our daughter Qinqin,who bring me loveand lots of joy.ixList of AcronymsADC Analog-to-Digital ConverterASIC Application-Speciﬁc Integrated CircuitBps Bytes per secondCCD Charge Coupled DeviceCCTV Closed Circuit TelevisionCFA Color Filter ArrayCIF Common Intermediate FormatCMOS Complementary Metal Oxide SemiconductorCORDIC Coordinate Rotation Digital ComputerDAC Digital-to-Analog ConverterDCM Digital Clock ManagerDCT Discrete Cosine TransformDWT Discrete Wavelet TransformDDR Double Data RateDSP Digital Signal ProcessorFD Frame DiﬀerenceFIFO First In,First OutFIR Finite Impulse ResponseFPGA Field Programmable Gate Arrayfps frames per secondFSM Finite State MachineHDL Hardware Description LanguageIIR Inﬁnite Impulse ResponsexiIP Intellectual Propertykbps kilo bits per secondKDE Kernel Density EstimationLPF Linear Predictive FilterLSB Least Signiﬁcant BitLUT Lookup TableMbps Mega bits per secondMCU Micro-Controller UnitMoG Mixture of GaussianMSB Most Signiﬁcant BitPPC Power PCP&R Place and RouteRAM Random-Access MemoryRISC Reduced Instruction Set ComputerPCB Printed Circuit BoardROM Read-Only MemorySRAM Static Random-Access MemorySDRAM Synchronous Dynamic Random-Access MemorySoC System on ChipVGA Video Graphics ArrayVHDL Very High-level Design LanguageVLSI Very Large-Scale IntegrationGeneral Introduction1Chapter 1OverviewImaging and video applications are one of the fastest growing sectors of themarket today.Typical application areas include e.g.medical imaging,HDTV,digital cameras,set-top boxes,machine vision and security surveillance.Asthe evolution in these applications progresses,the demands for technology in-novations tend to grow rapidly over the years.Driven by the consumer elec-tronics market,new emerging standards along with increasing requirementson system performance imposes great challenges on today’s imaging and videoproduct development.To meet with the constantly improved system perfor-mance measured in,e.g.,resolution,throughput,robustness,power consump-tion and digital convergence (where a wide range of terminal devices must pro-cess multimedia data streams including video,audio,GPS,cellular,etc.),newdesign methodologies and hardware accelerator architectures are constantlycalled for in the hardware implementation of such systems with real-time pro-cessing power.This thesis tries to deal with several design issues normallyencountered in hardware implementations of such image processing systems.1.1 Thesis ContributionsIn this thesis,implementation issues are elaborated regarding transformingimage processing algorithms into hardware realizations in an eﬃcient way.Withthe major concern to address memory bottlenecks which are common to mostimage applications,architectural considerations as well as design methodologyconstitute the main scope of the thesis research work.Two implementationsare contributed in the thesis for the design of image processing accelerators:34 CHAPTER 1.OVERVIEWIn the ﬁrst implementation,a real-time video segmentation unit is imple-mented on an Xilinx FPGA platform.The segmentation unit is part of a real-time embedded video surveillance system developed at the department,whichare aimed to track people in an indoor environment.Alternative segmenta-tion algorithms are elaborated,and an algorithm with Mixture of Gaussianapproach is selected based on the trade-oﬀs of segmentation quality and com-putational complexity.For hardware implementation,memory bottlenecks areaddressed with combined memory bandwidth reduction schemes.Modiﬁcationsto the original video segmentation are made to increase hardware eﬃciency.In the second implementation,a synthesis tool is modiﬁed to automate acontroller design ﬂow from a control algorithm speciﬁcation to VHDL imple-mentation.The modiﬁed tool is utilized in the implementation of a real-timeimage convolution accelerator,which is prototyped on an Xilinx FPGA.Anarchitecture of three levels of memory hierarchy is developed in the image con-volution accelerator to address issues of memory bandwidth and power con-sumption.1.2 Thesis OutlineThe thesis is structured into three parts.The introduction part covers topicsconcerning a range of technologies used in the hardware implementation of atypical image processing systems,e.g.image sensors,signal processing units,memory technologies and displays.Comparisons are made in various technolo-gies regarding performance,area and power consumption cost etc.Followingthe introduction are two parts covering implementations by the author withthe aim of diﬀerent design goals.Part IA design and implementation of a real-time video surveillance system ispresented in this part.Details on video segmentation implementation fromalgorithm evaluation to the architecture and hardware design are elaborated.Novel ideas of how oﬀ-chip memory bandwidth can be reduced by utilizingpixel locality and wordlength reduction scheme are shown.Modiﬁcations tothe existing Mixture of Gaussian (MoG) [1] is proposed aiming for potentialhardware eﬃciency.The implementation of the segmentation unit is based on:H.Jiang,H.Ard¨o and V.¨Owall,“Hardware Accelerator Design for VideoSegmentation with Multi-modal Background Modeling,” in IEEE Inter-national Symposium on Circuits and Systems (ISCAS),Kobe,Japan,2005.H.Jiang,H.Ard¨o and V.¨Owall,“Real-time Video Segmentation with1.2.THESIS OUTLINE 5VGA Resolution and Memory Bandwidth Reduction,” in IEEE Inter-national Conference on Advanced Video and Signal based Surveillance(AVSS),Sydney,Australia,2006.H.Jiang,H.Ard¨o and V.¨Owall,“VLSI Architecture for a Video Segmen-tation Embedded Systemwith AlgorithmOptimization and LowMemoryBandwidth,”To be Submitted to IEEE Transactions on Circuits and Sys-tems for Video Technology,February,2007.The second chapter of the part is dedicated to system integration of thesegmentation into a complete tracking system,which includes segmentation,morphology,labeling and tracking.The complete system is implemented on anXilinx VirtexII pro FPGA platform with real-time performance at a resolutionof 320 ×240.The implementation of the complete embedded tracking systemis based on:F.Kristensen,H.Hedberg,H.Jiang,P.Nilsson and V.¨Owall,“HardwareAspects of a Real-Time Surveillance System,” in European Conference onComputer Vision,Graz,(ECCV),Graz,Austria,2006.F.Kristensen,H.Hedberg,H.Jiang,P.Nilsson and V.¨Owall,“Workingtitle:Hardware Aspects of a Real-Time Surveillance System,” To besubmitted to Springer Journal of VLSI Signal Processing Systems forSignal,Image,and Video Technology,February,2007.The author’s contribution is on the segmentation part.Part IIController design automation with a modiﬁed controller synthesis tool isdiscussed in the procedure of implementing control intensive image processingsystems.For signal processing systems with increasing complexity,hand cod-ing FSMs in VHDL becomes a tedious and error prone task.To bridge thediﬃculty of implementing and veriﬁcation of complicated FSMs,a controllersynthesis tool is needed.In this part,various aspects of FSM structures andimplementations are explored.Details on design ﬂows with the developed toolare presented.In the second chapter,the controller synthesizer is applied onthe implementation of a real-time image convolution hardware accelerator.Inaddition,an architecture of three levels of memory hierarchy is developed inthe image convolution hardware.It is shown how power consumption as well asmemory bandwidth can be saved by utilizing memory hierarchies.Such archi-tecture can be generalized to implementing diﬀerent image processing functionslike morphology,DCT or other block based sliding window ﬁltering operations.6 CHAPTER 1.OVERVIEWIn the implementation,power consumption due to memory operations are re-duced by over 2.5 times,and the oﬀ-chip memory access is reduced from N2per clock to only one pixel per operations,where N is the size of the slidingwindow.The whole part is based on:H.Jiang and V.¨Owall,“FPGA Implementation of Controller-DatapathPair in Custom Image Processor Design,” in IEEE International Sympo-sium on Circuits and Systems (ISCAS),Vancouver,Canada,2004.H.Jiang and V.¨Owall,“FPGA Implementation of Real-time Image Con-volutions with Three Level of Memory Hierarchy,” in IEEE Conferenceon Field Programmable Technology (ICFPT),Tokyo,Japan,2003.Chapter 2Hardware Implementation TechnologiesThe construction of a typical real-time imaging or video embedded systemis usually an integration of a range of electronic devices,e.g.image acquisi-tion device,signal processing units,memories,and a display.Driven by themarket demand to have faster,smarter,smaller and more interconnected prod-ucts,designers are under greater pressure to make decisions on selecting theappropriate technologies in each one of the devices among many of the alter-natives.Trade-oﬀs are constantly made concerning e.g.cost,speed,power,and conﬁgurability.In this chapter,a brief overview of the varied alternativetechnologies is given along with elaborations on the plus and minus sides ofeach of the technologies,which motivates the decisions made on the selectionof the right architecture for each of the devices used in the projects.2.1 ASIC vs.FPGAThe core devices of an real-time embedded system are composed of one orseveral signal processing units implemented with diﬀerent technologies such asMicro-controller units (MCUs),Application Speciﬁc Signal Processors(ASSPs),General Purpose Processors (GPPs/RISCs),Field Programmable Gate Arrays(FPGAs) and Application Speciﬁc Integrated Circuits (ASICs).A comparisonis made for the areas where each of these technologies prevails [2],which is abit biased to DSPs.This is shown in Table 2.1.No perfect technology existsthat is competent in all areas.For a balanced embedded system design,acombination of some of the alternative technologies is a necessity.In general,an embedded system design is initiated with Hardware/Software partitioning,once the original speciﬁcations are settled under various system requirements.78 CHAPTER 2.HARDWARE IMPLEMENTATION TECHNOLOGIESTable 2.1:Comparisons of diﬀerent types of signal processing units.Sourcesare from [2].PerformancePricePowerFlexibilityTime tomarketASICExcellentExcellent1GoodPoorFairFPGAExcellentPoorFairExcellentGoodDSPExcellentExcellentExcellentExcellentGoodRISCGoodFairFairExcellentExcellentMCUFairExcellentFairExcellentExcellentThe partitioning is carried out by either a heuristic approach or by a certainkind of optimization algorithm,e.g.simulated annealing [3] or tabu search [4].Software is executed in processors (DSPs,MCUs,ASSPs,GPPs/RISCs) forfeatures and ﬂexibility,while dedicated hardware are used for parts of thealgorithmwhich are critical regarding timing constraints.With the main focusof the thesis being on the blocks that need to be accelerated and optimizedby custom hardware for better performance and power,only ASIC and FPGAimplementation technologies are discussed in the following sections.With the full freedom to customize the hardware to the very last single bitof logic,both ASICs and FPGAs can achieve much better system performancecompared to other technologies.However,as they diﬀer in the inner structure oflogic blocks building,they posses quite diﬀerent metrics in areas such as speed,power,unit cost,logic integration,etc.In general,designs implemented withASIC technology is optimized by utilizing a rich spectrum of logic cells withvaried sizes and strengths,along with dedicated interconnection.In contrast,FPGAs with the aim of full ﬂexibility are composed of programmable logiccomponents and programmable interconnects.A typical structure of an FPGAis illustrated in ﬁgure 2.1.Figure 2.2 and 2.3 show the details of programablelogic components and interconnects.Logic blocks can be formed on site throughprogramming look up tables and the conﬁguration SRAMs which control therouting resources.The programmability of FPGAs comes at the cost of speed,power,size,and cost,which is discussed in details in the following.Table 2.2gives a comparison between ASICs and FPGAs.Speed In terms of maximum achievable clock frequency,ASICs are typicallymuch faster than an FPGA given the same manufacture process technol-ogy.This is mainly due to the interconnect architecture within FPGAs.1Unit price for volume production2.1.ASIC VS.FPGA 9Logic blockConﬁgurable routingI/O blockFigure 2.1:A conceptual FPGA structure with conﬁgurable logic blocksand routing.4 inputLook-upTableInputsClockMuxD QFigure 2.2:Simpliﬁed programmable logic elements in an typical FPGAarchitecture.10 CHAPTER 2.HARDWARE IMPLEMENTATION TECHNOLOGIESCLB CLBCLBCLBMUXSRAMSRAMSRAMFigure 2.3:Conﬁgurable routing resources controlled by SRAMs.Table 2.2:Comparisons between ASICs and FPGAs.ASICsFPGAsClock speedHighLowPowerLowHighUnit cost with volume productionLowHighLogic IntegrationHighLowFlexibilityLowHighBack-end Design EﬀortHighLowIntegrated FeaturesLowHigh2.1.ASIC VS.FPGA 11To ensure programmability,many FPGA devices utilize pass transistorsto connect diﬀerent logic cells dynamically,see ﬁgure 2.3.These activerouting resources add signiﬁcant delays to signal paths.Furthermore,thelength of each wire is ﬁxed to either short,medium,and long types.Nofurther optimization can be exploited on the wire length even when twologic elements are very close to each other.The situation could get evenworse if high logic utilization is encountered,in which case it is diﬃcultto ﬁnd a appropriate route within certain regions.As a result,physicallyadjacent logic elements do not necessarily get a short signal path.Incontrast,ASICs has the facility to utilize optimally buﬀered wires imple-mented with metal in many layers,which can even route over logic cells.Another contributor to FPGAs speed degradation lies in its logic granu-larity.In order to achieve programmability,look-up tables are used whichusually have a ﬁxed number of inputs.Any logic function with slightlymore input variables will take up additional look-up tables,which willagain introduce additional routing and delay.On the contrary,ASICs,usually with a rich spectrum types of logic gates of varying functionalityand drive strength (e.g.over 500 types for UMC 0.13 µm technologyused at the department),logic functions can be very ﬁne tuned duringsynthesis process to meet a better timing constraint.Power The active routing in FPGA devices does not only increase signal pathdelays,it also introduce extra capacitance.Combined with large capaci-tances caused by the ﬁxed interconnection wire length,the capacitance inFPGA signal path is in general several times larger than that of an ASIC.Substantial power consumption is dissipated during signal switching thatdrives such signal paths.In addition,FPGAs have pre-made dedicatedclock routing resources,which are connected to all the ﬂip ﬂops on anFPGA in the same clock domain.The capacitance of the ﬂip ﬂop willcontribute to the total switching power even when it is not used.Fur-thermore,the extra SRAMs used to program look-up tables and wiresalso consume static power.Logic density The logic density on an FPGA is usually much lower comparedto ASICs.Active routing device takes up substantial chip area.Look-uptables waste logic resource when they are not fully used,which is alsotrue for ﬂip-ﬂops following each look-up table.Due to relatively low logicdensity,around 1/3 of large ASIC designs in the market usually could notﬁt into one single FPGA [5].Low logic density increase the cost per unitchip area,which makes ASIC design more preferable for industry designsin mass production.Despite of all the above drawbacks,FPGA implementation also comes with12 CHAPTER 2.HARDWARE IMPLEMENTATION TECHNOLOGIESquite a few advantages,which is served as the motivation in the thesis work.Veriﬁcation Ease Due to its ﬂexibility,an FPGA can be re-programmed asrequested when a design ﬂaw is spotted.This is extremely useful forvideo projects,since algorithms for video applications usually need to beveriﬁed over a long time period to observe long term eﬀects.Computersimulations are inherently slow.It could take a computer weeks of timeto simulate a video sequences lasting for only several minutes.Besides,anFPGA platform is also highly portable compared to a computer,whichmakes it more feasible to use in heterogeneous environments for systemrobustness veriﬁcation.Design Facility Modern FPGAs comes with integrated IP blocks for designease.Most importantly,microprocessors are shipped with certain FP-GAs,e.g.(hard Power PC and soft Microblaze processor cores on VirtexII pro and later version of Xilinx FPGAs).This gives great beneﬁt tohardware/software co-design,which is essential in the presented videosurveillance project.Algorithm such as feature extraction and trackingis more suitable for software implementation.With the facilitation ofvarious FPGA tools,interaction between software and hardware can beveriﬁed easily in an FPGAplatform.Minor changes in hardware/softwarepartitioning are easier and more viable compared to ASICs.Minimum Eﬀort Back-end Design The FPGA design ﬂow eliminates thecomplex and time-consuming ﬂoor planning,place and route,timing anal-ysis,and mask/re-spin stages of the project,since the design logic is al-ready synthesized to be placed onto an already veriﬁed,characterizedFPGA device.This will facilitate hardware designers more time to con-centrate mainly on architecture and logic design task.From the discussions above,FPGAs are selected as our implementationtechnology due to its fair performance and all the ﬂexibilities and facilities.2.2 Image SensorsAn image sensor is a device that converts light intensity to an electronic signal.They are widely used among digital cameras and other imaging devices.Thetwo most commonly used sensor technologies are based on Charge Coupled De-vices (CCD) or Complementary Metal Oxide Semiconductor(CMOS) sensors.Descriptions and comparisons of the two technologies are brieﬂy discussed inthe following which are based on [6–8].A summary of the two sensor typesis given in Table 2.3.Both devices are composed of a array of fundamentallight sensitive elements called photodiodes,which excite electrons (charges)2.2.IMAGE SENSORS 13Table 2.3:Image sensor technology comparisons:CCD vs.CMOS.CCDCMOSDynamic RangeHighModerateSpeedModerateHighWindowingLimitedExtensiveCostHighLowUniformityHighLow to moderateSystem NoiseLowHighwhen there is light with enough photons striking on it.In theory,the trans-formation from photon to electron is linear so that one photon would releaseone electron.In general,this is not the case in the real world.Typical imagesensors intended for digital cameras will release less than one electron.Thephotodiode measures the light intensity by accumulating light incident for ashort period of time (integration time),until enough charges are gathered andready to be read out.While CCD and CMOS sensors are quite similar in thesebasic photodiode structure,they mainly diﬀers in the way how these chargesare processed,e.g.readout procedure,signal ampliﬁcation,and AD conver-sion.The inner structures of the two devices are illustrated in ﬁgure 2.4 and2.5.CCD sensors read out charges in a row-wise manner:The charges on eachrow are coupled to the row above,so when the charges are moved down to therow below,new charges from the row above will ﬁll the current position,thusthe name Coupled Charged Device.The CCD shifts one row at a time to thereadout registers,where the charges are shifted out serially through a charge-to-voltage converter.The signal coming out of the chip is a weak analog signal,therefore an extra oﬀ-chip ampliﬁer and AD converter are need.In contrast,CMOS sensors integrates separate charge-to-voltage converter,ampliﬁer,noisecorrector and AD converter into each photosite,so the charges are directlytransformed,ampliﬁed and digitized to digital signals on each site.Row andcolumn decoders can also be added to select each individual pixel for readoutsince it is manufactured in the same standard CMOS process as main streamlogic and memory devices.With varied inner structures of the two sensor types,each technology hasunique strengths but also weaknesses in one area or the other,which are de-scribed in the following:Cost CMOS sensors in general come at a low price at system level since theauxiliary circuits such as oscillator,timing circuits,ampliﬁer,AD con-verter can be integrated onto the sensor chip itself.With CCD sensors,these functionality have to be implemented on a separate Printed Circuit14 CHAPTER 2.HARDWARE IMPLEMENTATION TECHNOLOGIESPCBTiming&ControlOscillatorA/DDigital OutCCD Sensor ChipCharge-to-VoltageCharge coupledGainPhoto-Diodebetween rowsOne row shifted outFigure 2.4:A typical CCD image sensor architecture.Timing&ControlOscillatorA/DDigitalOutputCMOS Sensor ChipAmpliﬁerPhoto-DiodeColumn Address DecoderRowAddressDecoderFigure 2.5:A typical CMOS image sensor architecture.2.3.MEMORY TECHNOLOGY 15Board (PCB) which results in a higher cost.On the chip level,althoughCMOS sensor can be manufactured using a foundry process technologythat is also capable of producing other circuits in volume,the cost of thechip is not considerable lower than a CCD.This is due to the fact thatspecial,lower volume,optically adapted mixed-signal process has to beused by the requirement of good electro-optical performance [6].Image Quality The image quality can be measured in many ways:Noise level CMOS sensors in general have a higher level of noises dueto the extra circuits introduced.This can be compensated to someextent by extra noise correction circuits.However this could alsoincrease the processing time between frames.Uniformity CMOS sensors use separate ampliﬁer for each pixel,theoﬀset and gain of which can vary due to wafer process variations.As a result,the same light intensity will be interpreted as diﬀerentvalue.CCD sensor with an oﬀ-chip ampliﬁer for every pixel,excelin uniformity.Light Sensitivity CMOS sensors are less sensitive to light due to thefact that part of each pixel site are not used for sensing light but forprocessing.The percentage of a pixel used for light sensing is calledﬁll factor,which is shown in ﬁgure 2.2.In general,CCD sensorshave a ﬁll factor of 100% while CMOS sensor has much less,e.g.30%−60%[9].Possibly,such a drawback can be partially solved byadjusting integration time of each pixel.Speed and Power In general,a CMOS sensor is faster and consumes lowerpower compared to a CCD.Moving auxiliary circuits on chip,parasiticcapacitance is reduced,which increase the speed at the same time con-sumes less power.Windowing The extra rowand column decoders in CMOS sensors enable datareading out from arbitrary positions.This could be useful if only portionof the pixel array is of interest.Reading out data with using diﬀerentresolution is made easy on CMOS sensor without having to discard pixelsoutside the active window as compared to a CCD sensor.2.3 Memory TechnologyAs a deduction from Moore’s law,the performance of processors is increasingroughly 60% each year due to the technology scaling.This is never the casefor memory chips.In terms of access time,memory performance have only16 CHAPTER 2.HARDWARE IMPLEMENTATION TECHNOLOGIESLight detectionAreaPeripheralCircuitsFigure 2.6:Fill factor refers to the percentage of a photosite that issensitive to light.If circuits cover 25% of each photosite,the sensor issaid to have a ﬁll factor of 75%.The higher the ﬁll factor,the moresensitive the sensor.managed to increase by less than 10% per year [10,11].The performance gapbetween processors and memories has already become a bottle neck of today’shardware systemdesign.With diﬀerent increase rate,the situation will get evenworse in the future until it reaches a point where further increase in processorspeed yield little or no performance boost for the whole system,a phenomenonthat is called ”hitting the memory wall” from the most cited article [12] by W.Wulf et al.on processor memory gap.The traditional way of bridging the gapis by introducing a hierarchical level of caches,while many new approaches areunder investigation e.g.[13–15].In order for better understanding of memoryissues today,topics regarding memory technology are given in the followingsection.In general,memory technology can be categorized into two types,namelyRead Only Memory (ROM) and Random Access Memory (RAM).Due to itsread only nature,a ROM is generally made up of a hardwired architecturewhere a transistor is placed on a memory cell depending on intended contentof the cell.The use of a ROM is limited to store ﬁxed information,e.g.look-up table,micro-codes.Many variant technology exists to provide at least onetime programmability,e.g.PROM,EPROM,EEPROMand FLASH.RAMs onthe other hand with both read and write access are widely used in hardware.Basically,RAMs consists of two types:Static RAM (SRAM) and DynamicRAM (DRAM).A typical 6 transistor SRAM cell is shown in ﬁgure 2.7,whilea 1 transistor and a 3 transistor DRAM cells are shown in ﬁgure 2.8.2.3.MEMORY TECHNOLOGY 17WLBLBLVDDFigure 2.7:An SRAM cell architecture with 6 transistors.WWLRWLBL1 BL2CS(a) A 3 transistor DRAM cell structureCSBLWLCBL(b) A 1 transistor DRAMcell struc-tureFigure 2.8:DRAM cell architectures with 1 or 3 transistors.18 CHAPTER 2.HARDWARE IMPLEMENTATION TECHNOLOGIESFrom the ﬁgure,static RAM holds its data in a positive feedback loopwith two cascaded inverters.The value will be stored for as long as power issupplied to the circuit.This is in contrast to DRAM,which holds its value on acapacitor.Due to the leakage,the charge on the capacitor will disappear aftera period of time.To be able keep the value,the capacitor has to be refreshedconstantly.With their respective strengths and weaknesses incurred by theirinner structures,SRAMs and DRAMs are used in quite diﬀerent applications.A brief comparison is made on the two technologies in the following:Density Each DRAMcell is made up of fewer transistors compared to a SRAMcell,which makes it possible to integrate much more memory cells giventhe same chip area.Due to the same reason,the cost of DRAMs is muchlower.Speed In general,DRAMs are relatively slow compared to SRAMs.One rea-son for this is that its high density structure leads to large cell arrays withhigh word and bit line capacitance.Another reason lies on its compli-cated read and write cycle with latencies.With its capacity,the addresssignals are multiplexed into row and column due to limited number ofpins,potentially degrading performance.Furthermore,DRAMs needs tobe refreshed constantly,during which period no read and write accessesare possible.Special IC process Integrating denser cells requires modiﬁcations in the man-ufacturing process [16],which makes DRAMs diﬃcult to integrate withstandard logic circuits.In general,DRAMs are manufactured in separatechips.Fromthese properties,DRAMs are generally used as systemmemory placedoﬀ-chip due to its density and cost,while SRAMs is placed on-chip with stan-dard logic circuits,working as L1 and L2 caches due to its speed and ease ofintegration.2.3.1 Synchronous DRAMTo overcome the shortcomings existing in traditional DRAMs,new technologieshave evolved over years,e.g.Fast Page Mode DRAM (FPM),Extended DataOut DRAM (EDO) and Synchronous DRAM (SDRAM).A good overview canbe found from many sources,e.g.[17,18].SDRAM gains its popularity byseveral reasons:• By introducing clock signals,memory buses are made synchronous toprocessors.As a result,the commands to be issued to the memoriesare put in pipelines,so that new operation is executed without waiting2.3.MEMORY TECHNOLOGY 19for the completion of the previous ones.Besides,the eﬀort of memorycontroller design is made easier to some extent,since timing parametersare measured in clock cycles instead of physical timing data.• SDRAM supports burst memory access to an entire row of data.Syn-chronous to the bus clock,the data can be read out sequentially withoutstalling.No column access signals are needed for burst read,the lengthof the burst accessed in set by a mode register,which is a new featurein SDRAMs.Burst data access will increase memory bandwidth sub-stantially if the data needed by the processor are stored successively in arow.• SDRAM utilize bank interleaving to minimize extra time introduced bye.g.precharge,refresh.The memory space of a SDRAM is divided intoseveral banks (usually two or four).When one of the bank is beingaccessed,other banks remains ready to be accessed.When there is arequest to access another bank,this will take place immediately withouthaving to wait for the current bank to complete.A continuous data ﬂowcan be obtained in such cases.2.3.1.1 Double Data Rate Synchronous DRAMTo further improve the bandwidth of a SDRAM,Double Data Rate SDRAM(DDR) is developed with doubled memory bandwidth.By using 2n pre-fetchingtechniques,two bits are picked up from the memory array simultaneously tothe I/Obuﬀer in two separate pipelines,where they are to be sent on to the bussequentially on both rising and falling edges of the clock.However,the usage islimited to the situation where the need of multiple accesses is on the same row.In addition to double data rate,the bus signaling technology is changed to a2.5v Stub Series Terminated Logic2 (SSTL2) standard [19],which consumesless power.Data strobes signals are also introduced for better synchronizationof data signals to memory controllers.2.3.2 DDR Controller Design on Xilinx VirtexII pro FPGAWith high data bandwidth and complicated timing parameters of a DDRSDRAM,the design of a DDR interface can be challenging.DDR SDRAMworks synchronously on a clock frequency at 100 MHz or above.Clock sig-nals together with data and command signals are transferred between memoryand processor chips through PCB signal traces.To make sure all data andcommand signals to be valid in the right timing in respective to the clock is anontrivial task.Many factors contributes to the total signal uncertainties,e.g.20 CHAPTER 2.HARDWARE IMPLEMENTATION TECHNOLOGIESDCMDCMInternalExternalCLKINCLKIN CLKFBCLKFBBUFGBUFGBUFGBUFGBUFGExternalFeedbackPCBTraceIBUFIBUFGOBUFOBUFIOBUFTR/WSSTL2IISSTL2IISSTL2IISSTL2IISSTL2IICLK0CLK0CLK180CLK90CLK270D0D0D1D1C0C0C1C1DDQQ00 11RiseDataFallDataFDDRFDDRCLKDQCLKDDRSDRAMFPGAFigure 2.9:Two DCMs are used to synchronize operations on an oﬀ-chipDDR SDRAM and on-chip memory controller.DCM external sends oﬀ-chip clocks for DDR SDRAM,while DCM internal are used for sendingdata oﬀ-chip or capturing data from an DDR SDRAM.2.3.MEMORY TECHNOLOGY 21PCB layout skew,package skew,clock phase oﬀset,clock tree skew and clockduty cycle distortion.In the following,the timing closure of a DDR controller design for the im-plementation of the video surveillance unit is described.The memory interfaceis implemented on a Xilinx Virtex II pro VP30 platform FPGA platform witha working frequency of 100 Mhz.According to the standard,the data are transferred between a DDR and aprocessor (FPGAin our implementation) with a bidirectional data strobe signal(DQS).The signal is issued by the memory controller during write operationand it is center aligned with the data.During a read operation,the DDR sendthe signal together with the data with edge alignment in respect to each other.To synchronize the operations between an FPGA and a DDR SDRAM,twoDigital Clock Managers (DCM) are used,which is shown in ﬁgure 2.9.DCM is a special device in many Xilinx FPGA platforms that providemany functionalities related to the clock management,e.g.delayed locked loop(DLL),digital frequency synthesizer and digital phase shifter.By using theclock signal feedback from the dedicated clock tree,the clock signal referencedinternally by each ﬂip-ﬂop inside an FPGA are in phase with the source ofthe clock from oﬀ-chip.From ﬁgure 2.9,the DCM External generates theclock signals (clk0 and clk180) that go oﬀ-chip to the DDR SDRAM throughdouble data rate ﬂip-ﬂops (FDDR).FDDR updates its outputs on the risingedges of both input clock signals.Thus the clock signals to a DDR can bedriven by an FDDR instead of an internal clock signal directly.The DCMInternal generates the clock signals that are used internally by all ﬂip-ﬂops inthe memory controller.To be able to align the two clock signals,they areboth aligned to the original clock source (the signal driven by IBUFG).Thealignment of the DCM External are implemented using a oﬀ-chip PCB tracesignals that is designed to have the same length as the clock signal trace fromthe FPGA to the DDR SDRAM.Thus the clock signal arrives at the DDRSDRAM is assumed to be in phase with the external feedback signal thatarrives at the DCM External.As the internal clock signals referenced by allﬂip-ﬂops in the memory controller are also aligned to the original clock signaldriven by IBUFGthrough an internal feedback loop,the clock signal in memorycontroller is aligned to the clock signal that arrives at the DDR SDRAM clockpin.During read operation,data are transferred froman oﬀ-chip DDR on bothedges of the clock,in a edge alignment manner.To register the data in thememory controller,a 90◦and 270◦phase shifted clock signals are used to alignwith the read being data in the center.This is shown in ﬁgure 2.9.In practice,the internal and external clock signals are not entirely in phasewith each other due to skews from many sources.From Xilinx datasheet [20–22],the worst case skews on an Xilinx Virtex II pro devices can result inleading and trailing uncertainties of 880 ps and 540 ps respectively in a read22 CHAPTER 2.HARDWARE IMPLEMENTATION TECHNOLOGIESLeading EdgeTrailing edgeUncertaintyUncertaintyDelayed CLK90Read DataData Valid Window880 ps 540 ps830 psLeading Edge MarginTrailing Edge MarginFigure 2.10:DDR read capture data valid window.data window,which is shown in ﬁgure 2.10.The internal DCM is phase shifted by 1 ns to take the advantage of variedleading and trailing uncertainties,thus the margin of the valid data window isimproved,see ﬁgure 2.10.On the other hand,the timing problem with data write operation is minorsince clock signals and data signals generated within FPGA propagate throughsimilar logics and trace delays.2.4 Power Consumption in Digital CMOS technologyMinimization of power consumption has been one of the major concerns in thedesign of embedded systems due to one of the following two distinctive reasons:• The increasing systemcomplexity of portable devices leads to more powerconsumption by more integrated functionality and sophistication,e.g.themultimedia applications on mobile phones such as digital video broadcast-ing (DVB) and digital camera,higher data rate wireless communicationwith emerging technologies such as WiMax/802.16.This shortens batterylife signiﬁcantly.• Reliability and cost issues regarding heat dissipation in the manufacturingof non-portable high end applications.High power consumption requires2.4.POWER CONSUMPTION IN DIGITAL CMOS TECHNOLOGY 23expensive packaging and cooling techniques given that insuﬃcient coolingleads to high operating temperatures,which tend to exacerbate severalsilicon failure mechanisms.This is especially true for battery-driven systemdesign.With only 30%batterycapacity increase in the last 30 years and 30 to 40% over the next 5 years byusing new battery technologies [23],e.g.the rechargeable lithium or polymers,the computational power of digital integrated circuits has increased by severalorders of magnitude.To bridge the gap,new approaches must be developed tohandle power consumption in mobile applications.2.4.1 Sources of power dissipationThree major sources contribute to the total power dissipation of digital CMOScircuits,which can be formulated as:Ptot= Pdyn+Pdp+Pstat,(2.1)where Pdynis the dynamic dissipation due to charging and discharging load ca-pacitances,Pdpis the power consumption caused by direct path between VDDand GND with ﬁnite slope of the input signal,and Pstatis the static powercaused by leakage current.Traditionally,the power consumption by capacitiveload has always been the dominant factor.This will not be the case in thedesign with deep sub-micron technologies,since leakage current increases ex-ponentially with threshold scaling in each new technology generation [24].For130 nmtechnology,leakage can account for 10%to 30%of the total power whenactive,and dominant when standby [25].With 90 nm and 65 nm technology,the leakage can reach more than 50%.Power dissipation due to direct path,on the other hand,is usually of minor importance,and can be minimized bycertain techniques e.g.supply voltage scaling [26].With the focus of the the-sis being on architecture exploration,power consumption regarding switchingpower is brieﬂy discussed in the following.2.4.1.1 Switching Power Reduction SchemesPower consumption due to signal switching activity can be calculated as [16]:Pswitch= P0→1CLV2DDf,(2.2)where P0→1is the probability that a output transition of 0 →1 occurs,CListhe load capacitance of the driving cell,VDDis the supply voltage,and f is theworking clock frequency.From the equation,power minimization strategy canbe carried out by constraining any of the factors,which is especially eﬀectivefor power supply reduction since the power dissipation decreases quadratically24 CHAPTER 2.HARDWARE IMPLEMENTATION TECHNOLOGIESTable 2.4:Power Savings in Diﬀerent Level of Design Abstraction.TechniqueSavingsArchitectural/Logic Changes45%Clock Gating8%Low power Synthesis15%Voltage Reduction32%Table 2.5:Core power consumption contribution from diﬀerent parts of a logiccore [36].ComponentPercentagePLLs/Macros7.21%Clocks52.13%Standard Cells6.72%Interconnect5.97%RAMs (including leakage)16.94%Logic Leakage11.04%with VDD.Power minimization techniques can be applied in all level of designabstractions,ranging from software down to chip layout.In [27–34],compre-hensive overviews of various power reduction techniques are given.Suggestionsare made to minimize power consumption in all level of a circuit design.In [35],a survey is made to give an overview of amount of power savings that can begenerally achieved at diﬀerent design level.Their experimental results are givenin Table 2.4.From the table,it is shown that the most eﬃcient way of lower-ing power consumption is to work on either high architecture level or the lowtransistor level.In [36],the contributions to the total power consumption fromdiﬀerent blocks of a design are given,which is shown in table2.5.From thetable,it can be seen that clock net and memory access contribute over 50% ofthe total power consumption in the logic core.In the following section,examplepower reduction schemes are discussed,which only covers power consumptionminimization in high level architecture design.2.4.2 Pipelining and Parallel ArchitecturesPower consumption can be reduced by using pipelining or parallel architectures.According to [37],the ﬁrst order estimation of the delay of a logic path can be2.4.POWER CONSUMPTION IN DIGITAL CMOS TECHNOLOGY 25calculated astd∝VDD(VDD−Vt)α.(2.3)With a pipelining architecture,the calculation paths of a design is inserted withpipeline registers.This eﬀectively reduces the tdin the critical path.ThusVDDcan be lowered in the equation while the same clock frequency can bemaintained.As stated above,power consumption can be reduced by loweringVDDsince it has quadratic eﬀects on power dissipation.The same principleapplies to parallel architecture.With hardware duplicated several times,thethroughput of a design increases proportionally.Alternatively,a design canachieve for lower power consumption by slowing down the clock frequency ofeach duplicates.The same throughput is maintained,while the supply voltagecan be reduced.Bibliography[1] C.Stauﬀer and W.Grimson,“Adaptive background mixture models forreal-time tracking,” in Proc.IEEE Conference on Computer Vision andPattern Recognition,1999.[2] L.Adams.(2002,November) Choosing the right architec-ture for real-time signal processing designs.[Online].Available:http://focus.ti.com/lit/an/spra879/spra879.pdf[3] P.Eles,Z.Peng,K.Kuchcinski,and A.Doboli,“System level hard-ware/software partitioning based on simulated annealing and tabu search,”Springer Design Automation for Embedded Systems,vol.2,pp.5–32,Jan-uary 1997.[4] T.Wiangtong,P.Y.Cheung,and W.Luk,“Tabu search with intensiﬁca-tion strategy for functional partitioning in hardware-software codesign,”in Proc.of the 10 th Annual IEEE Symposium on Field-ProgrammableCustom Computing Machines (FCCM 02),California,USA,April 2002,pp.297– 298.[5] J.Gallagher.(2006,January) ASIC prototyping using oﬀ-the-shelf FPGA boards:How to save months of veriﬁcationtime and tens of thousands of dollars.[Online].Available:http://www.synplicity.com/literature/whitepapers/pdf/protowp06.pdf[6] D.Litwiller.(2001,January) CCD vs.CMOS:Facts and ﬁc-tion.[Online].Available:http://www.dalsa.com/shared/content/PhotonicsSpectraCCDvsCMOSLitwiller.pdf2728 BIBLIOGRAPHY[7] ——.(2005,August) CMOS vs.CCD:Maturing technologies,maturingmarkets.[Online].Available:http://www.dalsa.com/shared/content/pdfs/CCDvsCMOSLitwiller2005.pdf[8] A.E.Gamal and H.Eltoukhy,“CMOS image sensors,” IEEE Circuits andDevice Magzine,vol.21,pp.6–20,May-June 2005.[9] D.Scansen.CMOS challenges CCD for image-sensinglead.[Online].Available:http://www.eetindia.com/articles/2005oct/b/2005oct17stechoptta.pdf[10] J.L.Hennessy and D.A.Patterson,Computer Architecture:A Quantita-tive Approach,Third Edition.Morgan Kaufmann,2002.[11] N.R.Mahapatra and B.Venkatrao,“The processor-memory bot-tleneck:Problems and solutions,” Tech.Rep.[Online].Available:http://www.acm.org/crossroads/xrds5-3/pmgap.html[12] W.A.Wulf and S.A.McKee,“Hitting the memory wall:Implicationsof the obvious,” Computer Architecture News,vol.23,pp.20–24,March1995.[13] “The berkeley intelligent RAM (IRAM) project,” Tech.Rep.[Online].Available:http://iram.cs.berkeley.edu/[14] C.C.Liu,I.Ganusov,M.Burtscher,and S.Tiwari,“Bridging the proces-sor memory performance gap with 3D IC technology,” IEEE Design andTest of Computers,vol.22,pp.556– 564,November 2005.[15] “puma2,proactively uniform memory access architecture,” Tech.Rep.[Online].Available:http://www.ece.cmu.edu/puma2/[16] J.M.Rabaey,A.Chandrakasan,and B.Nikoli´c,Digital Integrated Cir-cuits:A Design Perspective,Second Edition.Prentice Hall,2003.[17] T.-G.Hwang,“Semiconductor memories for it era,” in Proc.of IEEEInternational Solid-State Circuits Conference (ISSCC),California,USA,February 2002,pp.24–27.[18] (2005) Memory technology evolution.[On-line].Available:http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00266863/c00266863.pdf[19] [Online].Available:http://download.micron.com/pdf/misc/sstl2spec.pdfBIBLIOGRAPHY 29[20] M.George.(2006,December) Memory interface applicationnotes overview.[Online].Available:http://www.xilinx.com/bvdocs/appnotes/xapp802.pdf[21] N.Gupta and M.George.(2004,May) Creating high-speed memoryinterfaces with Virtex-II and Virtex-II Pro FPGAs.[Online].Available:http://www.xilinx.com/bvdocs/appnotes/xapp688.pdf[22] N.Gupta.(2005,January) Interfacing Virtex-II de-vices with DDR SDRAM memories for performance to167 mhz.[Online].Available:http://www.xilinx.com/support/software/memory/protected/XAPP758c.pdf[23] W.L.Goh,S.S.Rofail,and K.-S.Yeo.Low-power design:An overview.[Online].Available:http://www.informit.com/articles/article.asp?p=27212&rl=1[24] G.E.Moore,“No exponential is forever:But ”Forever” can be delayed!”in Proc.of IEEE International Solid-State Circuits Conference (ISSCC),California,USA,February 2003,pp.20–23.[25] B.Chatterjee,M.Sachdev,S.Hsu,R.Krishnamurthy,and S.Borkai,“Eﬀectiveness and scaling trends of leakage control techniques for sub430nm CMOS technologies,” in Proc.of International Symposium on LowPower Electronics and Design (ISLPED),California,USA,August 2003,pp.122–127.[26] T.Olsson,“Distributed clocking and clock generation in digital CMOSSoC ASICs,” Ph.D.dissertation,Lund University,Lund,2004.[27] J.M.Rabaey and M.Pedram,Low Power Design Methodologies.Springer,1995.[28] A.P.Chandrakasan,S.Sheng,and R.W.Brodersen,“Low-power CMOSdigital design,” IEEE Journal of Solid-State Circuits,vol.27,pp.473 –484,April 1992.[29] D.Garrett,M.Stan,and A.Dean,“Challenges in clockgating for a lowpower ASIC methodology,” in Proc.of International Symposium on LowPower Electronics and Design,California,USA,August 1999,pp.176 –181.[30] Y.J.Yeh,S.Y.Kuo,and J.Y.Jou,“Converter-free multiple-voltage scal-ing techniques for low-power CMOS digital design,” IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems,vol.20,pp.172 – 176,January 2001.30 BIBLIOGRAPHY[31] T.Kuroda and M.Hamada,“Low-power CMOS digital design with dualembedded adaptive power supplies,” IEEE Journal of Solid-State Circuits,vol.35,pp.652 – 655,April 2000.[32] A.Garcia,W.Burleson,and J.L.Danger,“Low power digital design inFPGAs:a study of pipeline architectures implemented in a FPGA using alow supply voltage to reduce power consumption,” in Proc.IEEE Interna-tional Symposium on Circuits and Systems (ISCAS),Geneva,Switzerland,May 2000,pp.561 – 564.[33] P.Brennan,A.Dean,S.Kenyon,and S.Ventrone,“Low power method-ology and design techniques for processor design,” in Proc.InternationalSymposium on Low Power Electronics and Design (ISLPED),California,USA,August 1998,pp.268 – 273.[34] L.Benini,G.D.Micheli,and E.Macii,“Designing low-power circuits:practical recipes,” IEEE Circuits and Systems Magazine,vol.1,pp.6–25,2001.[35] F.G.Wolﬀ,M.J.Knieser,D.J.Weyer,and C.A.Papachristou,“High-level low power FPGA design methodology,” in Proc.IEEE Conference onNational Aerospace and Electronics Conference (NAECON),Ohio,USA,October 2000,pp.554–559.[36] S.GadelRab,D.Bond,and D.Reynolds,“Fight the power:Power re-duction ideas for ASIC designers and tool providers,” in Proc.of SNUGConference,California,USA,2005.[37] K.K.Parhi,VLSI Digital Signal Processing Systems:Design and Imple-mentation.John Wiley & Sons,1999.Hardware AcceleratorDesign of an AutomatedVideo SurveillanceSystem31Chapter 1Segmentation1.1 IntroductionThe use of video surveillance systems is omnipresent in the modern world inboth a civilian and a military contexts,e.g.traﬃc control,security monitor-ing and antiterrorism.While traditional Closed Circuit TV (CCTV) basedsurveillance systems put heavy demands of human operators,there is an in-creasing needs for automated video surveillance system.By building a selfcontained video surveillance system capable of automatic information extrac-tion and processing,various events can be detected automatically,and alarmscan be triggered in presence of abnormity.Thereby,the volume of data pre-sented to security personnel is reduced substantially.Besides,automated videosurveillance better handles complex cluttered or camouﬂaged scenes.A videofeed for surveillance personnel to monitor after the system has announced anevent will support improved vigilance and increase the probability of incidentdetection.Crucial to most of such automated video surveillance systems is the qualityof the video segmentation,which is a process of extracting objects of interest(foreground) froman irrelevant background scene.The foreground information,often composed of moving objects,is passed on to later analysis units,where ob-jects are tracked and their activities are analyzed.To be able to perform videosegmentation,a so called background subtraction technique is usually applied.With a reference frame containing a pure background scene being maintainedfor all pixel locations,foreground objects are extracted by thresholding thediﬀerence between the current video frame and the background frame.In the3334 CHAPTER 1.SEGMENTATION5010015020025030050100150200(a) Indoor environment in the lab5010015020025030050100150200(b) TH = 55010015020025030050100150200(c) TH = 105010015020025030050100150200(d) TH = 20Figure 1.1:Video segmentation results with the frame diﬀerence ap-proach.Diﬀerent threshold value are tested in the indoor environmentin our lab.following section,a range of background subtraction algorithms are reviewed,along with the discussions on their performances and computational complex-ity.Based on these discussions,trade-oﬀs are made with a speciﬁc algorithmbased on Mixture of Gaussian (MoG) being selected as the baseline algorithmfor hardware implementation.The algorithm is subjected to modiﬁcations tobetter ﬁt implementation on an embedded platform.1.2.ALTERNATIVE VIDEO SEGMENTATION ALGORITHMS 351.2 Alternative Video Segmentation Algorithms1.2.1 Frame DiﬀerenceA Background/Foreground detection can be achieved by simply observing thediﬀerence of the pixels between two adjacent frames.By setting a thresholdvalue,a pixel is identiﬁed as foreground if the diﬀerence is higher than thethreshold value or background otherwise.The simplicity of the algorithmcomesat the cost of the segmentation quality.In general,bigger regions are detectedas foreground area than the actual moving part.Also it fails to detect innerpixels of a large,uniformly-colored moving object,a problemknown as apertureeﬀect [1].In addition,setting a global threshold value is problematic since thesegmentation is sensitive to light intensity.Figure 1.1 shows segmentationresults with a video sequence taken in our lab,where three people are movingin front of a camera.Fromthese ﬁgures,it can be seen that with lower thresholdvalue,more details of the moving objects are revealed.However,this comeswith substantial noise that could overwhelmthe segmented objects,e.g.the leftmost person in ﬁgure 1.1(b).On the other hand,increasing the threshold valuereduces noise level,at the cost of less details detected to a point where almostwhole objects are missing,e.g.left most person in ﬁgure 1.1(d).In general,inner parts of all objects are left undetected,due to their uniformity colors thatresult in minor value changes over frames.In spite of the segmentation quality,the frame diﬀerence approach suits well for hardware implementation.Thecomputational complexity as well as memory requirements are rather low.Withthe memory size of only one video frame and minor hardware calculation,e.g.an adder and a comparator,it is still found as part of many video surveillancesystems of today [1–4].1.2.2 Median FilterWhile the frame diﬀerence approach uses the previous frame as the backgroundreference frame,it is inherently unreliable and sensitive to noise with the mov-ing objects contained in the reference frame and the varying illumination noiseover frames.An alternative approach to obtain a background frame is by usingmedian ﬁlters.A median ﬁlter has traditionally been used in spatial image ﬁl-tering process to remove noise [5].The basic idea of noise reduction lies in thefact that a pixel corrupted by noise makes a sharp transition in the spatial do-main.By checking the surrounding pixels that centers at the pixel in question,the middle value is selected to replace the center pixel.By doing this,the pixelin question is forced to look like its neighbors,thus the extinctive pixel valuecorrupted by noise are replaced.Inspired by this,median ﬁlters are used tomodel background pixels with reduced noise deviation by ﬁltering pixel values36 CHAPTER 1.SEGMENTATIONin the time domain.they are used in many applications [6–8],with the medianﬁltering process carried out over the previous n frames,e.g.50 −200 framesin [6].To avoid foreground pixel values to be mixed into the background,thenumber of frames has to be large so that more than half the pixel values be-longs to the background.The principle is illustrated in ﬁgure 1.2,where thenumber of both foreground and background pixels are shown in a frame buﬀer.Due to various noise,a pixel value will not stay at exactly the same value overframes,thus the histograms are used to represent both the foreground and thebackground pixels.Consider the case when the number of background pixelsis more than that of foreground pixel by only one.The median value will lieright at the right foot of background histogram.With increasing backgroundpixel ﬁlled into the buﬀer,the value is moving towards the peak of the back-ground histogram.Under the previous assumption that no foreground pixelwill stay in the scene for more than half size of the buﬀer,the median valuewill move along the background histogram back and forth,representing thebackground pixel value for the current frame.Using buﬀers to store previousn frames is costly in memory usage.In certain situations,number of buﬀeredframes could increase substantially,e.g.slowly moving objects with uniformlycolored surface are present in the scene or the foreground objects stopped for awhile before moving on to another location.The calculation complexity is alsoproportional to the number of buﬀers.To ﬁnd the median value it is necessaryto sort all the values in the frame buﬀer in numerical order which is hardwarecostly with large number of frame buﬀers.1.2.3 Selective Running AverageOne similar alternative to median ﬁltering is to use the average instead of themedian value over previous n frames.Noise distortions to a background pixelover frames can be neutralized by taking the mean value of the pixel samplescollected over time.To avoid huge memory requirements similar to the medianﬁltering approach,a running average can be utilized which takes the form ofBt= (1 −α)Bt−1+αFt,(1.1)where α is the learning rate,F and B are the current frame and backgroundframe formed by the mean value of each pixel respectively.With such anapproach,only a frame of mean values are needed to be stored in a memory.The average operation is carried out by incorporating a small portion of thenew frames into the mean values at a time,using a learning factor α.At thesame time,the same portion of the current mean value is discarded.Dependingon the value of α,such a average operation can be fast or slow.For backgroundmodeling,a fast learning factor could result in foreground pixels to be quicklyincorporated into background,thus limiting its usage to certain situations,e.g.1.2.ALTERNATIVE VIDEO SEGMENTATION ALGORITHMS 37020406080100120010203040506070Color Intensity levelNumber of pixelbackgroundpixelsmedian valueforegroundpixelsFigure 1.2:Foreground and Background pixel histograms:With morepixels in the buﬀer falling within Background,the median value movestowards the center of Background distribution.initialization phase with only background scene.To avoid a foreground pixel to be mixed into the background updatingprocess,a selective running average can be applied.This is shown in thefollowing equations:Bt= (1 −α)Bt−1+αFtif Ft⊂ background (1.2)Bt= Bt−1if Ft⊂ foreground.(1.3)With the foreground/background distinction performed before backgroundframe updating process,more recent “clean” background pixels contributes tothe form of the new mean value,which makes the background modeling moreaccurate.The selective running average method is used in many applications,e.g.[9,10],and forms the basics of other alternative algorithms with muchhigher complexity,e.g.Mixture of Gaussian (MOG) discussed in the follow-ing sections.The merit of the approach comes in its relatively low hardwarecomplexity,e.g.simple multiplications and additions are needed to update themean value for each pixel.Together with low memory requirements of storing38 CHAPTER 1.SEGMENTATIONonly one frame of mean values,running selective average ﬁts well for hardwareimplementation.Acting virtually with the same principles as a mean ﬁlter,selective running average achieves similar segmentation results as that of themedian ﬁltering approach.1.2.4 Linear Predictive FilterTo be able to estimate the current background more accurately,linear predictiveﬁlters are developed for background modeling in several literatures [11–15].Theproblem with taking the median or mean of the past pixel samples lies in thefact that it does not reﬂect the uncertainty (variance) of how a backgroundpixel value could drift from its mean value.Without any of this information,the foreground/background distinction has to be done in a heuristic way.Analternative approach can be utilized which predicts the current backgroundpixel value from its recent history values.Compared to mean and medianvalues,a prediction value can more accurately represent the true value of thecurrent background pixel,which eﬀectively decrease the uncertainty of thevariation of a background pixel.As a result,a tighter threshold value canbe selected to achieve a more precise segmentation with a better chance ofavoiding camouﬂage problem,where foreground and background holds similarpixel values.Toyama et al.[11] uses an one-step Wiener ﬁlter to predict abackground value based on its recent history of values.In their approach,alinear estimation of the current background value is calculated as:ˆBt=NXk=1αkIt−k,(1.4)whereˆB is the current background estimation,It−kis one of the history valuesof a pixel,and αkis the prediction coeﬃcient.The coeﬃcient are calculated tominimize mean square of the estimation error,which is formulated as:E[e2t] = E[(Bt−ˆBt)2].(1.5)According to the procedure described in [16],the coeﬃcients can be obtainedby solving a set of linear equations as follows:pXk=1αkXtIt−kIt−i= −XtItIt−i,1 ≤ i ≤ p.(1.6)The estimation of coeﬃcients and pixel predictions are calculated recursivelyduring each frame.In [11],a pixel value with a deviation of more than 4.0 ×pE[e2t] is considered foreground pixel.In total,50 past values are used in [11]for each pixel to calculate 30 coeﬃcients.1.2.ALTERNATIVE VIDEO SEGMENTATION ALGORITHMS 39Wiener ﬁlters are also expensive in computation and memory requirement.N frame buﬀers are needed to store a history of frames.Background pixelprediction and coeﬃcients updating are very costly since a set of linear functionsare needed to obtain the value.p multiplication and p−1 additions are neededfor prediction,plus the solution of a linear equation of order p.An alternative approach for linear prediction is to use Kalman ﬁlters.BasicKalman ﬁlter theory can be found in many literatures,e.g.[12,13,15].Kalmanﬁlters are widely used for many background subtraction applications,e.g.[13–15].It predicts the current background pixel value by recursive computingfrom the previous estimate and the new input data.A brief formulation of thetheory is given in the below according to [13],while a detailed description ofKalman ﬁlters can be found in [12].Kalman ﬁlters provide an optimal estimate of the state of the process xt,by minimizing the diﬀerence of the average of the estimated outputs and theaverage of the measures,which is characterized by the variance of the estimationerror.The deﬁnition of a state can vary in diﬀerent applications,e.g.theestimated value of the background pixel and its derivative in [15].Kalmanﬁltering is performed in essentially two steps:prediction and correction:In theprediction step,the current state of the system is predicted from the previousstate asˆx−t= Aˆxt−1,(1.7)where A is the state transition matrix,ˆxt−1is the previous state estimate andˆx−tis the estimation of the current state before correction.In order to minimizethe diﬀerence between the measure and the estimated state value It−Hˆx−t.Itis the current observation and H is the transition matrix that maps the stateto the measurements.A variance of such diﬀerence is calculated based onP−t= APt−1AT+Qt,(1.8)where Qtrepresents the process noise,Pt−1is the previous estimation errorvariance and P−tis the estimation of error variance based on current predictionstate value.With a ﬁlter gain factor calculated byKt=P−tCTCP−tCT+Rt,(1.9)where Rtrepresents the variances of measurement noise and C is the transitionmatrix that maps the state to the measurement.The corrected state estimationbecomesˆxt= ˆxt−1+Kt(It−Hx−t),(1.10)and the variance after correction is reduced toPt= (1 −KtC)P−t.(1.11)40 CHAPTER 1.SEGMENTATIONRidder et al.[15] use both background pixel intensity value and its temporalderivative Btand B′tas the state value:ˆxt=

BtB′t

,(1.12)and the parameters are selected as follows:A =

1 0.70 0.7

and H =

1 0

.(1.13)The gain factor Ktvaries between a slow adaptation rate α1and a fastadaptation rate α2depending on whether the current observation is a back-ground pixel or not:Kt=

.(1.15)The Kalman ﬁltering approach is eﬃcient for hardware implementation.From equation 1.15,three matrix multiplication with size of 2 are needed.Memory requirement is low with one frame of estimated background pixel valuestored.The linear predictive approach is reported to achieve better resultsthan the many other algorithms e.g.median or mean ﬁltering approaches,especially in dealing with camouﬂage problem [11],where foreground pixelsholding similar color as that of the background pixels are undetected.1.2.5 Mixture of GaussianSo far,predictive methods have been discussed which model the backgroundscene as a time series and develop a linear dynamical model to recover the cur-rent input based on past observations.By minimizing the variance between thepredicted value and past observations,the estimated background pixel is adap-tive to the current situation where its value could vary slowly over time.Whilethis class of algorithms may work well with quasi-static background scenes withslow lighting changes,it fails to deal with multi-modal situations,which willbe discussed in detail in the following sections.Instead of utilizing the orderof incoming observations to predict the current background value,a Gaussian1.2.ALTERNATIVE VIDEO SEGMENTATION ALGORITHMS 41distribution can be used to model a static background value by accounting forthe noise introduced by small illumination changes,camera jitter and surfacetexture.In [17],three Gaussian are used to model background scenes for traﬃcsurveillance.The hypothesis is made that each pixel will contain the color ofeither the road,the shadows or the vehicles.Stauﬀer et al.[18] generalizedthe idea by extending the number of Gaussian for each pixel to deal with amulti-modal background environment,which are quite common in both indoorand outdoor environments.A multi-modal background is caused by repetitivebackground object motion,e.g.swaying trees or ﬂickering of a monitor.As apixel lying in the region where repetitive motion occurs will generally consistsof two or more background colors,the RGB value of that speciﬁc pixel will haveseveral pixel distributions in the RGB color space.The idea of multi-modaldistribution is illustrated by ﬁgure 1.3.From the ﬁgure,a typical indoor envi-ronment in 1.3(a) consists of static background objects,which are stationaryall the time.A pixel value in any location will stay within one single distri-bution over time.This is in contrast with the outdoor environment in ﬁgure1.3(c),where quasi-static background objects e.g.swaying leaves of a tree,arepresent in the scene.Pixel value from these regions contains multiple back-ground colors from time to time,e.g.the color of the leave,the color of houseor something in between.With multi-modal environments,the value of quasi-static background pixelstends to jump between diﬀerent distributions,which will be modeled by ﬁttingdiﬀerent Gaussians for each distribution.The idea of Mixture of Gaussians(MoG) is quite popular and many diﬀerent variants are developed [19–24] basedon it.1.2.5.1 Algorithm FormulationThe Stauﬀer-Grimson algorithmis formulated as modeling a pixel process witha mixture of Gaussian distributions.A pixel process is deﬁned as the recenthistory values of each pixel obtained from a number of consecutive frames.For a static background pixel process,the values will rather be pixel clustersthan identical points when they are plotted in a RGB color space.This is dueto the variations caused by many factors,e.g.surface texture,illuminationﬂuctuations,or camera noise.To model such a background pixel process,aGaussian distribution can be used with a mean equal to the average backgroundcolor and variances accounting for the value ﬂuctuation.More complicatedbackground pixel processes appear when it contains more than one backgroundobject surfaces,e.g.a background pixel of a road is covered by leaves of atree from time to time.In such cases,a mixture of Gaussian distributionsare necessary to model multi-modal background distributions.Formally,theStauﬀer-Grimson algorithm tries to address background modeling as in the42 CHAPTER 1.SEGMENTATION5010015020025030050100150200(a) A typical indoor environment taken inthe staircase.7080901001106070809030405060708090RGB(b) A pixel value sampled over time inthe indoor environment contains uni-modalpixel distributions.5010015020025030050100150200(c) A dynamic outdoor environment con-taining swaying trees.160180200220190200210220230180190200210220230240RGB(d) A pixel value sampled over time in theregion that contain leaves of a tree will gen-erally become multi-modal distributions inthe RGB color space.Figure 1.3:Background pixel distributions taken in diﬀerent environ-ments possess diﬀerent properties in the RGB color space.following:Each pixel is represented by a set of Gaussian distributions k ⊂ {1,2...K},where the number of distributions K is assumed to be constant (usually be-tween 3 and 7).Some of the K distributions correspond to background objectsand the rest are regarded as foreground.Each of the mixture of Gaussians isweighted with a parameter ωk,which represents probability of current obser-vation belonging to the distribution,thus the equationΣKk=1ωk= 1.(1.16)The probability of the current pixel value X being in distribution k is cal-1.2.ALTERNATIVE VIDEO SEGMENTATION ALGORITHMS 4305010015020025030000.0050.010.0150.020.025Color intensity levelProbabilityGauss1Mixture of GaussGauss2Gauss3Figure 1.4:Three Gaussian distributions are plotted in the ﬁgure withtheir mean and variance as {80,20},{100,5},{200,10} respectively.Their prior weight is speciﬁed as {0.2,0.2,0.6}.The probability of anew pixel observation belonging to one of the distributions can be seenas a sum of three Gaussian distributions [25].culated as:f(X|k) =1(2π)n2|Σk|12e−12(X−µk)TΣ−1k(X−µk),(1.17)where µkis the mean and Σkis the covariance matrix of the Kthdistribution.Thus,the probability of a pixel belonging to one of the Gaussian distribution isthe sumof probabilities of belonging to each of the Gaussian distribution,whichis illustrated in 1.4 [25].A further assumption is usually made that the diﬀerentcolor component is independent of each other so that the covariance matrixis diagonal - more suitable for calculations,e.g.matrix inversion.Stauﬀeret.al.go even further in assuming that the variances are identical,implying forexample that deviations in the red,green,and blue dimensions of a color spacehave the same statistics.While such simpliﬁcation reduce the computationalcomplexity,it has certain side eﬀects which will be discussed in the followingsections.44 CHAPTER 1.SEGMENTATIONThe most general solution to the foreground segmentation can be brieﬂyformulated as:at each sample time t,the most likely distribution k from a setof observations is estimated from X,along with a procedure for demarcatingthe foreground states from the background states.This is done by the following:A match is deﬁned as the incoming pixelwithin J times the standard deviation oﬀ the center,where in [18] J is selectedas 2.5.Mathematically,the portion of the Gaussian distributions belonging tothe background is determined byB = argminb

bXk=1ωk> H!,(1.18)where H is a predeﬁned parameter and ωkis the weight of distribution k.If amatch is found,the matched distribution is updated as:ωk,t= (1 −α)ωk,t−1+α (1.19)µt= (1 −ρ)µt−1+ρXt(1.20)σ2= (1 −ρ)σ2t−1+ρ(Xt−µt)T(Xt−µt);(1.21)where µ,σ are the mean and variance respectively,α,ρ are the learning factors,and Xtare the incoming RGB values.The mean,variance and weight factorsare updated frame by frame.For those unmatched,the weight is updatedaccording toωk,t= (1 −α)ωk,t−1,(1.22)while the mean and the variance remain the same.If none of the distributionsare matched,the one with the lowest weight is replaced by a distribution withthe incoming pixel value as its mean,a low weight and a large variance.1.2.6 Kernel Density ModelIn [26],it was discovered that the histogram of a dynamic background in anoutdoor environment covers a wide spectrum of gray levels (or intensity levelof diﬀerent color component),and all these variations occur in a short periodof time,e.g.30 seconds.Modeling such a dynamic background scene with alimited number of Gaussian distributions are not feasible.In order to adapt fast to the very recent information about an image se-quence,a kernel density function background modeling can be used which onlyuse a recent history of past values to distinguish foreground from backgroundpixels.Given a history of past values x1,x2,...xN,a kernel density function can beformulated as the following:1.2.ALTERNATIVE VIDEO SEGMENTATION ALGORITHMS 45The probability of a new observation having a value of xtcan be calculatedusing a density function:Pr(xt) =1nNXi=1K(xt−xi).(1.23)What this equation actually indicates is that a new background observationcan be predicted by the combination of its recent past history samples.If K ischosen to be a Gaussian distribution,then the density estimation becomesPr(xt) =1NNXi=11(2π)d2|Σ|12e−12(xt−xi)TΣ−1(xt−xi).(1.24)Under similar assumptions as the mixture of Gaussian approach,if diﬀerentcolor components are independent of each other,the covariance matrix Σ be-comesΣ =δ210 00 δ2200 0 δ23(1.25)and the density estimation is reduced toPr(xt) =1NNXi=1dYj=11q2πδ2je−12(xtj−xij)2δ2j.(1.26)From the deﬁnition of probability estimation,a foreground/backgroundclassiﬁcation process can be carried out by checking the probability valueagainst a threshold value,e.g.if Pr(xt) < th,the new observation can notbe predicted by its past history,thus recognized as a foreground pixel.Kerneldensity estimation generalize the idea of the Gaussian mixture model,whereeach single sample of the N samples is considered to be a Gaussian distribu-tion by itself.Thus it can also handle multi-modal background scenarios.Theprobability calculation only depends on its N past values,which makes thealgorithm quickly adapt to the dynamic background scene.Regarding hardware implementation complexity,the kernel density modelneeds to store N past frames,which makes it a memory intensive task.Thecalculation of probability in equation 1.26 is costly.In [26],a look-up tableis suggested to store precalculated value for each xt− xi.This will furtherincrease the requirement on the memory.1.2.7 SummaryA wide range of segmentation algorithms have been discussed,each with re-lated robustness to diﬀerent situations and each with diﬀerent computational46 CHAPTER 1.SEGMENTATIONTable 1.1:Algorithm Comparison.FDMedianLPFMoGKDEAlgorithmper-formancefastfastmediummediumslowMemory re-quirement